Monitoring Grafana Loki with Vigilmon: Ready Endpoint, Metrics Availability, Query API Health & SSL Alerts

Grafana Loki is the log aggregation layer for your observability stack — receiving log streams from Promtail, Grafana Alloy, and OpenTelemetry collectors, and serving them to Grafana dashboards during incidents. When Loki's ingestor goes down, logs stop being written and your team is flying blind during an outage. When the query frontend becomes unavailable, engineers can't search logs in Grafana — right when they need it most. When the distributor falls behind, log ingestion latency spikes and you start missing log lines. Vigilmon gives you external visibility into Loki's operational health: the ready endpoint, metrics availability, query API responsiveness, and SSL certificate expiry.

What You'll Build

A monitor on Loki's /ready endpoint to detect ingestion failures
A metrics endpoint check to confirm Loki's Prometheus scrape target is available
A query API health check that validates the query frontend
SSL certificate monitoring for your Loki domain
An alerting setup that distinguishes ingestion failures from query failures

Prerequisites

A running Grafana Loki instance (single-binary or microservices mode) accessible over HTTP/HTTPS
A domain or IP for Loki (e.g., https://loki.example.com or an internal address)
A free account at vigilmon.online

Step 1: Verify Loki's Ready Endpoint

Loki exposes a /ready endpoint that confirms the instance is ready to receive and serve log data:

curl https://loki.example.com/ready

A healthy Loki returns HTTP 200 with a plain-text body:

ready

Loki returns HTTP 503 with Not ready: ... during startup, after a failed ring membership, or when the ingestor cannot reach the storage backend. This makes /ready an accurate signal for the full ingestion pipeline.

Port: Loki's HTTP port defaults to 3100. If you're accessing Loki directly (not through a reverse proxy), use http://loki.example.com:3100/ready. If your Loki is behind a reverse proxy, use the proxied HTTPS URL.

Microservices mode: In microservices deployments, each component (distributor, ingester, querier, query-frontend) has its own /ready endpoint. Monitor the component URLs individually if you route traffic per component.

Step 2: Create a Vigilmon HTTP Monitor for the Ready Endpoint

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://loki.example.com/ready.
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: ready (the exact response body of a healthy Loki instance).
Click Save.

This monitor catches:

Loki process crashes or OOM kills (common when ingesting high-cardinality streams)
Storage backend failures (S3, GCS, or local filesystem issues that cause Loki to go not-ready)
Ring membership failures in distributed mode
Loki startup failures after configuration changes

Alert sensitivity: Set alerts to trigger after 1 consecutive failure. When Loki is not ready, log ingestion stops. Gaps in log data during an ongoing incident make root-cause analysis much harder.

Step 3: Monitor Loki's Metrics Endpoint

Loki exposes Prometheus metrics at /metrics. This endpoint is used by Prometheus or Grafana Alloy to scrape Loki's own operational metrics (ingestion rate, query latency, error rates). Monitoring it from Vigilmon confirms that the metrics pipeline is functioning:

curl https://loki.example.com/metrics | head -20

A healthy Loki returns a large text response starting with:

# HELP loki_build_info A metric with a constant '1' value labeled by version...
# TYPE loki_build_info gauge
loki_build_info{...} 1

Add Monitor → HTTP.
URL: https://loki.example.com/metrics.
Check interval: 2 minutes.
Response timeout: 15 seconds.
Expected status: 200.
Keyword: loki_build_info (always present in Loki metrics output).
Label: Loki metrics endpoint.
Click Save.

Why monitor the metrics endpoint? If Loki's metrics endpoint goes down, your Prometheus scrapes start failing — you lose visibility into Loki's own health metrics (ingestion lag, query errors, chunk utilization). Vigilmon catches this independently of your Prometheus stack so you don't need your observability stack to monitor your observability stack.

Step 4: Monitor the Loki Query API

The Loki query API (/loki/api/v1/query) serves LogQL queries from Grafana dashboards and the logcli command-line tool. A healthy query API is what enables engineers to search logs during incidents. Test it with a simple label query:

curl "https://loki.example.com/loki/api/v1/labels"

A healthy Loki query frontend returns:

{"status": "success", "data": [...]}

Add Monitor → HTTP.
URL: https://loki.example.com/loki/api/v1/labels.
Check interval: 2 minutes.
Response timeout: 15 seconds.
Expected status: 200.
Keyword: "status":"success" (confirms the query frontend processed the request).
Label: Loki query API.
Click Save.

When the query API monitor fires but the /ready endpoint is green, you have a query frontend issue separate from the ingestion pipeline — common in microservices mode when the querier or query-frontend component is degraded while the distributor/ingester continues working.

Authentication: If Loki is behind an auth proxy (e.g., Grafana's built-in auth proxy or nginx basic auth), the /loki/api/v1/labels endpoint may require authentication headers. In that case, use the /ready endpoint as your primary health check and monitor the auth proxy separately.

Step 5: Monitor the Loki Push API (Ingestion Health)

Loki's push API at /loki/api/v1/push accepts log streams from Promtail, Alloy, and other agents. You can verify it's available with a minimal POST that tests routing without writing actual data:

curl -X POST https://loki.example.com/loki/api/v1/push \
  -H "Content-Type: application/json" \
  -d '{"streams":[]}'

An available push endpoint returns HTTP 204 No Content for an empty streams array — proving the ingestion path is reachable:

Add Monitor → HTTP (POST).
URL: https://loki.example.com/loki/api/v1/push.
Method: POST.
Headers: Content-Type: application/json.
Body: {"streams":[]}.
Expected status: 204.
Check interval: 2 minutes.
Label: Loki push API (ingestion).
Click Save.

This confirms the full ingestion path is routable — from the distributor through to the ingester — without writing actual log data.

Step 6: Monitor SSL Certificates

Loki is typically accessed by Promtail agents running on every node in your infrastructure. A Loki SSL certificate expiry causes:

All Promtail, Fluentd, and Alloy agents to fail TLS validation and stop shipping logs
Grafana's Loki data source connection to fail, making all log panels blank
logcli queries to fail with certificate errors

Add Monitor → SSL Certificate.
Domain: loki.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

Internal Loki deployments: If Loki is on an internal network with a private CA certificate, Vigilmon's external SSL monitor can still check certificate validity if the endpoint is reachable. For truly internal deployments, focus monitoring on the /ready endpoint via a Vigilmon agent or reverse proxy that exposes it externally.

Step 7: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /ready | 503 or ready keyword missing | Loki not ready; check storage backend; inspect Loki logs | | /metrics | Non-200 or keyword missing | Metrics endpoint down; Prometheus scrapes failing | | Query API (/loki/api/v1/labels) | Non-200 or keyword missing | Query frontend degraded; Grafana log panels broken | | Push API (/loki/api/v1/push) | Non-204 | Ingestion path broken; agents can't write logs | | SSL certificate | < 30 days to expiry | Renew cert; agents will fail TLS validation when it expires |

Alert after: 1 consecutive failure for the /ready monitor. 2 consecutive failures for query and push monitors (brief query timeouts can happen under load without indicating a full failure).

Common Loki Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Loki OOM kill (high-cardinality labels) | /ready unreachable; alert within 60 s | | S3/GCS storage backend unreachable | /ready returns 503; ingestion stops | | Ingester ring membership lost | /ready returns 503; restart restores ring | | Query frontend crash (microservices) | Query API monitor fires; /ready stays green | | Promtail/Alloy agents stop shipping | Push API monitor (if POST check used); /ready stays green | | Compactor falls behind (disk pressure) | Metrics show compactor lag; /ready may stay green | | SSL certificate expires | SSL monitor alerts; all agents fail TLS and stop shipping | | Loki upgrade breaks query API | Query API monitor fires; /ready may still pass | | DNS misconfiguration | All monitors fire simultaneously | | Retention job deletes active chunks | Query API returns errors; chunk health issues visible in metrics |

Loki is part of your observability stack — which creates a monitoring paradox: if Loki goes down, you lose the tool you'd normally use to diagnose why it went down. Vigilmon breaks that dependency by giving you external, independent visibility into Loki's health from outside your observability infrastructure. The /ready endpoint, metrics scrape target, query API, and SSL certificate are all monitored by a system that doesn't depend on Loki being healthy to tell you it's not.

Start monitoring Loki in under 5 minutes — register free at vigilmon.online.