InfluxDB is the time-series database at the center of observability stacks — receiving metrics from Telegraf agents, Prometheus remote write, and application SDKs, then serving those metrics to Grafana dashboards, alerting systems, and capacity planning tools. When InfluxDB goes down, metrics stop flowing in and dashboards go blank; when the write API is degraded, you get silent data gaps that corrupt trend analysis and alerting baselines. When the query API is slow or unavailable, dashboards fail to render and on-call engineers lose visibility at exactly the moment they need it. Vigilmon gives you external visibility into InfluxDB's health: the built-in health and ping endpoints, write and query API availability, and SSL certificate expiry.
What You'll Build
- A monitor on InfluxDB's
/healthendpoint to detect when the database is down or degraded - A ping check at
/pingas a lightweight liveness probe - A write API availability check to confirm metric ingestion is working
- A query API monitor to verify Flux query execution is available
- SSL certificate monitoring for your InfluxDB domain
Prerequisites
- A running InfluxDB v2.0+ instance with a network-reachable domain or IP
- HTTPS configured (e.g.,
https://influxdb.example.com) - An InfluxDB API token with read access (for the query API monitor)
- A free account at vigilmon.online
Step 1: Understand InfluxDB's Health and Ping Endpoints
InfluxDB v2 exposes two diagnostic endpoints that don't require authentication:
# Health endpoint — full component status
curl https://influxdb.example.com/health
# Ping endpoint — simple liveness probe
curl -I https://influxdb.example.com/ping
The /health endpoint returns JSON with a status field:
{
"name": "influxdb",
"message": "ready for queries and writes",
"status": "pass",
"checks": [],
"version": "2.7.1",
"commit": "..."
}
The /ping endpoint returns HTTP 204 No Content with no body — a minimal liveness probe that confirms the InfluxDB process is running and accepting connections.
| Endpoint | Normal response | What a failure means |
|---|---|---|
| /health | 200 with "status":"pass" | InfluxDB is degraded or down |
| /ping | 204 | InfluxDB process is down or not accepting connections |
| /api/v2/write | 204 (success) or 401 (no auth) | Write pipeline is broken |
| /api/v2/query | 200 or 401 (no auth) | Query engine is unavailable |
Step 2: Create a Vigilmon HTTP Monitor for the Health Endpoint
- Log in to Vigilmon → Add Monitor → HTTP.
- URL:
https://influxdb.example.com/health. - Check interval: 60 seconds.
- Response timeout: 15 seconds.
- Expected status:
200. - Keyword:
"status":"pass"(confirms InfluxDB reports itself healthy). - Click Save.
This monitor catches:
- InfluxDB process crashes
- Storage engine failures (InfluxDB marks itself degraded when TSM files are corrupted)
- Bolt/SQLite metadata database failures
- Out-of-memory kills
- Deployment failures after InfluxDB upgrades
Alert sensitivity: Set to trigger after 1 consecutive failure. When InfluxDB is down, every metric ingestion pipeline starts buffering or dropping data.
Step 3: Add a Ping Monitor as a Lightweight Liveness Probe
The ping endpoint is faster and simpler than the health endpoint — useful as a rapid first check before the health endpoint has time to respond:
- Add Monitor → HTTP.
- URL:
https://influxdb.example.com/ping. - Check interval: 30 seconds.
- Response timeout: 5 seconds.
- Expected status:
204. - Label:
InfluxDB ping. - Click Save.
Why both? The ping monitor fires faster (lower timeout) and confirms the InfluxDB process is alive. The health monitor takes longer but confirms storage engines and subsystems are ready. If ping fires but health doesn't, the process is up but internally degraded.
Step 4: Monitor the Write API
InfluxDB's write API at /api/v2/write is what Telegraf, Prometheus remote write, and application SDKs use to push metrics. An unauthenticated request returns 401, which confirms the write path is up:
curl -I https://influxdb.example.com/api/v2/write?org=myorg&bucket=metrics&precision=s
# Returns 401 — correct; the endpoint is up and enforcing auth
- Add Monitor → HTTP.
- URL:
https://influxdb.example.com/api/v2/write?org=myorg&bucket=metrics&precision=s(replacemyorgandmetricswith your org and bucket names). - Check interval: 2 minutes.
- Response timeout: 10 seconds.
- Expected status:
401. - Label:
InfluxDB write API. - Click Save.
A
401response from the write API means the endpoint is running and enforcing authentication — the correct signal for an unauthenticated health check. If the write pipeline crashes or the storage engine fills up, you'll get503or a connection error instead.
Step 5: Monitor the Query API
InfluxDB's query API at /api/v2/query executes Flux queries — used by Grafana data source plugins, alerting systems, and any tool reading metrics. An unauthenticated POST returns 401:
curl -I -X POST https://influxdb.example.com/api/v2/query?org=myorg
# Returns 401 — confirms the query engine is up
- Add Monitor → HTTP.
- URL:
https://influxdb.example.com/api/v2/query?org=myorg(replacemyorgwith your org name). - Method:
POST. - Check interval: 2 minutes.
- Response timeout: 10 seconds.
- Expected status:
401. - Label:
InfluxDB query API. - Click Save.
If ping and health are green but the query API monitor returns a non-401 error, the Flux query engine may be in a degraded state. This can happen after a crash recovery where the TSM compaction queue is large — writes are accepted but queries timeout.
Step 6: Monitor SSL Certificates
InfluxDB's TLS certificate affects every client: Telegraf agents, Grafana plugins, and any application using the InfluxDB SDK. An expired certificate causes TLS handshake failures across all metric ingestion and query paths:
openssl s_client -connect influxdb.example.com:8086 2>/dev/null | openssl x509 -noout -dates
- Add Monitor → SSL Certificate.
- Domain:
influxdb.example.com. - Alert when expiry is within: 30 days.
- Alert again: 14 days, 7 days, 3 days, 1 day.
- Click Save.
Custom port: If your InfluxDB runs on a non-standard port (e.g.,
:8086), the SSL monitor should check the domain the certificate covers. Most certificate monitors check port 443; if InfluxDB runs on 8086 behind a reverse proxy, the reverse proxy's certificate is what matters to clients.
Step 7: Configure Alerting
In Vigilmon under Settings → Notifications, configure your alert channels:
| Monitor | Trigger | Action |
|---|---|---|
| /health | Non-200 or "status":"pass" missing | Check InfluxDB process and logs; inspect storage engine |
| /ping | Non-204 or connection error | InfluxDB process down; check systemd unit or Docker container |
| Write API | Non-401 or connection error | Metric ingestion broken; Telegraf/Prometheus writes failing |
| Query API | Non-401 or connection error | Dashboard queries failing; check Flux engine logs |
| SSL certificate | < 30 days to expiry | Renew certificate; test Telegraf and Grafana connections after renewal |
Alert after: 1 consecutive failure for all InfluxDB monitors — metric loss and dashboard blindness begin immediately.
Common InfluxDB Failure Modes and What Vigilmon Catches
| Scenario | Vigilmon monitor |
|---|---|
| InfluxDB process crash | /ping returns connection error; alert within 30 s |
| TSM storage engine corruption | /health returns degraded; alert fires |
| Bolt database (metadata) failure | Health check reports unhealthy; write API returns 500 |
| Disk full (TSM data directory) | Write API returns 503 instead of 401; write monitor fires |
| Flux query engine OOM killed | Query API returns 503; dashboard queries fail |
| SSL certificate expires | SSL monitor alerts at 30-day threshold; all client connections fail |
| InfluxDB upgrade migration fails | Health endpoint returns degraded during migration |
| Cardinality limit reached | Writes succeed but series get dropped; requires application-level monitoring |
| DNS misconfiguration | All monitors fire simultaneously |
| Reverse proxy certificate mismatch | SSL monitor fires; Telegraf/Grafana connections rejected |
InfluxDB is the data store for your entire observability stack. When it fails, you lose the ability to see what else is failing — dashboards go blank at exactly the moment when you need them most. Vigilmon gives you an independent, external view of InfluxDB's health that doesn't depend on InfluxDB being up: ping checks, health status, write and query API availability, and SSL certificate expiry monitored from outside your infrastructure.
Start monitoring InfluxDB in under 5 minutes — register free at vigilmon.online.