Elasticsearch is a distributed system with a lot of moving parts — nodes, shards, replicas, indices — and when any of them misbehave, query performance degrades or writes start failing before anyone notices. A cluster that reports yellow status is losing redundancy; one that goes red is dropping writes entirely. You need external, continuous monitoring to catch these transitions the moment they happen.
Vigilmon gives you HTTP probe monitoring for the Elasticsearch health API and heartbeat monitoring for your indexing pipelines. This tutorial shows you how to set both up.
Why Elasticsearch Health Monitoring Matters
Elasticsearch ships with built-in APIs that expose cluster state, but those APIs are only useful if something is actively checking them. Without external monitoring, your team finds out about cluster problems from:
- User-facing search returning stale or incomplete results
- Application error logs flooded with
NoNodeAvailableExceptionorBulkItemFailure - A database growing unboundedly because indexing jobs silently stopped consuming it
Vigilmon probes the cluster health API from outside your infrastructure on a regular interval. It catches problems that process-level monitoring (systemd, Docker health checks) completely misses:
- Cluster status transitions from green → yellow → red
- Unassigned shards blocking index availability
- Network partitions between your application and the Elasticsearch cluster
- Indexing pipeline failures — jobs that silently stop without crashing
Step 1: Expose an Elasticsearch Health Endpoint
Elasticsearch exposes health information through two built-in REST endpoints:
GET /_cluster/health— overall cluster status (green,yellow,red), node count, shard countsGET /_cat/health?v— human-readable cluster health table
For Vigilmon to probe these directly, you need the Elasticsearch HTTP port (default 9200) accessible from Vigilmon's probe network, or you need a thin proxy in your application layer that forwards the check. The proxy approach is preferred for clusters not exposed to the internet.
Option A: Direct Probe (cluster exposed via secure proxy)
If you have an Nginx or HAProxy reverse proxy in front of Elasticsearch with authentication, configure Vigilmon to probe:
https://es-proxy.example.com/_cluster/health
With Basic Auth credentials if required.
Option B: Application-Layer Health Wrapper
For clusters on private networks, add a /health/elasticsearch endpoint to your application:
# healthcheck.py — FastAPI Elasticsearch health proxy
import os
import httpx
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
ES_URL = os.environ["ELASTICSEARCH_URL"] # e.g. http://es:9200
@app.get("/health/elasticsearch")
async def elasticsearch_health():
try:
async with httpx.AsyncClient() as client:
resp = await client.get(f"{ES_URL}/_cluster/health", timeout=5)
data = resp.json()
status = data.get("status", "unknown")
if status == "red":
return JSONResponse(status_code=503, content={"status": "red", "detail": data})
if status == "yellow":
return JSONResponse(status_code=200, content={"status": "yellow", "detail": data})
return JSONResponse(status_code=200, content={"status": "green", "detail": data})
except Exception as e:
return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})
// healthcheck.js — Express Elasticsearch health proxy
const express = require('express');
const axios = require('axios');
const app = express();
const ES_URL = process.env.ELASTICSEARCH_URL;
app.get('/health/elasticsearch', async (req, res) => {
try {
const { data } = await axios.get(`${ES_URL}/_cluster/health`, { timeout: 5000 });
const status = data.status;
if (status === 'red') {
return res.status(503).json({ status: 'red', detail: data });
}
return res.status(200).json({ status, detail: data });
} catch (err) {
return res.status(503).json({ status: 'down', error: err.message });
}
});
app.listen(3001);
Verify it manually before adding to Vigilmon:
curl -i https://your-app.example.com/health/elasticsearch
# HTTP/1.1 200 OK
# {"status":"green","detail":{"cluster_name":"my-cluster","status":"green",...}}
Step 2: Configure a Vigilmon HTTP Monitor for Elasticsearch
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL to your Elasticsearch health endpoint
- Set the check interval to 1 minute
- Under Expected response, configure:
- Status code:
200 - Response body must not contain:
"status":"red" - Response time threshold:
3000ms
- Status code:
- Under Alert channels, assign your Slack or PagerDuty channel
- Save the monitor
Vigilmon probes from multiple geographic regions simultaneously and requires multi-region consensus before declaring an incident. You get confident, actionable alerts rather than false positives from transient probe blips.
What This Catches
| Failure | Internal tools | Vigilmon |
|---|---|---|
| Elasticsearch process crash | ✓ | ✓ |
| Cluster status red (write failures) | ✗ | ✓ |
| Cluster status yellow (replica loss) | ✗ | ✓ |
| Network partition from app to ES | ✗ | ✓ |
| Authentication/TLS misconfiguration | ✗ | ✓ |
Step 3: Monitor Shard Allocation and Index Replication
Beyond the cluster-level health check, you may want visibility into specific index health — particularly for write-heavy indices where shard imbalance or underreplication can cause subtle data loss.
Extend your health endpoint to include index-level detail:
@app.get("/health/elasticsearch/index/{index_name}")
async def index_health(index_name: str):
try:
async with httpx.AsyncClient() as client:
resp = await client.get(
f"{ES_URL}/_cluster/health/{index_name}",
params={"level": "shards"},
timeout=5
)
data = resp.json()
unassigned = data.get("unassigned_shards", 0)
status = data.get("status", "unknown")
if status == "red" or unassigned > 0:
return JSONResponse(status_code=503, content={
"status": status,
"unassigned_shards": unassigned,
"detail": data,
})
return JSONResponse(status_code=200, content={"status": status, "detail": data})
except Exception as e:
return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})
Create a separate Vigilmon monitor for each critical index:
https://your-app.example.com/health/elasticsearch/index/productshttps://your-app.example.com/health/elasticsearch/index/ordershttps://your-app.example.com/health/elasticsearch/index/logs
Step 4: Heartbeat Monitoring for Elasticsearch Indexing Jobs
Indexing pipelines are often the first thing to break when Elasticsearch degrades — and they fail silently. A Logstash pipeline, a custom Python indexer, or a Kafka consumer writing to Elasticsearch will stop making progress without throwing an exception visible to your process monitor.
Vigilmon's heartbeat monitors detect silent pipeline death. Your indexing job pings Vigilmon after each successful processing cycle. When pings stop arriving, Vigilmon fires an alert.
Set Up the Heartbeat Monitor
- In Vigilmon, go to Monitors → New Monitor → Heartbeat
- Set the name:
elasticsearch-indexer - Set the expected interval based on your job frequency (e.g., 5 minutes for a near-real-time indexer)
- Set the grace period: 10 minutes
- Save and copy the heartbeat URL, e.g.
https://vigilmon.online/heartbeat/abc123xyz
Wire It Into Your Indexing Pipeline
Python indexer:
import requests, os
VIGILMON_HB = os.environ["VIGILMON_HEARTBEAT_URL"]
def index_documents(docs):
es.bulk(body=build_bulk_actions(docs))
# Ping Vigilmon only after confirmed successful index
requests.get(VIGILMON_HB, timeout=5)
# Call index_documents in your processing loop
Node.js indexer:
const axios = require('axios');
async function indexBatch(docs) {
await esClient.bulk({ body: buildBulkActions(docs) });
// Confirm success before pinging heartbeat
await axios.get(process.env.VIGILMON_HEARTBEAT_URL).catch(() => {});
}
Logstash pipeline (via HTTP output):
output {
elasticsearch {
hosts => ["http://es:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
http {
url => "${VIGILMON_HEARTBEAT_URL}"
http_method => "get"
}
}
Step 5: Alert Routing for Yellow and Red Status
Elasticsearch cluster health follows a predictable degradation path: green → yellow (replica loss) → red (primary shard loss, write failures). Each level needs a different alert priority.
In Vigilmon:
- Cluster health monitor (returns 503 on
red) → immediate Slack + PagerDuty page (P1) - Index health monitors (returns 503 on unassigned shards) → Slack + email (P2)
- Heartbeat monitors (indexing pipeline) → Slack + email (P2)
Configure response time thresholds as early warnings:
- Alert at
2000msfor the health endpoint (slow health responses often precede node pressure issues) - Alert at
5000msfor application endpoints backed by Elasticsearch
Create an Elasticsearch Status Page in Vigilmon that groups the cluster health monitor, index monitors, and indexer heartbeats together. Share this page with your data team so they have a single pane of glass during incidents.
Summary
Elasticsearch is a complex distributed system where silent degradation — unassigned shards, stale replicas, stalled indexers — is the norm. Vigilmon gives you:
| Monitor Type | What It Covers |
|---|---|
| HTTP monitor on /_cluster/health | Overall cluster status (green/yellow/red) |
| HTTP monitor on index health | Shard allocation and replication per index |
| Heartbeat monitor | Indexing pipeline and job liveness |
Get started free at vigilmon.online — your first Elasticsearch monitor is running in under two minutes.