How to Monitor Elasticsearch Health with Vigilmon

Elasticsearch is a distributed system with a lot of moving parts — nodes, shards, replicas, indices — and when any of them misbehave, query performance degrades or writes start failing before anyone notices. A cluster that reports yellow status is losing redundancy; one that goes red is dropping writes entirely. You need external, continuous monitoring to catch these transitions the moment they happen.

Vigilmon gives you HTTP probe monitoring for the Elasticsearch health API and heartbeat monitoring for your indexing pipelines. This tutorial shows you how to set both up.

Why Elasticsearch Health Monitoring Matters

Elasticsearch ships with built-in APIs that expose cluster state, but those APIs are only useful if something is actively checking them. Without external monitoring, your team finds out about cluster problems from:

User-facing search returning stale or incomplete results
Application error logs flooded with NoNodeAvailableException or BulkItemFailure
A database growing unboundedly because indexing jobs silently stopped consuming it

Vigilmon probes the cluster health API from outside your infrastructure on a regular interval. It catches problems that process-level monitoring (systemd, Docker health checks) completely misses:

Cluster status transitions from green → yellow → red
Unassigned shards blocking index availability
Network partitions between your application and the Elasticsearch cluster
Indexing pipeline failures — jobs that silently stop without crashing

Step 1: Expose an Elasticsearch Health Endpoint

Elasticsearch exposes health information through two built-in REST endpoints:

GET /_cluster/health — overall cluster status (green, yellow, red), node count, shard counts
GET /_cat/health?v — human-readable cluster health table

For Vigilmon to probe these directly, you need the Elasticsearch HTTP port (default 9200) accessible from Vigilmon's probe network, or you need a thin proxy in your application layer that forwards the check. The proxy approach is preferred for clusters not exposed to the internet.

Option A: Direct Probe (cluster exposed via secure proxy)

If you have an Nginx or HAProxy reverse proxy in front of Elasticsearch with authentication, configure Vigilmon to probe:

https://es-proxy.example.com/_cluster/health

With Basic Auth credentials if required.

Option B: Application-Layer Health Wrapper

For clusters on private networks, add a /health/elasticsearch endpoint to your application:

# healthcheck.py — FastAPI Elasticsearch health proxy
import os
import httpx
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()
ES_URL = os.environ["ELASTICSEARCH_URL"]  # e.g. http://es:9200

@app.get("/health/elasticsearch")
async def elasticsearch_health():
    try:
        async with httpx.AsyncClient() as client:
            resp = await client.get(f"{ES_URL}/_cluster/health", timeout=5)
            data = resp.json()

        status = data.get("status", "unknown")
        if status == "red":
            return JSONResponse(status_code=503, content={"status": "red", "detail": data})
        if status == "yellow":
            return JSONResponse(status_code=200, content={"status": "yellow", "detail": data})
        return JSONResponse(status_code=200, content={"status": "green", "detail": data})
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

// healthcheck.js — Express Elasticsearch health proxy
const express = require('express');
const axios = require('axios');

const app = express();
const ES_URL = process.env.ELASTICSEARCH_URL;

app.get('/health/elasticsearch', async (req, res) => {
  try {
    const { data } = await axios.get(`${ES_URL}/_cluster/health`, { timeout: 5000 });
    const status = data.status;

    if (status === 'red') {
      return res.status(503).json({ status: 'red', detail: data });
    }
    return res.status(200).json({ status, detail: data });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3001);

Verify it manually before adding to Vigilmon:

curl -i https://your-app.example.com/health/elasticsearch
# HTTP/1.1 200 OK
# {"status":"green","detail":{"cluster_name":"my-cluster","status":"green",...}}

Step 2: Configure a Vigilmon HTTP Monitor for Elasticsearch

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to your Elasticsearch health endpoint
Set the check interval to 1 minute
Under Expected response, configure:
- Status code: 200
- Response body must not contain: "status":"red"
- Response time threshold: 3000ms
Under Alert channels, assign your Slack or PagerDuty channel
Save the monitor

Vigilmon probes from multiple geographic regions simultaneously and requires multi-region consensus before declaring an incident. You get confident, actionable alerts rather than false positives from transient probe blips.

What This Catches

| Failure | Internal tools | Vigilmon | |---|---|---| | Elasticsearch process crash | ✓ | ✓ | | Cluster status red (write failures) | ✗ | ✓ | | Cluster status yellow (replica loss) | ✗ | ✓ | | Network partition from app to ES | ✗ | ✓ | | Authentication/TLS misconfiguration | ✗ | ✓ |

Step 3: Monitor Shard Allocation and Index Replication

Beyond the cluster-level health check, you may want visibility into specific index health — particularly for write-heavy indices where shard imbalance or underreplication can cause subtle data loss.

Extend your health endpoint to include index-level detail:

@app.get("/health/elasticsearch/index/{index_name}")
async def index_health(index_name: str):
    try:
        async with httpx.AsyncClient() as client:
            resp = await client.get(
                f"{ES_URL}/_cluster/health/{index_name}",
                params={"level": "shards"},
                timeout=5
            )
            data = resp.json()

        unassigned = data.get("unassigned_shards", 0)
        status = data.get("status", "unknown")

        if status == "red" or unassigned > 0:
            return JSONResponse(status_code=503, content={
                "status": status,
                "unassigned_shards": unassigned,
                "detail": data,
            })
        return JSONResponse(status_code=200, content={"status": status, "detail": data})
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Create a separate Vigilmon monitor for each critical index:

https://your-app.example.com/health/elasticsearch/index/products
https://your-app.example.com/health/elasticsearch/index/orders
https://your-app.example.com/health/elasticsearch/index/logs

Step 4: Heartbeat Monitoring for Elasticsearch Indexing Jobs

Indexing pipelines are often the first thing to break when Elasticsearch degrades — and they fail silently. A Logstash pipeline, a custom Python indexer, or a Kafka consumer writing to Elasticsearch will stop making progress without throwing an exception visible to your process monitor.

Vigilmon's heartbeat monitors detect silent pipeline death. Your indexing job pings Vigilmon after each successful processing cycle. When pings stop arriving, Vigilmon fires an alert.

Set Up the Heartbeat Monitor

In Vigilmon, go to Monitors → New Monitor → Heartbeat
Set the name: elasticsearch-indexer
Set the expected interval based on your job frequency (e.g., 5 minutes for a near-real-time indexer)
Set the grace period: 10 minutes
Save and copy the heartbeat URL, e.g. https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your Indexing Pipeline

Python indexer:

import requests, os

VIGILMON_HB = os.environ["VIGILMON_HEARTBEAT_URL"]

def index_documents(docs):
    es.bulk(body=build_bulk_actions(docs))
    # Ping Vigilmon only after confirmed successful index
    requests.get(VIGILMON_HB, timeout=5)

# Call index_documents in your processing loop

Node.js indexer:

const axios = require('axios');

async function indexBatch(docs) {
  await esClient.bulk({ body: buildBulkActions(docs) });
  // Confirm success before pinging heartbeat
  await axios.get(process.env.VIGILMON_HEARTBEAT_URL).catch(() => {});
}

Logstash pipeline (via HTTP output):

output {
  elasticsearch {
    hosts => ["http://es:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
  http {
    url => "${VIGILMON_HEARTBEAT_URL}"
    http_method => "get"
  }
}

Step 5: Alert Routing for Yellow and Red Status

Elasticsearch cluster health follows a predictable degradation path: green → yellow (replica loss) → red (primary shard loss, write failures). Each level needs a different alert priority.

In Vigilmon:

Cluster health monitor (returns 503 on red) → immediate Slack + PagerDuty page (P1)
Index health monitors (returns 503 on unassigned shards) → Slack + email (P2)
Heartbeat monitors (indexing pipeline) → Slack + email (P2)

Configure response time thresholds as early warnings:

Alert at 2000ms for the health endpoint (slow health responses often precede node pressure issues)
Alert at 5000ms for application endpoints backed by Elasticsearch

Create an Elasticsearch Status Page in Vigilmon that groups the cluster health monitor, index monitors, and indexer heartbeats together. Share this page with your data team so they have a single pane of glass during incidents.

Summary

Elasticsearch is a complex distributed system where silent degradation — unassigned shards, stale replicas, stalled indexers — is the norm. Vigilmon gives you:

| Monitor Type | What It Covers | |---|---| | HTTP monitor on /_cluster/health | Overall cluster status (green/yellow/red) | | HTTP monitor on index health | Shard allocation and replication per index | | Heartbeat monitor | Indexing pipeline and job liveness |

Get started free at vigilmon.online — your first Elasticsearch monitor is running in under two minutes.