Elasticsearch and OpenSearch power search, logging, and analytics in production systems ranging from e-commerce search bars to security information and event management (SIEM) pipelines. When a cluster degrades to yellow or drops to red, search latency spikes, ingestion backs up, and data can be permanently lost if shards go unassigned long enough. Vigilmon gives you external visibility into your cluster's health: the cluster health API status, index availability, and the green/yellow/red traffic-light signal that tells you exactly how serious a problem is.
What You'll Build
- A monitor on Elasticsearch's
/_cluster/healthAPI with keyword checks for cluster status - Index-level availability checks for your most critical indices
- SSL certificate monitoring for your Elasticsearch endpoint
- An alerting setup that distinguishes yellow (degraded) from red (data unavailable) cluster states
Prerequisites
- A running Elasticsearch 7+ or OpenSearch 1+ cluster with HTTP API access
- An Elasticsearch endpoint accessible over HTTP or HTTPS (e.g.,
https://elastic.example.com) - A free account at vigilmon.online
Step 1: Verify the Cluster Health API
Elasticsearch exposes a cluster health endpoint at /_cluster/health that returns a JSON summary of the cluster's state:
curl https://elastic.example.com/_cluster/health
A healthy cluster returns:
{
"cluster_name": "my-cluster",
"status": "green",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 12,
"active_shards": 24,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0
}
The status field is the key signal:
green: All primary and replica shards are assigned and healthyyellow: All primary shards are assigned, but some replicas are unassigned (reduced redundancy — data is available but not fully protected)red: One or more primary shards are unassigned (some data is unavailable; searches or indexing may fail)
Authentication: If your cluster uses security (X-Pack or OpenSearch Security), add credentials:
curl -u elastic:password https://elastic.example.com/_cluster/health. Create a read-only monitoring user for Vigilmon rather than using theelasticsuperuser.
Step 2: Create a Vigilmon HTTP Monitor for Cluster Health
- Log in to Vigilmon → Add Monitor → HTTP.
- URL:
https://elastic.example.com/_cluster/health. - Check interval: 60 seconds.
- Response timeout: 15 seconds.
- Expected status:
200. - Keyword:
"green"(matches the status field when the cluster is fully healthy). - Click Save.
This monitor alerts when:
- The cluster goes down entirely (connection refused or 5xx)
- The cluster degrades to yellow or red (keyword
"green"is absent from the response)
Note on yellow clusters: A keyword check for
"green"will alert on both yellow and red states. This is intentional — yellow means replica loss and warrants attention even though primary data is available.
Step 3: Monitor for Red Cluster Status Separately
Add a second monitor specifically for red cluster status — this is your critical/urgent alert while the green-check is your warning:
- Add Monitor → HTTP.
- URL:
https://elastic.example.com/_cluster/health. - Check interval: 60 seconds.
- Expected status:
200. - Keyword:
"red"— but configure this monitor to alert when the keyword is present (inverted keyword check). - Label:
Elasticsearch cluster RED — data unavailable.
A red cluster means primary shards are unassigned and some data cannot be read or written. This should trigger immediate escalation to on-call engineers.
Step 4: Monitor Index Availability
For your most critical indices, add individual index-level health checks. The /_cluster/health/{index-name} endpoint narrows the health report to a single index:
curl https://elastic.example.com/_cluster/health/my-critical-index
This returns the same structure as cluster health but scoped to one index. Add a monitor for each business-critical index:
- Add Monitor → HTTP.
- URL:
https://elastic.example.com/_cluster/health/my-critical-index. - Check interval: 2 minutes.
- Expected status:
200. - Keyword:
"green"(or"yellow"if replicas aren't configured for this index). - Label:
Index: my-critical-index.
Prioritize indices that:
- Back a user-facing search feature
- Receive real-time ingestion (logs, events, metrics)
- Are large enough that re-indexing would take hours
Step 5: Monitor SSL Certificates
If your Elasticsearch cluster is exposed over HTTPS (required in production), add SSL certificate monitoring:
- Add Monitor → SSL Certificate.
- Domain:
elastic.example.com. - Alert when expiry is within: 30 days.
- Alert again: 14 days, 7 days, 3 days, 1 day.
- Click Save.
Elasticsearch certificate expiry is particularly disruptive because all cluster nodes communicate over TLS (inter-node transport) as well as the HTTP API. If the HTTP API certificate expires, clients lose access; if the transport certificate expires, the cluster may split.
Step 6: Monitor the Cat Nodes API for Node Count
A cluster can report green status while running at reduced capacity if node losses don't affect shard assignments. Monitor the /_cat/nodes endpoint to catch node departures:
curl https://elastic.example.com/_cat/nodes?h=name,node.role,heap.percent,disk.avail
Add a monitor that checks the number of expected nodes:
- Add Monitor → HTTP.
- URL:
https://elastic.example.com/_cluster/health?pretty=false. - Check interval: 2 minutes.
- Keyword:
"number_of_nodes":3(replace3with your expected node count). - Label:
Elasticsearch node count.
When a data node leaves the cluster, this monitor fires before the cluster status degrades, giving you earlier warning of a node failure.
Step 7: Configure Alerting
In Vigilmon under Settings → Notifications, configure your alert channels:
| Monitor | Trigger | Action |
|---|---|---|
| Cluster health (green check) | Keyword green absent | Investigate cluster state; check unassigned shards |
| Cluster health (red check) | Keyword red present | Escalate immediately; data availability is impacted |
| Index health | Non-green status | Check shard allocation for that specific index |
| SSL certificate | < 30 days to expiry | Renew certificate; check cert-manager or manual renewal |
| Node count | Expected count absent | A data node has left the cluster; check node logs |
Alert after: 2 consecutive failures for yellow-detection monitors (brief shard movements during rolling restarts create transient yellow states). 1 failure for red-detection monitors — red means data unavailability.
Common Elasticsearch Failure Modes and What Vigilmon Catches
| Scenario | Vigilmon monitor |
|---|---|
| Cluster process down | /_cluster/health connection refused; alert within 60 s |
| Node OOM / crash | Node count monitor fires; cluster may go yellow/red |
| Disk full on a data node | Cluster goes red as shards become read-only; red monitor fires |
| Network partition splits cluster | Cluster goes red; all monitors fire |
| Index mapping conflict blocks indexing | Index health monitor fires for that specific index |
| SSL certificate expires | SSL monitor alerts at 30-day threshold |
| Rolling restart causes transient yellow | Green-check fires; red-check stays silent (confirms non-critical) |
| Master node loss | Cluster may become unavailable; all monitors fire |
Elasticsearch clusters fail in ways that are invisible to application error logs — a yellow cluster serves degraded reads silently, and a red cluster can serve stale data from cached results. Vigilmon's external monitoring of the cluster health API with green/yellow/red keyword checks gives you the traffic-light view of your cluster from outside the system, catching degraded states before they escalate to full data unavailability.
Start monitoring Elasticsearch in under 5 minutes — register free at vigilmon.online.