Monitoring HashiCorp Nomad with Vigilmon: Agent Health, Leader Availability, Node Counts & SSL Alerts

HashiCorp Nomad schedules and runs workloads across your infrastructure — containers, binaries, Java applications, and more. When Nomad's server cluster loses its leader, new job submissions stall and running allocations can't be rescheduled. When client nodes become unhealthy or disconnect, tasks fail over silently and capacity shrinks. Vigilmon gives you external visibility into Nomad's health: the agent health endpoint, leader availability, node count checks, and SSL certificate expiry — so you know about cluster degradation before your deployments do.

What You'll Build

A monitor on Nomad's /v1/agent/health endpoint for agent and server availability
A leader availability check to detect quorum loss before deployments fail
A node count monitor to catch client nodes dropping off the cluster
SSL certificate monitoring for your Nomad endpoint
An alerting setup that distinguishes leader elections from full cluster failures

Prerequisites

A running HashiCorp Nomad 1.5+ cluster with at least one server and one client node
The Nomad HTTP API accessible over HTTPS (e.g., https://nomad.example.com)
A free account at vigilmon.online

Step 1: Understand Nomad's Agent Health Endpoint

Nomad's /v1/agent/health endpoint reports the local agent's health status. For server nodes it confirms Raft participation; for client nodes it confirms the agent process is running and able to accept task allocations:

curl https://nomad.example.com/v1/agent/health

A healthy server returns:

{
  "server": {
    "ok": true,
    "message": "server is ready"
  }
}

A healthy client returns:

{
  "client": {
    "ok": true,
    "message": "client is ready"
  }
}

If ok is false or the endpoint returns a non-200 status, the agent is unhealthy and may not be participating in the cluster correctly.

ACL note: If you have Nomad ACLs enabled, the /v1/agent/health endpoint may require a valid token. Create a read-only monitoring token with agent:read permission and pass it as -H "X-Nomad-Token: your-token". Alternatively, configure your Nomad policy to allow unauthenticated health checks on this path.

Step 2: Create a Vigilmon HTTP Monitor for Agent Health

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://nomad.example.com/v1/agent/health.
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: "ok":true (confirms the agent health check passed).
Click Save.

This monitor fires when:

The Nomad server process crashes or stops
The agent fails its own health checks (e.g., Raft errors, storage issues)
The host becomes unreachable
A deployment upgrade leaves the agent in a degraded state

Alert sensitivity: Set to trigger after 1 consecutive failure. A failed Nomad server agent means job scheduling is impaired.

Step 3: Monitor Leader Availability

In a Nomad server cluster, the leader handles all job scheduling decisions. If the cluster loses quorum and fails to elect a leader, no new jobs can be submitted, no failed allocations are rescheduled, and nomad job run commands hang indefinitely. Check leader availability via the /v1/status/leader endpoint:

curl https://nomad.example.com/v1/status/leader

When a leader is elected, this returns the leader's Raft address (a non-empty string):

"10.0.1.5:4647"

When there is no leader, the response body is an empty string "" with a 200 status code, or the cluster returns an error.

Add a Vigilmon keyword monitor that fails on an empty leader response:

Add Monitor → HTTP.
URL: https://nomad.example.com/v1/status/leader.
Check interval: 60 seconds.
Expected status: 200.
Keyword: :4647 (the Raft port suffix, always present in a valid leader address).
Label: Nomad leader availability.
Click Save.

During leader elections: Nomad's Raft protocol elects a new leader in seconds under normal conditions. If this monitor fires and recovers within 1–2 check cycles, a routine leader election occurred. If it fires and stays alerting, the cluster has lost quorum — check that a majority of server nodes are up.

Step 4: Monitor Node Count

Client node count is your cluster's capacity signal. When nodes drop off due to crashes, network partitions, or failed drain operations, your cluster can schedule fewer workloads and existing allocations may not be reschedulable. Check the node list:

curl https://nomad.example.com/v1/nodes | jq 'length'

The full list also reveals node status:

curl https://nomad.example.com/v1/nodes | jq '[.[] | select(.Status == "ready")] | length'

Add a monitor that checks the node list endpoint is reachable and returning data:

Add Monitor → HTTP.
URL: https://nomad.example.com/v1/nodes.
Check interval: 5 minutes.
Expected status: 200.
Keyword: "Status":"ready" (at least one ready node in the list).
Label: Nomad client nodes.
Click Save.

Scaling events: If you auto-scale your Nomad client pool, the node count fluctuates normally. Monitor for the "Status":"ready" keyword rather than a fixed count, so the monitor stays valid during scale-in and scale-out events.

Step 5: Monitor the Nomad Web UI (if Enabled)

The Nomad web UI is served from the same HTTP listener as the API. A UI availability check exercises the frontend routing layer and can catch reverse proxy misconfigurations that the API monitor wouldn't detect:

curl https://nomad.example.com/ui/

Add Monitor → HTTP.
URL: https://nomad.example.com/ui/.
Check interval: 5 minutes.
Expected status: 200.
Keyword: Nomad (appears in the page title of the Nomad UI).
Label: Nomad Web UI.
Click Save.

Step 6: Monitor SSL Certificates

Nomad clients (the CLI, agents, and any application using the Nomad HTTP API) validate TLS certificates on every connection. An expired Nomad certificate causes all nomad CLI commands and any CI/CD pipelines using the API to fail with TLS errors:

Add Monitor → SSL Certificate.
Domain: nomad.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

Internal CA: If your Nomad cluster uses certificates from an internal CA (e.g., Vault PKI), those certificates are often shorter-lived (90 days or less). Check your actual expiry with openssl s_client -connect nomad.example.com:443 2>/dev/null | openssl x509 -noout -dates and set your alert threshold earlier if needed.

Step 7: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /v1/agent/health | Non-200 or ok:true missing | Check Nomad server process; review journalctl -u nomad | | Leader availability | :4647 missing from response | Check quorum; ensure majority of servers are up and network is healthy | | Client nodes | "Status":"ready" missing | Check client node health; review drain status; verify network connectivity | | SSL certificate | < 30 days to expiry | Renew certificate; check Vault PKI or ACME automation | | Web UI | Non-200 or keyword missing | Check reverse proxy config and Nomad HTTP listener |

Alert after: 1 consecutive failure for agent health and leader checks; 2 consecutive failures for node count (transient during rolling deploys).

Common Nomad Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Nomad server process crash | /v1/agent/health unreachable; alert within 60 s | | Quorum loss (majority of servers down) | Leader monitor fires; no :4647 in response | | Leader election in progress | Leader monitor fires briefly; recovers in seconds if cluster is healthy | | All client nodes drained/unreachable | Node count monitor shows no ready nodes | | Rolling deploy removes node temporarily | Node count monitor may flap; use 2-failure threshold | | SSL certificate expires | SSL monitor alerts at 30-day threshold; all API/CLI access fails | | Reverse proxy misconfiguration | UI monitor fires while API monitor stays green | | DNS misconfiguration | All monitors fire simultaneously |

HashiCorp Nomad is the scheduling layer that keeps your workloads running — when the leader disappears or client nodes go missing, jobs stall, fail-overs don't complete, and deployments hang without obvious errors. Vigilmon gives you external visibility into Nomad's agent health, leader availability, node pool status, and SSL certificates so you detect cluster degradation the moment it happens rather than when a deployment mysteriously stops working.

Start monitoring Nomad in under 5 minutes — register free at vigilmon.online.