Monitoring CockroachDB with Vigilmon: Health Endpoint, SQL Port TCP, Admin UI Availability & SSL Certificate Alerts

CockroachDB is a distributed SQL database designed to survive node failures, datacenter outages, and network partitions — but "surviving" and "serving your application" are different things. A node that is technically alive but failing its readiness checks is not serving SQL queries. A SQL port that is unreachable is as good as a crashed database for your application. A certificate expiry on a CockroachDB cluster cascades into every inter-node connection failing simultaneously. Vigilmon gives you external monitoring of CockroachDB's health from the outside: the readiness endpoint, SQL port TCP connectivity, admin UI availability, and SSL certificate expiry.

What You'll Build

A monitor on CockroachDB's /health?ready=1 readiness endpoint
A TCP monitor on the SQL port to confirm your application can connect
An admin UI availability check to detect dashboard and management failures
SSL certificate monitoring for both the SQL listener and admin UI
An alerting setup that distinguishes node-level failures from cluster-wide outages

Prerequisites

A running CockroachDB cluster (single-node or multi-node) with the HTTP health endpoint exposed
CockroachDB accessible over HTTPS for the admin UI (default port: 8080)
SQL port accessible from outside (default: 26257)
A free account at vigilmon.online

Step 1: Verify CockroachDB's Health Endpoints

CockroachDB exposes two health endpoints on its HTTP port (default 8080):

/health — liveness check: returns 200 if the node process is alive
/health?ready=1 — readiness check: returns 200 only if the node is ready to serve SQL queries (i.e., it has joined the cluster, completed Raft recovery, and is not draining)

curl https://cockroachdb.example.com:8080/health?ready=1

A ready node returns HTTP 200 with an empty body {}. A node that is alive but not yet ready (during startup, Raft recovery, or graceful drain) returns 503 Service Unavailable:

{}

Always use /health?ready=1 for monitoring — /health alone will return 200 even when a node is draining or not yet part of the cluster.

TLS on the HTTP port: CockroachDB's HTTP port uses TLS by default. If your monitoring can't reach the HTTPS endpoint directly, expose the admin UI through a reverse proxy (NGINX, Traefik) that terminates TLS.

Step 2: Create a Vigilmon HTTP Monitor for the Readiness Endpoint

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://cockroachdb.example.com:8080/health?ready=1.
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Click Save (no keyword needed — a 200 confirms readiness).

For multi-node clusters, add a readiness monitor for each node:

https://cockroachdb-1.example.com:8080/health?ready=1 — Node 1
https://cockroachdb-2.example.com:8080/health?ready=1 — Node 2
https://cockroachdb-3.example.com:8080/health?ready=1 — Node 3

This monitor catches:

Node process crashes or OOM kills
Raft recovery failures after ungraceful restarts
Nodes that are alive but have lost quorum and cannot serve reads or writes
Drain operations that take longer than expected
Network partitions that isolate a node from the cluster

Alert sensitivity: Set to trigger after 1 consecutive failure per node. A single node failure in a 3-node cluster still serves requests, but a second failure crosses the quorum threshold.

Step 3: Monitor the SQL Port via TCP Check

Your application connects to CockroachDB on the SQL port (default: 26257) using the PostgreSQL wire protocol. TCP-level monitoring confirms that the port is accepting connections — independent of whether any SQL queries succeed:

nc -zv cockroachdb.example.com 26257

Add Monitor → TCP.
Host: cockroachdb.example.com (or your load balancer / HAProxy frontend).
Port: 26257.
Check interval: 60 seconds.
Label: CockroachDB SQL port.
Click Save.

Load balancers and HAProxy: In production CockroachDB deployments, applications typically connect through a load balancer or HAProxy that distributes connections across nodes. Monitor the load balancer's SQL frontend — if that port is down, no application can connect regardless of cluster health.

When the SQL TCP monitor fires while the HTTP readiness monitors are green, the failure is in your load balancer, HAProxy, or network routing — not in CockroachDB itself. This distinction immediately directs your troubleshooting.

Step 4: Monitor the CockroachDB Admin UI

The CockroachDB admin UI (port 8080) provides cluster-wide metrics, statement statistics, replication health, and node status. When the admin UI is unavailable, DBAs and SREs lose visibility into cluster internals and cannot investigate slow queries or replication lag:

curl https://cockroachdb.example.com:8080
# Returns HTML containing "CockroachDB"

Add Monitor → HTTP.
URL: https://cockroachdb.example.com:8080.
Check interval: 5 minutes.
Response timeout: 15 seconds.
Expected status: 200.
Keyword: CockroachDB (appears in the admin UI page title).
Label: CockroachDB Admin UI.
Click Save.

If your admin UI is behind a reverse proxy at a custom domain (e.g., https://crdb-admin.example.com), use that URL instead of the direct port. Monitoring the proxy URL also validates that the proxy configuration is correct.

Step 5: Monitor SSL Certificates

CockroachDB uses TLS for three distinct communication paths: the SQL wire protocol, inter-node (gossip and Raft), and the HTTP admin UI. Certificate expiry on the node certificates causes inter-node communication to fail, which can cascade into a full cluster outage. Monitor the certificate on your CockroachDB domain:

openssl s_client -connect cockroachdb.example.com:8080 2>/dev/null | openssl x509 -noout -dates

Add Monitor → SSL Certificate.
Domain: cockroachdb.example.com (the admin UI domain).
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

CockroachDB certificate rotation: CockroachDB supports certificate hot-reloading with cockroach cert rotate-certs (enterprise) or by replacing certificate files. However, auto-rotation tools like cert-manager target Kubernetes secrets — verify your rotation pipeline is working rather than assuming it is. A 30-day alert window gives you time to investigate before certificates expire.

Step 6: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /health?ready=1 per node | Non-200 | Check node status; run cockroach node status --insecure; investigate Raft recovery | | SQL port TCP | Connection refused | Check load balancer health; verify HAProxy frontend; check firewall rules on port 26257 | | Admin UI | Non-200 or keyword missing | HTTP listener issue; check reverse proxy; node may still be serving SQL | | SSL certificate | < 30 days to expiry | Rotate node certificates; check cert-manager pipeline or manual rotation schedule |

Alert after: 1 consecutive failure for the readiness and SQL port monitors. 2 consecutive failures for the admin UI monitor.

Common CockroachDB Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Node OOM killed | /health?ready=1 returns 503 or connection refused; alert within 60 s | | Node takes too long to rejoin cluster after restart | /health?ready=1 returns 503 during startup; resolves automatically | | Load balancer misconfiguration | SQL TCP monitor fires; node health monitors stay green | | Quorum loss (2 of 3 nodes down) | Two or more node readiness monitors fire | | SSL certificate expiry | SSL monitor alerts at 30-day threshold; inter-node TLS handshakes fail | | Admin UI behind reverse proxy fails | Admin UI monitor fires; readiness and SQL monitors stay green | | Firewall blocks SQL port 26257 | SQL TCP monitor fires; application gets connection refused | | Disk full on data directory | Node marks itself not ready; readiness monitor alerts | | DNS misconfiguration | All monitors fire simultaneously | | Graceful drain (rolling upgrade) | Readiness monitor briefly fires per node; resolves in sequence |

CockroachDB's distributed architecture means failures can be subtle — a node may be alive but not ready, the SQL port may be unreachable while the HTTP endpoint is fine, or a certificate may be days from expiry without any obvious symptom until all inter-node connections fail at once. Vigilmon gives you external monitoring at every layer: per-node readiness, SQL port TCP connectivity, admin UI availability, and certificate expiry, so you catch CockroachDB problems before they become application outages.

Start monitoring CockroachDB in under 5 minutes — register free at vigilmon.online.