Monitoring ClickHouse with Vigilmon: Health Endpoint, Cluster Health, Replica Sync & SSL Alerts

ClickHouse is the engine powering real-time analytics for many data-intensive applications — aggregating billions of rows per second for dashboards, BI tools, and event pipelines. When ClickHouse goes down or a replica falls out of sync, analytics queries fail, dashboards go blank, and data ingestion backs up. Vigilmon gives you external visibility into ClickHouse's health: the built-in HTTP ping endpoint, cluster node status, replica sync state, and SSL certificate expiry — so you catch problems before your users do.

What You'll Build

A monitor on ClickHouse's /ping HTTP endpoint for basic availability
A cluster health check via the ClickHouse HTTP query interface
A replica sync status monitor detecting lagging or out-of-sync replicas
SSL certificate monitoring for your ClickHouse domain
An alerting setup that distinguishes process failures from cluster-level issues

Prerequisites

A running ClickHouse 22.x+ instance with the HTTP interface enabled (default port 8123)
A domain accessible over HTTPS (e.g., https://clickhouse.example.com)
A free account at vigilmon.online

Step 1: Verify ClickHouse's HTTP Ping Endpoint

ClickHouse ships with a minimal /ping endpoint on its HTTP interface that returns Ok. when the server is running and accepting connections:

curl https://clickhouse.example.com/ping

A healthy server returns:

Ok.

The /ping endpoint bypasses authentication, query processing, and storage — it confirms the HTTP server process is alive and listening. It does not confirm that storage is accessible or that the cluster is healthy, but it is the fastest and most lightweight availability check available.

Port note: ClickHouse's HTTP interface runs on port 8123 by default (or 8443 for HTTPS). If you're not proxying through a reverse proxy, your URL will be https://clickhouse.example.com:8443/ping. Check your config.xml or config.d/ directory for https_port.

Step 2: Create a Vigilmon HTTP Monitor for the Ping Endpoint

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://clickhouse.example.com/ping.
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: Ok. (the literal response body from a healthy ClickHouse server).
Click Save.

This monitor fires when:

The ClickHouse process crashes or is OOM-killed
The HTTP listener stops accepting connections
The host becomes unreachable
A deployment or upgrade breaks the HTTP server

Alert sensitivity: Set to trigger after 1 consecutive failure. A down ClickHouse means all analytics queries and data ingestion pipelines fail immediately.

Step 3: Monitor Cluster Health via HTTP Query Interface

For ClickHouse clusters, the /ping endpoint only confirms the local node is alive — it does not reveal whether other cluster nodes are reachable or healthy. Use the HTTP query interface to check the cluster system table:

curl "https://clickhouse.example.com/?query=SELECT+count()+FROM+system.clusters+WHERE+is_local=0+AND+errors_count>0" \
  -u monitoring_user:monitoring_password

A cluster with no unhealthy remote nodes returns 0. Any non-zero value indicates a node the local ClickHouse cannot reach.

Create a monitor for overall cluster node count:

Add Monitor → HTTP.
URL: https://clickhouse.example.com/?query=SELECT%20count()%20FROM%20system.clusters (URL-encoded).
Check interval: 2 minutes.
Expected status: 200.
Keyword: configure based on your expected node count (e.g., 3 for a 3-node cluster).
Label: ClickHouse cluster nodes.
Auth: Add HTTP Basic Auth credentials for your monitoring user.
Click Save.

Monitoring user: Create a read-only ClickHouse user with access only to system.* tables for your monitoring queries. Never use the default user for external monitoring.
CREATE USER monitoring_user IDENTIFIED BY 'strong_password';
GRANT SELECT ON system.* TO monitoring_user;

Step 4: Monitor Replica Sync Status

ClickHouse's ReplicatedMergeTree tables maintain their own sync state, tracked in system.replicas. A lagging or out-of-sync replica can silently serve stale data or fail inserts without triggering a process-level failure:

curl "https://clickhouse.example.com/?query=SELECT+database,table,is_readonly,last_queue_update_exception+FROM+system.replicas+WHERE+is_readonly=1+LIMIT+5" \
  -u monitoring_user:monitoring_password

If any table returns is_readonly=1, that replica is in a read-only state — it cannot accept writes, and your data pipeline's INSERT operations will fail or silently succeed on other replicas only.

Add a Vigilmon monitor that alerts when any replica goes read-only:

Add Monitor → HTTP.
URL: https://clickhouse.example.com/?query=SELECT%20count()%20FROM%20system.replicas%20WHERE%20is_readonly%3D1.
Check interval: 5 minutes.
Expected status: 200.
Keyword: 0 (zero read-only replicas is healthy; any other digit triggers a manual check).
Label: ClickHouse replica sync.
Click Save.

Interpret carefully: If your ClickHouse has no ReplicatedMergeTree tables, this query returns 0 regardless. Confirm that system.replicas has rows before relying on this check: SELECT count() FROM system.replicas.

Step 5: Monitor the ClickHouse Web UI (Optional)

If you have the built-in Play UI or a Tabix/Superset frontend deployed alongside ClickHouse, add a monitor for the web interface:

Add Monitor → HTTP.
URL: https://clickhouse.example.com/play (or your BI tool's URL).
Check interval: 5 minutes.
Expected status: 200.
Keyword: ClickHouse (appears in the Play UI's HTML title).
Label: ClickHouse Play UI.
Click Save.

Step 6: Monitor SSL Certificates

ClickHouse clients — drivers, JDBC connectors, and HTTP clients — validate the server certificate for every connection. An expired certificate breaks all client connectivity simultaneously:

Add Monitor → SSL Certificate.
Domain: clickhouse.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

Multi-port SSL: If ClickHouse clients connect directly on port 8443 (native HTTPS) rather than through a reverse proxy on 443, ensure your SSL monitor targets the correct port. Some SSL monitoring checks default to port 443 — verify the port matches your actual ClickHouse HTTPS listener.

Step 7: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /ping | Non-200 or Ok. missing | Check ClickHouse process; review /var/log/clickhouse-server/ logs | | Cluster node count | Unexpected count | Check inter-node connectivity; review system.clusters | | Replica sync | Non-zero read-only replicas | Check system.replicas and system.replication_queue; resolve ZooKeeper issues | | SSL certificate | < 30 days to expiry | Renew certificate; check ACME or cert automation | | Web UI | Non-200 or keyword missing | Check proxy config; confirm ClickHouse process is healthy |

Alert after: 1 consecutive failure for ping and SSL monitors; 2 consecutive failures for cluster and replica checks (transient ZooKeeper hiccups can cause brief anomalies).

Common ClickHouse Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | ClickHouse process crash or OOM | /ping unreachable; alert within 60 s | | HTTP listener misconfiguration | /ping returns 404 or connection refused | | Cluster node partitioned from peers | Cluster node count drops; alert within 2 min | | ZooKeeper failure (replicated tables) | Replica sync monitor shows read-only replicas | | Full disk causes read-only mode | Replica sync monitor fires; inserts fail across all tables | | SSL certificate expires | SSL monitor alerts at 30-day threshold | | Memory limit causes query failures | Ping still healthy; use query-level monitoring for this | | DNS misconfiguration | All monitors fire simultaneously |

ClickHouse failures are often silent at the application layer — a crashed node may leave your application retrying against remaining replicas while data falls behind, and a read-only replica silently rejects writes. Vigilmon gives you layered external monitoring of ClickHouse's availability, cluster topology, replica sync state, and SSL certificates so you catch problems at the infrastructure level before they surface as missing data or query timeouts.

Start monitoring ClickHouse in under 5 minutes — register free at vigilmon.online.