ClickHouse is the analytical database behind real-time dashboards, event pipelines, and business intelligence systems that can't afford downtime. When ClickHouse goes silent — due to a crashed process, a query that consumes all memory, or a disk that fills with compressed column data — every dashboard that reads from it goes blank and every pipeline that writes to it starts queuing or dropping data.
Vigilmon gives you external monitoring for ClickHouse uptime, query endpoint health, and replicated table liveness. This tutorial covers the complete setup from exposing a health endpoint to configuring multi-region checks and instant alerts.
Why ClickHouse Needs External Monitoring
ClickHouse ships with system tables like system.processes and system.replicas that give deep internal observability. But internal observability has a blind spot: it only works when ClickHouse is up and reachable. External monitoring catches the failures that internal checks can't:
- Whether ClickHouse is reachable from your application servers and data pipeline nodes
- Whether a query is completing in acceptable time (a bad query plan can OOM-kill the server)
- Whether replicated tables have fallen behind the leader
- Whether a Materialized View or scheduled Merge has silently broken
- Whether ClickHouse is returning correct data (a corrupt part might cause specific queries to return empty results)
External HTTP probes from Vigilmon test the complete path — DNS → network → ClickHouse process → query execution — that your application actually depends on.
Step 1: Use the ClickHouse HTTP Interface
ClickHouse has a built-in HTTP interface on port 8123 that you can probe directly. The simplest health check is a SELECT 1:
curl http://localhost:8123/?query=SELECT%201
# 1
This verifies ClickHouse is running and can execute queries. For a more meaningful check, query a system table that exercises the storage engine:
curl "http://localhost:8123/?query=SELECT+uptime()"
# 86401
The uptime() function returns the number of seconds the server has been running. If the server restarts unexpectedly, your monitoring history captures the uptime dip.
Step 2: Build a Dedicated Health Endpoint
Rather than exposing port 8123 publicly, add a thin health proxy to your application server or create a dedicated lightweight handler:
Node.js Health Proxy
// health/clickhouse.js
const express = require('express');
const { createClient } = require('@clickhouse/client');
const app = express();
const client = createClient({
host: process.env.CLICKHOUSE_HOST || 'http://localhost:8123',
username: process.env.CLICKHOUSE_USER || 'default',
password: process.env.CLICKHOUSE_PASSWORD || '',
});
app.get('/health/clickhouse', async (req, res) => {
try {
const result = await client.query({
query: 'SELECT uptime() AS uptime, version() AS version',
format: 'JSONEachRow',
});
const rows = await result.json();
const { uptime, version } = rows[0];
// Flag low uptime as a warning (server restarted recently)
if (parseInt(uptime) < 60) {
return res.status(200).json({
status: 'recovering',
uptime_seconds: parseInt(uptime),
version,
});
}
return res.status(200).json({
status: 'ok',
uptime_seconds: parseInt(uptime),
version,
});
} catch (err) {
return res.status(503).json({ status: 'down', error: err.message });
}
});
app.listen(3001);
Python Health Endpoint (FastAPI)
# health/clickhouse.py
from fastapi import FastAPI, Response
import clickhouse_connect
import os
app = FastAPI()
@app.get("/health/clickhouse")
async def health():
try:
client = clickhouse_connect.get_client(
host=os.getenv("CLICKHOUSE_HOST", "localhost"),
port=int(os.getenv("CLICKHOUSE_PORT", "8123")),
username=os.getenv("CLICKHOUSE_USER", "default"),
password=os.getenv("CLICKHOUSE_PASSWORD", ""),
)
result = client.query("SELECT uptime() AS uptime, version() AS version")
row = result.first_row
uptime_seconds = int(row[0])
return {
"status": "ok",
"uptime_seconds": uptime_seconds,
"version": row[1],
}
except Exception as e:
return Response(
content=f'{{"status":"down","error":"{str(e)}"}}',
status_code=503,
media_type="application/json",
)
Step 3: Monitor Replicated Tables
If you're using ClickHouse replication with Keeper (or ZooKeeper), add a replication lag check:
@app.get("/health/clickhouse/replication")
async def replication_health():
try:
client = clickhouse_connect.get_client(...)
result = client.query("""
SELECT
database,
table,
is_leader,
absolute_delay,
queue_size
FROM system.replicas
WHERE is_readonly = 0
AND absolute_delay > 60 -- alert if more than 60s behind
LIMIT 10
""")
lagging = result.result_rows
if lagging:
return Response(
content=f'{{"status":"degraded","lagging_tables":{len(lagging)}}}',
status_code=503,
media_type="application/json",
)
return {"status": "ok", "replication": "healthy"}
except Exception as e:
return Response(
content=f'{{"status":"down","error":"{str(e)}"}}',
status_code=503,
media_type="application/json",
)
Step 4: Monitor ClickHouse with Vigilmon
Log in to Vigilmon and go to Monitors → New Monitor.
ClickHouse Uptime Monitor
- Select HTTP / HTTPS as the monitor type
- Set the URL to your health endpoint:
https://your-server.example.com/health/clickhouse - Set the check interval to 1 minute
- Under Expected response, configure:
- Status code:
200 - (Optional) Response body contains:
"status":"ok"
- Status code:
- Set the timeout to
10000ms— ClickHouse queries can legitimately take a few seconds - Save the monitor
Monitor via the Vigilmon API
# ClickHouse primary uptime monitor
curl -X POST https://api.vigilmon.online/v1/monitors \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "ClickHouse - Primary",
"type": "http",
"url": "https://your-server.example.com/health/clickhouse",
"interval": 60,
"timeout": 10000,
"expected_status": 200,
"expected_body": "\"status\":\"ok\"",
"regions": ["us-east", "eu-west"]
}'
# Replication health monitor
curl -X POST https://api.vigilmon.online/v1/monitors \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "ClickHouse - Replication",
"type": "http",
"url": "https://your-server.example.com/health/clickhouse/replication",
"interval": 120,
"timeout": 10000,
"expected_status": 200
}'
Step 5: Monitor Response Time Thresholds
ClickHouse is built for fast analytical queries, but a slow query or heavy merge can saturate resources and make your health endpoint itself slow to respond. Configure latency alerting:
- Open the ClickHouse monitor in Vigilmon
- Go to Latency Alerting
- Set a threshold of
5000msfor the health endpoint (normal responses take <200ms) - Enable latency alerts — a slow health response indicates ClickHouse is under resource pressure
This catches scenarios like a runaway query consuming all memory before ClickHouse OOM-kills it.
Step 6: Configure Alerts
Go to Alert Channels → Add Channel and set up your notification route:
{
"monitor_name": "ClickHouse - Primary",
"status": "down",
"url": "https://your-server.example.com/health/clickhouse",
"error": "HTTP 503 Service Unavailable",
"started_at": "2026-01-20T03:45:00Z",
"duration_seconds": 90
}
For a ClickHouse production cluster, configure:
- PagerDuty or Opsgenie webhook for on-call escalation
- Slack webhook for team awareness
- Email for audit trail
Assign alert channels to both the uptime and replication monitors.
Monitoring Setup Summary
| Monitor | Type | Endpoint | Alert threshold |
|---------|------|----------|----------------|
| ClickHouse uptime | HTTP | /health/clickhouse → 200 | 1 failure |
| Replication lag | HTTP | /health/clickhouse/replication → 200 | 1 failure |
| Query latency | HTTP | /health/clickhouse < 5s | 3 consecutive slow |
| TCP port | TCP | clickhouse-host:8123 | 1 failure |
Get Started for Free
Vigilmon monitors ClickHouse and your analytical infrastructure from multiple global regions with 1-minute checks and instant alerts. The free tier includes up to 5 monitors with no credit card required.
Start monitoring ClickHouse in under two minutes: vigilmon.online