ClickHouse Database Uptime Monitoring with Vigilmon

ClickHouse is the analytical database behind real-time dashboards, event pipelines, and business intelligence systems that can't afford downtime. When ClickHouse goes silent — due to a crashed process, a query that consumes all memory, or a disk that fills with compressed column data — every dashboard that reads from it goes blank and every pipeline that writes to it starts queuing or dropping data.

Vigilmon gives you external monitoring for ClickHouse uptime, query endpoint health, and replicated table liveness. This tutorial covers the complete setup from exposing a health endpoint to configuring multi-region checks and instant alerts.

Why ClickHouse Needs External Monitoring

ClickHouse ships with system tables like system.processes and system.replicas that give deep internal observability. But internal observability has a blind spot: it only works when ClickHouse is up and reachable. External monitoring catches the failures that internal checks can't:

Whether ClickHouse is reachable from your application servers and data pipeline nodes
Whether a query is completing in acceptable time (a bad query plan can OOM-kill the server)
Whether replicated tables have fallen behind the leader
Whether a Materialized View or scheduled Merge has silently broken
Whether ClickHouse is returning correct data (a corrupt part might cause specific queries to return empty results)

External HTTP probes from Vigilmon test the complete path — DNS → network → ClickHouse process → query execution — that your application actually depends on.

Step 1: Use the ClickHouse HTTP Interface

ClickHouse has a built-in HTTP interface on port 8123 that you can probe directly. The simplest health check is a SELECT 1:

curl http://localhost:8123/?query=SELECT%201
# 1

This verifies ClickHouse is running and can execute queries. For a more meaningful check, query a system table that exercises the storage engine:

curl "http://localhost:8123/?query=SELECT+uptime()"
# 86401

The uptime() function returns the number of seconds the server has been running. If the server restarts unexpectedly, your monitoring history captures the uptime dip.

Step 2: Build a Dedicated Health Endpoint

Rather than exposing port 8123 publicly, add a thin health proxy to your application server or create a dedicated lightweight handler:

Node.js Health Proxy

// health/clickhouse.js
const express = require('express');
const { createClient } = require('@clickhouse/client');

const app = express();

const client = createClient({
  host: process.env.CLICKHOUSE_HOST || 'http://localhost:8123',
  username: process.env.CLICKHOUSE_USER || 'default',
  password: process.env.CLICKHOUSE_PASSWORD || '',
});

app.get('/health/clickhouse', async (req, res) => {
  try {
    const result = await client.query({
      query: 'SELECT uptime() AS uptime, version() AS version',
      format: 'JSONEachRow',
    });
    const rows = await result.json();
    const { uptime, version } = rows[0];

    // Flag low uptime as a warning (server restarted recently)
    if (parseInt(uptime) < 60) {
      return res.status(200).json({
        status: 'recovering',
        uptime_seconds: parseInt(uptime),
        version,
      });
    }

    return res.status(200).json({
      status: 'ok',
      uptime_seconds: parseInt(uptime),
      version,
    });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3001);

Python Health Endpoint (FastAPI)

# health/clickhouse.py
from fastapi import FastAPI, Response
import clickhouse_connect
import os

app = FastAPI()

@app.get("/health/clickhouse")
async def health():
    try:
        client = clickhouse_connect.get_client(
            host=os.getenv("CLICKHOUSE_HOST", "localhost"),
            port=int(os.getenv("CLICKHOUSE_PORT", "8123")),
            username=os.getenv("CLICKHOUSE_USER", "default"),
            password=os.getenv("CLICKHOUSE_PASSWORD", ""),
        )

        result = client.query("SELECT uptime() AS uptime, version() AS version")
        row = result.first_row
        uptime_seconds = int(row[0])

        return {
            "status": "ok",
            "uptime_seconds": uptime_seconds,
            "version": row[1],
        }

    except Exception as e:
        return Response(
            content=f'{{"status":"down","error":"{str(e)}"}}',
            status_code=503,
            media_type="application/json",
        )

Step 3: Monitor Replicated Tables

If you're using ClickHouse replication with Keeper (or ZooKeeper), add a replication lag check:

@app.get("/health/clickhouse/replication")
async def replication_health():
    try:
        client = clickhouse_connect.get_client(...)

        result = client.query("""
            SELECT
                database,
                table,
                is_leader,
                absolute_delay,
                queue_size
            FROM system.replicas
            WHERE is_readonly = 0
              AND absolute_delay > 60  -- alert if more than 60s behind
            LIMIT 10
        """)

        lagging = result.result_rows
        if lagging:
            return Response(
                content=f'{{"status":"degraded","lagging_tables":{len(lagging)}}}',
                status_code=503,
                media_type="application/json",
            )

        return {"status": "ok", "replication": "healthy"}

    except Exception as e:
        return Response(
            content=f'{{"status":"down","error":"{str(e)}"}}',
            status_code=503,
            media_type="application/json",
        )

Step 4: Monitor ClickHouse with Vigilmon

ClickHouse Uptime Monitor

Select HTTP / HTTPS as the monitor type
Set the URL to your health endpoint: https://your-server.example.com/health/clickhouse
Set the check interval to 1 minute
Under Expected response, configure:
- Status code: 200
- (Optional) Response body contains: "status":"ok"
Set the timeout to 10000ms — ClickHouse queries can legitimately take a few seconds
Save the monitor

Monitor via the Vigilmon API

# ClickHouse primary uptime monitor
curl -X POST https://api.vigilmon.online/v1/monitors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ClickHouse - Primary",
    "type": "http",
    "url": "https://your-server.example.com/health/clickhouse",
    "interval": 60,
    "timeout": 10000,
    "expected_status": 200,
    "expected_body": "\"status\":\"ok\"",
    "regions": ["us-east", "eu-west"]
  }'

# Replication health monitor
curl -X POST https://api.vigilmon.online/v1/monitors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "ClickHouse - Replication",
    "type": "http",
    "url": "https://your-server.example.com/health/clickhouse/replication",
    "interval": 120,
    "timeout": 10000,
    "expected_status": 200
  }'

Step 5: Monitor Response Time Thresholds

ClickHouse is built for fast analytical queries, but a slow query or heavy merge can saturate resources and make your health endpoint itself slow to respond. Configure latency alerting:

Open the ClickHouse monitor in Vigilmon
Go to Latency Alerting
Set a threshold of 5000ms for the health endpoint (normal responses take <200ms)
Enable latency alerts — a slow health response indicates ClickHouse is under resource pressure

This catches scenarios like a runaway query consuming all memory before ClickHouse OOM-kills it.

Step 6: Configure Alerts

Go to Alert Channels → Add Channel and set up your notification route:

{
  "monitor_name": "ClickHouse - Primary",
  "status": "down",
  "url": "https://your-server.example.com/health/clickhouse",
  "error": "HTTP 503 Service Unavailable",
  "started_at": "2026-01-20T03:45:00Z",
  "duration_seconds": 90
}

For a ClickHouse production cluster, configure:

PagerDuty or Opsgenie webhook for on-call escalation
Slack webhook for team awareness
Email for audit trail

Assign alert channels to both the uptime and replication monitors.

Monitoring Setup Summary

| Monitor | Type | Endpoint | Alert threshold | |---------|------|----------|----------------| | ClickHouse uptime | HTTP | /health/clickhouse → 200 | 1 failure | | Replication lag | HTTP | /health/clickhouse/replication → 200 | 1 failure | | Query latency | HTTP | /health/clickhouse < 5s | 3 consecutive slow | | TCP port | TCP | clickhouse-host:8123 | 1 failure |

Get Started for Free

Vigilmon monitors ClickHouse and your analytical infrastructure from multiple global regions with 1-minute checks and instant alerts. The free tier includes up to 5 monitors with no credit card required.

Start monitoring ClickHouse in under two minutes: vigilmon.online