tutorial

Cassandra Monitoring and Alerting for Production

Apache Cassandra's distributed architecture makes failures subtle — a node goes down, compaction falls behind, or CQL port stops responding without obvious errors. Here's how to set up external monitoring and alerting for Cassandra clusters with Vigilmon.

Running Cassandra in production is running a distributed system where things fail in ways that don't announce themselves loudly. A node silently stops accepting writes. Compaction falls behind and read latencies spike. A replica goes down, dropping your replication factor below quorum without triggering any obvious error. By the time your application returns query timeouts, the problem has usually been building for minutes.

I've worked with Cassandra clusters from single-node development setups to multi-datacenter production deployments, and the monitoring lesson is consistent: internal JMX metrics tell you what happened; external probes tell you whether your cluster is reachable and healthy right now. You need both. This tutorial focuses on setting up the external layer with Vigilmon — HTTP probes, TCP port monitoring, and heartbeat checks for Cassandra-dependent applications.


How Cassandra Fails in Production

Understanding failure modes shapes what you monitor:

Node failure — A node leaves the ring (crash, OOM kill, network partition). Depending on your replication factor and consistency level, this may immediately affect write availability or cause read timeouts as the coordinator retries against fewer replicas.

CQL port unavailability — The Cassandra process is running but the native transport port (9042 by default) stops accepting connections. This can happen during heavy GC pauses, JVM heap pressure, or after a misconfigured max_native_transport_connections is hit.

Compaction backlog — SSTable compaction falls behind under heavy write load. Read latencies climb as the read path must merge more SSTables per query. No single process crashes, but the cluster becomes effectively degraded.

Schema disagreement — Nodes disagree on schema version during rolling upgrades or DDL operations. Queries routed to nodes with schema mismatches can fail or return unexpected results.

Gossip-level isolation — A node remains running but stops participating in gossip (the peer-to-peer health protocol). It serves local reads and writes but doesn't see topology changes.

The external monitoring strategy: check CQL port availability per node, build HTTP health endpoints that test actual query execution, and heartbeat-monitor any application workers that consume from Cassandra.


Step 1: Build a Cassandra Health Endpoint

Cassandra doesn't expose an HTTP health API natively. I add a thin health endpoint in the application layer (or a dedicated sidecar) that executes a CQL query and returns the result.

Node.js (Express + cassandra-driver)

// health/cassandra.js
const express = require('express');
const cassandra = require('cassandra-driver');

const app = express();

const client = new cassandra.Client({
  contactPoints: (process.env.CASSANDRA_HOSTS || 'localhost').split(','),
  localDataCenter: process.env.CASSANDRA_DC || 'datacenter1',
  credentials: {
    username: process.env.CASSANDRA_USER || 'cassandra',
    password: process.env.CASSANDRA_PASS || 'cassandra',
  },
  socketOptions: { connectTimeout: 3000, readTimeout: 5000 },
});

client.connect().catch(console.error);

app.get('/health/cassandra', async (req, res) => {
  try {
    // system.local gives us cluster name and schema version without touching user data
    const result = await client.execute(
      'SELECT cluster_name, schema_version, cql_version FROM system.local',
      [],
      { consistency: cassandra.types.consistencies.localOne }
    );

    const row = result.first();
    if (!row) throw new Error('No response from system.local');

    // Check how many peers are known (rough node count)
    const peers = await client.execute(
      'SELECT peer FROM system.peers',
      [],
      { consistency: cassandra.types.consistencies.localOne }
    );

    return res.status(200).json({
      status: 'ok',
      cluster: row.cluster_name,
      schema_version: row.schema_version,
      cql_version: row.cql_version,
      known_peers: peers.rows.length,
    });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3005, () => console.log('Cassandra health endpoint on :3005'));

Python (FastAPI + cassandra-driver)

# health/cassandra_health.py
import os
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.policies import DCAwareRoundRobinPolicy
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

auth = PlainTextAuthProvider(
    username=os.environ.get("CASSANDRA_USER", "cassandra"),
    password=os.environ.get("CASSANDRA_PASS", "cassandra"),
)
cluster = Cluster(
    contact_points=os.environ.get("CASSANDRA_HOSTS", "localhost").split(","),
    auth_provider=auth,
    load_balancing_policy=DCAwareRoundRobinPolicy(
        local_dc=os.environ.get("CASSANDRA_DC", "datacenter1")
    ),
    connect_timeout=3,
)
session = cluster.connect()

@app.get("/health/cassandra")
def cassandra_health():
    try:
        row = session.execute(
            "SELECT cluster_name, schema_version FROM system.local"
        ).one()

        peers = session.execute("SELECT peer FROM system.peers")
        peer_count = len(list(peers))

        if not row:
            return JSONResponse(status_code=503, content={
                "status": "down", "error": "no response from system.local"
            })

        return {
            "status": "ok",
            "cluster": row.cluster_name,
            "schema_version": str(row.schema_version),
            "known_peers": peer_count,
        }
    except Exception as e:
        return JSONResponse(status_code=503, content={
            "status": "down", "error": str(e)
        })

Verify the endpoint before wiring up Vigilmon:

curl -i https://your-app.example.com/health/cassandra
# HTTP/1.1 200 OK
# {"status":"ok","cluster":"production","schema_version":"abc123","known_peers":2}

Step 2: Multi-Node Monitoring Setup

For a Cassandra cluster, monitoring a single node isn't enough. I set up one HTTP monitor per node so I can see exactly which node is degraded, not just that "the cluster is unhealthy."

Per-node health endpoints — deploy the health sidecar on each Cassandra node, pointing contactPoints to localhost so the health check tests that specific node's CQL port:

# Node 1 health endpoint
curl https://cassandra-node1.internal:3005/health/cassandra

# Node 2 health endpoint
curl https://cassandra-node2.internal:3005/health/cassandra

# Node 3 health endpoint
curl https://cassandra-node3.internal:3005/health/cassandra

Name monitors clearly in Vigilmon:

  • [cassandra-node1] cql-health
  • [cassandra-node2] cql-health
  • [cassandra-node3] cql-health

Group all node monitors into a single Status Page so on-call engineers see the cluster health at a glance.


Step 3: Configure Vigilmon HTTP Monitor for Cassandra

  1. Log in to vigilmon.online and go to Monitors → New Monitor
  2. Choose HTTP / HTTPS
  3. Set the URL: https://cassandra-node1.internal:3005/health/cassandra
  4. Set the check interval: 1 minute
  5. Under Expected response:
    • Status code: 200
    • Response body contains: "status":"ok"
    • Response time threshold: 5000ms (CQL queries can be slower than HTTP APIs)
  6. Under Alert channels, assign your on-call Slack channel or PagerDuty
  7. Save — repeat for each node

Vigilmon probes from multiple geographic regions and requires multi-region consensus before opening an incident. A slow single probe won't page your on-call — a genuine CQL failure that affects all probe regions will.


Step 4: TCP Port Monitoring for the CQL Native Transport

In addition to the HTTP health endpoint, I add a TCP port monitor for each node's CQL port (9042). The TCP monitor catches failures where the Cassandra process is still running but the native transport has stopped accepting connections — a common symptom of JVM heap pressure or GC pause storms.

  1. In Vigilmon, go to Monitors → New Monitor → TCP Port
  2. Set the host: cassandra-node1.internal
  3. Set the port: 9042
  4. Set the check interval: 30 seconds
  5. Repeat for each cluster node

The TCP monitor provides an earlier signal than the HTTP probe — connection failures register immediately, without waiting for the health query to time out.


Step 5: Alerting for Node Failures

Cassandra node failures need tiered alerting because single-node failures in a properly replicated cluster are often non-urgent, while multi-node failures are critical.

Configure alert routing in Vigilmon:

| Monitor | Alert | Priority | |---|---|---| | Primary data center node | Slack + PagerDuty | P2 | | Second node in same DC | Slack + PagerDuty | P1 (quorum at risk) | | Cross-DC replication node | Slack | P2 | | CQL TCP port | Slack | P2 |

For the response time threshold, alert at 3000ms — a slow CQL query to system.local (which should be sub-10ms) indicates the coordinator is under severe GC pressure or disk I/O contention.


Step 6: Heartbeat Monitoring for Cassandra-Dependent Workers

Batch processing jobs, ETL pipelines, and analytics workers that read/write from Cassandra will stall silently when a node fails or CQL becomes unavailable. Vigilmon heartbeat monitors detect this: your worker sends a ping after each successful processing cycle, and Vigilmon pages you when pings stop.

# etl_job.py — Cassandra ETL with Vigilmon heartbeat
import os
import requests
from cassandra.cluster import Cluster

cluster = Cluster(contact_points=["cassandra-node1.internal"])
session = cluster.connect("analytics")

def process_batch(rows):
    for row in rows:
        # ... process and write ...
        pass

def run():
    rows = session.execute("SELECT * FROM events WHERE date = '2026-06-30'")
    process_batch(rows)

    # Ping Vigilmon after successful batch
    requests.get(os.environ["VIGILMON_HEARTBEAT_URL"], timeout=5)

if __name__ == "__main__":
    run()

Set the heartbeat interval in Vigilmon to your expected batch frequency, with a 2x grace period before alerting.


Summary

Cassandra failures are subtle — they often degrade quietly before causing visible errors. Here's the monitoring stack that catches them early:

| Monitor Type | What It Catches | |---|---| | HTTP probe per node | CQL query availability, driver connectivity, schema version | | TCP port 9042 per node | Native transport failures, GC-induced connection drops | | Heartbeat monitor | Batch job and worker stalls |

Multi-region probe consensus in Vigilmon means you get confident alerts — not false positives from transient network blips.

Start monitoring free at vigilmon.online — your first Cassandra node monitor is running in under two minutes.

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →