tutorial

Database Health Monitoring with Uptime Checks: A Practical Guide for 2026

Database availability is the most consequential single-point failure in most application stacks. When the database goes down, everything goes down. When repl...

Database availability is the most consequential single-point failure in most application stacks. When the database goes down, everything goes down. When replication falls behind, reads from replicas return stale data. When the backup job silently stops running, you discover the gap only when you need to restore.

Internal database metrics — query latency, connection pool utilization, buffer hit ratios — are important for performance tuning. But they miss the availability failure modes that matter most to end users: the database is unreachable, the TCP port is closed, the replica is lagging so far behind it's serving stale data, or last night's backup didn't run.

This guide covers external database monitoring with Vigilmon: why it matters beyond internal metrics, TCP port monitoring for database availability, health endpoint patterns, replication lag detection via keyword checks, backup job heartbeats, and practical Vigilmon configurations for PostgreSQL, MySQL, MongoDB, Redis, and ClickHouse.


Why External Database Monitoring Matters

Internal monitoring tools — Prometheus exporters, cloud provider dashboards, database built-in metrics — give you rich visibility into database internals: query plans, cache hit ratios, lock waits, connection pool saturation. This is invaluable for performance tuning and capacity planning.

External monitoring catches a different set of failures:

Network-level failures: A database might be running normally but unreachable due to a firewall rule change, a security group misconfiguration, a certificate renewal that caused a connection drop, or a routing issue between your application servers and the database. Internal metrics show the database as healthy because it is running — the failure is in the network path between clients and the database.

Port binding failures: A database restart that fails to re-bind to the configured port leaves the internal monitoring agent (often co-located) able to connect via local socket while remote connections fail. TCP port monitoring from an external vantage point catches this.

Application-layer health via health API: The database process is running, the port is open, but the application cannot query it — connection pool exhausted, authentication misconfigured, disk full preventing writes. An application-layer health endpoint that actually executes a test query surfaces this.

Replication lag: A replica that's running fine internally but lagging 30 minutes behind the primary is serving stale data to your read replicas. The replica's own metrics show it as healthy. External monitoring of a replication-lag-aware health endpoint catches this.

Backup job failures: The most dangerous silent failure. A backup job that stops running leaves no visible trace — no errors, no alerts from the database, no log warnings. The data is simply not being backed up. Heartbeat monitoring catches this.


TCP Port Monitoring for Database Availability

The fastest way to monitor database reachability: monitor the TCP port.

Vigilmon's TCP monitors open a raw socket connection to the host and port, confirm the connection succeeds, and alert if the connection fails or times out. This is the lowest-level check — it confirms the database process is bound to the port and the network path is open.

Default Database Ports

| Database | Default Port | |---|---| | PostgreSQL | 5432 | | MySQL / MariaDB | 3306 | | MongoDB | 27017 | | Redis | 6379 | | ClickHouse (native) | 9000 | | ClickHouse (HTTP) | 8123 | | Elasticsearch | 9200 | | Cassandra | 9042 |

Setting Up TCP Monitoring in Vigilmon

  1. Create a new TCP monitor in Vigilmon
  2. Set the host to your database server hostname or IP
  3. Set the port to the database's listening port
  4. Set check interval (1 minute recommended for production databases)
  5. Enable multi-region consensus alerting (default in Vigilmon)

Example: monitoring a PostgreSQL primary and its read replica:

  • Monitor 1 (TCP): db-primary.internal.example.com:5432
  • Monitor 2 (TCP): db-replica.internal.example.com:5432

If your databases are not publicly addressable (common for production setups), see the "Monitoring Private Databases" section below.


Health Endpoints via Application API

TCP port monitoring confirms the database is accepting connections. It doesn't confirm the application can actually query the database. For that, use a health endpoint in your application that performs a real test query.

Standard Health Endpoint Pattern

Most web frameworks support a health check route that validates database connectivity:

# FastAPI / Python example
from fastapi import FastAPI
import asyncpg

app = FastAPI()

@app.get("/health/db")
async def db_health():
    try:
        conn = await asyncpg.connect(DATABASE_URL)
        await conn.fetchval("SELECT 1")
        await conn.close()
        return {"status": "ok", "database": "connected"}
    except Exception as e:
        return {"status": "error", "database": "unreachable", "detail": str(e)}, 503
// Express / Node.js example
app.get('/health/db', async (req, res) => {
  try {
    await db.query('SELECT 1');
    res.json({ status: 'ok', database: 'connected' });
  } catch (err) {
    res.status(503).json({ status: 'error', database: 'unreachable' });
  }
});

Vigilmon HTTP monitor configuration:

  • URL: https://api.yourdomain.com/health/db
  • Expected status: 200
  • Keyword check: "status": "ok"
  • Interval: 1 minute

If the database becomes unreachable, the health endpoint returns 503, the keyword check fails, and Vigilmon alerts.


Replication Lag Detection via Keyword Checks

Read replicas that lag too far behind the primary serve stale data to your application's read paths. Standard uptime monitoring checks whether the replica is reachable — it doesn't check whether the replica is current.

Pattern: Replication-Aware Health Endpoint

Add a replication lag check to your health endpoint:

# PostgreSQL replication lag check
@app.get("/health/replication")
async def replication_health():
    conn = await asyncpg.connect(REPLICA_URL)
    
    # pg_last_xact_replay_timestamp returns the timestamp of the last replayed WAL record
    lag = await conn.fetchval("""
        SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
    """)
    await conn.close()
    
    LAG_THRESHOLD_SECONDS = 30
    
    if lag is None:
        # Replica hasn't replayed anything yet or is a primary
        return {"status": "unknown", "lag_seconds": None}
    elif lag > LAG_THRESHOLD_SECONDS:
        return {"status": "lagging", "lag_seconds": round(lag)}, 503
    else:
        return {"status": "ok", "lag_seconds": round(lag)}
# MySQL replication lag check
@app.get("/health/replication")
async def mysql_replication_health():
    result = await db.execute("SHOW REPLICA STATUS")
    row = result.fetchone()
    
    if not row:
        return {"status": "not_replica"}, 200
    
    lag = row["Seconds_Behind_Source"]
    LAG_THRESHOLD_SECONDS = 30
    
    if lag is None:
        return {"status": "error", "message": "replication_stopped"}, 503
    elif lag > LAG_THRESHOLD_SECONDS:
        return {"status": "lagging", "lag_seconds": lag}, 503
    else:
        return {"status": "ok", "lag_seconds": lag}

Vigilmon HTTP monitor for replication:

  • URL: https://api.yourdomain.com/health/replication
  • Expected status: 200
  • Keyword check: "status": "ok"
  • Interval: 2 minutes

If the replica lags beyond the threshold, the endpoint returns 503, keyword check fails, and Vigilmon alerts — even though the replica is reachable and the TCP port check would show green.


Backup Job Heartbeats

Backup job failure is the highest-stakes silent failure in database operations. A backup job that stops running produces no errors and no alerts from standard monitoring. You discover the gap when you need to restore — and realize the last good backup is weeks old.

Vigilmon's heartbeat monitoring is designed exactly for this scenario.

How Heartbeat Monitoring Works

Create a heartbeat monitor in Vigilmon. Vigilmon generates a unique ping URL. Your backup script sends an HTTP POST to this URL on each successful completion. If no ping arrives within the configured timeout window, Vigilmon fires an alert.

The inversion is the key: Vigilmon waits for your job to check in, rather than probing your job. The job's absence is the alert.

PostgreSQL Backup Heartbeat

#!/bin/bash
# pg_backup.sh

set -e

BACKUP_DIR="/backups/postgresql"
DB_NAME="production"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump"

# Run the backup
pg_dump -Fc "$DB_NAME" > "$BACKUP_FILE"

# Verify the backup is non-zero size
if [ ! -s "$BACKUP_FILE" ]; then
    echo "ERROR: Backup file is empty"
    exit 1
fi

# Upload to object storage
aws s3 cp "$BACKUP_FILE" "s3://your-backups-bucket/postgresql/"

# Ping Vigilmon heartbeat on success
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID

echo "Backup completed: $BACKUP_FILE"

Vigilmon heartbeat configuration:

  • Timeout: 25 hours (for a daily backup — allows the job to run at any point within the day plus 1 hour buffer)
  • Alert: email + Slack webhook

If the backup cron job fails partway through (before the curl ping line), or if the job stops being scheduled entirely, the ping never arrives and Vigilmon alerts within 25 hours — long before you discover the gap the hard way.

MySQL Backup Heartbeat

#!/bin/bash
# mysql_backup.sh

set -e

BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/production_${TIMESTAMP}.sql.gz"

mysqldump --single-transaction --all-databases | gzip > "$BACKUP_FILE"

# Verify backup
if [ ! -s "$BACKUP_FILE" ]; then
    exit 1
fi

# Upload and ping
aws s3 cp "$BACKUP_FILE" "s3://your-backups-bucket/mysql/"
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID

MongoDB Backup Heartbeat

#!/bin/bash
# mongo_backup.sh

set -e

BACKUP_DIR="/backups/mongodb"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

mongodump --uri="$MONGO_URI" --out="$BACKUP_DIR/dump_${TIMESTAMP}" --gzip

# Archive
tar -czf "$BACKUP_DIR/mongo_${TIMESTAMP}.tar.gz" -C "$BACKUP_DIR" "dump_${TIMESTAMP}"
rm -rf "$BACKUP_DIR/dump_${TIMESTAMP}"

# Upload and ping
aws s3 cp "$BACKUP_DIR/mongo_${TIMESTAMP}.tar.gz" s3://your-backups-bucket/mongodb/
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID

Database-Specific Monitoring Patterns

PostgreSQL

PostgreSQL health monitoring layers:

Layer 1 — TCP availability:

  • TCP monitor: postgres-primary.yourdomain.com:5432
  • TCP monitor: postgres-replica.yourdomain.com:5432 (if using read replicas)

Layer 2 — Application-layer health:

  • HTTP monitor: https://api.yourdomain.com/health/db
  • Keyword: "status": "ok"
  • Tests an actual SELECT 1 query through the connection pool

Layer 3 — Replication health:

  • HTTP monitor: https://api.yourdomain.com/health/replication
  • Keyword: "status": "ok"
  • Checks pg_last_xact_replay_timestamp() lag

Layer 4 — Backup heartbeat:

  • Heartbeat monitor, timeout: 25 hours
  • pg_dump script pings on successful completion

Layer 5 — Vacuum/maintenance job:

  • Heartbeat monitor, timeout: 1 week
  • Autovacuum is automatic, but if you run manual VACUUM ANALYZE on large tables, monitoring completion is prudent

MySQL / MariaDB

MySQL health monitoring layers:

Layer 1 — TCP: mysql.yourdomain.com:3306

Layer 2 — Health API:

SELECT 1; -- test connection
SHOW REPLICA STATUS; -- check replication

Layer 3 — Replication lag: HTTP endpoint checking Seconds_Behind_Source

Layer 4 — Backup heartbeat: mysqldump or xtrabackup script

Additional: Monitor your MySQL slow query log export job if you ship slow queries to an analytics system.

MongoDB

MongoDB health monitoring layers:

Layer 1 — TCP: mongodb.yourdomain.com:27017

Layer 2 — Health API:

@app.get("/health/mongo")
async def mongo_health():
    try:
        result = await client.admin.command("ping")
        if result.get("ok") == 1.0:
            return {"status": "ok"}
        return {"status": "error"}, 503
    except Exception:
        return {"status": "unreachable"}, 503

Layer 3 — Replica set health:

@app.get("/health/mongo/replication")
async def mongo_replica_health():
    status = await client.admin.command("replSetGetStatus")
    primary_found = any(m["stateStr"] == "PRIMARY" for m in status["members"])
    lagging_members = [
        m for m in status["members"]
        if m["stateStr"] == "SECONDARY" and m.get("optimeDate") is not None
        and (status.get("date") - m["optimeDate"]).seconds > 30
    ]
    if not primary_found or lagging_members:
        return {"status": "degraded"}, 503
    return {"status": "ok"}

Layer 4 — Backup heartbeat: mongodump script with ping on success.

Redis

Redis availability monitoring is simpler because Redis has no replication lag concern in most setups (Redis replication is eventually consistent and typically sub-millisecond).

Layer 1 — TCP: redis.yourdomain.com:6379

Layer 2 — Health API:

@app.get("/health/redis")
async def redis_health():
    try:
        pong = await redis_client.ping()
        if pong:
            return {"status": "ok"}
        return {"status": "error"}, 503
    except Exception:
        return {"status": "unreachable"}, 503

Layer 3 — Memory pressure monitoring (via health endpoint):

info = await redis_client.info("memory")
used_pct = info["used_memory"] / info["maxmemory"] * 100 if info.get("maxmemory") else 0

if used_pct > 90:
    return {"status": "degraded", "memory_pct": round(used_pct)}, 503
return {"status": "ok", "memory_pct": round(used_pct)}

Layer 4 — RDB dump heartbeat: If you rely on Redis persistence, monitor the RDB dump job:

# After confirming dump.rdb was updated recently:
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_REDIS_BACKUP_HEARTBEAT_ID

ClickHouse

ClickHouse exposes an HTTP interface on port 8123, making it natively compatible with Vigilmon HTTP monitoring:

Layer 1 — HTTP health (ClickHouse built-in):

  • URL: https://clickhouse.yourdomain.com:8123/ping
  • Expected status: 200
  • Keyword: Ok.

This is ClickHouse's built-in ping endpoint — it returns Ok. when healthy.

Layer 2 — Replica sync health:

-- ClickHouse replicated table health check
SELECT
    database,
    table,
    is_leader,
    absolute_delay
FROM system.replicas
WHERE absolute_delay > 30

Wrap in a health API endpoint and monitor with Vigilmon.

Layer 3 — Backup heartbeat: ClickHouse BACKUP command completion ping.


Monitoring Private Databases

Most production databases are not publicly addressable. They live on private networks, VPCs, or internal LANs. Vigilmon monitors from external probe nodes, so direct TCP or HTTP monitoring of private database addresses is not possible.

The standard solution: proxy monitoring through your application's health API.

Your application servers sit inside the private network and can reach the database. Your application's health endpoints are publicly accessible via your load balancer or API gateway. Vigilmon monitors the public health API endpoints, which internally validate database connectivity.

This architectural boundary is actually a feature: the health endpoint tests the exact same network path your application code uses. If your app can't reach the database, the health endpoint returns 503, and Vigilmon alerts — because that's what users experience.

For TCP-level monitoring of private databases, options include:

  • A lightweight health proxy/sidecar in the same private network that exposes an HTTP health endpoint based on TCP connectivity tests
  • VPN-based monitoring (Vigilmon does not currently support VPN-based probe routing — use the application health endpoint pattern instead)

Complete Vigilmon Configuration for a Production Database Stack

Here's a complete Vigilmon monitoring configuration for a typical PostgreSQL-backed web application:

| # | Type | Target | Check | Interval | What It Catches | |---|---|---|---|---|---| | 1 | TCP | db-primary:5432 | Connection | 1 min | Primary down/unreachable | | 2 | TCP | db-replica:5432 | Connection | 1 min | Replica down/unreachable | | 3 | HTTP | /health/db | status: ok | 1 min | App can't query DB | | 4 | HTTP | /health/replication | status: ok | 2 min | Replica lag > 30s | | 5 | Heartbeat | PG backup job | timeout: 25h | — | Daily backup stopped | | 6 | Heartbeat | DB migration job | timeout: 10m | — | Migration job hung | | 7 | HTTP | /health/connections | status: ok | 2 min | Connection pool saturated |

This gives you visibility across the full availability stack: network reachability, application-layer connectivity, replication currency, and job health — with no single failure mode left undetected.


Summary

External database monitoring with Vigilmon covers the availability and correctness gaps that internal metrics miss:

  • TCP port monitoring for immediate detection of database process failures and network unreachability
  • Application health API monitoring for confirming the app can actually query the database through its connection pool
  • Replication lag endpoints for detecting when read replicas drift too far behind the primary
  • Backup job heartbeats for the most dangerous silent failure: discovering a backup gap when you need to restore

These patterns work for PostgreSQL, MySQL, MongoDB, Redis, ClickHouse, and any database that your application can query and expose via a health API. The external monitoring perspective catches the failures that matter to end users — not just the internal metrics that explain why the failure is happening.

Try Vigilmon free at vigilmon.online — 5 monitors, no credit card, no trial expiry, multi-region consensus alerting from the first monitor.


Tags: #database #monitoring #uptime #postgresql #mysql #mongodb #redis #clickhouse #heartbeat #vigilmon #devops #sre #2026

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →