Database availability is the most consequential single-point failure in most application stacks. When the database goes down, everything goes down. When replication falls behind, reads from replicas return stale data. When the backup job silently stops running, you discover the gap only when you need to restore.
Internal database metrics — query latency, connection pool utilization, buffer hit ratios — are important for performance tuning. But they miss the availability failure modes that matter most to end users: the database is unreachable, the TCP port is closed, the replica is lagging so far behind it's serving stale data, or last night's backup didn't run.
This guide covers external database monitoring with Vigilmon: why it matters beyond internal metrics, TCP port monitoring for database availability, health endpoint patterns, replication lag detection via keyword checks, backup job heartbeats, and practical Vigilmon configurations for PostgreSQL, MySQL, MongoDB, Redis, and ClickHouse.
Why External Database Monitoring Matters
Internal monitoring tools — Prometheus exporters, cloud provider dashboards, database built-in metrics — give you rich visibility into database internals: query plans, cache hit ratios, lock waits, connection pool saturation. This is invaluable for performance tuning and capacity planning.
External monitoring catches a different set of failures:
Network-level failures: A database might be running normally but unreachable due to a firewall rule change, a security group misconfiguration, a certificate renewal that caused a connection drop, or a routing issue between your application servers and the database. Internal metrics show the database as healthy because it is running — the failure is in the network path between clients and the database.
Port binding failures: A database restart that fails to re-bind to the configured port leaves the internal monitoring agent (often co-located) able to connect via local socket while remote connections fail. TCP port monitoring from an external vantage point catches this.
Application-layer health via health API: The database process is running, the port is open, but the application cannot query it — connection pool exhausted, authentication misconfigured, disk full preventing writes. An application-layer health endpoint that actually executes a test query surfaces this.
Replication lag: A replica that's running fine internally but lagging 30 minutes behind the primary is serving stale data to your read replicas. The replica's own metrics show it as healthy. External monitoring of a replication-lag-aware health endpoint catches this.
Backup job failures: The most dangerous silent failure. A backup job that stops running leaves no visible trace — no errors, no alerts from the database, no log warnings. The data is simply not being backed up. Heartbeat monitoring catches this.
TCP Port Monitoring for Database Availability
The fastest way to monitor database reachability: monitor the TCP port.
Vigilmon's TCP monitors open a raw socket connection to the host and port, confirm the connection succeeds, and alert if the connection fails or times out. This is the lowest-level check — it confirms the database process is bound to the port and the network path is open.
Default Database Ports
| Database | Default Port | |---|---| | PostgreSQL | 5432 | | MySQL / MariaDB | 3306 | | MongoDB | 27017 | | Redis | 6379 | | ClickHouse (native) | 9000 | | ClickHouse (HTTP) | 8123 | | Elasticsearch | 9200 | | Cassandra | 9042 |
Setting Up TCP Monitoring in Vigilmon
- Create a new TCP monitor in Vigilmon
- Set the host to your database server hostname or IP
- Set the port to the database's listening port
- Set check interval (1 minute recommended for production databases)
- Enable multi-region consensus alerting (default in Vigilmon)
Example: monitoring a PostgreSQL primary and its read replica:
- Monitor 1 (TCP):
db-primary.internal.example.com:5432 - Monitor 2 (TCP):
db-replica.internal.example.com:5432
If your databases are not publicly addressable (common for production setups), see the "Monitoring Private Databases" section below.
Health Endpoints via Application API
TCP port monitoring confirms the database is accepting connections. It doesn't confirm the application can actually query the database. For that, use a health endpoint in your application that performs a real test query.
Standard Health Endpoint Pattern
Most web frameworks support a health check route that validates database connectivity:
# FastAPI / Python example
from fastapi import FastAPI
import asyncpg
app = FastAPI()
@app.get("/health/db")
async def db_health():
try:
conn = await asyncpg.connect(DATABASE_URL)
await conn.fetchval("SELECT 1")
await conn.close()
return {"status": "ok", "database": "connected"}
except Exception as e:
return {"status": "error", "database": "unreachable", "detail": str(e)}, 503
// Express / Node.js example
app.get('/health/db', async (req, res) => {
try {
await db.query('SELECT 1');
res.json({ status: 'ok', database: 'connected' });
} catch (err) {
res.status(503).json({ status: 'error', database: 'unreachable' });
}
});
Vigilmon HTTP monitor configuration:
- URL:
https://api.yourdomain.com/health/db - Expected status: 200
- Keyword check:
"status": "ok" - Interval: 1 minute
If the database becomes unreachable, the health endpoint returns 503, the keyword check fails, and Vigilmon alerts.
Replication Lag Detection via Keyword Checks
Read replicas that lag too far behind the primary serve stale data to your application's read paths. Standard uptime monitoring checks whether the replica is reachable — it doesn't check whether the replica is current.
Pattern: Replication-Aware Health Endpoint
Add a replication lag check to your health endpoint:
# PostgreSQL replication lag check
@app.get("/health/replication")
async def replication_health():
conn = await asyncpg.connect(REPLICA_URL)
# pg_last_xact_replay_timestamp returns the timestamp of the last replayed WAL record
lag = await conn.fetchval("""
SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
""")
await conn.close()
LAG_THRESHOLD_SECONDS = 30
if lag is None:
# Replica hasn't replayed anything yet or is a primary
return {"status": "unknown", "lag_seconds": None}
elif lag > LAG_THRESHOLD_SECONDS:
return {"status": "lagging", "lag_seconds": round(lag)}, 503
else:
return {"status": "ok", "lag_seconds": round(lag)}
# MySQL replication lag check
@app.get("/health/replication")
async def mysql_replication_health():
result = await db.execute("SHOW REPLICA STATUS")
row = result.fetchone()
if not row:
return {"status": "not_replica"}, 200
lag = row["Seconds_Behind_Source"]
LAG_THRESHOLD_SECONDS = 30
if lag is None:
return {"status": "error", "message": "replication_stopped"}, 503
elif lag > LAG_THRESHOLD_SECONDS:
return {"status": "lagging", "lag_seconds": lag}, 503
else:
return {"status": "ok", "lag_seconds": lag}
Vigilmon HTTP monitor for replication:
- URL:
https://api.yourdomain.com/health/replication - Expected status: 200
- Keyword check:
"status": "ok" - Interval: 2 minutes
If the replica lags beyond the threshold, the endpoint returns 503, keyword check fails, and Vigilmon alerts — even though the replica is reachable and the TCP port check would show green.
Backup Job Heartbeats
Backup job failure is the highest-stakes silent failure in database operations. A backup job that stops running produces no errors and no alerts from standard monitoring. You discover the gap when you need to restore — and realize the last good backup is weeks old.
Vigilmon's heartbeat monitoring is designed exactly for this scenario.
How Heartbeat Monitoring Works
Create a heartbeat monitor in Vigilmon. Vigilmon generates a unique ping URL. Your backup script sends an HTTP POST to this URL on each successful completion. If no ping arrives within the configured timeout window, Vigilmon fires an alert.
The inversion is the key: Vigilmon waits for your job to check in, rather than probing your job. The job's absence is the alert.
PostgreSQL Backup Heartbeat
#!/bin/bash
# pg_backup.sh
set -e
BACKUP_DIR="/backups/postgresql"
DB_NAME="production"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump"
# Run the backup
pg_dump -Fc "$DB_NAME" > "$BACKUP_FILE"
# Verify the backup is non-zero size
if [ ! -s "$BACKUP_FILE" ]; then
echo "ERROR: Backup file is empty"
exit 1
fi
# Upload to object storage
aws s3 cp "$BACKUP_FILE" "s3://your-backups-bucket/postgresql/"
# Ping Vigilmon heartbeat on success
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID
echo "Backup completed: $BACKUP_FILE"
Vigilmon heartbeat configuration:
- Timeout: 25 hours (for a daily backup — allows the job to run at any point within the day plus 1 hour buffer)
- Alert: email + Slack webhook
If the backup cron job fails partway through (before the curl ping line), or if the job stops being scheduled entirely, the ping never arrives and Vigilmon alerts within 25 hours — long before you discover the gap the hard way.
MySQL Backup Heartbeat
#!/bin/bash
# mysql_backup.sh
set -e
BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/production_${TIMESTAMP}.sql.gz"
mysqldump --single-transaction --all-databases | gzip > "$BACKUP_FILE"
# Verify backup
if [ ! -s "$BACKUP_FILE" ]; then
exit 1
fi
# Upload and ping
aws s3 cp "$BACKUP_FILE" "s3://your-backups-bucket/mysql/"
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID
MongoDB Backup Heartbeat
#!/bin/bash
# mongo_backup.sh
set -e
BACKUP_DIR="/backups/mongodb"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
mongodump --uri="$MONGO_URI" --out="$BACKUP_DIR/dump_${TIMESTAMP}" --gzip
# Archive
tar -czf "$BACKUP_DIR/mongo_${TIMESTAMP}.tar.gz" -C "$BACKUP_DIR" "dump_${TIMESTAMP}"
rm -rf "$BACKUP_DIR/dump_${TIMESTAMP}"
# Upload and ping
aws s3 cp "$BACKUP_DIR/mongo_${TIMESTAMP}.tar.gz" s3://your-backups-bucket/mongodb/
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_HEARTBEAT_ID
Database-Specific Monitoring Patterns
PostgreSQL
PostgreSQL health monitoring layers:
Layer 1 — TCP availability:
- TCP monitor:
postgres-primary.yourdomain.com:5432 - TCP monitor:
postgres-replica.yourdomain.com:5432(if using read replicas)
Layer 2 — Application-layer health:
- HTTP monitor:
https://api.yourdomain.com/health/db - Keyword:
"status": "ok" - Tests an actual
SELECT 1query through the connection pool
Layer 3 — Replication health:
- HTTP monitor:
https://api.yourdomain.com/health/replication - Keyword:
"status": "ok" - Checks
pg_last_xact_replay_timestamp()lag
Layer 4 — Backup heartbeat:
- Heartbeat monitor, timeout: 25 hours
pg_dumpscript pings on successful completion
Layer 5 — Vacuum/maintenance job:
- Heartbeat monitor, timeout: 1 week
- Autovacuum is automatic, but if you run manual VACUUM ANALYZE on large tables, monitoring completion is prudent
MySQL / MariaDB
MySQL health monitoring layers:
Layer 1 — TCP: mysql.yourdomain.com:3306
Layer 2 — Health API:
SELECT 1; -- test connection
SHOW REPLICA STATUS; -- check replication
Layer 3 — Replication lag: HTTP endpoint checking Seconds_Behind_Source
Layer 4 — Backup heartbeat: mysqldump or xtrabackup script
Additional: Monitor your MySQL slow query log export job if you ship slow queries to an analytics system.
MongoDB
MongoDB health monitoring layers:
Layer 1 — TCP: mongodb.yourdomain.com:27017
Layer 2 — Health API:
@app.get("/health/mongo")
async def mongo_health():
try:
result = await client.admin.command("ping")
if result.get("ok") == 1.0:
return {"status": "ok"}
return {"status": "error"}, 503
except Exception:
return {"status": "unreachable"}, 503
Layer 3 — Replica set health:
@app.get("/health/mongo/replication")
async def mongo_replica_health():
status = await client.admin.command("replSetGetStatus")
primary_found = any(m["stateStr"] == "PRIMARY" for m in status["members"])
lagging_members = [
m for m in status["members"]
if m["stateStr"] == "SECONDARY" and m.get("optimeDate") is not None
and (status.get("date") - m["optimeDate"]).seconds > 30
]
if not primary_found or lagging_members:
return {"status": "degraded"}, 503
return {"status": "ok"}
Layer 4 — Backup heartbeat: mongodump script with ping on success.
Redis
Redis availability monitoring is simpler because Redis has no replication lag concern in most setups (Redis replication is eventually consistent and typically sub-millisecond).
Layer 1 — TCP: redis.yourdomain.com:6379
Layer 2 — Health API:
@app.get("/health/redis")
async def redis_health():
try:
pong = await redis_client.ping()
if pong:
return {"status": "ok"}
return {"status": "error"}, 503
except Exception:
return {"status": "unreachable"}, 503
Layer 3 — Memory pressure monitoring (via health endpoint):
info = await redis_client.info("memory")
used_pct = info["used_memory"] / info["maxmemory"] * 100 if info.get("maxmemory") else 0
if used_pct > 90:
return {"status": "degraded", "memory_pct": round(used_pct)}, 503
return {"status": "ok", "memory_pct": round(used_pct)}
Layer 4 — RDB dump heartbeat: If you rely on Redis persistence, monitor the RDB dump job:
# After confirming dump.rdb was updated recently:
curl -fsS -X POST https://hb.vigilmon.online/ping/YOUR_REDIS_BACKUP_HEARTBEAT_ID
ClickHouse
ClickHouse exposes an HTTP interface on port 8123, making it natively compatible with Vigilmon HTTP monitoring:
Layer 1 — HTTP health (ClickHouse built-in):
- URL:
https://clickhouse.yourdomain.com:8123/ping - Expected status: 200
- Keyword:
Ok.
This is ClickHouse's built-in ping endpoint — it returns Ok. when healthy.
Layer 2 — Replica sync health:
-- ClickHouse replicated table health check
SELECT
database,
table,
is_leader,
absolute_delay
FROM system.replicas
WHERE absolute_delay > 30
Wrap in a health API endpoint and monitor with Vigilmon.
Layer 3 — Backup heartbeat: ClickHouse BACKUP command completion ping.
Monitoring Private Databases
Most production databases are not publicly addressable. They live on private networks, VPCs, or internal LANs. Vigilmon monitors from external probe nodes, so direct TCP or HTTP monitoring of private database addresses is not possible.
The standard solution: proxy monitoring through your application's health API.
Your application servers sit inside the private network and can reach the database. Your application's health endpoints are publicly accessible via your load balancer or API gateway. Vigilmon monitors the public health API endpoints, which internally validate database connectivity.
This architectural boundary is actually a feature: the health endpoint tests the exact same network path your application code uses. If your app can't reach the database, the health endpoint returns 503, and Vigilmon alerts — because that's what users experience.
For TCP-level monitoring of private databases, options include:
- A lightweight health proxy/sidecar in the same private network that exposes an HTTP health endpoint based on TCP connectivity tests
- VPN-based monitoring (Vigilmon does not currently support VPN-based probe routing — use the application health endpoint pattern instead)
Complete Vigilmon Configuration for a Production Database Stack
Here's a complete Vigilmon monitoring configuration for a typical PostgreSQL-backed web application:
| # | Type | Target | Check | Interval | What It Catches |
|---|---|---|---|---|---|
| 1 | TCP | db-primary:5432 | Connection | 1 min | Primary down/unreachable |
| 2 | TCP | db-replica:5432 | Connection | 1 min | Replica down/unreachable |
| 3 | HTTP | /health/db | status: ok | 1 min | App can't query DB |
| 4 | HTTP | /health/replication | status: ok | 2 min | Replica lag > 30s |
| 5 | Heartbeat | PG backup job | timeout: 25h | — | Daily backup stopped |
| 6 | Heartbeat | DB migration job | timeout: 10m | — | Migration job hung |
| 7 | HTTP | /health/connections | status: ok | 2 min | Connection pool saturated |
This gives you visibility across the full availability stack: network reachability, application-layer connectivity, replication currency, and job health — with no single failure mode left undetected.
Summary
External database monitoring with Vigilmon covers the availability and correctness gaps that internal metrics miss:
- TCP port monitoring for immediate detection of database process failures and network unreachability
- Application health API monitoring for confirming the app can actually query the database through its connection pool
- Replication lag endpoints for detecting when read replicas drift too far behind the primary
- Backup job heartbeats for the most dangerous silent failure: discovering a backup gap when you need to restore
These patterns work for PostgreSQL, MySQL, MongoDB, Redis, ClickHouse, and any database that your application can query and expose via a health API. The external monitoring perspective catches the failures that matter to end users — not just the internal metrics that explain why the failure is happening.
Try Vigilmon free at vigilmon.online — 5 monitors, no credit card, no trial expiry, multi-region consensus alerting from the first monitor.
Tags: #database #monitoring #uptime #postgresql #mysql #mongodb #redis #clickhouse #heartbeat #vigilmon #devops #sre #2026