PostgreSQL is extraordinarily reliable, which makes its failures all the more surprising when they happen. A primary failover, connection pool exhaustion, a replica falling hours behind, or a pg_dump job that silently stopped running — these are the incidents that cause long outages because nobody was watching.
Vigilmon provides external monitoring for PostgreSQL through HTTP health probes on your database API layer and heartbeat monitors for scheduled database jobs. This tutorial covers the full stack: primary, replicas, connection pool, and background jobs.
PostgreSQL Health Check Patterns
PostgreSQL itself speaks the Postgres wire protocol, not HTTP. For external HTTP monitoring, you need a health check endpoint that wraps pg_isready or a live connection check. There are several patterns:
Pattern 1: Application-level /health Endpoint
Most PostgreSQL deployments already have an API service in front of the database. Add a health endpoint to that service:
# FastAPI + asyncpg
import asyncpg
import os
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
@app.get("/health/db")
async def db_health():
try:
conn = await asyncpg.connect(os.environ["DATABASE_URL"])
result = await conn.fetchval("SELECT 1")
await conn.close()
if result != 1:
raise ValueError("Unexpected query result")
return {"status": "ok", "database": "postgresql"}
except Exception as e:
return JSONResponse(status_code=503, content={
"status": "down",
"error": str(e),
})
// Express + pg
const { Pool } = require('pg');
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
app.get('/health/db', async (req, res) => {
try {
const result = await pool.query('SELECT 1 AS ok');
if (result.rows[0].ok !== 1) throw new Error('Query returned unexpected result');
res.status(200).json({ status: 'ok' });
} catch (err) {
res.status(503).json({ status: 'down', error: err.message });
}
});
Pattern 2: pg_isready Sidecar
If you don't control the application layer, run a minimal HTTP sidecar that shells out to pg_isready:
# pg-healthcheck.sh — serve via netcat or a minimal HTTP server
#!/bin/bash
PGHOST=${PGHOST:-localhost}
PGPORT=${PGPORT:-5432}
PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-postgres}
if pg_isready -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -q; then
echo -e "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\n\r\n{\"status\":\"ok\"}"
else
echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Type: application/json\r\n\r\n{\"status\":\"down\"}"
fi
Or use a Go binary like pg_health_check that runs on port 8086 and exposes /health. This is common for containerised PostgreSQL deployments.
Pattern 3: Enhanced Health with Replica Lag
For primary/replica setups, include replication lag in the health response:
-- Run this on the primary to check replica lag
SELECT
client_addr,
state,
EXTRACT(EPOCH FROM (now() - write_lag)) AS write_lag_seconds,
EXTRACT(EPOCH FROM (now() - replay_lag)) AS replay_lag_seconds
FROM pg_stat_replication;
@app.get("/health/db/replication")
async def replication_health():
try:
conn = await asyncpg.connect(os.environ["DATABASE_URL"])
rows = await conn.fetch("""
SELECT
client_addr::text,
state,
EXTRACT(EPOCH FROM (now() - replay_lag))::float AS lag_seconds
FROM pg_stat_replication
""")
await conn.close()
replicas = [dict(r) for r in rows]
lagging = [r for r in replicas if r["lag_seconds"] and r["lag_seconds"] > 30]
status = "degraded" if lagging else "ok"
code = 503 if lagging else 200
return JSONResponse(status_code=code, content={
"status": status,
"replicas": replicas,
"lagging_replicas": lagging,
})
except Exception as e:
return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})
Step 1: Configure Vigilmon HTTP Monitors
Primary Database Monitor
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL:
https://your-api.example.com/health/db - Check interval: 1 minute
- Expected response:
- Status code:
200 - Response body contains:
"status":"ok" - Response time threshold:
3000ms
- Status code:
- Alert channels: Slack + PagerDuty (primary database downtime is P1)
- Save
Replica Monitor
Add a separate monitor for each read replica:
- Monitors → New Monitor → HTTP / HTTPS
- URL:
https://your-api.example.com/health/db/replication - Check interval: 2 minutes
- Expected response:
- Status code:
200 - Response body contains:
"status":"ok"
- Status code:
- Alert channels: Slack only (replica degradation is P2)
- Save
Use a naming convention to keep your monitors organized:
[postgres-primary] api /health/db[postgres-replica-1] replication /health/db/replication[postgres-replica-2] replication /health/db/replication
Step 2: Monitor PgBouncer Connection Pool
PgBouncer sits between your application and PostgreSQL, maintaining a pool of long-lived database connections. When PgBouncer is healthy but PostgreSQL is not, your connection pool fills with stale connections and new queries queue indefinitely. When PgBouncer itself fails, your application cannot reach PostgreSQL at all.
PgBouncer exposes an admin console via a pseudo-database named pgbouncer. Wrap it in an HTTP endpoint:
import asyncpg
@app.get("/health/pgbouncer")
async def pgbouncer_health():
try:
# Connect to PgBouncer's admin pseudo-database
conn = await asyncpg.connect(
host=os.environ["PGBOUNCER_HOST"],
port=int(os.environ.get("PGBOUNCER_PORT", 6432)),
user=os.environ["PGBOUNCER_ADMIN_USER"],
password=os.environ["PGBOUNCER_ADMIN_PASSWORD"],
database="pgbouncer",
)
stats = await conn.fetch("SHOW POOLS")
await conn.close()
# Check for pools with too many waiting clients
pools = [dict(r) for r in stats]
overloaded = [p for p in pools if p.get("cl_waiting", 0) > 10]
status = "degraded" if overloaded else "ok"
code = 503 if overloaded else 200
return JSONResponse(status_code=code, content={
"status": status,
"pools": pools,
"overloaded_pools": overloaded,
})
except Exception as e:
return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})
Add a Vigilmon monitor for this endpoint at https://your-api.example.com/health/pgbouncer. A spike in cl_waiting predicts connection exhaustion before it becomes a full outage.
Step 3: Heartbeat Monitoring for pg_dump and WAL Archiving
Database backup jobs are invisible to uptime monitors. A pg_dump that started failing six weeks ago, a WAL archive job that ran out of disk space, a logical replication slot that stopped consuming — these failures compound silently until they become disasters.
Vigilmon heartbeat monitors flip the model: your job pings Vigilmon on success, and Vigilmon alerts if the ping stops arriving.
Set Up the Heartbeat Monitor
- Monitors → New Monitor → Heartbeat
- Name:
pg-nightly-backup - Expected interval: 1 day
- Grace period: 2 hours (alert if backup is 2 hours late)
- Save — copy the heartbeat URL:
https://vigilmon.online/heartbeat/abc123xyz
Wire It Into Your Backup Script
#!/bin/bash
# pg-backup.sh
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/backups/pg_dump_${TIMESTAMP}.sql.gz"
VIGILMON_HEARTBEAT_URL="${VIGILMON_HEARTBEAT_URL:?Must set VIGILMON_HEARTBEAT_URL}"
echo "Starting pg_dump at $(date)"
PGPASSWORD="$DB_PASSWORD" pg_dump \
--host="$DB_HOST" \
--username="$DB_USER" \
--dbname="$DB_NAME" \
--format=custom \
--compress=9 \
--file="$BACKUP_FILE"
echo "Backup complete: $BACKUP_FILE ($(du -sh $BACKUP_FILE | cut -f1))"
# Only ping Vigilmon if pg_dump exited 0
curl -fsS "$VIGILMON_HEARTBEAT_URL" > /dev/null
echo "Heartbeat sent to Vigilmon"
Run this via cron:
# /etc/cron.d/pg-backup
0 2 * * * postgres /opt/scripts/pg-backup.sh >> /var/log/pg-backup.log 2>&1
WAL Archiving Heartbeat
For continuous WAL archiving, configure archive_command in postgresql.conf to ping Vigilmon on each successful archive:
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /wal-archive/%f && curl -fsS https://vigilmon.online/heartbeat/xyz789 > /dev/null'
Set the heartbeat interval to 30 minutes with a 1 hour grace period — if no WAL file has been archived in an hour, something is wrong.
Step 4: Alert Routing
| Monitor | Severity | Alert Channel |
|---|---|---|
| Primary database /health/db | P1 — immediate page | Slack + PagerDuty |
| Replica replication health | P2 — degraded | Slack only |
| PgBouncer pool health | P2 — degraded | Slack only |
| Nightly pg_dump heartbeat | P2 — backup failure | Email + Slack |
| WAL archive heartbeat | P1 — data loss risk | Slack + PagerDuty |
Configure these in Vigilmon under Alert Channels — assign different channels to different monitors based on severity.
Summary
PostgreSQL failures range from immediate (primary down) to slow-burn (replica lag, backup failure). Vigilmon covers both:
| Layer | Vigilmon Monitor Type |
|---|---|
| Primary availability | HTTP monitor on /health/db |
| Replica sync status | HTTP monitor on /health/db/replication |
| Connection pool | HTTP monitor on /health/pgbouncer |
| Backup jobs | Heartbeat monitor |
| WAL archiving | Heartbeat monitor |
Start monitoring your PostgreSQL for free at vigilmon.online — no credit card required.