How to Monitor PostgreSQL Uptime with Vigilmon

PostgreSQL is extraordinarily reliable, which makes its failures all the more surprising when they happen. A primary failover, connection pool exhaustion, a replica falling hours behind, or a pg_dump job that silently stopped running — these are the incidents that cause long outages because nobody was watching.

Vigilmon provides external monitoring for PostgreSQL through HTTP health probes on your database API layer and heartbeat monitors for scheduled database jobs. This tutorial covers the full stack: primary, replicas, connection pool, and background jobs.

PostgreSQL Health Check Patterns

PostgreSQL itself speaks the Postgres wire protocol, not HTTP. For external HTTP monitoring, you need a health check endpoint that wraps pg_isready or a live connection check. There are several patterns:

Pattern 1: Application-level `/health` Endpoint

Most PostgreSQL deployments already have an API service in front of the database. Add a health endpoint to that service:

# FastAPI + asyncpg
import asyncpg
import os
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/health/db")
async def db_health():
    try:
        conn = await asyncpg.connect(os.environ["DATABASE_URL"])
        result = await conn.fetchval("SELECT 1")
        await conn.close()

        if result != 1:
            raise ValueError("Unexpected query result")

        return {"status": "ok", "database": "postgresql"}
    except Exception as e:
        return JSONResponse(status_code=503, content={
            "status": "down",
            "error": str(e),
        })

// Express + pg
const { Pool } = require('pg');
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

app.get('/health/db', async (req, res) => {
  try {
    const result = await pool.query('SELECT 1 AS ok');
    if (result.rows[0].ok !== 1) throw new Error('Query returned unexpected result');
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    res.status(503).json({ status: 'down', error: err.message });
  }
});

Pattern 2: pg_isready Sidecar

If you don't control the application layer, run a minimal HTTP sidecar that shells out to pg_isready:

# pg-healthcheck.sh — serve via netcat or a minimal HTTP server
#!/bin/bash
PGHOST=${PGHOST:-localhost}
PGPORT=${PGPORT:-5432}
PGUSER=${PGUSER:-postgres}
PGDATABASE=${PGDATABASE:-postgres}

if pg_isready -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" -q; then
  echo -e "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\n\r\n{\"status\":\"ok\"}"
else
  echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Type: application/json\r\n\r\n{\"status\":\"down\"}"
fi

Or use a Go binary like pg_health_check that runs on port 8086 and exposes /health. This is common for containerised PostgreSQL deployments.

Pattern 3: Enhanced Health with Replica Lag

For primary/replica setups, include replication lag in the health response:

-- Run this on the primary to check replica lag
SELECT
  client_addr,
  state,
  EXTRACT(EPOCH FROM (now() - write_lag)) AS write_lag_seconds,
  EXTRACT(EPOCH FROM (now() - replay_lag)) AS replay_lag_seconds
FROM pg_stat_replication;

@app.get("/health/db/replication")
async def replication_health():
    try:
        conn = await asyncpg.connect(os.environ["DATABASE_URL"])
        rows = await conn.fetch("""
            SELECT
                client_addr::text,
                state,
                EXTRACT(EPOCH FROM (now() - replay_lag))::float AS lag_seconds
            FROM pg_stat_replication
        """)
        await conn.close()

        replicas = [dict(r) for r in rows]
        lagging = [r for r in replicas if r["lag_seconds"] and r["lag_seconds"] > 30]

        status = "degraded" if lagging else "ok"
        code = 503 if lagging else 200

        return JSONResponse(status_code=code, content={
            "status": status,
            "replicas": replicas,
            "lagging_replicas": lagging,
        })
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Step 1: Configure Vigilmon HTTP Monitors

Primary Database Monitor

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL: https://your-api.example.com/health/db
Check interval: 1 minute
Expected response:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 3000ms
Alert channels: Slack + PagerDuty (primary database downtime is P1)
Save

Replica Monitor

Add a separate monitor for each read replica:

Monitors → New Monitor → HTTP / HTTPS
URL: https://your-api.example.com/health/db/replication
Check interval: 2 minutes
Expected response:
- Status code: 200
- Response body contains: "status":"ok"
Alert channels: Slack only (replica degradation is P2)
Save

Use a naming convention to keep your monitors organized:

[postgres-primary] api /health/db
[postgres-replica-1] replication /health/db/replication
[postgres-replica-2] replication /health/db/replication

Step 2: Monitor PgBouncer Connection Pool

PgBouncer sits between your application and PostgreSQL, maintaining a pool of long-lived database connections. When PgBouncer is healthy but PostgreSQL is not, your connection pool fills with stale connections and new queries queue indefinitely. When PgBouncer itself fails, your application cannot reach PostgreSQL at all.

PgBouncer exposes an admin console via a pseudo-database named pgbouncer. Wrap it in an HTTP endpoint:

import asyncpg

@app.get("/health/pgbouncer")
async def pgbouncer_health():
    try:
        # Connect to PgBouncer's admin pseudo-database
        conn = await asyncpg.connect(
            host=os.environ["PGBOUNCER_HOST"],
            port=int(os.environ.get("PGBOUNCER_PORT", 6432)),
            user=os.environ["PGBOUNCER_ADMIN_USER"],
            password=os.environ["PGBOUNCER_ADMIN_PASSWORD"],
            database="pgbouncer",
        )

        stats = await conn.fetch("SHOW POOLS")
        await conn.close()

        # Check for pools with too many waiting clients
        pools = [dict(r) for r in stats]
        overloaded = [p for p in pools if p.get("cl_waiting", 0) > 10]

        status = "degraded" if overloaded else "ok"
        code = 503 if overloaded else 200

        return JSONResponse(status_code=code, content={
            "status": status,
            "pools": pools,
            "overloaded_pools": overloaded,
        })
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Add a Vigilmon monitor for this endpoint at https://your-api.example.com/health/pgbouncer. A spike in cl_waiting predicts connection exhaustion before it becomes a full outage.

Step 3: Heartbeat Monitoring for pg_dump and WAL Archiving

Database backup jobs are invisible to uptime monitors. A pg_dump that started failing six weeks ago, a WAL archive job that ran out of disk space, a logical replication slot that stopped consuming — these failures compound silently until they become disasters.

Vigilmon heartbeat monitors flip the model: your job pings Vigilmon on success, and Vigilmon alerts if the ping stops arriving.

Set Up the Heartbeat Monitor

Monitors → New Monitor → Heartbeat
Name: pg-nightly-backup
Expected interval: 1 day
Grace period: 2 hours (alert if backup is 2 hours late)
Save — copy the heartbeat URL: https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your Backup Script

#!/bin/bash
# pg-backup.sh
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/backups/pg_dump_${TIMESTAMP}.sql.gz"
VIGILMON_HEARTBEAT_URL="${VIGILMON_HEARTBEAT_URL:?Must set VIGILMON_HEARTBEAT_URL}"

echo "Starting pg_dump at $(date)"

PGPASSWORD="$DB_PASSWORD" pg_dump \
  --host="$DB_HOST" \
  --username="$DB_USER" \
  --dbname="$DB_NAME" \
  --format=custom \
  --compress=9 \
  --file="$BACKUP_FILE"

echo "Backup complete: $BACKUP_FILE ($(du -sh $BACKUP_FILE | cut -f1))"

# Only ping Vigilmon if pg_dump exited 0
curl -fsS "$VIGILMON_HEARTBEAT_URL" > /dev/null
echo "Heartbeat sent to Vigilmon"

Run this via cron:

# /etc/cron.d/pg-backup
0 2 * * * postgres /opt/scripts/pg-backup.sh >> /var/log/pg-backup.log 2>&1

WAL Archiving Heartbeat

For continuous WAL archiving, configure archive_command in postgresql.conf to ping Vigilmon on each successful archive:

# postgresql.conf
archive_mode = on
archive_command = 'cp %p /wal-archive/%f && curl -fsS https://vigilmon.online/heartbeat/xyz789 > /dev/null'

Set the heartbeat interval to 30 minutes with a 1 hour grace period — if no WAL file has been archived in an hour, something is wrong.

Step 4: Alert Routing

| Monitor | Severity | Alert Channel | |---|---|---| | Primary database /health/db | P1 — immediate page | Slack + PagerDuty | | Replica replication health | P2 — degraded | Slack only | | PgBouncer pool health | P2 — degraded | Slack only | | Nightly pg_dump heartbeat | P2 — backup failure | Email + Slack | | WAL archive heartbeat | P1 — data loss risk | Slack + PagerDuty |

Configure these in Vigilmon under Alert Channels — assign different channels to different monitors based on severity.

Summary

PostgreSQL failures range from immediate (primary down) to slow-burn (replica lag, backup failure). Vigilmon covers both:

| Layer | Vigilmon Monitor Type | |---|---| | Primary availability | HTTP monitor on /health/db | | Replica sync status | HTTP monitor on /health/db/replication | | Connection pool | HTTP monitor on /health/pgbouncer | | Backup jobs | Heartbeat monitor | | WAL archiving | Heartbeat monitor |

Start monitoring your PostgreSQL for free at vigilmon.online — no credit card required.