How to Monitor MySQL and MariaDB Uptime with Vigilmon

MySQL and MariaDB power some of the world's busiest applications — but the failures that matter most often go undetected by standard process monitors. A read replica that has fallen 30 seconds behind the primary looks healthy from the OS perspective while serving stale data to every read request. A mysqldump that silently exits with a non-zero code leaves you with no backup while your cron job cheerfully logs "done."

Vigilmon gives you external visibility into MySQL and MariaDB health through HTTP probe monitoring and heartbeat monitoring for replication jobs and backup tasks. This tutorial covers both.

Why Database Process Monitoring Is Not Enough

systemd, Docker health checks, and RDS enhanced monitoring confirm the mysqld process is running. They do not tell you:

Whether MySQL is reachable from your application servers across the network
Whether replication lag on a read replica has grown past an acceptable threshold
Whether your connection pool is saturated and new connections are being rejected
Whether mysqldump or a replication lag monitoring job has silently stopped
Whether a slow query is holding a table lock and blocking writes

External monitoring through Vigilmon probes the actual connectivity and response behaviour your application depends on.

Step 1: Build a MySQL Health Endpoint

MySQL does not expose an HTTP health interface natively. Add a thin health route to your application server.

Node.js / Express Example

// health/mysql.js
const express = require('express');
const mysql2 = require('mysql2/promise');

const app = express();
const pool = mysql2.createPool({
  uri: process.env.MYSQL_URL,
  waitForConnections: true,
  connectionLimit: 2,
});

app.get('/health/mysql', async (req, res) => {
  let conn;
  try {
    conn = await pool.getConnection();

    // Verify basic connectivity
    await conn.query('SELECT 1');

    // Check replication lag on replicas
    const [rows] = await conn.query('SHOW SLAVE STATUS');
    if (rows.length > 0) {
      const lag = rows[0].Seconds_Behind_Master;
      if (lag === null) {
        return res.status(503).json({
          status: 'degraded',
          reason: 'replication_not_running',
        });
      }
      if (lag > 30) {
        return res.status(503).json({
          status: 'degraded',
          reason: 'replication_lag',
          lag_seconds: lag,
        });
      }
    }

    return res.status(200).json({ status: 'ok' });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  } finally {
    if (conn) conn.release();
  }
});

app.listen(3003);

Python (FastAPI) Example

# health_mysql.py
import aiomysql, os
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()
pool = None

@app.on_event("startup")
async def startup():
    global pool
    pool = await aiomysql.create_pool(
        host=os.environ["MYSQL_HOST"],
        user=os.environ["MYSQL_USER"],
        password=os.environ["MYSQL_PASSWORD"],
        db=os.environ["MYSQL_DB"],
        minsize=1, maxsize=2,
    )

@app.get("/health/mysql")
async def mysql_health():
    try:
        async with pool.acquire() as conn:
            async with conn.cursor(aiomysql.DictCursor) as cur:
                await cur.execute("SELECT 1")
                await cur.execute("SHOW SLAVE STATUS")
                rows = await cur.fetchall()
                if rows:
                    lag = rows[0].get("Seconds_Behind_Master")
                    if lag is None:
                        return JSONResponse(status_code=503, content={
                            "status": "degraded",
                            "reason": "replication_not_running",
                        })
                    if int(lag) > 30:
                        return JSONResponse(status_code=503, content={
                            "status": "degraded",
                            "reason": "replication_lag",
                            "lag_seconds": int(lag),
                        })
        return {"status": "ok"}
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Verify the endpoint before wiring up Vigilmon:

curl -i https://your-app.example.com/health/mysql
# HTTP/1.1 200 OK
# {"status":"ok"}

Step 2: Configure Vigilmon HTTP Monitor for MySQL

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to your health endpoint: https://your-app.example.com/health/mysql
Set the check interval to 1 minute
Under Expected response, configure:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 2000ms
Under Alert channels, assign your Slack or PagerDuty channel
Save the monitor

Vigilmon probes from multiple regions simultaneously, requiring multi-region confirmation before opening an incident. You get actionable alerts, not noise.

Monitoring Primary and Replicas Separately

Create one Vigilmon monitor per database role:

[mysql-primary] /health/mysql — immediate P1 page on failure
[mysql-replica-1] /health/mysql — P2 Slack alert (includes replication lag check)
[mysql-replica-2] /health/mysql — P2 Slack alert

The replica health endpoints use SHOW SLAVE STATUS to surface replication lag directly in the HTTP response, so Vigilmon detects lagging replicas as easily as downed ones. Point your read replica monitor URL at the replica application server instance to get accurate topology coverage.

Step 3: Heartbeat Monitoring for mysqldump and Replication Lag Jobs

Scheduled database tasks fail silently more often than you would expect. mysqldump can exit with a non-zero code while the cron job wrapper logs no error. A replication monitoring script can crash and leave you with no visibility into lag for hours.

Vigilmon heartbeat monitors detect silent failures: your job pings Vigilmon at the end of each successful run. If the ping stops arriving within the expected window, Vigilmon fires an alert.

Set Up the Heartbeat Monitor

In Vigilmon, go to Monitors → New Monitor → Heartbeat
Set the name: mysqldump-nightly
Set the expected interval: 24 hours
Set the grace period: 1 hour
Save — copy the heartbeat URL, e.g. https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your Backup Script

#!/bin/bash
# backup-mysql.sh
set -euo pipefail

BACKUP_FILE="/backups/mysql-$(date +%Y%m%d).sql.gz"

mysqldump \
  --host="$MYSQL_HOST" \
  --user="$MYSQL_USER" \
  --password="$MYSQL_PASSWORD" \
  --all-databases \
  --single-transaction \
  --routines \
  | gzip > "$BACKUP_FILE"

# Only ping if dump succeeded (set -e exits on error)
curl -fsS "$VIGILMON_HEARTBEAT_URL" > /dev/null
echo "Backup complete and heartbeat sent: $BACKUP_FILE"

The set -euo pipefail directive ensures the heartbeat is only sent when the dump completes without errors. If mysqldump fails, the script exits early and the heartbeat ping is never sent — Vigilmon alerts within the grace period.

Step 4: MySQL vs MariaDB Monitoring Differences

MySQL and MariaDB share the same wire protocol, but there are a few monitoring differences to be aware of.

Replication status command: MariaDB 10.5+ uses SHOW REPLICA STATUS (MySQL 8.0.22+ also changed to this). For compatibility across both, use a fallback:

let replicationRows;
try {
  [replicationRows] = await conn.query('SHOW REPLICA STATUS');
} catch {
  [replicationRows] = await conn.query('SHOW SLAVE STATUS'); // older versions
}

User statistics: MariaDB exposes SHOW USER_STATISTICS for per-user query counts, useful for identifying application users hammering the database. MySQL requires Performance Schema queries instead.

Galera Cluster (MariaDB multi-primary): expose wsrep_cluster_size and wsrep_local_state_comment in your health endpoint:

SHOW STATUS LIKE 'wsrep_cluster_size';
SHOW STATUS LIKE 'wsrep_local_state_comment';
-- Expected: wsrep_local_state_comment = 'Synced'

Return HTTP 503 if wsrep_local_state_comment is not Synced — the node is not in a healthy cluster state.

Step 5: Connection Pool Alert Routing

Connection pool exhaustion causes a sharp latency cliff: every new request blocks waiting for a slot instead of executing immediately. Surface the connection count in your health endpoint:

SHOW STATUS LIKE 'Threads_connected';
SHOW VARIABLES LIKE 'max_connections';

const [[{ Value: currentConn }]] = await conn.query(
  "SHOW STATUS LIKE 'Threads_connected'"
);
const [[{ Value: maxConn }]] = await conn.query(
  "SHOW VARIABLES LIKE 'max_connections'"
);

const usage = parseInt(currentConn) / parseInt(maxConn);
if (usage > 0.85) {
  return res.status(503).json({
    status: 'degraded',
    reason: 'connection_pool_near_limit',
    current: parseInt(currentConn),
    max: parseInt(maxConn),
  });
}

Configure Vigilmon alert routing:

| Monitor | Alert Channel | Priority | |---|---|---| | MySQL primary /health/mysql | Slack + PagerDuty | P1 | | MySQL replica /health/mysql | Slack | P2 | | Heartbeat: mysqldump | Email | P2 | | Heartbeat: replication lag job | Slack | P2 |

Summary

MySQL and MariaDB failures — replication lag, connection exhaustion, silent backup failures — don't always produce visible errors. Vigilmon catches them before they become user-facing incidents:

| Monitor Type | What It Covers | |---|---| | HTTP monitor on /health/mysql (primary) | Connectivity, connection pool saturation | | HTTP monitor on /health/mysql (replicas) | Replication lag, replica availability | | Heartbeat monitor | mysqldump success, replication job liveness |

Get started free at vigilmon.online — your first MySQL monitor is running in under two minutes.