How to Monitor Redis Uptime and Health with Vigilmon

Redis is fast, lightweight, and deceptively easy to operate — until a cache stampede clears your instance at peak traffic, an OOM killer terminates the process, or a misconfigured maxmemory-policy silently starts dropping keys. When Redis goes down, your application doesn't gracefully degrade; it hammers the database with every request that used to hit the cache, amplifying the outage.

Vigilmon gives you external visibility into Redis availability through HTTP probe monitoring and heartbeat monitoring for Redis-dependent background workers. This tutorial walks through setting up both.

Why Redis Monitoring Matters

Internal process monitoring (systemd, supervisor, Docker health checks) tells you whether the Redis process is running. It cannot tell you:

Whether Redis is reachable from your application servers across the network
Whether Redis has hit its maxmemory limit and is refusing writes
Whether a replica lag has grown large enough to serve stale data
Whether your Redis-dependent background jobs have silently stopped processing
Whether an eviction storm is returning empty responses without any error codes

These failure modes all return healthy from a process-level check while causing real user-facing impact. External monitoring through Vigilmon catches them by testing the actual path your application uses.

Step 1: Build a Redis Health Endpoint

Redis does not expose an HTTP health endpoint natively. You need a thin wrapper in your application layer (or a dedicated sidecar) that checks Redis and returns HTTP 200/503.

Node.js Example

// healthcheck.js — Redis health endpoint for Express
const express = require('express');
const redis = require('redis');

const app = express();
const client = redis.createClient({ url: process.env.REDIS_URL });

client.connect().catch(console.error);

app.get('/health/redis', async (req, res) => {
  try {
    // PING/PONG verifies the connection is alive
    const pong = await client.ping();
    if (pong !== 'PONG') throw new Error('Unexpected PING response');

    // Check memory — warn if over 90% of maxmemory
    const info = await client.info('memory');
    const usedMatch = info.match(/used_memory:(\d+)/);
    const maxMatch = info.match(/maxmemory:(\d+)/);

    const used = usedMatch ? parseInt(usedMatch[1]) : 0;
    const max = maxMatch ? parseInt(maxMatch[1]) : 0;
    const memPressure = max > 0 ? used / max : 0;

    if (memPressure > 0.95) {
      return res.status(503).json({
        status: 'critical',
        reason: 'memory_pressure',
        used_bytes: used,
        max_bytes: max,
      });
    }

    return res.status(200).json({ status: 'ok', pong });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3001);

Python (FastAPI) Example

# healthcheck.py — Redis health endpoint for FastAPI
import redis
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()
r = redis.Redis.from_url(os.environ["REDIS_URL"], decode_responses=True)

@app.get("/health/redis")
def redis_health():
    try:
        pong = r.ping()
        info = r.info("memory")
        used = info.get("used_memory", 0)
        max_mem = info.get("maxmemory", 0)

        if max_mem > 0 and used / max_mem > 0.95:
            return JSONResponse(status_code=503, content={
                "status": "critical",
                "reason": "memory_pressure",
            })

        return {"status": "ok", "pong": pong}
    except Exception as e:
        return JSONResponse(status_code=503, content={
            "status": "down",
            "error": str(e),
        })

Deploy this endpoint on the same host as your Redis client code. Once it's live, verify it manually:

curl -i https://your-app.example.com/health/redis
# HTTP/1.1 200 OK
# {"status":"ok","pong":true}

Step 2: Configure a Vigilmon HTTP Monitor for Redis

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to your Redis health endpoint: https://your-app.example.com/health/redis
Set the check interval to 1 minute
Under Expected response, configure:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 1000ms
Under Alert channels, assign your Slack or email channel
Save the monitor

Vigilmon probes from multiple geographic regions simultaneously. A transient single-probe blip will not page you; Vigilmon requires multi-region consensus before opening an incident. You get confident, actionable alerts — not alert fatigue from flapping monitors.

What This Catches

| Failure | systemd/Docker | Vigilmon | |---|---|---| | Redis process crash | ✓ | ✓ | | Network partition to Redis | ✗ | ✓ | | Memory pressure (OOM imminent) | ✗ | ✓ | | Blocked by maxmemory-policy noeviction | ✗ | ✓ | | Wrong REDIS_URL in app config | ✗ | ✓ |

Step 3: Heartbeat Monitoring for Redis-Dependent Workers

Background workers that consume from Redis queues (Sidekiq, BullMQ, Celery with Redis broker, RQ) will stop silently when Redis goes down — often long before any HTTP endpoint becomes unhealthy. The worker just blocks waiting for a connection and never processes another job.

Vigilmon's heartbeat monitors detect silent worker death: your worker sends a ping to Vigilmon after each successful processing cycle. If the ping stops arriving, Vigilmon fires an alert.

Set Up the Heartbeat Monitor

In Vigilmon, go to Monitors → New Monitor → Heartbeat
Set the name: redis-queue-worker
Set the expected interval: 5 minutes (adjust to your job frequency)
Set the grace period: 10 minutes
Save — copy the unique heartbeat URL, e.g. https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your Worker

Node.js / BullMQ:

import { Worker } from 'bullmq';
import axios from 'axios';

const worker = new Worker('my-queue', async (job) => {
  await processJob(job);
}, { connection: { host: 'localhost', port: 6379 } });

worker.on('completed', async () => {
  await axios.get(process.env.VIGILMON_HEARTBEAT_URL).catch(() => {});
});

Python / RQ:

from rq import Worker, Queue
import requests, os

def after_job(job, connection, result, *args, **kwargs):
    requests.get(os.environ["VIGILMON_HEARTBEAT_URL"], timeout=5)

q = Queue(connection=Redis.from_url(os.environ["REDIS_URL"]))
worker = Worker([q], connection=q.connection)
worker.push_job_execution_timeout(None)
worker.work(burst=False)

Ruby / Sidekiq:

# config/initializers/vigilmon_heartbeat.rb
class VigilmonHeartbeat
  include Sidekiq::ServerMiddleware

  def call(worker, job, queue)
    yield
    Net::HTTP.get(URI(ENV["VIGILMON_HEARTBEAT_URL"]))
  rescue => e
    Rails.logger.warn("Vigilmon heartbeat failed: #{e}")
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add VigilmonHeartbeat
  end
end

Step 4: Redis Cluster Monitoring Considerations

If you run Redis Cluster or Redis Sentinel, each node needs its own consideration:

Monitor the primary node with an HTTP probe — this is the node handling writes, and its downtime is immediately user-impacting.

Monitor each replica independently with separate monitors. Replica health checks reveal replication lag and silent replica failures:

# Check replica sync status via the health endpoint
redis-cli -h replica1 INFO replication | grep master_sync_in_progress
# 0 = in sync, 1 = syncing (potential stale reads)

Expose this in your health endpoint:

const replication = await client.info('replication');
const role = replication.match(/role:(\w+)/)?.[1];
const masterLinkDown = replication.includes('master_link_status:down');

if (role === 'slave' && masterLinkDown) {
  return res.status(503).json({ status: 'degraded', reason: 'replica_disconnected' });
}

Use a naming convention in Vigilmon to distinguish nodes at a glance:

[redis-primary] cache /health
[redis-replica-1] cache /health
[redis-replica-2] cache /health

Group all Redis monitors into a single Status Page so your team has a single pane of glass for the cache layer.

Step 5: Alert Routing for Cache Outages

Cache outages follow a pattern: Redis goes down → application response times spike → database CPU spikes → cascading failure. You want to page your on-call engineer before the cascade completes.

In Vigilmon, configure alert channels per monitor:

Primary Redis monitor → immediate Slack + PagerDuty page (P1)
Replica monitors → Slack only (P2 — degraded, not down)
Worker heartbeat monitors → Slack + email (P2 — background jobs failing)

Set response time thresholds as an early warning:

Alert at 500ms for Redis health endpoint (cache should be fast; a slow health response may indicate memory pressure before OOM)
Alert at 2000ms for endpoints that depend on Redis (latency spike signals cache misses)

Summary

Redis failures are fast and catastrophic — don't wait for users to report them. Vigilmon gives you:

| Monitor Type | What It Covers | |---|---| | HTTP monitor on /health/redis | Redis process, connectivity, memory pressure | | HTTP monitor on replica health | Replica sync status, disconnection | | Heartbeat monitor | Background worker liveness |

Get started free at vigilmon.online — your first Redis monitor is running in under two minutes.