Redis is fast, lightweight, and deceptively easy to operate — until a cache stampede clears your instance at peak traffic, an OOM killer terminates the process, or a misconfigured maxmemory-policy silently starts dropping keys. When Redis goes down, your application doesn't gracefully degrade; it hammers the database with every request that used to hit the cache, amplifying the outage.
Vigilmon gives you external visibility into Redis availability through HTTP probe monitoring and heartbeat monitoring for Redis-dependent background workers. This tutorial walks through setting up both.
Why Redis Monitoring Matters
Internal process monitoring (systemd, supervisor, Docker health checks) tells you whether the Redis process is running. It cannot tell you:
- Whether Redis is reachable from your application servers across the network
- Whether Redis has hit its
maxmemorylimit and is refusing writes - Whether a replica lag has grown large enough to serve stale data
- Whether your Redis-dependent background jobs have silently stopped processing
- Whether an eviction storm is returning empty responses without any error codes
These failure modes all return healthy from a process-level check while causing real user-facing impact. External monitoring through Vigilmon catches them by testing the actual path your application uses.
Step 1: Build a Redis Health Endpoint
Redis does not expose an HTTP health endpoint natively. You need a thin wrapper in your application layer (or a dedicated sidecar) that checks Redis and returns HTTP 200/503.
Node.js Example
// healthcheck.js — Redis health endpoint for Express
const express = require('express');
const redis = require('redis');
const app = express();
const client = redis.createClient({ url: process.env.REDIS_URL });
client.connect().catch(console.error);
app.get('/health/redis', async (req, res) => {
try {
// PING/PONG verifies the connection is alive
const pong = await client.ping();
if (pong !== 'PONG') throw new Error('Unexpected PING response');
// Check memory — warn if over 90% of maxmemory
const info = await client.info('memory');
const usedMatch = info.match(/used_memory:(\d+)/);
const maxMatch = info.match(/maxmemory:(\d+)/);
const used = usedMatch ? parseInt(usedMatch[1]) : 0;
const max = maxMatch ? parseInt(maxMatch[1]) : 0;
const memPressure = max > 0 ? used / max : 0;
if (memPressure > 0.95) {
return res.status(503).json({
status: 'critical',
reason: 'memory_pressure',
used_bytes: used,
max_bytes: max,
});
}
return res.status(200).json({ status: 'ok', pong });
} catch (err) {
return res.status(503).json({ status: 'down', error: err.message });
}
});
app.listen(3001);
Python (FastAPI) Example
# healthcheck.py — Redis health endpoint for FastAPI
import redis
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app = FastAPI()
r = redis.Redis.from_url(os.environ["REDIS_URL"], decode_responses=True)
@app.get("/health/redis")
def redis_health():
try:
pong = r.ping()
info = r.info("memory")
used = info.get("used_memory", 0)
max_mem = info.get("maxmemory", 0)
if max_mem > 0 and used / max_mem > 0.95:
return JSONResponse(status_code=503, content={
"status": "critical",
"reason": "memory_pressure",
})
return {"status": "ok", "pong": pong}
except Exception as e:
return JSONResponse(status_code=503, content={
"status": "down",
"error": str(e),
})
Deploy this endpoint on the same host as your Redis client code. Once it's live, verify it manually:
curl -i https://your-app.example.com/health/redis
# HTTP/1.1 200 OK
# {"status":"ok","pong":true}
Step 2: Configure a Vigilmon HTTP Monitor for Redis
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL to your Redis health endpoint:
https://your-app.example.com/health/redis - Set the check interval to 1 minute
- Under Expected response, configure:
- Status code:
200 - Response body contains:
"status":"ok" - Response time threshold:
1000ms
- Status code:
- Under Alert channels, assign your Slack or email channel
- Save the monitor
Vigilmon probes from multiple geographic regions simultaneously. A transient single-probe blip will not page you; Vigilmon requires multi-region consensus before opening an incident. You get confident, actionable alerts — not alert fatigue from flapping monitors.
What This Catches
| Failure | systemd/Docker | Vigilmon |
|---|---|---|
| Redis process crash | ✓ | ✓ |
| Network partition to Redis | ✗ | ✓ |
| Memory pressure (OOM imminent) | ✗ | ✓ |
| Blocked by maxmemory-policy noeviction | ✗ | ✓ |
| Wrong REDIS_URL in app config | ✗ | ✓ |
Step 3: Heartbeat Monitoring for Redis-Dependent Workers
Background workers that consume from Redis queues (Sidekiq, BullMQ, Celery with Redis broker, RQ) will stop silently when Redis goes down — often long before any HTTP endpoint becomes unhealthy. The worker just blocks waiting for a connection and never processes another job.
Vigilmon's heartbeat monitors detect silent worker death: your worker sends a ping to Vigilmon after each successful processing cycle. If the ping stops arriving, Vigilmon fires an alert.
Set Up the Heartbeat Monitor
- In Vigilmon, go to Monitors → New Monitor → Heartbeat
- Set the name:
redis-queue-worker - Set the expected interval: 5 minutes (adjust to your job frequency)
- Set the grace period: 10 minutes
- Save — copy the unique heartbeat URL, e.g.
https://vigilmon.online/heartbeat/abc123xyz
Wire It Into Your Worker
Node.js / BullMQ:
import { Worker } from 'bullmq';
import axios from 'axios';
const worker = new Worker('my-queue', async (job) => {
await processJob(job);
}, { connection: { host: 'localhost', port: 6379 } });
worker.on('completed', async () => {
await axios.get(process.env.VIGILMON_HEARTBEAT_URL).catch(() => {});
});
Python / RQ:
from rq import Worker, Queue
import requests, os
def after_job(job, connection, result, *args, **kwargs):
requests.get(os.environ["VIGILMON_HEARTBEAT_URL"], timeout=5)
q = Queue(connection=Redis.from_url(os.environ["REDIS_URL"]))
worker = Worker([q], connection=q.connection)
worker.push_job_execution_timeout(None)
worker.work(burst=False)
Ruby / Sidekiq:
# config/initializers/vigilmon_heartbeat.rb
class VigilmonHeartbeat
include Sidekiq::ServerMiddleware
def call(worker, job, queue)
yield
Net::HTTP.get(URI(ENV["VIGILMON_HEARTBEAT_URL"]))
rescue => e
Rails.logger.warn("Vigilmon heartbeat failed: #{e}")
end
end
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add VigilmonHeartbeat
end
end
Step 4: Redis Cluster Monitoring Considerations
If you run Redis Cluster or Redis Sentinel, each node needs its own consideration:
Monitor the primary node with an HTTP probe — this is the node handling writes, and its downtime is immediately user-impacting.
Monitor each replica independently with separate monitors. Replica health checks reveal replication lag and silent replica failures:
# Check replica sync status via the health endpoint
redis-cli -h replica1 INFO replication | grep master_sync_in_progress
# 0 = in sync, 1 = syncing (potential stale reads)
Expose this in your health endpoint:
const replication = await client.info('replication');
const role = replication.match(/role:(\w+)/)?.[1];
const masterLinkDown = replication.includes('master_link_status:down');
if (role === 'slave' && masterLinkDown) {
return res.status(503).json({ status: 'degraded', reason: 'replica_disconnected' });
}
Use a naming convention in Vigilmon to distinguish nodes at a glance:
[redis-primary] cache /health[redis-replica-1] cache /health[redis-replica-2] cache /health
Group all Redis monitors into a single Status Page so your team has a single pane of glass for the cache layer.
Step 5: Alert Routing for Cache Outages
Cache outages follow a pattern: Redis goes down → application response times spike → database CPU spikes → cascading failure. You want to page your on-call engineer before the cascade completes.
In Vigilmon, configure alert channels per monitor:
- Primary Redis monitor → immediate Slack + PagerDuty page (P1)
- Replica monitors → Slack only (P2 — degraded, not down)
- Worker heartbeat monitors → Slack + email (P2 — background jobs failing)
Set response time thresholds as an early warning:
- Alert at
500msfor Redis health endpoint (cache should be fast; a slow health response may indicate memory pressure before OOM) - Alert at
2000msfor endpoints that depend on Redis (latency spike signals cache misses)
Summary
Redis failures are fast and catastrophic — don't wait for users to report them. Vigilmon gives you:
| Monitor Type | What It Covers |
|---|---|
| HTTP monitor on /health/redis | Redis process, connectivity, memory pressure |
| HTTP monitor on replica health | Replica sync status, disconnection |
| Heartbeat monitor | Background worker liveness |
Get started free at vigilmon.online — your first Redis monitor is running in under two minutes.