How to Monitor Redis Performance with Vigilmon

Redis is the database that makes your application feel instant — until it doesn't. A cache stampede, an OOM kill, a misconfigured eviction policy, or a slow replica can make Redis go from sub-millisecond to a bottleneck in minutes. By the time your APM tool shows elevated latency, the damage is already reaching your users.

I've been running Redis in production for years, and the lesson I keep learning is that Redis performance problems always telegraph themselves before they become outages. Memory pressure builds gradually. Hit rate drops first before the cache is empty. Command queues grow before the server blocks. External monitoring with Vigilmon catches these signals early and pages you before users notice.

This tutorial walks through setting up Redis performance monitoring with Vigilmon — covering the metrics that matter, how to expose them over HTTP, and how to configure alerting for the failure patterns that actually bite you in production.

The Redis Metrics That Matter for Production

Before setting up monitoring, it's worth understanding which metrics predict real failure. Redis exposes everything via INFO, but most of it is noise. The signals that matter:

Memory pressure — used_memory vs maxmemory. When Redis approaches its memory limit under a noeviction policy, writes start returning OOM command not allowed. Under eviction policies, hit rate degrades silently as keys get evicted before their natural expiry.

Hit/miss ratio — keyspace_hits / (keyspace_hits + keyspace_misses). A healthy cache should be above 90%. A sudden drop in hit rate means keys are evicting faster than they're being repopulated — a stampede in progress.

Commands per second — instantaneous_ops_per_sec. A spike can indicate a runaway client loop; a sudden drop can indicate a blocked command or a client disconnect.

Connected clients — connected_clients. Approaching maxclients causes new connections to be refused with ERR max number of clients reached.

Replication lag — master_repl_offset vs replica offsets. A large and growing gap means replicas are serving stale data.

Blocked clients — blocked_clients. Any non-zero value means clients are waiting on BLPOP, BRPOP, or WAIT — often indicating a queue consumer has stalled.

Step 1: Build a Redis Performance Health Endpoint

Redis doesn't expose HTTP natively, so I build a thin sidecar that reads INFO and returns a structured health payload. This endpoint becomes the probe target for Vigilmon.

Node.js (Express + ioredis)

// health/redis-perf.js
const express = require('express');
const Redis = require('ioredis');

const app = express();
const redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');

app.get('/health/redis', async (req, res) => {
  try {
    // Verify connectivity with PING
    await redis.ping();

    // Fetch all relevant INFO sections in one call
    const [memInfo, statsInfo, clientsInfo, replicationInfo] = await Promise.all([
      redis.info('memory'),
      redis.info('stats'),
      redis.info('clients'),
      redis.info('replication'),
    ]);

    const parse = (section, key) => {
      const match = section.match(new RegExp(`${key}:(\\S+)`));
      return match ? match[1] : null;
    };

    const usedMemory = parseInt(parse(memInfo, 'used_memory') || '0');
    const maxMemory = parseInt(parse(memInfo, 'maxmemory') || '0');
    const memRatio = maxMemory > 0 ? usedMemory / maxMemory : 0;

    const hits = parseInt(parse(statsInfo, 'keyspace_hits') || '0');
    const misses = parseInt(parse(statsInfo, 'keyspace_misses') || '0');
    const hitRate = hits + misses > 0 ? hits / (hits + misses) : 1.0;

    const opsPerSec = parseInt(parse(statsInfo, 'instantaneous_ops_per_sec') || '0');
    const blockedClients = parseInt(parse(clientsInfo, 'blocked_clients') || '0');
    const connectedClients = parseInt(parse(clientsInfo, 'connected_clients') || '0');

    const role = parse(replicationInfo, 'role');
    const masterLinkStatus = parse(replicationInfo, 'master_link_status');

    // Evaluate health
    if (memRatio > 0.95) {
      return res.status(503).json({
        status: 'critical',
        reason: 'memory_pressure',
        mem_ratio: memRatio.toFixed(3),
        used_bytes: usedMemory,
        max_bytes: maxMemory,
      });
    }

    if (hitRate < 0.7 && hits + misses > 1000) {
      return res.status(503).json({
        status: 'degraded',
        reason: 'low_hit_rate',
        hit_rate: hitRate.toFixed(3),
      });
    }

    if (role === 'slave' && masterLinkStatus === 'down') {
      return res.status(503).json({
        status: 'degraded',
        reason: 'replica_disconnected',
      });
    }

    return res.status(200).json({
      status: 'ok',
      mem_ratio: memRatio.toFixed(3),
      hit_rate: hitRate.toFixed(3),
      ops_per_sec: opsPerSec,
      connected_clients: connectedClients,
      blocked_clients: blockedClients,
      role,
    });
  } catch (err) {
    return res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3001, () => console.log('Redis health endpoint on :3001'));

Python (FastAPI + redis-py)

# health/redis_perf.py
import os
import redis
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()
r = redis.Redis.from_url(os.environ.get("REDIS_URL", "redis://localhost:6379"))

@app.get("/health/redis")
def redis_health():
    try:
        r.ping()

        mem = r.info("memory")
        stats = r.info("stats")
        clients = r.info("clients")
        replication = r.info("replication")

        used = mem.get("used_memory", 0)
        maxmem = mem.get("maxmemory", 0)
        mem_ratio = used / maxmem if maxmem > 0 else 0.0

        hits = stats.get("keyspace_hits", 0)
        misses = stats.get("keyspace_misses", 0)
        hit_rate = hits / (hits + misses) if (hits + misses) > 0 else 1.0

        if mem_ratio > 0.95:
            return JSONResponse(status_code=503, content={
                "status": "critical",
                "reason": "memory_pressure",
                "mem_ratio": round(mem_ratio, 3),
            })

        if hit_rate < 0.7 and (hits + misses) > 1000:
            return JSONResponse(status_code=503, content={
                "status": "degraded",
                "reason": "low_hit_rate",
                "hit_rate": round(hit_rate, 3),
            })

        return {
            "status": "ok",
            "mem_ratio": round(mem_ratio, 3),
            "hit_rate": round(hit_rate, 3),
            "ops_per_sec": stats.get("instantaneous_ops_per_sec", 0),
            "connected_clients": clients.get("connected_clients", 0),
            "blocked_clients": clients.get("blocked_clients", 0),
        }
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Deploy this sidecar alongside your Redis client. Verify it manually before wiring up Vigilmon:

curl -i https://your-app.example.com/health/redis
# HTTP/1.1 200 OK
# {"status":"ok","mem_ratio":"0.42","hit_rate":"0.94","ops_per_sec":1823,...}

Step 2: Configure Vigilmon HTTP Monitor for Redis Performance

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL: https://your-app.example.com/health/redis
Set the check interval: 1 minute
Under Expected response:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 500ms (a slow Redis health endpoint often indicates backpressure)
Under Alert channels, assign your Slack or PagerDuty channel
Save

Vigilmon checks your endpoint from multiple geographic regions simultaneously and requires consensus across regions before opening an incident. A single slow probe won't page you — but a genuine Redis failure that affects all your users will.

What Each Status Means

| /health/redis response | Vigilmon status | What's happening | |---|---|---| | 200 {"status":"ok"} | Up | Healthy cache, normal operations | | 503 {"reason":"memory_pressure"} | Down | Redis approaching OOM, writes failing soon | | 503 {"reason":"low_hit_rate"} | Down | Cache eviction storm or stampede in progress | | 503 {"reason":"replica_disconnected"} | Down | Replica serving stale data | | Connection refused | Down | Redis process crashed |

Step 3: Monitor the Redis TCP Port Directly

The HTTP health endpoint catches application-layer issues, but I also recommend a TCP port monitor as a lower-level check. If the Redis port (default: 6379) stops accepting connections, your health endpoint won't even be reachable.

In Vigilmon, go to Monitors → New Monitor → TCP Port
Set the host: redis.internal (your Redis hostname)
Set the port: 6379
Set the check interval: 30 seconds
Alert on connection failure immediately — no response time threshold needed here

The TCP monitor is your "is Redis even running" check. The HTTP monitor is your "is Redis healthy enough to serve traffic" check. Together they give you a complete picture.

Step 4: Alerting for Redis Performance Degradation

Cache outages follow a predictable cascade: hit rate drops → database query rate spikes → response latency climbs → user-facing errors. You want to catch the hit rate drop before the database is overwhelmed.

In Vigilmon, configure alert escalation per monitor:

Redis HTTP monitor → immediate Slack notification + PagerDuty page
Redis TCP monitor → immediate Slack + PagerDuty page (even faster signal of process crash)

For response time thresholds, add a second alert tier:

Warn at 200ms for the Redis health endpoint — at normal load, Redis operations are sub-millisecond; a 200ms health check response indicates the Redis event loop is under pressure
Page at 1000ms — the server is probably backed up with slow commands

Step 5: Heartbeat Monitoring for Redis-Dependent Workers

If you run background workers that consume from Redis queues (BullMQ, Celery, Sidekiq, RQ), the workers will stall silently when Redis degrades — even before your HTTP endpoint returns an error. Vigilmon heartbeat monitors catch this: your worker pings Vigilmon after each successful processing cycle, and Vigilmon alerts if the pings stop.

BullMQ Example

import { Worker } from 'bullmq';
import fetch from 'node-fetch';

const worker = new Worker('payments', async (job) => {
  await processPayment(job.data);
}, { connection: { host: 'redis.internal', port: 6379 } });

worker.on('completed', () => {
  // Ping Vigilmon after each successful job
  fetch(process.env.VIGILMON_HEARTBEAT_URL).catch(() => {});
});

Set up the heartbeat monitor in Vigilmon with an expected interval slightly longer than your slowest job, plus a 2x grace period. A stopped worker will miss its heartbeat window and trigger an alert before you'd notice from metrics alone.

Summary

Redis performance problems are silent before they're catastrophic. Here's the full monitoring stack:

| Monitor | Type | What It Catches | |---|---|---| | /health/redis HTTP probe | HTTP | Memory pressure, hit rate drops, replica lag | | Port 6379 TCP probe | TCP | Process crash, network partition | | Worker heartbeat | Heartbeat | Silent job queue stalls |

Multi-region consensus in Vigilmon means you get confident, actionable alerts — not alert fatigue from transient probes.

Start monitoring free at vigilmon.online — your first Redis monitor is running in under two minutes.