How to Monitor Celery Workers and Scheduled Tasks with Vigilmon

Celery is Python's most popular distributed task queue — and one of the most common sources of silent production failures. Workers crash without alerting anyone. The beat scheduler stops dispatching tasks because the broker went down and nobody noticed. A task that should run every five minutes hasn't run in six hours, and your users haven't noticed yet because the symptom is delayed email or a stale report rather than a hard error.

Vigilmon gives you heartbeat monitoring for Celery workers and scheduled tasks, and HTTP probe monitoring for a Celery health API you can build in minutes. This tutorial shows you how to wire both up.

Why Celery Monitoring Matters

Celery workers fail in ways that process monitors miss:

Worker crash: the process exits, but if it's managed by supervisor and immediately restarted, the task mid-flight is lost without any trace
Worker stuck: the process is running and appears healthy, but is deadlocked or waiting for a resource and processing no tasks
Beat scheduler down: celery beat stops dispatching periodic tasks; workers remain healthy and idle; no error is raised anywhere
Broker connection lost: workers lose connectivity to RabbitMQ or Redis; they retry silently, but tasks queue indefinitely on the broker side

Without external monitoring, you discover these failures from:

Users reporting that scheduled reports haven't arrived
Data pipelines that should run hourly having a 12-hour gap in their output
Email deliveries stalled for hours after a Redis blip

Vigilmon catches these failures through two complementary mechanisms: heartbeat monitors that verify task execution, and HTTP probes that verify worker availability.

Step 1: Build a Celery Health Endpoint

Celery exposes worker status through the celery inspect ping command, which broadcasts a ping to all connected workers and returns a response from each. You can wrap this as an HTTP health endpoint:

# healthcheck.py — Django/FastAPI Celery health endpoint
import os
from celery import Celery
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app_api = FastAPI()

# Initialize Celery app — adjust broker URL to your config
celery_app = Celery(
    "myapp",
    broker=os.environ.get("CELERY_BROKER_URL", "redis://redis:6379/0"),
)

@app_api.get("/health/celery")
def celery_health():
    try:
        inspector = celery_app.control.inspect(timeout=3)
        active = inspector.ping()

        if not active:
            return JSONResponse(status_code=503, content={
                "status": "no_workers",
                "detail": "No Celery workers responded to ping",
            })

        worker_count = len(active)
        return JSONResponse(status_code=200, content={
            "status": "ok",
            "workers": worker_count,
            "detail": list(active.keys()),
        })
    except Exception as e:
        return JSONResponse(status_code=503, content={
            "status": "down",
            "error": str(e),
        })

For Django projects using django-celery-beat, add this to a health view:

# myapp/views/health.py
from celery import current_app
from django.http import JsonResponse

def celery_health(request):
    try:
        inspector = current_app.control.inspect(timeout=3)
        ping_result = inspector.ping()

        if not ping_result:
            return JsonResponse({"status": "no_workers"}, status=503)

        return JsonResponse({
            "status": "ok",
            "workers": len(ping_result),
        })
    except Exception as e:
        return JsonResponse({"status": "down", "error": str(e)}, status=503)

Add the view to your URL configuration:

# urls.py
from django.urls import path
from myapp.views.health import celery_health

urlpatterns = [
    path("health/celery", celery_health),
]

Verify before adding to Vigilmon:

curl -i https://your-app.example.com/health/celery
# HTTP/1.1 200 OK
# {"status":"ok","workers":3}

Step 2: Configure a Vigilmon HTTP Monitor for Celery

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to https://your-app.example.com/health/celery
Set the check interval to 2 minutes (the inspect ping call has a 3-second timeout; a 1-minute interval may overlap under load)
Under Expected response, configure:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 5000ms
Under Alert channels, assign your Slack or email channel
Save the monitor

What This Catches

| Failure | Supervisor / systemd | Vigilmon | |---|---|---| | All workers crashed | ✗ | ✓ | | Workers running but disconnected from broker | ✗ | ✓ | | No workers registered for a queue | ✗ | ✓ | | Worker count dropped below expected | ✗ | ✓ |

Step 3: Heartbeat Monitoring for Periodic Celery Tasks

The inspect ping probe tells you workers are alive and connected. It does not tell you whether your periodic tasks are actually executing on schedule. A celery beat process failure, a task that always raises an exception and retries forever, or a routing misconfiguration that drops tasks into the wrong queue will all pass an inspect ping while your scheduled work is silently not happening.

Vigilmon's heartbeat monitors solve this: your task pings Vigilmon after each successful execution. If Vigilmon doesn't receive a ping within the expected window, it fires an alert.

Set Up the Heartbeat Monitor

In Vigilmon, go to Monitors → New Monitor → Heartbeat
Set the name: celery-beat-daily-report
Set the expected interval to match your task schedule (e.g., 1 hour for an hourly task)
Set the grace period: 15 minutes (allows for delayed broker delivery)
Save and copy the heartbeat URL, e.g. https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your Celery Tasks

Add a heartbeat ping at the end of each periodic task you want to monitor:

# tasks.py
import os
import requests
from celery import shared_task

VIGILMON_HEARTBEAT_URL = os.environ.get("VIGILMON_HEARTBEAT_URL_DAILY_REPORT")

@shared_task(name="myapp.tasks.generate_daily_report")
def generate_daily_report():
    # Your task logic
    report_data = compile_daily_stats()
    send_report_email(report_data)

    # Ping Vigilmon — only after confirmed success
    if VIGILMON_HEARTBEAT_URL:
        try:
            requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)
        except requests.RequestException:
            pass  # Don't fail the task if the ping fails

For tasks that call multiple workers or subtasks, ping after the chain or group completes:

from celery import chain, group

@shared_task
def monitor_heartbeat_callback(_result):
    if VIGILMON_HEARTBEAT_URL:
        requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)

# Ping after a chord completes
pipeline = chain(
    group(process_chunk.s(chunk) for chunk in data_chunks),
    aggregate_results.s(),
    monitor_heartbeat_callback.s(),
)
pipeline.apply_async()

Step 4: Heartbeat Monitoring for Celery Beat

celery beat is the scheduler that dispatches periodic tasks. It's a separate process from your workers, and if it dies, all scheduled tasks stop — silently. Workers remain healthy; the only symptom is tasks not running.

Add a dedicated Celery beat health task that runs frequently (every minute is fine) and pings a separate Vigilmon heartbeat monitor:

# tasks.py

VIGILMON_BEAT_HEARTBEAT = os.environ.get("VIGILMON_HEARTBEAT_URL_BEAT")

@shared_task(name="myapp.tasks.beat_healthcheck")
def beat_healthcheck():
    """Runs every minute via celery beat to confirm the scheduler is alive."""
    if VIGILMON_BEAT_HEARTBEAT:
        try:
            requests.get(VIGILMON_BEAT_HEARTBEAT, timeout=5)
        except requests.RequestException:
            pass

Add it to your Celery beat schedule:

# celery.py or settings.py (Django)
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    "beat-healthcheck": {
        "task": "myapp.tasks.beat_healthcheck",
        "schedule": 60.0,  # every 60 seconds
    },
    "daily-report": {
        "task": "myapp.tasks.generate_daily_report",
        "schedule": crontab(hour=6, minute=0),
    },
    # ... your other periodic tasks
}

Configure the Vigilmon heartbeat monitor for this check:

Name: celery-beat-liveness
Expected interval: 2 minutes
Grace period: 5 minutes

If celery beat stops, this heartbeat will miss its window within 5 minutes, and Vigilmon will alert you.

Step 5: Monitor Celery Task Backlogs

If your workers are alive but slow, task queues can grow unboundedly. Add a queue depth check:

import redis as redis_client

REDIS_URL = os.environ.get("CELERY_BROKER_URL", "redis://redis:6379/0")

@app_api.get("/health/celery/queue/{queue_name}")
def celery_queue_health(queue_name: str, threshold: int = 1000):
    try:
        r = redis_client.from_url(REDIS_URL)
        depth = r.llen(queue_name)

        if depth > threshold:
            return JSONResponse(status_code=503, content={
                "status": "backlog",
                "queue": queue_name,
                "depth": depth,
                "threshold": threshold,
            })
        return JSONResponse(status_code=200, content={
            "status": "ok",
            "queue": queue_name,
            "depth": depth,
        })
    except Exception as e:
        return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})

Create Vigilmon monitors per critical queue:

https://your-app.example.com/health/celery/queue/default
https://your-app.example.com/health/celery/queue/high-priority
https://your-app.example.com/health/celery/queue/email

Step 6: Alert Routing for Worker and Beat Failures

In Vigilmon, configure alert priorities:

Worker liveness (no workers respond to ping) → immediate Slack + PagerDuty (P1)
Beat liveness (beat heartbeat missed) → Slack + email (P2 — scheduled tasks not running)
Per-task heartbeats (critical task missed) → Slack + PagerDuty (P1 for revenue-critical tasks, P2 for reports)
Queue depth monitors (backlog growing) → Slack + email (P2 — workers struggling)

Group all Celery monitors on a Status Page in Vigilmon labeled "Task Processing" for your team's visibility.

Summary

Celery failures are often discovered too late because the failure mode is silence, not an error. Vigilmon gives you proactive coverage:

| Monitor Type | What It Covers | |---|---| | HTTP monitor on /health/celery | Worker availability and broker connectivity | | Heartbeat per periodic task | Task actually executing on schedule | | Heartbeat for celery beat | Scheduler liveness | | HTTP monitor on queue depth | Worker throughput and backlog |

Get started free at vigilmon.online — your first Celery monitor is running in under two minutes.