Celery is Python's most popular distributed task queue — and one of the most common sources of silent production failures. Workers crash without alerting anyone. The beat scheduler stops dispatching tasks because the broker went down and nobody noticed. A task that should run every five minutes hasn't run in six hours, and your users haven't noticed yet because the symptom is delayed email or a stale report rather than a hard error.
Vigilmon gives you heartbeat monitoring for Celery workers and scheduled tasks, and HTTP probe monitoring for a Celery health API you can build in minutes. This tutorial shows you how to wire both up.
Why Celery Monitoring Matters
Celery workers fail in ways that process monitors miss:
- Worker crash: the process exits, but if it's managed by supervisor and immediately restarted, the task mid-flight is lost without any trace
- Worker stuck: the process is running and appears healthy, but is deadlocked or waiting for a resource and processing no tasks
- Beat scheduler down:
celery beatstops dispatching periodic tasks; workers remain healthy and idle; no error is raised anywhere - Broker connection lost: workers lose connectivity to RabbitMQ or Redis; they retry silently, but tasks queue indefinitely on the broker side
Without external monitoring, you discover these failures from:
- Users reporting that scheduled reports haven't arrived
- Data pipelines that should run hourly having a 12-hour gap in their output
- Email deliveries stalled for hours after a Redis blip
Vigilmon catches these failures through two complementary mechanisms: heartbeat monitors that verify task execution, and HTTP probes that verify worker availability.
Step 1: Build a Celery Health Endpoint
Celery exposes worker status through the celery inspect ping command, which broadcasts a ping to all connected workers and returns a response from each. You can wrap this as an HTTP health endpoint:
# healthcheck.py — Django/FastAPI Celery health endpoint
import os
from celery import Celery
from fastapi import FastAPI
from fastapi.responses import JSONResponse
app_api = FastAPI()
# Initialize Celery app — adjust broker URL to your config
celery_app = Celery(
"myapp",
broker=os.environ.get("CELERY_BROKER_URL", "redis://redis:6379/0"),
)
@app_api.get("/health/celery")
def celery_health():
try:
inspector = celery_app.control.inspect(timeout=3)
active = inspector.ping()
if not active:
return JSONResponse(status_code=503, content={
"status": "no_workers",
"detail": "No Celery workers responded to ping",
})
worker_count = len(active)
return JSONResponse(status_code=200, content={
"status": "ok",
"workers": worker_count,
"detail": list(active.keys()),
})
except Exception as e:
return JSONResponse(status_code=503, content={
"status": "down",
"error": str(e),
})
For Django projects using django-celery-beat, add this to a health view:
# myapp/views/health.py
from celery import current_app
from django.http import JsonResponse
def celery_health(request):
try:
inspector = current_app.control.inspect(timeout=3)
ping_result = inspector.ping()
if not ping_result:
return JsonResponse({"status": "no_workers"}, status=503)
return JsonResponse({
"status": "ok",
"workers": len(ping_result),
})
except Exception as e:
return JsonResponse({"status": "down", "error": str(e)}, status=503)
Add the view to your URL configuration:
# urls.py
from django.urls import path
from myapp.views.health import celery_health
urlpatterns = [
path("health/celery", celery_health),
]
Verify before adding to Vigilmon:
curl -i https://your-app.example.com/health/celery
# HTTP/1.1 200 OK
# {"status":"ok","workers":3}
Step 2: Configure a Vigilmon HTTP Monitor for Celery
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL to
https://your-app.example.com/health/celery - Set the check interval to 2 minutes (the
inspect pingcall has a 3-second timeout; a 1-minute interval may overlap under load) - Under Expected response, configure:
- Status code:
200 - Response body contains:
"status":"ok" - Response time threshold:
5000ms
- Status code:
- Under Alert channels, assign your Slack or email channel
- Save the monitor
What This Catches
| Failure | Supervisor / systemd | Vigilmon | |---|---|---| | All workers crashed | ✗ | ✓ | | Workers running but disconnected from broker | ✗ | ✓ | | No workers registered for a queue | ✗ | ✓ | | Worker count dropped below expected | ✗ | ✓ |
Step 3: Heartbeat Monitoring for Periodic Celery Tasks
The inspect ping probe tells you workers are alive and connected. It does not tell you whether your periodic tasks are actually executing on schedule. A celery beat process failure, a task that always raises an exception and retries forever, or a routing misconfiguration that drops tasks into the wrong queue will all pass an inspect ping while your scheduled work is silently not happening.
Vigilmon's heartbeat monitors solve this: your task pings Vigilmon after each successful execution. If Vigilmon doesn't receive a ping within the expected window, it fires an alert.
Set Up the Heartbeat Monitor
- In Vigilmon, go to Monitors → New Monitor → Heartbeat
- Set the name:
celery-beat-daily-report - Set the expected interval to match your task schedule (e.g., 1 hour for an hourly task)
- Set the grace period: 15 minutes (allows for delayed broker delivery)
- Save and copy the heartbeat URL, e.g.
https://vigilmon.online/heartbeat/abc123xyz
Wire It Into Your Celery Tasks
Add a heartbeat ping at the end of each periodic task you want to monitor:
# tasks.py
import os
import requests
from celery import shared_task
VIGILMON_HEARTBEAT_URL = os.environ.get("VIGILMON_HEARTBEAT_URL_DAILY_REPORT")
@shared_task(name="myapp.tasks.generate_daily_report")
def generate_daily_report():
# Your task logic
report_data = compile_daily_stats()
send_report_email(report_data)
# Ping Vigilmon — only after confirmed success
if VIGILMON_HEARTBEAT_URL:
try:
requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)
except requests.RequestException:
pass # Don't fail the task if the ping fails
For tasks that call multiple workers or subtasks, ping after the chain or group completes:
from celery import chain, group
@shared_task
def monitor_heartbeat_callback(_result):
if VIGILMON_HEARTBEAT_URL:
requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)
# Ping after a chord completes
pipeline = chain(
group(process_chunk.s(chunk) for chunk in data_chunks),
aggregate_results.s(),
monitor_heartbeat_callback.s(),
)
pipeline.apply_async()
Step 4: Heartbeat Monitoring for Celery Beat
celery beat is the scheduler that dispatches periodic tasks. It's a separate process from your workers, and if it dies, all scheduled tasks stop — silently. Workers remain healthy; the only symptom is tasks not running.
Add a dedicated Celery beat health task that runs frequently (every minute is fine) and pings a separate Vigilmon heartbeat monitor:
# tasks.py
VIGILMON_BEAT_HEARTBEAT = os.environ.get("VIGILMON_HEARTBEAT_URL_BEAT")
@shared_task(name="myapp.tasks.beat_healthcheck")
def beat_healthcheck():
"""Runs every minute via celery beat to confirm the scheduler is alive."""
if VIGILMON_BEAT_HEARTBEAT:
try:
requests.get(VIGILMON_BEAT_HEARTBEAT, timeout=5)
except requests.RequestException:
pass
Add it to your Celery beat schedule:
# celery.py or settings.py (Django)
from celery.schedules import crontab
CELERY_BEAT_SCHEDULE = {
"beat-healthcheck": {
"task": "myapp.tasks.beat_healthcheck",
"schedule": 60.0, # every 60 seconds
},
"daily-report": {
"task": "myapp.tasks.generate_daily_report",
"schedule": crontab(hour=6, minute=0),
},
# ... your other periodic tasks
}
Configure the Vigilmon heartbeat monitor for this check:
- Name:
celery-beat-liveness - Expected interval: 2 minutes
- Grace period: 5 minutes
If celery beat stops, this heartbeat will miss its window within 5 minutes, and Vigilmon will alert you.
Step 5: Monitor Celery Task Backlogs
If your workers are alive but slow, task queues can grow unboundedly. Add a queue depth check:
import redis as redis_client
REDIS_URL = os.environ.get("CELERY_BROKER_URL", "redis://redis:6379/0")
@app_api.get("/health/celery/queue/{queue_name}")
def celery_queue_health(queue_name: str, threshold: int = 1000):
try:
r = redis_client.from_url(REDIS_URL)
depth = r.llen(queue_name)
if depth > threshold:
return JSONResponse(status_code=503, content={
"status": "backlog",
"queue": queue_name,
"depth": depth,
"threshold": threshold,
})
return JSONResponse(status_code=200, content={
"status": "ok",
"queue": queue_name,
"depth": depth,
})
except Exception as e:
return JSONResponse(status_code=503, content={"status": "down", "error": str(e)})
Create Vigilmon monitors per critical queue:
https://your-app.example.com/health/celery/queue/defaulthttps://your-app.example.com/health/celery/queue/high-priorityhttps://your-app.example.com/health/celery/queue/email
Step 6: Alert Routing for Worker and Beat Failures
In Vigilmon, configure alert priorities:
- Worker liveness (no workers respond to ping) → immediate Slack + PagerDuty (P1)
- Beat liveness (beat heartbeat missed) → Slack + email (P2 — scheduled tasks not running)
- Per-task heartbeats (critical task missed) → Slack + PagerDuty (P1 for revenue-critical tasks, P2 for reports)
- Queue depth monitors (backlog growing) → Slack + email (P2 — workers struggling)
Group all Celery monitors on a Status Page in Vigilmon labeled "Task Processing" for your team's visibility.
Summary
Celery failures are often discovered too late because the failure mode is silence, not an error. Vigilmon gives you proactive coverage:
| Monitor Type | What It Covers |
|---|---|
| HTTP monitor on /health/celery | Worker availability and broker connectivity |
| Heartbeat per periodic task | Task actually executing on schedule |
| Heartbeat for celery beat | Scheduler liveness |
| HTTP monitor on queue depth | Worker throughput and backlog |
Get started free at vigilmon.online — your first Celery monitor is running in under two minutes.