Apache Airflow is the orchestration layer for data pipelines, ML training jobs, and ETL workflows. When Airflow's scheduler stops, DAG runs silently miss their schedules — no errors thrown, no logs written, just data that stops flowing. When the webserver goes down, operators lose visibility into every pipeline running in your organization. When a critical DAG fails and no one configured an alert, your downstream dashboards fill with stale data.
Vigilmon gives you external monitoring for Airflow's webserver API, scheduler liveness, and DAG health through HTTP probe monitoring and heartbeat checks. This tutorial covers the full setup.
Why Airflow Needs External Monitoring
Airflow's internal alerting (email on DAG failure, SLA miss callbacks) is useful but incomplete. It does not tell you:
- Whether the Airflow webserver is reachable from external users and tools
- Whether the scheduler has crashed and all DAG runs have stopped silently
- Whether the Celery workers (if using CeleryExecutor) are alive and processing tasks
- Whether the metadata database is responsive (Airflow stores all state in PostgreSQL/MySQL)
- Whether a critical DAG hasn't run at all in the expected window
External monitoring through Vigilmon catches these scenarios by probing from outside Airflow's own infrastructure.
Step 1: Use the Airflow REST API Health Endpoint
Airflow ships with a built-in health endpoint since version 2.0. Enable the REST API and use the /health endpoint as your primary probe target:
# airflow.cfg
[api]
auth_backend = airflow.api.auth.backend.basic_auth
# or for Airflow 2.6+:
# auth_backends = airflow.providers.fab.auth_manager.api.auth.backend.basic_auth
Test the health endpoint:
curl -u admin:your_password http://localhost:8080/api/v1/health
Response when healthy:
{
"metadatabase": {
"status": "healthy"
},
"scheduler": {
"status": "healthy",
"latest_scheduler_heartbeat": "2026-01-20T10:00:05+00:00"
},
"triggerer": {
"status": "healthy",
"latest_triggerer_heartbeat": "2026-01-20T10:00:04+00:00"
}
}
The scheduler.status field is particularly important: Airflow reports the scheduler as unhealthy if its heartbeat has not updated within the configured grace period, even while the webserver remains responsive.
Step 2: Create a Lightweight Health Proxy
The Airflow REST API requires authentication, which makes it tricky to probe directly with Vigilmon. Create a thin proxy that checks the /health endpoint and returns 200/503 without exposing credentials:
Python (FastAPI) Health Proxy
# health/airflow.py
import os
import httpx
from fastapi import FastAPI, Response
app = FastAPI()
AIRFLOW_URL = os.getenv("AIRFLOW_URL", "http://localhost:8080")
AIRFLOW_USER = os.getenv("AIRFLOW_HEALTH_USER", "monitor")
AIRFLOW_PASS = os.getenv("AIRFLOW_HEALTH_PASS", "")
@app.get("/health/airflow")
async def airflow_health():
try:
async with httpx.AsyncClient(timeout=10.0) as client:
r = await client.get(
f"{AIRFLOW_URL}/api/v1/health",
auth=(AIRFLOW_USER, AIRFLOW_PASS),
)
data = r.json()
scheduler_status = data.get("scheduler", {}).get("status", "unknown")
db_status = data.get("metadatabase", {}).get("status", "unknown")
if scheduler_status != "healthy" or db_status != "healthy":
return Response(
content=f'{{"status":"degraded","scheduler":"{scheduler_status}","database":"{db_status}"}}',
status_code=503,
media_type="application/json",
)
return {
"status": "ok",
"scheduler": scheduler_status,
"database": db_status,
}
except Exception as e:
return Response(
content=f'{{"status":"down","error":"{str(e)}"}}',
status_code=503,
media_type="application/json",
)
Node.js Health Proxy
// health/airflow.js
const express = require('express');
const app = express();
const AIRFLOW_URL = process.env.AIRFLOW_URL || 'http://localhost:8080';
const AIRFLOW_USER = process.env.AIRFLOW_HEALTH_USER || 'monitor';
const AIRFLOW_PASS = process.env.AIRFLOW_HEALTH_PASS || '';
app.get('/health/airflow', async (req, res) => {
try {
const credentials = Buffer.from(`${AIRFLOW_USER}:${AIRFLOW_PASS}`).toString('base64');
const response = await fetch(`${AIRFLOW_URL}/api/v1/health`, {
headers: { Authorization: `Basic ${credentials}` },
signal: AbortSignal.timeout(10000),
});
const data = await response.json();
const schedulerStatus = data?.scheduler?.status;
const dbStatus = data?.metadatabase?.status;
if (schedulerStatus !== 'healthy' || dbStatus !== 'healthy') {
return res.status(503).json({
status: 'degraded',
scheduler: schedulerStatus,
database: dbStatus,
});
}
res.status(200).json({ status: 'ok', scheduler: schedulerStatus, database: dbStatus });
} catch (err) {
res.status(503).json({ status: 'down', error: err.message });
}
});
app.listen(3002);
Step 3: Monitor DAG Heartbeats with Vigilmon
For individual critical DAGs, use Vigilmon's heartbeat monitoring to detect when a DAG hasn't run in its expected window.
Create a heartbeat monitor in Vigilmon, then POST to it from your DAG on each successful run:
# dags/etl_daily.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
import requests
import os
VIGILMON_HEARTBEAT_URL = os.environ.get(
"VIGILMON_HEARTBEAT_URL",
"https://heartbeats.vigilmon.online/hb/YOUR_HEARTBEAT_TOKEN"
)
def run_etl(**context):
# ... your ETL logic here ...
print("ETL completed successfully")
def ping_vigilmon(**context):
try:
requests.post(VIGILMON_HEARTBEAT_URL, timeout=5)
except Exception as e:
print(f"Vigilmon heartbeat ping failed (non-blocking): {e}")
with DAG(
dag_id="etl_daily",
schedule_interval="0 6 * * *", # 6 AM UTC daily
start_date=days_ago(1),
catchup=False,
tags=["etl", "monitored"],
) as dag:
run_task = PythonOperator(
task_id="run_etl",
python_callable=run_etl,
)
heartbeat_task = PythonOperator(
task_id="ping_vigilmon_heartbeat",
python_callable=ping_vigilmon,
)
run_task >> heartbeat_task
Configure the heartbeat monitor in Vigilmon:
- Go to Monitors → New Monitor
- Select Heartbeat / Cron Job as the monitor type
- Set the expected interval to
24h 30m(slightly longer than the DAG schedule) - Copy the heartbeat URL into your DAG's
VIGILMON_HEARTBEAT_URLenvironment variable - Save
Vigilmon will alert you if the heartbeat is not received within the expected window — meaning the DAG didn't run on schedule.
Step 4: Monitor Airflow with Vigilmon HTTP Checks
Log in to Vigilmon and go to Monitors → New Monitor.
Airflow Webserver + Scheduler Monitor
- Select HTTP / HTTPS as the monitor type
- Set the URL to your health proxy:
https://your-server.example.com/health/airflow - Set the check interval to 1 minute
- Under Expected response, set:
- Status code:
200 - (Optional) Response body contains:
"status":"ok"
- Status code:
- Set the timeout to
15000ms(Airflow can be slow under load) - Save the monitor
Monitor via the Vigilmon API
# Airflow webserver + scheduler combined health
curl -X POST https://api.vigilmon.online/v1/monitors \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Airflow - Webserver & Scheduler",
"type": "http",
"url": "https://your-server.example.com/health/airflow",
"interval": 60,
"timeout": 15000,
"expected_status": 200,
"expected_body": "\"status\":\"ok\"",
"regions": ["us-east", "eu-west"]
}'
Step 5: Monitor Celery Workers (CeleryExecutor Deployments)
If you're running Airflow with CeleryExecutor, worker availability is critical. Celery workers can die silently while the webserver and scheduler remain healthy.
Add a Celery Flower check to your health endpoint:
@app.get("/health/airflow/workers")
async def workers_health():
try:
async with httpx.AsyncClient(timeout=10.0) as client:
# Celery Flower API
r = await client.get(
"http://localhost:5555/api/workers",
auth=(FLOWER_USER, FLOWER_PASS),
)
workers = r.json()
active_workers = [w for w in workers if workers[w].get("active_tasks") is not None]
if not active_workers:
return Response(
content='{"status":"degraded","workers":0}',
status_code=503,
media_type="application/json",
)
return {"status": "ok", "workers": len(active_workers)}
except Exception as e:
return Response(
content=f'{{"status":"down","error":"{str(e)}"}}',
status_code=503,
media_type="application/json",
)
Step 6: Configure Alerts
Go to Alert Channels → Add Channel and configure notification routing for Airflow outages:
{
"monitor_name": "Airflow - Webserver & Scheduler",
"status": "down",
"url": "https://your-server.example.com/health/airflow",
"error": "HTTP 503 - scheduler degraded",
"started_at": "2026-01-20T06:05:00Z",
"duration_seconds": 300
}
For data engineering teams, configure:
- Slack webhook →
#data-platform-alertschannel - PagerDuty for scheduler outages during business hours
- Email to the data engineering team for DAG heartbeat misses
Complete Monitoring Setup Summary
| Monitor | Type | Endpoint | Purpose |
|---------|------|----------|---------|
| Airflow webserver + scheduler | HTTP | /health/airflow → 200 | Catches scheduler crashes and DB issues |
| Airflow Celery workers | HTTP | /health/airflow/workers → 200 | Catches worker process deaths |
| ETL Daily DAG | Heartbeat | Vigilmon heartbeat URL | Alerts if DAG misses its daily schedule |
| ML Training DAG | Heartbeat | Vigilmon heartbeat URL | Alerts if training job doesn't run |
This combination gives you full coverage: Airflow infrastructure health, worker availability, and per-DAG schedule liveness.
Get Started for Free
Vigilmon monitors Apache Airflow infrastructure and individual DAG schedules from multiple global regions with 1-minute checks and instant alerts. The free tier covers up to 5 monitors with no credit card needed.
Set up your first Airflow monitor in under two minutes: vigilmon.online