How to Monitor Apache Airflow with Vigilmon: DAGs, Scheduler & API Health

Apache Airflow is the orchestration layer for data pipelines, ML training jobs, and ETL workflows. When Airflow's scheduler stops, DAG runs silently miss their schedules — no errors thrown, no logs written, just data that stops flowing. When the webserver goes down, operators lose visibility into every pipeline running in your organization. When a critical DAG fails and no one configured an alert, your downstream dashboards fill with stale data.

Vigilmon gives you external monitoring for Airflow's webserver API, scheduler liveness, and DAG health through HTTP probe monitoring and heartbeat checks. This tutorial covers the full setup.

Why Airflow Needs External Monitoring

Airflow's internal alerting (email on DAG failure, SLA miss callbacks) is useful but incomplete. It does not tell you:

Whether the Airflow webserver is reachable from external users and tools
Whether the scheduler has crashed and all DAG runs have stopped silently
Whether the Celery workers (if using CeleryExecutor) are alive and processing tasks
Whether the metadata database is responsive (Airflow stores all state in PostgreSQL/MySQL)
Whether a critical DAG hasn't run at all in the expected window

External monitoring through Vigilmon catches these scenarios by probing from outside Airflow's own infrastructure.

Step 1: Use the Airflow REST API Health Endpoint

Airflow ships with a built-in health endpoint since version 2.0. Enable the REST API and use the /health endpoint as your primary probe target:

# airflow.cfg
[api]
auth_backend = airflow.api.auth.backend.basic_auth
# or for Airflow 2.6+:
# auth_backends = airflow.providers.fab.auth_manager.api.auth.backend.basic_auth

Test the health endpoint:

curl -u admin:your_password http://localhost:8080/api/v1/health

Response when healthy:

{
  "metadatabase": {
    "status": "healthy"
  },
  "scheduler": {
    "status": "healthy",
    "latest_scheduler_heartbeat": "2026-01-20T10:00:05+00:00"
  },
  "triggerer": {
    "status": "healthy",
    "latest_triggerer_heartbeat": "2026-01-20T10:00:04+00:00"
  }
}

The scheduler.status field is particularly important: Airflow reports the scheduler as unhealthy if its heartbeat has not updated within the configured grace period, even while the webserver remains responsive.

Step 2: Create a Lightweight Health Proxy

The Airflow REST API requires authentication, which makes it tricky to probe directly with Vigilmon. Create a thin proxy that checks the /health endpoint and returns 200/503 without exposing credentials:

Python (FastAPI) Health Proxy

# health/airflow.py
import os
import httpx
from fastapi import FastAPI, Response

app = FastAPI()

AIRFLOW_URL = os.getenv("AIRFLOW_URL", "http://localhost:8080")
AIRFLOW_USER = os.getenv("AIRFLOW_HEALTH_USER", "monitor")
AIRFLOW_PASS = os.getenv("AIRFLOW_HEALTH_PASS", "")

@app.get("/health/airflow")
async def airflow_health():
    try:
        async with httpx.AsyncClient(timeout=10.0) as client:
            r = await client.get(
                f"{AIRFLOW_URL}/api/v1/health",
                auth=(AIRFLOW_USER, AIRFLOW_PASS),
            )
        data = r.json()

        scheduler_status = data.get("scheduler", {}).get("status", "unknown")
        db_status = data.get("metadatabase", {}).get("status", "unknown")

        if scheduler_status != "healthy" or db_status != "healthy":
            return Response(
                content=f'{{"status":"degraded","scheduler":"{scheduler_status}","database":"{db_status}"}}',
                status_code=503,
                media_type="application/json",
            )

        return {
            "status": "ok",
            "scheduler": scheduler_status,
            "database": db_status,
        }

    except Exception as e:
        return Response(
            content=f'{{"status":"down","error":"{str(e)}"}}',
            status_code=503,
            media_type="application/json",
        )

Node.js Health Proxy

// health/airflow.js
const express = require('express');
const app = express();

const AIRFLOW_URL = process.env.AIRFLOW_URL || 'http://localhost:8080';
const AIRFLOW_USER = process.env.AIRFLOW_HEALTH_USER || 'monitor';
const AIRFLOW_PASS = process.env.AIRFLOW_HEALTH_PASS || '';

app.get('/health/airflow', async (req, res) => {
  try {
    const credentials = Buffer.from(`${AIRFLOW_USER}:${AIRFLOW_PASS}`).toString('base64');
    const response = await fetch(`${AIRFLOW_URL}/api/v1/health`, {
      headers: { Authorization: `Basic ${credentials}` },
      signal: AbortSignal.timeout(10000),
    });

    const data = await response.json();
    const schedulerStatus = data?.scheduler?.status;
    const dbStatus = data?.metadatabase?.status;

    if (schedulerStatus !== 'healthy' || dbStatus !== 'healthy') {
      return res.status(503).json({
        status: 'degraded',
        scheduler: schedulerStatus,
        database: dbStatus,
      });
    }

    res.status(200).json({ status: 'ok', scheduler: schedulerStatus, database: dbStatus });
  } catch (err) {
    res.status(503).json({ status: 'down', error: err.message });
  }
});

app.listen(3002);

Step 3: Monitor DAG Heartbeats with Vigilmon

For individual critical DAGs, use Vigilmon's heartbeat monitoring to detect when a DAG hasn't run in its expected window.

Create a heartbeat monitor in Vigilmon, then POST to it from your DAG on each successful run:

# dags/etl_daily.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
import requests
import os

VIGILMON_HEARTBEAT_URL = os.environ.get(
    "VIGILMON_HEARTBEAT_URL",
    "https://heartbeats.vigilmon.online/hb/YOUR_HEARTBEAT_TOKEN"
)

def run_etl(**context):
    # ... your ETL logic here ...
    print("ETL completed successfully")

def ping_vigilmon(**context):
    try:
        requests.post(VIGILMON_HEARTBEAT_URL, timeout=5)
    except Exception as e:
        print(f"Vigilmon heartbeat ping failed (non-blocking): {e}")

with DAG(
    dag_id="etl_daily",
    schedule_interval="0 6 * * *",  # 6 AM UTC daily
    start_date=days_ago(1),
    catchup=False,
    tags=["etl", "monitored"],
) as dag:

    run_task = PythonOperator(
        task_id="run_etl",
        python_callable=run_etl,
    )

    heartbeat_task = PythonOperator(
        task_id="ping_vigilmon_heartbeat",
        python_callable=ping_vigilmon,
    )

    run_task >> heartbeat_task

Configure the heartbeat monitor in Vigilmon:

Go to Monitors → New Monitor
Select Heartbeat / Cron Job as the monitor type
Set the expected interval to 24h 30m (slightly longer than the DAG schedule)
Copy the heartbeat URL into your DAG's VIGILMON_HEARTBEAT_URL environment variable
Save

Vigilmon will alert you if the heartbeat is not received within the expected window — meaning the DAG didn't run on schedule.

Step 4: Monitor Airflow with Vigilmon HTTP Checks

Airflow Webserver + Scheduler Monitor

Select HTTP / HTTPS as the monitor type
Set the URL to your health proxy: https://your-server.example.com/health/airflow
Set the check interval to 1 minute
Under Expected response, set:
- Status code: 200
- (Optional) Response body contains: "status":"ok"
Set the timeout to 15000ms (Airflow can be slow under load)
Save the monitor

Monitor via the Vigilmon API

# Airflow webserver + scheduler combined health
curl -X POST https://api.vigilmon.online/v1/monitors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Airflow - Webserver & Scheduler",
    "type": "http",
    "url": "https://your-server.example.com/health/airflow",
    "interval": 60,
    "timeout": 15000,
    "expected_status": 200,
    "expected_body": "\"status\":\"ok\"",
    "regions": ["us-east", "eu-west"]
  }'

Step 5: Monitor Celery Workers (CeleryExecutor Deployments)

If you're running Airflow with CeleryExecutor, worker availability is critical. Celery workers can die silently while the webserver and scheduler remain healthy.

Add a Celery Flower check to your health endpoint:

@app.get("/health/airflow/workers")
async def workers_health():
    try:
        async with httpx.AsyncClient(timeout=10.0) as client:
            # Celery Flower API
            r = await client.get(
                "http://localhost:5555/api/workers",
                auth=(FLOWER_USER, FLOWER_PASS),
            )
        workers = r.json()
        active_workers = [w for w in workers if workers[w].get("active_tasks") is not None]

        if not active_workers:
            return Response(
                content='{"status":"degraded","workers":0}',
                status_code=503,
                media_type="application/json",
            )

        return {"status": "ok", "workers": len(active_workers)}

    except Exception as e:
        return Response(
            content=f'{{"status":"down","error":"{str(e)}"}}',
            status_code=503,
            media_type="application/json",
        )

Step 6: Configure Alerts

Go to Alert Channels → Add Channel and configure notification routing for Airflow outages:

{
  "monitor_name": "Airflow - Webserver & Scheduler",
  "status": "down",
  "url": "https://your-server.example.com/health/airflow",
  "error": "HTTP 503 - scheduler degraded",
  "started_at": "2026-01-20T06:05:00Z",
  "duration_seconds": 300
}

For data engineering teams, configure:

Slack webhook → #data-platform-alerts channel
PagerDuty for scheduler outages during business hours
Email to the data engineering team for DAG heartbeat misses

Complete Monitoring Setup Summary

| Monitor | Type | Endpoint | Purpose | |---------|------|----------|---------| | Airflow webserver + scheduler | HTTP | /health/airflow → 200 | Catches scheduler crashes and DB issues | | Airflow Celery workers | HTTP | /health/airflow/workers → 200 | Catches worker process deaths | | ETL Daily DAG | Heartbeat | Vigilmon heartbeat URL | Alerts if DAG misses its daily schedule | | ML Training DAG | Heartbeat | Vigilmon heartbeat URL | Alerts if training job doesn't run |

This combination gives you full coverage: Airflow infrastructure health, worker availability, and per-DAG schedule liveness.

Get Started for Free

Vigilmon monitors Apache Airflow infrastructure and individual DAG schedules from multiple global regions with 1-minute checks and instant alerts. The free tier covers up to 5 monitors with no credit card needed.

Set up your first Airflow monitor in under two minutes: vigilmon.online