Monitoring Your Flask App with Vigilmon: Health Endpoints, Workers & Alerts

Flask's simplicity is a strength, but it makes monitoring easy to skip. A broken database connection returns a 500, a crashed Celery worker silently stops processing jobs, and a failed Redis connection can take down your session store without any visible error in your logs. Vigilmon catches all of these. In this tutorial you'll add comprehensive monitoring to a Flask application — from a health endpoint to background worker heartbeats to alert routing.

What You'll Build

A Flask /health blueprint that checks DB and Redis
A Vigilmon HTTP monitor pointed at your app
A Celery beat heartbeat task that pings Vigilmon on success
An APScheduler alternative for apps without Celery
Gunicorn/uWSGI deployment health tips
Email and Slack alert channels

Prerequisites

Python 3.10+
A Flask project
A free Vigilmon account
Optionally: SQLAlchemy, Redis, and Celery

Step 1: Create the Health Blueprint

Blueprints keep your health check isolated from application logic. This is important — if your app's main blueprint has a bug, the health endpoint should still respond.

# app/health/views.py
import time
from flask import Blueprint, current_app, jsonify
from sqlalchemy import text
import redis

health_bp = Blueprint("health", __name__, url_prefix="")


def _check_database():
    """Ping the SQLAlchemy database connection."""
    try:
        db = current_app.extensions["sqlalchemy"]
        with db.engine.connect() as conn:
            conn.execute(text("SELECT 1"))
        return "ok", None
    except Exception as exc:
        return "error", str(exc)


def _check_redis():
    """Ping the Redis connection (if configured)."""
    redis_url = current_app.config.get("REDIS_URL")
    if not redis_url:
        return "not_configured", None
    try:
        client = redis.from_url(redis_url, socket_timeout=2)
        client.ping()
        return "ok", None
    except Exception as exc:
        return "error", str(exc)


@health_bp.route("/health")
def health_check():
    checks = {}
    overall_status = "ok"

    db_status, db_err = _check_database()
    checks["database"] = db_status if not db_err else f"error: {db_err}"
    if db_status == "error":
        overall_status = "degraded"

    redis_status, redis_err = _check_redis()
    checks["redis"] = redis_status if not redis_err else f"error: {redis_err}"
    if redis_status == "error":
        overall_status = "degraded"

    http_status = 200 if overall_status == "ok" else 503
    return jsonify(
        status=overall_status,
        timestamp=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        checks=checks,
    ), http_status

# app/__init__.py
from flask import Flask
from app.health.views import health_bp


def create_app(config=None):
    app = Flask(__name__)
    # ... other setup ...

    app.register_blueprint(health_bp)
    return app

Test it:

curl -s http://localhost:5000/health | python3 -m json.tool
# {
#   "status": "ok",
#   "timestamp": "2025-06-29T10:00:00Z",
#   "checks": {
#     "database": "ok",
#     "redis": "ok"
#   }
# }

When your database is unreachable, the endpoint returns 503 Service Unavailable. Vigilmon treats any non-2xx response as a failure.

Step 2: Create a Vigilmon HTTP Monitor

| Field | Value | |---|---| | URL | https://yourapp.com/health | | Method | GET | | Check interval | 60 seconds | | Expected status | 200 | | Timeout | 10 seconds | | Regions | 2–3 for triangulation |

Under Alert Channels, add your email. Slack comes in Step 5.

Tip: if you're behind a load balancer or reverse proxy, point the monitor at the public URL — this validates the full network path, not just the Flask process. If you also want to monitor individual instances, use internal monitors from each host.

Step 3: Celery Beat Heartbeat Task

Celery workers can silently stop consuming tasks after an unhandled exception or OOM event. The heartbeat pattern keeps Vigilmon informed: your beat task pings a heartbeat URL on success. If pings stop arriving, Vigilmon alerts you.

First, grab your Heartbeat URL from Vigilmon (Dashboard → Heartbeat Monitors → New):

https://vigilmon.online/api/heartbeats/YOUR-UUID/ping

Create a Celery task that does real work and then pings:

# app/tasks/heartbeat.py
import logging
import os
import requests
from celery import shared_task

logger = logging.getLogger(__name__)

VIGILMON_HEARTBEAT_URL = os.environ.get("VIGILMON_HEARTBEAT_URL")


@shared_task(name="tasks.heartbeat", bind=True, max_retries=0)
def heartbeat_ping(self):
    """
    Run periodically via Celery Beat. Pings Vigilmon on success
    so a silent worker failure triggers an alert automatically.
    """
    try:
        # Your actual scheduled work goes here:
        # e.g. send_pending_notifications()
        #      refresh_exchange_rates()
        #      prune_expired_sessions()
        logger.info("[heartbeat] scheduled work complete")

        # Only ping Vigilmon when work succeeds
        if VIGILMON_HEARTBEAT_URL:
            resp = requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)
            resp.raise_for_status()
            logger.info("[heartbeat] pinged Vigilmon, status %d", resp.status_code)

    except Exception as exc:
        logger.error("[heartbeat] failed: %s", exc)
        # Do NOT ping — silence is the signal to Vigilmon
        raise

# celery_config.py (or wherever you configure Celery)
from celery.schedules import crontab

beat_schedule = {
    "heartbeat-every-minute": {
        "task": "tasks.heartbeat",
        "schedule": 60.0,  # seconds — matches your Vigilmon heartbeat window
    },
}

Add the environment variable:

VIGILMON_HEARTBEAT_URL=https://vigilmon.online/api/heartbeats/YOUR-UUID/ping

Start your beat worker:

celery -A app.celery beat --loglevel=info

Vigilmon will alert if the ping stops arriving — whether because the beat process died, a Redis connection error stopped task dispatch, or a worker got OOM-killed.

Step 4: APScheduler Alternative (No Celery)

If your app doesn't use Celery, APScheduler is a lightweight alternative that runs inside your Flask process:

pip install apscheduler

# app/scheduler.py
import logging
import os
import requests
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.triggers.interval import IntervalTrigger

logger = logging.getLogger(__name__)

VIGILMON_HEARTBEAT_URL = os.environ.get("VIGILMON_HEARTBEAT_URL")


def heartbeat_job():
    """Runs every 60 seconds inside the Flask process."""
    try:
        # Your scheduled work here
        logger.info("[scheduler] job ran successfully")

        if VIGILMON_HEARTBEAT_URL:
            resp = requests.get(VIGILMON_HEARTBEAT_URL, timeout=5)
            resp.raise_for_status()
    except Exception as exc:
        logger.error("[scheduler] job failed: %s", exc)


def init_scheduler(app):
    """Call this from your application factory."""
    scheduler = BackgroundScheduler()
    scheduler.add_job(
        heartbeat_job,
        trigger=IntervalTrigger(seconds=60),
        id="heartbeat",
        replace_existing=True,
    )
    scheduler.start()

    # Shut down cleanly on app teardown
    import atexit
    atexit.register(scheduler.shutdown)

    return scheduler

Wire it in your factory:

# app/__init__.py
from app.scheduler import init_scheduler

def create_app(config=None):
    app = Flask(__name__)
    # ... other setup ...

    with app.app_context():
        init_scheduler(app)

    return app

Gunicorn caveat: when Gunicorn spawns multiple workers (e.g. --workers 4), each worker process runs init_scheduler, leading to multiple heartbeat pings per interval. This is harmless for Vigilmon but wastes requests. Use a file-based lock or move to Celery Beat for multi-worker deployments:

import fcntl, os

def init_scheduler(app):
    lock_file = "/tmp/scheduler.lock"
    try:
        lock = open(lock_file, "w")
        fcntl.flock(lock, fcntl.LOCK_EX | fcntl.LOCK_NB)
    except IOError:
        return None  # Another worker already holds the lock
    # ... rest of scheduler init ...

Step 5: Gunicorn/uWSGI Deployment Health Tips

Gunicorn

# gunicorn.conf.py
bind = "0.0.0.0:5000"
workers = 4
worker_class = "gthread"
threads = 2
timeout = 30
keepalive = 5

# Critical: workers that time out are killed and restarted
# Set this lower than Vigilmon's check timeout
graceful_timeout = 25

Run with:

gunicorn -c gunicorn.conf.py "app:create_app()"

Health check in Docker:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:5000/health || exit 1

uWSGI

; uwsgi.ini
[uwsgi]
module = app:create_app()
callable = app
master = true
processes = 4
harakiri = 30  ; kill workers that exceed 30s — prevents hangs
py-autoreload = 0

; Expose a stats socket for monitoring
stats = /tmp/uwsgi-stats.sock

Both Gunicorn and uWSGI expose metrics that tools like Prometheus can scrape. For a quick sanity check, the Vigilmon HTTP monitor against /health is sufficient.

Step 6: Alert Routing

Email Alerts

Configure under Vigilmon → Alert Channels → Email. You receive alerts when:

/health returns non-2xx
The endpoint doesn't respond within timeout
The heartbeat window expires without a ping

Slack Webhook

Create a Slack incoming webhook
Vigilmon → Alert Channels → Add Channel → Webhook → paste the Slack URL
Assign to your HTTP monitor and heartbeat monitor

Example alert:

🔴 *yourapp.com/health* is DOWN
Status: 503 | checks.database: error: FATAL: remaining connection slots reserved
Duration: 1m 12s

You can also integrate with PagerDuty for on-call escalations or send to a Microsoft Teams channel.

Step 7: Test the Full Loop

Simulate a DB failure: set an invalid DATABASE_URL and restart Flask — /health should return 503.
Verify Vigilmon detects it: within one check interval, your monitor goes red and an alert fires.
Kill the worker: stop the Celery beat process and wait for the Vigilmon heartbeat window to expire — you should get an alert.
Redis failure: point REDIS_URL at a non-existent host — /health should show redis: error: ... and return 503.
Recover: fix each issue and verify Vigilmon sends "back online" notifications.

Production Checklist

[ ] /health blueprint is isolated — no imports from your main app blueprint
[ ] Database and Redis checks have explicit timeouts
[ ] VIGILMON_HEARTBEAT_URL is set in the environment, not hard-coded
[ ] Celery Beat (or APScheduler) only pings on successful task completion
[ ] Gunicorn timeout is lower than Vigilmon's check timeout
[ ] Alert channels tested end-to-end
[ ] Maintenance windows configured for deployments

Wrapping Up

You now have layered monitoring for your Flask app:

Uptime: Vigilmon polls /health every 60 seconds, checking DB and Redis
Heartbeat: Celery Beat (or APScheduler) confirms background workers are alive
Alert routing: email and Slack with escalation policies

The combination means you'll know about a production failure before your users do — and the specific check that failed (database, redis, or heartbeat silence) tells you exactly where to look.

Have questions or a specific Flask setup? Drop it in the comments!