Uptime Monitoring for FastAPI Applications (Free, Multi-Region)

FastAPI is beloved for its async speed and automatic OpenAPI docs. But none of that matters if your app goes down at 3 AM and you don't find out until a customer emails you at 9.

By the end of this guide you'll have a structured health endpoint, background task heartbeats for your async workers, external uptime monitoring from multiple regions, and a public status page — all on the free tier.

The two failure modes FastAPI developers miss

Endpoint outages — your API starts returning 500s or times out. Maybe a database connection pool is exhausted, a dependency is unreachable, or a bad deploy broke a route. Your process is still running, but users are getting errors.

Silent background task failures — an asyncio task or APScheduler job throws an exception. Python logs a traceback, swallows the error, and keeps going. Your nightly cleanup stopped running three days ago. Nothing in your health check reflects this.

Both are invisible without external monitoring. Let's fix that.

Step 1: Add a structured health check endpoint

FastAPI makes it trivial to add a health endpoint. The key is returning enough structured data to distinguish which dependency is unhealthy when things go wrong.

# app/routers/health.py
from fastapi import APIRouter, status
from fastapi.responses import JSONResponse
import asyncpg
import os

router = APIRouter()


async def check_database() -> dict:
    """Check PostgreSQL connectivity."""
    try:
        conn = await asyncpg.connect(os.environ["DATABASE_URL"], timeout=3)
        await conn.execute("SELECT 1")
        await conn.close()
        return {"status": "ok"}
    except Exception as e:
        return {"status": "error", "detail": str(e)}


async def check_redis() -> dict:
    """Check Redis connectivity (if used)."""
    try:
        import aioredis
        redis = await aioredis.from_url(os.environ.get("REDIS_URL", "redis://localhost"))
        await redis.ping()
        await redis.close()
        return {"status": "ok"}
    except Exception as e:
        return {"status": "error", "detail": str(e)}


@router.get("/health")
async def health_check():
    db = await check_database()
    checks = {"database": db}

    # Only include Redis if configured
    if os.environ.get("REDIS_URL"):
        checks["redis"] = await check_redis()

    all_ok = all(c["status"] == "ok" for c in checks.values())
    http_status = status.HTTP_200_OK if all_ok else status.HTTP_503_SERVICE_UNAVAILABLE

    return JSONResponse(
        status_code=http_status,
        content={
            "status": "ok" if all_ok else "degraded",
            "checks": checks,
        }
    )

# main.py
from fastapi import FastAPI
from app.routers import health

app = FastAPI()
app.include_router(health.router)

Test it locally:

curl http://localhost:8000/health
# {"status":"ok","checks":{"database":{"status":"ok"}}}

When the database is unreachable, the response becomes HTTP 503:

{
  "status": "degraded",
  "checks": {
    "database": {
      "status": "error",
      "detail": "could not connect to server: Connection refused"
    }
  }
}

This 503 response is exactly the signal your external monitor needs to open an incident.

Step 2: Set up external HTTP monitoring with Vigilmon

With /health live, point Vigilmon at it:

Sign up at vigilmon.online — free tier, no credit card
Click New Monitor → HTTP
Enter https://yourdomain.com/health
Set the check interval (5 minutes on free tier)
Save

Vigilmon probes your endpoint from multiple geographic regions. If any location gets a non-2xx response or a timeout, it opens an incident and sends you an alert immediately.

For FastAPI APIs, add monitors for each critical surface:

| Endpoint | What it catches | |---|---| | /health | Database down, Redis down | | /docs | App startup failure (FastAPI docs always render if the app is up) | | /api/v1/items | Business logic routes |

Step 3: Heartbeat monitoring for background tasks

HTTP uptime checks confirm your app is responding, but they won't tell you if a scheduled background job silently stopped running.

The heartbeat pattern: a background job pings a unique URL after each successful run. If Vigilmon stops receiving that ping within the expected interval, it fires an alert — even if your app is technically "up."

Using APScheduler

APScheduler is the most common scheduler for FastAPI background jobs. Install it:

pip install apscheduler httpx

Add the scheduler to your app's lifespan:

# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from app.routers import health
from app.tasks import scheduled_tasks

scheduler = AsyncIOScheduler()


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Register jobs on startup
    scheduler.add_job(
        scheduled_tasks.nightly_cleanup,
        trigger="cron",
        hour=2,
        minute=0,
        id="nightly_cleanup",
        replace_existing=True,
    )
    scheduler.start()
    yield
    scheduler.shutdown()


app = FastAPI(lifespan=lifespan)
app.include_router(health.router)

Define your job with a heartbeat ping:

# app/tasks/scheduled_tasks.py
import os
import httpx
import logging

logger = logging.getLogger(__name__)


async def nightly_cleanup():
    """Run nightly data cleanup and ping heartbeat on success."""
    try:
        # Your actual cleanup logic
        await run_cleanup()

        # Only ping on success — missed ping IS the alert
        heartbeat_url = os.environ.get("CLEANUP_HEARTBEAT_URL")
        if heartbeat_url:
            async with httpx.AsyncClient() as client:
                await client.get(heartbeat_url, timeout=10)
            logger.info("nightly_cleanup: heartbeat pinged")

    except Exception:
        # Log but don't ping — the absence of a ping triggers the alert
        logger.exception("nightly_cleanup: failed, skipping heartbeat")
        raise


async def run_cleanup():
    """Actual cleanup logic goes here."""
    pass

In Vigilmon:

Click New Monitor → Heartbeat
Set the expected interval (e.g. 25 hours for a nightly job — give yourself a buffer)
Copy the unique ping URL
Add it to your environment:

# .env
CLEANUP_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-unique-token

Now if your job fails midway, throws an exception, or simply isn't scheduled anymore, Vigilmon will alert you within one missed interval.

One heartbeat per critical job

The goal is one heartbeat monitor per job that must run for your system to be healthy:

CLEANUP_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-1
SYNC_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-2
REPORT_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-3

Each has its own expected interval and alert policy. You get a targeted alert when a specific job stops, not a vague "something is wrong."

Step 4: Webhook alerts to Slack or Discord

Configure alert delivery in Vigilmon:

For Slack:

Create an incoming webhook in your Slack workspace
In Vigilmon go to Notifications → New Channel → Slack
Paste your webhook URL
Enable it on your monitors

For Discord:

In your Discord server, go to Integrations → Webhooks → New Webhook
Copy the webhook URL
In Vigilmon, go to Notifications → New Channel → Discord
Paste and enable

You'll receive:

🔴 DOWN: yourdomain.com/health
Status: 503 Service Unavailable
Regions: US-East, EU-West
Started: 3 minutes ago

And recovery:

✅ RECOVERED: yourdomain.com/health is back UP
Total downtime: 8 minutes

For heartbeat monitors:

🔴 MISSED: nightly_cleanup heartbeat
Expected every: 25 hours
Last ping: 27 hours ago

Step 5: Expose a public status page

When you have an incident, your users will Google "is X down" or tweet at you. A public status page short-circuits both.

In Vigilmon:

Go to Status Pages → New Status Page
Name it (e.g. "Acme API Status")
Select the monitors to include
Save and copy the public URL

Link to it from your docs, README, or error responses:

@app.exception_handler(503)
async def service_unavailable_handler(request, exc):
    return JSONResponse(
        status_code=503,
        content={
            "error": "Service temporarily unavailable",
            "status_page": "https://status.yourdomain.com"
        }
    )

Users who hit the status page during an incident know the team is on it — they're less likely to churn or file duplicate support tickets.

Bonus: Add a readiness vs liveness distinction

For container-based deployments (Kubernetes, Docker Swarm), you often want two separate endpoints:

@router.get("/health/live")
async def liveness():
    """Is the process running? (For k8s liveness probe)"""
    return {"status": "ok"}


@router.get("/health/ready")
async def readiness():
    """Are dependencies ready? (For k8s readiness probe)"""
    db = await check_database()
    all_ok = db["status"] == "ok"
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={"status": "ok" if all_ok else "not_ready", "database": db}
    )

Vigilmon monitors /health/ready — this reflects whether your app is actually serving traffic correctly
Kubernetes uses /health/live for the liveness probe — it won't restart a healthy pod just because Postgres is temporarily slow

What you've built

| What | How | |---|---| | Structured health endpoint | /health with per-dependency status + HTTP 503 on failure | | External uptime monitoring | Vigilmon HTTP monitor (multi-region) | | Background job monitoring | APScheduler + heartbeat ping on success | | Instant alerts | Slack/Discord webhook notifications | | Public status page | Vigilmon status page | | Kubernetes-ready | Separate liveness and readiness endpoints |

The full setup takes under 30 minutes and runs free on Vigilmon's free tier. You'll catch the next silent background job failure long before it cascades into a user-facing incident.

Get started free at vigilmon.online — your first monitor is running in under a minute.