Uptime Monitoring for FastAPI Applications (Free, Multi-Region)
FastAPI is beloved for its async speed and automatic OpenAPI docs. But none of that matters if your app goes down at 3 AM and you don't find out until a customer emails you at 9.
By the end of this guide you'll have a structured health endpoint, background task heartbeats for your async workers, external uptime monitoring from multiple regions, and a public status page — all on the free tier.
The two failure modes FastAPI developers miss
Endpoint outages — your API starts returning 500s or times out. Maybe a database connection pool is exhausted, a dependency is unreachable, or a bad deploy broke a route. Your process is still running, but users are getting errors.
Silent background task failures — an asyncio task or APScheduler job throws an exception. Python logs a traceback, swallows the error, and keeps going. Your nightly cleanup stopped running three days ago. Nothing in your health check reflects this.
Both are invisible without external monitoring. Let's fix that.
Step 1: Add a structured health check endpoint
FastAPI makes it trivial to add a health endpoint. The key is returning enough structured data to distinguish which dependency is unhealthy when things go wrong.
# app/routers/health.py
from fastapi import APIRouter, status
from fastapi.responses import JSONResponse
import asyncpg
import os
router = APIRouter()
async def check_database() -> dict:
"""Check PostgreSQL connectivity."""
try:
conn = await asyncpg.connect(os.environ["DATABASE_URL"], timeout=3)
await conn.execute("SELECT 1")
await conn.close()
return {"status": "ok"}
except Exception as e:
return {"status": "error", "detail": str(e)}
async def check_redis() -> dict:
"""Check Redis connectivity (if used)."""
try:
import aioredis
redis = await aioredis.from_url(os.environ.get("REDIS_URL", "redis://localhost"))
await redis.ping()
await redis.close()
return {"status": "ok"}
except Exception as e:
return {"status": "error", "detail": str(e)}
@router.get("/health")
async def health_check():
db = await check_database()
checks = {"database": db}
# Only include Redis if configured
if os.environ.get("REDIS_URL"):
checks["redis"] = await check_redis()
all_ok = all(c["status"] == "ok" for c in checks.values())
http_status = status.HTTP_200_OK if all_ok else status.HTTP_503_SERVICE_UNAVAILABLE
return JSONResponse(
status_code=http_status,
content={
"status": "ok" if all_ok else "degraded",
"checks": checks,
}
)
Register the router in your main app:
# main.py
from fastapi import FastAPI
from app.routers import health
app = FastAPI()
app.include_router(health.router)
Test it locally:
curl http://localhost:8000/health
# {"status":"ok","checks":{"database":{"status":"ok"}}}
When the database is unreachable, the response becomes HTTP 503:
{
"status": "degraded",
"checks": {
"database": {
"status": "error",
"detail": "could not connect to server: Connection refused"
}
}
}
This 503 response is exactly the signal your external monitor needs to open an incident.
Step 2: Set up external HTTP monitoring with Vigilmon
With /health live, point Vigilmon at it:
- Sign up at vigilmon.online — free tier, no credit card
- Click New Monitor → HTTP
- Enter
https://yourdomain.com/health - Set the check interval (5 minutes on free tier)
- Save
Vigilmon probes your endpoint from multiple geographic regions. If any location gets a non-2xx response or a timeout, it opens an incident and sends you an alert immediately.
For FastAPI APIs, add monitors for each critical surface:
| Endpoint | What it catches |
|---|---|
| /health | Database down, Redis down |
| /docs | App startup failure (FastAPI docs always render if the app is up) |
| /api/v1/items | Business logic routes |
Step 3: Heartbeat monitoring for background tasks
HTTP uptime checks confirm your app is responding, but they won't tell you if a scheduled background job silently stopped running.
The heartbeat pattern: a background job pings a unique URL after each successful run. If Vigilmon stops receiving that ping within the expected interval, it fires an alert — even if your app is technically "up."
Using APScheduler
APScheduler is the most common scheduler for FastAPI background jobs. Install it:
pip install apscheduler httpx
Add the scheduler to your app's lifespan:
# main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from app.routers import health
from app.tasks import scheduled_tasks
scheduler = AsyncIOScheduler()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Register jobs on startup
scheduler.add_job(
scheduled_tasks.nightly_cleanup,
trigger="cron",
hour=2,
minute=0,
id="nightly_cleanup",
replace_existing=True,
)
scheduler.start()
yield
scheduler.shutdown()
app = FastAPI(lifespan=lifespan)
app.include_router(health.router)
Define your job with a heartbeat ping:
# app/tasks/scheduled_tasks.py
import os
import httpx
import logging
logger = logging.getLogger(__name__)
async def nightly_cleanup():
"""Run nightly data cleanup and ping heartbeat on success."""
try:
# Your actual cleanup logic
await run_cleanup()
# Only ping on success — missed ping IS the alert
heartbeat_url = os.environ.get("CLEANUP_HEARTBEAT_URL")
if heartbeat_url:
async with httpx.AsyncClient() as client:
await client.get(heartbeat_url, timeout=10)
logger.info("nightly_cleanup: heartbeat pinged")
except Exception:
# Log but don't ping — the absence of a ping triggers the alert
logger.exception("nightly_cleanup: failed, skipping heartbeat")
raise
async def run_cleanup():
"""Actual cleanup logic goes here."""
pass
In Vigilmon:
- Click New Monitor → Heartbeat
- Set the expected interval (e.g. 25 hours for a nightly job — give yourself a buffer)
- Copy the unique ping URL
- Add it to your environment:
# .env
CLEANUP_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/your-unique-token
Now if your job fails midway, throws an exception, or simply isn't scheduled anymore, Vigilmon will alert you within one missed interval.
One heartbeat per critical job
The goal is one heartbeat monitor per job that must run for your system to be healthy:
CLEANUP_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-1
SYNC_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-2
REPORT_HEARTBEAT_URL=https://vigilmon.online/api/heartbeat/token-3
Each has its own expected interval and alert policy. You get a targeted alert when a specific job stops, not a vague "something is wrong."
Step 4: Webhook alerts to Slack or Discord
Configure alert delivery in Vigilmon:
For Slack:
- Create an incoming webhook in your Slack workspace
- In Vigilmon go to Notifications → New Channel → Slack
- Paste your webhook URL
- Enable it on your monitors
For Discord:
- In your Discord server, go to Integrations → Webhooks → New Webhook
- Copy the webhook URL
- In Vigilmon, go to Notifications → New Channel → Discord
- Paste and enable
You'll receive:
🔴 DOWN: yourdomain.com/health
Status: 503 Service Unavailable
Regions: US-East, EU-West
Started: 3 minutes ago
And recovery:
✅ RECOVERED: yourdomain.com/health is back UP
Total downtime: 8 minutes
For heartbeat monitors:
🔴 MISSED: nightly_cleanup heartbeat
Expected every: 25 hours
Last ping: 27 hours ago
Step 5: Expose a public status page
When you have an incident, your users will Google "is X down" or tweet at you. A public status page short-circuits both.
In Vigilmon:
- Go to Status Pages → New Status Page
- Name it (e.g. "Acme API Status")
- Select the monitors to include
- Save and copy the public URL
Link to it from your docs, README, or error responses:
@app.exception_handler(503)
async def service_unavailable_handler(request, exc):
return JSONResponse(
status_code=503,
content={
"error": "Service temporarily unavailable",
"status_page": "https://status.yourdomain.com"
}
)
Users who hit the status page during an incident know the team is on it — they're less likely to churn or file duplicate support tickets.
Bonus: Add a readiness vs liveness distinction
For container-based deployments (Kubernetes, Docker Swarm), you often want two separate endpoints:
@router.get("/health/live")
async def liveness():
"""Is the process running? (For k8s liveness probe)"""
return {"status": "ok"}
@router.get("/health/ready")
async def readiness():
"""Are dependencies ready? (For k8s readiness probe)"""
db = await check_database()
all_ok = db["status"] == "ok"
return JSONResponse(
status_code=200 if all_ok else 503,
content={"status": "ok" if all_ok else "not_ready", "database": db}
)
- Vigilmon monitors
/health/ready— this reflects whether your app is actually serving traffic correctly - Kubernetes uses
/health/livefor the liveness probe — it won't restart a healthy pod just because Postgres is temporarily slow
What you've built
| What | How |
|---|---|
| Structured health endpoint | /health with per-dependency status + HTTP 503 on failure |
| External uptime monitoring | Vigilmon HTTP monitor (multi-region) |
| Background job monitoring | APScheduler + heartbeat ping on success |
| Instant alerts | Slack/Discord webhook notifications |
| Public status page | Vigilmon status page |
| Kubernetes-ready | Separate liveness and readiness endpoints |
The full setup takes under 30 minutes and runs free on Vigilmon's free tier. You'll catch the next silent background job failure long before it cascades into a user-facing incident.
Get started free at vigilmon.online — your first monitor is running in under a minute.