Monitoring Fly.io Applications with Vigilmon: Global Regions, Alerts & Health Checks

Fly.io's global anycast network runs your app close to users in dozens of regions simultaneously. That distributed deployment model is great for latency — but it also means failures can be regional, partial, and surprisingly hard to detect from within the platform. A region-specific VM crash, a machine that passes health checks locally but serves 503s externally, an SSL certificate that expired in one region but not another: these are real failure modes that platform-level dashboards often miss.

Vigilmon polls your Fly.io app's health endpoint from external global locations and alerts you the moment any check fails. This tutorial shows you how to add health monitoring to a Fly.io application in under 20 minutes.

What You'll Build

A /health endpoint in your Fly.io app that checks real dependencies
Vigilmon HTTP monitors targeting your Fly.io app URL
Multi-region uptime coverage with regional failure alerts
Deploy-aware maintenance windows to suppress false positives during fly deploy

Prerequisites

A Fly.io account and the flyctl CLI installed
An app already deployed on Fly.io (any language/framework)
A free Vigilmon account

Step 1: Add a Health Endpoint to Your App

Vigilmon needs an HTTP endpoint to poll. Add a /health route that checks the dependencies your app actually relies on — a route that always returns 200 OK is useless for monitoring.

Node.js / Express example

// health.js
app.get('/health', async (req, res) => {
  const checks = {}
  let ok = true

  // Database connectivity
  try {
    await db.query('SELECT 1')
    checks.database = 'ok'
  } catch (err) {
    checks.database = `error: ${err.message}`
    ok = false
  }

  res.status(ok ? 200 : 503).json({
    status: ok ? 'ok' : 'degraded',
    checks,
    region: process.env.FLY_REGION ?? 'unknown',  // Fly injects this automatically
    timestamp: new Date().toISOString(),
  })
})

Python / FastAPI example

# health.py
from fastapi import APIRouter
import os

router = APIRouter()

@router.get("/health")
async def health_check():
    checks = {}
    ok = True

    try:
        # Your DB ping here
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"
        ok = False

    return JSONResponse(
        status_code=200 if ok else 503,
        content={
            "status": "ok" if ok else "degraded",
            "checks": checks,
            "region": os.getenv("FLY_REGION", "unknown"),
            "timestamp": datetime.utcnow().isoformat(),
        }
    )

The FLY_REGION environment variable is injected by Fly.io automatically — including it in the response makes it easy to correlate Vigilmon alerts with specific regions.

Step 2: Configure Fly.io Health Checks

Fly.io has its own internal health check system that controls whether traffic is routed to a VM. These are separate from Vigilmon (which monitors from the outside), but both are valuable. Configure Fly's checks in fly.toml:

# fly.toml
app = "my-app"
primary_region = "iad"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 1

  [[http_service.checks]]
    interval = "15s"
    timeout = "5s"
    grace_period = "10s"
    method = "GET"
    path = "/health"
    protocol = "http"
    tls_skip_verify = false

Fly's internal checks prevent traffic routing to unhealthy VMs. Vigilmon's external checks tell you when Fly's routing itself is broken, when a region is unreachable from the public internet, or when HTTPS/TLS terminates incorrectly at Fly's edge.

Step 3: Deploy and Note Your Health URL

Deploy your app:

fly deploy

Your health endpoint URL follows the pattern:

https://<app-name>.fly.dev/health

If you've configured a custom domain:

https://api.yourdomain.com/health

Verify it's reachable from outside Fly's network:

curl -s https://my-app.fly.dev/health | jq

Step 4: Create Vigilmon HTTP Monitors

Primary health monitor

Click Add Monitor → HTTP:

| Field | Value | |---|---| | URL | https://my-app.fly.dev/health | | Method | GET | | Check interval | 60 seconds | | Expected status | 200 | | Timeout | 10 seconds | | Regions | Enable 3+ Vigilmon regions for triangulation |

Under Advanced, add a JSON body assertion:

Path: status
Expected value: ok

This catches the scenario where your app returns 200 but reports "status": "degraded" — a common pattern when a DB connection pool is degraded but the process is still alive.

Root URL monitor (keyword check)

Add a second monitor for your app's root URL:

| Field | Value | |---|---| | URL | https://my-app.fly.dev/ | | Expected status | 200 | | Keyword check | Your app name or a unique string on the page |

This catches Fly's edge serving a cached error page with a 200 status, which a plain HTTP check misses.

Step 5: Alert on Regional Failures

Fly.io runs across dozens of regions. When a Vigilmon monitor fires, the alert includes which of Vigilmon's external check locations detected the failure — giving you a signal about whether the problem is global or regional.

Configure your alert channels to capture this context:

Go to Alert Channels → Add Channel → Webhook.
Add your Slack incoming webhook URL.
Assign the channel to both monitors.

A regional failure alert looks like:

🔴 my-app.fly.dev/health is DOWN
Checked from: us-east, eu-west, ap-southeast
Failing regions: eu-west
Status: 503 | Duration: 2m 10s

When only one Vigilmon region reports a failure, it's likely a regional issue or a routing problem specific to that path. When all regions fail simultaneously, it's a global outage.

Step 6: Deploy-Safe Maintenance Windows

fly deploy triggers a rolling restart. Without a maintenance window, Vigilmon may fire alerts during the 30–60 seconds when old VMs are stopping and new ones are starting.

Two approaches:

Option A: Vigilmon maintenance window (manual)

Before a deploy, open your monitor in Vigilmon, click Maintenance Window, and set a 5-minute mute. After deploy completes, close the window.

Option B: Automate via CI/CD

If you use GitHub Actions for deploys, add a Vigilmon API call to your deploy workflow:

# .github/workflows/deploy.yml
- name: Mute Vigilmon during deploy
  run: |
    curl -s -X POST https://vigilmon.online/api/monitors/$VIGILMON_MONITOR_ID/maintenance \
      -H "Authorization: Bearer $VIGILMON_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"duration_minutes": 5}'

- name: Deploy to Fly.io
  run: fly deploy --wait-timeout 120

The maintenance window automatically expires after 5 minutes, so you don't need to remember to re-enable monitoring.

Step 7: SSL Certificate Monitoring

Fly.io manages SSL certificates automatically, but they can fail to renew in edge cases (DNS misconfiguration, Let's Encrypt rate limits). Add an SSL monitor:

Add Monitor → SSL Certificate
URL: https://my-app.fly.dev
Alert when expiry is less than: 14 days

Vigilmon will alert you with two weeks' lead time if the certificate fails to auto-renew — enough time to intervene before users see TLS errors.

Production Checklist

[ ] /health checks all critical dependencies and returns 503 on failure
[ ] FLY_REGION included in health response for regional correlation
[ ] Fly.io internal checks configured in fly.toml
[ ] Vigilmon HTTP monitor with JSON body assertion
[ ] Vigilmon keyword monitor for root URL
[ ] SSL certificate monitor with 14-day expiry alert
[ ] Maintenance window configured for deploy pipeline
[ ] Alert channels tested end-to-end

Wrapping Up

Fly.io's global reach is a superpower — and Vigilmon's external monitoring matches that global footprint to give you the independent view you need to catch regional failures, HTTPS issues, and downtime that Fly's own platform tooling won't surface.

You now have:

External HTTP monitoring from multiple regions
JSON body assertions catching degraded-but-alive states
Regional failure attribution in alert payloads
Deploy-safe maintenance windows

Deploying on Fly.io? Share what you're monitoring in the comments.