Monitoring Google Cloud Run with Vigilmon
Google Cloud Run scales your containers from zero to thousands of instances in seconds. That elasticity is powerful, but it also hides failure modes that traditional monitoring misses: a revision with a broken dependency check can pass Cloud Run's internal health gate while returning errors to real users, and a regional Cloud Run endpoint can degrade while your global load balancer quietly reroutes traffic.
Vigilmon gives you an external, independent health check that fires alerts the moment your Cloud Run service stops responding — regardless of what Google's own dashboards say.
This tutorial covers:
- A health endpoint in your Cloud Run service
- Vigilmon HTTP monitor setup
- Alerting on cold-start failures and deployment regressions
- Heartbeat monitoring for Cloud Run Jobs (scheduled workloads)
- Multi-region monitoring strategy
Step 1: Add a health check route to your Cloud Run service
Cloud Run routes HTTP traffic to your container. Add a /health route that verifies real dependencies instead of returning a static 200.
Node.js / Express:
// src/health.js
import { Firestore } from '@google-cloud/firestore';
const db = new Firestore();
app.get('/health', async (req, res) => {
const checks = {};
let healthy = true;
// Firestore connectivity probe
try {
await db.collection('_health').doc('probe').set({ ts: Date.now() });
checks.firestore = 'ok';
} catch (err) {
checks.firestore = `error: ${err.message}`;
healthy = false;
}
// Downstream HTTP dependency
try {
const resp = await fetch(process.env.DOWNSTREAM_URL + '/ping', {
signal: AbortSignal.timeout(3000),
});
checks.downstream = resp.ok ? 'ok' : `http_${resp.status}`;
if (!resp.ok) healthy = false;
} catch (err) {
checks.downstream = `error: ${err.message}`;
healthy = false;
}
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
region: process.env.CLOUD_RUN_REGION ?? 'unknown',
checks,
timestamp: new Date().toISOString(),
});
});
Python / FastAPI:
# app/health.py
import os
from fastapi import APIRouter, Response
from google.cloud import firestore
from datetime import datetime, timezone
router = APIRouter()
db = firestore.Client()
@router.get('/health')
async def health(response: Response):
checks = {}
healthy = True
try:
db.collection('_health').document('probe').set({'ts': datetime.now(tz=timezone.utc).isoformat()})
checks['firestore'] = 'ok'
except Exception as e:
checks['firestore'] = f'error: {e}'
healthy = False
response.status_code = 503
return {
'status': 'ok' if healthy else 'degraded',
'region': os.getenv('CLOUD_RUN_REGION', 'unknown'),
'checks': checks,
'timestamp': datetime.now(tz=timezone.utc).isoformat(),
}
Cloud Run injects CLOUD_RUN_REGION, K_SERVICE, and K_REVISION automatically — include them in the health response to make region-specific debugging easier.
Dockerfile tip: Make sure your image listens on the port Cloud Run expects:
ENV PORT=8080
EXPOSE 8080
CMD ["node", "src/index.js"]
Step 2: Configure Cloud Run startup and liveness probes
Cloud Run supports container-level health probes via your service definition. Add a startup probe so Cloud Run waits for your service to be ready before routing traffic:
Cloud Run YAML (via gcloud run services replace):
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: my-app
spec:
template:
spec:
containers:
- image: gcr.io/my-project/my-app:latest
startupProbe:
httpGet:
path: /health
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
periodSeconds: 30
Terraform:
resource "google_cloud_run_v2_service" "app" {
name = "my-app"
location = "us-central1"
template {
containers {
image = "gcr.io/my-project/my-app:latest"
startup_probe {
http_get { path = "/health" }
initial_delay_seconds = 5
period_seconds = 10
failure_threshold = 3
}
liveness_probe {
http_get { path = "/health" }
period_seconds = 30
}
}
}
}
The startup probe prevents failed revisions from receiving traffic. The liveness probe restarts containers that become unhealthy mid-run.
Step 3: Set up Vigilmon external HTTP monitoring
Cloud Run's internal probes only run inside Google's infrastructure. You need an external probe that monitors your service from the internet and alerts you when users can't reach it.
- Sign up at vigilmon.online — free, no card required
- Click New Monitor → HTTP
- URL:
https://my-app-<hash>-uc.a.run.app/health(or your custom domain) - Check interval: 1 minute (paid) or 5 minutes (free)
- Response timeout: 15 seconds — Cloud Run cold starts can add 2–10 seconds for large images
- Expected status:
200 - JSON assertion (optional): path
status, expected valueok - Save
Cold start tip: At 1-minute check intervals, Cloud Run will keep at least one instance warm between checks. Use Vigilmon's Confirm Down After setting (set to 2 failures) to suppress alerts from isolated cold-start timeouts while still catching sustained outages.
Step 4: Alert on deployment failures
Cloud Run deploys new revisions atomically. If you deploy a broken revision, traffic gradually shifts to it — Vigilmon will catch the degradation before it reaches 100%.
In Vigilmon, configure alert channels at Notifications → New Channel:
Slack:
- Create an incoming webhook at api.slack.com/apps
- Paste the webhook URL into Vigilmon
- Enable it on your Cloud Run monitors
Email:
- Add your on-call address as a notification channel
PagerDuty:
- Create a Vigilmon integration in PagerDuty and paste the integration key
When a bad deployment causes failures, Vigilmon sends:
🔴 DOWN: my-app.example.com/health
Status: 503 (degraded)
Duration: 2m 15s
You can then roll back with:
gcloud run services update-traffic my-app \
--to-revisions my-app-00001-abc=100 \
--region us-central1
Step 5: Heartbeat monitoring for Cloud Run Jobs
Cloud Run Jobs are batch workloads that run on a schedule (e.g., via Cloud Scheduler). A standard HTTP monitor won't detect a scheduled job that fails silently.
Add a Vigilmon heartbeat ping at the end of each successful job run:
// jobs/sync.js
import fetch from 'node-fetch';
async function main() {
try {
await runSyncJob();
// Ping Vigilmon heartbeat only on success
const heartbeatUrl = process.env.VIGILMON_HEARTBEAT_URL;
if (heartbeatUrl) {
await fetch(heartbeatUrl, { method: 'GET' });
}
console.log('Job complete');
process.exit(0);
} catch (err) {
console.error('Job failed:', err);
process.exit(1);
}
}
main();
Configure the heartbeat URL as a Cloud Run Job environment variable (store the value in Secret Manager):
gcloud secrets create vigilmon-heartbeat-url \
--data-file=- <<< "https://vigilmon.online/heartbeat/your-monitor-id"
gcloud run jobs update my-sync-job \
--set-secrets VIGILMON_HEARTBEAT_URL=vigilmon-heartbeat-url:latest \
--region us-central1
In Vigilmon, create a Heartbeat Monitor with a grace period slightly longer than your Cloud Scheduler interval. If Vigilmon doesn't receive a ping within the grace period, it fires an alert.
Step 6: Multi-region Cloud Run monitoring
If you deploy Cloud Run services to multiple regions via a Global External Load Balancer, create one Vigilmon monitor per region:
- Deploy your service to
us-central1,europe-west1, andasia-northeast1 - In Vigilmon, create an HTTP monitor for each regional URL
- Group all monitors under a single Status Page
The health response already includes "region": process.env.CLOUD_RUN_REGION, so Vigilmon's check log tells you exactly which region degraded.
What you've built
| What | How |
|------|-----|
| Health endpoint | /health route checking real dependencies |
| Startup protection | Cloud Run startup probe prevents bad revisions |
| Liveness | Cloud Run liveness probe restarts stuck containers |
| External uptime monitoring | Vigilmon HTTP monitor on public URL |
| Deployment failure alerts | Vigilmon catches degradation as traffic shifts |
| Scheduled job monitoring | Heartbeat ping in Cloud Run Jobs |
| Multi-region visibility | One Vigilmon monitor per regional deployment |
| Slack/email/PagerDuty | Vigilmon notification channels |
Cloud Run handles scaling and self-healing. Vigilmon handles visibility. You know the moment something breaks — not when a user files a ticket.
Next steps
- Add Vigilmon monitors for each environment (prod, staging, canary)
- Watch response time trends in Vigilmon to catch gradual resource exhaustion before failures
- Set up heartbeat monitors for every Cloud Run Job that does important work
Get started free at vigilmon.online.