Monitoring Google Cloud Run with Vigilmon

Google Cloud Run scales your containers from zero to thousands of instances in seconds. That elasticity is powerful, but it also hides failure modes that traditional monitoring misses: a revision with a broken dependency check can pass Cloud Run's internal health gate while returning errors to real users, and a regional Cloud Run endpoint can degrade while your global load balancer quietly reroutes traffic.

Vigilmon gives you an external, independent health check that fires alerts the moment your Cloud Run service stops responding — regardless of what Google's own dashboards say.

This tutorial covers:

A health endpoint in your Cloud Run service
Vigilmon HTTP monitor setup
Alerting on cold-start failures and deployment regressions
Heartbeat monitoring for Cloud Run Jobs (scheduled workloads)
Multi-region monitoring strategy

Step 1: Add a health check route to your Cloud Run service

Cloud Run routes HTTP traffic to your container. Add a /health route that verifies real dependencies instead of returning a static 200.

Node.js / Express:

// src/health.js
import { Firestore } from '@google-cloud/firestore';

const db = new Firestore();

app.get('/health', async (req, res) => {
  const checks = {};
  let healthy = true;

  // Firestore connectivity probe
  try {
    await db.collection('_health').doc('probe').set({ ts: Date.now() });
    checks.firestore = 'ok';
  } catch (err) {
    checks.firestore = `error: ${err.message}`;
    healthy = false;
  }

  // Downstream HTTP dependency
  try {
    const resp = await fetch(process.env.DOWNSTREAM_URL + '/ping', {
      signal: AbortSignal.timeout(3000),
    });
    checks.downstream = resp.ok ? 'ok' : `http_${resp.status}`;
    if (!resp.ok) healthy = false;
  } catch (err) {
    checks.downstream = `error: ${err.message}`;
    healthy = false;
  }

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    region: process.env.CLOUD_RUN_REGION ?? 'unknown',
    checks,
    timestamp: new Date().toISOString(),
  });
});

Python / FastAPI:

# app/health.py
import os
from fastapi import APIRouter, Response
from google.cloud import firestore
from datetime import datetime, timezone

router = APIRouter()
db = firestore.Client()

@router.get('/health')
async def health(response: Response):
    checks = {}
    healthy = True

    try:
        db.collection('_health').document('probe').set({'ts': datetime.now(tz=timezone.utc).isoformat()})
        checks['firestore'] = 'ok'
    except Exception as e:
        checks['firestore'] = f'error: {e}'
        healthy = False
        response.status_code = 503

    return {
        'status': 'ok' if healthy else 'degraded',
        'region': os.getenv('CLOUD_RUN_REGION', 'unknown'),
        'checks': checks,
        'timestamp': datetime.now(tz=timezone.utc).isoformat(),
    }

Cloud Run injects CLOUD_RUN_REGION, K_SERVICE, and K_REVISION automatically — include them in the health response to make region-specific debugging easier.

Dockerfile tip: Make sure your image listens on the port Cloud Run expects:

ENV PORT=8080
EXPOSE 8080
CMD ["node", "src/index.js"]

Step 2: Configure Cloud Run startup and liveness probes

Cloud Run supports container-level health probes via your service definition. Add a startup probe so Cloud Run waits for your service to be ready before routing traffic:

Cloud Run YAML (via gcloud run services replace):

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
        - image: gcr.io/my-project/my-app:latest
          startupProbe:
            httpGet:
              path: /health
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
            periodSeconds: 30

Terraform:

resource "google_cloud_run_v2_service" "app" {
  name     = "my-app"
  location = "us-central1"

  template {
    containers {
      image = "gcr.io/my-project/my-app:latest"

      startup_probe {
        http_get { path = "/health" }
        initial_delay_seconds = 5
        period_seconds        = 10
        failure_threshold     = 3
      }

      liveness_probe {
        http_get { path = "/health" }
        period_seconds = 30
      }
    }
  }
}

The startup probe prevents failed revisions from receiving traffic. The liveness probe restarts containers that become unhealthy mid-run.

Step 3: Set up Vigilmon external HTTP monitoring

Cloud Run's internal probes only run inside Google's infrastructure. You need an external probe that monitors your service from the internet and alerts you when users can't reach it.

Sign up at vigilmon.online — free, no card required
Click New Monitor → HTTP
URL: https://my-app-<hash>-uc.a.run.app/health (or your custom domain)
Check interval: 1 minute (paid) or 5 minutes (free)
Response timeout: 15 seconds — Cloud Run cold starts can add 2–10 seconds for large images
Expected status: 200
JSON assertion (optional): path status, expected value ok
Save

Cold start tip: At 1-minute check intervals, Cloud Run will keep at least one instance warm between checks. Use Vigilmon's Confirm Down After setting (set to 2 failures) to suppress alerts from isolated cold-start timeouts while still catching sustained outages.

Step 4: Alert on deployment failures

Cloud Run deploys new revisions atomically. If you deploy a broken revision, traffic gradually shifts to it — Vigilmon will catch the degradation before it reaches 100%.

In Vigilmon, configure alert channels at Notifications → New Channel:

Slack:

Create an incoming webhook at api.slack.com/apps
Paste the webhook URL into Vigilmon
Enable it on your Cloud Run monitors

Email:

Add your on-call address as a notification channel

PagerDuty:

Create a Vigilmon integration in PagerDuty and paste the integration key

When a bad deployment causes failures, Vigilmon sends:

🔴 DOWN: my-app.example.com/health
Status: 503 (degraded)
Duration: 2m 15s

You can then roll back with:

gcloud run services update-traffic my-app \
  --to-revisions my-app-00001-abc=100 \
  --region us-central1

Step 5: Heartbeat monitoring for Cloud Run Jobs

Cloud Run Jobs are batch workloads that run on a schedule (e.g., via Cloud Scheduler). A standard HTTP monitor won't detect a scheduled job that fails silently.

Add a Vigilmon heartbeat ping at the end of each successful job run:

// jobs/sync.js
import fetch from 'node-fetch';

async function main() {
  try {
    await runSyncJob();

    // Ping Vigilmon heartbeat only on success
    const heartbeatUrl = process.env.VIGILMON_HEARTBEAT_URL;
    if (heartbeatUrl) {
      await fetch(heartbeatUrl, { method: 'GET' });
    }

    console.log('Job complete');
    process.exit(0);
  } catch (err) {
    console.error('Job failed:', err);
    process.exit(1);
  }
}

main();

Configure the heartbeat URL as a Cloud Run Job environment variable (store the value in Secret Manager):

gcloud secrets create vigilmon-heartbeat-url \
  --data-file=- <<< "https://vigilmon.online/heartbeat/your-monitor-id"

gcloud run jobs update my-sync-job \
  --set-secrets VIGILMON_HEARTBEAT_URL=vigilmon-heartbeat-url:latest \
  --region us-central1

In Vigilmon, create a Heartbeat Monitor with a grace period slightly longer than your Cloud Scheduler interval. If Vigilmon doesn't receive a ping within the grace period, it fires an alert.

Step 6: Multi-region Cloud Run monitoring

If you deploy Cloud Run services to multiple regions via a Global External Load Balancer, create one Vigilmon monitor per region:

Deploy your service to us-central1, europe-west1, and asia-northeast1
In Vigilmon, create an HTTP monitor for each regional URL
Group all monitors under a single Status Page

The health response already includes "region": process.env.CLOUD_RUN_REGION, so Vigilmon's check log tells you exactly which region degraded.

What you've built

| What | How | |------|-----| | Health endpoint | /health route checking real dependencies | | Startup protection | Cloud Run startup probe prevents bad revisions | | Liveness | Cloud Run liveness probe restarts stuck containers | | External uptime monitoring | Vigilmon HTTP monitor on public URL | | Deployment failure alerts | Vigilmon catches degradation as traffic shifts | | Scheduled job monitoring | Heartbeat ping in Cloud Run Jobs | | Multi-region visibility | One Vigilmon monitor per regional deployment | | Slack/email/PagerDuty | Vigilmon notification channels |

Cloud Run handles scaling and self-healing. Vigilmon handles visibility. You know the moment something breaks — not when a user files a ticket.

Next steps

Add Vigilmon monitors for each environment (prod, staging, canary)
Watch response time trends in Vigilmon to catch gradual resource exhaustion before failures
Set up heartbeat monitors for every Cloud Run Job that does important work

Get started free at vigilmon.online.