Kubernetes gives you livenessProbe and readinessProbe out of the box. They restart crashed pods and pull unhealthy ones from service rotation. So why do production Kubernetes clusters still experience outages that engineers only discover from user complaints?
Because internal probes only see what the node sees. They cannot tell you whether the external load balancer is routing traffic correctly, whether DNS resolves outside the cluster, or whether an Ingress controller update quietly broke path routing. Vigilmon fills that gap with external monitoring from multiple geographic regions — the view your users have, not the view your kubelet has.
Why Internal Probes Are Not Enough
Liveness and readiness probes are essential, but their architecture has hard limits:
They only probe inside the cluster network. A livenessProbe.httpGet check runs from the kubelet on the same node as the pod. It traverses none of the external infrastructure that real traffic uses: cloud load balancers, external DNS, ingress controllers, TLS termination, CDN edge nodes.
They report per-pod, not per-service. A pod can be Ready while the Service pointing to it has incorrect selectors, the LoadBalancer IP has been de-provisioned, or a NodePort has stopped binding after a node reboot.
They cannot detect external DNS failures. If your api.example.com DNS record stops resolving — misconfiguration, registrar issue, TTL quirk — Kubernetes reports all pods healthy while every external client sees connection errors.
They have no concept of latency SLOs. A probe that returns 200 in 4 seconds counts as healthy. Vigilmon lets you configure response time thresholds and alert when latency degrades, even before the service is technically "down."
What You'll Set Up
- External HTTP monitoring for k8s Ingress endpoints
- Heartbeat monitoring for Kubernetes CronJobs
- Multi-cluster coverage strategy
- Alert routing to Slack or PagerDuty
You'll need a free Vigilmon account — no credit card required.
Step 1: Add a Proper Health Endpoint to Your Service
Before configuring external monitoring, give your service a /health endpoint that checks real dependencies, not just that the process is running.
// main.go — Go example with dependency checks
package main
import (
"database/sql"
"encoding/json"
"net/http"
"time"
_ "github.com/lib/pq"
)
type HealthResponse struct {
Status string `json:"status"`
Checks map[string]string `json:"checks"`
Uptime float64 `json:"uptime_seconds"`
}
var startTime = time.Now()
func healthHandler(db *sql.DB) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
checks := map[string]string{}
status := "ok"
if err := db.Ping(); err != nil {
checks["database"] = "error: " + err.Error()
status = "degraded"
} else {
checks["database"] = "ok"
}
code := http.StatusOK
if status != "ok" {
code = http.StatusServiceUnavailable
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
json.NewEncoder(w).Encode(HealthResponse{
Status: status,
Checks: checks,
Uptime: time.Since(startTime).Seconds(),
})
}
}
Deploy this alongside your existing liveness probe — the probe can hit /health and so can Vigilmon, but from entirely different network paths.
Step 2: Monitor Your Kubernetes Ingress Endpoint
The Ingress is where most external traffic enters your cluster and where silent failures most commonly occur. Set up an HTTP monitor targeting your public hostname:
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL to your ingress hostname:
https://api.yourdomain.com/health - Set the check interval to 1 minute
- Under Expected response, set:
- Status code:
200 - Response body contains:
"status":"ok" - Response time threshold:
2000ms(alert if latency exceeds 2 seconds)
- Status code:
- Save the monitor
Vigilmon probes from multiple geographic regions. A single probe glitch won't fire an alert — Vigilmon uses multi-region consensus, requiring that independent probes from different locations agree the service is unreachable before opening an incident. You get confident alerts with minimal false positives.
What This Catches That Your Probes Miss
| Failure Mode | kubelet probe | Vigilmon | |---|---|---| | Pod crash | ✓ | ✓ | | App logic error (500) | ✗ | ✓ | | Ingress routing broken | ✗ | ✓ | | Cloud load balancer failure | ✗ | ✓ | | External DNS failure | ✗ | ✓ | | TLS certificate expired | ✗ | ✓ | | Latency spike | ✗ | ✓ |
Step 3: Heartbeat Monitoring for CronJobs
Kubernetes CronJobs are invisible to uptime monitors — there's no endpoint to probe. A CronJob that fails, runs too slowly, or stops being scheduled reports nothing externally. The only way to know it ran is to have it report in.
Vigilmon's heartbeat monitors flip the model: instead of Vigilmon probing your service, your CronJob pings Vigilmon on completion. If the ping doesn't arrive within the expected window, Vigilmon fires an alert.
Set Up the Heartbeat Monitor
- In Vigilmon, go to Monitors → New Monitor → Heartbeat
- Set the name:
nightly-report-job - Set the expected interval: 1 day
- Set the grace period: 30 minutes (alert if ping is 30 minutes late)
- Save — you'll get a unique ping URL like
https://vigilmon.online/heartbeat/abc123xyz
Wire It Into Your CronJob
# cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
namespace: production
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: report
image: your-registry/nightly-report:latest
env:
- name: VIGILMON_HEARTBEAT_URL
valueFrom:
secretKeyRef:
name: vigilmon-secrets
key: heartbeat-url
command:
- /bin/sh
- -c
- |
set -e
echo "Running nightly report..."
/app/run-report
# Only ping if the job succeeded
curl -fsS "$VIGILMON_HEARTBEAT_URL" > /dev/null
echo "Heartbeat sent."
Store the heartbeat URL as a Kubernetes Secret:
kubectl create secret generic vigilmon-secrets \
--from-literal=heartbeat-url='https://vigilmon.online/heartbeat/abc123xyz' \
-n production
Now if the CronJob fails, is mis-scheduled, or takes too long, you'll know immediately — not when someone notices stale data the next business day.
Step 4: Multi-Cluster Monitoring Strategy
Running more than one cluster — staging, production, multiple regions? Each cluster needs its own set of monitors. Here's a scalable naming convention:
| Monitor Name | URL | Cluster |
|---|---|---|
| [prod-us] api /health | https://api.us.example.com/health | prod-us |
| [prod-eu] api /health | https://api.eu.example.com/health | prod-eu |
| [staging] api /health | https://staging-api.example.com/health | staging |
| [prod-us] CronJob: nightly-report | heartbeat | prod-us |
Group related monitors into a single Status Page in Vigilmon:
- Go to Status Pages → New Status Page
- Name it "Production Infrastructure"
- Add all prod monitors, grouped by cluster or service
- Share the URL with your SRE team and stakeholders
When a regional incident hits, you can see at a glance which clusters are affected and which are healthy.
Step 5: Configure Alert Channels
Webhook to Slack
- Create a Slack Incoming Webhook for your
#alertschannel - In Vigilmon, go to Alert Channels → New Channel → Webhook
- Paste the Slack webhook URL
- Assign the channel to all your k8s monitors
The payload Vigilmon sends on a downtime event:
{
"monitor_name": "[prod-us] api /health",
"status": "down",
"url": "https://api.us.example.com/health",
"started_at": "2026-06-30T10:23:00Z",
"duration_seconds": 0,
"regions_failing": ["us-east", "eu-west"]
}
The regions_failing field immediately tells you whether this is a regional issue or a global outage.
Auto-Correlation With kubectl
Save this script to your runbook — it's the first thing to run when a Vigilmon alert fires:
#!/bin/bash
# k8s-triage.sh — run immediately on Vigilmon alert
NAMESPACE=${1:-production}
APP=${2:-my-api}
echo "=== Pod Status ==="
kubectl get pods -n "$NAMESPACE" -l app="$APP"
echo "=== Endpoints (are pods registered?) ==="
kubectl get endpoints "$APP" -n "$NAMESPACE"
echo "=== Recent Events ==="
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -20
echo "=== Ingress Status ==="
kubectl get ingress -n "$NAMESPACE"
echo "=== Test from inside cluster ==="
kubectl run tmp-curl --image=curlimages/curl --rm -it --restart=Never \
--namespace="$NAMESPACE" -- \
curl -s "http://$APP.$NAMESPACE.svc.cluster.local/health"
If the in-cluster curl succeeds but Vigilmon shows the service down, the problem is external — check your cloud load balancer, Ingress controller, and DNS. If the in-cluster curl also fails, the problem is internal — check your pods and services.
Summary
Kubernetes internal probes tell you when pods need to be restarted. Vigilmon tells you when your users can't reach your service. You need both.
| Layer | Tool | Scope | |---|---|---| | Pod health | liveness probe | Per-pod, in-cluster | | Traffic routing | readiness probe | Per-pod, in-cluster | | External reachability | Vigilmon HTTP monitor | Full external path | | DNS + TLS | Vigilmon HTTP monitor | Full external path | | CronJob execution | Vigilmon heartbeat | Job completion | | Latency SLOs | Vigilmon response time threshold | External latency |
Get started free at vigilmon.online — your first monitor is running in under two minutes, and the free tier covers clusters of any size.