Kubernetes Pod Health Monitoring: Beyond Liveness and Readiness Probes

Kubernetes gives you livenessProbe and readinessProbe out of the box. They restart crashed pods and pull unhealthy ones from service rotation. So why do production Kubernetes clusters still experience outages that engineers only discover from user complaints?

Because internal probes only see what the node sees. They cannot tell you whether the external load balancer is routing traffic correctly, whether DNS resolves outside the cluster, or whether an Ingress controller update quietly broke path routing. Vigilmon fills that gap with external monitoring from multiple geographic regions — the view your users have, not the view your kubelet has.

Why Internal Probes Are Not Enough

Liveness and readiness probes are essential, but their architecture has hard limits:

They only probe inside the cluster network. A livenessProbe.httpGet check runs from the kubelet on the same node as the pod. It traverses none of the external infrastructure that real traffic uses: cloud load balancers, external DNS, ingress controllers, TLS termination, CDN edge nodes.

They report per-pod, not per-service. A pod can be Ready while the Service pointing to it has incorrect selectors, the LoadBalancer IP has been de-provisioned, or a NodePort has stopped binding after a node reboot.

They cannot detect external DNS failures. If your api.example.com DNS record stops resolving — misconfiguration, registrar issue, TTL quirk — Kubernetes reports all pods healthy while every external client sees connection errors.

They have no concept of latency SLOs. A probe that returns 200 in 4 seconds counts as healthy. Vigilmon lets you configure response time thresholds and alert when latency degrades, even before the service is technically "down."

What You'll Set Up

External HTTP monitoring for k8s Ingress endpoints
Heartbeat monitoring for Kubernetes CronJobs
Multi-cluster coverage strategy
Alert routing to Slack or PagerDuty

You'll need a free Vigilmon account — no credit card required.

Step 1: Add a Proper Health Endpoint to Your Service

Before configuring external monitoring, give your service a /health endpoint that checks real dependencies, not just that the process is running.

// main.go — Go example with dependency checks
package main

import (
    "database/sql"
    "encoding/json"
    "net/http"
    "time"
    _ "github.com/lib/pq"
)

type HealthResponse struct {
    Status   string            `json:"status"`
    Checks   map[string]string `json:"checks"`
    Uptime   float64           `json:"uptime_seconds"`
}

var startTime = time.Now()

func healthHandler(db *sql.DB) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        checks := map[string]string{}
        status := "ok"

        if err := db.Ping(); err != nil {
            checks["database"] = "error: " + err.Error()
            status = "degraded"
        } else {
            checks["database"] = "ok"
        }

        code := http.StatusOK
        if status != "ok" {
            code = http.StatusServiceUnavailable
        }

        w.Header().Set("Content-Type", "application/json")
        w.WriteHeader(code)
        json.NewEncoder(w).Encode(HealthResponse{
            Status:  status,
            Checks:  checks,
            Uptime:  time.Since(startTime).Seconds(),
        })
    }
}

Deploy this alongside your existing liveness probe — the probe can hit /health and so can Vigilmon, but from entirely different network paths.

Step 2: Monitor Your Kubernetes Ingress Endpoint

The Ingress is where most external traffic enters your cluster and where silent failures most commonly occur. Set up an HTTP monitor targeting your public hostname:

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to your ingress hostname: https://api.yourdomain.com/health
Set the check interval to 1 minute
Under Expected response, set:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 2000ms (alert if latency exceeds 2 seconds)
Save the monitor

Vigilmon probes from multiple geographic regions. A single probe glitch won't fire an alert — Vigilmon uses multi-region consensus, requiring that independent probes from different locations agree the service is unreachable before opening an incident. You get confident alerts with minimal false positives.

What This Catches That Your Probes Miss

| Failure Mode | kubelet probe | Vigilmon | |---|---|---| | Pod crash | ✓ | ✓ | | App logic error (500) | ✗ | ✓ | | Ingress routing broken | ✗ | ✓ | | Cloud load balancer failure | ✗ | ✓ | | External DNS failure | ✗ | ✓ | | TLS certificate expired | ✗ | ✓ | | Latency spike | ✗ | ✓ |

Step 3: Heartbeat Monitoring for CronJobs

Kubernetes CronJobs are invisible to uptime monitors — there's no endpoint to probe. A CronJob that fails, runs too slowly, or stops being scheduled reports nothing externally. The only way to know it ran is to have it report in.

Vigilmon's heartbeat monitors flip the model: instead of Vigilmon probing your service, your CronJob pings Vigilmon on completion. If the ping doesn't arrive within the expected window, Vigilmon fires an alert.

Set Up the Heartbeat Monitor

In Vigilmon, go to Monitors → New Monitor → Heartbeat
Set the name: nightly-report-job
Set the expected interval: 1 day
Set the grace period: 30 minutes (alert if ping is 30 minutes late)
Save — you'll get a unique ping URL like https://vigilmon.online/heartbeat/abc123xyz

Wire It Into Your CronJob

# cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
  namespace: production
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: report
              image: your-registry/nightly-report:latest
              env:
                - name: VIGILMON_HEARTBEAT_URL
                  valueFrom:
                    secretKeyRef:
                      name: vigilmon-secrets
                      key: heartbeat-url
              command:
                - /bin/sh
                - -c
                - |
                  set -e
                  echo "Running nightly report..."
                  /app/run-report
                  # Only ping if the job succeeded
                  curl -fsS "$VIGILMON_HEARTBEAT_URL" > /dev/null
                  echo "Heartbeat sent."

Store the heartbeat URL as a Kubernetes Secret:

kubectl create secret generic vigilmon-secrets \
  --from-literal=heartbeat-url='https://vigilmon.online/heartbeat/abc123xyz' \
  -n production

Now if the CronJob fails, is mis-scheduled, or takes too long, you'll know immediately — not when someone notices stale data the next business day.

Step 4: Multi-Cluster Monitoring Strategy

Running more than one cluster — staging, production, multiple regions? Each cluster needs its own set of monitors. Here's a scalable naming convention:

| Monitor Name | URL | Cluster | |---|---|---| | [prod-us] api /health | https://api.us.example.com/health | prod-us | | [prod-eu] api /health | https://api.eu.example.com/health | prod-eu | | [staging] api /health | https://staging-api.example.com/health | staging | | [prod-us] CronJob: nightly-report | heartbeat | prod-us |

Group related monitors into a single Status Page in Vigilmon:

Go to Status Pages → New Status Page
Name it "Production Infrastructure"
Add all prod monitors, grouped by cluster or service
Share the URL with your SRE team and stakeholders

When a regional incident hits, you can see at a glance which clusters are affected and which are healthy.

Step 5: Configure Alert Channels

Webhook to Slack

Create a Slack Incoming Webhook for your #alerts channel
In Vigilmon, go to Alert Channels → New Channel → Webhook
Paste the Slack webhook URL
Assign the channel to all your k8s monitors

The payload Vigilmon sends on a downtime event:

{
  "monitor_name": "[prod-us] api /health",
  "status": "down",
  "url": "https://api.us.example.com/health",
  "started_at": "2026-06-30T10:23:00Z",
  "duration_seconds": 0,
  "regions_failing": ["us-east", "eu-west"]
}

The regions_failing field immediately tells you whether this is a regional issue or a global outage.

Auto-Correlation With kubectl

Save this script to your runbook — it's the first thing to run when a Vigilmon alert fires:

#!/bin/bash
# k8s-triage.sh — run immediately on Vigilmon alert
NAMESPACE=${1:-production}
APP=${2:-my-api}

echo "=== Pod Status ==="
kubectl get pods -n "$NAMESPACE" -l app="$APP"

echo "=== Endpoints (are pods registered?) ==="
kubectl get endpoints "$APP" -n "$NAMESPACE"

echo "=== Recent Events ==="
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -20

echo "=== Ingress Status ==="
kubectl get ingress -n "$NAMESPACE"

echo "=== Test from inside cluster ==="
kubectl run tmp-curl --image=curlimages/curl --rm -it --restart=Never \
  --namespace="$NAMESPACE" -- \
  curl -s "http://$APP.$NAMESPACE.svc.cluster.local/health"

If the in-cluster curl succeeds but Vigilmon shows the service down, the problem is external — check your cloud load balancer, Ingress controller, and DNS. If the in-cluster curl also fails, the problem is internal — check your pods and services.

Summary

Kubernetes internal probes tell you when pods need to be restarted. Vigilmon tells you when your users can't reach your service. You need both.

| Layer | Tool | Scope | |---|---|---| | Pod health | liveness probe | Per-pod, in-cluster | | Traffic routing | readiness probe | Per-pod, in-cluster | | External reachability | Vigilmon HTTP monitor | Full external path | | DNS + TLS | Vigilmon HTTP monitor | Full external path | | CronJob execution | Vigilmon heartbeat | Job completion | | Latency SLOs | Vigilmon response time threshold | External latency |

Get started free at vigilmon.online — your first monitor is running in under two minutes, and the free tier covers clusters of any size.