Kubernetes has Prometheus, Grafana, liveness probes, readiness probes, and a rich ecosystem of internal observability tooling. But there's a class of failure that none of those tools will catch: the moment when everything inside your cluster looks green while your users see downtime.
This guide explains why external monitoring is essential even when you're running a full Prometheus stack, and how to set up Vigilmon to provide that outside-in view for your Kubernetes workloads.
Why Internal Kubernetes Monitoring Isn't Enough
When a Kubernetes cluster fails in a way your users notice, it almost never looks like a pod crash. Pod crashes are easy — Kubernetes restarts them, and if Prometheus is running, you'll see the restart counter tick.
The failures that cause real outages tend to look like this:
Ingress controller misconfiguration. After a Helm upgrade, your nginx-ingress or Traefik controller reloads its config with an error. Pods are Running. Services exist. But the HTTP routing broke. kubectl get pods -n ingress-nginx shows green. Your users get 502s.
LoadBalancer IP provisioning failure. A cloud provider hiccup or quota limit means the LoadBalancer service type never gets an external IP. The pod is Running. The Service exists. There's just no external endpoint. Kubernetes has no concept of this being wrong — from its perspective, it made a reasonable best-effort attempt.
SSL certificate expiry. cert-manager renews certificates automatically, but renewal can fail due to DNS propagation issues, ACME rate limits, or misconfigured ClusterIssuer resources. The certificate expires. Your users get TLS errors. Prometheus has no opinion about this unless you've specifically set up custom certificate expiry alerting.
DNS resolution failure. Your cluster-external DNS record for the service stops resolving. Maybe a TTL expired and a stale record wasn't cleaned up. Maybe a Cloudflare zone got misconfigured after an infrastructure change. The pods are fine. The cluster is fine. Users can't reach the service.
Cross-region load balancer routing. A geo-DNS or Anycast routing change silently drops traffic for users in a specific region. Your internal monitoring, running inside the same region or inside the cluster, sees no problem.
The pattern: Kubernetes reports everything green while users see downtime. External monitoring — from outside your cluster, from multiple geographic regions — is the only reliable way to catch this class of failure.
What External Monitoring Covers That Internal Monitoring Doesn't
| Failure Type | Prometheus / Internal | External Monitoring (Vigilmon) | |---|---|---| | Pod crash / restart | ✅ | ✅ | | Container OOMKill | ✅ | ✅ | | Ingress routing broken | ❌ | ✅ | | LoadBalancer IP missing | ❌ | ✅ | | SSL certificate expired | ❌ without custom rules | ✅ | | DNS resolution failed | ❌ | ✅ | | Node unreachable from internet | ❌ | ✅ | | CDN or edge routing failure | ❌ | ✅ | | Cloud provider network partition | ❌ | ✅ |
External monitoring and internal monitoring are additive. Running Prometheus doesn't make external uptime monitoring redundant — they catch entirely different failure modes.
What to Monitor Externally for a Kubernetes Deployment
1. Ingress HTTP Endpoints
Your primary HTTP health check target should be the publicly-routed URL of your service — the same URL your users hit. Not the pod IP, not the cluster IP, not the NodePort. The full external URL through your Ingress controller.
For a service at api.example.com:
Monitor: https://api.example.com/health
Type: HTTP
Method: GET
Expected status: 200
Body match: {"status":"ok"} (optional)
This check exercises:
- DNS resolution of
api.example.com - Your CDN or load balancer routing
- The Ingress controller's routing rules
- The service endpoint connectivity
- The pod's actual HTTP handler
A single Vigilmon HTTP monitor on this URL catches failures at every layer of that stack.
2. SSL Certificate Monitoring
cert-manager handles automatic certificate renewal, but renewal failures are silent until the certificate expires. Add an SSL expiry check that alerts you with enough lead time to intervene:
Monitor: https://api.example.com
Type: HTTP (SSL monitoring enabled)
SSL expiry warning: 21 days before expiry
This gives you three weeks to diagnose a renewal failure before users start seeing TLS errors. On a daily check cadence, you'll catch a renewal problem before it becomes an outage.
3. TCP Checks for Non-HTTP Services
Not everything runs over HTTP. Databases, message brokers, and SMTP services need TCP-level checks:
Monitor: postgres.example.com:5432
Type: TCP
For a Kubernetes service exposed via LoadBalancer (for a database accessed by external applications) or NodePort, a TCP check confirms the port is reachable and the connection handshake succeeds. It won't test authentication or query execution — but it will tell you immediately if the pod stopped listening or if the service routing broke.
4. API Server Health (For Managed Clusters)
If you run a managed Kubernetes cluster where the API server is externally accessible (or you've exposed it through a bastion), the API server's /readyz endpoint is a useful top-level health check:
Monitor: https://api.k8s.example.com/readyz
Type: HTTP
Expected status: 200
This is most relevant for teams managing their own clusters. For EKS, GKE, and AKS, the API server is managed by the cloud provider and this check is usually unnecessary.
5. Cronjob Heartbeats
Kubernetes CronJobs are a common source of silent failures. If a CronJob fails — due to an image pull failure, a resource quota breach, or a logic error that returns exit code 0 — it produces no user-visible symptoms until the silent accumulation of missed jobs becomes a real problem.
Heartbeat monitoring inverts the check. Instead of probing the CronJob, Vigilmon waits for the job to report success. Configure the job to POST to a Vigilmon heartbeat URL on each successful run. If the ping doesn't arrive within the configured window, the alert fires.
# In your CronJob container command:
run-my-job.sh && curl -s https://heartbeat.vigilmon.online/YOUR_HEARTBEAT_ID
This catches:
- Jobs that stop running entirely (scheduler issue, resource quota)
- Jobs that run but exit with an error
- Jobs that run but don't complete within the expected time window
Setting Up Vigilmon for a Kubernetes Deployment
Step 1: Create a Vigilmon Account
Sign up at vigilmon.online — the free tier gives you 5 monitors with no credit card and no trial expiry. For a standard Kubernetes deployment, 5 monitors covers the critical path:
- Primary HTTP endpoint
- SSL certificate
- Secondary HTTP endpoint (admin or internal API)
- TCP check for a critical database
- Heartbeat for your most critical CronJob
Step 2: Add Your Primary HTTP Monitor
In the Vigilmon dashboard, click "Add Monitor":
- Type: HTTP
- URL:
https://api.example.com/health - Check interval: 1 minute (or 5 minutes on free tier)
- Alert threshold: 2 consecutive failures before alert (reduces noise from transient failures)
Vigilmon's multi-region consensus means this check dispatches from multiple probe locations simultaneously. An alert only fires when a majority of probes independently confirm the failure — not when a single probe has a bad second.
Step 3: Configure SSL Monitoring
If you're using HTTPS (you should be), enable SSL certificate monitoring on the same monitor. Set the expiry warning to 21 days — this gives you three full weeks to respond to a cert-manager renewal failure before users are affected.
Step 4: Add TCP Checks for Database Services
For each critical TCP service:
- Type: TCP
- Host: The external hostname of your service (via LoadBalancer or NodePort)
- Port: The service port (e.g., 5432 for PostgreSQL, 6379 for Redis)
Step 5: Set Up Heartbeats for CronJobs
For each critical CronJob:
- In Vigilmon, add a Heartbeat monitor with the expected job period (e.g., 1 hour for an hourly job, 24 hours for a nightly backup)
- Add a grace period of 5–10 minutes for job execution time
- Copy the heartbeat URL
- Add
&& curl -sf HEARTBEAT_URLto the end of your CronJob's command
For a Kubernetes CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: my-backup-image:latest
command:
- /bin/sh
- -c
- |
run-backup.sh && \
curl -sf https://heartbeat.vigilmon.online/YOUR_HEARTBEAT_ID
restartPolicy: OnFailure
If run-backup.sh exits non-zero, curl never runs, and the heartbeat doesn't arrive. Vigilmon alerts after the configured grace period.
Step 6: Configure Alert Channels
Connect Vigilmon to your team's notification channels:
- Slack: Add the Vigilmon Slack webhook to your #alerts or #oncall channel
- PagerDuty: Connect to your PagerDuty service for on-call escalation
- OpsGenie: Route to your OpsGenie team
- Email: Direct email alerts (included on all tiers)
- Webhook: POST to any custom endpoint (Zapier, n8n, custom handler)
For a Kubernetes production environment, the typical setup is: Slack for immediate team visibility, PagerDuty for overnight on-call escalation.
Multi-Region Consensus and Why It Matters for k8s
Kubernetes services often sit behind Anycast IP addresses, CDN edges, or global load balancers. A monitoring tool that checks from a single location can generate alerts caused by a route to that specific probe location — not a failure users are actually experiencing.
Vigilmon dispatches every check simultaneously from multiple geographically distributed probe nodes. An alert fires only when a majority of those probes independently confirm the failure. This eliminates:
- False positives from a CDN edge flap that only affected one region
- False positives from a DNS propagation issue in one geographic area
- False positives from a probe node's own network having a bad moment
For teams with global Kubernetes deployments behind CDNs or geo-DNS, this is especially important. A probe from the EU shouldn't alert you about a failure only visible to a probe in Southeast Asia.
Recommended Monitor Setup by Kubernetes Deployment Size
Small (1–3 services)
- 1 HTTP monitor per public endpoint
- 1 SSL check per domain
- 1 heartbeat per CronJob
Typical count: 3–5 monitors (fits Vigilmon free tier)
Medium (5–15 services)
- HTTP monitor per public API endpoint
- SSL monitors for each domain
- TCP monitors for shared databases
- Heartbeats for critical background jobs
- Status page for customer communication
Typical count: 15–40 monitors
Large (15+ services, multi-cluster)
- All of the above, per cluster
- Synthetic flows for critical user journeys
- Separate monitors for blue/green canary traffic splits
- Heartbeats for each critical batch process
External Monitoring vs. Internal Kubernetes Health Checks
Kubernetes liveness and readiness probes run inside the cluster and test pod responsiveness from the node's perspective. They're essential — they drive Kubernetes's self-healing loop. But they don't replace external monitoring.
| Property | k8s liveness/readiness probes | Vigilmon external checks | |---|---|---| | Perspective | Node → Pod (internal) | Internet → Service (external) | | Tests ingress routing | ❌ | ✅ | | Tests DNS resolution | ❌ | ✅ | | Tests SSL certificate validity | ❌ | ✅ | | Tests LoadBalancer routing | ❌ | ✅ | | Multi-region coverage | ❌ | ✅ | | Drives pod restart | ✅ | ❌ | | Drives traffic routing | ✅ | ❌ |
Run both. Kubernetes probes keep your pods healthy from the inside. Vigilmon tells you what users actually see from the outside.
Conclusion
External uptime monitoring is a necessary complement to any Kubernetes observability stack. Prometheus tells you what's happening inside your cluster. Vigilmon tells you what users are experiencing from the outside — and those two perspectives are not the same thing.
The failure modes that matter most to your users — ingress routing breaks, LoadBalancer provisioning failures, SSL expiry, DNS failures, CDN routing issues — are invisible to internal monitoring and clearly visible to external probes.
Setting up Vigilmon for a Kubernetes deployment takes under ten minutes: add an HTTP monitor for each public endpoint, enable SSL monitoring, add TCP checks for critical services, and wire up heartbeats for CronJobs. The multi-region consensus alerting model means you get alerts only when a real failure is affecting real users — not when a single probe node had a bad network second.
Try Vigilmon free at vigilmon.online — 5 monitors, no credit card, multi-region consensus from the first check.
Tags: #kubernetes #k8s #monitoring #uptime #devops #sre #externalmonitoring #vigilmon #cloudnative #2026