Monitoring a microservices architecture is a different problem than monitoring a monolith. A monolith either responds or it doesn't. A microservices architecture can have 30 services, each independently available, but a dependency failure in service A causes service B to timeout, which causes service C to return errors, which surfaces as a user-facing failure in service D — and by the time the customer complains, the cascade has made the origin hard to find.
This guide covers how to use uptime monitoring tools effectively in a microservices environment: health endpoint conventions, inter-service dependency modeling, cascading failure detection, heartbeat monitoring for background workers, and how to set up Vigilmon for a distributed system.
The Uptime Monitoring Problem in Microservices
Traditional uptime monitoring was designed for monoliths: add a monitor on your main URL, get paged when it's down, done. In a microservices architecture, that approach misses most failure modes:
- Service A is up, but its downstream dependency (Service B) is degraded. The health check on Service A returns 200, but users are seeing errors because Service A's calls to Service B are timing out.
- The API gateway is up, but a critical internal service is unreachable. Users hit the gateway, which responds but returns errors from the failed upstream.
- A background worker stops processing jobs. No HTTP endpoint reports this. The queue builds silently.
- Service C is healthy, but one geographic region can't route to it. A single-location health check misses the regional failure.
Effective uptime monitoring in microservices requires monitoring at multiple levels: the external-facing layer (what users reach), the internal-service layer (what matters for the critical path), and the asynchronous layer (background workers and scheduled jobs).
Health Endpoint Conventions for Microservices
The health endpoint is the foundation of uptime monitoring for each service. Consistency across services matters: when every service exposes health data in the same shape, your monitoring configuration is uniform and your incident response tooling can consume all services identically.
The Three-Level Health Model
Liveness — is the process running and able to accept requests at all?
GET /health/live
{
"status": "ok"
}
Return HTTP 200 if the process is running. Return HTTP 503 if the process is in a failed state. Keep this endpoint fast and dependency-free — its purpose is to tell orchestrators (Kubernetes, ECS) whether to restart the container.
Readiness — is the service ready to handle production traffic?
GET /health/ready
{
"status": "ok",
"checks": {
"database": "ok",
"cache": "ok"
}
}
Return HTTP 200 if all critical dependencies are healthy. Return HTTP 503 if a critical dependency is unavailable. Load balancers and service meshes use this to decide whether to route traffic to an instance. This is the endpoint your uptime monitoring tool should check for the service's true operational status.
Deep health / status — what is the service's current operational state, including non-critical dependencies?
GET /health/status
{
"status": "degraded",
"version": "2.4.1",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"cache": "ok",
"email_provider": "degraded",
"search_index": "ok"
},
"timestamp": "2026-06-30T10:23:01Z"
}
This endpoint returns the full operational picture, including degraded non-critical dependencies. Do not use this as your monitor's check endpoint — a non-critical degradation returning 503 will alert on things that don't require immediate response. Use it as a diagnostic tool during incidents.
Health Endpoint Rules for Microservices
Return the correct HTTP status code. HTTP 200 means healthy. HTTP 503 means unhealthy. Do not return HTTP 200 with {"status": "error"} in the body. Uptime monitoring tools read status codes; inconsistent semantics cause missed alerts and false alerts.
Check only critical-path dependencies. A readiness endpoint that marks the service unhealthy because an optional analytics API is slow will cause false alerts. Check the dependencies that would cause you to serve errors to users — typically: your primary database, any message broker you read from synchronously, any downstream service your critical path requires. Optional enhancements that degrade gracefully don't belong in the readiness check.
Keep health endpoints fast. A health endpoint that takes 800ms because it runs a database query per check will skew response time history and put unnecessary load on your database at high check frequency. Use connection pool status checks rather than query execution where possible. Keep the liveness endpoint sub-10ms — it must respond even when the service is under load.
Version-tag your health responses. Including "version": "2.4.1" in health responses makes it possible to verify that the right version is deployed during incident investigation. It also lets your monitoring tool detect when a deployment propagated across all instances.
What to Monitor in a Microservices Architecture
Layer 1: External-Facing Endpoints
These are the endpoints your users reach. They represent the ultimate test of whether your system is serving requests.
API gateway or ingress:
- Primary API base URL (
https://api.yourdomain.com/) - A lightweight health or ping endpoint (
/healthor/ping) - Check from multiple geographies with consensus alerting — your API gateway being unreachable from one region is a user-visible failure
Public-facing web application:
- Primary application URL
- Critical authenticated user flows if a lightweight check can cover them
Third-party services your application depends on:
- Payment processing API availability
- Email delivery provider
- Any partner API your application calls synchronously in the request path
Monitoring third-party dependencies externally gives you visibility into their availability independent of what their status page claims.
Layer 2: Internal Services
Internal services are not publicly accessible — but their health determines whether external-facing services can serve requests correctly.
For critical internal services:
- Add health endpoints on each service following the conventions above
- Monitor the readiness endpoint from your monitoring infrastructure or via a probe that can reach internal network ranges
- Use response body matching to verify the response contains
"status": "ok"rather than just checking the HTTP status code
For inter-service communication that must be verified:
- Add a dedicated health check endpoint on the downstream service that is checked as part of the upstream service's own readiness check
- This creates a health dependency chain that surfaces downstream failures as upstream degradation
Layer 3: Asynchronous Workers and Cron Jobs
This is the most commonly missed monitoring layer. HTTP health checks tell you nothing about whether your message queue consumers, event processors, or cron jobs are running.
Heartbeat monitoring covers this layer. The pattern is simple:
- At the end of each successful job run, send a ping to a heartbeat URL
- Your monitoring tool waits for that ping on the expected schedule
- If the ping doesn't arrive within the configured window, an alert fires
Every background process that must run reliably is a candidate for heartbeat monitoring:
| Process Type | Example | Heartbeat Window | |---|---|---| | Scheduled cron job | Database backup at 2 AM | 25 hours (daily + buffer) | | Periodic worker | Cache refresh every 15 minutes | 20 minutes | | Event consumer | Order processing queue worker | 5 minutes | | Data sync pipeline | CRM import every 4 hours | 5 hours | | Email sender | Notification dispatch every minute | 3 minutes |
Set the heartbeat window to 20–50% longer than the job's normal interval. This prevents alerts from firing during normal scheduling variation without masking actual failures.
Detecting Cascading Failures
Cascading failures in microservices have a pattern: one service degrades, its callers experience increased latency or errors, those callers' response times spike, and errors propagate upstream until a user-visible failure surfaces.
Uptime monitoring detects cascading failures most effectively when you have coverage at multiple layers:
Detection sequence for a database failure:
- Database TCP port monitor fires (if you monitor the DB port) — or the first check cycle after the failure
- Service A's readiness endpoint starts returning 503 (Service A depends on the database)
- Service B's response time spikes (Service B calls Service A synchronously)
- External API gateway monitor detects elevated response times or error status codes
- If Service A has a heartbeat consumer, the heartbeat monitor fires when jobs stop processing
Without multi-layer monitoring, step 4 is the first signal — by which time the cascade has already fully developed. With internal service monitoring, step 2 fires before the cascade reaches the external layer.
Response Time History as Early Warning
Response time degradation often precedes complete unavailability. A service that takes 200ms normally but is suddenly at 1,800ms is likely experiencing a dependency problem before it fails completely.
Vigilmon's response time history records latency trends on every monitored endpoint. Setting a response time alert threshold (e.g., alert when response time exceeds 2× baseline) gives you an early warning signal before the endpoint starts returning errors.
For a microservices architecture, configure response time thresholds on:
- Your primary API gateway endpoint
- Any critical internal service readiness endpoint you can reach
- Any third-party API you depend on for the critical request path
Alerting by Dependency Tier
When a shared dependency fails, many services alert simultaneously. To avoid chasing symptoms instead of root causes, group your monitors by dependency tier:
Tier 1 — Shared infrastructure (highest priority): database, message broker, cache cluster. Alert immediately; when Tier 1 fires, treat all co-occurring Tier 2 alerts as downstream effects.
Tier 2 — Core platform services: auth, user service, payment service, any service many others call. Alert and page the responsible team; investigate after confirming Tier 1 is healthy.
Tier 3 — Feature services: leaf services with fewer upstream callers. Notify the responsible team; page only if the failure is user-visible and Tier 1/2 are healthy.
This tiering prevents 20 simultaneous alerts from the same root cause from drowning your incident response in noise.
Service Mesh Health Monitoring
In service mesh architectures (Istio, Linkerd, Consul Connect), the mesh provides rich internal observability — traffic metrics, latency histograms, error rates between services. But even with a service mesh, outside-in monitoring remains valuable:
- The mesh observes traffic that successfully enters the mesh. DNS failures, load balancer failures, and TLS termination errors upstream of the mesh are invisible to mesh telemetry.
- SSL certificate monitoring for your ingress is outside the mesh's visibility scope.
- External user experience (what a user from Tokyo experiences reaching your US-West infrastructure) is not captured by mesh telemetry.
- Cron jobs and batch workers running outside the mesh have no mesh visibility.
Treat service mesh telemetry and outside-in uptime monitoring as complementary layers, not alternatives. The mesh tells you about internal traffic health; Vigilmon tells you what users experience when they try to reach your infrastructure.
Setting Up Vigilmon for a Microservices Architecture
Step 1: Map Your Critical Path
Before adding monitors, identify the services and endpoints that form your critical user-facing path. A reasonable starting structure:
External users
└── API gateway / load balancer
├── Auth service
├── Product service
│ └── Product database
├── Order service
│ ├── Order database
│ └── Payment API (external)
└── Notification service
└── Email provider (external)
Prioritize monitoring from the outside in: start at the API gateway, then add internal services in priority order based on their impact if they fail.
Step 2: Add HTTP Monitors for External Endpoints
In Vigilmon:
- Click Add Monitor → select HTTP/HTTPS
- Enter your API gateway base URL or health endpoint
- Set check interval (5 minutes on free tier; 1 minute on paid)
- Configure expected status code (200)
- Set a response time threshold based on your baseline
- Enable multi-region consensus (Vigilmon default)
Repeat for each external-facing endpoint. For a typical microservices deployment, external monitors typically cover:
- Primary application URL
- API base URL or health endpoint
- Auth service URL (if independently accessible)
- Any externally accessible service URL with its own domain or subdomain
Step 3: Add Heartbeat Monitors for Background Workers
For each background process:
- Click Add Monitor → select Heartbeat
- Name the monitor descriptively (e.g., "Order processor — queue consumer")
- Set the expected ping interval to match the process schedule
- Set the grace period (20–50% above normal interval)
- Copy the heartbeat URL
- Add the ping to your process code
Example — Node.js queue consumer:
const https = require('https');
class OrderProcessor {
async processNextBatch() {
const orders = await this.queue.dequeue(100);
for (const order of orders) {
await this.processOrder(order);
}
// Signal successful completion to Vigilmon
await this.pingHeartbeat();
}
pingHeartbeat() {
return new Promise((resolve) => {
https.get('https://vigilmon.online/heartbeat/YOUR_HEARTBEAT_ID', resolve)
.on('error', () => resolve()); // Non-blocking — don't let monitoring fail the job
});
}
}
Example — Python cron job:
import requests
import logging
def run_nightly_sync():
try:
sync_records()
# Ping heartbeat on success
requests.get(
'https://vigilmon.online/heartbeat/YOUR_HEARTBEAT_ID',
timeout=5
)
except Exception as e:
logging.error(f"Sync failed: {e}")
raise # Don't ping heartbeat on failure — absence of ping is the signal
The heartbeat ping should come only after successful completion. A failed job should not ping the heartbeat — the absence of the ping is the signal that something went wrong.
Step 4: Configure TCP Monitors for Infrastructure
For databases, message brokers, and other non-HTTP services:
- Click Add Monitor → select TCP
- Enter the hostname and port
- Set check interval
- Configure notifications
Common TCP monitors in microservices:
| Service | Default Port | |---|---| | PostgreSQL | 5432 | | MySQL | 3306 | | Redis | 6379 | | RabbitMQ management | 15672 | | Elasticsearch | 9200 | | Kafka | 9092 |
Note: TCP monitors for internal services require network access to those ports from Vigilmon's probe network. If your database is in a private VPC, use application-level health endpoints to surface database health instead.
Step 5: Route Alerts by Service and Team
For a microservices architecture with multiple teams, route alerts to the appropriate channel via Vigilmon webhooks:
API gateway alert → #ops-alerts Slack + PagerDuty (Tier 1)
Database alert → #dba-oncall + PagerDuty (Tier 1)
Payment service → #payments-oncall + PagerDuty (Tier 2)
Auth service → #security-ops (Tier 2)
Background workers → #data-team-alerts (Tier 3)
Vigilmon supports webhook notifications to any HTTPS endpoint. Configure per-monitor webhook routing to send each alert to the team responsible for that service — not a single shared channel that everyone ignores.
Step 6: Integrate with Deployment Pipelines
Pause monitors during deployments to prevent false alerts during rolling restarts:
# pre-deploy.sh
curl -s -X PATCH https://vigilmon.online/api/monitors/API_GATEWAY_MONITOR_ID \
-H "Authorization: Bearer $VIGILMON_API_KEY" \
-H "Content-Type: application/json" \
-d '{"paused": true}'
# post-deploy.sh
curl -s -X PATCH https://vigilmon.online/api/monitors/API_GATEWAY_MONITOR_ID \
-H "Authorization: Bearer $VIGILMON_API_KEY" \
-H "Content-Type: application/json" \
-d '{"paused": false}'
At scale, automate monitor creation and deletion alongside service deployment:
# Create monitor when a service is deployed
curl -s -X POST https://vigilmon.online/api/monitors \
-H "Authorization: Bearer $VIGILMON_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"name\": \"${SERVICE_NAME} [PROD]\",
\"url\": \"${SERVICE_HEALTH_URL}\",
\"interval\": 60
}"
Microservices Monitoring Checklist
Per-Service Checklist
- [ ]
/health/readyendpoint returns 200 when healthy, 503 when unavailable - [ ] Readiness endpoint checks only critical-path dependencies
- [ ] Health endpoint responds in under 100ms
- [ ] Version tag included in health response body
- [ ] HTTP monitor configured on the service's health or readiness endpoint
- [ ] Response time threshold set based on observed baseline
- [ ] Each background worker has a heartbeat monitor with appropriate window
- [ ] TCP monitor configured for any directly accessible non-HTTP service port
Architecture-Level Checklist
- [ ] External API gateway / load balancer monitored from multiple geographies
- [ ] Critical third-party dependencies monitored independently
- [ ] SSL certificate expiry monitoring configured for all external-facing domains
- [ ] Alert routing maps each monitor to the team responsible for that service
- [ ] Monitors grouped by dependency tier; Tier 1 alerts treated as root cause when co-occurring
- [ ] Deployment pipeline pauses and resumes monitors during rolling restarts
- [ ] Monitor creation/deletion automated via API alongside service deployment
- [ ] Runbook entry exists for each Tier 1 and Tier 2 monitor
Conclusion
Uptime monitoring in a microservices architecture is not simpler than in a monolith — but the tools and conventions exist to do it well. The key shifts from monolith monitoring:
- Monitor at multiple layers — external gateway, critical internal services, and asynchronous workers — not just the public URL
- Use health endpoint conventions consistently across services so monitoring configuration is uniform
- Add heartbeat monitors for every background process — this is the gap most teams have
- Use consensus alerting to avoid alert fatigue from single-probe false positives in high-check-frequency environments
- Group alerts by dependency tier so cascading failures don't create noise that hides the root cause
- Integrate monitoring into deployments so rolling restarts don't trigger false alerts
Vigilmon's free tier — 5 monitors with permanent consensus alerting — covers a basic microservices monitoring setup at no cost. The REST API lets you expand coverage and automate monitor lifecycle alongside service deployment as your architecture grows.
Get started at vigilmon.online — no agents, no credit card, monitoring up in minutes.
Tags: #microservices #monitoring #uptime #devops #sre #kubernetes #distributed-systems #vigilmon #2026