tutorial

Designing Health Check Endpoints for Monitoring 2026

A health check endpoint is the contract between your service and the monitoring systems that watch it. Get the design wrong and your monitoring produces fals...

A health check endpoint is the contract between your service and the monitoring systems that watch it. Get the design wrong and your monitoring produces false positives (alerts when the service is fine), false negatives (silence when the service is broken), or noise that trains engineers to ignore alerts. Get it right and you have a precise, reliable signal that tells your monitoring systems exactly what they need to know — and tells your on-call engineers what's actually wrong.

This guide covers what makes a good health endpoint versus a bad one, the distinction between liveness and readiness and startup checks, what to include in health check responses, HTTP status codes for health checks, deep versus shallow checks, dependency health propagation, standardizing paths across your services, and practical examples in Express, FastAPI, Django, Go, and Rails.


What Makes a Good Health Endpoint vs. a Bad One

Bad Health Endpoint: Always Returns 200

GET /health
{"status":"ok"}

If this endpoint always returns 200 regardless of the state of your application's dependencies, it's not a health check — it's a static file. Your uptime monitor sees 200 and reports "up" while your database connection pool is exhausted, your downstream services are timing out, and users are receiving 503 errors.

The failure mode: Your monitoring says the service is healthy. Your users are experiencing errors. Your on-call engineer is not paged.

Bad Health Endpoint: Returns 500 for Any Dependency Blip

An endpoint that returns 500 whenever any dependency shows any latency spike is also a bad health check. If a transient 50ms database slowdown causes your health check to return 500, and your monitoring service has a 1-minute check interval, you'll get spurious alerts during every normal load spike.

The failure mode: Your on-call engineer is paged for non-incidents. Alert fatigue sets in. The real incidents start getting ignored.

Good Health Endpoint: Returns Status That Reflects Actual Serviceability

A good health check answers the question: "Can this service currently handle requests?" Not "is every dependency perfectly healthy" and not "does any code run without crashing."

The answer should be:

  • 200 when the service can handle requests (even if some non-critical dependencies are degraded)
  • 503 when the service genuinely cannot serve requests — critical dependencies are unavailable, the service is starting up, or it's draining traffic before a graceful shutdown

Everything between those extremes requires judgment about which dependencies are critical to request handling and which are degraded-but-functional.


Liveness vs. Readiness vs. Startup Checks

Kubernetes formalized the distinction between three types of health checks, and that distinction is useful even outside Kubernetes:

Liveness Check

Question it answers: Is this process alive and not stuck in a deadlock or infinite loop?

What it checks: Minimal — often just "can the process accept a connection and return a response." Not dependencies, not application state. If the process is so broken it can't respond to a liveness check, Kubernetes restarts it.

Failure action: Restart the container / process.

What NOT to check: External dependencies. A liveness check that fails because a database is down causes the container to restart repeatedly without fixing anything. This is a common and painful misconfiguration.

GET /healthz/live
200 OK
{"alive": true}

Readiness Check

Question it answers: Is this instance ready to receive traffic?

What it checks: Critical dependencies that the service needs to serve requests. If the database is disconnected, a readiness check should fail — the service can't serve requests. If a cache is unavailable but requests can proceed without it, the readiness check should still pass.

Failure action: Remove the instance from the load balancer rotation (stop sending it traffic) — but don't restart it.

The distinction from liveness: A failing readiness check means "don't send me traffic"; a failing liveness check means "I'm broken, restart me." An instance can be alive but not ready (starting up, warming caches), and it can be alive and ready (healthy). An instance cannot be ready but not alive.

GET /healthz/ready
200 OK
{"ready": true, "dependencies": {"database": "ok", "cache": "ok"}}

503 Service Unavailable
{"ready": false, "dependencies": {"database": "unreachable", "cache": "ok"}}

Startup Check

Question it answers: Has the service finished its startup sequence?

What it checks: Similar to a readiness check, but specifically evaluated during startup. Kubernetes uses startup probes to give slow-starting applications time to initialize before liveness probes begin.

Why it exists: If a liveness probe fires before a slow-starting application finishes initialization, Kubernetes may restart an application that's healthy but hasn't finished loading yet.

Failure action: Keep waiting (don't restart) until the probe succeeds or the failure threshold is exceeded.

GET /healthz/startup
200 OK
{"started": true}

503 Service Unavailable  
{"started": false, "initializing": ["database_connection", "cache_warmup"]}

For External Uptime Monitoring

External uptime monitors like Vigilmon typically check a single endpoint. For external monitoring:

  • Use the readiness endpoint as your primary uptime check — it tests whether the service can handle requests
  • The liveness endpoint is primarily for container orchestration, not external monitors
  • If you don't distinguish liveness from readiness, your primary /health endpoint should behave like a readiness check

What to Include in Health Check Responses

Minimum Viable Health Response

{
  "status": "ok"
}

With an appropriate HTTP status code (200 for healthy, 503 for unhealthy), this is technically sufficient for binary up/down uptime monitoring.

Practical Health Response with Dependency Status

{
  "status": "ok",
  "version": "2.14.1",
  "uptime_seconds": 86431,
  "dependencies": {
    "database": {
      "status": "ok",
      "latency_ms": 3
    },
    "cache": {
      "status": "ok",
      "latency_ms": 1
    },
    "storage": {
      "status": "ok"
    }
  }
}

This response lets an engineer who reaches the health endpoint during an incident immediately see which dependency is failing, without needing to open additional dashboards.

Degraded State Response

Not all dependency failures should cause the health check to return 503. A cache being temporarily unavailable might mean requests are slower but the service is still functional. Introducing a "degraded" status covers this case:

{
  "status": "degraded",
  "details": "Cache unavailable — requests served from database (increased latency expected)",
  "dependencies": {
    "database": {
      "status": "ok",
      "latency_ms": 8
    },
    "cache": {
      "status": "unavailable",
      "error": "connection refused"
    }
  }
}

HTTP status for a degraded response: 200 if the service can still handle requests (monitoring systems should not alert), or 503 if the degradation prevents serving requests (monitoring should alert).

What NOT to Include in Health Responses

Sensitive configuration: Database passwords, API keys, connection strings, or internal IP addresses should never appear in health check responses. Health endpoints are often not authenticated — they're accessible by monitoring services and anyone who can reach the network.

Verbose error messages: "Connection to postgres://user:password@host:5432/dbname failed" leaks credentials and internal topology. Return "database: unreachable" instead.

User data: The health endpoint should not return any data about users, user sessions, or user activity. A health check is an operational endpoint, not an application endpoint.

Excessive diagnostic information: Long stack traces, full configuration dumps, or environment variable contents don't belong in health responses. A health endpoint has one job — reporting serviceability status.


HTTP Status Codes for Health Checks

The HTTP status code is the signal your uptime monitor reads. The response body is for human engineers. Use status codes correctly:

| Status | Meaning | When to Use | |---|---|---| | 200 OK | Service is healthy and ready to handle requests | All critical dependencies up, service operational | | 200 OK with degraded status in body | Service degraded but functional | Non-critical dependency unavailable; requests still served | | 503 Service Unavailable | Service cannot handle requests | Critical dependency unavailable, or service shutting down | | 429 Too Many Requests | Health check itself is rate-limited | If you rate-limit health endpoints (unusual; avoid) | | 500 Internal Server Error | The health check itself encountered an unexpected error | Use sparingly; distinguish from deliberate 503 |

The 200 vs 503 decision rule: Ask "can this instance serve a normal user request right now?" If yes: 200. If no: 503.

Avoid using 200 with a body that says {"status":"error"}. Uptime monitors read HTTP status codes, not response bodies (unless configured to do body matching). A 200 with an error body tells monitoring systems the service is healthy.

Avoiding 500 from Health Checks

If your health check endpoint itself throws an unhandled exception (e.g., because the dependency check code has a bug), it returns 500. This is generally indistinguishable from a service error from the monitor's perspective.

Design health check handlers to catch and handle all exceptions, returning a structured 503 response rather than allowing unhandled exceptions to produce 500 responses with HTML error pages.


Deep vs. Shallow Health Checks

Shallow Check

A shallow check verifies that the application process is running and can accept connections. It makes no downstream calls.

GET /health → 200 (if the process can respond)

Pros: Fast, no dependency risk, can't cause cascading failures.

Cons: Tells you nothing about whether the service can actually do useful work.

Use case: Liveness checks, load balancer ping checks where you only want to know if the process is alive.

Deep Check

A deep check verifies that the application's critical dependencies are actually functional by performing real (or representative) operations against them.

GET /health
- Queries the database (SELECT 1)
- Checks cache connectivity (PING)
- Verifies critical background queues are processing

Pros: Accurate reflection of service capability; surfaces dependency failures before they affect users.

Cons: Adds latency (each dependency check takes time); creates load on dependencies (at high check frequency); if a dependency is slow, the health check is slow; if a check hangs, the health check hangs.

Use case: Readiness checks, primary uptime monitoring endpoint.

Balancing Depth with Risk

Deep checks that actually perform write operations to verify dependencies introduce risk: a misconfigured health check that writes to production data on every check, at 1-minute intervals, can pollute production data or create load.

Best practice for database health checks: Use a read-only, lightweight query (SELECT 1 or equivalent). Don't write test records. Don't query large tables. Don't join across tables.

Best practice for timeouts: Set aggressive timeouts on dependency checks within health endpoints. If a database query hasn't returned in 2 seconds during a health check, treat it as unhealthy. Don't let dependency checks hang the health endpoint.

# Set a 2-second timeout on database health checks
try:
    with timeout(2):
        db.execute("SELECT 1")
    status["database"] = "ok"
except TimeoutError:
    status["database"] = "timeout"
    overall_healthy = False

Dedicated Health Check Connections

For database-backed services, consider using a dedicated database connection (or connection pool of 1–2 connections) for health checks, separate from your application's main connection pool. This prevents health checks from competing with application requests for connections — and prevents a saturated connection pool from causing health check failures when the application is actually functional.


Dependency Health Propagation

When your service has multiple upstream dependencies, you need a policy for how their health propagates to your service's health status.

Critical vs. Non-Critical Dependencies

Classify each dependency:

Critical: Failure means the service cannot handle requests. Return 503.

  • Primary database (for read/write applications)
  • Authentication service (for applications that require auth)
  • Payment processor (for checkout flows)

Non-critical: Failure degrades the service but doesn't prevent request handling. Return 200 with degraded status.

  • Cache layer (if fallback to database is implemented)
  • Search index (if basic functionality works without search)
  • Analytics service (if failure is invisible to users)
  • Non-essential third-party integrations

Optional: Failure is informational. Return 200 with status in response.

  • Feature flag service (if flags have defaults)
  • A/B testing service (if tests degrade gracefully)

Document this classification for your team. It should be a deliberate decision, not implicit in your health check implementation.

Partial Outage Response

When some dependencies are down and some are up, the response should reflect the worst-case critical dependency:

{
  "status": "degraded",
  "http_status": 200,
  "dependencies": {
    "database": {"status": "ok"},
    "cache": {"status": "unavailable", "critical": false},
    "search_index": {"status": "unavailable", "critical": false}
  },
  "message": "Cache and search unavailable. Core functionality operational with degraded performance."
}
{
  "status": "unhealthy",
  "http_status": 503,
  "dependencies": {
    "database": {"status": "unreachable", "critical": true},
    "cache": {"status": "ok"}
  },
  "message": "Database unreachable. Service cannot handle requests."
}

Circuit Breakers in Health Checks

If your service uses circuit breakers for dependency calls, your health check can reflect circuit breaker state:

{
  "status": "degraded",
  "dependencies": {
    "payments_service": {
      "status": "circuit_open",
      "circuit_state": "open",
      "error_rate_5m": 0.85,
      "critical": true
    }
  }
}

An open circuit breaker on a critical dependency should cause the health check to return 503 — the service is refusing to route payment requests because the dependency is failing.


Standardizing Health Check Paths

Multiple conventions exist for health check endpoint paths. Standardizing across your services makes configuration consistent:

| Path | Common Context | |---|---| | /health | General-purpose; most common convention | | /healthz | Kubernetes convention (the z suffix is a Google internal convention that leaked into the Kubernetes ecosystem) | | /_health | Leading underscore signals operational endpoint, not user-facing | | /ping | Shallow liveness-only check; typically returns pong or 200 with minimal body | | /ready | Readiness-only; used when separate liveness and readiness paths are needed | | /live | Liveness-only | | /status | Sometimes used for more detailed status pages; can return HTML or JSON |

Recommendations

For external uptime monitoring with Vigilmon, any consistent path works — Vigilmon checks the URL you specify. What matters is consistency within your organization:

  • Pick one path convention and document it in your service template
  • Use the same path across all your services — monitoring configuration, runbooks, and dashboards all benefit from consistency
  • Expose separate /healthz/live and /healthz/ready if you use Kubernetes, and configure Vigilmon against /healthz/ready (the more meaningful external check)
  • Protect health endpoints from accidental inclusion in your application's authentication middleware — they should be accessible to monitoring services without credentials (or with a dedicated monitoring token)

Practical Examples

Express.js (Node.js)

const express = require('express');
const app = express();

app.get('/health', async (req, res) => {
  const health = {
    status: 'ok',
    version: process.env.APP_VERSION || 'unknown',
    uptime_seconds: Math.floor(process.uptime()),
    dependencies: {}
  };

  let isHealthy = true;

  // Check database
  try {
    await Promise.race([
      db.query('SELECT 1'),
      new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), 2000))
    ]);
    health.dependencies.database = { status: 'ok' };
  } catch (err) {
    health.dependencies.database = { status: 'unreachable', error: err.message };
    isHealthy = false;
  }

  // Check Redis cache (non-critical)
  try {
    await Promise.race([
      redis.ping(),
      new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), 1000))
    ]);
    health.dependencies.cache = { status: 'ok' };
  } catch (err) {
    health.dependencies.cache = { status: 'unavailable', error: err.message };
    // Don't set isHealthy = false — cache is non-critical
  }

  health.status = isHealthy ? 'ok' : 'unhealthy';
  res.status(isHealthy ? 200 : 503).json(health);
});

FastAPI (Python)

from fastapi import FastAPI
from fastapi.responses import JSONResponse
import asyncio
import time

app = FastAPI()

@app.get("/health")
async def health_check():
    health = {
        "status": "ok",
        "uptime_seconds": int(time.time() - startup_time),
        "dependencies": {}
    }
    is_healthy = True

    # Check database (critical)
    try:
        async with asyncio.timeout(2.0):
            await db.execute("SELECT 1")
        health["dependencies"]["database"] = {"status": "ok"}
    except (asyncio.TimeoutError, Exception) as e:
        health["dependencies"]["database"] = {"status": "unreachable"}
        is_healthy = False

    # Check Redis (non-critical)
    try:
        async with asyncio.timeout(1.0):
            await redis.ping()
        health["dependencies"]["cache"] = {"status": "ok"}
    except Exception:
        health["dependencies"]["cache"] = {"status": "unavailable"}
        # Non-critical: don't set is_healthy = False

    health["status"] = "ok" if is_healthy else "unhealthy"
    status_code = 200 if is_healthy else 503
    return JSONResponse(content=health, status_code=status_code)

Django (Python)

# health/views.py
from django.http import JsonResponse
from django.db import connection
from django.core.cache import cache
import time

def health(request):
    health_data = {
        "status": "ok",
        "dependencies": {}
    }
    is_healthy = True

    # Check database (critical)
    try:
        with connection.cursor() as cursor:
            cursor.execute("SELECT 1")
        health_data["dependencies"]["database"] = {"status": "ok"}
    except Exception:
        health_data["dependencies"]["database"] = {"status": "unreachable"}
        is_healthy = False

    # Check cache (non-critical)
    try:
        cache.set("_health_check", "1", timeout=10)
        if cache.get("_health_check") == "1":
            health_data["dependencies"]["cache"] = {"status": "ok"}
        else:
            raise ValueError("Cache read/write mismatch")
    except Exception:
        health_data["dependencies"]["cache"] = {"status": "unavailable"}

    health_data["status"] = "ok" if is_healthy else "unhealthy"
    status_code = 200 if is_healthy else 503
    return JsonResponse(health_data, status=status_code)

# urls.py
from django.urls import path
from health.views import health

urlpatterns = [
    path("health", health),
]

Go (net/http)

package main

import (
    "context"
    "encoding/json"
    "net/http"
    "time"
)

type HealthResponse struct {
    Status       string                 `json:"status"`
    Dependencies map[string]interface{} `json:"dependencies"`
}

func healthHandler(w http.ResponseWriter, r *http.Request) {
    response := HealthResponse{
        Status:       "ok",
        Dependencies: make(map[string]interface{}),
    }
    isHealthy := true

    // Check database (critical) — 2 second timeout
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    if err := db.PingContext(ctx); err != nil {
        response.Dependencies["database"] = map[string]string{
            "status": "unreachable",
        }
        isHealthy = false
    } else {
        response.Dependencies["database"] = map[string]string{
            "status": "ok",
        }
    }

    // Check Redis (non-critical) — 1 second timeout
    rctx, rcancel := context.WithTimeout(r.Context(), 1*time.Second)
    defer rcancel()

    if err := redisClient.Ping(rctx).Err(); err != nil {
        response.Dependencies["cache"] = map[string]string{
            "status": "unavailable",
        }
        // Non-critical: don't set isHealthy = false
    } else {
        response.Dependencies["cache"] = map[string]string{
            "status": "ok",
        }
    }

    statusCode := http.StatusOK
    if !isHealthy {
        response.Status = "unhealthy"
        statusCode = http.StatusServiceUnavailable
    }

    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(statusCode)
    json.NewEncoder(w).Encode(response)
}

Ruby on Rails

# config/routes.rb
Rails.application.routes.draw do
  get "/health", to: "health#show"
end

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!, raise: false

  def show
    health = { status: "ok", dependencies: {} }
    is_healthy = true

    # Check database (critical)
    begin
      Timeout.timeout(2) { ActiveRecord::Base.connection.execute("SELECT 1") }
      health[:dependencies][:database] = { status: "ok" }
    rescue => e
      health[:dependencies][:database] = { status: "unreachable" }
      is_healthy = false
    end

    # Check Redis cache (non-critical)
    begin
      Timeout.timeout(1) { Rails.cache.read("_health_check") }
      health[:dependencies][:cache] = { status: "ok" }
    rescue => e
      health[:dependencies][:cache] = { status: "unavailable" }
      # Non-critical: don't set is_healthy = false
    end

    health[:status] = is_healthy ? "ok" : "unhealthy"
    status_code = is_healthy ? :ok : :service_unavailable
    render json: health, status: status_code
  end
end

Configuring Uptime Monitoring for Health Endpoints

Once you have well-designed health endpoints, configuring uptime monitoring is straightforward:

With Vigilmon

  1. Add an HTTP monitor pointed at https://your-service.com/health
  2. Expect HTTP status 200 — Vigilmon's multi-region probes will confirm availability from multiple geographic locations
  3. Optionally configure body matching: confirm the response body contains "status":"ok" to catch cases where the endpoint returns 200 with an error body
  4. Set check interval based on your SLA requirements (1-minute checks for production critical services)

Body Matching for Extra Precision

If your health endpoint returns 200 for both "ok" and "degraded" states, and you want monitoring to alert only on hard failures (503), rely on HTTP status code matching.

If you want monitoring to alert on degraded state as well, configure body matching to look for "status":"ok" and treat any other body content as a failure — even if the HTTP status is 200.

Vigilmon's body matching option lets you specify a string that must be present in the response body. Pair this with an HTTP 200 check for maximum precision in detecting degraded states.

Check Frequency Trade-offs

| Check Interval | Use Case | Time-to-Detection | |---|---|---| | 30 seconds | Critical revenue path, strict SLA | ~30 seconds | | 1 minute | Production services, SOC 2 evidence | ~1 minute | | 5 minutes | Staging, non-critical | ~5 minutes | | 15 minutes | Batch jobs, infrequent checks | ~15 minutes |

Note that time-to-detection is the check interval plus Vigilmon's consensus confirmation time — multi-region probes must agree before the alert fires, which adds robustness at the cost of a small additional detection delay.


Health Check Anti-Patterns to Avoid

Anti-pattern: Checking services that don't affect request handling Querying a metrics aggregation service, a feature flag service with local fallbacks, or a non-essential third-party integration and failing the health check when they're unavailable causes false alerts.

Anti-pattern: Making health checks expensive A health check that runs a full database migration check, queries multiple large tables, or makes HTTP calls to external APIs introduces latency and creates load. Health checks run frequently — keep them lightweight.

Anti-pattern: Using health check endpoints for diagnostics Don't include stack traces, configuration dumps, or verbose diagnostic information in health responses. Health responses should be terse operational status. Build a separate /debug or /diagnostics endpoint (protected by authentication) for verbose diagnostic output.

Anti-pattern: Letting health endpoints trigger side effects A health check that creates database records, sends notifications, or modifies application state causes issues at monitoring check frequency (every minute, from multiple probe regions, forever).

Anti-pattern: No timeout on dependency checks A hung database connection will hang your health check handler indefinitely. Set aggressive timeouts (1–3 seconds) on every external call within a health check handler.


Conclusion

A well-designed health check endpoint is a small investment that pays dividends across your entire monitoring stack. It gives uptime monitors a precise signal to alert on, gives on-call engineers diagnostic context during incidents, and prevents the alert fatigue that comes from monitoring endpoints that don't accurately reflect service health.

The core principles: check what matters for request handling (not everything), use 200 for healthy and degraded-but-functional, use 503 for genuinely unable-to-serve, set timeouts on all dependency checks, exclude sensitive data from responses, and pick a consistent path convention across your services.

Once your health endpoints are well-designed, configuring Vigilmon to monitor them gives you multi-region consensus-based alerting — multiple independent probes confirming your service is reachable from the outside world, not just from your internal infrastructure. The combination of a good health endpoint and multi-region external monitoring gives you a signal you can trust.

Try Vigilmon free at vigilmon.online — no agents, no instrumentation, no credit card, multi-region consensus alerting. Point it at your /health endpoint and know when your services are really down.


Tags: #monitoring #uptime #healthcheck #kubernetes #liveness #readiness #vigilmon #devops #sre #express #fastapi #django #go #rails #2026

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →