tutorial

How to Write a Perfect Health Check Endpoint

A `/health` endpoint sounds like one of those "just do it" things that gets added in 10 minutes and forgotten. In practice, a badly designed health endpoint ...

A /health endpoint sounds like one of those "just do it" things that gets added in 10 minutes and forgotten. In practice, a badly designed health endpoint creates subtle, hard-to-debug problems: your monitoring says green while your database is unreachable, your load balancer keeps routing to a broken instance, your status page says up while half your users can't log in.

This guide covers what a good health endpoint actually looks like — what to include, what to avoid, and how to make it genuinely useful for the tools that consume it.


The Most Common Mistake: Returning 200 When You're Broken

The single most damaging health check anti-pattern is returning HTTP 200 unconditionally, regardless of actual application state.

# BAD: This tells every tool "I'm healthy" even when I'm not
@app.get("/health")
def health():
    return {"status": "ok"}

This passes all monitoring checks permanently. Your uptime monitor shows 100%. Your load balancer routes traffic normally. Meanwhile your database connection pool is exhausted, your Redis cache is unreachable, and your background jobs have been silently failing for 45 minutes.

The whole point of a health check is to reflect actual health. A health endpoint that always returns 200 is less useful than no health endpoint — at least with no health endpoint, you'd notice when the process crashes.


What a Health Check Should Return

A good health response includes:

1. Overall status

A simple top-level status field with a value that maps to HTTP status codes:

  • ok → 200
  • degraded → 200 (optional — service is up but something is wrong)
  • error → 503
{
  "status": "ok"
}

2. Version information

Include your application version and build metadata. This confirms the right code is deployed and lets you verify which deployment you're actually hitting.

{
  "status": "ok",
  "version": "2.4.1",
  "commit": "a3f8c92",
  "environment": "production"
}

3. Dependency checks

Check each critical external dependency and report its status individually. This is the most valuable part — it tells you not just "something is wrong" but what is wrong.

{
  "status": "error",
  "version": "2.4.1",
  "dependencies": {
    "database": {
      "status": "ok",
      "latency_ms": 4
    },
    "redis": {
      "status": "error",
      "error": "connection refused"
    },
    "email_service": {
      "status": "ok",
      "latency_ms": 120
    }
  }
}

4. Response time

The time taken to complete the health check itself, measured inside the handler. This becomes useful over time — a health check that normally completes in 5ms but is suddenly taking 800ms tells you something is slow before it's broken.

{
  "status": "ok",
  "check_duration_ms": 12
}

5. Correct HTTP status codes

Map your health status to appropriate HTTP codes:

| Status | HTTP Code | Meaning | |---|---|---| | ok | 200 | All dependencies healthy | | degraded | 200 | Service is up, some dependencies slow or non-critical | | error | 503 | Service is broken, dependency required for core function is down |

The HTTP status code matters because load balancers, Kubernetes, and monitoring tools read the status code — many don't parse the body.


A Complete Example

Here's a well-structured health endpoint in Python (FastAPI):

import time
from fastapi import FastAPI, Response
import psycopg2
import redis

app = FastAPI()

@app.get("/health")
async def health(response: Response):
    start = time.time()
    
    checks = {}
    overall = "ok"
    
    # Database check
    try:
        db_start = time.time()
        conn = psycopg2.connect(DATABASE_URL)
        conn.cursor().execute("SELECT 1")
        conn.close()
        checks["database"] = {
            "status": "ok",
            "latency_ms": round((time.time() - db_start) * 1000)
        }
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}
        overall = "error"
    
    # Redis check
    try:
        r_start = time.time()
        r = redis.from_url(REDIS_URL)
        r.ping()
        checks["redis"] = {
            "status": "ok",
            "latency_ms": round((time.time() - r_start) * 1000)
        }
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}
        overall = "error"
    
    if overall == "error":
        response.status_code = 503
    
    return {
        "status": overall,
        "version": "2.4.1",
        "environment": "production",
        "dependencies": checks,
        "check_duration_ms": round((time.time() - start) * 1000)
    }

Liveness vs. Readiness Checks

If you're running in Kubernetes, you need to distinguish between two types of health checks with different implications:

Liveness probe

Answers: Is the process alive?

If the liveness probe fails, Kubernetes restarts the container. This is a nuclear option — use it only for detecting when the process has genuinely wedged (deadlock, memory corruption, infinite loop).

A liveness probe should be fast and minimal. It should not check dependencies. It should not connect to a database. It should only verify that the process is responding at all.

GET /health/live
→ 200 {"status": "ok"}

If this check fails, the container is restarted. Making it too aggressive causes restart loops.

Readiness probe

Answers: Is the process ready to receive traffic?

If the readiness probe fails, Kubernetes stops routing traffic to that pod — but does not restart it. This is the right place for dependency checks. A pod that can't reach the database shouldn't receive requests, but it doesn't need to be restarted.

GET /health/ready
→ 200 {"status": "ok", "dependencies": {...}}
→ 503 {"status": "error", "dependencies": {"database": {"status": "error"}}}

Startup probe

For slow-starting applications, add a startup probe that gives the container time to initialize before liveness and readiness probes kick in. Without this, a slow-starting application can be killed before it finishes starting up.


What Vigilmon Keyword Monitors Can Parse

Vigilmon supports keyword monitors — HTTP checks that verify a specific string or JSON value is present in the response body. This is useful for confirming your health endpoint is returning meaningful data, not just returning 200.

A useful Vigilmon keyword monitor configuration:

  • URL: https://yourapp.com/health
  • Expected keyword: "status":"ok" or "status": "ok"
  • Alert if keyword missing: Yes
  • Alert on non-200: Yes

This catches two failure modes: the endpoint returning a non-200 (obvious failure) and the endpoint returning 200 with "status":"error" in the body (the "always 200" anti-pattern described above).

If you want to monitor individual dependency states, you can add separate monitors for specific keywords:

  • Keyword: "database":{"status":"ok"} — alerts if database check fails
  • Keyword: "redis":{"status":"ok"} — alerts if Redis check fails

This gives you component-level monitoring visibility from a single health endpoint.


JSON vs. Plain Text

Always use JSON. The reasons:

  1. Machines parse it reliably. Monitoring tools, load balancers, and deployment scripts can extract specific fields.
  2. It's human-readable enough. Developers can curl /health | jq during debugging.
  3. It's extensible. You can add fields without breaking existing consumers.

Plain text ("OK") tells you exactly one bit of information and nothing more. The overhead of JSON serialization is negligible for a health endpoint.

Set Content-Type: application/json on the response so tools don't have to guess.


Security Considerations

Your health endpoint reveals information about your internal dependency graph. Take reasonable precautions:

  • Don't include credentials, hostnames, or connection strings in error messages. "error": "connection refused" is fine. "error": "could not connect to postgres://admin:password@db.internal:5432/prod" is not.
  • Consider rate limiting or IP allowlisting for detailed health responses. Public /health can return minimal info; a restricted /health/detailed can return dependency breakdown.
  • Log access to health endpoints sparingly. High-frequency monitoring will generate significant log volume if every check is logged at INFO level. Use DEBUG or a sampled logger.

Quick Checklist

  • [ ] Returns 503 (not 200) when critical dependencies are down
  • [ ] Includes version and environment fields
  • [ ] Checks each critical dependency individually
  • [ ] Reports per-dependency latency and status
  • [ ] Includes total check_duration_ms
  • [ ] Uses Content-Type: application/json
  • [ ] Error messages don't expose credentials or internal hostnames
  • [ ] Kubernetes deployments have separate /health/live and /health/ready endpoints
  • [ ] An external monitor verifies the response body, not just the status code

Conclusion

A good health endpoint takes 30 minutes to write and saves hours of debugging when something goes wrong. The key properties: it reflects actual state, it fails loudly with a non-200 status when dependencies are down, it tells you which dependency is the problem, and it's parseable by the tools that consume it.

Pair it with external monitoring that validates the response body — not just the HTTP status code — and you'll catch the "always returns 200" anti-pattern before it bites you in production.

Monitor your health endpoints externally with Vigilmon — free tier includes keyword monitoring, multi-region consensus checks, and instant Slack alerts.


Tags: #webdev #devops #monitoring #healthcheck #kubernetes

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →