Monitoring AI and LLM API Endpoints in Production 2026

AI APIs have become critical infrastructure. In 2026, a significant share of production SaaS applications depend on OpenAI, Anthropic, Cohere, or self-hosted inference servers to deliver their core features — and when those APIs are unavailable or slow, the application fails in ways that are harder to diagnose and harder to explain to users than traditional database or cache failures.

This guide covers why AI API availability demands a different monitoring approach, health check patterns for inference servers like Ollama, vLLM, and LiteLLM, latency alerting for LLM APIs, fallback provider monitoring, heartbeat monitoring for AI background jobs, and how to configure Vigilmon to cover the full surface area of an AI-powered SaaS in production.

Why AI API Availability Matters More Than Ever

LLM API calls sit on the critical path for an increasing number of user-facing features. Summarization, generation, classification, and chat features all depend on inference API calls completing successfully. When the inference layer fails:

Features silently degrade: A summarization feature that returns an empty string looks broken, not unavailable. A chat assistant that returns a generic error message is a worse UX than a clear "this feature is temporarily down" message.
Error semantics differ from traditional APIs: A 429 (rate limit) is not the same as a 503 (service unavailable). A timeout at 30 seconds for an LLM API call is normal; the same timeout from a REST endpoint indicates a serious problem.
Self-hosted models add operational complexity: Teams running Ollama or vLLM are operating their own inference infrastructure — GPU memory management, model loading times, and worker concurrency are failure modes that don't exist with hosted APIs.
Provider incidents affect many services simultaneously: When OpenAI or Anthropic experiences an incident, thousands of applications that depend on those APIs degrade at the same time. Knowing about the incident seconds after it starts — rather than when users start reporting problems — is materially valuable.

Monitoring Hosted LLM APIs

What to Monitor for Each Provider

Hosted LLM APIs expose public status pages and generally publish uptime metrics, but these lag real incidents. External monitoring from your own probe setup gives you earlier detection from your vantage point.

OpenAI (api.openai.com)

OpenAI's API uses standard HTTP. The most reliable outside-in check is a lightweight completion request against the /v1/models endpoint, which lists available models and requires a valid API key but does not trigger a billed inference call:

# Health check command (run from a monitoring script)
curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  https://api.openai.com/v1/models

A 200 response confirms the API is accepting authenticated requests. A 401 indicates an API key problem (not a service outage). A 503 or timeout indicates a service availability problem.

Anthropic (api.anthropic.com)

Similarly, Anthropic's Messages API can be health-checked with a models listing endpoint:

curl -s -o /dev/null -w "%{http_code}" \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  https://api.anthropic.com/v1/models

Cohere (api.cohere.com)

Cohere's health check follows the same pattern against their /v1/models endpoint with an Authorization: Bearer header.

Configuring Vigilmon for Hosted LLM API Monitoring

For hosted AI APIs, configure Vigilmon's HTTP monitors pointing to the provider endpoint. Since these calls require authentication headers, use Vigilmon's custom header support:

Create an HTTP monitor targeting https://api.openai.com/v1/models
Add the Authorization: Bearer {your_key} header
Validate status code 200
Set check interval to 1 minute (or 5 minutes on free tier)
Configure alert routing to your on-call channel

The key insight: Vigilmon's multi-region consensus means a single slow response from one probe does not trigger an alert. Only when the majority of probes observe a failure — confirming it's not a transient network blip to one probe location — does the alert fire.

What Not to Monitor with Uptime Checks

Do not run inference completion requests (generating tokens) as health checks in Vigilmon. These are billed calls and incur cost on every check cycle. The models listing endpoint confirms API reachability without triggering inference billing.

Monitoring Self-Hosted Inference Servers

Self-hosted inference servers introduce infrastructure monitoring concerns that hosted API users don't face. GPU memory exhaustion, model loading failures, worker process crashes, and CUDA errors all require visibility.

Ollama

Ollama exposes a built-in health endpoint at /:

# Returns "Ollama is running" with HTTP 200 when healthy
curl http://localhost:11434/

For TCP-accessible Ollama deployments, also expose the health endpoint via your reverse proxy. A comprehensive Ollama health check includes:

# Check if a specific model is loaded and ready
curl http://localhost:11434/api/tags

The /api/tags endpoint lists loaded models. If your application depends on a specific model (e.g., llama3:8b), validate that the model appears in the response.

Vigilmon setup for Ollama:

If your Ollama instance is publicly accessible (on a VPS or behind a reverse proxy):

Create an HTTP monitor targeting https://your-llm-host.example.com/
Validate response body contains Ollama is running
Add a second HTTP monitor targeting https://your-llm-host.example.com/api/tags
Validate response body contains the model name your application depends on

TCP monitor for Ollama:

If the Ollama API port (11434) is exposed directly on your infrastructure:

Create a TCP monitor targeting your-llm-host.example.com:11434
This confirms the port is reachable even without HTTP-level validation

vLLM

vLLM's OpenAI-compatible server exposes a health endpoint:

# Health check
curl http://localhost:8000/health

# Model listing (OpenAI-compatible)
curl http://localhost:8000/v1/models

A healthy vLLM server returns HTTP 200 from /health. The /v1/models endpoint additionally validates that models are loaded and serving.

Configure Vigilmon to check /health for availability and optionally /v1/models for model readiness.

LiteLLM Proxy

LiteLLM acts as a unified proxy in front of multiple LLM providers. Its health endpoint:

# LiteLLM health check (checks all configured model backends)
curl http://localhost:4000/health

LiteLLM's /health endpoint actively checks connectivity to each configured upstream provider and returns a structured status. A 200 with healthy status confirms all configured backends are reachable.

This makes LiteLLM's health endpoint particularly valuable for monitoring: a single check validates Ollama, OpenAI, Anthropic, and any other configured provider simultaneously.

Latency Alerting for LLM APIs

LLM API latency is fundamentally different from traditional REST API latency. A 500ms response from a database query is catastrophic. A 500ms response from an LLM API generating a 1,000-token completion is fast.

Expected Latency Ranges

| Provider / Model | Time to First Token | Full Completion | |---|---|---| | OpenAI GPT-4o (streaming) | 300–800ms | 3–15s (varies by length) | | OpenAI GPT-4o-mini | 150–400ms | 1–5s | | Anthropic Claude Haiku | 200–500ms | 1–4s | | Anthropic Claude Sonnet | 400–900ms | 3–12s | | Local Ollama (CPU, Llama 3 8B) | 1–5s | 10–60s | | Local Ollama (GPU, Llama 3 8B) | 100–400ms | 1–5s |

Vigilmon's response time history tracks check latency over time. This is most useful for detecting relative degradation — if your Ollama instance normally responds to the health check in 80ms but recent checks show 800ms, GPU memory pressure or process issues may be developing.

Setting Latency Thresholds

For Vigilmon's TCP and HTTP monitors targeting LLM inference servers, configure check timeouts that match normal operational behavior:

Hosted APIs (OpenAI, Anthropic): 10-second timeout for the models listing endpoint (which should respond quickly). A timeout here indicates serious API degradation.
Self-hosted Ollama/vLLM health endpoint: 5-second timeout. These should respond immediately without triggering inference.
Self-hosted health endpoint under GPU load: If your server is actively processing long inference requests, health check response times may momentarily increase. Use a 10-second timeout to avoid false positives under load.

Fallback Provider Monitoring

Many production AI applications implement provider fallback: if OpenAI is unavailable, route requests to Anthropic or a local model. This architecture requires monitoring the fallback path, not just the primary path.

The Fallback Monitoring Pattern

If your application can fall back from OpenAI to Anthropic:

Monitor both api.openai.com/v1/models and api.anthropic.com/v1/models independently
Alert when both are degraded simultaneously (the fallback path is also down)
Alert when either is degraded with high sensitivity for the primary provider

Vigilmon's multi-monitor setup covers this naturally: separate monitors for each provider give you independent visibility into each path.

Monitoring LiteLLM Fallback Configuration

If you use LiteLLM's fallback routing:

# litellm_config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: gpt-4o
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: least-busy
  num_retries: 2
  fallbacks: [{"gpt-4o": ["claude-sonnet"]}]

LiteLLM's /health endpoint validates all configured backends. A Vigilmon monitor on this endpoint gives you a single signal that covers the entire fallback chain.

Heartbeat Monitoring for AI Background Jobs

Many AI-powered applications run inference in background jobs: document processing pipelines, embedding generation, content moderation, report generation, and batch summarization. These jobs have the same silent failure modes as any background job — but the consequences can be harder to notice until a significant backlog has accumulated.

Common AI Background Job Patterns

Embedding pipeline: A job that processes newly uploaded documents, generates vector embeddings via OpenAI or a local sentence-transformer model, and stores them in a vector database. If the job silently fails, search quality degrades as new documents aren't indexed.

Nightly batch summarization: A job that summarizes the day's user activity for an analytics report. If it fails, the report is missing — but the failure may not surface until the report is opened the next day.

Content moderation: A job that runs new submissions through a moderation model before publishing. If the job fails, submissions queue up unmoderated.

Setting Up Heartbeat Monitoring

For each AI background job, instrument with a Vigilmon heartbeat ping at the end of successful execution:

import httpx
import os

async def process_document_embeddings():
    try:
        documents = await fetch_unprocessed_documents()
        for doc in documents:
            embedding = await openai_client.embeddings.create(
                input=doc.content,
                model="text-embedding-3-small"
            )
            await store_embedding(doc.id, embedding.data[0].embedding)

        # Only ping on complete success
        async with httpx.AsyncClient() as client:
            await client.get(os.environ["VIGILMON_EMBEDDING_HEARTBEAT_URL"])

    except Exception as e:
        logger.error(f"Embedding pipeline failed: {e}")
        # Do NOT call heartbeat — let it expire to trigger alert
        raise

Configure the heartbeat window generously for AI jobs that call hosted APIs: if the job processes 1,000 documents and each embedding call takes 100ms, the job runs for ~100 seconds. Add a 15-minute grace period to absorb API latency variability, rate limiting backoff, and retry delays.

AI Job Heartbeat Window Sizing

| Job Type | Typical Duration | Recommended Grace Period | |---|---|---| | Hourly embedding sync (small batch) | 2–10 minutes | 20 minutes | | Nightly batch summarization | 15–60 minutes | 2 hours | | Document classification pipeline | 5–30 minutes | 45 minutes | | Weekly report generation | 30–120 minutes | 3 hours | | Real-time inference health ping | Continuous | 5 minutes |

Practical Vigilmon Setup for an AI-Powered SaaS

A complete AI application monitoring setup with Vigilmon:

HTTP Uptime Monitors:

Primary AI API (e.g., OpenAI /v1/models) — 1-minute interval
Fallback AI API (e.g., Anthropic /v1/models) — 1-minute interval
Self-hosted LLM health endpoint (/health) — 30-second interval
LiteLLM proxy health endpoint (/health) — 1-minute interval
Application API health endpoint — 1-minute interval

TCP Port Monitors:

Ollama or vLLM TCP port — 1-minute interval
Vector database port (Qdrant :6333, Weaviate :8080, Pinecone via HTTP) — 5-minute interval

Heartbeat Monitors:

Embedding pipeline (runs hourly): 1h interval, 20m grace
Nightly summarization job (runs at 02:00 UTC): 24h interval, 2h grace
Content moderation pipeline (runs every 15 minutes): 15m interval, 5m grace
Weekly report generation: 7d interval, 3h grace

Alert Routing:

AI API availability failures → PagerDuty (immediate page)
Self-hosted LLM server failures → PagerDuty (immediate page)
Heartbeat expiries → PagerDuty (immediate page for business-critical jobs)
Heartbeat expiries → Slack #ai-ops (for lower-priority jobs)

Common AI Monitoring Mistakes

Not Separating Inference Latency from Availability

An LLM API that responds in 45 seconds to a completion request is "available" by HTTP status code but degraded in a way that likely breaks your user-facing feature. Vigilmon's response time history for health check endpoints (which respond quickly) surfaces server-level issues, but end-to-end inference latency needs APM instrumentation inside your application.

Using Inference Calls as Health Checks

Running a real completion request (/v1/chat/completions) on every monitoring check cycle incurs API costs and adds load to rate-limited endpoints. Use models listing endpoints for availability checks — they confirm the API accepts requests without triggering inference billing.

Monitoring Only the Primary Provider

If your application has fallback logic but you only monitor the primary provider, you don't know if your fallback is healthy until the primary fails and the fallback is already needed. Monitor all providers your application may route to.

Missing Heartbeats for AI Background Workers

AI jobs that call hosted APIs are especially prone to silent failures: rate limit errors that exhaust retry budgets, model deprecation errors from provider API changes, or provider incidents that cause jobs to fail without HTTP-level notification to your infrastructure. Every scheduled AI job needs a heartbeat monitor.

Conclusion

AI APIs are now critical infrastructure, and their failure modes — latency spikes, provider incidents, self-hosted GPU exhaustion, and silent background job failures — require monitoring that goes beyond traditional server health checks.

The monitoring approach is straightforward: use lightweight models listing endpoints for availability checks on hosted providers, deploy health endpoints on self-hosted inference servers, configure response time history to detect latency degradation, add heartbeat monitors for every scheduled AI job, and cover fallback providers independently.

Vigilmon provides outside-in visibility for the entire AI infrastructure surface area — hosted provider availability, self-hosted server reachability, and background job heartbeats — without agents, instrumentation, or SDK integration.

Try Vigilmon free at vigilmon.online — no credit card, no agents, multi-region consensus alerting, HTTP, TCP, and heartbeat monitoring for AI-powered applications, free tier permanent.

Tags: #aimonitoring #llmapi #openai #anthropic #ollama #vllm #litellm #uptime #heartbeat #vigilmon #devops #sre #2026