Uptime Monitoring for GraphQL APIs: A Complete Guide
GraphQL APIs have a quirk that trips up most generic uptime monitors: everything goes to a single endpoint via POST. There's no /users to GET or /products/123 to check. A naive monitor that just does a GET request to https://api.example.com/graphql will get a 400 Bad Request and conclude your API is down — even when it's perfectly healthy.
This guide explains how to monitor GraphQL APIs correctly using Vigilmon, covering introspection health checks, custom health endpoints, and Apollo/Hasura-specific approaches.
The Core Challenge: GraphQL Is POST-First
A standard GraphQL request looks like this:
curl -X POST https://api.example.com/graphql \
-H 'Content-Type: application/json' \
-d '{"query": "{ __typename }"}'
# {"data":{"__typename":"Query"}}
This returns 200 with valid data. But a plain GET to the same URL typically returns:
curl https://api.example.com/graphql
# 400 Bad Request: Must provide query string.
Your uptime monitor needs to send a proper GraphQL request body. Here are three strategies.
Strategy 1: Dedicated HTTP Health Endpoint (Recommended)
The cleanest solution is a separate, non-GraphQL /health endpoint on the same server. This is the recommended approach because it:
- Decouples your monitoring probe from your GraphQL schema
- Avoids adding monitoring noise to your operation logs
- Works even if your schema evolves
Apollo Server (Node.js)
// server.js
const express = require('express');
const { ApolloServer } = require('@apollo/server');
const { expressMiddleware } = require('@apollo/server/express4');
const app = express();
// Health endpoint — checked by Vigilmon
app.get('/health', async (req, res) => {
try {
// Lightweight DB ping
await db.raw('SELECT 1');
res.json({ status: 'ok', service: 'graphql-api' });
} catch (err) {
res.status(503).json({ status: 'down', error: err.message });
}
});
// GraphQL endpoint
app.use('/graphql', express.json(), expressMiddleware(server));
app.listen(4000);
Apollo Server with TypeScript
import express, { Request, Response } from 'express';
import { ApolloServer } from '@apollo/server';
import { expressMiddleware } from '@apollo/server/express4';
const app = express();
app.get('/health', async (_req: Request, res: Response) => {
const checks = { database: false };
try {
await dataSource.query('SELECT 1');
checks.database = true;
} catch {}
const healthy = Object.values(checks).every(Boolean);
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
checks,
});
});
app.use('/graphql', express.json(), expressMiddleware(server));
Hasura
Hasura exposes /healthz out of the box:
curl https://your-hasura-instance.com/healthz
# OK
Hasura returns 200 OK (body: OK) when healthy and 500 when the metadata database is unreachable. Point Vigilmon directly at https://your-hasura-instance.com/healthz.
Strategy 2: GraphQL Introspection Ping via POST Monitor
If you can't add a separate health endpoint (e.g. a third-party API gateway), use Vigilmon's HTTP POST monitoring with an introspection query. The { __typename } query is the lightest valid operation:
{ "query": "{ __typename }" }
In Vigilmon:
- Create a new monitor, select type HTTP
- Set Method to POST
- Set URL to
https://api.example.com/graphql - Add header:
Content-Type: application/json - Set Request body to:
{"query":"{ __typename }"} - Under Keyword check, add
"data"— this ensures the response body contains a valid GraphQL data envelope, not an error object - Save
This verifies the GraphQL layer is actually parsing and resolving queries, not just serving a 200 from a gateway that's already dropped the backend.
Caution about introspection in production: many teams disable introspection queries in production for security reasons. If you've done this, use a lightweight named query instead:
{ "query": "query HealthPing { __typename }" }
Or even better, add a dedicated health query to your schema:
type Query {
health: HealthStatus!
}
type HealthStatus {
ok: Boolean!
version: String!
}
// resolver
Query: {
health: async () => ({
ok: true,
version: process.env.APP_VERSION ?? '0.0.0',
}),
},
Then monitor with:
{ "query": "{ health { ok version } }" }
And keyword check on "ok":true.
Strategy 3: Apollo Server Built-in Health Check
Apollo Server 4.x ships with a built-in health check route:
const server = new ApolloServer({ typeDefs, resolvers });
await server.start();
// Apollo exposes /.well-known/apollo/server-health by default
// You can also configure a custom healthCheckPath:
const server = new ApolloServer({
typeDefs,
resolvers,
// healthCheckPath: '/health', // Apollo Server 3.x option
});
For Apollo Server 4 with Express:
// The built-in landing page serves GET /, but for a proper health check:
app.get('/.well-known/apollo/server-health', (req, res) => {
res.json({ status: 'pass' });
});
Point Vigilmon at https://api.example.com/.well-known/apollo/server-health with a keyword check for "pass".
Step 3: Create the Vigilmon Monitor
For a /health or /healthz endpoint:
- Log in to Vigilmon → Monitors → New Monitor
- Type: HTTP
- Method: GET
- URL:
https://api.example.com/health - Interval: 1 minute
- Keyword check:
"status":"ok"(orOKfor Hasura) - Save
For a POST-based GraphQL probe:
- Type: HTTP
- Method: POST
- URL:
https://api.example.com/graphql - Headers:
Content-Type: application/json - Body:
{"query":"{ __typename }"} - Keyword check:
"data" - Save
Step 4: Handle Vigilmon Webhooks in Your GraphQL Server
When a monitor fires, Vigilmon can call your server's webhook endpoint. Add a mutation or a REST handler:
REST handler (recommended — no auth needed for webhooks)
// Express route alongside your GraphQL endpoint
app.post('/webhooks/vigilmon', express.json(), (req, res) => {
const { event, monitor } = req.body;
if (event === 'down') {
console.error(`[Vigilmon] DOWN: ${monitor.name}`);
// Notify Slack, PagerDuty, create incident ticket
} else if (event === 'up') {
console.info(`[Vigilmon] RECOVERED: ${monitor.name}`);
}
res.sendStatus(200);
});
Configure this URL in Vigilmon under Alert Channels → Webhook.
Step 5: Subscription and WebSocket Monitoring
If you expose GraphQL subscriptions over WebSocket, your HTTP health check won't cover the WebSocket transport. Add a secondary monitor or a heartbeat:
// Send a heartbeat from the subscription server every 60 seconds
const HEARTBEAT_URL = process.env.VIGILMON_HEARTBEAT_URL;
setInterval(async () => {
if (!HEARTBEAT_URL) return;
try {
await fetch(HEARTBEAT_URL);
} catch (e) {
console.warn('[Vigilmon] Heartbeat ping failed:', e.message);
}
}, 60_000);
Create a Heartbeat monitor in Vigilmon with a 2-minute expected interval.
Step 6: Monitoring Federated GraphQL (Apollo Federation)
For a federated supergraph, monitor at multiple levels:
| Layer | What to monitor | Monitor type |
|-------|-----------------|--------------|
| Router / Gateway | https://gateway.example.com/health | HTTP GET |
| Auth subgraph | https://auth-subgraph.example.com/health | HTTP GET |
| Products subgraph | https://products-subgraph.example.com/health | HTTP GET |
If a subgraph goes down, the gateway may still respond 200 but return partial errors. Monitoring each subgraph independently lets you pinpoint which service failed.
Add a keyword check on "errors" absence or use a query that touches the affected subgraph's data:
{ "query": "{ products { id } }" }
And keyword check that "data" is present and "errors" is absent.
Step 7: Configure Alert Escalation
In Monitors → (your monitor) → Alert Channels, configure:
- Immediate: Email or Slack when the monitor first fails
- Escalation (10 min): Page the on-call engineer if unacknowledged
- Recovery alert: Notify when the API comes back up so you can close the incident
For critical GraphQL APIs serving paying customers, set the interval to 1 minute and escalation to 5 minutes.
Common Pitfalls
1. Monitoring the wrong thing Don't monitor a CDN or gateway cache — monitor the actual origin. A cached 200 from a CDN doesn't mean your GraphQL server is healthy.
2. Ignoring partial errors
GraphQL returns 200 even for { "errors": [...] } responses. Use a keyword check to verify "data" is present, not just the HTTP status code.
3. No authentication on health endpoints
Your /health endpoint should be accessible without auth — monitoring probes can't log in. Use a separate path that's explicitly public.
4. Overly heavy health checks
Don't run schema validation or resolver benchmarks inside your health endpoint. A simple SELECT 1 or ping is enough. The health check itself should complete in under 50ms.
Summary
| Strategy | Best for |
|----------|----------|
| Dedicated /health GET endpoint | Any GraphQL server you control |
| Introspection POST probe | Third-party APIs or managed services |
| Hasura /healthz | Hasura Cloud or self-hosted Hasura |
| Apollo built-in health | Apollo Server 4+ |
With Vigilmon monitoring your GraphQL API, you'll know within 60 seconds when resolvers stop working, databases disconnect, or the server crashes — before users start seeing broken queries.