GraphQL API Monitoring Best Practices: Why REST Tools Fall Short

GraphQL APIs break in ways that REST monitoring tools are architecturally unfit to detect. A REST uptime monitor hits GET /api/users and checks for a 200 response. A GraphQL API returns 200 for nearly every request — including ones that contain errors, timed-out resolvers, and responses with null data where records should be.

If you're running a GraphQL API in production and relying on standard HTTP status monitoring, you have a significant blind spot. This guide covers what actually fails in GraphQL production systems and how to monitor it correctly with Vigilmon.

Why REST Monitoring Tools Fail for GraphQL

GraphQL has one endpoint — typically POST /graphql. Every query, mutation, and subscription goes through it. The HTTP layer almost always returns 200 OK, even when:

A resolver throws an exception
A database query times out
A required field returns null because the underlying record was deleted
A type mismatch causes a partial response

The GraphQL specification puts errors in the response body, not in HTTP status codes:

{
  "data": { "user": null },
  "errors": [
    {
      "message": "User not found",
      "locations": [{ "line": 2, "column": 3 }],
      "path": ["user"]
    }
  ]
}

A monitor that only checks for HTTP 200 will mark this as healthy. A monitor checking for a "errors" key in the body is closer, but still fragile — you need field-level visibility into whether the data you expect is actually present.

The right approach is layered: health endpoint monitoring for uptime, response time monitoring for performance, and webhook-based error rate monitoring for application-level errors.

Layer 1: Health Endpoint Monitoring

Both Apollo Server and Hasura expose dedicated health check endpoints that return a clean, checkable status. These are your primary uptime monitors.

Apollo Server

Apollo Server v4+ exposes /.well-known/apollo/server-health:

curl https://api.yourdomain.com/.well-known/apollo/server-health
# {"status":"pass"}

If you're on Apollo Server v3 or earlier, add a custom health route:

// apollo-server.js (Express integration)
const express = require('express');
const { ApolloServer } = require('@apollo/server');
const { expressMiddleware } = require('@apollo/server/express4');

const app = express();
const server = new ApolloServer({ typeDefs, resolvers });

await server.start();

// Health check endpoint — checks server state and a fast DB probe
app.get('/health', async (req, res) => {
    try {
        // Quick connectivity check — not a full query
        await db.raw('SELECT 1');
        res.status(200).json({
            status: 'ok',
            graphql: server.assertStarted('health check') ? 'ready' : 'not_ready',
            timestamp: new Date().toISOString()
        });
    } catch (err) {
        res.status(503).json({ status: 'error', message: err.message });
    }
});

app.use('/graphql', expressMiddleware(server));
app.listen(4000);

Hasura

Hasura exposes /healthz:

curl https://your-hasura.domain.com/healthz
# OK

For Hasura Cloud or self-hosted, also check the metadata endpoint to confirm the GraphQL engine is fully initialized:

curl https://your-hasura.domain.com/v1/version
# {"version":"v2.36.0","console_assets_version":"v2.36.0"}

Set Up Vigilmon HTTP Monitors

For Apollo Server:

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
URL: https://api.yourdomain.com/.well-known/apollo/server-health
Check interval: 1 minute
Expected: status code 200, body contains "pass"
Response time alert: 2000ms

For Hasura:

Same process, URL: https://your-hasura.domain.com/healthz
Expected: status code 200, body contains OK

These monitors tell you the GraphQL server process is alive and the underlying infrastructure is reachable. That's your uptime signal.

Layer 2: Response Time Monitoring for Query Performance

GraphQL resolvers can become slow silently. A query that returns data correctly but takes 8 seconds is causing users to abandon your app. Response time monitoring catches this before it becomes a user complaint.

Vigilmon's HTTP monitors include response time alerting. Configure a threshold on your health endpoint monitor — if the health check itself starts taking longer than expected, your server is under load or a dependency (database, Redis, downstream API) is degrading.

For more granular query-level performance monitoring, add an introspection query monitor. The introspection query exercises the schema resolution logic without touching your data:

# Test your introspection endpoint response time
curl -X POST https://api.yourdomain.com/graphql \
  -H "Content-Type: application/json" \
  -d '{"query": "{ __typename }"}'
# {"data":{"__typename":"Query"}}

Set up a Vigilmon HTTP monitor for this endpoint with a POST method and the introspection query body. Set a response time threshold appropriate for your SLA.

Note: Disable introspection in production if your schema is sensitive — use a lightweight { __typename } query instead as the performance probe.

Layer 3: Error Rate Monitoring via Webhooks

Application-level GraphQL errors (resolver failures, authorization errors, malformed queries from clients) don't appear in HTTP status codes. You need to collect them in your application and push them to an alert channel.

Structured Error Logging in Apollo Server

// apollo-server.js — error tracking plugin
const errorTrackingPlugin = {
    async requestDidStart(requestContext) {
        const startTime = Date.now();
        return {
            async willSendResponse(context) {
                const duration = Date.now() - startTime;
                const hasErrors = context.response.body.kind === 'single' &&
                    context.response.body.singleResult.errors?.length > 0;

                if (hasErrors || duration > 3000) {
                    const errors = context.response.body.singleResult.errors || [];
                    await reportToMonitoring({
                        query: context.request.query?.substring(0, 200),
                        operationName: context.request.operationName,
                        duration,
                        errorCount: errors.length,
                        errors: errors.map(e => e.message),
                        timestamp: new Date().toISOString()
                    });
                }
            }
        };
    }
};

const server = new ApolloServer({
    typeDefs,
    resolvers,
    plugins: [errorTrackingPlugin]
});

Push Error Spikes to Vigilmon Webhooks

When your error rate crosses a threshold, have your monitoring backend fire a Vigilmon webhook. Set up the webhook channel:

In Vigilmon, go to Alert Channels → New Channel → Webhook
Enter your internal monitoring endpoint URL
Vigilmon will deliver incident status changes to this URL

Alternatively, use the reverse: your application sends to a Vigilmon-registered webhook to trigger your alerting pipeline when the error rate spikes:

// error-rate-monitor.js — runs server-side
async function checkErrorRate() {
    const recentErrors = await getErrorCountLastMinute();
    const threshold = 10; // alert if > 10 GraphQL errors per minute

    if (recentErrors > threshold) {
        await fetch(process.env.VIGILMON_WEBHOOK_URL, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({
                event: 'graphql_error_spike',
                error_count: recentErrors,
                threshold,
                timestamp: new Date().toISOString()
            })
        });
    }
}

setInterval(checkErrorRate, 60_000);

Layer 4: Monitoring Hasura Metadata and Actions

Hasura's GraphQL engine can be healthy at the HTTP layer while having broken metadata (pending migrations, failed remote schema connections, broken action handlers). Check these explicitly:

# Check metadata consistency
curl -X POST https://your-hasura.domain.com/v1/metadata \
  -H "X-Hasura-Admin-Secret: $ADMIN_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"type": "export_metadata", "args": {}}'

For production Hasura deployments, set up a Vigilmon monitor for each external Action handler endpoint — these are REST APIs that Hasura calls internally and their failures manifest as GraphQL errors:

https://your-action-handler.yourdomain.com/health

Complete Monitoring Stack

| Monitor | Type | URL/Endpoint | What It Catches | |---|---|---|---| | Apollo health | HTTP | /.well-known/apollo/server-health | Server process, DB connectivity | | Hasura health | HTTP | /healthz | Engine initialization | | Introspection probe | HTTP POST | /graphql with {__typename} | Schema resolution latency | | Hasura Action handlers | HTTP | Per-handler /health | Broken remote REST handlers | | Error rate spike | Webhook | App → Vigilmon | Application-level GraphQL errors |

Status Page for API Consumers

If you have external clients consuming your GraphQL API, a public status page reduces support noise during incidents:

Go to Status Pages → New Status Page in Vigilmon
Name it "GraphQL API Status"
Add your Apollo/Hasura health monitors
Share the URL in your API documentation

Get started free at vigilmon.online — no credit card, monitors live in under two minutes. Your GraphQL API deserves more than a 200 OK check.