Hasura GraphQL Engine Monitoring with Vigilmon
Hasura is an open-source GraphQL engine that auto-generates a GraphQL API over your Postgres (or other supported) database. It's popular for rapidly building data-driven backends and for adding GraphQL capabilities to existing databases. When Hasura goes down, your entire GraphQL API layer disappears — every client querying your data fails simultaneously.
This guide covers how to monitor Hasura with Vigilmon, including built-in health endpoints, metadata status checks, and alerting configuration.
Hasura's Built-in Health Endpoints
Hasura ships with dedicated health endpoints you should monitor. Unlike a generic GraphQL API where you might craft a health query, Hasura exposes these out of the box:
/healthz — Primary Health Check
GET https://your-hasura-instance.com/healthz
A healthy Hasura instance responds with:
HTTP/1.1 200 OK
OK
An unhealthy instance returns a non-200 status. This endpoint checks that the Hasura process is running and the database connection is active.
/v1/version — Version and Metadata
GET https://your-hasura-instance.com/v1/version
Response:
{
"version": "v2.36.0",
"is_metadata_inconsistent": false
}
The is_metadata_inconsistent field is critical. When it is true, Hasura has loaded but its metadata (table relationships, permissions, event triggers) is in a broken state. Queries may partially succeed or silently return wrong data. This is harder to catch than a full outage and easy to miss without a keyword monitor.
Monitoring the Hasura Health Endpoint
Step 1: Create a Ping Monitor
- Log in to Vigilmon → Monitors → New Monitor
- Type: HTTP
- Method: GET
- URL:
https://your-hasura-instance.com/healthz - Interval: 1 minute
- Expected status: 200
- Keyword check:
OK
This gives you 60-second detection of Hasura process failures and database connectivity issues.
Step 2: Metadata Consistency Monitor
Create a second monitor targeting /v1/version:
- Type: HTTP
- Method: GET
- URL:
https://your-hasura-instance.com/v1/version - Interval: 5 minutes
- Keyword check:
"is_metadata_inconsistent":false
When Hasura's metadata becomes inconsistent (after a bad migration, a dropped table, or a permissions error), this monitor fires — even if /healthz still returns 200.
Monitoring Hasura on Docker or Kubernetes
If you're running Hasura via Docker Compose, the health endpoint is reachable on the mapped port:
# docker-compose.yml excerpt
services:
hasura:
image: hasura/graphql-engine:v2.36.0
ports:
- "8080:8080"
environment:
HASURA_GRAPHQL_DATABASE_URL: postgres://...
HASURA_GRAPHQL_ENABLE_CONSOLE: "true"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
interval: 30s
timeout: 10s
retries: 3
In Vigilmon, point to your public domain or load balancer, not the internal Docker host.
On Kubernetes, the standard liveness/readiness probes mirror what Vigilmon checks externally:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Vigilmon provides an external perspective that your cluster's internal probes cannot — it catches networking failures, ingress misconfigurations, and TLS certificate issues.
Monitoring the Hasura Console
If you expose the Hasura console for your team, monitor that separately:
GET https://your-hasura-instance.com/console
The console serves a React app. A 200 response with keyword Hasura in the HTML confirms the static assets are being served. This is useful for deployments where the API is up but a CDN or proxy is blocking console access.
Webhook Alerting on Hasura Failures
When Hasura goes down, you lose your entire GraphQL data layer. Configure Vigilmon to fire an alert immediately.
Slack Alerting
In Vigilmon → Alert Channels → Slack, add your Slack webhook URL. Configure the alert to trigger after 1 failed check — Hasura failures cascade quickly to user-facing errors and should not wait for multiple confirmation checks.
Webhook Integration
For custom incident handling:
// Example webhook handler (Node.js/Express)
app.post('/webhooks/vigilmon', express.json(), (req, res) => {
const { event, monitor } = req.body;
if (event === 'down') {
console.error(`[Vigilmon] Hasura DOWN: ${monitor.url}`);
// Page on-call, create incident ticket, notify team
notifyPagerDuty({
summary: `Hasura GraphQL Engine is DOWN`,
severity: 'critical',
source: monitor.url,
});
} else if (event === 'up') {
console.info(`[Vigilmon] Hasura RECOVERED: ${monitor.url}`);
resolvePagerDutyIncident();
}
res.json({ received: true });
});
Register this URL under Alert Channels → Webhook in Vigilmon.
Monitoring Hasura Cloud
If you're on Hasura Cloud rather than self-hosted, you still monitor the same endpoints — Hasura Cloud instances expose /healthz and /v1/version identically. Your instance URL is the hasura.app subdomain assigned to your project.
One additional consideration for Hasura Cloud: monitor your Postgres database host separately. Hasura Cloud can report healthy while your underlying database is degraded. Add a second Vigilmon monitor for your database's connection pooler or health endpoint if your provider exposes one.
Alerting Strategy
| Scenario | Monitor | Alert Threshold |
|----------|---------|-----------------|
| Hasura process down | /healthz returns non-200 | 1 failed check |
| Database connection lost | /healthz returns non-200 | 1 failed check |
| Metadata inconsistent | /v1/version keyword missing | 2 failed checks |
| Console inaccessible | /console returns non-200 | 3 failed checks |
| Response latency spike | /healthz > 2000ms | Warning alert |
Summary
| Component | Endpoint | Vigilmon Check |
|-----------|----------|----------------|
| Process + DB health | /healthz | Status 200 + keyword OK |
| Metadata consistency | /v1/version | Keyword "is_metadata_inconsistent":false |
| Console availability | /console | Status 200 |
| Event trigger latency | Custom endpoint | Response time threshold |
Hasura is a high-leverage component — one instance often serves many frontend clients. With Vigilmon's 1-minute check interval and instant alerting, you'll detect engine failures before users start reporting missing data.