Monitoring GlitchTip with Vigilmon: Health Endpoint, Dashboard, API Liveness & SSL Alerts

GlitchTip is a self-hosted open-source error tracking platform compatible with the Sentry SDK — it collects exceptions, logs, and performance events from your applications and presents them in a searchable dashboard for debugging. Engineering teams choose GlitchTip to get Sentry-compatible error tracking on their own infrastructure without SaaS costs or data residency constraints. When GlitchTip goes down, errors from your applications stop being captured and recorded: your SDK clients silently fail to deliver events, exception spikes go undetected, and production bugs become invisible until users report them directly. The background worker queue is particularly critical — it processes incoming error payloads asynchronously, so a stalled worker means events are accepted at the API but never written to the database. Vigilmon gives you external visibility into GlitchTip's health endpoint, web dashboard, API liveness, and SSL certificate so failures are caught within 60 seconds.

What You'll Build

A monitor on GlitchTip's /.well-known/healthz health endpoint
An HTTP monitor for the GlitchTip web dashboard
An HTTP monitor for the API liveness check (/api/0/ returning 401 confirms the API is alive)
A cron heartbeat monitor for the GlitchTip worker queue health
SSL certificate monitoring for your GlitchTip domain
An alerting setup tuned for production error pipeline criticality

Prerequisites

A running GlitchTip instance with a public or network-reachable domain
HTTPS configured (e.g., https://glitchtip.example.com)
A free account at vigilmon.online

Step 1: Verify GlitchTip's Health Endpoint

GlitchTip exposes a health check at /.well-known/healthz that confirms the Django application server and its backing database are responding:

curl -i https://glitchtip.example.com/.well-known/healthz

A healthy instance returns HTTP 200:

{
  "status": "ok"
}

This endpoint requires no authentication and is designed for uptime probes. A 200 response confirms the Django WSGI server is running and PostgreSQL (GlitchTip's primary database for events, issues, teams, and projects) is reachable. A non-200 response or a timeout indicates the application has crashed, the database is unreachable, or the container is mid-restart.

Fallback: Some GlitchTip deployments respond to root path / with a 200 redirect to the dashboard. If /.well-known/healthz returns 404, use the root path check as your primary availability monitor and validate GlitchTip's version in the response headers.

Step 2: Create a Vigilmon HTTP Monitor for the Health Endpoint

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://glitchtip.example.com/.well-known/healthz.
Check interval: 60 seconds.
Response timeout: 15 seconds.
Expected status: 200.
Keyword: ok.
Label: GlitchTip Health.
Click Save.

This monitor catches:

Django WSGI server crashes or OOM kills from large error ingestion spikes
PostgreSQL connectivity failures — GlitchTip stores all issues, events, teams, projects, alerts, and user accounts in PostgreSQL; a database outage makes all error tracking non-functional
Redis connectivity failures that stall the Celery task queue and prevent event processing
Container restart loops triggered by misconfigured environment variables or secret mismatches

The ok keyword check ensures the server is reporting a healthy state — not just a 200 from a reverse proxy in front of an unreachable Django backend.

PostgreSQL dependency in error pipelines: GlitchTip's PostgreSQL database is the source of truth for every captured exception, issue group, alert rule, and team membership. If it goes down during a deployment, all incoming SDK error events are silently lost and developers cannot access the issue dashboard to investigate production problems. The health endpoint is your earliest warning of a database-layer failure.

Step 3: Monitor the GlitchTip Web Dashboard

The GlitchTip web dashboard is where your development team reviews exceptions, triages issues, and monitors performance events. Monitor it independently from the API to catch reverse proxy failures and static asset problems:

Add Monitor → HTTP.
URL: https://glitchtip.example.com.
Check interval: 60 seconds.
Expected status: 200.
Keyword: GlitchTip.
Label: GlitchTip Dashboard.
Click Save.

This monitor catches nginx or reverse proxy failures, CDN misconfiguration, and Django static file serving errors that prevent developers from accessing the issue list — even when the event ingestion API is still accepting SDK payloads. A broken dashboard means your team cannot investigate an active production incident, even though errors are technically being collected.

Step 4: Monitor the API Liveness

GlitchTip's API base path at /api/0/ accepts Sentry-compatible SDK requests. Calling it without authentication returns 401 Unauthorized — the correct response confirming the API server is alive and enforcing authentication:

curl -i https://glitchtip.example.com/api/0/
# Expected: HTTP 401 (API is alive, authentication is enforced)

Add Monitor → HTTP.
URL: https://glitchtip.example.com/api/0/.
Check interval: 60 seconds.
Expected status: 401.
Label: GlitchTip API.
Click Save.

A 401 is the correct liveness signal: it proves the Django API server accepted the connection, ran request routing and authentication middleware, and returned a proper HTTP response. A 502 or 504 means the reverse proxy is running but the Django process is not responding. A timeout means the application or network layer has failed entirely and your SDK clients are sending error events into a black hole.

Why this monitor is critical for error pipelines: Sentry SDK clients in your applications send error events asynchronously in the background. If the API is down, the SDKs fail silently — no events are delivered, and you only discover the gap when you notice missing issues in the dashboard hours after the outage. This monitor gives you a 60-second alert before a production debugging blackout compounds an ongoing incident.

Step 5: Monitor Worker Queue Health with a Cron Heartbeat

GlitchTip relies on Celery workers to process incoming error payloads asynchronously — the API accepts events from SDK clients, places them on a Redis queue, and the workers write them to PostgreSQL, group them into issues, and trigger alert rules. If the workers stall, the API appears healthy but events accumulate in the queue and are never persisted.

To detect a stalled worker queue, configure a GlitchTip cron monitor (available under Project → Monitors) that sends a heartbeat to Vigilmon when the periodic cleanup task runs successfully:

In Vigilmon, Add Monitor → Cron and copy the heartbeat URL.
In GlitchTip, configure a periodic Celery task to curl that heartbeat URL after each successful run.
Expected interval: set to match your Celery beat schedule (e.g., every 5 minutes).
Grace period: 2 minutes.
Label: GlitchTip Worker Heartbeat.
Click Save.

Alternatively, configure a Celery beat task that performs a test database write and sends the heartbeat only on success. If the heartbeat stops, Vigilmon alerts you that the worker queue has stalled even though the HTTP health endpoint is still returning 200.

Step 6: Monitor SSL Certificates

An expired SSL certificate on your GlitchTip instance has cascading effects across your entire application portfolio:

The GlitchTip web dashboard becomes inaccessible to all developers
Sentry SDK clients in your applications reject the TLS certificate and stop sending error events — creating a complete error tracking blackout
Any CI/CD pipelines that post test failure events to GlitchTip break with TLS errors
Alert notification webhooks that GlitchTip fires on new issues fail if they traverse the same certificate

Add Monitor → SSL Certificate.
Domain: glitchtip.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

Step 7: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /.well-known/healthz | Non-200 or ok missing | Check GlitchTip container; inspect PostgreSQL and Redis connectivity; review Django logs | | Web Dashboard | Non-200 or keyword missing | Check nginx/reverse proxy; verify static asset serving; inspect container logs | | API (/api/0/) | Non-401 response | Check Django API process; inspect Gunicorn workers; verify Redis is reachable | | Worker Heartbeat | Missed heartbeat | Check Celery worker containers; inspect Redis queue depth; review worker logs for stalled tasks | | SSL certificate | < 30 days to expiry | Renew certificate; verify Let's Encrypt auto-renewal is functioning |

Alert after: 2 consecutive failures for HTTP monitors. For the worker heartbeat, treat a single missed heartbeat as actionable — a stalled queue means events are being silently dropped.

Common GlitchTip Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Django server OOM-killed by event ingestion spike | Health endpoint unreachable; alert within 60 s | | PostgreSQL disk full from event volume growth | Health check fails; all issues and events inaccessible | | Redis failure stalls Celery queue | Health check may pass; worker heartbeat stops; events backlog silently | | Celery workers crash after memory leak | Worker heartbeat stops; events accepted at API but never written | | Reverse proxy misconfiguration after nginx update | Dashboard monitor fires; API may still be reachable directly | | Static asset serving failure after Django upgrade | Dashboard keyword check fails; blank page or 500 on load | | SSL certificate expires | SDK clients reject TLS; all error events from all applications stop | | DNS misconfiguration | All monitors fire simultaneously | | Database migration failure after version upgrade | Health check fails or Django returns 500 on all routes | | Environment variable misconfiguration (SECRET_KEY, DB_URL) | Container restart loop; health endpoint intermittently returns 503 |

Error tracking is your first line of defence against production bugs — a GlitchTip outage during an incident is the worst possible time to discover your error pipeline was also down. Vigilmon watches GlitchTip's health endpoint, web dashboard, API, worker queue, and SSL certificate so you're alerted within 60 seconds of any failure, before your team loses the visibility they need to debug a live production problem.

Start monitoring GlitchTip in under 5 minutes — register free at vigilmon.online.