Monitoring Drone CI (and Woodpecker CI) with Vigilmon: Healthz Endpoint, Web UI, Runner TCP & SSL Alerts

Drone CI (and its open-source fork Woodpecker CI) powers automated build and deployment pipelines for development teams. When the Drone server goes down, builds queue and never run, pull request checks hang indefinitely, and deployments stop. When runner agents disconnect, pipelines appear to accept triggers but never execute. Vigilmon gives you external visibility into Drone's health: the built-in healthz endpoint, web UI availability, runner agent TCP connectivity, and SSL certificate expiry — so you catch CI failures before your developers do.

What You'll Build

A monitor on Drone's /healthz endpoint for server availability
A web UI uptime check to catch frontend routing failures
A TCP port monitor for runner agent connectivity
SSL certificate monitoring for your Drone domain
An alerting setup that distinguishes server failures from runner disconnections

Prerequisites

A running Drone CI 2.x or Woodpecker CI 2.x server with a public or network-reachable domain
HTTPS configured (e.g., https://drone.example.com)
At least one runner agent deployed
A free account at vigilmon.online

Step 1: Verify Drone's Healthz Endpoint

Drone CI exposes a /healthz endpoint that confirms the server process is running, the database connection is healthy, and the core services are initialized:

curl https://drone.example.com/healthz

A healthy Drone server returns:

OK

Woodpecker CI uses the same path:

curl https://woodpecker.example.com/healthz

A healthy Woodpecker server returns:

{"status":"OK"}

Drone vs. Woodpecker response format: Drone returns the plain string OK; Woodpecker returns a JSON object {"status":"OK"}. Configure your Vigilmon keyword accordingly.

If /healthz returns a non-200 status or the body does not contain OK, the server process is degraded — the database may be unreachable, or the process may be in a restart loop.

Step 2: Create a Vigilmon HTTP Monitor for the Healthz Endpoint

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://drone.example.com/healthz (or https://woodpecker.example.com/healthz).
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: OK (works for both Drone's plain-text and Woodpecker's JSON response).
Click Save.

This monitor fires when:

The Drone/Woodpecker server process crashes or is OOM-killed
The database (SQLite, MySQL, or PostgreSQL) becomes unreachable
A failed upgrade leaves the server in a startup loop
The host becomes unreachable

Alert sensitivity: Set to trigger after 1 consecutive failure. A down CI server means all builds are queued indefinitely and PR status checks never resolve.

Step 3: Monitor the Web UI

The Drone/Woodpecker web UI is the interface developers use to view build logs, trigger rebuilds, and manage secrets. A UI availability check exercises the HTTP layer independently from the /healthz API endpoint — a reverse proxy misconfiguration can break the UI while the healthz endpoint stays green:

curl https://drone.example.com/

Add Monitor → HTTP.
URL: https://drone.example.com/ (root of your Drone domain).
Check interval: 5 minutes.
Expected status: 200.
Keyword: Drone (or Woodpecker for Woodpecker CI — appears in the HTML page title).
Label: Drone CI Web UI.
Click Save.

Login redirect: If your Drone instance redirects unauthenticated requests to a login page or OAuth provider, the keyword check should target a string that appears on the login page, such as Login or GitHub. Set the expected status to 200 (after redirect) and adjust the keyword to what actually appears in the redirected page body.

Step 4: Monitor Runner Agent TCP Connectivity

Drone and Woodpecker runners connect to the server over a gRPC channel (default port 9000 for Drone, or via a webhook for Woodpecker HTTP runners). When all runners disconnect, the server accepts pipeline triggers but queues them indefinitely without executing — a failure mode that's invisible to the healthz endpoint.

For Drone's gRPC runner protocol, monitor the TCP port the runners connect on:

Add Monitor → TCP Port.
Host: drone.example.com.
Port: 9000 (Drone's default gRPC/runner listener port — check your DRONE_RPC_SECRET and DRONE_SERVER_HOST configuration).
Check interval: 2 minutes.
Label: Drone runner port.
Click Save.

Woodpecker agents: Woodpecker's agent connection model differs by version — agents may connect via HTTP POST webhooks rather than a persistent TCP listener. In that case, skip the TCP monitor and monitor the Woodpecker agent's health endpoint directly on each agent host if accessible (e.g., http://agent-host:3000/healthz).

Firewall note: If the runner gRPC port is not publicly accessible (runners connect from a private network), this TCP monitor will fail even when runners are healthy. In that case, rely on the healthz endpoint and build queue depth instead of a TCP check.

Step 5: Monitor SSL Certificates

Drone's HTTPS certificate protects the web UI, OAuth callback URLs, and the webhook endpoints that GitHub/GitLab send build triggers to. An expired certificate breaks all three: the web UI shows a certificate error, OAuth redirects fail, and incoming webhook payloads are rejected:

Add Monitor → SSL Certificate.
Domain: drone.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

Let's Encrypt auto-renewal: Drone has built-in Let's Encrypt support (set DRONE_TLS_AUTOCERT=true). If auto-renewal is enabled, SSL alerts indicate that the renewal process has failed — check Drone logs for ACME errors and ensure port 80 is reachable for the HTTP-01 challenge.

Step 6: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | /healthz | Non-200 or OK missing | Check Drone/Woodpecker process; verify DB connectivity; review container logs | | Web UI | Non-200 or keyword missing | Check reverse proxy/nginx config; verify Drone HTTP listener | | Runner TCP port | Connection refused or timeout | Check if runner agents are running; verify network/firewall rules | | SSL certificate | < 30 days to expiry | Check Let's Encrypt renewal; verify port 80 challenge accessibility |

Alert after: 1 consecutive failure for healthz and SSL; 2 consecutive failures for runner TCP (transient during runner restarts) and UI (transient during deployments).

Common Drone/Woodpecker Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Drone server crash or OOM | /healthz unreachable; alert within 60 s | | Database goes down | /healthz returns non-OK; builds queue but don't run | | All runner agents disconnect | TCP port monitor fires; builds queue indefinitely | | Runner agents up but server rejects them | Build queue grows; TCP port open but builds don't start | | SSL certificate expires | SSL monitor alerts at 30-day threshold; webhooks and OAuth fail | | Reverse proxy misconfiguration | Web UI monitor fires while healthz stays green | | Port 9000 blocked by firewall | Runner TCP monitor fires; investigate network rules | | DNS misconfiguration | All monitors fire simultaneously |

Drone CI and Woodpecker CI failures are often invisible until a developer asks why their PR has been "pending" for 30 minutes. A down server, a disconnected runner pool, or an expired SSL certificate can silently halt all CI activity while the Git hosting side shows builds as pending rather than failed. Vigilmon gives you layered external monitoring of the server health, web UI, runner connectivity, and SSL certificates so you catch CI failures as soon as they happen.

Start monitoring Drone CI in under 5 minutes — register free at vigilmon.online.