tutorial

DevOps Monitoring Checklist 2026: What to Monitor Before You Ship

Most outage post-mortems contain a line that reads something like: "We didn't have monitoring on X, which is why we didn't catch this sooner." After the inci...

Most outage post-mortems contain a line that reads something like: "We didn't have monitoring on X, which is why we didn't catch this sooner." After the incident, the monitoring gets added. Before the incident, it felt like overhead.

This checklist is the thing you add before the incident. It covers the monitoring categories that matter for production services in 2026, what specifically to configure in each category, and how to implement the common items with Vigilmon in under an hour.


Why This Checklist Exists

There's a difference between having monitoring and having monitoring that works when it matters. Teams often have some monitoring — a cloud provider health check here, a basic ping there — but the coverage is inconsistent and the gaps are exactly where failures hide.

This checklist is opinionated. It reflects what SREs and DevOps engineers have learned is necessary after running production systems through real incidents. Not every item applies to every system, but every item on this list has a corresponding incident story somewhere.


Category 1: Uptime Monitors (External HTTP)

External uptime monitors send HTTP requests to your endpoints from outside your infrastructure and alert when requests fail. They catch the failures that matter most to users: the service is unreachable from the internet.

Checklist

  • [ ] Root domain: Monitor https://yourdomain.com — catches DNS failures, load balancer issues, TLS certificate problems
  • [ ] Primary API base URL: Monitor https://api.yourdomain.com or your primary API endpoint
  • [ ] Health check endpoint: If you expose /health or /api/health, monitor it — this is your service's explicit liveness declaration
  • [ ] Critical user-facing routes: Login, checkout, primary data endpoints — the routes where a 5xx costs you revenue or users
  • [ ] Check interval: 1 minute or less for revenue-critical endpoints; 5 minutes acceptable for secondary routes
  • [ ] Multi-region probing: Probes from multiple geographic locations catch CDN regional failures and DNS propagation issues invisible from a single probe location
  • [ ] Status code validation: Configure expected codes (typically 200, 204) — a 301 redirect loop or unexpected 403 should alert, not silently pass
  • [ ] Response time alerting: Set a threshold for response time (e.g., alert if p50 exceeds 2 seconds) — slow responses are degraded service even if status codes pass

Implementation with Vigilmon:

vigilmon.online → Add Monitor → HTTP
URL: https://api.yourapp.com/health
Interval: 1 minute
Alert when: 2 consecutive failures
Notify: Slack webhook

Category 2: Health Endpoint Standards

A health endpoint is a dedicated route that your application exposes specifically for monitoring systems to probe. It's worth getting right.

What a good health endpoint should do

  • [ ] Return 200 OK only when the service is genuinely healthy and ready to serve traffic
  • [ ] Return a non-2xx code (typically 503 Service Unavailable) when critical dependencies are unavailable
  • [ ] Include basic metadata in the response body: service name, version, uptime, and dependency status
  • [ ] Respond quickly — health checks should complete in under 500ms; anything slower risks false positives from probe timeouts
  • [ ] Not require authentication — monitors need to reach it without credentials

Minimal health endpoint (Express/Node.js):

app.get('/health', async (req, res) => {
  try {
    await db.query('SELECT 1');  // verify DB connection
    res.status(200).json({
      status: 'ok',
      service: 'api',
      version: process.env.npm_package_version,
      uptime: process.uptime()
    });
  } catch (err) {
    res.status(503).json({ status: 'degraded', error: 'database unavailable' });
  }
});
  • [ ] Add the health endpoint to your uptime monitor immediately after deploying it
  • [ ] Test that it returns 503 (not 200) when the database connection fails — the most common health check mistake is returning 200 regardless of dependency state

Category 3: SSL Certificate Monitoring

TLS certificate expiry is a fully preventable outage that happens dozens of times per week across production services globally. Modern certificate management tools (Let's Encrypt auto-renewal, ACM) have reduced it but not eliminated it.

Checklist

  • [ ] Monitor certificate expiry for all HTTPS endpoints — alert with at least 14 days of warning
  • [ ] Monitor wildcard certificates separately — a wildcard cert expiry takes down all subdomains at once
  • [ ] Monitor CDN TLS — your origin certificate may be valid but the CDN edge certificate on a custom domain can expire independently
  • [ ] Verify auto-renewal is actually renewing — Let's Encrypt renewal jobs fail silently more often than you'd expect; monitor the renewal job with a cron heartbeat
  • [ ] Test certificate alerts by checking expiry date manually before trusting the automated alert

Vigilmon monitors SSL certificate expiry automatically for all HTTPS monitors you add — it alerts before the certificate expires based on your configured warning threshold.


Category 4: Cron Job and Scheduled Task Monitoring

Silent cron job failures are one of the most common categories of "we didn't know until a user told us" incidents. A cron job that stops running produces no errors — it simply produces silence. Without active monitoring, that silence is invisible.

Checklist

  • [ ] Inventory every scheduled job in production — list them, including frequency and what breaks if they don't run
  • [ ] Add a heartbeat monitor for each critical job — the job pings a unique URL at completion; the monitor alerts if the ping doesn't arrive on schedule
  • [ ] Monitor the job success, not just job execution — a job that runs but fails silently is just as bad as one that doesn't run; ping after verifying your success condition
  • [ ] Cover the common silent failure jobs:
    • Daily database backups
    • Data sync and ETL jobs
    • Report generation jobs
    • Cleanup and archival jobs
    • Third-party data pull jobs (payments reconciliation, CRM sync)
  • [ ] Set heartbeat windows tightly — a daily job should have a 30-minute to 2-hour window, not a 24-hour window that gives you no useful detection time

Implementation with Vigilmon heartbeats (bash):

#!/bin/bash
# Your backup script
pg_dump mydb | gzip > backup_$(date +%Y%m%d).gz

# Verify it worked before pinging
if [ $? -eq 0 ]; then
  curl -s "https://vigilmon.online/heartbeat/YOUR_UNIQUE_TOKEN" > /dev/null
else
  echo "Backup failed — heartbeat NOT sent" | mail -s "BACKUP FAILED" ops@yourcompany.com
fi
  • [ ] Configure each heartbeat to alert your on-call channel, not just email

Category 5: API Latency and Response Time Baselines

Performance degradation is a form of outage. A service that responds in 8 seconds instead of 200ms is technically "up" according to a simple success/failure check — but it's delivering a broken user experience.

Checklist

  • [ ] Establish response time baselines for your critical endpoints — measure p50 and p95 over a week of normal traffic before setting alert thresholds
  • [ ] Set latency alert thresholds — a common starting point: alert if p95 exceeds 2x your baseline p95
  • [ ] Monitor response time trends, not just instantaneous values — a slow 10% week-over-week degradation is a growing problem, not a spike
  • [ ] Track response time by region — latency spikes in one region but not others indicate a CDN or routing issue, not an application issue
  • [ ] Alert on sustained degradation, not single slow responses — a 2-second response at 3am is different from a 2-second p95 across 1000 requests during peak

Vigilmon tracks response time history with period selectors (24h, 7d, 30d) for every HTTP monitor. You can see response time trends over time and compare recent behavior to historical baselines.


Category 6: Alert Channels and Escalation

Monitoring is useless if alerts go unread. Alert channel hygiene is as important as monitor coverage.

Checklist

  • [ ] Slack/Teams integration: Route alerts to a dedicated #incidents or #alerts channel — not #general
  • [ ] Email backup: Configure email alerts as a backup to Slack, not primary — email has higher latency for on-call response
  • [ ] Avoid alert fatigue: If a monitor sends more than 2–3 alerts per week that turn out to be false positives, tune the thresholds or check interval. Alert fatigue kills incident response time.
  • [ ] Test your alert channels: Deliberately trigger a test alert and verify it arrives in the configured channels before relying on them in production
  • [ ] Document your incident response procedure: What does the on-call engineer do when an uptime alert fires? Write it down. Undocumented procedures add minutes to incident response.
  • [ ] Set up escalation for critical monitors: If the primary alert channel is a Slack channel that's noisy, add a separate high-priority alert to PagerDuty, Opsgenie, or direct Slack DM for your most critical monitors

Category 7: Status Page for Users

Internal monitoring tells your team when something is wrong. A public status page tells your users. Both matter.

Checklist

  • [ ] Create a public status page that reflects the health of your main user-facing services
  • [ ] Update it during incidents — a status page that says "all systems operational" during a known outage destroys user trust faster than the outage itself
  • [ ] Link the status page from your docs, dashboard, and support channels — it should be the first thing users check before filing a support ticket
  • [ ] Post incident updates during active incidents: acknowledge, update on investigation, update on resolution, post a timeline after resolution

Vigilmon includes a public status page for all monitors. It automatically reflects real-time probe results and lets you post manual incident updates.


The Pre-Ship Monitoring Checklist

Use this as a gate before your next production deployment:

Before shipping:

  • [ ] Uptime monitor exists for all new endpoints (root domain + health + critical routes)
  • [ ] SSL certificate monitoring is configured for any new domains
  • [ ] All new cron jobs have heartbeat monitors
  • [ ] Alert channels have been tested
  • [ ] Health endpoint returns non-200 when dependencies are down

Within 24 hours of shipping:

  • [ ] Verify monitors are firing on actual traffic patterns (no false positives in first 24h)
  • [ ] Confirm response time baselines are being captured
  • [ ] Status page reflects new services if customer-facing

Implementation Time Estimate

If you're starting from zero, here's an honest time estimate for this checklist:

| Category | Time | |---|---| | Set up Vigilmon account + add HTTP monitors | 15 minutes | | Add health endpoint to your application | 30–60 minutes | | Configure SSL expiry monitoring | Already included for HTTPS monitors | | Add cron heartbeats (1 job) | 10 minutes | | Configure Slack and email alert channels | 10 minutes | | Enable public status page | 5 minutes |

Total: ~1–2 hours for a complete baseline setup. Most of the time is the health endpoint if you don't already have one.


Summary

Production outages are expensive and preventable. The monitoring categories in this checklist — uptime probes, health endpoints, SSL expiry, cron heartbeats, latency baselines, alert channels, and a public status page — cover the failure modes that appear most commonly in incident post-mortems.

Each category takes minutes to configure, not days. The cost of not having this monitoring isn't the setup time you avoided — it's the silent outage that runs for two hours before a user tells you.

Get started at vigilmon.online — free tier includes 5 monitors with 1-minute intervals, cron heartbeats, multi-region consensus, and a public status page.


Tags: #devops #monitoring #checklist #sre #uptime #infrastructure #webdev

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →