tutorial

The Complete Production Readiness Monitoring Checklist 2026

Going to production is a moment of commitment. Once users are hitting your system, every failure is a user experience problem, and every missed alert is an i...

Going to production is a moment of commitment. Once users are hitting your system, every failure is a user experience problem, and every missed alert is an incident you'll hear about from a customer before you hear about it from your own tools.

The monitoring layer is one of the most commonly incomplete parts of a production readiness review. Teams that ship solid code, pass CI, and clear code review sometimes skip the monitoring verification entirely — or configure it weeks after launch when the first major incident makes its absence obvious.

This checklist covers everything your monitoring setup should include before a production deployment, organized by category so you can work through it systematically.


Before You Start: What Production Monitoring Actually Covers

Production monitoring isn't just "is the site up?" It spans several distinct concerns:

  • Uptime: Is the service reachable and responding with success codes?
  • Performance: Are response times within acceptable bounds?
  • SSL/TLS: Are certificates valid and not approaching expiry?
  • APIs: Are your API endpoints returning correct responses?
  • Background jobs: Are cron tasks, workers, and pipelines completing successfully?
  • Alerting pipelines: Do alerts actually reach the right people?
  • Status communication: Do customers and stakeholders know when things are degraded?

Each concern has its own verification. Work through all of them before you deploy to production, not after.


Part 1: Uptime Monitoring

Core HTTP/HTTPS Checks

  • [ ] External uptime monitor is configured for your primary production domain
  • [ ] Monitor checks the correct HTTPS URL (not HTTP redirects)
  • [ ] Check interval is appropriate for your SLO (≤5 minutes for 99.9% SLA; 1 minute for critical services)
  • [ ] Monitor checks from multiple geographic locations (not a single probe)
  • [ ] Alerting requires consensus from multiple probes before firing (prevents false positives from single-probe failures)
  • [ ] HTTP status code assertion is configured (2xx expected; alert on 4xx/5xx)
  • [ ] Response content validation is set if you monitor a specific endpoint (optional but recommended for health endpoints)

Health Endpoints

  • [ ] /health or /api/health endpoint exists and returns 200 when the service is healthy
  • [ ] Health endpoint response time is under 100ms under normal load
  • [ ] Health endpoint checks critical dependencies (database connectivity, cache availability, queue reachability)
  • [ ] Health endpoint returns 503 when dependencies are unhealthy (not 200 with error body)
  • [ ] If Kubernetes: separate /health/live (liveness) and /health/ready (readiness) probes are configured
  • [ ] Health endpoint path is documented in your runbook and monitoring configuration

Alert Configuration

  • [ ] Alert fires within two missed checks (not on a single transient failure)
  • [ ] Alert notifications go to the correct channel (on-call rotation, Slack engineering channel, or PagerDuty)
  • [ ] Alert recovery notification is configured (so you know when the service comes back up)
  • [ ] Alert silence/acknowledge workflow is understood by the on-call team
  • [ ] Test alert has been fired and confirmed received before launch

Part 2: SSL Certificate Monitoring

SSL certificate expiry is one of the most avoidable production incidents. Certificates expire on a fixed date that's known months in advance — and yet expired SSL certificates take down production services regularly because no one set up expiry monitoring.

  • [ ] SSL certificate expiry monitoring is enabled for all production domains
  • [ ] Warning alert is configured at 30 days before expiry
  • [ ] Critical alert is configured at 7 days before expiry
  • [ ] Certificate auto-renewal is configured (Let's Encrypt + Certbot, AWS ACM auto-renew, or equivalent)
  • [ ] If auto-renewal: verify it's actually working, not just configured (test the renewal manually or check renewal logs)
  • [ ] Certificate covers all domains served by the application (including subdomains and www variants)
  • [ ] Wildcard certificates: verify the wildcard covers all expected subdomains
  • [ ] Certificate chain is complete (intermediate certificates installed; verify with an SSL checker)

Part 3: API Endpoint Monitoring

Public API Endpoints

  • [ ] Your primary public API endpoint is monitored (not just the website root)
  • [ ] Authentication endpoint is checked (login/token issuance confirms auth infrastructure is working)
  • [ ] Core data endpoints are verified — at minimum one read and one write path if available
  • [ ] API version header is checked if your API is versioned (detect version routing failures)
  • [ ] Response time thresholds are set for critical API paths

API Health Check Best Practices

  • [ ] API health endpoint returns dependency status (database, cache, queue, external APIs)
  • [ ] API returns structured error responses (JSON with error code and message — not bare 500 HTML)
  • [ ] Rate limiting is in place and tested (returns 429, not 500 under load)
  • [ ] CORS configuration is correct for browser-facing APIs (verify from expected origins)
  • [ ] API documentation is accessible and reflects current endpoints

Third-Party API Dependencies

  • [ ] Critical third-party API dependencies are identified (payment processor, auth provider, email delivery, storage)
  • [ ] External uptime monitor checks each critical dependency's public status endpoint
  • [ ] Third-party status page alerts are subscribed for each dependency
  • [ ] Graceful degradation is implemented and tested for each critical dependency failure
  • [ ] Timeout and retry logic is configured for all external API calls

Part 4: Database Monitoring

Connectivity and Availability

  • [ ] Health endpoint verifies database connectivity (not just application availability)
  • [ ] Database connection pool metrics are accessible (don't exhaust pool under production load)
  • [ ] Read replica availability is checked separately if you use read replicas
  • [ ] Database alert fires if health endpoint reports database unreachable

Backup Verification

  • [ ] Automated database backups are running and verified to complete
  • [ ] Backup completion is monitored via heartbeat (backup job sends ping on success; alert if no ping within window)
  • [ ] Last backup age is monitored (alert if backup is older than expected interval)
  • [ ] Restore procedure is documented and has been tested (a backup you can't restore is not a backup)
  • [ ] Backup storage location is separate from primary infrastructure (off-site or different cloud region)

Performance Baselines

  • [ ] Baseline query response times are documented at expected production load
  • [ ] Slow query logging is enabled with a threshold appropriate to your application
  • [ ] Database disk usage alert is configured before storage saturation

Part 5: Background Jobs and Scheduled Tasks

Background jobs are the monitoring blind spot in most applications. HTTP endpoint checks confirm your API is up; they say nothing about whether your cron jobs ran, your email queue is draining, or your data pipeline completed.

Heartbeat Monitoring Setup

  • [ ] Every scheduled cron job sends an HTTP ping on successful completion
  • [ ] Heartbeat monitor is configured for each job with an appropriate alert window (e.g., job runs every hour → alert if no ping in 90 minutes)
  • [ ] Heartbeat alert fires to the same channel as other infrastructure alerts
  • [ ] Job failure (non-zero exit code) does NOT send the heartbeat ping (alert fires on failure)
  • [ ] Each job's heartbeat URL is documented in the job's configuration or comments

Jobs That Need Heartbeat Monitoring

Consider adding heartbeat monitoring to:

  • [ ] Database backup job
  • [ ] Data import/export pipelines
  • [ ] Report generation jobs
  • [ ] Cache warming jobs
  • [ ] Email/notification queue processors
  • [ ] Cleanup and maintenance jobs (log rotation, soft-delete purging, token expiry cleanup)
  • [ ] Billing and payment retry jobs
  • [ ] Search index rebuild jobs
  • [ ] Media processing workers (image resizing, video transcoding)
  • [ ] External data sync jobs (CRM sync, analytics export, webhook replay)

Queue Depth Monitoring

If your application uses a message queue or job queue:

  • [ ] Queue depth is monitored (alert if queue grows beyond expected bounds)
  • [ ] Worker health is monitored (alert if worker pool drops to zero)
  • [ ] Dead letter queue / failed job count is alerted when it grows

Part 6: Alerting Pipelines

A monitoring system that generates alerts nobody sees is not a monitoring system — it's a log. Verify the full alert pipeline end-to-end before launch.

Alert Routing

  • [ ] On-call rotation is configured with real people assigned to real shifts
  • [ ] P1 (complete outage) alerts page the on-call engineer immediately (phone/SMS)
  • [ ] P2 (degraded service) alerts go to Slack engineering channel and on-call
  • [ ] P3 (performance degradation) alerts go to Slack engineering channel
  • [ ] Certificate expiry and background job failures route to appropriate team
  • [ ] After-hours alerts for non-P1 issues don't wake the entire team (noise is itself an incident)

Alert Pipeline Verification

  • [ ] Test alert has been sent through the full pipeline (monitor → webhook → PagerDuty/OpsGenie → on-call phone)
  • [ ] Alert recovery notification is confirmed working (resolved state clears the incident)
  • [ ] Escalation path is documented: who gets called if primary on-call doesn't respond?
  • [ ] Runbook links are included in alert bodies (engineer getting paged at 3 AM needs context)
  • [ ] On-call engineer knows how to silence, acknowledge, and resolve alerts

Integration Health

  • [ ] Monitoring webhook is authenticated (signed secret or auth token — not bare URL)
  • [ ] Webhook delivery failures are logged and alerted
  • [ ] If using PagerDuty/OpsGenie: service is configured for deduplication (one incident per outage, not one per check)
  • [ ] Incident auto-resolution is configured when monitoring confirms recovery

Part 7: Status Pages

A status page answers the question users and customers ask during an outage before they contact support: "Is this a me problem or a them problem?"

Status Page Basics

  • [ ] Status page is accessible at a predictable URL (status.yourdomain.com or similar)
  • [ ] Status page is hosted independently of your production infrastructure (it shouldn't go down when your app goes down)
  • [ ] Status page reflects current production status (not manually updated — connected to your monitoring)
  • [ ] Status page shows uptime history (30/60/90 days makes SLA conversations concrete)

Incident Communication

  • [ ] Process for posting incident updates during an outage is documented
  • [ ] Team knows who is responsible for updating the status page during incidents
  • [ ] Status page subscriber notifications are configured (users can opt in to email updates)
  • [ ] Post-incident policy: how quickly after resolution will a post-mortem be posted?

Part 8: Performance Baselines

Monitoring catches outages. Performance baselines let you catch degradation before it becomes an outage.

  • [ ] Response time baseline is documented for critical endpoints at expected production load
  • [ ] Response time monitoring is enabled in your uptime tool (check interval captures response time per check)
  • [ ] Alert threshold is set for response time SLO violation (e.g., alert when P95 > 500ms for 3 consecutive checks)
  • [ ] Response time history is accessible for trend analysis (detect slow drift upward over days/weeks)
  • [ ] Load test has been run at expected peak traffic to identify performance ceiling before users find it
  • [ ] Memory and CPU usage baselines are documented for your production instances

Part 9: Pre-Launch Verification Run

Before you flip the production DNS or enable public traffic, run through this final verification sequence:

  1. Trigger a test failure — take a test endpoint offline and confirm the alert fires within the expected interval
  2. Verify alert delivery — confirm the alert reaches every configured channel (email, Slack, PagerDuty)
  3. Confirm recovery notification — bring the test endpoint back up and verify the "resolved" notification fires
  4. Test a heartbeat miss — stop a heartbeat monitor and verify the alert fires at the expected window boundary
  5. Check SSL validity — run your production domain through an SSL checker; confirm chain is complete and expiry is tracked
  6. Confirm backup completion — verify the most recent database backup completed successfully and the heartbeat fired
  7. Walk the on-call flow — the engineer going on-call first should trigger a test alert on their own device to confirm the full pipeline

Monitoring Tool Recommendation

For the uptime and heartbeat monitoring layer, Vigilmon covers most of what this checklist requires:

  • Multi-region consensus alerting for all HTTP/HTTPS and TCP monitors (free tier and paid)
  • Heartbeat monitoring for background jobs and cron tasks
  • Response time history with color-coded latency bands
  • Webhook notifications that integrate with PagerDuty, OpsGenie, Slack, or any custom endpoint
  • REST API for programmatic monitor management in deployment pipelines
  • Status badge and basic status page

The free tier covers up to 5 monitors with full multi-region consensus — sufficient for a small production environment and a starting point for larger ones.

Get started at vigilmon.online — no credit card required, monitoring configured in under 5 minutes.


Summary Checklist

Use this abbreviated list as a final pass before launch:

Uptime

  • [ ] External uptime monitor active on primary production domain
  • [ ] Multi-region / consensus-based alerting (not single probe)
  • [ ] Health endpoint exists, responds 200, checks dependencies

SSL

  • [ ] Certificate expiry monitored with 30-day and 7-day warnings
  • [ ] Auto-renewal configured and verified

APIs

  • [ ] Primary API endpoints monitored
  • [ ] Third-party dependencies monitored
  • [ ] Graceful degradation implemented for critical dependencies

Databases

  • [ ] Health endpoint checks DB connectivity
  • [ ] Backup job monitored via heartbeat
  • [ ] Restore procedure tested

Background Jobs

  • [ ] All cron jobs and workers send heartbeat pings on success
  • [ ] Heartbeat alert windows are configured

Alerting

  • [ ] On-call rotation is configured with real people
  • [ ] Alert routes verified end-to-end with a test fire
  • [ ] Escalation path documented

Status Page

  • [ ] Status page live on independent infrastructure
  • [ ] Incident update process documented

Tags: #monitoring #devops #production #sre #checklist #uptime #2026

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →