The Complete Production Readiness Monitoring Checklist 2026

Going to production is a moment of commitment. Once users are hitting your system, every failure is a user experience problem, and every missed alert is an incident you'll hear about from a customer before you hear about it from your own tools.

The monitoring layer is one of the most commonly incomplete parts of a production readiness review. Teams that ship solid code, pass CI, and clear code review sometimes skip the monitoring verification entirely — or configure it weeks after launch when the first major incident makes its absence obvious.

This checklist covers everything your monitoring setup should include before a production deployment, organized by category so you can work through it systematically.

Before You Start: What Production Monitoring Actually Covers

Production monitoring isn't just "is the site up?" It spans several distinct concerns:

Uptime: Is the service reachable and responding with success codes?
Performance: Are response times within acceptable bounds?
SSL/TLS: Are certificates valid and not approaching expiry?
APIs: Are your API endpoints returning correct responses?
Background jobs: Are cron tasks, workers, and pipelines completing successfully?
Alerting pipelines: Do alerts actually reach the right people?
Status communication: Do customers and stakeholders know when things are degraded?

Each concern has its own verification. Work through all of them before you deploy to production, not after.

Part 1: Uptime Monitoring

Core HTTP/HTTPS Checks

[ ] External uptime monitor is configured for your primary production domain
[ ] Monitor checks the correct HTTPS URL (not HTTP redirects)
[ ] Check interval is appropriate for your SLO (≤5 minutes for 99.9% SLA; 1 minute for critical services)
[ ] Monitor checks from multiple geographic locations (not a single probe)
[ ] Alerting requires consensus from multiple probes before firing (prevents false positives from single-probe failures)
[ ] HTTP status code assertion is configured (2xx expected; alert on 4xx/5xx)
[ ] Response content validation is set if you monitor a specific endpoint (optional but recommended for health endpoints)

Health Endpoints

[ ] /health or /api/health endpoint exists and returns 200 when the service is healthy
[ ] Health endpoint response time is under 100ms under normal load
[ ] Health endpoint checks critical dependencies (database connectivity, cache availability, queue reachability)
[ ] Health endpoint returns 503 when dependencies are unhealthy (not 200 with error body)
[ ] If Kubernetes: separate /health/live (liveness) and /health/ready (readiness) probes are configured
[ ] Health endpoint path is documented in your runbook and monitoring configuration

Alert Configuration

[ ] Alert fires within two missed checks (not on a single transient failure)
[ ] Alert notifications go to the correct channel (on-call rotation, Slack engineering channel, or PagerDuty)
[ ] Alert recovery notification is configured (so you know when the service comes back up)
[ ] Alert silence/acknowledge workflow is understood by the on-call team
[ ] Test alert has been fired and confirmed received before launch

Part 2: SSL Certificate Monitoring

SSL certificate expiry is one of the most avoidable production incidents. Certificates expire on a fixed date that's known months in advance — and yet expired SSL certificates take down production services regularly because no one set up expiry monitoring.

[ ] SSL certificate expiry monitoring is enabled for all production domains
[ ] Warning alert is configured at 30 days before expiry
[ ] Critical alert is configured at 7 days before expiry
[ ] Certificate auto-renewal is configured (Let's Encrypt + Certbot, AWS ACM auto-renew, or equivalent)
[ ] If auto-renewal: verify it's actually working, not just configured (test the renewal manually or check renewal logs)
[ ] Certificate covers all domains served by the application (including subdomains and www variants)
[ ] Wildcard certificates: verify the wildcard covers all expected subdomains
[ ] Certificate chain is complete (intermediate certificates installed; verify with an SSL checker)

Part 3: API Endpoint Monitoring

Public API Endpoints

[ ] Your primary public API endpoint is monitored (not just the website root)
[ ] Authentication endpoint is checked (login/token issuance confirms auth infrastructure is working)
[ ] Core data endpoints are verified — at minimum one read and one write path if available
[ ] API version header is checked if your API is versioned (detect version routing failures)
[ ] Response time thresholds are set for critical API paths

API Health Check Best Practices

[ ] API health endpoint returns dependency status (database, cache, queue, external APIs)
[ ] API returns structured error responses (JSON with error code and message — not bare 500 HTML)
[ ] Rate limiting is in place and tested (returns 429, not 500 under load)
[ ] CORS configuration is correct for browser-facing APIs (verify from expected origins)
[ ] API documentation is accessible and reflects current endpoints

Third-Party API Dependencies

[ ] Critical third-party API dependencies are identified (payment processor, auth provider, email delivery, storage)
[ ] External uptime monitor checks each critical dependency's public status endpoint
[ ] Third-party status page alerts are subscribed for each dependency
[ ] Graceful degradation is implemented and tested for each critical dependency failure
[ ] Timeout and retry logic is configured for all external API calls

Part 4: Database Monitoring

Connectivity and Availability

[ ] Health endpoint verifies database connectivity (not just application availability)
[ ] Database connection pool metrics are accessible (don't exhaust pool under production load)
[ ] Read replica availability is checked separately if you use read replicas
[ ] Database alert fires if health endpoint reports database unreachable

Backup Verification

[ ] Automated database backups are running and verified to complete
[ ] Backup completion is monitored via heartbeat (backup job sends ping on success; alert if no ping within window)
[ ] Last backup age is monitored (alert if backup is older than expected interval)
[ ] Restore procedure is documented and has been tested (a backup you can't restore is not a backup)
[ ] Backup storage location is separate from primary infrastructure (off-site or different cloud region)

Performance Baselines

[ ] Baseline query response times are documented at expected production load
[ ] Slow query logging is enabled with a threshold appropriate to your application
[ ] Database disk usage alert is configured before storage saturation

Part 5: Background Jobs and Scheduled Tasks

Background jobs are the monitoring blind spot in most applications. HTTP endpoint checks confirm your API is up; they say nothing about whether your cron jobs ran, your email queue is draining, or your data pipeline completed.

Heartbeat Monitoring Setup

[ ] Every scheduled cron job sends an HTTP ping on successful completion
[ ] Heartbeat monitor is configured for each job with an appropriate alert window (e.g., job runs every hour → alert if no ping in 90 minutes)
[ ] Heartbeat alert fires to the same channel as other infrastructure alerts
[ ] Job failure (non-zero exit code) does NOT send the heartbeat ping (alert fires on failure)
[ ] Each job's heartbeat URL is documented in the job's configuration or comments

Jobs That Need Heartbeat Monitoring

Consider adding heartbeat monitoring to:

[ ] Database backup job
[ ] Data import/export pipelines
[ ] Report generation jobs
[ ] Cache warming jobs
[ ] Email/notification queue processors
[ ] Cleanup and maintenance jobs (log rotation, soft-delete purging, token expiry cleanup)
[ ] Billing and payment retry jobs
[ ] Search index rebuild jobs
[ ] Media processing workers (image resizing, video transcoding)
[ ] External data sync jobs (CRM sync, analytics export, webhook replay)

Queue Depth Monitoring

If your application uses a message queue or job queue:

[ ] Queue depth is monitored (alert if queue grows beyond expected bounds)
[ ] Worker health is monitored (alert if worker pool drops to zero)
[ ] Dead letter queue / failed job count is alerted when it grows

Part 6: Alerting Pipelines

A monitoring system that generates alerts nobody sees is not a monitoring system — it's a log. Verify the full alert pipeline end-to-end before launch.

Alert Routing

[ ] On-call rotation is configured with real people assigned to real shifts
[ ] P1 (complete outage) alerts page the on-call engineer immediately (phone/SMS)
[ ] P2 (degraded service) alerts go to Slack engineering channel and on-call
[ ] P3 (performance degradation) alerts go to Slack engineering channel
[ ] Certificate expiry and background job failures route to appropriate team
[ ] After-hours alerts for non-P1 issues don't wake the entire team (noise is itself an incident)

Alert Pipeline Verification

[ ] Test alert has been sent through the full pipeline (monitor → webhook → PagerDuty/OpsGenie → on-call phone)
[ ] Alert recovery notification is confirmed working (resolved state clears the incident)
[ ] Escalation path is documented: who gets called if primary on-call doesn't respond?
[ ] Runbook links are included in alert bodies (engineer getting paged at 3 AM needs context)
[ ] On-call engineer knows how to silence, acknowledge, and resolve alerts

Integration Health

[ ] Monitoring webhook is authenticated (signed secret or auth token — not bare URL)
[ ] Webhook delivery failures are logged and alerted
[ ] If using PagerDuty/OpsGenie: service is configured for deduplication (one incident per outage, not one per check)
[ ] Incident auto-resolution is configured when monitoring confirms recovery

Part 7: Status Pages

A status page answers the question users and customers ask during an outage before they contact support: "Is this a me problem or a them problem?"

Status Page Basics

[ ] Status page is accessible at a predictable URL (status.yourdomain.com or similar)
[ ] Status page is hosted independently of your production infrastructure (it shouldn't go down when your app goes down)
[ ] Status page reflects current production status (not manually updated — connected to your monitoring)
[ ] Status page shows uptime history (30/60/90 days makes SLA conversations concrete)

Incident Communication

[ ] Process for posting incident updates during an outage is documented
[ ] Team knows who is responsible for updating the status page during incidents
[ ] Status page subscriber notifications are configured (users can opt in to email updates)
[ ] Post-incident policy: how quickly after resolution will a post-mortem be posted?

Part 8: Performance Baselines

Monitoring catches outages. Performance baselines let you catch degradation before it becomes an outage.

[ ] Response time baseline is documented for critical endpoints at expected production load
[ ] Response time monitoring is enabled in your uptime tool (check interval captures response time per check)
[ ] Alert threshold is set for response time SLO violation (e.g., alert when P95 > 500ms for 3 consecutive checks)
[ ] Response time history is accessible for trend analysis (detect slow drift upward over days/weeks)
[ ] Load test has been run at expected peak traffic to identify performance ceiling before users find it
[ ] Memory and CPU usage baselines are documented for your production instances

Part 9: Pre-Launch Verification Run

Before you flip the production DNS or enable public traffic, run through this final verification sequence:

Trigger a test failure — take a test endpoint offline and confirm the alert fires within the expected interval
Verify alert delivery — confirm the alert reaches every configured channel (email, Slack, PagerDuty)
Confirm recovery notification — bring the test endpoint back up and verify the "resolved" notification fires
Test a heartbeat miss — stop a heartbeat monitor and verify the alert fires at the expected window boundary
Check SSL validity — run your production domain through an SSL checker; confirm chain is complete and expiry is tracked
Confirm backup completion — verify the most recent database backup completed successfully and the heartbeat fired
Walk the on-call flow — the engineer going on-call first should trigger a test alert on their own device to confirm the full pipeline

Monitoring Tool Recommendation

For the uptime and heartbeat monitoring layer, Vigilmon covers most of what this checklist requires:

Multi-region consensus alerting for all HTTP/HTTPS and TCP monitors (free tier and paid)
Heartbeat monitoring for background jobs and cron tasks
Response time history with color-coded latency bands
Webhook notifications that integrate with PagerDuty, OpsGenie, Slack, or any custom endpoint
REST API for programmatic monitor management in deployment pipelines
Status badge and basic status page

The free tier covers up to 5 monitors with full multi-region consensus — sufficient for a small production environment and a starting point for larger ones.

Get started at vigilmon.online — no credit card required, monitoring configured in under 5 minutes.

Summary Checklist

Use this abbreviated list as a final pass before launch:

Uptime

[ ] External uptime monitor active on primary production domain
[ ] Multi-region / consensus-based alerting (not single probe)
[ ] Health endpoint exists, responds 200, checks dependencies

SSL

[ ] Certificate expiry monitored with 30-day and 7-day warnings
[ ] Auto-renewal configured and verified

APIs

[ ] Primary API endpoints monitored
[ ] Third-party dependencies monitored
[ ] Graceful degradation implemented for critical dependencies

Databases

[ ] Health endpoint checks DB connectivity
[ ] Backup job monitored via heartbeat
[ ] Restore procedure tested

Background Jobs

[ ] All cron jobs and workers send heartbeat pings on success
[ ] Heartbeat alert windows are configured

Alerting

[ ] On-call rotation is configured with real people
[ ] Alert routes verified end-to-end with a test fire
[ ] Escalation path documented

Status Page

[ ] Status page live on independent infrastructure
[ ] Incident update process documented

Tags: #monitoring #devops #production #sre #checklist #uptime #2026