Going to production is a moment of commitment. Once users are hitting your system, every failure is a user experience problem, and every missed alert is an incident you'll hear about from a customer before you hear about it from your own tools.
The monitoring layer is one of the most commonly incomplete parts of a production readiness review. Teams that ship solid code, pass CI, and clear code review sometimes skip the monitoring verification entirely — or configure it weeks after launch when the first major incident makes its absence obvious.
This checklist covers everything your monitoring setup should include before a production deployment, organized by category so you can work through it systematically.
Before You Start: What Production Monitoring Actually Covers
Production monitoring isn't just "is the site up?" It spans several distinct concerns:
- Uptime: Is the service reachable and responding with success codes?
- Performance: Are response times within acceptable bounds?
- SSL/TLS: Are certificates valid and not approaching expiry?
- APIs: Are your API endpoints returning correct responses?
- Background jobs: Are cron tasks, workers, and pipelines completing successfully?
- Alerting pipelines: Do alerts actually reach the right people?
- Status communication: Do customers and stakeholders know when things are degraded?
Each concern has its own verification. Work through all of them before you deploy to production, not after.
Part 1: Uptime Monitoring
Core HTTP/HTTPS Checks
- [ ] External uptime monitor is configured for your primary production domain
- [ ] Monitor checks the correct HTTPS URL (not HTTP redirects)
- [ ] Check interval is appropriate for your SLO (≤5 minutes for 99.9% SLA; 1 minute for critical services)
- [ ] Monitor checks from multiple geographic locations (not a single probe)
- [ ] Alerting requires consensus from multiple probes before firing (prevents false positives from single-probe failures)
- [ ] HTTP status code assertion is configured (2xx expected; alert on 4xx/5xx)
- [ ] Response content validation is set if you monitor a specific endpoint (optional but recommended for health endpoints)
Health Endpoints
- [ ]
/healthor/api/healthendpoint exists and returns 200 when the service is healthy - [ ] Health endpoint response time is under 100ms under normal load
- [ ] Health endpoint checks critical dependencies (database connectivity, cache availability, queue reachability)
- [ ] Health endpoint returns 503 when dependencies are unhealthy (not 200 with error body)
- [ ] If Kubernetes: separate
/health/live(liveness) and/health/ready(readiness) probes are configured - [ ] Health endpoint path is documented in your runbook and monitoring configuration
Alert Configuration
- [ ] Alert fires within two missed checks (not on a single transient failure)
- [ ] Alert notifications go to the correct channel (on-call rotation, Slack engineering channel, or PagerDuty)
- [ ] Alert recovery notification is configured (so you know when the service comes back up)
- [ ] Alert silence/acknowledge workflow is understood by the on-call team
- [ ] Test alert has been fired and confirmed received before launch
Part 2: SSL Certificate Monitoring
SSL certificate expiry is one of the most avoidable production incidents. Certificates expire on a fixed date that's known months in advance — and yet expired SSL certificates take down production services regularly because no one set up expiry monitoring.
- [ ] SSL certificate expiry monitoring is enabled for all production domains
- [ ] Warning alert is configured at 30 days before expiry
- [ ] Critical alert is configured at 7 days before expiry
- [ ] Certificate auto-renewal is configured (Let's Encrypt + Certbot, AWS ACM auto-renew, or equivalent)
- [ ] If auto-renewal: verify it's actually working, not just configured (test the renewal manually or check renewal logs)
- [ ] Certificate covers all domains served by the application (including subdomains and www variants)
- [ ] Wildcard certificates: verify the wildcard covers all expected subdomains
- [ ] Certificate chain is complete (intermediate certificates installed; verify with an SSL checker)
Part 3: API Endpoint Monitoring
Public API Endpoints
- [ ] Your primary public API endpoint is monitored (not just the website root)
- [ ] Authentication endpoint is checked (login/token issuance confirms auth infrastructure is working)
- [ ] Core data endpoints are verified — at minimum one read and one write path if available
- [ ] API version header is checked if your API is versioned (detect version routing failures)
- [ ] Response time thresholds are set for critical API paths
API Health Check Best Practices
- [ ] API health endpoint returns dependency status (database, cache, queue, external APIs)
- [ ] API returns structured error responses (JSON with error code and message — not bare 500 HTML)
- [ ] Rate limiting is in place and tested (returns 429, not 500 under load)
- [ ] CORS configuration is correct for browser-facing APIs (verify from expected origins)
- [ ] API documentation is accessible and reflects current endpoints
Third-Party API Dependencies
- [ ] Critical third-party API dependencies are identified (payment processor, auth provider, email delivery, storage)
- [ ] External uptime monitor checks each critical dependency's public status endpoint
- [ ] Third-party status page alerts are subscribed for each dependency
- [ ] Graceful degradation is implemented and tested for each critical dependency failure
- [ ] Timeout and retry logic is configured for all external API calls
Part 4: Database Monitoring
Connectivity and Availability
- [ ] Health endpoint verifies database connectivity (not just application availability)
- [ ] Database connection pool metrics are accessible (don't exhaust pool under production load)
- [ ] Read replica availability is checked separately if you use read replicas
- [ ] Database alert fires if health endpoint reports database unreachable
Backup Verification
- [ ] Automated database backups are running and verified to complete
- [ ] Backup completion is monitored via heartbeat (backup job sends ping on success; alert if no ping within window)
- [ ] Last backup age is monitored (alert if backup is older than expected interval)
- [ ] Restore procedure is documented and has been tested (a backup you can't restore is not a backup)
- [ ] Backup storage location is separate from primary infrastructure (off-site or different cloud region)
Performance Baselines
- [ ] Baseline query response times are documented at expected production load
- [ ] Slow query logging is enabled with a threshold appropriate to your application
- [ ] Database disk usage alert is configured before storage saturation
Part 5: Background Jobs and Scheduled Tasks
Background jobs are the monitoring blind spot in most applications. HTTP endpoint checks confirm your API is up; they say nothing about whether your cron jobs ran, your email queue is draining, or your data pipeline completed.
Heartbeat Monitoring Setup
- [ ] Every scheduled cron job sends an HTTP ping on successful completion
- [ ] Heartbeat monitor is configured for each job with an appropriate alert window (e.g., job runs every hour → alert if no ping in 90 minutes)
- [ ] Heartbeat alert fires to the same channel as other infrastructure alerts
- [ ] Job failure (non-zero exit code) does NOT send the heartbeat ping (alert fires on failure)
- [ ] Each job's heartbeat URL is documented in the job's configuration or comments
Jobs That Need Heartbeat Monitoring
Consider adding heartbeat monitoring to:
- [ ] Database backup job
- [ ] Data import/export pipelines
- [ ] Report generation jobs
- [ ] Cache warming jobs
- [ ] Email/notification queue processors
- [ ] Cleanup and maintenance jobs (log rotation, soft-delete purging, token expiry cleanup)
- [ ] Billing and payment retry jobs
- [ ] Search index rebuild jobs
- [ ] Media processing workers (image resizing, video transcoding)
- [ ] External data sync jobs (CRM sync, analytics export, webhook replay)
Queue Depth Monitoring
If your application uses a message queue or job queue:
- [ ] Queue depth is monitored (alert if queue grows beyond expected bounds)
- [ ] Worker health is monitored (alert if worker pool drops to zero)
- [ ] Dead letter queue / failed job count is alerted when it grows
Part 6: Alerting Pipelines
A monitoring system that generates alerts nobody sees is not a monitoring system — it's a log. Verify the full alert pipeline end-to-end before launch.
Alert Routing
- [ ] On-call rotation is configured with real people assigned to real shifts
- [ ] P1 (complete outage) alerts page the on-call engineer immediately (phone/SMS)
- [ ] P2 (degraded service) alerts go to Slack engineering channel and on-call
- [ ] P3 (performance degradation) alerts go to Slack engineering channel
- [ ] Certificate expiry and background job failures route to appropriate team
- [ ] After-hours alerts for non-P1 issues don't wake the entire team (noise is itself an incident)
Alert Pipeline Verification
- [ ] Test alert has been sent through the full pipeline (monitor → webhook → PagerDuty/OpsGenie → on-call phone)
- [ ] Alert recovery notification is confirmed working (resolved state clears the incident)
- [ ] Escalation path is documented: who gets called if primary on-call doesn't respond?
- [ ] Runbook links are included in alert bodies (engineer getting paged at 3 AM needs context)
- [ ] On-call engineer knows how to silence, acknowledge, and resolve alerts
Integration Health
- [ ] Monitoring webhook is authenticated (signed secret or auth token — not bare URL)
- [ ] Webhook delivery failures are logged and alerted
- [ ] If using PagerDuty/OpsGenie: service is configured for deduplication (one incident per outage, not one per check)
- [ ] Incident auto-resolution is configured when monitoring confirms recovery
Part 7: Status Pages
A status page answers the question users and customers ask during an outage before they contact support: "Is this a me problem or a them problem?"
Status Page Basics
- [ ] Status page is accessible at a predictable URL (status.yourdomain.com or similar)
- [ ] Status page is hosted independently of your production infrastructure (it shouldn't go down when your app goes down)
- [ ] Status page reflects current production status (not manually updated — connected to your monitoring)
- [ ] Status page shows uptime history (30/60/90 days makes SLA conversations concrete)
Incident Communication
- [ ] Process for posting incident updates during an outage is documented
- [ ] Team knows who is responsible for updating the status page during incidents
- [ ] Status page subscriber notifications are configured (users can opt in to email updates)
- [ ] Post-incident policy: how quickly after resolution will a post-mortem be posted?
Part 8: Performance Baselines
Monitoring catches outages. Performance baselines let you catch degradation before it becomes an outage.
- [ ] Response time baseline is documented for critical endpoints at expected production load
- [ ] Response time monitoring is enabled in your uptime tool (check interval captures response time per check)
- [ ] Alert threshold is set for response time SLO violation (e.g., alert when P95 > 500ms for 3 consecutive checks)
- [ ] Response time history is accessible for trend analysis (detect slow drift upward over days/weeks)
- [ ] Load test has been run at expected peak traffic to identify performance ceiling before users find it
- [ ] Memory and CPU usage baselines are documented for your production instances
Part 9: Pre-Launch Verification Run
Before you flip the production DNS or enable public traffic, run through this final verification sequence:
- Trigger a test failure — take a test endpoint offline and confirm the alert fires within the expected interval
- Verify alert delivery — confirm the alert reaches every configured channel (email, Slack, PagerDuty)
- Confirm recovery notification — bring the test endpoint back up and verify the "resolved" notification fires
- Test a heartbeat miss — stop a heartbeat monitor and verify the alert fires at the expected window boundary
- Check SSL validity — run your production domain through an SSL checker; confirm chain is complete and expiry is tracked
- Confirm backup completion — verify the most recent database backup completed successfully and the heartbeat fired
- Walk the on-call flow — the engineer going on-call first should trigger a test alert on their own device to confirm the full pipeline
Monitoring Tool Recommendation
For the uptime and heartbeat monitoring layer, Vigilmon covers most of what this checklist requires:
- Multi-region consensus alerting for all HTTP/HTTPS and TCP monitors (free tier and paid)
- Heartbeat monitoring for background jobs and cron tasks
- Response time history with color-coded latency bands
- Webhook notifications that integrate with PagerDuty, OpsGenie, Slack, or any custom endpoint
- REST API for programmatic monitor management in deployment pipelines
- Status badge and basic status page
The free tier covers up to 5 monitors with full multi-region consensus — sufficient for a small production environment and a starting point for larger ones.
Get started at vigilmon.online — no credit card required, monitoring configured in under 5 minutes.
Summary Checklist
Use this abbreviated list as a final pass before launch:
Uptime
- [ ] External uptime monitor active on primary production domain
- [ ] Multi-region / consensus-based alerting (not single probe)
- [ ] Health endpoint exists, responds 200, checks dependencies
SSL
- [ ] Certificate expiry monitored with 30-day and 7-day warnings
- [ ] Auto-renewal configured and verified
APIs
- [ ] Primary API endpoints monitored
- [ ] Third-party dependencies monitored
- [ ] Graceful degradation implemented for critical dependencies
Databases
- [ ] Health endpoint checks DB connectivity
- [ ] Backup job monitored via heartbeat
- [ ] Restore procedure tested
Background Jobs
- [ ] All cron jobs and workers send heartbeat pings on success
- [ ] Heartbeat alert windows are configured
Alerting
- [ ] On-call rotation is configured with real people
- [ ] Alert routes verified end-to-end with a test fire
- [ ] Escalation path documented
Status Page
- [ ] Status page live on independent infrastructure
- [ ] Incident update process documented
Tags: #monitoring #devops #production #sre #checklist #uptime #2026