Most SaaS products get monitoring wrong in the same way: they add a simple uptime ping to the homepage, call it done, and discover the gaps at 2 AM when a customer reports that the app has been broken for an hour.
Good SaaS monitoring covers more than a single URL. It covers the full attack surface of a production application — APIs, background jobs, databases, certificates, integrations, and the communication infrastructure you use to tell customers when something breaks. This checklist is the complete picture.
Use it when launching a new SaaS product, auditing your existing monitoring setup, or onboarding a new team member to your infrastructure practices.
Why SaaS Monitoring Is Different
SaaS products carry a specific failure profile that makes comprehensive monitoring non-optional:
Customers have contractual expectations. Free tools can go down. SaaS products that charge money have implicit — and often explicit — uptime commitments. When an outage happens, customers have a reference point: "You charged me $99/month for this. It was down for 3 hours."
The blast radius includes your customers' customers. A developer tool downtime means your customers' users can't use their products. A payment processor integration failure means your customer's checkout is broken. You're responsible for downstream impact, not just your own users.
SaaS revenue is recurring. An outage doesn't just lose a sale; it risks a churn event. Customers who experience unreliable service cancel. The economics of churn mean a single preventable outage can cost you months of subscription revenue.
Your surface area is bigger than you think. Beyond your main application, you're running webhooks, cron jobs, background queues, email delivery, SSL certificates, third-party integrations, and documentation. Any of these can fail independently of your core product.
The Checklist
✅ Core Application Monitoring
[ ] Monitor your main application URL
The homepage or login page — the first thing a user hits. This check catches complete application failures. Set a 1–5 minute check interval depending on your SLA commitment.
- Verify HTTP 200 response
- Verify response body contains expected content (not just a status code)
- Alert threshold: any failure, immediately
[ ] Monitor your primary API endpoint
If you have a public API or your frontend communicates with a backend API, monitor it separately. Application and API can fail independently.
- HTTP check on
/api/healthor equivalent - Alert on non-2xx responses
- Track response time trends
[ ] Monitor your user authentication endpoint
Authentication failures affect every logged-in user. Monitor the login/token endpoint specifically:
- POST
/auth/loginor equivalent - Verify success response structure
- Alert on failure immediately — auth outages have the highest user-visible impact
[ ] Monitor your most critical customer-facing feature
Every SaaS product has a "if this is broken, it's down" feature. For a project management tool, it's task creation. For an email tool, it's sending. For a database tool, it's the query endpoint. Monitor this path explicitly.
✅ API and Webhook Monitoring
[ ] Monitor all public API endpoints
If your product has a public API, each major endpoint category should be monitored:
- List/read endpoints (your highest traffic paths)
- Write endpoints (where failures lose customer data)
- Webhook delivery endpoints (where customers' systems depend on yours)
[ ] Monitor your webhook delivery system
Webhooks fail silently. If your product sends webhooks to customers and those webhooks stop delivering, customers won't know until their own automation breaks. Monitor:
- Webhook queue health (if accessible)
- Webhook delivery success rate
- Webhook endpoint response from a test receiver
[ ] Monitor inbound webhook endpoints
If you receive webhooks from payment processors, authentication providers, or third-party services, monitor those endpoints are reachable. An unreachable webhook endpoint means missed events that may be unrecoverable.
✅ Background Job Monitoring
Background jobs are the most commonly unmonitored component in SaaS applications. They fail silently — no user-facing error, no HTTP response, no log that anyone reads — until a customer reports that something hasn't happened.
[ ] Set up heartbeat monitoring for all cron jobs
Every cron job should send an HTTP ping to a heartbeat monitor after each successful execution. If the ping doesn't arrive within the expected window, an alert fires.
Common cron jobs to monitor:
- Email digest sends
- Usage calculation and billing jobs
- Data export generation
- Database cleanup and archival
- Third-party sync jobs (CRM, analytics, etc.)
- Report generation
[ ] Monitor background queue workers
Queue consumers that process user-initiated work (file uploads, image processing, async API calls) should send heartbeats on each successful batch. Worker death or queue stalls need immediate visibility.
[ ] Monitor scheduled payment and billing jobs
Billing failures are catastrophic for SaaS revenue. Monitor every billing-related cron job with heartbeat checks:
- Invoice generation
- Payment retry jobs
- Subscription renewal processing
- Failed charge notification sends
Tools like Vigilmon support heartbeat monitoring alongside HTTP checks — background workers send a ping after each successful run, and the monitoring tool alerts if no ping arrives within the configured window.
✅ Database and Infrastructure Monitoring
[ ] Monitor database connectivity via health endpoint
Your application's health endpoint should check database connectivity and return a degraded status if the database is unreachable. External monitoring then picks up the degraded health check.
Design your health endpoint to:
- Execute a simple query (
SELECT 1) - Return within 100ms
- Return HTTP 503 if database is unreachable
[ ] Monitor TCP port for each database
External TCP port monitoring on your database ports (5432 for PostgreSQL, 3306 for MySQL, etc.) catches database process failures before your health endpoint even checks it. Combine TCP monitoring with the health check for layered detection.
[ ] Monitor cache service connectivity
If your application depends on Redis or Memcached for session storage, rate limiting, or caching, monitor cache connectivity via your health endpoint. Cache failure often degrades performance severely before causing complete outages.
[ ] Monitor message queue health
If you use RabbitMQ, Kafka, or a cloud queue service (SQS, Google Pub/Sub), monitor queue depth and consumer health. A growing queue with no consumers processing it is an incident that won't show up as a health check failure.
✅ SSL Certificate Monitoring
SSL certificate expiry is the most preventable outage category for SaaS products — and one of the most common. A certificate expires, HTTPS fails, browsers show security warnings, and every customer is affected simultaneously.
[ ] Monitor SSL certificate expiry for your primary domain
Set up certificate expiry monitoring for your main domain with at minimum three alert thresholds:
- 30 days: First warning — time to investigate if auto-renewal is configured
- 14 days: Escalation — manual renewal should start now
- 7 days: Critical — same-day action required
[ ] Monitor SSL for all customer-facing subdomains
SaaS products often run multiple HTTPS subdomains that can expire independently:
app.yourdomain.com— main applicationapi.yourdomain.com— public APIdocs.yourdomain.com— documentationstatus.yourdomain.com— status pagedashboard.yourdomain.com— customer dashboard
[ ] Monitor SSL for customer custom domains
If your product supports custom domains (where customers bring their own domain and you provision a certificate), monitor those certificates too. Customer custom domain certificate failures are indistinguishable from your core product failing from the customer's perspective.
[ ] Don't rely solely on auto-renewal
Let's Encrypt auto-renewal via Certbot, Caddy, or Traefik is reliable — until it isn't. DNS changes, firewall rules, or permission issues can cause renewal to fail silently. External certificate monitoring is your safety net when auto-renewal breaks.
✅ Status Page and Incident Communication
[ ] Run a public status page
Customers need somewhere to check during incidents. A public status page at status.yourdomain.com (or linked prominently from your product) serves multiple functions:
- Reduces support ticket volume during incidents ("is it just me?")
- Demonstrates transparency and operational maturity
- Gives customers a reference point for SLA calculations
[ ] Monitor your status page itself
Your status page should be monitored as a separate system from your main application. If your main app goes down and takes your status page with it, customers have nowhere to check. Run your status page on a different hosting provider or use a third-party status page service.
[ ] Set up automated status updates from your monitoring
Manual status page updates during an incident are error-prone — when you're debugging a production issue, updating a status page is easily forgotten. Configure your monitoring tool to automatically update your status page when a monitor fails.
[ ] Communicate proactively during incidents
Beyond the status page:
- Email customers before they email you
- Post in-app banners during known incidents
- Tweet from your product account for high-impact issues
- Be specific: "database failover in progress, expected resolution 14:30 UTC" beats "we're aware of issues"
✅ Third-Party Integration Monitoring
[ ] Monitor payment processor connectivity
For B2B SaaS, payment processor downtime is revenue-critical. Monitor:
- Stripe/Braintree/PayPal API status
- Your checkout flow end-to-end
- Webhook receipts from your payment processor
Subscribe to your payment processor's status page notifications and build degradation handling for payment provider outages.
[ ] Monitor email delivery service
Transactional email failure is often invisible — emails just don't arrive, with no error surfaced to your users. Monitor:
- SendGrid/Mailgun/Postmark API status
- Delivery rate from your monitoring (send a test email via API, check delivery)
- Bounce rates (elevated bounce rates signal deliverability problems)
[ ] Monitor authentication provider (SSO/OAuth)
If your product uses a third-party auth provider (Auth0, Okta, Google OAuth), monitor their API status. Auth provider failure means no users can log in — total product downtime even if your application is perfectly healthy.
[ ] Monitor CDN and edge delivery
If you use a CDN (Cloudflare, Fastly, AWS CloudFront), monitor your origin separately from the CDN-fronted URLs. CDN configuration errors can cause failures that your health check doesn't detect because the check runs against the origin, not the edge.
✅ Multi-Region Coverage
[ ] Monitor from multiple geographic locations
Single-probe monitoring can create false negatives (regional issues that only affect some users) and false positives (probe-side issues that aren't real outages).
Multi-region monitoring provides:
- False positive elimination: an alert requires multiple independent observers
- Regional failure detection: discover that your service is down in Europe but not North America
- Geographic latency visibility: identify regions where your response times are unacceptably slow
[ ] Choose monitoring with consensus-based alerting
Not all multi-region monitoring is equal. Some tools run checks from multiple locations but still alert when a single location reports failure. Tools with consensus-based alerting — like Vigilmon — only alert when a quorum of probes independently confirm failure, structurally eliminating the most common source of alert fatigue.
[ ] Monitor user latency by region
For SaaS products serving global customers, regional latency matters. A US-based application may be fast for North American users but slow for European and Asian users. Track response times by region and set region-specific alert thresholds.
✅ Developer and API User Experience
[ ] Monitor API documentation availability
If developers integrate with your API, your documentation must be available. Monitor docs.yourdomain.com the same way you monitor your application.
[ ] Monitor API rate limit headers
If your API enforces rate limits, monitor that rate limit headers are present and accurate. Missing or incorrect rate limit headers cause integrations to break in unexpected ways.
[ ] Set up monitoring for your developer portal or sandbox
API sandbox environments used by developers building integrations can fail independently of production. Monitor your sandbox environment and alert on failures — sandbox downtime blocks customers from completing integrations.
✅ Alert Architecture
[ ] Define alert severity levels
| Severity | Condition | Response | |---|---|---| | P1 | Application completely down | Immediate on-call page, all-hands response | | P2 | Core feature broken or SLO breached | On-call page, response within 15 minutes | | P3 | Degraded performance or partial failure | Engineering channel alert, response within 1 hour | | P4 | Background job failure | Slack alert, response within 4 hours | | P5 | SSL expiry warning | Email, response within 7 days |
[ ] Route alerts to the right channels
- P1–P2: PagerDuty or OpsGenie on-call rotation
- P3–P4: Engineering Slack channel
- P5: Email to platform or DevOps team
- All: Status page update (automated where possible)
[ ] Test your alert paths regularly
An alerting pipeline that's never tested is an alerting pipeline that fails when you need it most. Quarterly:
- Trigger a test alert from each monitoring check
- Verify it arrives in the correct channel
- Verify the on-call rotation has the right people
- Verify the runbook linked in the alert is current
[ ] Set up on-call rotation for P1 incidents
Solo founders can carry a pager alone, but teams of 3+ should implement an on-call rotation. Tools like PagerDuty, OpsGenie, and Opsgenie make rotation scheduling straightforward and integrate with monitoring tools via webhook.
✅ Runbooks
[ ] Write runbooks for your top 5 alert types
A runbook is a documented response procedure for a specific incident type. It answers:
- What are the symptoms?
- Who should respond?
- What do you check first?
- What are the likely causes?
- How do you resolve each cause?
- When do you escalate?
Link the runbook from the alert. Engineers shouldn't have to search for documentation during an incident.
[ ] Document your rollback procedure
When a deployment causes an incident, you need to roll back fast. Document:
- How to roll back a failed deployment
- How to fail over to a previous database backup
- How to disable a feature flag that's causing problems
- Who has production access and how to request emergency access
Quick-Start Monitoring Setup
If you're starting from zero, here's the order of operations:
- Set up uptime monitoring for your primary URL and API health endpoint — takes 5 minutes
- Add heartbeat monitoring for your highest-impact cron jobs (billing, email)
- Enable SSL monitoring for your primary domain — usually automatic with HTTPS monitoring
- Add a public status page — even a minimal one
- Route P1 alerts to on-call — email is fine for solo founders, PagerDuty for teams
- Expand coverage to secondary URLs, background jobs, and third-party integrations
A monitoring tool like Vigilmon covers steps 1–3 in a single setup flow — HTTP monitoring, TCP monitoring, heartbeat monitoring, and SSL monitoring are all included in the free tier. No credit card required.
Monitoring Coverage Assessment
Score your current monitoring setup against this rubric:
| Coverage Area | Basic | Good | Complete | |---|---|---|---| | Application availability | Homepage checked | All core flows checked | Multi-region consensus | | API health | Health endpoint checked | All critical endpoints | Functional assertions | | Background jobs | None | Key cron jobs | All workers + queues | | SSL certificates | None | Primary domain | All domains + custom | | Status page | None | Manual updates | Auto-updated from monitoring | | Third-party APIs | None | Status subscriptions | External endpoint checks | | Alert routing | Everything to email | Severity-based channels | On-call rotation + runbooks |
Getting from Basic to Good is the most valuable increment. Getting from Good to Complete is where SaaS operational maturity lives.
Conclusion
Comprehensive SaaS monitoring isn't complicated — but it does require intentional coverage of every failure surface. The checklist above covers what every SaaS product that charges money needs to monitor: application health, APIs, background jobs, certificates, status communication, third-party dependencies, and alert architecture.
Most monitoring gaps aren't gaps in tooling — they're gaps in coverage. A team that picks one monitoring tool and uses it to cover all of these surfaces is in better shape than a team with four monitoring products and unchecked cron jobs.
Start with the basics, build toward the complete picture, and make sure your alert paths are tested before you need them.
Set up Vigilmon's free monitoring at vigilmon.online — covers HTTP, TCP, heartbeat, and SSL monitoring in one tool, free for up to 5 monitors, no credit card required.
Tags: #saas #monitoring #devops #uptime #checklist #sre #webdev #2026