Monitoring is the part of the software lifecycle that developers most commonly hand off to someone else — and then resent when that someone isn't available at 2 AM. In 2026, the tools to own your own monitoring are better than they've ever been, and the argument for developers owning it rather than delegating it has never been stronger.
This guide covers why developers should own monitoring, what to monitor, how to instrument it, on-call practices that don't burn people out, and a practical walkthrough of setting up monitoring with Vigilmon.
Why Developers Should Own Monitoring
The conventional wisdom is that monitoring belongs to SRE or operations. In practice, the developers who build a system understand it better than anyone else. They know which failure modes exist, which endpoints are critical, which background jobs are fragile, and what "degraded but technically up" looks like for a specific service.
When monitoring is owned by a separate team:
- Monitors are configured based on what that team can see (usually just HTTP endpoints)
- Heartbeat monitoring for background jobs is often missing because ops doesn't know the jobs exist
- Alert thresholds are guesses rather than informed by how the service actually behaves
- Incidents require a handoff: ops notifies development, development investigates, communication is a bottleneck
When developers own monitoring:
- Monitor coverage maps to actual failure modes the team understands
- Alert thresholds are calibrated to baseline behavior the team has observed
- Incident response is direct: the person who gets paged is the person who can fix it
- Monitoring evolves with the code rather than lagging months behind
This doesn't mean developers run a 24/7 on-call operation solo. It means developers configure and own the monitoring for the services they build, integrate it into their deployment processes, and participate in the on-call rotation rather than being woken by a proxy.
What Developers Should Monitor
1. HTTP/HTTPS Endpoints
The foundational check: is your service responding at its public URL?
A good HTTP monitor:
- Checks your production URL at regular intervals (1–5 minutes depending on SLA)
- Verifies the HTTP status code is in the 2xx range
- Checks response time against a threshold
- Checks from multiple geographic locations to avoid false positives from single-probe failures
Configure HTTP monitoring for:
- Your primary application URL
- Your API base URL
- Your health endpoint (
/healthor/api/health) - Any critical sub-paths that have independent failure modes (e.g.,
/api/v2/if v2 runs on different infrastructure)
2. TCP Port Monitoring
Some services don't speak HTTP. TCP port monitoring checks whether a port is accepting connections — useful for:
- Database ports (if accessible from your monitoring infrastructure)
- Mail server ports (SMTP, IMAP)
- Custom protocol services
- Internal service ports accessible via your monitoring network
If your service exposes only HTTP, TCP monitoring is lower priority. If you run mixed infrastructure, TCP covers what HTTP checks miss.
3. Heartbeat Monitoring for Background Jobs
This is the monitoring gap most developer teams have. HTTP endpoint checks confirm your web server is up. They say nothing about whether your cron jobs ran.
Heartbeat monitoring works by inverting the check: instead of your monitor probing your service, your service pings your monitor on each successful run. If the ping doesn't arrive within the expected window, the alert fires.
Add a heartbeat ping to:
# Simple cron job with heartbeat
0 * * * * /usr/bin/python3 /app/scripts/hourly_sync.py && curl -fsS https://vigilmon.online/heartbeat/YOUR_HEARTBEAT_ID
# Python job with heartbeat
import requests
def run_data_sync():
# ... your job logic ...
pass
if __name__ == "__main__":
run_data_sync()
requests.get("https://vigilmon.online/heartbeat/YOUR_HEARTBEAT_ID", timeout=5)
// Node.js job with heartbeat
const https = require('https');
async function runJob() {
// ... your job logic ...
}
runJob()
.then(() => {
https.get('https://vigilmon.online/heartbeat/YOUR_HEARTBEAT_ID');
})
.catch(console.error);
Configure the heartbeat's alert window to be 20–50% longer than the job's typical run time. An hourly job should have a 90-minute window. A job that takes 45 minutes should have a 70-minute window. This prevents alerts from firing during normal variation without hiding failures.
Every job that should run on a schedule is a candidate for heartbeat monitoring. Common examples:
- Database backups
- Email/notification queue workers
- Data import and export pipelines
- Cache warming jobs
- Billing retry jobs
- Search index rebuilds
- Log aggregation and rotation
- External data sync (CRM, analytics, webhooks)
4. Response Time History
A service that's technically available but responding at 3× its baseline latency is failing its users even though a simple uptime check shows green. Response time history lets you:
- Track latency trends over time (catch gradual drift before it becomes an outage)
- Set alert thresholds based on actual baseline behavior rather than guesses
- Correlate latency spikes with deployments, traffic patterns, or upstream changes
- Document SLA compliance with objective latency data
Enable response time tracking on your HTTP monitors and establish baselines within the first week of production operation.
How to Instrument Your Application for Monitoring
Build a Useful Health Endpoint
The health endpoint is what your monitor checks — and what engineers check first during an incident. Build it to be genuinely informative.
Minimal health endpoint:
{
"status": "ok",
"timestamp": "2026-06-30T10:23:01Z"
}
More useful health endpoint:
{
"status": "ok",
"version": "3.1.4",
"uptime_seconds": 172800,
"checks": {
"database": "ok",
"cache": "ok",
"queue": "ok"
},
"timestamp": "2026-06-30T10:23:01Z"
}
Design principles for health endpoints:
Return correct HTTP status codes. HTTP 200 means healthy. HTTP 503 means unhealthy. Do not return HTTP 200 with {"status": "error"} in the body — your monitoring tool reads status codes; inconsistent semantics cause missed alerts and false alerts.
Keep it fast. A health endpoint that takes 500ms because it checks four dependencies should be split into liveness (is the process running?) and readiness (is the process ready for traffic?). A slow health check can cause cascading failures in load balancers and container orchestrators.
Check the critical path, not everything. A health endpoint that checks every dependency will false-alarm whenever any non-critical system hiccups. Check the dependencies that, if unavailable, would cause you to serve errors to users. Optional enhancements that degrade gracefully don't belong in the critical path health check.
Express health in terms of the system's ability to serve requests. "Is the database reachable?" is less useful than "Can we complete a database read within 50ms?" The latter reflects actual user experience.
Structured Application Logging
Good monitoring and good logging are complementary. When an alert fires, logs tell you why.
Structured logs (JSON or key=value) are searchable and parseable by log aggregation systems:
{
"level": "error",
"timestamp": "2026-06-30T10:23:01Z",
"message": "Payment API request failed",
"service": "billing",
"error": "connection timeout",
"endpoint": "/api/v2/charge",
"duration_ms": 5000,
"request_id": "req_abc123"
}
Include at minimum: log level, timestamp, message, service name, and request ID. Request IDs let you trace a single request through multiple services and log lines during incident investigation.
Expose Internal Metrics
For services that matter enough to monitor deeply, expose metrics alongside your health endpoint:
- Request count and error rate per endpoint
- Response time percentiles (P50, P95, P99)
- Queue depth (if applicable)
- Active connection count
- Cache hit rate
- Background job execution time and success rate
Prometheus-compatible /metrics endpoints are the standard format. Even if you're not running Prometheus, the format is readable and easily parsed by other tools.
Alerting That Doesn't Cause Burnout
The False Positive Problem
Alert fatigue is when engineers stop responding to alerts because most alerts don't represent real problems. It's the most dangerous failure mode in monitoring — an ignored alert for an actual outage means hours of silent user impact.
The primary cause of false positives in uptime monitoring is single-probe failures. If your monitoring tool uses one probe location, that probe's momentary bad day — transient packet loss, a DNS anomaly, a routing hiccup — triggers an alert. The service was never actually down; the probe was just having a bad second.
Solution: multi-region consensus alerting. Require independent confirmation from multiple geographically distributed probes before an alert fires. A single probe with a problem cannot achieve consensus against probes on healthy paths. Tools like Vigilmon implement this by default — every check dispatched simultaneously from multiple regions, alert fires only on quorum.
Alert Severity Tiers
Not every alert should wake you up. Define severity tiers and route them differently:
| Severity | Condition | Response | Channel | |---|---|---|---| | P1 | Service completely unreachable | Page immediately | On-call rotation (phone/SMS) | | P2 | Error rate above SLO threshold | Respond within 15 minutes | PagerDuty + Slack | | P3 | Response time degraded | Respond next business hour | Slack engineering channel | | P4 | Background job missed one run | Investigate during work hours | Slack or email | | P5 | SSL expiry within 30 days | Scheduled renewal task | Email |
Mixing all of these into the same alert channel trains engineers to ignore alerts. When the P5 SSL warning arrives at 2 AM with the same urgency as a P1 outage, both get treated as noise.
On-Call Rotation Basics
A sustainable on-call rotation:
Rotate frequently enough that no one carries the full burden. Weekly rotation is a common starting point. Monthly rotation means the same person gets November with all its deployment risk.
Respect off-hours. Non-P1 alerts should not page engineers outside business hours. Save the 2 AM pages for actual outages. Alert fatigue from unnecessary night interruptions is a retention problem, not just a comfort problem.
Document runbooks for your most common alerts. When an engineer gets paged, they need context immediately — what this alert means, what to check first, what the likely causes are, and how to resolve each one. A runbook link in the alert body is the fastest way to get a new on-call engineer productive during an incident.
Conduct blameless post-mortems. After every significant incident, document what happened, what the impact was, what was done to resolve it, and what could prevent recurrence. Focus on systemic causes rather than individual errors. Post-mortems that result in monitoring improvements prevent the same alert from firing again.
Include developers in rotation. The developer who built a service knows its failure modes. Having them participate in on-call — not exclusively, but as part of the rotation — improves incident response time and creates direct feedback between production failures and future development priorities.
Setting Up Vigilmon: A Practical Walkthrough
Step 1: Create Your Free Account
Go to vigilmon.online and create a free account. No credit card required. The free tier covers 5 monitors with full multi-region consensus alerting.
Step 2: Add Your First HTTP Monitor
- Click Add Monitor → select HTTP/HTTPS
- Enter your production URL (e.g.,
https://api.yourdomain.com/health) - Set the check interval (5 minutes on free tier; 1 minute on paid)
- Configure expected status code (200)
- Set a response time threshold if desired
- Save — your monitor is active immediately
Vigilmon begins dispatching checks from multiple geographic locations. Response time history is recorded from the first check.
Step 3: Configure Webhook Notifications
- Go to Settings → Notifications
- Add a webhook URL for your alerting destination:
- Slack: Create an incoming webhook in Slack's App Directory; paste the URL
- PagerDuty: Use PagerDuty's Events API v2 webhook endpoint
- OpsGenie: Use OpsGenie's Heartbeat API or webhook integration
- Custom: Any HTTPS endpoint that accepts POST requests
Vigilmon will POST a JSON payload to this URL when an alert fires and when it recovers.
Step 4: Add Heartbeat Monitors for Background Jobs
- Click Add Monitor → select Heartbeat
- Name the monitor (e.g., "Database Backup - Nightly")
- Set the expected ping interval (match or slightly exceed your job's schedule)
- Set the grace period (how long after the expected interval before alerting)
- Copy the heartbeat URL
- Add the curl ping to the end of your job
Example for a nightly database backup:
- Schedule: daily at 2 AM
- Expected interval: 24 hours
- Grace period: 1 hour
- Alert fires: if no ping by 3 AM
Step 5: Add TCP Monitors If Needed
For non-HTTP services:
- Click Add Monitor → select TCP
- Enter the hostname and port
- Set check interval
- Configure notifications
Step 6: Verify the Alert Pipeline
Before trusting your monitoring in production:
- Trigger a test failure: temporarily take a test endpoint offline (or misconfigure a check URL) and verify the alert fires within the expected interval
- Confirm delivery: check that the alert reached your configured webhook destination
- Verify recovery: bring the endpoint back up and confirm the recovery notification fires
- Test a missed heartbeat: stop a heartbeat check from receiving pings and confirm the alert fires at the configured window boundary
Only when you've confirmed the full alert pipeline end-to-end should you treat your monitoring as production-ready.
Step 7: Use the REST API for Deployment Integration
Vigilmon's REST API lets you integrate monitoring into your deployment pipeline:
# Pause a monitor during deployment
curl -X PATCH https://vigilmon.online/api/monitors/MONITOR_ID \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"paused": true}'
# Resume after deployment completes
curl -X PATCH https://vigilmon.online/api/monitors/MONITOR_ID \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"paused": false}'
This prevents false alerts during rolling deployments where endpoints are briefly unavailable as instances restart.
Monitoring as Part of Your Development Process
Good monitoring is not a one-time setup — it's a practice that evolves with your codebase.
Add monitors when you add services. Every new service, endpoint, or background job should have corresponding monitors before it ships to production. Make it part of your definition of "done" for new services.
Include monitor configuration in your codebase. Vigilmon's API makes it possible to define monitors in code and apply them as part of your deployment pipeline. Infrastructure-as-code for monitoring means new environments get the right monitors automatically.
Review alert history after incidents. When an incident occurs, check whether your monitoring detected it promptly. If not, update your monitors to catch the same failure mode faster next time.
Audit heartbeat monitors when jobs change. If a cron job's schedule changes, its heartbeat monitor's expected interval should change too. Stale heartbeat configurations are silent — the alert window is wrong, but it doesn't look wrong until it matters.
Share monitoring visibility with the team. Response time history and status badges that the whole team can access make performance trends visible without requiring anyone to go looking for them.
Quick Reference: Monitoring Setup Checklist for Developers
Before a service goes to production:
- [ ] HTTP monitor on primary production URL (multi-region, 5-minute or 1-minute interval)
- [ ] HTTP monitor on API base URL or health endpoint
- [ ] Heartbeat monitor for each cron job, worker, and scheduled task
- [ ] TCP monitor for any non-HTTP service endpoints that matter
- [ ] Webhook notifications configured and tested (Slack, PagerDuty, or OpsGenie)
- [ ] Response time history enabled; baseline documented
- [ ] Alert pipeline tested end-to-end (test fire and confirmed delivery)
- [ ] Runbook entry for each monitor documenting what to check when it fires
- [ ] Monitor configuration documented or added to deployment pipeline
Conclusion
Monitoring in 2026 is too important to delegate to a team that doesn't understand what it's monitoring — and too accessible to treat as someone else's problem. The tools exist, the free tiers are genuinely useful, and the setup time for a new service is measured in minutes, not days.
The developers who own their monitoring have shorter incident response times, more accurate alert signals, and better coverage of the failure modes that matter. They also sleep better because they've verified that their monitoring actually works before users find the failure.
Start with your most important service. Add an HTTP monitor, a heartbeat monitor for your most critical background job, and a webhook to wherever your team gets notifications. Verify the pipeline. Then build from there.
Get started at vigilmon.online — free account, no credit card, monitoring up in under 5 minutes.
Tags: #monitoring #devops #developers #sre #uptime #on-call #webdev #2026