Why Your Uptime Monitor Is Lying to You (And How to Fix It)
You get paged at 2:14am. Heart rate spikes. You grab your laptop, open a terminal, and... the site loads fine. You check the logs — nothing. You check the database — healthy. By the time you've ruled out every possible cause, 20 minutes have passed and you're wide awake with nowhere to direct your frustration.
Your monitor told you the site was down. It wasn't.
This happens to developers everywhere, every week. And the root cause isn't noisy infrastructure or misconfigured alerts. It's a fundamental architectural flaw in how most uptime monitors work.
The Single-Probe Problem
Here's how 90% of uptime monitoring services operate:
- One probe server in one region (let's say Virginia, US)
- Every 5 minutes (or 1 minute), that probe sends an HTTP request to your URL
- If the request fails or times out, the service marks your site as "down" and fires an alert
- If the next check succeeds, it marks it "up" again
This model has one critical flaw: a single probe is not a reliable witness.
Networks are noisy. BGP routes flap. DNS resolvers have transient failures. The probe server's local ISP might experience packet loss for 30 seconds. A CDN edge node might return a 503 to one region while serving every other region perfectly.
None of these are your site being "down." But every one of them will trigger a 2am page.
What False Positives Actually Cost
The obvious cost is lost sleep. But the downstream costs compound:
Alert fatigue. After the fifth 2am false alarm, engineers start ignoring alerts. They build a mental discount for monitoring noise. Then when a real outage happens, the pager goes off and the on-call engineer assumes it's another false positive — and rolls over.
On-call burnout. False alerts during off-hours are one of the most common reasons experienced engineers quit. It's not just the lost sleep — it's the sustained low-grade stress of knowing your phone might go off for no reason.
Trust erosion. When stakeholders ask "were we down last night?" and you have to say "the monitoring says yes but probably not," you've lost credibility. Monitoring data that you can't trust is worse than no monitoring data.
Wasted escalations. If your alerting policy pages a second on-call after 5 minutes of no response, false positives cascade through your entire on-call rotation.
The Specific Scenarios That Cause False Positives
Let me be concrete about what "network blip" actually means in practice:
BGP route instability. Border Gateway Protocol is how ISPs exchange routing information. BGP routes change constantly, and during a route update, some paths to your server temporarily disappear. This is often regional — ISP A in North Virginia loses the route to your server for 40 seconds while ISP B in Oregon continues routing normally. A Virginia-based probe will see a timeout. An Oregon-based probe will see nothing unusual.
DNS propagation and TTL issues. Your DNS TTL expires mid-check. The probe resolver queries an upstream resolver that hasn't yet cached the new record. Your probe gets an NXDOMAIN or a stale IP. Your site is fine. The probe thinks it's gone.
CDN edge failures. Major CDNs serve traffic from hundreds of edge nodes. Individual nodes occasionally fail, return errors, or get pulled for maintenance. If the probe happens to be routed through the failing edge node — and your real users aren't — you get a false alert.
Probe-side issues. The monitoring service's own probe servers have problems. They run out of file descriptors. They get rate-limited. Their outbound connections get blackholed by an overzealous network filter. You get paged for their problem.
SSL/TLS handshake timeouts. The probe is slow to complete a TLS handshake because its own networking stack is briefly degraded. Your server accepted the connection; the probe gave up.
In every one of these cases, the problem is between the probe and your server, not in your server itself. Your users are fine. You are not.
Why "Alert After 2 Consecutive Failures" Doesn't Really Help
The standard response from monitoring vendors is: configure alerts to fire only after 2 or 3 consecutive failures.
This helps a little. But it doesn't fix the problem — it just makes the problem intermittently silent.
If your check interval is 5 minutes and you require 2 consecutive failures, your minimum detection time for a real outage is now 10 minutes. That's a trade: you reduce false positives slightly while making real outage detection slower.
More importantly, it doesn't eliminate false positives. A persistent but transient network issue (like a flapping BGP route that's mostly down for 8 minutes) will still generate 2 consecutive failures and page you.
The consecutive-failure heuristic is trying to approximate what you actually need: independent corroboration from multiple sources. It's a hack. There's a better way.
The Fix: Multi-Region Consensus Alerting
The correct solution is architecturally simple: instead of one probe asking "is the site down?", use many probes across independent network regions and require a majority to agree before alerting.
Here's how consensus alerting works:
- 3+ probe nodes in geographically and network-topologically distinct locations each independently check your site
- Each probe checks at your configured interval
- An alert fires only when 3+ nodes agree the site is unreachable
- If 1 probe fails and 2 others succeed, no alert fires — because it's clearly a probe-side or regional issue
The key insight is independence. A BGP route flapping in Virginia doesn't affect a probe in Frankfurt or Singapore. A CDN edge failing in Tokyo doesn't affect a probe checking from São Paulo. For all 3+ probes to independently agree the site is down, the site must actually be down.
This is not complicated monitoring theory. It's the same principle as requiring multiple witnesses before convicting someone of a crime. One witness might be wrong. Five independent witnesses who corroborate each other are almost certainly right.
The Math: How Consensus Changes False Alert Probability
Let's model it simply.
Assume each individual probe has a 0.1% chance of a false positive on any given check (one erroneous failure per 1,000 checks). This is actually generous — real-world single-probe false positive rates are higher.
Single probe model: 0.1% false positive rate per check.
3-probe consensus (all must agree): 0.1% × 0.1% × 0.1% = 0.000001% false positive rate.
That's a 100,000× reduction in false positive rate. For a service checking every minute, a 0.1% single-probe false positive rate means roughly 1 false alert per day. With 3-probe consensus, you'd expect 1 false alert per 190 years.
In practice, probe failures are correlated within a region — but they're independent across regions. The consensus model still reduces false positives by orders of magnitude.
What This Looks Like in Practice
Developers who switch to consensus-based monitoring consistently report the same thing: their monitoring goes quiet except when something is genuinely wrong.
The 2am page stops happening for network blips. When the phone does ring, you know it's real. You open your laptop with focus instead of frustration, because you know the site is actually down.
This changes the psychological relationship with monitoring. Alerts become signal, not noise. Engineers start trusting the system again. On-call stops feeling like a tax on your off hours.
One pattern that emerges: teams using single-probe monitoring often have elaborate alert suppression rules, snooze windows, and "don't page me for this" filters built up over months. These are all attempts to cope with a noisy signal. With consensus alerting, you can delete most of those rules, because the signal itself is clean.
Other Improvements Worth Making
Consensus alerting is the highest-leverage fix, but it pairs well with a few other practices:
Monitor from the same path your users take. Don't monitor your server's internal health endpoint. Monitor the same URL your users hit, through the same CDN. You want to know what your users experience.
Set meaningful timeouts. A 30-second timeout is too generous for detecting a slow-loading page. A 5-second timeout is more honest. If your page takes more than 5 seconds to respond, your users are experiencing a problem even if the server technically "responded."
Track response time, not just up/down. A monitor that reports your site "up" but with 8-second response times is technically correct and practically useless. Response time percentiles tell you whether your site is actually usable.
Monitor the thing that pages, not just what you care about. If your service is composed of an API, a worker, and a queue, monitor the API endpoint users hit. Don't just monitor the health check endpoint that only checks if the process is running.
The Bottom Line
Most uptime monitors are architecturally honest tools that happen to be bad at the core job: telling you when your users can't reach your service.
A single probe making a single request from a single region is not a reliable way to answer "is my site down?" It's a coin flip with better odds. False positives are the predictable, inevitable result.
The fix — multi-region consensus — is not a new idea. It's just not implemented in most monitoring tools because it requires running distributed probe infrastructure across multiple regions and network providers. It's genuinely harder to build than a single-probe poller.
When you're evaluating uptime monitoring tools, ask one question: how many independent probes must agree before an alert fires?
If the answer is "one," you're going to get paged at 2am for something that isn't broken.
Vigilmon uses multi-region consensus to eliminate false alerts — try it free at vigilmon.online
No credit card required. 5 monitors free. Only alerts when 3+ nodes agree something is genuinely down.