SLA Monitoring Guide: How to Track Uptime SLAs in 2026

If your service has a Service Level Agreement (SLA), you've made a promise about uptime. Tracking whether you're meeting that promise — and knowing fast when you're about to break it — is the job of SLA monitoring. This guide explains what uptime SLAs mean, how to calculate the downtime budgets they imply, and how to use Vigilmon to track and report SLA compliance in 2026.

What Is an SLA?

A Service Level Agreement is a formal commitment about service availability, performance, or quality — typically expressed as a percentage of time the service will be reachable and functional. SLAs appear in:

B2B SaaS contracts: "We guarantee 99.9% uptime per calendar month."
API provider agreements: "The API will be available 99.95% of the time, excluding scheduled maintenance."
Internal engineering commitments: "The payments service will maintain 99.99% availability per quarter."
Cloud provider terms: AWS, GCP, and Azure all publish SLAs for individual services.

The three nines and four nines formulations are the most common benchmarks, but what do they actually mean in terms of real downtime?

SLA Percentages: What They Actually Allow

The math on SLA percentages surprises most developers the first time they work it out.

| SLA | Allowed downtime per month | Allowed downtime per year | |---|---|---| | 99% | ~7 hours 18 minutes | ~3 days 15 hours | | 99.5% | ~3 hours 39 minutes | ~1 day 19 hours | | 99.9% ("three nines") | ~43 minutes | ~8 hours 46 minutes | | 99.95% | ~21 minutes | ~4 hours 23 minutes | | 99.99% ("four nines") | ~4 minutes | ~52 minutes | | 99.999% ("five nines") | ~26 seconds | ~5 minutes |

A 99.9% SLA — common in mid-market SaaS — allows just 43 minutes of downtime per month. A single undetected outage that lasts an hour violates it. A 99.99% SLA gives you a budget of only 4 minutes per month — less than the time it takes most monitoring systems to alert and for an on-call engineer to acknowledge.

The gap between what teams think their SLA allows and what it actually allows is where SLA breaches happen.

Error Budgets: Spending Your Downtime Allowance

Google's SRE book popularized the concept of the error budget: instead of thinking of the SLA as a constraint to avoid, treat it as a budget to spend.

If your SLA is 99.9%, your error budget for the month is 43 minutes. That budget can be spent on:

Unplanned outages (detected, diagnosed, and resolved)
Planned maintenance windows
Partial degradation events (some users affected, not total outage)
Deployment-related downtime

Tracking your error budget in real time changes how you think about risk. If you've already spent 35 of your 43 monthly minutes, you don't deploy a risky change at 4 PM on a Friday — you wait until the budget resets.

SLA vs SLO vs SLI: Getting the Terminology Right

These three acronyms often get conflated:

SLI (Service Level Indicator): The raw measurement. "Current availability is 99.97% over the last 30 days." An SLI is a metric.

SLO (Service Level Objective): An internal target. "We aim to maintain 99.9% availability." An SLO is a goal you set for yourself — often more ambitious than the external SLA.

SLA (Service Level Agreement): The external commitment. "We contractually guarantee 99.9% availability." Breaking an SLA typically has consequences: credits, contract clauses, customer churn.

In practice: set your SLO higher than your SLA. If your SLA is 99.9%, set your internal SLO at 99.95%. Use the SLO as your early warning system so SLA breaches never happen because you didn't see them coming.

How to Track SLAs with Vigilmon

Step 1: Add Your Monitored Endpoints

HTTP/HTTPS endpoints — REST APIs, web applications, health check URLs
TCP ports — database connections, message brokers, internal services
Cron job heartbeats — scheduled workers and background jobs

For SLA tracking, create a dedicated monitor for each SLA-covered service with a descriptive name. Example: "Payments API — SLA Monitor" rather than "payments-api-prod".

Step 2: Configure Check Intervals Appropriately

Your check interval determines how quickly you detect a failure — and therefore how much of your error budget burns before you know something is wrong.

For a 99.99% SLA (4 minutes/month budget), a 5-minute check interval could miss an entire month's downtime budget in a single incident if the outage starts right after a check passes. Use 1-minute intervals for SLA-critical services.

Vigilmon's paid plans support 1-minute check intervals. If you're managing against four-nines SLAs, the math on check frequency matters.

Step 3: Configure Alerts to Minimize Response Time

SLA compliance depends on mean time to detect (MTTD) and mean time to resolve (MTTR). Faster detection means less error budget burned per incident.

Configure Vigilmon webhook alerts to route directly to:

Slack or Discord for team visibility
PagerDuty or OpsGenie for on-call paging
Custom endpoints in your incident management workflow

Vigilmon's multi-region consensus means alerts only fire when a genuine outage is detected — not when a single probe has a bad moment. This is especially important for SLA management: every false positive that triggers an incident response is wasted on-call time and burned MTTR capacity.

Step 4: Track Response Time History

SLAs often include performance commitments, not just availability. "99.9% uptime with p95 response time under 500ms" is a common API SLA structure.

Vigilmon's response time history dashboard shows your endpoint's performance over time with color-coded latency bands. This lets you identify degradation trends before they become SLA violations:

Green: Normal response time
Yellow: Elevated latency — potential early warning
Red: Response time exceeding threshold — SLA risk

Check response time history weekly when managing against performance SLAs.

Step 5: Calculate SLA Compliance from Downtime Data

Vigilmon's uptime data gives you the raw material for SLA reporting. To calculate monthly availability:

Availability % = ((total_minutes - downtime_minutes) / total_minutes) × 100

For a 30-day month: total_minutes = 43,200

If you had two incidents totaling 25 minutes of downtime:

Availability % = ((43,200 - 25) / 43,200) × 100 = 99.942%

This beats a 99.9% SLA. But if you had three incidents totaling 50 minutes, you're at 99.884% — a breach.

Use Vigilmon's uptime data to run this calculation at the end of each reporting period. Export it as evidence for SLA reports to customers, executives, or compliance teams.

SLA Monitoring Best Practices for 2026

Define What Counts as Downtime

Not all failures count equally toward SLA calculations. Define explicitly:

Total outage: Service unreachable from all probe locations — always counts
Partial degradation: Some features unavailable or some regions affected — depends on SLA language
Scheduled maintenance: Most SLAs exclude pre-announced maintenance windows
Third-party failures: Many SLAs exclude downtime caused by upstream provider failures (AWS, DNS, CDN)

Review your SLA language carefully. Track separately so you can prove exclusions when needed.

Set Internal Alerts Before the SLA Threshold

Don't wait for an SLA breach to act. Configure internal alerts at 75% and 90% of your monthly error budget:

75% spent (e.g., 32 of 43 minutes): Engineering review triggered — what changed recently? What's at risk?
90% spent (e.g., 38 of 43 minutes): Change freeze — no deploys, no risky operations until the month resets
100% spent: SLA breach — initiate customer communications and post-mortem

Separate Monitoring From Your Application

A monitoring system hosted on the same infrastructure it monitors defeats the purpose. Vigilmon runs as external SaaS with independent probe infrastructure. If your application goes down, Vigilmon's probes are unaffected — they'll detect the failure and alert you even if your entire hosting environment is offline.

Document Your SLA Compliance Trail

For contractual SLAs, maintain a monthly record:

Dates and durations of all incidents
Response time data showing performance SLA compliance
Calculated availability percentages
Notes on maintenance windows and exclusions

Vigilmon's response time history and uptime data give you the raw data. Keep a monthly summary document as evidence for customer conversations, compliance audits, and renewal discussions.

Real-World SLA Scenarios

"We only need 99.9% — that's easy"

99.9% means 43 minutes of downtime per month. A single deploy that goes wrong and takes 30 minutes to roll back spends 70% of your monthly budget in one event. Two such incidents in a month is a breach.

99.9% is not as forgiving as teams assume. Monitor it.

"Our customers are in the US, latency doesn't matter"

Latency SLAs often hide in API documentation without being treated as seriously as availability SLAs. A p95 response time commitment of 200ms is easy to break under load, during a deploy, or if your database query plan changes. Track response time continuously, not just during incidents.

"We'll deal with SLA reporting at the end of the month"

Real-time awareness beats monthly reconciliation. When you know your error budget in real time, you make better deployment and change management decisions throughout the month — not just after a breach.

Getting Started

Sign up at vigilmon.online — no credit card required
Add monitors for all SLA-covered services (HTTP, TCP, heartbeats)
Configure alerts with check intervals matched to your SLA severity
Set up webhooks to route alerts to your incident response workflow
Track response time history weekly for performance SLA compliance
Calculate monthly availability from Vigilmon's uptime data and document it

SLA monitoring doesn't require enterprise tooling or a dedicated SRE team. It requires accurate, reliable detection of when services fail — and fast enough alerting to minimize how long they stay down.

Try Vigilmon free — no credit card required at vigilmon.online.

Tags: #monitoring #sla #devops #sre #uptime #reliability