tutorial

SLA, SLO, and Uptime Monitoring: A Complete Guide for 2026

Every engineering team running a production service eventually faces the question: "How reliable does this need to be?" The answer lives in the vocabulary of...

Every engineering team running a production service eventually faces the question: "How reliable does this need to be?" The answer lives in the vocabulary of SLAs, SLOs, and SLIs — terms that are often used interchangeably but mean meaningfully different things.

Getting these concepts right matters. Misdefining an SLO leads to either over-engineering (burning resources chasing 99.999% availability on a service that doesn't need it) or under-engineering (promising 99.9% availability and failing to build the reliability that requires). And for teams with paying customers or enterprise contracts, SLA violations have financial and legal consequences.

This guide covers the definitions, the math, how to measure uptime against your commitments, and how to use Vigilmon data for SLA reporting and SLO dashboards in 2026.


SLA, SLO, and SLI: The Definitions

SLI — Service Level Indicator

An SLI is a quantitative measurement of a service's behavior — the raw metric you observe. SLIs are the inputs that tell you whether you're meeting your targets.

Examples of SLIs:

  • The percentage of HTTP requests that returned a 2xx status code in the last 30 days
  • The P95 response time for API requests over the last 7 days
  • The fraction of cron job executions that completed successfully in the last month
  • The availability percentage of an endpoint, as measured by external probes

An SLI is just a number. It becomes meaningful when compared against a target.

SLO — Service Level Objective

An SLO is an internal target for an SLI — what you commit to achieving within your own team and engineering organization. SLOs are the internal agreement about what "good enough" means for a given service.

Examples of SLOs:

  • "Our API availability SLI will be ≥ 99.9% over any 30-day rolling window"
  • "P95 response time for our checkout API will be ≤ 500ms"
  • "Our nightly backup job will complete successfully ≥ 99.5% of scheduled runs"

SLOs are set by engineering teams and reviewed by product and leadership. They're the targets your engineering decisions are optimized around. Choosing an SLO involves real tradeoffs: a higher SLO requires more redundancy, faster incident response, more careful change management, and ongoing reliability investment.

SLA — Service Level Agreement

An SLA is an external commitment made to customers or partners, typically with contractual and financial consequences for violations. An SLA is a business document that describes what happens when you fail to deliver the promised reliability.

Examples of SLAs:

  • "We guarantee 99.9% monthly uptime. If we fall below this threshold, affected customers receive a 10% service credit for each additional 0.1% below the target."
  • "Our enterprise tier includes a 99.95% monthly uptime SLA. Violations result in credits calculated as a percentage of monthly fees."
  • "API response time will not exceed 2 seconds for 95% of requests. SLA violations are credited at 1 day of service per hour of breach."

A well-designed SLA is derived from your SLOs — you commit externally to a target you're confident you can meet internally, with some buffer. An SLA set at the same level as your SLO leaves no room for error and creates contractual risk for unavoidable incidents.

The Hierarchy

SLI (measurement) → SLO (internal target) → SLA (external commitment)

You measure SLIs constantly, set SLOs based on what those SLIs can reliably deliver, and define SLAs based on SLOs with appropriate headroom. Violations flow in the opposite direction: an SLA violation is discovered through SLO tracking, which is measured through SLIs.


Uptime Percentages: The Math

Uptime is expressed as a percentage, but the practical meaning — how much downtime is actually allowed — depends on the time window.

Monthly Downtime Allowance

| Uptime % | Monthly downtime allowed | |---|---| | 99% | ~7 hours 18 minutes | | 99.5% | ~3 hours 39 minutes | | 99.9% | ~43 minutes 50 seconds | | 99.95% | ~21 minutes 55 seconds | | 99.99% | ~4 minutes 23 seconds | | 99.999% | ~26 seconds |

Annual Downtime Allowance

| Uptime % | Annual downtime allowed | |---|---| | 99% | ~3 days 15 hours | | 99.5% | ~1 day 21 hours | | 99.9% | ~8 hours 46 minutes | | 99.95% | ~4 hours 23 minutes | | 99.99% | ~52 minutes 36 seconds | | 99.999% | ~5 minutes 16 seconds |

What These Numbers Mean in Practice

99.9% (three nines): The most common SLO target for SaaS APIs and web applications. Allows about 44 minutes of downtime per month. Achievable with a single-region deployment, good monitoring, and a responsive on-call rotation. A 20-minute incident once a month keeps you on track; a 2-hour incident blows your monthly budget.

99.95%: Common for enterprise-tier SaaS commitments. Allows about 22 minutes per month. Requires faster incident detection and response, typically multi-AZ or redundant infrastructure, and more careful change management (deploy during low-traffic windows, canary deploys).

99.99% (four nines): Serious infrastructure investment required. 4 minutes of monthly downtime means a single deployment rollback that takes 5 minutes violates the SLO. Requires multi-region active-active deployment, zero-downtime deploys, extensive automated testing, and sub-minute incident detection. Typically found in infrastructure services (cloud providers, CDNs, DNS) rather than application-layer SaaS.

99.999% (five nines): The domain of telecoms, financial clearing systems, and critical infrastructure. Extremely expensive to build and operate. Not a realistic target for most application-layer services.

Calculating Your Actual Uptime

Uptime % = (Total minutes in period - Downtime minutes) / Total minutes in period × 100

For a 30-day month (43,200 total minutes):

Uptime % = (43,200 - downtime_minutes) / 43,200 × 100

If you had 2 incidents totaling 38 minutes of downtime:

Uptime % = (43,200 - 38) / 43,200 × 100 = 99.912%

This is above the 99.9% SLO (which allows ~44 minutes), so you're in compliance.


Error Budgets

An error budget is the allowable amount of downtime or failure that falls within your SLO — the complement of your availability target.

For a 99.9% monthly SLO, your error budget is 0.1% of the month:

  • Monthly error budget: ~43 minutes 50 seconds
  • Weekly error budget: ~10 minutes 5 seconds

Error budgets shift the mental model from "are we up or down?" to "how much of our allowance have we used?" This framing is powerful:

  • If you've used 20% of your monthly error budget by day 5: You're burning budget 3× faster than the SLO allows. Something is wrong and needs attention.
  • If you've used 50% of your monthly error budget by day 15: You're burning at exactly the SLO rate. Watch closely.
  • If you've used 0% of your monthly error budget by day 28: You have buffer. A 20-minute incident now wouldn't violate the SLO.
  • If your error budget is depleted on day 10: Your SLO is already violated for the month. Focus on stabilization; feature work takes lower priority.

Error Budget Burn Rate

The burn rate measures how quickly you're consuming your error budget relative to the expected rate.

  • Burn rate = 1: Consuming budget at exactly the SLO rate. If you continue at this rate, you'll hit the limit exactly at the end of the period.
  • Burn rate > 1: Consuming budget faster than the SLO allows. A burn rate of 2 means you'll exhaust your budget in half the time.
  • Burn rate < 1: Consuming budget slower than the SLO rate. You're ahead of target.

Alert on high burn rates before the budget is exhausted:

| Burn Rate | Alert Urgency | Meaning | |---|---|---| | > 14.4× | Page on-call immediately | Budget exhausts in 1 hour | | > 6× | Alert engineering team | Budget exhausts in 6 hours | | > 3× | Warning notification | Budget exhausts in 2 days | | > 1× | Information | Consuming faster than target |


How Vigilmon Data Supports SLA Reporting

Vigilmon's outside-in monitoring provides the raw data for uptime SLA reporting. Every probe check, its result (up or down), and its response time is recorded and queryable through the Vigilmon dashboard and API.

What Vigilmon Measures

  • Check results: Each scheduled check returns a pass (service reachable, correct status code) or fail (connection error, timeout, unexpected status code, body mismatch)
  • Response time: The time from probe to response, measured in milliseconds, recorded per check
  • Downtime periods: Continuous sequences of failed checks from multiple probe regions, timestamped for duration calculation
  • Response time history: Historical trends of response times, useful for P95/P99 latency SLO tracking

Calculating SLA-Compliant Uptime from Vigilmon Data

Using the Vigilmon API to retrieve downtime data for your reporting period:

# Get monitor status history for a date range
curl "https://vigilmon.online/api/v1/monitors/mon_abc123/history?from=2026-06-01&to=2026-06-30" \
  -H "Authorization: Bearer $VIGILMON_API_TOKEN"

The response includes check results with timestamps. Calculate downtime by summing the duration of consecutive failed check windows.

Scheduled SLA Reports

Automate monthly SLA reports with a scheduled script that queries Vigilmon API, calculates uptime percentage, and generates a report:

#!/bin/bash
# monthly-sla-report.sh

MONITOR_ID="mon_abc123"
VIGILMON_TOKEN="${VIGILMON_API_TOKEN}"
MONTH_START="$(date -d 'first day of last month' +%Y-%m-%d)"
MONTH_END="$(date -d 'last day of last month' +%Y-%m-%d)"

# Fetch history
history=$(curl -s \
  "https://vigilmon.online/api/v1/monitors/$MONITOR_ID/history?from=$MONTH_START&to=$MONTH_END" \
  -H "Authorization: Bearer $VIGILMON_TOKEN")

# Calculate uptime percentage
total_checks=$(echo "$history" | jq '.checks | length')
failed_checks=$(echo "$history" | jq '[.checks[] | select(.status == "down")] | length')
uptime_pct=$(echo "scale=4; (($total_checks - $failed_checks) / $total_checks) * 100" | bc)

echo "SLA Report: $MONTH_START to $MONTH_END"
echo "Total checks: $total_checks"
echo "Failed checks: $failed_checks"
echo "Uptime: ${uptime_pct}%"
echo "SLO target: 99.9%"

if (( $(echo "$uptime_pct >= 99.9" | bc -l) )); then
  echo "Status: SLO MET"
else
  echo "Status: SLO VIOLATED"
fi

Reporting Downtime for SLA Credits

When an SLA breach occurs and customers are owed service credits, your SLA report needs:

  1. Start time of incident: When Vigilmon first detected consecutive failures across multiple probe regions
  2. End time of incident: When probes returned to a passing state consistently
  3. Total downtime duration: The difference between start and end times
  4. Affected monitors: Which services were impacted (API, frontend, specific endpoints)
  5. Proof of measurement: An export of Vigilmon check data for the incident period

Vigilmon's response time history and check logs provide the raw evidence for SLA credit calculations. Export the data from the Vigilmon API and include it in your SLA credit communication to affected customers.


Building SLO Dashboards

An effective SLO dashboard answers three questions at a glance:

  1. Are we meeting our current SLO?
  2. How much error budget do we have left this period?
  3. Is our error budget burn rate concerning?

Dashboard Data Sources

Pull the following from Vigilmon for your dashboard:

# Current monitor status
GET /api/v1/monitors/{id}
# Returns: current status, last check time, consecutive failures

# Response time history (for latency SLOs)
GET /api/v1/monitors/{id}/response-times?period=30d
# Returns: array of {timestamp, responseTime} for the last 30 days

# Downtime periods (for availability calculations)
GET /api/v1/monitors/{id}/incidents?from=2026-06-01&to=2026-06-30
# Returns: list of incident periods with start/end timestamps

Key Dashboard Panels

Current Status Panel: Is the service currently up or down? Green/red indicator with time since last state change.

30-Day Uptime Percentage: Calculated from Vigilmon check history. Compare against SLO target.

Error Budget Remaining: Expressed as minutes remaining and as a percentage of total budget.

Error budget remaining = Total monthly allowance - Downtime in current period

For a 99.9% SLO (43.8 min/month) with 12 minutes of downtime so far this month:

Error budget remaining = 43.8 - 12 = 31.8 minutes (72.6% remaining)

Error Budget Burn Rate: How fast are we consuming the budget?

Burn rate = (budget consumed / total budget) / (days elapsed / total days in period)

If we've consumed 72% of our monthly budget by day 15 (50% of the month):

Burn rate = 0.72 / 0.50 = 1.44×

We're consuming budget 44% faster than the target rate.

Response Time Trend: P50, P95, and P99 response time charts from Vigilmon response time history. Color code against your latency SLO thresholds:

  • Green: P95 < 200ms (target)
  • Yellow: P95 200–500ms (degraded)
  • Red: P95 > 500ms (SLO at risk)

Incident History: Timeline of downtime periods in the current month, with duration labels.


Alerting on Error Budgets

Configure alerts that fire before your SLO is violated, not after. The goal is to catch high error budget burn rates while you still have budget remaining.

Alert Thresholds for a 99.9% Monthly SLO

| Alert | Condition | Action | |---|---|---| | Burn rate critical | > 14.4× (budget exhausts in 1 hour) | Page on-call immediately | | Burn rate high | > 6× (budget exhausts in 6 hours) | Notify engineering team, assess incident | | Burn rate elevated | > 3× (budget exhausts in ~2 days) | Engineering team awareness, review changes | | Budget 50% consumed | 50% of monthly budget used | Team awareness, review trends | | Budget 80% consumed | 80% of monthly budget used | Freeze non-critical deployments | | SLO violated | Monthly availability drops below 99.9% | Customer communication, post-mortem required |

Vigilmon Webhook for Alert Integration

Use Vigilmon webhooks to feed downtime events into your SLO alerting pipeline:

{
  "event": "monitor_down",
  "monitor": {
    "id": "mon_abc123",
    "name": "Production API",
    "url": "https://api.yourapp.com/health"
  },
  "downtime": {
    "startedAt": "2026-06-15T14:23:00Z",
    "confirmedRegions": ["us-east", "eu-west", "ap-southeast"]
  }
}

Your incident management system (PagerDuty, OpsGenie) receives this webhook and:

  1. Opens an incident with priority based on the monitor tier
  2. Updates your SLO dashboard with the start of a downtime period
  3. Triggers error budget burn rate recalculation
  4. Escalates if burn rate exceeds critical threshold

SLA Communication to Customers

When your service falls below an SLA threshold, customer communication requires:

Immediate Communication (During Incident)

Use your status page (Vigilmon's embeddable status badge can drive this) to communicate real-time status. Update every 30 minutes with:

  • What is affected
  • Current status of resolution efforts
  • Estimated time to resolution (if known)

Post-Incident SLA Report

Within 5 business days of an SLA-violating incident, send affected customers:

  1. Incident summary: What happened, what the root cause was
  2. Timeline: Exact start and end times (from Vigilmon check logs)
  3. Impact duration: Total downtime duration in minutes
  4. Uptime percentage for the period: Compared against SLA threshold
  5. Credit calculation: Based on your SLA credit schedule
  6. Remediation steps: What you've done to prevent recurrence

SLA Credit Tiers (Common Structure)

| Monthly Uptime Achieved | Service Credit | |---|---| | 99.9% – 100% | 0% (SLA met) | | 99.0% – 99.9% | 10% of monthly fee | | 95.0% – 99.0% | 25% of monthly fee | | Below 95.0% | 50% of monthly fee |

Set your credit schedule based on the revenue risk and customer expectations in your market. Enterprise SaaS typically has more generous credits than SMB tiers.


Common SLO Mistakes

Setting SLOs without baseline data: Don't set a 99.9% SLO before measuring what your service actually achieves. If you're currently delivering 99.7% uptime, committing to 99.9% externally before you've closed the gap creates immediate SLA risk.

No scheduled maintenance exclusions: Define whether your SLA/SLO calculations exclude pre-announced maintenance windows. Most SLAs exclude downtime that was announced at least 24-48 hours in advance. Document this clearly in your SLA.

Single-probe measurement: Using a monitoring tool that measures from a single probe location means a probe-side anomaly (not a real outage) counts against your SLA calculation. Tools like Vigilmon that use multi-region consensus alerting — where downtime is only recorded when multiple independent probes agree — give you cleaner SLA data with fewer false incidents.

Setting the same target for all services: Not every service needs the same SLO. An internal admin dashboard has different availability requirements than your customer-facing payment API. Tiered SLOs — where critical paths have higher targets and non-critical services have lower ones — align engineering investment with business value.

Measuring uptime from inside your infrastructure: An inside-out health check (a service calling its own /health endpoint) doesn't catch network-level outages, routing failures, or CDN problems. Measure your SLIs from outside-in probes that simulate the path a real user request takes.


SLA/SLO Quick Reference

| Term | What It Is | Who Sets It | Consequences of Miss | |---|---|---|---| | SLI | Raw measurement (e.g., 99.92% uptime) | Engineering | Data only | | SLO | Internal target (e.g., ≥ 99.9% uptime) | Engineering + Product | Internal priority / error budget depleted | | SLA | External commitment (e.g., ≥ 99.9% uptime) | Business + Legal | Credits, legal exposure |

| Uptime Target | Monthly Downtime Budget | Suitable For | |---|---|---| | 99% | ~7h 18m | Internal tools, dev environments | | 99.5% | ~3h 39m | Non-critical SaaS features | | 99.9% | ~43m 50s | Standard SaaS API / web app | | 99.95% | ~21m 55s | Enterprise-tier SaaS | | 99.99% | ~4m 23s | Infrastructure, payments, auth | | 99.999% | ~26s | Telecoms, financial clearing |


Conclusion

SLAs, SLOs, and SLIs are the language of reliability engineering — they turn vague commitments ("we try to keep it up") into measurable, reviewable, improvable targets. The math is straightforward: pick an uptime percentage, understand the downtime allowance it implies, set an error budget, and monitor burn rate to catch problems before the budget is exhausted.

Outside-in monitoring with multi-region consensus — the kind Vigilmon provides — gives you the most accurate SLI data available. Probe failures from multiple independent geographic locations mean your downtime records reflect real outages, not monitoring tool anomalies. That accuracy matters when SLA credits and customer trust are on the line.

Start measuring your uptime SLIs today with Vigilmon's free tier at vigilmon.online — permanent free tier, multi-region consensus alerting, response time history, and webhook notifications included.


Tags: #sla #slo #sli #uptime #monitoring #devops #sre #reliability #errorbudget #vigilmon #2026

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →