A postmortem is only as good as the data behind it. The timeline reconstruction that takes three hours in a post-incident meeting — "wait, when did the first alarm fire? who noticed it first? when exactly did the fix go out?" — is precisely the part that monitoring data makes definitive rather than anecdotal.
This guide covers how to use Vigilmon's monitoring data — response time history, downtime logs, and webhook delivery records — to write rigorous postmortems, calculate meaningful MTTR, and generate action items that actually prevent recurrence.
Why Good Postmortems Are Hard
Most postmortems fail not because the team lacks intent but because the underlying data is missing or unreliable.
Common failure modes:
- Blurry timelines: "Around 3 AM" isn't useful. Timelines built from memory, Slack messages, and inconsistent log timestamps leave ambiguity that obscures the true cause and impact duration
- Missing blast radius data: "Some users were affected" is not actionable. Without response time metrics and regional availability data, you can't characterize who was affected, how severely, and for how long
- Unmeasured MTTR: Mean time to recovery is a critical SRE metric. If you don't know exactly when the incident started and exactly when recovery was confirmed, MTTR is a guess
- Unverifiable evidence trail: Action items decided in the postmortem meeting need to reference specific evidence — the alert timestamp, the webhook payload that shows a failing request, the response time chart showing latency spikes before the outage
Vigilmon provides the raw material to fix all of these.
The Data Vigilmon Provides for Postmortems
1. Response Time History
Vigilmon records response times for every probe, from every region, on every check cycle. This gives you a time-series chart of latency from the user's perspective — not from inside your network.
In a postmortem, response time history answers questions like:
- Did latency degrade before the outage? (Degradation precursors often appear in the response time chart 5–15 minutes before an endpoint goes fully down)
- Which regions were affected and when? (If the US region showed normal response times while EU latency spiked, that's a CDN or routing issue, not a backend failure)
- When did performance return to baseline after recovery? (The "incident resolved" time in your postmortem should match when response times normalized — not when the on-call engineer closed the ticket)
Access response time history in the Vigilmon dashboard under each monitor's detail view. Export or screenshot the time-series chart covering the incident window for inclusion in your postmortem document.
2. Downtime Logs
Vigilmon maintains a log of every downtime event per monitor: start time (when consensus failure was first detected), end time (when consensus recovery was confirmed), and duration.
This is your authoritative incident timeline anchor. The start time in the downtime log is when your monitoring system confirmed — from multiple independent regions simultaneously — that the endpoint was unreachable. It is the most reliable proxy for "when did the outage actually begin from the user's perspective."
Key fields in the downtime log for postmortem use:
| Field | Postmortem Use | |---|---| | First failure timestamp | Incident start time (confirmed, multi-region) | | Recovery timestamp | Incident end time (confirmed recovery) | | Duration | Incident impact duration | | Affected regions | Blast radius — which users were affected geographically | | Check interval at time of incident | Resolution of detection (1-minute checks = 1-minute maximum detection lag) |
3. Webhook Delivery History
Vigilmon's webhook delivery history shows every outbound webhook notification: timestamp, payload, delivery status, and HTTP response from the receiving endpoint. This creates an auditable trail of alert delivery.
In a postmortem, webhook history answers:
- Did the alert actually fire when the incident started? (Confirms your alerting pipeline worked correctly — or reveals it didn't)
- Was the webhook delivery successful? (If PagerDuty received the webhook but no one was paged, the problem is in your PagerDuty routing, not Vigilmon)
- What was the exact payload? (Useful for automated post-incident tooling that ingests alert data)
The gap between the downtime log's first failure timestamp and your incident response team's first response action — findable in Slack history or your incident management tool — is your detection-to-response time. This is different from MTTR and often more actionable as an improvement target.
MTTR Calculation Using Vigilmon Data
Mean time to recovery is calculated from incident start to confirmed recovery. Using Vigilmon data makes this precise:
MTTR = Recovery timestamp (downtime log) - First failure timestamp (downtime log)
More useful for action items is decomposing MTTR into components:
| Component | Formula | What to Improve | |---|---|---| | Time to detect | Alert timestamp - First failure timestamp | Check interval, alert routing | | Time to respond | First action timestamp - Alert timestamp | On-call SLA, escalation policy | | Time to resolve | Recovery timestamp - First action timestamp | Runbook quality, deployment speed | | MTTR | Recovery timestamp - First failure timestamp | Sum of above |
The Vigilmon downtime log provides the two anchors (first failure, recovery). Your incident management tool (PagerDuty, Linear, Slack) provides the middle timestamps. Putting them in a table in your postmortem transforms abstract MTTR into a decomposed diagnostic.
Postmortem Template Using Vigilmon Data
Here is a structured template that incorporates Vigilmon data sources at each section:
Incident Summary
Date and time: [First failure timestamp from Vigilmon downtime log]
Duration: [Duration from Vigilmon downtime log]
Affected services: [List monitors that went down]
Affected regions: [Regions showing failure in Vigilmon]
Severity: P0 / P1 / P2
Timeline
| Timestamp | Event | Source |
|---|---|---|
| [T+0] | First probe failure detected (multi-region consensus) | Vigilmon downtime log |
| [T+1m] | Alert fired via webhook | Vigilmon webhook history |
| [T+Xm] | On-call engineer acknowledged | PagerDuty / Slack |
| [T+Ym] | Root cause identified | Incident Slack thread |
| [T+Zm] | Fix deployed | CI/CD deployment log |
| [T+Wm] | Recovery confirmed (multi-region consensus) | Vigilmon downtime log |
Use exact timestamps from Vigilmon logs. Do not reconstruct from memory.
Response Time History
[Embed or link to Vigilmon response time chart for affected monitor(s) covering the incident window]
Note: If latency spiked 10 minutes before the full outage, include that window. Degradation precursors are often more actionable than the outage itself for prevention.
Root Cause
[Single sentence describing the root technical cause — not symptoms]
Contributing factors:
- [Factor 1]
- [Factor 2]
Impact
- User-facing impact: [What users experienced — requests failing, timeouts, specific features unavailable]
- Regions affected: [From Vigilmon regional data]
- Duration: [From Vigilmon downtime log]
- Estimated users affected: [If calculable from traffic data]
MTTR Breakdown
| Metric | Value | |---|---| | Incident start | [Vigilmon first failure timestamp] | | Alert fired | [Vigilmon webhook timestamp] | | Detection lag | [Alert - Start] | | Response time | [First action - Alert] | | Time to resolve | [Recovery - First action] | | Total MTTR | [Recovery - Start] |
Action Items
| Action | Owner | Due Date | Prevents | |---|---|---|---| | [Specific change] | [Name/team] | [Date] | [Recurrence or impact] |
Good action items reference specific evidence from the postmortem — the response time chart showing degradation 10 minutes before the outage, the webhook log showing alert delivery failed, the MTTR breakdown showing detection lag was 8 minutes on a 5-minute check interval.
Evidence Trail
- Vigilmon downtime log export: [link or attachment]
- Vigilmon webhook delivery log: [link or attachment]
- Response time chart (incident window): [link or attachment]
- Incident Slack thread: [link]
- Deploy log: [link]
What Rigorous Postmortems Prevent
Teams with good postmortem discipline — precise timelines, measured MTTR, evidence-backed action items — tend to converge on a handful of structural improvements that have compounding value:
- Tighter check intervals: If detection lag is consistently 8+ minutes, dropping from 5-minute to 1-minute check intervals cuts the detection contribution to MTTR by 80%
- Better on-call routing: If webhook delivery succeeds but response time is 30 minutes, the problem is escalation policy, not monitoring
- Runbook quality: If time-to-resolve is consistently high, the bottleneck is documentation of what to do, not detection speed
- SSL certificate expiry process: If SSL failures appear in the evidence trail, an automated 30-day expiry alert is a zero-cost prevention
Conclusion
Vigilmon provides the three data sources that make postmortems rigorous rather than anecdotal: response time history that shows degradation before and after the incident, downtime logs that provide authoritative start and end timestamps, and webhook delivery history that verifies your alerting pipeline worked.
With this data, MTTR becomes a measured metric rather than a rough estimate. Timelines become auditable. Action items reference specific evidence. The postmortem stops being a retrospective meeting and starts being an operational improvement cycle.
Set up monitoring for your services at vigilmon.online — 5 monitors, 1-minute intervals, multi-region consensus, response time history, webhook delivery logs. Free tier, no credit card required.
Tags: #devops #sre #monitoring #postmortem #incidentresponse #uptime #observability