tutorial

How to Set Up Effective On-Call Alerts with Vigilmon

Bad on-call setups have a predictable failure mode: too many alerts, too little signal, and engineers who stop responding with urgency because they've been b...

Bad on-call setups have a predictable failure mode: too many alerts, too little signal, and engineers who stop responding with urgency because they've been burned by false positives too many times. When the real P1 eventually arrives, the response is slow.

This guide covers how to configure Vigilmon for on-call workflows that actually work — the right alert channels, escalation thinking, noise suppression, threshold design, and the multi-step verification that separates real outages from probe-side glitches.


Why Alert Quality Is the First Problem

Before configuring any specific channel or threshold, understand the core principle: the goal of on-call alerting is to page someone only when their immediate attention is required.

This sounds obvious. But most monitoring setups fail it, because:

  1. Single-probe checking fires alerts when one monitoring node encounters a network anomaly, regardless of whether the service is actually down for users
  2. Overly sensitive thresholds trigger on temporary spikes that resolve before any human can act
  3. Missing deduplication floods on-call channels with repeated alerts for the same ongoing incident

Vigilmon's architecture addresses point 1 structurally: every check requires multi-region consensus. Multiple independent probe nodes must independently confirm a service is unreachable before any alert fires. A single-node failure is silently discarded. This eliminates the most common source of on-call noise at the source.

Points 2 and 3 are configuration problems — solvable with the right setup.


Alert Channels: Choosing the Right Delivery Path

Vigilmon supports multiple alert delivery mechanisms. The right on-call setup uses more than one.

Email Alerts

Email is Vigilmon's default alert channel. It's reliable, auditable, and appropriate for non-urgent notifications — slower-moving monitors, SSL expiry notices, or secondary recipients who need awareness without being paged.

Email is not the right primary channel for on-call. Response latency is too variable, and most engineers don't have push notifications configured for email at 2 AM.

Configure email for: secondary recipients, business-hours monitoring notifications, daily digest summaries

Webhook Alerts

Vigilmon's webhook integration is the most flexible delivery mechanism. Configure a webhook endpoint and Vigilmon will POST a structured JSON payload on every status change — up-to-down and down-to-up transitions, including monitor name, check time, and status details.

Use webhooks to integrate with:

  • Slack — post directly to a #incidents channel or a dedicated on-call channel
  • Discord — same approach for teams using Discord
  • Custom alerting pipelines — if you have an internal alert router, fan out to multiple destinations from a single Vigilmon webhook
  • PagerDuty via webhook — Vigilmon can POST to PagerDuty's Events API endpoint directly using the PagerDuty integration key

Slack Integration via Webhook

To connect Vigilmon to Slack:

  1. In your Slack workspace, create an Incoming Webhook for your #alerts or #incidents channel
  2. Copy the Slack webhook URL (format: https://hooks.slack.com/services/...)
  3. In Vigilmon, open the monitor settings and paste the webhook URL
  4. Test with Vigilmon's "Send test alert" feature to confirm the message appears in Slack

Format tip: Slack messages from Vigilmon arrive as structured text. For better visibility, configure your #incidents channel as a dedicated alert destination rather than posting to a busy general engineering channel.

PagerDuty Integration

PagerDuty's Events API accepts alerts via webhook POST. To integrate:

  1. In PagerDuty, create a new service and add an Events API v2 integration
  2. Copy the Integration Key
  3. Set up a relay endpoint (a small serverless function or proxy) that accepts Vigilmon's webhook payload and forwards it to PagerDuty's Events API at https://events.pagerduty.com/v2/enqueue
  4. Map Vigilmon's status field to PagerDuty's severity field: downcritical, degradedwarning

This setup gives you full PagerDuty escalation policies, on-call rotations, and acknowledgement workflows driven by Vigilmon's outage detection.

Opsgenie Integration

Opsgenie (Atlassian) follows the same pattern as PagerDuty — an Opsgenie Alert API integration key and a small relay layer to format Vigilmon's webhook payload into Opsgenie's alert schema. The relay can be an AWS Lambda function, a Cloudflare Worker, or a small Express endpoint.


Escalation Policy Design

A well-designed escalation policy ensures that every alert eventually reaches someone, even if the primary on-call contact doesn't respond.

The Baseline Escalation Stack

For most teams:

  1. Primary on-call — Slack DM + PagerDuty push notification, 5-minute acknowledgement window
  2. Secondary on-call — PagerDuty escalation if primary doesn't acknowledge within the window
  3. Engineering manager — escalated if both primary and secondary are unresponsive after 15 minutes
  4. Email to team distribution list — always fires in parallel for record-keeping

Configure Vigilmon's webhook to fire immediately on monitor transition to down. Route that webhook to your Slack channel for team visibility, and simultaneously to PagerDuty to trigger the escalation chain.

Alert Suppression by Time Window

Not all monitoring alerts warrant an immediate on-call page at 3 AM. Consider:

  • Low-traffic services — a dev environment or staging monitor that goes down overnight is a morning notification, not an emergency page
  • Maintenance windows — schedule Vigilmon monitoring pauses for known maintenance periods to avoid false escalations during planned work
  • Rate-limiting escalations — if a flapping service triggers 20 alerts in an hour, PagerDuty's deduplication rules should collapse these into a single incident

Configure separate Vigilmon monitors for production vs. non-production, and route them to different alert channels with different escalation policies. Production down events always page. Non-production down events go to a Slack channel with no escalation.


Alert Threshold Design

Check Intervals

Vigilmon's check intervals determine how quickly outage detection occurs:

  • Free tier: 5-minute intervals — appropriate for services where a 5-minute detection window is acceptable
  • Paid plans: 1-minute intervals — tighter detection for production services where every minute of undetected downtime has measurable cost

For on-call setups, 1-minute intervals are strongly recommended for any service covered by an SLA or with significant revenue exposure. Five minutes of undetected downtime is five additional minutes of customer impact before anyone is paged.

Response Time Thresholds

Beyond binary up/down monitoring, Vigilmon tracks response time history. Configure alerts on response time degradation:

  • Set a baseline from your normal response time distribution (e.g., p95 = 800ms)
  • Alert when response time exceeds 2–3× your baseline for more than one consecutive check
  • Route response-time alerts to a Slack channel rather than PagerDuty — they indicate degradation, not necessarily a full outage requiring immediate human response

TCP vs. HTTP Monitoring

For backend services, TCP port monitoring often provides earlier warning than HTTP endpoint monitoring:

  • A TCP connection failure means the port is unreachable before any application-level error occurs
  • Configure TCP monitors on database ports, message queue ports, and internal service endpoints
  • Layer HTTP monitoring on top for application-level health validation

Vigilmon supports both in the same monitor dashboard. A robust on-call setup uses both: TCP for network-layer availability, HTTP for application-level health.


Multi-Step Verification: What Vigilmon Does So You Don't Have To

Many monitoring tools offer manual "re-check before alert" features — run 2–3 checks before paging. This is better than immediate single-check alerting but still imperfect, because consecutive checks from the same probe location can all fail on the same probe-side issue.

Vigilmon's approach is structurally different: multi-region consensus. Each check runs from geographically distributed probe nodes. An alert fires only when multiple independent nodes in different geographic regions all confirm the service is down.

This matters for on-call workflows because:

  • BGP routing anomalies affecting one network path don't trigger alerts — the service is reachable from other regions
  • Probe-side DNS hiccups affecting one monitoring node are filtered out automatically
  • Transient cloud provider network events in one availability zone don't page your on-call at 3 AM

The result: when Vigilmon fires an alert, it has already completed its own multi-step verification across regions. You're not getting paged for a 30-second network blip. You're getting paged because multiple independent sources agree the service is genuinely unreachable.

For on-call engineers, this is the difference between a reliable alerting system and an alerting system that requires constant second-guessing.


Suppressing Weekend Noise

For many services, weekend traffic is significantly lower than weekday traffic. Weekend outages may have different business impact thresholds than weekday outages.

Practical approaches:

  1. Separate weekend escalation policies in PagerDuty — weekday pages go to primary on-call immediately; weekend pages have a longer acknowledgement window before escalating
  2. Slack channel routing — weekend alerts go to a #weekend-incidents channel with a longer response SLA
  3. Monitor coverage tiers — classify monitors as P1 (always immediate page), P2 (business hours escalation only), and P3 (Slack notification only, no escalation)

Vigilmon's webhook flexibility supports this pattern: different monitors can have different webhook destinations, which can route to different PagerDuty services or Slack channels with different escalation rules.


Practical Setup Checklist

Walk through this checklist to go from zero to a production-ready on-call setup:

  • [ ] Add all production HTTP and TCP monitors in Vigilmon with 1-minute check intervals
  • [ ] Add heartbeat monitors for all scheduled jobs and cron tasks
  • [ ] Configure Slack webhook for immediate team visibility on down events
  • [ ] Set up PagerDuty (or Opsgenie) integration for escalation chain
  • [ ] Define primary and secondary on-call rotation in PagerDuty
  • [ ] Set acknowledgement windows (5 min primary → 10 min secondary → manager)
  • [ ] Configure separate monitors and escalation paths for non-production environments
  • [ ] Pause monitors during planned maintenance windows
  • [ ] Test the full alerting chain: manually take a test service down and verify the page fires within 2 minutes of the monitor detecting the outage
  • [ ] Document the escalation policy so every engineer on the rotation understands it

Conclusion

Effective on-call alerting is not about getting more alerts — it's about getting the right alerts at the right time through the right channels. Vigilmon's multi-region consensus model handles the hardest part: ensuring every alert represents a real outage rather than a probe-side anomaly.

Layer on top of that a thoughtful channel configuration (email for secondary, Slack for team visibility, PagerDuty or Opsgenie for escalation), sensible thresholds that distinguish production from non-production, and weekend noise suppression — and you have an on-call setup where engineers stay responsive because the signal is trustworthy.

Set up your on-call monitoring at vigilmon.online — free tier available, no credit card required.


Tags: #monitoring #devops #oncall #sre #pagerduty #alerting #uptime

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →