Vigilmon vs Prometheus: External Uptime Monitoring vs Pull-Based Metrics

Prometheus is one of the most widely adopted monitoring tools in the cloud-native ecosystem. It's also one of the most misunderstood — particularly when developers reach for it to answer the question "is my website up?" This comparison explains what Prometheus actually does, where it falls short for external uptime monitoring, and how Vigilmon fills the gap.

What Is Prometheus?

Prometheus is an open-source metrics collection and alerting system, originally built at SoundCloud and now a graduated CNCF project. It's designed for recording time-series metrics — CPU usage, request rates, error rates, memory consumption — from instrumented applications and infrastructure components.

Prometheus's core capabilities include:

Pull-based scraping: Prometheus polls a /metrics HTTP endpoint exposed by each target on a regular interval
PromQL: A powerful query language for aggregating and analyzing time-series data
Alertmanager: A companion service for routing, grouping, deduplicating, and silencing alerts
Service discovery: Native integrations with Kubernetes, Consul, AWS EC2, and static configs to auto-discover scrape targets
Long-term storage (via remote write): Push metrics to Thanos, Cortex, or other long-term stores
Grafana integration: Prometheus is the de facto data source for Grafana dashboards

Prometheus is excellent at answering questions like "what is the p99 latency of my /checkout endpoint over the last 7 days?" or "how many 5xx errors per minute is my API returning?" Those are deeply internal questions answered by data your own services produce.

What Is Vigilmon?

Vigilmon is a purpose-built external uptime monitoring service. It stands outside your infrastructure and probes your services from multiple geographic locations, asking the same question a real user would ask: "Can I reach this endpoint right now?"

Vigilmon's core capabilities include:

HTTP/HTTPS endpoint monitoring with status code and response body validation
TCP port monitoring
Cron job heartbeat monitoring (dead man's switch)
Multi-region consensus checking — alerts only when a quorum of independent probes agree the target is down
Response time history with period selectors
Webhook, email, and Slack alert channels
Status page badge embeds for customer-facing transparency
REST API for programmatic monitor management

There's no PromQL. No scrape config. No Alertmanager. Just fast, reliable external checks with zero infrastructure to operate.

Why Prometheus Is Not an Uptime Monitor

This distinction matters: Prometheus is a pull-based metrics system, not an external probe.

1. Prometheus can only scrape what it can reach

Prometheus scrapes your services' /metrics endpoints from within your network. It has no concept of "check whether this public URL is reachable from Tokyo." If your load balancer misconfigures and users can't reach your site but your internal services are still running, Prometheus will report everything healthy — because your internal /metrics endpoints are still responding.

External outages caused by DNS failures, CDN misconfigurations, TLS certificate expiry, or upstream network issues are invisible to Prometheus by design.

2. Blackbox Exporter is a workaround, not a solution

The Prometheus ecosystem does offer a blackbox_exporter that adds active probing (HTTP, TCP, ICMP, DNS). But deploying it adds significant complexity:

You must deploy, configure, and operate blackbox_exporter as a separate service
Probes run from wherever blackbox_exporter is deployed — usually inside your own infrastructure, which means they share the same failure domain as your services
Alertmanager must be separately configured to route alerts from blackbox probe metrics
Multi-region probing requires deploying blackbox_exporter in multiple locations — with the attendant operational overhead for each

Even fully configured, blackbox_exporter running inside your VPC still can't reliably emulate what a user in São Paulo experiences when reaching your app.

3. Self-hosted infrastructure overhead

Running Prometheus in production means operating:

A Prometheus server (or cluster for HA)
Alertmanager (for alert routing and deduplication)
Long-term storage (Thanos, VictoriaMetrics, or Cortex) for retention beyond 15 days
blackbox_exporter for HTTP checks
Grafana for dashboards

Each component needs provisioning, configuration, upgrades, and on-call coverage. If your monitoring stack goes down, you have no monitoring.

Feature Comparison

| Feature | Prometheus | Vigilmon | |---|---|---| | HTTP/HTTPS uptime monitoring | ✅ (via blackbox_exporter, complex) | ✅ built-in, 2-minute setup | | TCP port monitoring | ✅ (via blackbox_exporter) | ✅ built-in | | Cron job heartbeat monitoring | ❌ | ✅ built-in | | External probing (outside your infra) | ❌ by default | ✅ always | | Multi-region consensus alerting | ❌ | ✅ default behavior | | Internal application metrics | ✅ core strength | ❌ | | PromQL query language | ✅ | ❌ | | Grafana dashboards | ✅ | ❌ | | Alertmanager routing | ✅ | ❌ | | Setup time for basic uptime check | 30–120 minutes | ~2 minutes | | Operational overhead | High (self-hosted stack) | None (SaaS) | | Status page for users | ❌ (third-party add-ons) | ✅ included | | Pricing | Free (ops costs) | Free tier + flat monthly | | SaaS option | ❌ (Grafana Cloud has limited Prometheus) | ✅ |

Setting Up an Uptime Check: A Concrete Comparison

Prometheus + blackbox_exporter approach

# 1. blackbox_exporter config (blackbox.yml)
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET

# 2. Prometheus scrape config (prometheus.yml)
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://myapp.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

# 3. Alertmanager alert rule (alerts.yml)
groups:
  - name: uptime
    rules:
      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"

# 4. Alertmanager routing config (alertmanager.yml)
route:
  receiver: 'slack-notifications'
receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

That's four configuration files, two running services, and a Slack webhook URL you need to obtain separately. And all probes run from a single point — wherever blackbox_exporter lives.

Vigilmon approach

1. Go to vigilmon.online
2. Enter: https://myapp.com/health
3. Paste your Slack webhook URL
4. Done — multi-region probing active in 30 seconds

The Use Case Split: When to Use Each

Use Prometheus for:

Application performance metrics — request rate, error rate, duration histograms (RED method)
Infrastructure metrics — CPU, memory, disk, network I/O on your own servers
Custom business metrics — orders per minute, active sessions, queue depth, payment processing latency
Kubernetes monitoring — kube-state-metrics, cAdvisor, node-exporter integration
Long-term trends and capacity planning — PromQL aggregations over weeks or months

Use Vigilmon for:

External uptime confirmation — verifying your service is reachable from the public internet
Multi-region availability checks — ensuring users in different geographies all get responses
Cron job monitoring — receiving a heartbeat from scheduled jobs and alerting when they go silent
TCP port monitoring — confirming database ports, SMTP endpoints, and non-HTTP services are open
Customer-facing status pages — letting users self-serve outage information without building a separate page

Use both:

The most complete production monitoring setup uses Prometheus and Vigilmon together. Prometheus tells you what's happening inside your services — the request rate dropped, a queue is backing up, memory is climbing. Vigilmon tells you whether users can reach your services at all. These are complementary signals, not redundant ones.

An outage where your internal Prometheus metrics look healthy but Vigilmon fires an alert is the most important class of failure to catch: something between your services and your users has broken.

Multi-Region Probing: A Critical Difference

Vigilmon's multi-region consensus model deserves specific attention because it solves a real alert-fatigue problem.

Single-probe uptime monitors fire false alerts when a single regional network hiccup affects one probe location. You get paged at 2 AM for a 30-second network anomaly in one datacenter that users never noticed.

Vigilmon requires a quorum of independent probes from different geographic regions to agree that a target is down before triggering an alert. Single-region blips are absorbed. Only genuine, multi-geography outages reach your team.

Replicating this with Prometheus + blackbox_exporter would require deploying blackbox_exporter in each target region, writing PromQL expressions that correlate results across deployments, and building the consensus logic in Alertmanager. That's a non-trivial project.

With Vigilmon, it's the default.

Summary and Recommendation

Prometheus and Vigilmon solve different problems in the monitoring stack. The confusion arises because both can technically perform HTTP checks — but Prometheus's model is fundamentally pull-based, internal, and requires significant infrastructure to operate.

Choose Prometheus if you need deep application and infrastructure metrics, custom alerting on business KPIs, PromQL-based analysis, or Kubernetes cluster monitoring. It's one of the best tools ever built for its intended purpose.

Choose Vigilmon if you need external uptime monitoring — confirmation that public URLs are reachable from the internet, TCP port checks, cron heartbeat monitoring, and a status page for users. Setup takes two minutes with no infrastructure to maintain.

Run both for comprehensive coverage. They occupy different layers of the monitoring stack and answer different questions. Prometheus answers "what is happening inside my system?" Vigilmon answers "can users reach my system at all?" Neither answer is enough without the other.

Start external uptime monitoring for free at vigilmon.online — 5 monitors, 1-minute intervals, multi-region consensus, status page, and Slack alerts included at $0/month.

Tags: #monitoring #devops #prometheus #uptime #sre #kubernetes #observability