tutorial

Monitoring AWS ECS/Fargate Containers with Vigilmon

Add external uptime monitoring to containerized apps on AWS ECS/Fargate — health check container routes, Vigilmon monitor setup, and alerting on task failure.

Monitoring AWS ECS/Fargate Containers with Vigilmon

AWS ECS and Fargate handle container orchestration, auto-scaling, and self-healing. But "self-healing" has limits — ECS can restart a failed task, but it can't tell you the restart happened, or that your load balancer was routing to a broken target for 3 minutes before it did.

External monitoring fills that gap. Vigilmon probes your container's public URL from outside AWS, catches outages that CloudWatch misses, and sends alerts the moment something breaks.

This tutorial covers:

  • A health check route in your container
  • ECS task definition health check configuration
  • Vigilmon external HTTP monitoring
  • Alerts on ECS task failure
  • Heartbeat monitoring for ECS Scheduled Tasks

Step 1: Add a health check route to your container

Every container that serves HTTP traffic needs a /health endpoint. This is used both by ECS/ALB internally and by Vigilmon externally.

Node.js / Express:

// src/health.js
app.get('/health', async (req, res) => {
  const checks = {};
  let healthy = true;

  // Check database if used
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch {
    checks.database = 'error';
    healthy = false;
  }

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});

Python / FastAPI:

# app/health.py
from fastapi import APIRouter, Response
from datetime import datetime

router = APIRouter()

@router.get("/health")
async def health(response: Response):
    checks = {}
    healthy = True

    try:
        # check your DB/cache
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "error"
        healthy = False
        response.status_code = 503

    return {
        "status": "ok" if healthy else "degraded",
        "checks": checks,
        "timestamp": datetime.utcnow().isoformat(),
    }

Go:

// internal/health/handler.go
package health

import (
    "encoding/json"
    "net/http"
    "time"
)

func Handler(w http.ResponseWriter, r *http.Request) {
    status := map[string]interface{}{
        "status":    "ok",
        "timestamp": time.Now().UTC().Format(time.RFC3339),
    }
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(status)
}

Keep the health endpoint fast — ECS uses it for routing decisions and slow checks delay recovery.


Step 2: Configure health checks in your task definition

ECS supports container-level health checks directly in your task definition. Add this to your taskDefinition.json or CDK/Terraform config:

JSON task definition:

{
  "containerDefinitions": [
    {
      "name": "app",
      "image": "your-ecr-image:latest",
      "healthCheck": {
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8080/health || exit 1"
        ],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "portMappings": [
        { "containerPort": 8080, "protocol": "tcp" }
      ]
    }
  ]
}

Terraform:

resource "aws_ecs_task_definition" "app" {
  family = "app"

  container_definitions = jsonencode([{
    name  = "app"
    image = "your-ecr-image:latest"

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }

    portMappings = [{ containerPort = 8080 }]
  }])
}

startPeriod (60 seconds) prevents ECS from killing a container during initialization. Adjust it to match your actual cold-start time.


Step 3: Configure ALB health checks

If your ECS service sits behind an Application Load Balancer, configure ALB target group health checks to use the same endpoint:

AWS Console:

  1. Go to EC2 → Target Groups → your target group
  2. Click Health checks → Edit
  3. Set Health check path to /health
  4. Set Success codes to 200
  5. Set Interval to 30 seconds, Threshold to 2

Terraform:

resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 30
    timeout             = 5
    matcher             = "200"
  }
}

ALB health checks control routing. Vigilmon health checks control your alerting. Both should point to the same endpoint.


Step 4: Set up Vigilmon external monitoring

ECS health checks are internal — they run inside the VPC. You also need an external probe that monitors your service from the internet.

Connect your ECS service to Vigilmon:

  1. Sign up at vigilmon.online — free, no card required
  2. Click New Monitor → HTTP
  3. URL: https://api.yourdomain.com/health (your ALB or CloudFront URL)
  4. Check interval: 1 minute (paid) or 5 minutes (free)
  5. Expected status: 200
  6. Save

Vigilmon catches things CloudWatch and ECS health checks miss:

  • ALB misconfiguration (target group removed from listener)
  • Security group changes blocking traffic
  • Route 53 DNS failures
  • CloudFront distribution errors
  • Certificate expiry

These are control-plane failures where the container is healthy but external traffic can't reach it.


Step 5: Configure alerts

In Vigilmon, go to Notifications → New Channel:

Slack:

  1. Create an incoming webhook at api.slack.com/apps
  2. Paste the webhook URL into Vigilmon
  3. Enable it on your ECS service monitors

Email:

  1. Add email address as notification channel
  2. Enable on monitors

PagerDuty (for on-call rotation):

  1. Create a Vigilmon integration in PagerDuty
  2. Add the integration key in Vigilmon's notification settings

When ECS task replacement causes brief downtime:

🔴 DOWN: api.yourdomain.com/health
Status: 503
Duration: 45s

This is normal during ECS rolling updates. Use Vigilmon's Confirm Down After setting to suppress alerts for brief replacement windows:

  • Set Confirm Down After to 2 failures
  • This ignores single-probe failures from task replacement

Step 6: CloudWatch → Vigilmon alert routing

For task-level failures (not service-level), you can combine CloudWatch alarms with Vigilmon:

Create a CloudWatch alarm for unhealthy ECS tasks:

resource "aws_cloudwatch_metric_alarm" "ecs_unhealthy_tasks" {
  alarm_name          = "ecs-unhealthy-tasks"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "UnhealthyHostCount"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Maximum"
  threshold           = 0

  dimensions = {
    TargetGroup  = aws_lb_target_group.app.arn_suffix
    LoadBalancer = aws_lb.app.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Route the SNS topic to Vigilmon's webhook receiver (available in Vigilmon's integration settings) or to Slack directly. This gives you ECS-level task health on top of Vigilmon's external probe.


Step 7: Heartbeat monitoring for ECS Scheduled Tasks

ECS Scheduled Tasks (cron jobs running on Fargate) aren't covered by HTTP monitoring. If a scheduled task fails silently, nothing external detects it.

Add a heartbeat ping at the end of each successful run:

// tasks/sync.js
import fetch from 'node-fetch';

async function main() {
  try {
    await runSyncJob();

    const heartbeatUrl = process.env.VIGILMON_SYNC_HEARTBEAT;
    if (heartbeatUrl) {
      await fetch(heartbeatUrl, { method: 'GET' });
    }

    console.log('Sync complete');
    process.exit(0);
  } catch (err) {
    console.error('Sync failed:', err);
    process.exit(1);
  }
}

async function runSyncJob() {
  // your scheduled work
}

main();

Pass the heartbeat URL as a task definition environment variable:

{
  "environment": [
    {
      "name": "VIGILMON_SYNC_HEARTBEAT",
      "value": "https://vigilmon.online/heartbeat/your-monitor-id"
    }
  ]
}

Or store it in AWS Secrets Manager and reference it as a secret:

{
  "secrets": [
    {
      "name": "VIGILMON_SYNC_HEARTBEAT",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:vigilmon-heartbeat"
    }
  ]
}

In Vigilmon, create a Heartbeat Monitor:

  1. Click New Monitor → Heartbeat
  2. Set the interval to match your ECS cron schedule
  3. Copy the ping URL
  4. Add it to your task definition environment

Step 8: Public status page

Go to Status Pages → New Status Page in Vigilmon, add all your ECS service monitors, and publish:

Service issues? https://status.yourdomain.com

Each monitor generates a live badge:

![API Status](https://vigilmon.online/badge/your-monitor-slug)

What you've built

| What | How | |------|-----| | Health endpoint | Container-native /health route | | ECS task health | Task definition healthCheck | | ALB routing protection | Target group health check | | External uptime monitoring | Vigilmon HTTP monitor on public URL | | Task failure alerts | Vigilmon + optional CloudWatch SNS routing | | Slack/email/PagerDuty alerts | Vigilmon notification channels | | Scheduled task monitoring | Heartbeat ping in ECS cron tasks | | Status page | Vigilmon public status page |

ECS handles self-healing. Vigilmon handles visibility. CloudWatch tells you what happened after — Vigilmon tells you the moment it does.


Next steps

  • Add Vigilmon monitors for each ECS service environment (prod, staging, canary)
  • Watch response time trends to catch gradual container resource exhaustion before it causes failures
  • Add heartbeat monitors for every ECS Scheduled Task that does important work — billing jobs, data sync, report generation

Get started free at vigilmon.online.

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →