Monitoring AWS ECS/Fargate Containers with Vigilmon

AWS ECS and Fargate handle container orchestration, auto-scaling, and self-healing. But "self-healing" has limits — ECS can restart a failed task, but it can't tell you the restart happened, or that your load balancer was routing to a broken target for 3 minutes before it did.

External monitoring fills that gap. Vigilmon probes your container's public URL from outside AWS, catches outages that CloudWatch misses, and sends alerts the moment something breaks.

This tutorial covers:

A health check route in your container
ECS task definition health check configuration
Vigilmon external HTTP monitoring
Alerts on ECS task failure
Heartbeat monitoring for ECS Scheduled Tasks

Step 1: Add a health check route to your container

Every container that serves HTTP traffic needs a /health endpoint. This is used both by ECS/ALB internally and by Vigilmon externally.

Node.js / Express:

// src/health.js
app.get('/health', async (req, res) => {
  const checks = {};
  let healthy = true;

  // Check database if used
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch {
    checks.database = 'error';
    healthy = false;
  }

  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    checks,
    timestamp: new Date().toISOString(),
  });
});

Python / FastAPI:

# app/health.py
from fastapi import APIRouter, Response
from datetime import datetime

router = APIRouter()

@router.get("/health")
async def health(response: Response):
    checks = {}
    healthy = True

    try:
        # check your DB/cache
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "error"
        healthy = False
        response.status_code = 503

    return {
        "status": "ok" if healthy else "degraded",
        "checks": checks,
        "timestamp": datetime.utcnow().isoformat(),
    }

Go:

// internal/health/handler.go
package health

import (
    "encoding/json"
    "net/http"
    "time"
)

func Handler(w http.ResponseWriter, r *http.Request) {
    status := map[string]interface{}{
        "status":    "ok",
        "timestamp": time.Now().UTC().Format(time.RFC3339),
    }
    w.Header().Set("Content-Type", "application/json")
    json.NewEncoder(w).Encode(status)
}

Keep the health endpoint fast — ECS uses it for routing decisions and slow checks delay recovery.

Step 2: Configure health checks in your task definition

ECS supports container-level health checks directly in your task definition. Add this to your taskDefinition.json or CDK/Terraform config:

JSON task definition:

{
  "containerDefinitions": [
    {
      "name": "app",
      "image": "your-ecr-image:latest",
      "healthCheck": {
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8080/health || exit 1"
        ],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "portMappings": [
        { "containerPort": 8080, "protocol": "tcp" }
      ]
    }
  ]
}

Terraform:

resource "aws_ecs_task_definition" "app" {
  family = "app"

  container_definitions = jsonencode([{
    name  = "app"
    image = "your-ecr-image:latest"

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }

    portMappings = [{ containerPort = 8080 }]
  }])
}

startPeriod (60 seconds) prevents ECS from killing a container during initialization. Adjust it to match your actual cold-start time.

Step 3: Configure ALB health checks

If your ECS service sits behind an Application Load Balancer, configure ALB target group health checks to use the same endpoint:

AWS Console:

Go to EC2 → Target Groups → your target group
Click Health checks → Edit
Set Health check path to /health
Set Success codes to 200
Set Interval to 30 seconds, Threshold to 2

Terraform:

resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 30
    timeout             = 5
    matcher             = "200"
  }
}

ALB health checks control routing. Vigilmon health checks control your alerting. Both should point to the same endpoint.

Step 4: Set up Vigilmon external monitoring

ECS health checks are internal — they run inside the VPC. You also need an external probe that monitors your service from the internet.

Connect your ECS service to Vigilmon:

Sign up at vigilmon.online — free, no card required
Click New Monitor → HTTP
URL: https://api.yourdomain.com/health (your ALB or CloudFront URL)
Check interval: 1 minute (paid) or 5 minutes (free)
Expected status: 200
Save

Vigilmon catches things CloudWatch and ECS health checks miss:

ALB misconfiguration (target group removed from listener)
Security group changes blocking traffic
Route 53 DNS failures
CloudFront distribution errors
Certificate expiry

These are control-plane failures where the container is healthy but external traffic can't reach it.

Step 5: Configure alerts

In Vigilmon, go to Notifications → New Channel:

Slack:

Create an incoming webhook at api.slack.com/apps
Paste the webhook URL into Vigilmon
Enable it on your ECS service monitors

Email:

Add email address as notification channel
Enable on monitors

PagerDuty (for on-call rotation):

Create a Vigilmon integration in PagerDuty
Add the integration key in Vigilmon's notification settings

When ECS task replacement causes brief downtime:

🔴 DOWN: api.yourdomain.com/health
Status: 503
Duration: 45s

This is normal during ECS rolling updates. Use Vigilmon's Confirm Down After setting to suppress alerts for brief replacement windows:

Set Confirm Down After to 2 failures
This ignores single-probe failures from task replacement

Step 6: CloudWatch → Vigilmon alert routing

For task-level failures (not service-level), you can combine CloudWatch alarms with Vigilmon:

Create a CloudWatch alarm for unhealthy ECS tasks:

resource "aws_cloudwatch_metric_alarm" "ecs_unhealthy_tasks" {
  alarm_name          = "ecs-unhealthy-tasks"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "UnhealthyHostCount"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Maximum"
  threshold           = 0

  dimensions = {
    TargetGroup  = aws_lb_target_group.app.arn_suffix
    LoadBalancer = aws_lb.app.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Route the SNS topic to Vigilmon's webhook receiver (available in Vigilmon's integration settings) or to Slack directly. This gives you ECS-level task health on top of Vigilmon's external probe.

Step 7: Heartbeat monitoring for ECS Scheduled Tasks

ECS Scheduled Tasks (cron jobs running on Fargate) aren't covered by HTTP monitoring. If a scheduled task fails silently, nothing external detects it.

Add a heartbeat ping at the end of each successful run:

// tasks/sync.js
import fetch from 'node-fetch';

async function main() {
  try {
    await runSyncJob();

    const heartbeatUrl = process.env.VIGILMON_SYNC_HEARTBEAT;
    if (heartbeatUrl) {
      await fetch(heartbeatUrl, { method: 'GET' });
    }

    console.log('Sync complete');
    process.exit(0);
  } catch (err) {
    console.error('Sync failed:', err);
    process.exit(1);
  }
}

async function runSyncJob() {
  // your scheduled work
}

main();

Pass the heartbeat URL as a task definition environment variable:

{
  "environment": [
    {
      "name": "VIGILMON_SYNC_HEARTBEAT",
      "value": "https://vigilmon.online/heartbeat/your-monitor-id"
    }
  ]
}

Or store it in AWS Secrets Manager and reference it as a secret:

{
  "secrets": [
    {
      "name": "VIGILMON_SYNC_HEARTBEAT",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:vigilmon-heartbeat"
    }
  ]
}

In Vigilmon, create a Heartbeat Monitor:

Click New Monitor → Heartbeat
Set the interval to match your ECS cron schedule
Copy the ping URL
Add it to your task definition environment

Step 8: Public status page

Go to Status Pages → New Status Page in Vigilmon, add all your ECS service monitors, and publish:

Service issues? https://status.yourdomain.com

Each monitor generates a live badge:

![API Status](https://vigilmon.online/badge/your-monitor-slug)

What you've built

| What | How | |------|-----| | Health endpoint | Container-native /health route | | ECS task health | Task definition healthCheck | | ALB routing protection | Target group health check | | External uptime monitoring | Vigilmon HTTP monitor on public URL | | Task failure alerts | Vigilmon + optional CloudWatch SNS routing | | Slack/email/PagerDuty alerts | Vigilmon notification channels | | Scheduled task monitoring | Heartbeat ping in ECS cron tasks | | Status page | Vigilmon public status page |

ECS handles self-healing. Vigilmon handles visibility. CloudWatch tells you what happened after — Vigilmon tells you the moment it does.

Next steps

Add Vigilmon monitors for each ECS service environment (prod, staging, canary)
Watch response time trends to catch gradual container resource exhaustion before it causes failures
Add heartbeat monitors for every ECS Scheduled Task that does important work — billing jobs, data sync, report generation

Get started free at vigilmon.online.