Monitoring AWS ECS/Fargate Containers with Vigilmon
AWS ECS and Fargate handle container orchestration, auto-scaling, and self-healing. But "self-healing" has limits — ECS can restart a failed task, but it can't tell you the restart happened, or that your load balancer was routing to a broken target for 3 minutes before it did.
External monitoring fills that gap. Vigilmon probes your container's public URL from outside AWS, catches outages that CloudWatch misses, and sends alerts the moment something breaks.
This tutorial covers:
- A health check route in your container
- ECS task definition health check configuration
- Vigilmon external HTTP monitoring
- Alerts on ECS task failure
- Heartbeat monitoring for ECS Scheduled Tasks
Step 1: Add a health check route to your container
Every container that serves HTTP traffic needs a /health endpoint. This is used both by ECS/ALB internally and by Vigilmon externally.
Node.js / Express:
// src/health.js
app.get('/health', async (req, res) => {
const checks = {};
let healthy = true;
// Check database if used
try {
await db.query('SELECT 1');
checks.database = 'ok';
} catch {
checks.database = 'error';
healthy = false;
}
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ok' : 'degraded',
checks,
timestamp: new Date().toISOString(),
});
});
Python / FastAPI:
# app/health.py
from fastapi import APIRouter, Response
from datetime import datetime
router = APIRouter()
@router.get("/health")
async def health(response: Response):
checks = {}
healthy = True
try:
# check your DB/cache
checks["database"] = "ok"
except Exception:
checks["database"] = "error"
healthy = False
response.status_code = 503
return {
"status": "ok" if healthy else "degraded",
"checks": checks,
"timestamp": datetime.utcnow().isoformat(),
}
Go:
// internal/health/handler.go
package health
import (
"encoding/json"
"net/http"
"time"
)
func Handler(w http.ResponseWriter, r *http.Request) {
status := map[string]interface{}{
"status": "ok",
"timestamp": time.Now().UTC().Format(time.RFC3339),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(status)
}
Keep the health endpoint fast — ECS uses it for routing decisions and slow checks delay recovery.
Step 2: Configure health checks in your task definition
ECS supports container-level health checks directly in your task definition. Add this to your taskDefinition.json or CDK/Terraform config:
JSON task definition:
{
"containerDefinitions": [
{
"name": "app",
"image": "your-ecr-image:latest",
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:8080/health || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
},
"portMappings": [
{ "containerPort": 8080, "protocol": "tcp" }
]
}
]
}
Terraform:
resource "aws_ecs_task_definition" "app" {
family = "app"
container_definitions = jsonencode([{
name = "app"
image = "your-ecr-image:latest"
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
portMappings = [{ containerPort = 8080 }]
}])
}
startPeriod (60 seconds) prevents ECS from killing a container during initialization. Adjust it to match your actual cold-start time.
Step 3: Configure ALB health checks
If your ECS service sits behind an Application Load Balancer, configure ALB target group health checks to use the same endpoint:
AWS Console:
- Go to EC2 → Target Groups → your target group
- Click Health checks → Edit
- Set Health check path to
/health - Set Success codes to
200 - Set Interval to
30seconds, Threshold to2
Terraform:
resource "aws_lb_target_group" "app" {
name = "app-tg"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
interval = 30
timeout = 5
matcher = "200"
}
}
ALB health checks control routing. Vigilmon health checks control your alerting. Both should point to the same endpoint.
Step 4: Set up Vigilmon external monitoring
ECS health checks are internal — they run inside the VPC. You also need an external probe that monitors your service from the internet.
Connect your ECS service to Vigilmon:
- Sign up at vigilmon.online — free, no card required
- Click New Monitor → HTTP
- URL:
https://api.yourdomain.com/health(your ALB or CloudFront URL) - Check interval: 1 minute (paid) or 5 minutes (free)
- Expected status:
200 - Save
Vigilmon catches things CloudWatch and ECS health checks miss:
- ALB misconfiguration (target group removed from listener)
- Security group changes blocking traffic
- Route 53 DNS failures
- CloudFront distribution errors
- Certificate expiry
These are control-plane failures where the container is healthy but external traffic can't reach it.
Step 5: Configure alerts
In Vigilmon, go to Notifications → New Channel:
Slack:
- Create an incoming webhook at api.slack.com/apps
- Paste the webhook URL into Vigilmon
- Enable it on your ECS service monitors
Email:
- Add email address as notification channel
- Enable on monitors
PagerDuty (for on-call rotation):
- Create a Vigilmon integration in PagerDuty
- Add the integration key in Vigilmon's notification settings
When ECS task replacement causes brief downtime:
🔴 DOWN: api.yourdomain.com/health
Status: 503
Duration: 45s
This is normal during ECS rolling updates. Use Vigilmon's Confirm Down After setting to suppress alerts for brief replacement windows:
- Set Confirm Down After to
2failures - This ignores single-probe failures from task replacement
Step 6: CloudWatch → Vigilmon alert routing
For task-level failures (not service-level), you can combine CloudWatch alarms with Vigilmon:
Create a CloudWatch alarm for unhealthy ECS tasks:
resource "aws_cloudwatch_metric_alarm" "ecs_unhealthy_tasks" {
alarm_name = "ecs-unhealthy-tasks"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "UnhealthyHostCount"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Maximum"
threshold = 0
dimensions = {
TargetGroup = aws_lb_target_group.app.arn_suffix
LoadBalancer = aws_lb.app.arn_suffix
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
Route the SNS topic to Vigilmon's webhook receiver (available in Vigilmon's integration settings) or to Slack directly. This gives you ECS-level task health on top of Vigilmon's external probe.
Step 7: Heartbeat monitoring for ECS Scheduled Tasks
ECS Scheduled Tasks (cron jobs running on Fargate) aren't covered by HTTP monitoring. If a scheduled task fails silently, nothing external detects it.
Add a heartbeat ping at the end of each successful run:
// tasks/sync.js
import fetch from 'node-fetch';
async function main() {
try {
await runSyncJob();
const heartbeatUrl = process.env.VIGILMON_SYNC_HEARTBEAT;
if (heartbeatUrl) {
await fetch(heartbeatUrl, { method: 'GET' });
}
console.log('Sync complete');
process.exit(0);
} catch (err) {
console.error('Sync failed:', err);
process.exit(1);
}
}
async function runSyncJob() {
// your scheduled work
}
main();
Pass the heartbeat URL as a task definition environment variable:
{
"environment": [
{
"name": "VIGILMON_SYNC_HEARTBEAT",
"value": "https://vigilmon.online/heartbeat/your-monitor-id"
}
]
}
Or store it in AWS Secrets Manager and reference it as a secret:
{
"secrets": [
{
"name": "VIGILMON_SYNC_HEARTBEAT",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:vigilmon-heartbeat"
}
]
}
In Vigilmon, create a Heartbeat Monitor:
- Click New Monitor → Heartbeat
- Set the interval to match your ECS cron schedule
- Copy the ping URL
- Add it to your task definition environment
Step 8: Public status page
Go to Status Pages → New Status Page in Vigilmon, add all your ECS service monitors, and publish:
Service issues? https://status.yourdomain.com
Each monitor generates a live badge:

What you've built
| What | How |
|------|-----|
| Health endpoint | Container-native /health route |
| ECS task health | Task definition healthCheck |
| ALB routing protection | Target group health check |
| External uptime monitoring | Vigilmon HTTP monitor on public URL |
| Task failure alerts | Vigilmon + optional CloudWatch SNS routing |
| Slack/email/PagerDuty alerts | Vigilmon notification channels |
| Scheduled task monitoring | Heartbeat ping in ECS cron tasks |
| Status page | Vigilmon public status page |
ECS handles self-healing. Vigilmon handles visibility. CloudWatch tells you what happened after — Vigilmon tells you the moment it does.
Next steps
- Add Vigilmon monitors for each ECS service environment (prod, staging, canary)
- Watch response time trends to catch gradual container resource exhaustion before it causes failures
- Add heartbeat monitors for every ECS Scheduled Task that does important work — billing jobs, data sync, report generation
Get started free at vigilmon.online.