Docker Swarm is deceptively capable for teams that don't need Kubernetes complexity. You get built-in load balancing, rolling updates, and service health checks — but those health checks only tell you what the Swarm manager can see from inside the overlay network. They cannot tell you whether your published port is reachable from the internet, whether the routing mesh is functioning correctly, or whether your ingress proxy is serving traffic after a configuration change.
Vigilmon adds the external layer that Docker Swarm lacks: monitoring from the user's perspective, heartbeat monitoring for Swarm-scheduled jobs, and multi-region validation that survives manager node failures.
Understanding the Monitoring Gap
Docker Swarm's HEALTHCHECK instruction is powerful but bounded:
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -fsS http://localhost:8080/health || exit 1
This check runs inside the container, hitting the service on localhost. It cannot see:
- The Swarm routing mesh (
docker_gwbridgeand overlay network) - The published port on the host (
-p 80:8080) - The ingress proxy (Traefik, Nginx, HAProxy) in front of your services
- External DNS pointing at your Swarm manager VIPs
- TLS termination at the load balancer or ingress
A container can pass HEALTHCHECK while being completely unreachable from outside the cluster. Vigilmon probes from external regions using the same path your users take — through DNS, load balancer, ingress, and all.
Step 1: Add Proper Health Endpoints to Your Services
Before configuring external monitoring, give each service a /health endpoint that reflects real dependency status:
// Node.js / Express
app.get('/health', async (req, res) => {
const checks = {};
let degraded = false;
// Check database connectivity
try {
await db.query('SELECT 1');
checks.database = 'ok';
} catch (err) {
checks.database = `error: ${err.message}`;
degraded = true;
}
// Check Redis connectivity
try {
await redisClient.ping();
checks.cache = 'ok';
} catch (err) {
checks.cache = `error: ${err.message}`;
degraded = true;
}
const status = degraded ? 'degraded' : 'ok';
const code = degraded ? 503 : 200;
res.status(code).json({ status, checks, uptime: process.uptime() });
});
Configure the HEALTHCHECK in your Dockerfile to use this endpoint:
FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --production
HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
CMD wget -qO- http://localhost:3000/health | grep -q '"status":"ok"' || exit 1
EXPOSE 3000
CMD ["node", "server.js"]
And in your Compose / Stack file:
# docker-compose.yml (Swarm stack)
version: "3.8"
services:
api:
image: registry.example.com/api:latest
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
restart_policy:
condition: on-failure
max_attempts: 3
resources:
limits:
cpus: "0.5"
memory: 512M
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
ports:
- "3000:3000"
networks:
- traefik_proxy
- backend
labels:
- "traefik.enable=true"
- "traefik.http.routers.api.rule=Host(`api.example.com`)"
- "traefik.http.services.api.loadbalancer.server.port=3000"
Step 2: Monitor Swarm Service Endpoints with Vigilmon
With health endpoints in place, set up external HTTP monitoring:
- Log in to vigilmon.online and go to Monitors → New Monitor
- Choose HTTP / HTTPS
- Set the URL to your public service endpoint:
https://api.example.com/health - Check interval: 1 minute
- Expected response:
- Status code:
200 - Response body contains:
"status":"ok" - Response time threshold:
2000ms
- Status code:
- Assign alert channels
- Save
Vigilmon probes from multiple geographic regions, requiring consensus before opening an incident. A single region's connectivity blip will not wake your on-call engineer.
Monitor the Swarm Manager Nodes
Swarm manager nodes run the control plane. If you lose quorum (fewer than ⌈n/2⌉+1 managers healthy), the cluster cannot accept new deployments or scaling events. Monitor the manager node directly:
# Add a simple TCP/HTTP monitor for each manager node
# Manager nodes expose port 2377 for cluster management (internal only)
# Monitor the published service port on each manager's IP:
curl -i http://MANAGER_IP:3000/health
Create a separate Vigilmon monitor for each manager node's published service port. If a manager node is unhealthy but the service is still reachable via VIP, you'll see degraded infrastructure before it causes a quorum failure.
Naming Convention
[swarm] api.example.com /health— public ingress[swarm-manager-1] 192.168.1.10:3000 /health— manager node 1[swarm-manager-2] 192.168.1.11:3000 /health— manager node 2[swarm-manager-3] 192.168.1.12:3000 /health— manager node 3
Step 3: Service Endpoint Probing vs Container-Level Health
This is the key distinction Swarm engineers often miss:
| Layer | Tool | What It Sees |
|---|---|---|
| Container process | HEALTHCHECK | localhost inside container |
| Swarm task health | docker service ps | Container exit codes, health status |
| Service VIP / routing mesh | Internal curl | Overlay network routing |
| Published port on host | External curl | Docker iptables rules |
| Ingress proxy | Traefik/Nginx logs | Proxy routing rules |
| Public endpoint | Vigilmon | Full external path including DNS, TLS |
Vigilmon's role is the last row: probing the same URL your users hit, from outside your infrastructure entirely.
Example gap scenario: You deploy a new version of api with a Traefik routing label typo. The container is healthy (HEALTHCHECK passes), the Swarm service shows Running, but Traefik is serving a 404 to all external traffic because the router label is malformed. Vigilmon catches this within 60 seconds. Swarm's internal health checks see nothing wrong.
Step 4: Heartbeat Monitoring for Swarm Scheduled Tasks
Docker Swarm doesn't have a native CronJob equivalent — teams typically run scheduled tasks with cron inside a service container, a one-shot container triggered by an external cron, or a custom scheduler service. All of these can fail silently.
Vigilmon heartbeat monitors give scheduled tasks a voice:
Set Up the Heartbeat Monitor
- Monitors → New Monitor → Heartbeat
- Name:
swarm-nightly-cleanup - Expected interval: 1 day
- Grace period: 2 hours
- Save — copy the heartbeat URL
Trigger via External Cron
#!/bin/bash
# /etc/cron.d/swarm-cleanup — runs on a Docker manager node
0 3 * * * root docker service create \
--restart-condition=none \
--name cleanup-$(date +%s) \
registry.example.com/cleanup:latest \
&& curl -fsS https://vigilmon.online/heartbeat/abc123xyz
Trigger via Swarm Service (One-Shot Pattern)
# cleanup-task.yml — deploy with: docker stack deploy -c cleanup-task.yml cleanup
version: "3.8"
services:
cleanup:
image: registry.example.com/cleanup:latest
command: >
sh -c "/app/cleanup.sh &&
curl -fsS $$VIGILMON_HEARTBEAT_URL"
deploy:
replicas: 1
restart_policy:
condition: none
environment:
- VIGILMON_HEARTBEAT_URL
Step 5: Vigilmon as External Validator Beyond Docker Health Checks
Consider a rolling update scenario:
docker service update \
--image registry.example.com/api:v2.1.0 \
--update-parallelism 1 \
--update-delay 10s \
--update-failure-action rollback \
api
Docker's --update-failure-action rollback triggers when a container's HEALTHCHECK fails. But if the new image has a bug that only manifests under external traffic (a broken middleware, a missing environment variable, a bad TLS config), Docker's internal health check may pass while Vigilmon shows the external endpoint returning 500.
You can integrate Vigilmon into your deployment pipeline:
#!/bin/bash
# deploy.sh — deploy with Vigilmon validation
SERVICE_URL="https://api.example.com/health"
IMAGE="registry.example.com/api:${VERSION}"
echo "Deploying $IMAGE..."
docker service update --image "$IMAGE" api
echo "Waiting for rollout..."
sleep 30
echo "Validating external health..."
for i in {1..10}; do
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL")
if [ "$HTTP_STATUS" = "200" ]; then
echo "External health check passed (HTTP $HTTP_STATUS)"
exit 0
fi
echo "Check $i/10: HTTP $HTTP_STATUS — waiting..."
sleep 10
done
echo "External health check failed after 10 attempts — rolling back"
docker service rollback api
exit 1
This script validates that Vigilmon's probe URL returns 200 after the rollout — not just that Docker reports healthy containers.
Step 6: Migration from Swarm to Kubernetes
If you're planning a Kubernetes migration, Vigilmon makes the transition safer. Your monitors continue to test the public endpoints regardless of what's serving them behind the scenes. During a blue/green migration:
- Keep Swarm monitors active
- Add K8s monitors for the new cluster endpoints
- Run both in parallel while migrating traffic
- Compare monitor status to validate parity before cutting over
- Decommission Swarm monitors after migration is complete
The monitor names will change, but Vigilmon's status page gives you a single view across both environments during the transition.
Summary
Docker Swarm's internal health checks are valuable but insufficient for production reliability. Vigilmon adds the external layer your Swarm cluster needs:
| Coverage | Docker Swarm | Vigilmon | |---|---|---| | Container liveness | ✓ HEALTHCHECK | ✓ | | Overlay network routing | ✓ (internal) | ✗ | | Published port reachability | ✗ | ✓ | | Ingress proxy routing | ✗ | ✓ | | External DNS + TLS | ✗ | ✓ | | Scheduled task execution | ✗ | ✓ (heartbeat) | | Multi-region validation | ✗ | ✓ |
Get started free at vigilmon.online — your first Swarm service monitor is live in under two minutes.