How to Monitor Docker Swarm Services with Vigilmon

Docker Swarm is deceptively capable for teams that don't need Kubernetes complexity. You get built-in load balancing, rolling updates, and service health checks — but those health checks only tell you what the Swarm manager can see from inside the overlay network. They cannot tell you whether your published port is reachable from the internet, whether the routing mesh is functioning correctly, or whether your ingress proxy is serving traffic after a configuration change.

Vigilmon adds the external layer that Docker Swarm lacks: monitoring from the user's perspective, heartbeat monitoring for Swarm-scheduled jobs, and multi-region validation that survives manager node failures.

Understanding the Monitoring Gap

Docker Swarm's HEALTHCHECK instruction is powerful but bounded:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -fsS http://localhost:8080/health || exit 1

This check runs inside the container, hitting the service on localhost. It cannot see:

The Swarm routing mesh (docker_gwbridge and overlay network)
The published port on the host (-p 80:8080)
The ingress proxy (Traefik, Nginx, HAProxy) in front of your services
External DNS pointing at your Swarm manager VIPs
TLS termination at the load balancer or ingress

A container can pass HEALTHCHECK while being completely unreachable from outside the cluster. Vigilmon probes from external regions using the same path your users take — through DNS, load balancer, ingress, and all.

Step 1: Add Proper Health Endpoints to Your Services

Before configuring external monitoring, give each service a /health endpoint that reflects real dependency status:

// Node.js / Express
app.get('/health', async (req, res) => {
  const checks = {};
  let degraded = false;

  // Check database connectivity
  try {
    await db.query('SELECT 1');
    checks.database = 'ok';
  } catch (err) {
    checks.database = `error: ${err.message}`;
    degraded = true;
  }

  // Check Redis connectivity
  try {
    await redisClient.ping();
    checks.cache = 'ok';
  } catch (err) {
    checks.cache = `error: ${err.message}`;
    degraded = true;
  }

  const status = degraded ? 'degraded' : 'ok';
  const code = degraded ? 503 : 200;

  res.status(code).json({ status, checks, uptime: process.uptime() });
});

Configure the HEALTHCHECK in your Dockerfile to use this endpoint:

FROM node:20-alpine
WORKDIR /app
COPY . .
RUN npm ci --production

HEALTHCHECK --interval=15s --timeout=5s --start-period=30s --retries=3 \
  CMD wget -qO- http://localhost:3000/health | grep -q '"status":"ok"' || exit 1

EXPOSE 3000
CMD ["node", "server.js"]

And in your Compose / Stack file:

# docker-compose.yml (Swarm stack)
version: "3.8"
services:
  api:
    image: registry.example.com/api:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        max_attempts: 3
      resources:
        limits:
          cpus: "0.5"
          memory: 512M
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s
    ports:
      - "3000:3000"
    networks:
      - traefik_proxy
      - backend
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.api.rule=Host(`api.example.com`)"
      - "traefik.http.services.api.loadbalancer.server.port=3000"

Step 2: Monitor Swarm Service Endpoints with Vigilmon

With health endpoints in place, set up external HTTP monitoring:

Log in to vigilmon.online and go to Monitors → New Monitor
Choose HTTP / HTTPS
Set the URL to your public service endpoint: https://api.example.com/health
Check interval: 1 minute
Expected response:
- Status code: 200
- Response body contains: "status":"ok"
- Response time threshold: 2000ms
Assign alert channels
Save

Vigilmon probes from multiple geographic regions, requiring consensus before opening an incident. A single region's connectivity blip will not wake your on-call engineer.

Monitor the Swarm Manager Nodes

Swarm manager nodes run the control plane. If you lose quorum (fewer than ⌈n/2⌉+1 managers healthy), the cluster cannot accept new deployments or scaling events. Monitor the manager node directly:

# Add a simple TCP/HTTP monitor for each manager node
# Manager nodes expose port 2377 for cluster management (internal only)
# Monitor the published service port on each manager's IP:
curl -i http://MANAGER_IP:3000/health

Create a separate Vigilmon monitor for each manager node's published service port. If a manager node is unhealthy but the service is still reachable via VIP, you'll see degraded infrastructure before it causes a quorum failure.

Naming Convention

[swarm] api.example.com /health — public ingress
[swarm-manager-1] 192.168.1.10:3000 /health — manager node 1
[swarm-manager-2] 192.168.1.11:3000 /health — manager node 2
[swarm-manager-3] 192.168.1.12:3000 /health — manager node 3

Step 3: Service Endpoint Probing vs Container-Level Health

This is the key distinction Swarm engineers often miss:

| Layer | Tool | What It Sees | |---|---|---| | Container process | HEALTHCHECK | localhost inside container | | Swarm task health | docker service ps | Container exit codes, health status | | Service VIP / routing mesh | Internal curl | Overlay network routing | | Published port on host | External curl | Docker iptables rules | | Ingress proxy | Traefik/Nginx logs | Proxy routing rules | | Public endpoint | Vigilmon | Full external path including DNS, TLS |

Vigilmon's role is the last row: probing the same URL your users hit, from outside your infrastructure entirely.

Example gap scenario: You deploy a new version of api with a Traefik routing label typo. The container is healthy (HEALTHCHECK passes), the Swarm service shows Running, but Traefik is serving a 404 to all external traffic because the router label is malformed. Vigilmon catches this within 60 seconds. Swarm's internal health checks see nothing wrong.

Step 4: Heartbeat Monitoring for Swarm Scheduled Tasks

Docker Swarm doesn't have a native CronJob equivalent — teams typically run scheduled tasks with cron inside a service container, a one-shot container triggered by an external cron, or a custom scheduler service. All of these can fail silently.

Vigilmon heartbeat monitors give scheduled tasks a voice:

Set Up the Heartbeat Monitor

Monitors → New Monitor → Heartbeat
Name: swarm-nightly-cleanup
Expected interval: 1 day
Grace period: 2 hours
Save — copy the heartbeat URL

Trigger via External Cron

#!/bin/bash
# /etc/cron.d/swarm-cleanup — runs on a Docker manager node
0 3 * * * root docker service create \
  --restart-condition=none \
  --name cleanup-$(date +%s) \
  registry.example.com/cleanup:latest \
  && curl -fsS https://vigilmon.online/heartbeat/abc123xyz

Trigger via Swarm Service (One-Shot Pattern)

# cleanup-task.yml — deploy with: docker stack deploy -c cleanup-task.yml cleanup
version: "3.8"
services:
  cleanup:
    image: registry.example.com/cleanup:latest
    command: >
      sh -c "/app/cleanup.sh &&
             curl -fsS $$VIGILMON_HEARTBEAT_URL"
    deploy:
      replicas: 1
      restart_policy:
        condition: none
    environment:
      - VIGILMON_HEARTBEAT_URL

Step 5: Vigilmon as External Validator Beyond Docker Health Checks

Consider a rolling update scenario:

docker service update \
  --image registry.example.com/api:v2.1.0 \
  --update-parallelism 1 \
  --update-delay 10s \
  --update-failure-action rollback \
  api

Docker's --update-failure-action rollback triggers when a container's HEALTHCHECK fails. But if the new image has a bug that only manifests under external traffic (a broken middleware, a missing environment variable, a bad TLS config), Docker's internal health check may pass while Vigilmon shows the external endpoint returning 500.

You can integrate Vigilmon into your deployment pipeline:

#!/bin/bash
# deploy.sh — deploy with Vigilmon validation
SERVICE_URL="https://api.example.com/health"
IMAGE="registry.example.com/api:${VERSION}"

echo "Deploying $IMAGE..."
docker service update --image "$IMAGE" api

echo "Waiting for rollout..."
sleep 30

echo "Validating external health..."
for i in {1..10}; do
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$SERVICE_URL")
  if [ "$HTTP_STATUS" = "200" ]; then
    echo "External health check passed (HTTP $HTTP_STATUS)"
    exit 0
  fi
  echo "Check $i/10: HTTP $HTTP_STATUS — waiting..."
  sleep 10
done

echo "External health check failed after 10 attempts — rolling back"
docker service rollback api
exit 1

This script validates that Vigilmon's probe URL returns 200 after the rollout — not just that Docker reports healthy containers.

Step 6: Migration from Swarm to Kubernetes

If you're planning a Kubernetes migration, Vigilmon makes the transition safer. Your monitors continue to test the public endpoints regardless of what's serving them behind the scenes. During a blue/green migration:

Keep Swarm monitors active
Add K8s monitors for the new cluster endpoints
Run both in parallel while migrating traffic
Compare monitor status to validate parity before cutting over
Decommission Swarm monitors after migration is complete

The monitor names will change, but Vigilmon's status page gives you a single view across both environments during the transition.

Summary

Docker Swarm's internal health checks are valuable but insufficient for production reliability. Vigilmon adds the external layer your Swarm cluster needs:

| Coverage | Docker Swarm | Vigilmon | |---|---|---| | Container liveness | ✓ HEALTHCHECK | ✓ | | Overlay network routing | ✓ (internal) | ✗ | | Published port reachability | ✗ | ✓ | | Ingress proxy routing | ✗ | ✓ | | External DNS + TLS | ✗ | ✓ | | Scheduled task execution | ✗ | ✓ (heartbeat) | | Multi-region validation | ✗ | ✓ |

Get started free at vigilmon.online — your first Swarm service monitor is live in under two minutes.