tutorial

Uptime Monitoring for AWS Lambda Functions: A Practical Guide

How to add production-grade uptime monitoring to AWS Lambda — health endpoints, cold start impact, Vigilmon webhook integration, and heartbeat monitors for scheduled functions.

AWS Lambda is one of the most widely adopted serverless platforms in the world. The operational model is compelling: no servers to provision, automatic scaling, and pay-per-invocation pricing. But "serverless" doesn't mean "worry-free." Lambda functions fail silently, cold starts inflate response times, and scheduled jobs stop running without any visible error. Vigilmon gives you the external visibility layer that Lambda's built-in metrics can't provide: an independent check that confirms your function actually responds correctly, from outside AWS.

This guide walks through adding production-grade uptime monitoring to your Lambda functions.

What You'll Build

  • A health endpoint Lambda function that checks real dependencies
  • A Vigilmon HTTP monitor targeting your function via API Gateway or Lambda URL
  • A heartbeat monitor for scheduled (EventBridge) Lambda jobs
  • Alert channels via Slack and webhook

Prerequisites

  • An AWS account with Lambda and API Gateway (or Lambda Function URLs) configured
  • Node.js or Python Lambda functions (examples cover both)
  • A free account at vigilmon.online

Step 1: Add a Health Endpoint to Your Lambda Function

The simplest and most reliable monitoring pattern for Lambda is a dedicated /health route that probes your real dependencies — database, downstream APIs, cache — and returns a structured status. This goes beyond a static 200 and tells you whether the function is actually usable.

Node.js (Express + API Gateway)

// health.js
const { DynamoDBClient, DescribeTableCommand } = require("@aws-sdk/client-dynamodb");

const dynamo = new DynamoDBClient({ region: process.env.AWS_REGION });

exports.handler = async (event) => {
  const checks = {};
  let ok = true;

  // DynamoDB probe — describe the table to verify IAM permissions + connectivity
  try {
    await dynamo.send(new DescribeTableCommand({ TableName: process.env.TABLE_NAME }));
    checks.dynamodb = "ok";
  } catch (err) {
    checks.dynamodb = `error: ${err.message}`;
    ok = false;
  }

  // Downstream API probe
  try {
    const resp = await fetch("https://api.example.com/ping", {
      signal: AbortSignal.timeout(3000),
    });
    checks.upstream = resp.ok ? "ok" : `http_${resp.status}`;
    if (!resp.ok) ok = false;
  } catch (err) {
    checks.upstream = `timeout_or_unreachable`;
    ok = false;
  }

  return {
    statusCode: ok ? 200 : 503,
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ status: ok ? "ok" : "degraded", checks }),
  };
};

Python (Flask + Mangum or direct Lambda handler)

import json
import os
import urllib.request
import boto3

def health_handler(event, context):
    checks = {}
    ok = True

    # S3 probe — check bucket accessibility
    try:
        s3 = boto3.client("s3")
        s3.head_bucket(Bucket=os.environ["BUCKET_NAME"])
        checks["s3"] = "ok"
    except Exception as e:
        checks["s3"] = f"error: {str(e)}"
        ok = False

    # RDS connectivity probe via a simple query
    try:
        import psycopg2
        conn = psycopg2.connect(
            host=os.environ["DB_HOST"],
            database=os.environ["DB_NAME"],
            user=os.environ["DB_USER"],
            password=os.environ["DB_PASS"],
            connect_timeout=3,
        )
        conn.close()
        checks["rds"] = "ok"
    except Exception as e:
        checks["rds"] = f"error: {str(e)}"
        ok = False

    return {
        "statusCode": 200 if ok else 503,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps({"status": "ok" if ok else "degraded", "checks": checks}),
    }

Deploy this as a Lambda function and expose it through API Gateway or a Lambda Function URL. Your health endpoint URL will look like:

https://<api-id>.execute-api.<region>.amazonaws.com/prod/health
# or with Lambda Function URL:
https://<url-id>.lambda-url.<region>.on.aws/health

Step 2: Understanding Cold Start Impact on Monitoring

Lambda functions that haven't been invoked recently go "cold" — the runtime container is shut down and must be re-initialized on the next request. Cold starts add 200ms–2s of latency depending on runtime, memory, and package size.

This matters for monitoring because:

  1. A cold-start response may time out if your monitor has a tight timeout setting.
  2. Cold start latency looks like a performance incident in response time graphs.
  3. The first probe after idle may fail if your timeout is shorter than the cold start duration.

Vigilmon settings for cold-start-aware monitoring

When creating your monitor in Vigilmon:

  • Set Timeout to at least 10 seconds for Lambda endpoints — this gives room for a cold start without false alarms.
  • Set Check interval to 60 seconds — frequent checks also serve as a keep-warm mechanism, reducing cold starts for production traffic.
  • Enable Multi-region consensus — Vigilmon probes from multiple regions simultaneously. A cold start in one region won't fire an alert unless other regions also see a failure.

This last point is critical. Lambda containers are per-region. A cold start in eu-west-1 doesn't mean your US traffic is affected. Multi-region consensus prevents cross-region cold start noise from waking your on-call.


Step 3: Configure Vigilmon HTTP Monitor

  1. Log in to Vigilmon and click Add Monitor → HTTP.
  2. Set URL to your Lambda health endpoint URL.
  3. Set Check interval to 60 seconds.
  4. Set Timeout to 10 seconds.
  5. Set Expected status code to 200.
  6. Under Advanced → JSON body assertion, add:
    • Path: status
    • Expected value: ok
  7. Save and verify the first check appears green.

Adding alert channels

Navigate to Alerts → Channels and set up:

  • Slack: paste your Slack incoming webhook URL — Vigilmon will post a message to your chosen channel when your Lambda goes down.
  • Webhook: configure a webhook URL for PagerDuty, Opsgenie, or any HTTP endpoint to receive structured JSON incident payloads.
  • Email: add your on-call email for direct paging.

Step 4: Heartbeat Monitor for Scheduled Lambda Functions

If your Lambda runs on an EventBridge schedule (cron or rate expression), a standard HTTP monitor won't catch it — the function has no public endpoint to probe between runs. Instead, configure the function to send a heartbeat ping to Vigilmon at the end of each successful run.

EventBridge-triggered Node.js function

// scheduled-job.js
exports.handler = async (event) => {
  try {
    await runYourScheduledWork();

    // Notify Vigilmon the job completed successfully
    await fetch(`https://vigilmon.online/api/heartbeat/${process.env.VIGILMON_HEARTBEAT_ID}`, {
      method: "POST",
    });

    return { status: "ok" };
  } catch (err) {
    // Do NOT ping the heartbeat on failure — Vigilmon will fire an alert after the grace period
    console.error("Scheduled job failed:", err);
    throw err;
  }
};

Set up the heartbeat in Vigilmon

  1. In Vigilmon, click Add Monitor → Heartbeat.
  2. Name it (e.g., nightly-report-job).
  3. Set Grace period to a few minutes longer than your schedule interval — e.g., for a 15-minute EventBridge rate, set grace to 20 minutes.
  4. Copy the heartbeat URL and store it as a Lambda environment variable (VIGILMON_HEARTBEAT_ID).

If Vigilmon doesn't receive a ping within the grace period, it fires an alert. This catches:

  • Function crashes or unhandled exceptions
  • EventBridge rule disabled or misconfigured
  • IAM permission revocations that prevent execution
  • Deployment failures that break the handler

Step 5: Webhook Integration for Incident Automation

Vigilmon can POST structured JSON to any webhook endpoint when a monitor changes state (up → down or down → up). This enables automation beyond simple alerting.

Example payload Vigilmon sends:

{
  "monitor": "lambda-health",
  "status": "down",
  "url": "https://<your-lambda>.lambda-url.eu-west-1.on.aws/health",
  "region": "eu-west-1",
  "timestamp": "2026-06-30T14:22:00Z",
  "response_time_ms": null,
  "error": "Connection timed out"
}

You can wire this to a Lambda function that:

  • Opens a GitHub issue or Jira ticket automatically
  • Triggers an AWS Systems Manager Automation runbook
  • Posts a structured incident update to a Slack channel

Configure the webhook URL in Alerts → Channels → Add Webhook.


Monitoring Coverage Summary

| Failure scenario | Detection method | |---|---| | Lambda function throws unhandled exception | HTTP monitor sees 5xx response | | Downstream database is unreachable | Health check returns degraded (503) | | Cold start causes timeout | HTTP monitor with 10s timeout catches runaway starts | | Scheduled job silently stops running | Heartbeat monitor fires after grace period | | Lambda function URL or API Gateway misconfigured | HTTP monitor returns connection error | | SSL certificate on custom domain expires | Vigilmon certificate monitor alerts 14 days before expiry |


Lambda's abstraction layer makes deployments faster and operations simpler — but it doesn't eliminate the need for external monitoring. Vigilmon gives you the independent vantage point that AWS CloudWatch metrics can't: a probe from outside your AWS account that confirms the function responds correctly to real traffic, from multiple geographies, with noise-free alerting via multi-region consensus.

Set up monitoring for your Lambda functions in under 5 minutes — start free at vigilmon.online.


Tags: #aws #lambda #serverless #monitoring

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →