tutorial

Monitoring Multi-Tenant SaaS Applications 2026

Multi-tenant SaaS monitoring requires capabilities that single-tenant application monitoring doesn't. When one platform serves hundreds or thousands of custo...

Multi-tenant SaaS monitoring requires capabilities that single-tenant application monitoring doesn't. When one platform serves hundreds or thousands of customers, a degradation that affects Tenant A doesn't necessarily affect Tenant B — and generic "is the app up?" checks won't tell you which tenants are experiencing problems. Tenant-scoped health monitoring, per-tenant SLA tracking, and incident communications that correctly scope affected customers require deliberate monitoring architecture.

This guide covers the challenges of per-tenant health monitoring, shared vs. dedicated infrastructure patterns and their monitoring implications, tenant-aware health endpoint design, how to use Vigilmon to monitor per-tenant endpoints, scoping incidents by affected tenants, and tracking SLA compliance per customer tier.


Why Multi-Tenant Monitoring Is Different

The Shared Infrastructure Problem

In a standard SaaS application, all tenants share infrastructure: the same database clusters, the same application servers, the same message queues, the same CDN. A generic health check (/api/health) validates that the shared infrastructure is operational — but it cannot tell you whether any specific tenant is experiencing degraded service.

A few concrete scenarios where generic monitoring misses tenant-level problems:

Database partition degradation: A PostgreSQL schema per tenant architecture has 500 tenant schemas. A corrupted index on one tenant's schema causes their queries to time out. /api/health checks the database connection using your system schema — passes fine. Tenant 234's users are experiencing multi-second page loads.

Queue backlog for specific tenants: A multi-tenant job queue prioritizes by tenant tier. A queue consumer bug causes Premium tenants' jobs to be requeued indefinitely. Basic tenants process normally. Global queue depth is within normal range. The health check passes.

Feature flag rollout affecting subset: A new feature is rolled out to 10% of tenants for A/B testing. The feature has a bug that causes failure for tenants with specific locale settings. The global health check never touches locale-specific code paths. Affected tenants experience errors.

Storage quota enforcement bug: A tenant hits their storage quota. The enforcement code has a bug that instead of rejecting new uploads, it corrupts the tenant's existing storage index. Generic health check passes. Affected tenant's users can't access their files.

None of these failures are visible to a monitoring setup that only checks global health endpoints.

The Incident Scoping Problem

When a tenant-scoped failure occurs, support teams need to know immediately which tenants are affected. "The system is down" is an unhelpful incident description when 490 tenants are fine and 10 are affected. Effective multi-tenant incident management requires:

  • Identifying affected tenants within minutes of incident start
  • Communicating with affected tenants specifically (not blasting all tenants with "we're investigating an issue")
  • Scoping the postmortem to understand why those specific tenants were affected
  • Tracking per-tenant restoration as the incident is resolved

This is only possible with tenant-level monitoring visibility.


Infrastructure Patterns and Their Monitoring Implications

Shared Infrastructure (One Database, One App Cluster)

The most common SaaS architecture: all tenants share a single database (with logical separation by tenant_id column or schema), a single application cluster, and shared caches and queues.

Monitoring challenges:

  • No natural per-tenant endpoint to probe
  • Tenant-specific degradation is invisible to global health checks
  • Response time metrics are averages across all tenants — a degraded tail doesn't appear unless P99 is tracked

Monitoring approach:

  • Global health endpoints for infrastructure status
  • Synthetic tenant-scoped health endpoints (described below)
  • Response time P95/P99 monitoring (not just average) to catch tail latency affecting slow tenants
  • Heartbeat monitors for background processing that should affect all tenants equally

Schema-Per-Tenant (Shared DB, Separate Schemas)

Each tenant gets a separate PostgreSQL (or similar) schema. The database server is shared, but tenant data and indexes are isolated.

Monitoring challenges:

  • One tenant's table bloat, index corruption, or runaway query can affect that tenant's performance without affecting others
  • Standard health checks use the system schema — they don't exercise tenant schemas

Monitoring approach:

  • Per-tenant synthetic health endpoints that execute a lightweight query against that tenant's schema
  • Prioritize monitoring for high-tier tenants (Premium, Enterprise) at 1-minute intervals
  • Standard-tier tenants can use 5-minute intervals for the synthetic checks

Database-Per-Tenant (Dedicated DB per Tenant)

Each tenant has a fully dedicated database instance. True isolation at the data layer.

Monitoring challenges:

  • With 100+ tenants, that means 100+ databases to monitor
  • A failing tenant database needs immediate detection regardless of which tenant

Monitoring approach:

  • TCP port monitors for each tenant database instance (especially if databases are accessible from monitoring probes)
  • Per-tenant synthetic HTTP health checks that validate database connectivity for that tenant
  • Automated monitor provisioning via Vigilmon API when new tenants are onboarded
  • Database-level heartbeat monitors if scheduled jobs run against each tenant database

Dedicated Infrastructure per Tier

Enterprise tenants get dedicated infrastructure; SMB tenants share. Common in usage-heavy SaaS (data platforms, video processing, AI inference).

Monitoring challenges:

  • Enterprise tenant infrastructure must be monitored to SLA — often stricter than the default
  • Shared SMB infrastructure can use standard monitoring intervals
  • Incidents on Enterprise infrastructure are high-severity regardless of scope

Monitoring approach:

  • Separate monitor sets for Enterprise vs. shared infrastructure
  • 1-minute intervals and immediate PagerDuty for Enterprise tenant health checks
  • 5-minute intervals for shared infrastructure serving SMB tenants
  • Dedicated on-call rotation or escalation path for Enterprise infrastructure failures

Tenant-Aware Health Endpoint Design

The foundation of per-tenant monitoring is a health endpoint that exercises infrastructure in the context of a specific tenant.

Standard Global Health Check (Not Enough)

GET /api/health
Response: {"status":"ok","db":"connected","cache":"connected"}

This validates that:

  • A database connection can be established (using system credentials)
  • Cache is reachable
  • The application process is running

It does not validate that any specific tenant's data is accessible, their specific schema/partition is healthy, or their associated background processing is running.

Tenant-Scoped Health Check Design

Add a health endpoint parameterized by tenant identifier:

GET /api/health/tenant/{tenant_id}
Response: {
  "status": "ok",
  "tenant_id": "acme-corp",
  "db": "connected",
  "schema_query": "ok",
  "queue_depth": 12,
  "last_job_processed": "2026-06-30T14:18:00Z",
  "storage": "accessible"
}

This endpoint:

  1. Connects to the database using credentials appropriate for that tenant's data context
  2. Executes a lightweight query against the tenant's schema or partition (e.g., SELECT 1 FROM tenant_acme.users LIMIT 1)
  3. Checks that the tenant's associated queue is being processed (last job timestamp within expected window)
  4. Validates that the tenant's storage context is accessible
  5. Returns aggregate status

Authentication considerations: Tenant-scoped health endpoints should authenticate via a dedicated monitoring API key (not tenant user credentials) and should return only operational status data — never tenant data, PII, or records. The endpoint confirms the tenant's infrastructure is working, not returns their data.

Monitoring Probe Key

Create a dedicated API key for your monitoring system with read-only access to health endpoints:

GET /api/health/tenant/acme-corp
Authorization: Bearer monitoring_key_readonly_xyz

This key has access only to health check endpoints — nothing else. It cannot read tenant data. Configure this key in Vigilmon's HTTP monitor header configuration.

Lightweight Schema Probe Pattern

For schema-per-tenant architectures, the schema query inside the health endpoint should be minimal:

-- Good: touches the tenant schema with negligible load
SELECT 1 FROM acme_corp.system_health_ping LIMIT 1;

-- Or: check for recent activity as a proxy for schema health
SELECT MAX(updated_at) FROM acme_corp.records
WHERE updated_at > NOW() - INTERVAL '1 hour';

-- Bad: full table scan, slow query that burdens the database
SELECT COUNT(*) FROM acme_corp.records;

Keep the health check lightweight. You're checking that the schema is accessible and responsive, not generating analytics.


Configuring Vigilmon for Per-Tenant Monitoring

Monitor Structure for Multi-Tenant SaaS

Use a naming convention that encodes the tenant and tier:

{tier}/{tenant_slug}/health
{tier}/{tenant_slug}/api
{tier}/{tenant_slug}/heartbeat/{job_name}

Examples:

enterprise/acme-corp/health
enterprise/acme-corp/api-endpoint
enterprise/acme-corp/heartbeat/nightly-export

premium/widget-co/health
premium/widget-co/heartbeat/billing-retry

standard/small-startup/health

This naming scheme enables:

  • Filtering all Enterprise monitors: GET /api/monitors?search=enterprise/
  • Filtering all monitors for a specific tenant: GET /api/monitors?search=acme-corp
  • Reporting uptime by tier: aggregate all enterprise/ monitors, premium/ monitors, etc.
  • Per-tier alerting thresholds in your webhook router

Check Interval by Tenant Tier

| Tier | Check Interval | Alert Routing | SLA Target | |---|---|---|---| | Enterprise | 1 minute | PagerDuty P1 immediately | 99.9% | | Premium | 2 minutes | PagerDuty P2 within 5 min | 99.5% | | Standard | 5 minutes | Slack + email | 99.0% | | Free | 5 minutes | Slack only (internal) | Best effort |

The check interval directly determines your minimum detection latency. For Enterprise tenants with SLA targets that include response time commitments, 1-minute intervals ensure detection within 1 minute of failure onset.

Scaling Monitor Count with Tenant Count

At small scale (10–50 tenants), manually provisioning monitors per tenant is reasonable. At larger scale, monitor provisioning should be automated via the Vigilmon API.

Automated provisioning on tenant signup:

// Called when a new tenant signs up
async function onTenantCreated(tenant) {
  const tier = tenant.subscriptionTier; // 'enterprise' | 'premium' | 'standard'
  const slug = tenant.slug; // 'acme-corp'

  // Create HTTP health monitor
  await vigilmon.monitors.create({
    name: `${tier}/${slug}/health`,
    url: `https://app.yourproduct.com/api/health/tenant/${slug}`,
    type: 'http',
    interval: tierInterval(tier),
    headers: { 'Authorization': `Bearer ${MONITORING_API_KEY}` },
    expectedStatus: 200,
    expectedBody: '"status":"ok"'
  });

  // Create heartbeat monitor for nightly job
  await vigilmon.monitors.create({
    name: `${tier}/${slug}/heartbeat/nightly-export`,
    type: 'heartbeat',
    interval: '24h',
    grace: '2h' // Allow 2h beyond expected window before alerting
  });
}

Deprovisioning on tenant churn:

async function onTenantCancelled(tenant) {
  const monitors = await vigilmon.monitors.list({
    search: tenant.slug
  });
  await Promise.all(monitors.map(m => vigilmon.monitors.delete(m.id)));
}

Incident Scoping by Tenant

Detecting Which Tenants Are Affected

When a multi-region consensus alert fires for a specific tenant health endpoint, the incident is scoped immediately. The monitor name tells you exactly which tenant is affected: enterprise/acme-corp/health.

If multiple tenant monitors alert within a short window, it signals shared infrastructure failure:

enterprise/acme-corp/health      DOWN at 14:22:01
premium/widget-co/health         DOWN at 14:22:03
premium/startup-xyz/health       DOWN at 14:22:04
standard/small-co/health         DOWN at 14:22:05

Four monitors alerting within 4 seconds indicates infrastructure-wide failure, not tenant-specific. Your incident response switches from "investigate Tenant A" to "investigate shared infrastructure."

Conversely, a single tenant alerting while others remain up indicates a tenant-specific problem:

enterprise/acme-corp/health      DOWN at 14:22:01
enterprise/other-corp/health     UP
premium/widget-co/health         UP

Immediate scope: the problem is isolated to Acme Corp's infrastructure context.

Automated Incident Scope Tagging

Configure your webhook router to extract the tenant tier and slug from the monitor name and tag incidents accordingly:

function parseMonitorName(name) {
  const parts = name.split('/');
  return {
    tier: parts[0],       // 'enterprise'
    tenant: parts[1],     // 'acme-corp'
    checkType: parts[2]   // 'health'
  };
}

function handleVigilmonWebhook(payload) {
  const { tier, tenant, checkType } = parseMonitorName(payload.monitorName);

  if (tier === 'enterprise') {
    // Create P1 PagerDuty incident
    pagerduty.createIncident({
      title: `[ENTERPRISE] ${tenant} health check failed`,
      severity: 'critical',
      body: `Tenant: ${tenant}\nTier: ${tier}\nCheck: ${checkType}`,
      tags: ['enterprise', tenant, 'multi-tenant']
    });
  }
  // Notify affected tenant's CSM
  notifyAccountTeam(tenant, payload);
}

Per-Tenant Incident Communication

For tenants with SLA commitments, incidents require direct communication — not a generic "we're experiencing issues" status page post. Effective multi-tenant incident communication:

  1. Identify affected tenant(s) from the monitor alert
  2. Notify CSM or account manager for Enterprise tenants immediately
  3. Send direct email/Slack to affected tenant contacts — not generic status page
  4. Scope the status page update to "Service degraded for a subset of users" rather than a global outage
  5. Track per-tenant restoration — send resolution communication to affected tenants specifically

This targeted communication model is only possible with tenant-scoped monitoring. Generic monitoring tells you "the app is down" — tenant-scoped monitoring tells you "Acme Corp's services are down, and everyone else is fine."


SLA Tracking per Customer Tier

What SLA Monitoring Requires

A per-tenant SLA tracking system needs:

  • Timestamped check history for each tenant's monitor
  • Downtime duration calculations per tenant per billing period
  • Uptime percentage per tenant per billing period
  • SLA credit calculation when commitments are missed

Vigilmon's check history API provides the timestamped event data. Your SLA calculation layer reads this data and computes per-tenant availability.

SLA Calculation from Vigilmon Data

async function calculateMonthlyUptime(tenantSlug, year, month) {
  const startDate = new Date(year, month - 1, 1);
  const endDate = new Date(year, month, 1);
  const totalMinutes = (endDate - startDate) / 60000;

  // Fetch check events for this tenant's health monitor
  const events = await vigilmon.events.list({
    monitorSearch: `${tenantSlug}/health`,
    from: startDate.toISOString(),
    to: endDate.toISOString()
  });

  // Calculate downtime minutes from DOWN events with duration
  const downtimeMinutes = events
    .filter(e => e.type === 'outage')
    .reduce((sum, e) => sum + e.durationMinutes, 0);

  const uptimePercentage = ((totalMinutes - downtimeMinutes) / totalMinutes) * 100;

  return {
    tenant: tenantSlug,
    period: `${year}-${String(month).padStart(2, '0')}`,
    totalMinutes,
    downtimeMinutes,
    uptimePercentage: uptimePercentage.toFixed(4),
    slaBreached: uptimePercentage < tierSLATarget(tenantSlug)
  };
}

Per-Tier SLA Credit Policies

| Tier | SLA Target | Credit if Missed | |---|---|---| | Enterprise | 99.9% (8.7h/year) | 10% per 0.1% below target | | Premium | 99.5% (43.8h/year) | 5% per 0.1% below target | | Standard | 99.0% (87.6h/year) | No contractual credit | | Free | Best effort | No commitment |

Monthly SLA reports for Enterprise and Premium tenants:

  • Uptime percentage for the month
  • Incident count, durations, and dates
  • Confirmation of SLA achievement or credit owed
  • Prior-month trend (improving or declining)

Heartbeat Monitoring for Tenant-Specific Background Jobs

Multi-tenant SaaS applications often run per-tenant background jobs: nightly data exports, usage invoice generation, report compiling, or data sync tasks. In a schema-per-tenant architecture, these jobs run independently per tenant.

Per-tenant heartbeat pattern:

Each tenant job pings its own Vigilmon heartbeat URL on completion:

# nightly_export.py - runs per tenant
def run_nightly_export(tenant_slug):
    try:
        export_tenant_data(tenant_slug)
        # Ping heartbeat on success
        requests.post(
            f"https://vigil.vigilmon.online/api/heartbeat/{TENANT_HEARTBEAT_IDS[tenant_slug]}"
        )
    except Exception as e:
        log_error(tenant_slug, e)
        # Don't ping — Vigilmon will alert when the heartbeat window expires

If 48 of 50 tenants ping their heartbeats successfully but 2 don't, Vigilmon alerts specifically for those 2 tenant heartbeat monitors. You know immediately which tenants' nightly exports failed — without checking 50 log files.


Common Multi-Tenant Monitoring Mistakes

Monitoring Only the Global Health Endpoint

A single /api/health check misses every tenant-scoped failure pattern. For any SaaS application with more than a handful of tenants, add tenant-scoped health checks for at minimum your top-tier customers.

Uniform Check Intervals Across Tiers

Monitoring all tenants at 5-minute intervals treats a $50/month Standard tenant the same as a $5,000/month Enterprise tenant. Enterprise tenants with SLA commitments need 1-minute intervals for detection that supports the contractual response time you've committed to.

No Automated Monitor Provisioning

Manually provisioning monitors when tenants sign up does not scale. At 100 tenants it becomes burdensome; at 500 it's untenable. Wire monitor creation into your tenant onboarding flow via the Vigilmon API from the start.

Generic Incident Communication

Sending "we're experiencing service disruption" messages to all 500 tenants when only 3 are affected destroys trust with the 497 tenants who were unaffected. Tenant-scoped monitoring enables tenant-scoped incident communication. Use both.

Neglecting Per-Tenant Job Monitoring

Per-tenant background jobs that run silently without producing HTTP responses or triggering database errors are invisible to standard uptime monitoring. Heartbeat monitoring for per-tenant job execution is the only reliable way to detect these failures before tenants report missing data.


Conclusion

Multi-tenant SaaS monitoring is primarily about scope granularity: knowing not just that "the service" is degraded, but which tenants are experiencing degradation and why. This requires tenant-aware health endpoints, per-tenant monitoring with tier-appropriate check intervals, automated monitor provisioning via API, and webhook routing that extracts tenant context from alert payloads.

The monitoring architecture described here — tenant-scoped health endpoints, per-tier check intervals, heartbeat monitoring for per-tenant jobs, automated provisioning on tenant lifecycle events, and SLA uptime calculations from check history — gives SaaS teams the visibility they need to manage multi-tenant reliability at scale.

Vigilmon's REST API makes this architecture practical: monitors are programmable, webhooks carry the monitor name for routing, and check history is accessible for SLA calculations. The free tier covers prototyping the tenant monitoring model before scaling to a full tenant fleet.

Start building your multi-tenant monitoring architecture at vigilmon.online — REST API access, webhook-based alerting, heartbeat monitoring, and a permanent free tier to get started without commitment.


Tags: #monitoring #multitenant #saas #uptime #vigilmon #devops #sre #sla #tenantmonitoring #api #2026

Monitor your app with Vigilmon

Free plan — 5 monitors, no credit card required. Up and running in 60 seconds.

Start free →