Monitoring Temporal with Vigilmon: Frontend Service Health, Web UI Uptime, gRPC Port & Namespace Availability

Temporal is the workflow orchestration engine that powers durable business processes — payment flows, data pipelines, onboarding sequences, and any long-running operation that needs to survive server restarts and partial failures. When Temporal's frontend service goes down, workers can no longer poll for tasks, running workflows stall mid-execution, and new workflow starts are rejected. When the web UI is unavailable, engineers lose visibility into running and failed workflows. When the gRPC port is unreachable, every Temporal SDK client — Go, Java, TypeScript, Python — loses connectivity to the cluster. Vigilmon gives you external visibility into Temporal's availability: the frontend service health, web UI, gRPC port, and namespace availability.

What You'll Build

A monitor on Temporal's web UI to detect cluster-level failures
A TCP monitor on the gRPC port to catch SDK connectivity failures
A namespace availability check to confirm your workflow namespace is active
SSL certificate monitoring for your Temporal domain
An alerting setup that distinguishes frontend failures from worker connectivity issues

Prerequisites

A running Temporal cluster (self-hosted or Temporal Cloud) with a network-reachable domain
The Temporal web UI exposed via HTTPS (e.g., https://temporal.example.com)
The gRPC frontend port accessible (default 7233)
A free account at vigilmon.online

Step 1: Understand Temporal's Service Architecture

Temporal runs as a set of internal services. For external monitoring, the key services are:

| Service | Default port | Role | |---|---|---| | Frontend | 7233 (gRPC) | Entry point for all SDK clients and the tctl/temporal CLI | | Web UI | 8233 (HTTP) | Browser dashboard for workflow inspection and management | | History | 7234 (gRPC, internal) | Stores workflow event history | | Matching | 7235 (gRPC, internal) | Routes tasks to workers |

External monitoring focuses on the Frontend service and the Web UI — these are the externally reachable components. History and Matching are internal services managed by the Temporal cluster itself.

Step 2: Create a Vigilmon HTTP Monitor for the Web UI

The Temporal web UI is the most directly observable component via HTTP. It serves the dashboard that engineers use to inspect workflows, activities, and task queues:

curl https://temporal.example.com
# Returns HTML with "Temporal" in the page content

Log in to Vigilmon → Add Monitor → HTTP.
URL: https://temporal.example.com.
Check interval: 60 seconds.
Response timeout: 15 seconds.
Expected status: 200.
Keyword: Temporal (appears in the web UI page title and content).
Click Save.

This monitor catches:

Temporal web UI service crashes
Kubernetes pod failures for the temporal-web container
Ingress controller failures blocking access to Temporal
Deployment failures after Temporal upgrades

Alert sensitivity: Set to trigger after 1 consecutive failure. When the web UI is down, engineers have no visibility into running workflows, stuck activities, or worker health.

Step 3: Monitor the Frontend gRPC Port via TCP Check

The Temporal frontend service listens on gRPC port 7233. Every Temporal SDK client — Go (go.temporal.io/sdk), Java, TypeScript, Python — connects directly to this port. A TCP check confirms the port is accepting connections at the network level:

# Test gRPC port connectivity
nc -zv temporal.example.com 7233

Add Monitor → TCP.
Host: temporal.example.com.
Port: 7233.
Check interval: 60 seconds.
Label: Temporal frontend gRPC port.
Click Save.

Why TCP monitoring for gRPC? Temporal uses gRPC exclusively for SDK communication — there's no HTTP fallback. If the gRPC port is down, workflows stop starting, workers stop polling, and running workflows stall. A TCP check detects port-level failures (firewall rules, load balancer misconfiguration, process crashes) faster than waiting for SDK timeouts.

Step 4: Monitor the HTTP API (Temporal 1.20+)

Temporal 1.20+ includes a built-in HTTP API alongside gRPC. If your cluster exposes it, this provides an HTTP-native health signal:

# List namespaces via HTTP API
curl https://temporal.example.com/api/v1/namespaces
# Returns a JSON object with a namespaces array

Add Monitor → HTTP.
URL: https://temporal.example.com/api/v1/namespaces.
Check interval: 60 seconds.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: namespaces (present in the namespace list response).
Label: Temporal HTTP API.
Click Save.

If your Temporal version is older than 1.20 or the HTTP API is disabled, skip this step and rely on the web UI monitor and gRPC TCP check instead.

Step 5: Monitor Namespace Availability

Temporal workflows run inside namespaces. If your primary namespace becomes unavailable (corrupted, accidentally deleted, or degraded), workflow starts and task polls fail even when the cluster itself is healthy. Use the Temporal HTTP API to check namespace availability:

curl https://temporal.example.com/api/v1/namespaces/default
# Returns namespace details for the "default" namespace

Add Monitor → HTTP.
URL: https://temporal.example.com/api/v1/namespaces/default (replace default with your namespace name).
Check interval: 2 minutes.
Response timeout: 10 seconds.
Expected status: 200.
Keyword: "state":"REGISTERED" (confirms the namespace is active and not deleted or deprecated).
Label: Temporal namespace: default.
Click Save.

Temporal Cloud: If you're using Temporal Cloud, your namespace URL follows the pattern https://<namespace>.tmprl.cloud. The namespace health check is particularly valuable in multi-tenant setups where different teams own different namespaces.

Step 6: Monitor SSL Certificates

Temporal's TLS configuration protects both the web UI (HTTPS) and gRPC connections. An expired certificate causes browser access to fail and all SDK clients using TLS to reject connections:

openssl s_client -connect temporal.example.com:443 2>/dev/null | openssl x509 -noout -dates

Add Monitor → SSL Certificate.
Domain: temporal.example.com.
Alert when expiry is within: 30 days.
Alert again: 14 days, 7 days, 3 days, 1 day.
Click Save.

mTLS for gRPC: If your Temporal cluster uses mutual TLS for gRPC (where both client and server present certificates), client-side certificate expiry is also a failure mode. SDK clients that use mTLS certificate files will start failing when those certificates expire — monitor those separately or rotate them via your certificate automation toolchain.

Step 7: Configure Alerting

In Vigilmon under Settings → Notifications, configure your alert channels:

| Monitor | Trigger | Action | |---|---|---| | Web UI | Non-200 or Temporal missing | Check temporal-web pod; inspect ingress; verify web UI service is healthy | | gRPC port TCP | Connection refused | Frontend service down; SDK clients disconnected; workers stopped polling | | HTTP API / namespaces | Non-200 or keyword missing | Frontend API degraded; check Temporal frontend pod logs | | Namespace availability | Non-200 or REGISTERED missing | Namespace corrupted or deleted; workflow starts will fail | | SSL certificate | < 30 days to expiry | Renew certificate; test browser and SDK access after renewal |

Alert after: 1 consecutive failure for the gRPC port and web UI monitors. 2 consecutive failures for namespace and HTTP API monitors.

Common Temporal Failure Modes and What Vigilmon Catches

| Scenario | Vigilmon monitor | |---|---| | Frontend service pod OOM killed | gRPC TCP monitor fires; web UI monitor fires; alert within 60 s | | Web UI container crash | Web UI monitor fires; gRPC port may still be healthy | | Cassandra/PostgreSQL backend down | Frontend responds but workflow operations fail; HTTP API monitor catches degraded state | | gRPC port blocked by firewall rule | TCP monitor fires immediately; web UI may still load | | SSL certificate expires | SSL monitor alerts at 30-day threshold; browser and SDK connections fail | | Namespace accidentally deleted | Namespace monitor fires; workflow starts return namespace-not-found errors | | Temporal cluster upgrade failure | Health and web UI monitors fire during failed upgrade | | Worker connectivity lost (network partition) | gRPC TCP monitor fires; workflows stall | | DNS misconfiguration | All monitors fire simultaneously | | History service degraded | Workflow execution stalls; not directly catchable via HTTP external monitoring |

Monitoring Worker Health

Vigilmon monitors the Temporal service endpoints — the control plane that workers connect to. Worker health (whether your Go, Java, or TypeScript workers are polling and executing tasks) is a separate concern that requires application-level observability:

Temporal's built-in metrics: Workers expose Prometheus metrics on a configurable port. Monitor temporal_worker_task_slots_available and temporal_request_failure_total.
Workflow execution monitoring: Set up Temporal's WorkflowExecutionTimeout and use the web UI's task queue view to check for task queue backlog.
Dead-letter queue patterns: Use Temporal's activity retry and timeout configuration to route permanently-failed executions to a dead-letter workflow for alerting.

Vigilmon catches infrastructure-level failures (cluster down, port unreachable, certificate expired). Application-level worker health requires SDK-level instrumentation.

Temporal is the backbone for durable workflows — when the frontend goes down, every worker stops executing tasks, and in-flight business processes stall silently mid-execution. Vigilmon gives you external visibility into Temporal's availability that doesn't depend on your workers reporting status: web UI uptime, gRPC port connectivity, namespace health, and SSL certificate expiry, so you know the moment the cluster becomes unreachable and can restore workflow execution before business processes time out.

Start monitoring Temporal in under 5 minutes — register free at vigilmon.online.