Monitoring Your Phoenix LiveView App with Vigilmon

Elixir is famous for "let it crash." BEAM processes restart themselves, supervisors keep trees healthy, and your app heals from most failures without human intervention.

But some failures don't self-heal: the node is unreachable from the internet, the Postgres connection pool is exhausted, a LiveView deploy left a port misbound. These require external eyes — a monitoring service that hits your app from outside and tells you when it can't get through.

This tutorial adds production observability to a Phoenix app:

A health check plug (zero-dependency, framework-idiomatic)
HTTP uptime monitoring with Vigilmon
Multi-region checks that benefit distributed Elixir deployments
Heartbeat monitoring for Oban jobs and GenServer-based workers
Slack alerts and a status page

Step 1: Add a health check plug

Phoenix doesn't need a library for a basic health check. A plug is idiomatic, fast, and has zero dependencies.

# lib/my_app_web/plugs/health_check.ex
defmodule MyAppWeb.Plugs.HealthCheck do
  import Plug.Conn

  def init(opts), do: opts

  def call(%Plug.Conn{request_path: "/health"} = conn, _opts) do
    checks = run_checks()
    status = if Enum.all?(checks, fn {_, v} -> v == :ok end), do: 200, else: 503

    conn
    |> put_resp_content_type("application/json")
    |> send_resp(status, Jason.encode!(%{status: status_label(status), checks: checks}))
    |> halt()
  end

  def call(conn, _opts), do: conn

  defp run_checks do
    %{
      database: check_database(),
      memory: check_memory()
    }
  end

  defp check_database do
    case Ecto.Adapters.SQL.query(MyApp.Repo, "SELECT 1", []) do
      {:ok, _} -> :ok
      {:error, _} -> :error
    end
  end

  defp check_memory do
    # Alert if memory usage exceeds 90%
    case :memsup.get_system_memory_data() do
      [] ->
        :ok
      data ->
        total = Keyword.get(data, :total_memory, 1)
        free = Keyword.get(data, :free_memory, total)
        used_pct = (total - free) / total * 100
        if used_pct < 90, do: :ok, else: :error
    end
  end

  defp status_label(200), do: "ok"
  defp status_label(_), do: "degraded"
end

Plug it in before your router (so it bypasses authentication middleware):

# lib/my_app_web/endpoint.ex
defmodule MyAppWeb.Endpoint do
  use Phoenix.Endpoint, otp_app: :my_app

  plug MyAppWeb.Plugs.HealthCheck  # ← add before the router

  # ... rest of your plugs
  plug MyAppWeb.Router
end

Test it:

mix phx.server
curl http://localhost:4000/health
# {"status":"ok","checks":{"database":"ok","memory":"ok"}}

A non-200 response body tells you exactly which check failed. That precision matters when you're triaging at 2 AM.

Optional: use the `plug_checkup` library

If you'd rather use a library with built-in checks for Ecto, Redis, and HTTP dependencies:

# mix.exs
{:plug_checkup, "~> 0.6"}

defmodule MyApp.Checks do
  use PlugCheckup, checks: [
    PlugCheckup.Check.new("db", MyApp.Checks.Database),
    PlugCheckup.Check.new("redis", MyApp.Checks.Redis),
  ]
end

Either approach gives you a URL that returns 200 when healthy and 503 with details when not.

Step 2: Set up HTTP monitoring in Vigilmon

Point Vigilmon at your health endpoint:

Sign up at vigilmon.online
Click New Monitor → HTTP
Enter https://yourdomain.com/health
Set check interval: 1 minute (paid) or 5 minutes (free)
Save

Vigilmon pings from multiple geographic regions. This is particularly valuable for Phoenix apps:

Why multi-region checks matter for Elixir:

Phoenix and LiveView apps often run as distributed clusters (libcluster, fly.io regions, Render multi-region). Multi-region monitoring catches split-brain scenarios where your app is reachable from one region but not another — which wouldn't show up in single-probe monitoring.

If you're deploying to Fly.io with multiple regions:

# Add monitors for each regional endpoint
https://app-name.fly.dev/health           # primary
https://lhr.app-name.fly.dev/health       # London
https://ord.app-name.fly.dev/health       # Chicago

Each regional failure alerts independently, so you know whether a failure is local or global.

Step 3: Alerts via Slack

In Vigilmon, go to Notifications → New Channel → Slack and paste your Slack incoming webhook URL.

To create a webhook in Slack:

api.slack.com/apps → Create New App → From scratch
Enable Incoming Webhooks → Add New Webhook
Pick your alerts channel and copy the URL

Enable the Slack channel on your monitor. When Phoenix is unreachable, Vigilmon sends:

🔴 DOWN: yourdomain.com/health
Status: 503 Service Unavailable
Detected from: EU-West, US-East
5 minutes ago

And when it recovers:

✅ RECOVERED: yourdomain.com/health
Downtime: 12 minutes

The recovery notification is often the most important one — it tells you when it's safe to stop firefighting.

Step 4: Heartbeat monitoring for Oban jobs and GenServers

LiveView handles its own process restarts. But scheduled Oban jobs and long-running GenServers can fail silently: the process stays up, the supervisor is happy, but work has stopped happening.

Heartbeat pattern: your job or GenServer pings a unique URL at the end of every successful execution cycle. If Vigilmon doesn't receive a ping within the expected window, it alerts you.

Oban job heartbeat

# lib/my_app/workers/daily_digest_worker.ex
defmodule MyApp.Workers.DailyDigestWorker do
  use Oban.Worker, queue: :default

  require Logger

  @impl Oban.Worker
  def perform(%Oban.Job{}) do
    with :ok <- generate_digest(),
         :ok <- send_digest() do
      ping_heartbeat()
      :ok
    else
      error ->
        Logger.error("DailyDigestWorker failed: #{inspect(error)}")
        {:error, error}
    end
  end

  defp ping_heartbeat do
    url = Application.get_env(:my_app, :vigilmon)[:digest_heartbeat_url]
    if url do
      case Req.get(url, receive_timeout: 5_000) do
        {:ok, _} -> :ok
        {:error, reason} -> Logger.warning("Heartbeat ping failed: #{inspect(reason)}")
      end
    end
  end

  defp generate_digest, do: :ok   # your logic
  defp send_digest, do: :ok        # your logic
end

Add the config:

# config/runtime.exs
config :my_app, :vigilmon,
  digest_heartbeat_url: System.get_env("VIGILMON_DIGEST_HEARTBEAT_URL")

GenServer heartbeat

For long-running GenServers (polling external APIs, syncing data), add a heartbeat on each successful tick:

# lib/my_app/sync_server.ex
defmodule MyApp.SyncServer do
  use GenServer

  require Logger

  @interval :timer.minutes(5)

  def start_link(_), do: GenServer.start_link(__MODULE__, %{}, name: __MODULE__)

  @impl true
  def init(state) do
    schedule_tick()
    {:ok, state}
  end

  @impl true
  def handle_info(:tick, state) do
    case sync_data() do
      :ok ->
        ping_heartbeat()
      {:error, reason} ->
        Logger.error("Sync failed: #{inspect(reason)}")
        # No ping → Vigilmon alerts after the window expires
    end

    schedule_tick()
    {:noreply, state}
  end

  defp schedule_tick, do: Process.send_after(self(), :tick, @interval)

  defp ping_heartbeat do
    url = Application.get_env(:my_app, :vigilmon)[:sync_heartbeat_url]
    if url, do: Req.get(url, receive_timeout: 5_000)
  end

  defp sync_data, do: :ok  # your logic
end

In Vigilmon, create a Heartbeat Monitor for each critical worker:

Click New Monitor → Heartbeat
Set expected interval (e.g. 5 minutes for the sync worker, 24 hours for the digest)
Copy the ping URL
Set it as an env variable in your release config

Now if a worker crashes and its supervisor gives up retrying, you get an alert rather than a silent gap in your data.

Step 5: LiveView deployment health

LiveView uses long-poll WebSocket connections. After a deploy, existing clients reconnect to the new node — and if something goes wrong during that reconnect window, users see a broken interface.

Add a monitor specifically for your LiveView websocket endpoint:

https://yourdomain.com/live/websocket

Vigilmon's HTTP monitor will verify the endpoint responds. This catches port binding failures, SSL termination issues, and misconfigured nginx/Caddy upstreams after a deploy.

You can also use a keyword check: in Vigilmon's HTTP monitor, add a body keyword match for content that should appear on your homepage (like your app name or a page title). If the response is a 200 but the wrong page, the keyword check fails.

Step 6: Status page and badge

Status page:

Go to Status Pages → New Status Page in Vigilmon
Add your monitors
Copy the public URL

Share it in your README, error pages, or in your Slack channel topic so the team can check it first when users report issues.

README badge:

![Uptime](https://vigilmon.online/badge/your-monitor-slug)

As an HTML embed:

<a href="https://status.yourdomain.com">
  <img src="https://vigilmon.online/badge/your-monitor-slug" alt="Uptime">
</a>

The badge shows live status and response time.

What you've built

| What | How | |------|-----| | Health check endpoint | Custom HealthCheck plug, zero dependencies | | DB + memory checks | Ecto.Adapters.SQL.query/3, :memsup | | HTTP uptime monitoring | Vigilmon HTTP monitor → /health | | Multi-region coverage | Vigilmon multi-probe checks | | Slack downtime alerts | Vigilmon Slack notification channel | | Oban job monitoring | Heartbeat ping on perform/1 success | | GenServer monitoring | Heartbeat ping on each successful tick | | Status page | Vigilmon public status page | | README badge | /badge/{slug} SVG embed |

BEAM keeps your processes alive. Vigilmon keeps external eyes on the result.

Next steps

Add :memsup and :cpu_sup data to your health response for richer monitoring context
Use Vigilmon's response time history to catch slow Ecto queries before they cause timeouts
Add separate heartbeat monitors for every Oban queue that processes business-critical jobs
If you run multiple Fly.io regions, add a monitor per region to catch split-brain failures

Get started free at vigilmon.online.