Observability for Elixir Applications

You cannot debug what you cannot see. Sounds obvious; then you watch a team spend three days hunting a performance regression that proper instrumentation would have surfaced in three minutes.

Observability is not monitoring. Monitoring tells you when something is wrong; observability tells you why.observability-vs-monitoring In a distributed Elixir system--where dozens of processes handle a single request, where failures are routine and recovery is automatic--the "why" demands a fundamentally different approach than tailing log files and staring at CPU graphs.

The Three Pillars: Logs, Metrics, Traces

Logs are discrete events with context. "User 42 failed authentication at 14:32:07 because their password hash did not match." They answer what happened; they carry arbitrary metadata and every event is unique. They are also expensive to store and query at scale--a tradeoff that bites you exactly when you need them most.log-cost-paradox

Metrics are aggregated measurements over time. "Authentication failures averaged 12 per minute over the last hour." They answer how much; they're cheap to store because you're compressing many events into summary statistics. But they lose the individual context. You know failures increased. You don't know why.

Traces connect the dots across process and network boundaries. "Request ABC started in the API gateway, called the auth service, which queried the database, which timed out." They answer how did we get here--and in a system where a single user action might touch fifteen GenServers, that question matters more than most teams realize.three-pillars-criticism

Elixir's ecosystem has first-class support for all three. The foundation is Telemetry.

Telemetry: The Backbone of Elixir Instrumentation

Telemetry is a small library that does one thing well: dynamic event dispatch.telemetry-erlang-heritage Libraries emit events; your application attaches handlers to process them. The decoupling matters--library authors don't need to know whether you're sending data to Prometheus, Datadog, or a text file on your desktop.

A Telemetry event has three components:

:telemetry.execute(
  [:my_app, :request, :complete],  # Event name (list of atoms)
  %{duration: 42_000_000},         # Measurements (numeric values)
  %{route: "/users", status: 200}  # Metadata (arbitrary context)
)

The event name is a hierarchical identifier. By convention, the first element is your application or library name. Measurements are the numeric data you care about--durations, counts, sizes. Metadata is everything else: request IDs, user IDs, route names, error types.

Attaching Handlers

You attach handlers at application startup, typically in your Application module:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    # Attach telemetry handlers before starting supervision tree
    attach_telemetry_handlers()

    children = [
      MyApp.Repo,
      MyAppWeb.Endpoint
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end

  defp attach_telemetry_handlers do
    :telemetry.attach_many(
      "my-app-handlers",
      [
        [:my_app, :request, :complete],
        [:my_app, :request, :exception],
        [:phoenix, :endpoint, :stop],
        [:ecto, :repo, :query]
      ],
      &MyApp.TelemetryHandler.handle_event/4,
      nil
    )
  end
end

The handler function receives four arguments: the event name, measurements, metadata, and your handler config (the fourth argument to attach_many).

defmodule MyApp.TelemetryHandler do
  require Logger

  def handle_event([:my_app, :request, :complete], measurements, metadata, _config) do
    Logger.info("Request completed",
      duration_ms: System.convert_time_unit(measurements.duration, :native, :millisecond),
      route: metadata.route,
      status: metadata.status
    )
  end

  def handle_event([:ecto, :repo, :query], measurements, metadata, _config) do
    if measurements.total_time > 100_000_000 do  # 100ms in native units
      Logger.warning("Slow query detected",
        query: metadata.query,
        duration_ms: System.convert_time_unit(measurements.total_time, :native, :millisecond),
        source: metadata.source
      )
    end
  end

  def handle_event(_event, _measurements, _metadata, _config), do: :ok
end

One thing that trips people up: Telemetry handlers run in the calling process.telemetry-handler-tradeoff If your handler crashes, it takes down the process that emitted the event. If your handler blocks, it blocks that process. Keep handlers fast and defensive.

What Phoenix and Ecto Already Emit

You get substantial instrumentation for free. Phoenix emits events for every request:

[:phoenix, :endpoint, :start] -- request received
[:phoenix, :endpoint, :stop] -- response sent
[:phoenix, :router_dispatch, :start] -- routing began
[:phoenix, :router_dispatch, :stop] -- controller invoked

Ecto emits events for every query:

[:my_app, :repo, :query] -- query executed (includes query string, params, timing)

LiveView, Oban, Broadway, Finch--most major Elixir libraries follow this pattern. Consult their documentation for the specific events they emit.

OpenTelemetry: The Industry Standard

Telemetry is Elixir-specific. OpenTelemetry is vendor-neutral; it standardizes telemetry data across languages and platforms.opentelemetry-origin If you run a polyglot architecture, or want to ship data to commercial observability platforms, this is the integration point.

The Elixir packages you'll need:

# mix.exs
defp deps do
  [
    {:opentelemetry, "~> 1.4"},
    {:opentelemetry_api, "~> 1.3"},
    {:opentelemetry_exporter, "~> 1.7"},
    {:opentelemetry_phoenix, "~> 1.2"},
    {:opentelemetry_ecto, "~> 1.2"},
    {:opentelemetry_oban, "~> 1.1"}  # if using Oban
  ]
end

Configuration happens in your config files:

# config/runtime.exs
config :opentelemetry,
  resource: [
    service: [
      name: "my-app",
      version: Application.spec(:my_app, :vsn) |> to_string()
    ]
  ],
  span_processor: :batch,
  traces_exporter: :otlp

config :opentelemetry_exporter,
  otlp_protocol: :grpc,
  otlp_endpoint: System.get_env("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")

The library packages (opentelemetry_phoenix, opentelemetry_ecto) automatically translate Telemetry events into OpenTelemetry spans. Set them up at application start:

defmodule MyApp.Application do
  use Application

  @impl true
  def start(_type, _args) do
    OpentelemetryPhoenix.setup()
    OpentelemetryEcto.setup([:my_app, :repo])

    children = [
      MyApp.Repo,
      MyAppWeb.Endpoint
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

That's the minimal setup. Every Phoenix request and Ecto query now generates spans that flow to your configured exporter. Two function calls and you've got distributed tracing.

Distributed Tracing Across Services

Tracing earns its keep in distributed systems. When service A calls service B, the trace context has to propagate; otherwise you get two disconnected traces instead of one story about what actually happened.

OpenTelemetry handles this through context propagation. When making HTTP requests with a client that supports it (Finch with the appropriate middleware, for instance), trace headers are injected automatically:

defmodule MyApp.ExternalService do
  require OpenTelemetry.Tracer, as: Tracer

  def fetch_user_data(user_id) do
    Tracer.with_span "external_service.fetch_user" do
      Tracer.set_attributes([
        {:user_id, user_id},
        {:service, "user-service"}
      ])

      url = "https://user-service.internal/users/#{user_id}"

      case Finch.build(:get, url) |> Finch.request(MyApp.Finch) do
        {:ok, %{status: 200, body: body}} ->
          Tracer.set_attribute(:status, "success")
          {:ok, Jason.decode!(body)}

        {:ok, %{status: status}} ->
          Tracer.set_attribute(:status, "error")
          Tracer.set_attribute(:http_status, status)
          {:error, :unexpected_status}

        {:error, reason} ->
          Tracer.record_exception(reason)
          {:error, reason}
      end
    end
  end
end

For manual context propagation--when you're not using auto-instrumented HTTP clients--you extract and inject the context explicitly:

# Extracting context from incoming request headers
def handle_incoming_request(headers) do
  :otel_propagator_text_map.extract(headers)
  # Context is now set for this process
end

# Injecting context into outgoing request headers
def make_outgoing_request(url, body) do
  headers = :otel_propagator_text_map.inject([])
  # headers now contains trace context
  HTTPClient.post(url, body, headers)
end

Structured Logging with Logger Metadata

Elixir's Logger supports metadata--key-value pairs attached to log messages. Pair this with a structured logging backend and your logs stop being opaque strings; they become queryable data.logger-process-dictionary

Set metadata at process boundaries:

defmodule MyAppWeb.Plugs.RequestContext do
  import Plug.Conn

  def init(opts), do: opts

  def call(conn, _opts) do
    request_id = get_req_header(conn, "x-request-id") |> List.first() || generate_request_id()

    Logger.metadata(
      request_id: request_id,
      user_id: conn.assigns[:current_user][:id],
      remote_ip: conn.remote_ip |> :inet.ntoa() |> to_string()
    )

    conn
  end

  defp generate_request_id, do: :crypto.strong_rand_bytes(16) |> Base.encode16(case: :lower)
end

Every log statement in that process now carries this metadata automatically. In a GenServer handling background work:

defmodule MyApp.JobWorker do
  use GenServer
  require Logger

  def handle_cast({:process, job}, state) do
    Logger.metadata(job_id: job.id, job_type: job.type)

    Logger.info("Starting job processing")

    case process_job(job) do
      {:ok, result} ->
        Logger.info("Job completed successfully", result_size: byte_size(result))
        {:noreply, state}

      {:error, reason} ->
        Logger.error("Job failed", reason: inspect(reason))
        {:noreply, state}
    end
  end
end

For JSON-formatted logs--required by most log aggregation systems--configure a backend like logger_json:

# config/prod.exs
config :logger, :console,
  format: {LoggerJSON.Formatters.GoogleCloud, :format},
  metadata: :all

Now your logs are machine-parseable. You can query for all logs where job_type = "invoice_generation" and duration > 5000; no more grepping through walls of text at 3 AM hoping the right line scrolls past.

Building Custom Telemetry Reporters

Pre-built integrations don't always cover what you need. I've worked on systems where the most important metrics were domain-specific--queue depths for specific job types, processing latency broken down by customer tier, that sort of thing.

A custom reporter wires together metric definitions, event handlers, and periodic reporting:

defmodule MyApp.Metrics do
  import Telemetry.Metrics

  def metrics do
    [
      # Counters: count occurrences
      counter("phoenix.endpoint.stop.duration",
        tags: [:route, :status],
        tag_values: &tag_values/1
      ),

      # Distributions: track value distributions (for percentiles)
      distribution("phoenix.endpoint.stop.duration",
        unit: {:native, :millisecond},
        tags: [:route],
        tag_values: &tag_values/1,
        reporter_options: [buckets: [10, 50, 100, 250, 500, 1000, 2500, 5000]]
      ),

      # Summaries: track statistics
      summary("ecto.repo.query.total_time",
        unit: {:native, :millisecond},
        tags: [:source]
      ),

      # Last value: track current state
      last_value("vm.memory.total", unit: :byte),
      last_value("vm.total_run_queue_lengths.total")
    ]
  end

  defp tag_values(%{conn: conn}) do
    %{
      route: Phoenix.Controller.controller_module(conn),
      status: conn.status
    }
  end
end

For a console reporter during development:

defmodule MyApp.Metrics.ConsoleReporter do
  use GenServer
  require Logger

  def start_link(opts) do
    metrics = Keyword.fetch!(opts, :metrics)
    GenServer.start_link(__MODULE__, metrics, name: __MODULE__)
  end

  @impl true
  def init(metrics) do
    groups = Enum.group_by(metrics, & &1.event_name)

    for {event, metrics} <- groups do
      :telemetry.attach(
        {__MODULE__, event, self()},
        event,
        &__MODULE__.handle_event/4,
        metrics
      )
    end

    {:ok, %{}}
  end

  def handle_event(event_name, measurements, metadata, metrics) do
    for metric <- metrics do
      measurement = extract_measurement(metric, measurements)
      tags = extract_tags(metric, metadata)

      Logger.debug("Metric: #{inspect(metric.name)}",
        value: measurement,
        tags: tags
      )
    end
  end

  defp extract_measurement(metric, measurements) do
    case metric.measurement do
      fun when is_function(fun) -> fun.(measurements)
      key -> Map.get(measurements, key)
    end
  end

  defp extract_tags(metric, metadata) do
    tag_values = metric.tag_values.(metadata)
    Map.take(tag_values, metric.tags)
  end
end

Connecting to Observability Backends

Each backend has its own integration pattern. The good news: most of the wiring is boilerplate you set up once.

Prometheus

Prometheus uses a pull model--it scrapes a metrics endpoint you expose.prometheus-pull-model

# mix.exs
{:telemetry_metrics_prometheus, "~> 1.1"}

# application.ex
children = [
  {TelemetryMetricsPrometheus, metrics: MyApp.Metrics.metrics()}
]

# router.ex
forward "/metrics", TelemetryMetricsPrometheus

Prometheus scrapes http://your-app:4000/metrics on its configured interval and ingests the data.

Datadog

Datadog accepts metrics via StatsD protocol or their agent:

# mix.exs
{:telemetry_metrics_statsd, "~> 0.7"}

# application.ex
children = [
  {TelemetryMetricsStatsd,
    metrics: MyApp.Metrics.metrics(),
    host: "localhost",
    port: 8125,
    formatter: :datadog}
]

For traces, configure the OpenTelemetry exporter to send to Datadog's OTLP endpoint, or use the Datadog agent as a collector.

Honeycomb

Honeycomb ingests OpenTelemetry data natively:

# config/runtime.exs
config :opentelemetry_exporter,
  otlp_protocol: :http_protobuf,
  otlp_endpoint: "https://api.honeycomb.io:443",
  otlp_headers: [
    {"x-honeycomb-team", System.fetch_env!("HONEYCOMB_API_KEY")}
  ]

Honeycomb handles high-cardinality data well. Unlike Prometheus--which struggles when you have many unique tag values--Honeycomb treats every event as a wide structured row rather than a pre-aggregated series.high-cardinality-tradeoff This makes it a natural fit for Elixir's process-heavy architecture; you can query by individual process IDs without worrying about cardinality explosion.

Self-Hosted: Grafana Stack

For self-hosted observability, the Grafana stack (Prometheus, Loki, Tempo) covers all three pillars:

# Metrics to Prometheus (as shown above)
# Traces to Tempo via OTLP
config :opentelemetry_exporter,
  otlp_protocol: :grpc,
  otlp_endpoint: "http://tempo:4317"

# Logs to Loki via logger backend
config :logger,
  backends: [LokiLoggerBackend]

config :loki_logger_backend,
  url: "http://loki:3100/loki/api/v1/push",
  labels: %{app: "my_app", env: "production"}

Practical Patterns

A few patterns I keep coming back to in production Elixir systems:

Correlation IDs everywhere. Generate a unique ID at the edge of your system and propagate it through every process, log statement, and external call. When something goes wrong, you can reconstruct the entire request flow. I've seen teams skip this and then spend hours manually correlating log lines across services; it's the kind of shortcut that costs you tenfold.

Sample expensive operations. Tracing every request in a high-throughput system generates enormous data volumes. Configure sampling:

config :opentelemetry,
  sampler: {:parent_based, %{root: {:trace_id_ratio_based, 0.1}}}

This samples 10% of traces.sampling-strategies Adjust based on your traffic and budget.

Measure what matters. Instrument business metrics, not just technical ones. "Orders processed per minute" tells you more than "requests per minute." "Payment failures by reason" tells you more than "HTTP 500s." The teams I've seen get the most value from observability are the ones who instrument domain events first and infrastructure second.

Alert on symptoms, debug with causes. Your alerting should fire on user-facing symptoms--elevated error rates, increased latency, failed transactions. Your observability stack helps you find the cause once the alert fires. Getting this backwards (alerting on causes like CPU or memory) leads to noisy paging and missed incidents.

What This Actually Buys You

Observability in Elixir is not a bolt-on afterthought. Telemetry, the OpenTelemetry integrations, and Logger's structured metadata form a system that fits the way Elixir applications actually work--processes communicating through messages, supervisors restarting failures, work distributed across schedulers.

The investment compounds. Every hour spent on instrumentation saves days of debugging later; every span you add to a trace is context you won't have to reconstruct from memory during an incident. Every structured log field is a query you can run instead of a grep you have to write.

I won't pretend any of this is glamorous work. But the team that instruments early is the team that sleeps through the night.

Observability for Elixir Applications

Observability for Elixir Applications

The Three Pillars: Logs, Metrics, Traces

Telemetry: The Backbone of Elixir Instrumentation

Attaching Handlers

What Phoenix and Ecto Already Emit

OpenTelemetry: The Industry Standard

Distributed Tracing Across Services

Structured Logging with Logger Metadata

Building Custom Telemetry Reporters

Connecting to Observability Backends

Prometheus

Datadog

Honeycomb

Self-Hosted: Grafana Stack

Practical Patterns

What This Actually Buys You

What do you think of what I said?