27m read
Tags: elixir, distributed-systems, otp, clustering

Distribution is Elixir's superpower and its most dangerous footgun. The BEAM makes connecting nodes trivially easy — a single Node.connect/1 call and you have transparent message passing across machines; the kind of thing that takes weeks to build in most languages happens in one line. This elegance is the problem. Teams distribute too early, before they've exhausted simpler solutions, before they understand what network partitions will do to their carefully crafted supervision trees.

I've watched teams spend months debugging split-brain scenarios that would never have occurred if they'd just run two independent instances behind a load balancer. Distribution is not a feature you adopt. It's a trade-off you accept.

The Siren Song of Node Clustering

Connecting Elixir nodes manually works fine for development. Production demands automatic discovery; libcluster has become the de facto standard here, supporting multiple strategies for nodes to find each other.libcluster

DNS-based discovery is the simplest strategy for Kubernetes deployments:

# config/runtime.exs
config :libcluster,
  topologies: [
    k8s_dns: [
      strategy: Cluster.Strategy.Kubernetes.DNS,
      config: [
        service: "myapp-headless",
        application_name: "myapp",
        polling_interval: 5_000
      ]
    ]
  ]

This polls a headless Kubernetes service every five seconds, picking up new nodes as pods scale. The application_name must match your release name — libcluster uses it to construct the full node name.

Gossip-based discovery works outside Kubernetes:

config :libcluster,
  topologies: [
    gossip: [
      strategy: Cluster.Strategy.Gossip,
      config: [
        port: 45892,
        if_addr: "0.0.0.0",
        multicast_addr: "230.1.1.1",
        broadcast_only: true
      ]
    ]
  ]

Gossip uses UDP multicast to announce node presence; no external infrastructure required. The catch is that your network has to allow multicast traffic. Many cloud VPCs don't. AWS blocks it by default. So does GCP.cloud-multicast Azure supports it in specific configurations — check your provider's docs before committing to this strategy.

For Fly.io, DNS-based discovery against their internal DNS works reliably:

config :libcluster,
  topologies: [
    fly: [
      strategy: Cluster.Strategy.DNSPoll,
      config: [
        polling_interval: 5_000,
        query: "myapp.internal",
        node_basename: "myapp"
      ]
    ]
  ]

Start the libcluster supervisor early in your application tree:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    topologies = Application.get_env(:libcluster, :topologies, [])

    children = [
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
      # ... other children
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

Clustering is the easy part. Keeping state consistent across that cluster — that's where things get ugly.

Process Groups: The Modern Way

OTP 23 introduced :pg, a complete rewrite of the older :pg2 module.pg-rewrite The difference matters: :pg is partition-tolerant by design. Each node maintains its own view of group membership; views eventually converge after a partition heals.

Starting a scope:

defmodule MyApp.Application do
  def start(_type, _args) do
    children = [
      {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
      %{id: :pg, start: {:pg, :start_link, [MyApp.PG]}},
      # ... other children
    ]

    Supervisor.start_link(children, strategy: :one_for_one)
  end
end

Processes join and leave groups dynamically:

defmodule MyApp.Worker do
  use GenServer

  def start_link(topic) do
    GenServer.start_link(__MODULE__, topic)
  end

  def init(topic) do
    :pg.join(MyApp.PG, {:subscribers, topic}, self())
    {:ok, %{topic: topic}}
  end

  def terminate(_reason, state) do
    :pg.leave(MyApp.PG, {:subscribers, state.topic}, self())
    :ok
  end
end

Broadcasting means iterating over the membership list:

defmodule MyApp.Broadcaster do
  def notify(topic, message) do
    members = :pg.get_members(MyApp.PG, {:subscribers, topic})

    for pid <- members do
      send(pid, {:notification, message})
    end

    {:ok, length(members)}
  end

  def notify_local(topic, message) do
    # Only notify processes on this node — faster, no network hops
    members = :pg.get_local_members(MyApp.PG, {:subscribers, topic})

    for pid <- members do
      send(pid, {:notification, message})
    end

    {:ok, length(members)}
  end
end

The split between get_members/2 and get_local_members/2 is worth internalizing. The former returns members across the entire cluster; the latter returns only processes on the current node. When you can scope to local members, you eliminate network round-trips entirely. Faster. Simpler.

Process groups solve pub/sub elegantly. They don't solve singleton processes or distributed state.

One thing that catches newcomers: :pg group membership isn't persistent. A crashed process gets removed from all groups automatically; when the supervisor restarts it, the new process must rejoin explicitly. This is correct behavior — a restarted process is a new process with a new PID — but it's the kind of thing you discover at 2 AM when your message fanout silently drops to zero.pg-membership

:pg also supports multiple scopes. You can run separate instances for different subsystems; this prevents namespace collisions and lets you tune sync behavior independently:

# Two independent process group scopes
children = [
  %{id: :pg_pubsub, start: {:pg, :start_link, [MyApp.PubSub]}},
  %{id: :pg_workers, start: {:pg, :start_link, [MyApp.Workers]}}
]

Global vs Local Registration: Choose Your Pain

Elixir gives you two registration models. They fail in completely different ways.

Local registration through Registry is fast and partition-proof:registry

# In your supervision tree
{Registry, keys: :unique, name: MyApp.Registry}

# In your GenServer
def start_link(id) do
  GenServer.start_link(__MODULE__, id, name: {:via, Registry, {MyApp.Registry, id}})
end

# Looking up a process
case Registry.lookup(MyApp.Registry, worker_id) do
  [{pid, _value}] -> {:ok, pid}
  [] -> {:error, :not_found}
end

Local registration works entirely within a single node. During a network partition, nothing breaks — you simply can't see processes on other nodes. That's a feature.

Global registration through :global gives you cluster-wide unique names:

def start_link(id) do
  case GenServer.start_link(__MODULE__, id, name: {:global, {:worker, id}}) do
    {:ok, pid} -> {:ok, pid}
    {:error, {:already_started, pid}} -> {:ok, pid}
  end
end

# Looking up a process anywhere in the cluster
case :global.whereis_name({:worker, id}) do
  :undefined -> {:error, :not_found}
  pid -> {:ok, pid}
end

Global registration exacts a heavy toll. Every registration and lookup requires coordination across nodes; during a partition, :global must choose between allowing duplicate names — violating its core guarantee — or blocking operations until the partition heals. It blocks.global-blocking

I've seen :global cause cascading failures when a single node becomes unreachable. The registration system locks up waiting for consensus; timeouts cascade through calling code; suddenly your entire cluster is wedged. Three nodes, all healthy, all frozen because one peer dropped off the network for forty seconds.

Use :global only when you genuinely need exactly one instance of something across the cluster and you can tolerate blocking during partitions. For most use cases, local registration with application-level routing is safer.

Network Partitions: When Theory Meets Reality

The CAP theorem says a distributed system can provide at most two of three guarantees: Consistency, Availability, and Partition tolerance.cap-theorem Since partitions will happen — they're not optional — you're really choosing between consistency and availability.

Elixir's default primitives lean toward availability. Process groups stay available during partitions; each side maintains its own membership view. Messages to processes on the other side fail, but local operations continue. No coordination required. No coordination possible.

Detecting partitions means monitoring node connections:

defmodule MyApp.ClusterMonitor do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init(_) do
    :net_kernel.monitor_nodes(true, node_type: :visible)
    {:ok, %{nodes: MapSet.new()}}
  end

  def handle_info({:nodeup, node, _info}, state) do
    Logger.info("Node joined: #{node}")
    {:noreply, %{state | nodes: MapSet.put(state.nodes, node)}}
  end

  def handle_info({:nodedown, node, _info}, state) do
    Logger.warning("Node left: #{node}")
    # Trigger partition handling logic here
    handle_potential_partition(node)
    {:noreply, %{state | nodes: MapSet.delete(state.nodes, node)}}
  end

  defp handle_potential_partition(node) do
    # Your partition response strategy:
    # - Fence off writes?
    # - Switch to degraded mode?
    # - Attempt reconnection?
  end
end

The hard problem isn't detecting nodedown events. It's knowing what they mean. A crashed node is gone; a partitioned node is still running, potentially accepting writes that will conflict with yours. From inside the partition, you can't tell the difference.

Consider a payment processing system where workers claim jobs from a queue. Partition happens. Both sides claim the same job. Without coordination, the payment processes twice. The conservative response: stop claiming jobs when any node is unreachable. The aggressive response: keep processing and deduplicate later using idempotency keys. Neither is universally correct; the choice depends on whether duplicate operations or temporary unavailability causes more damage to your specific business.

Erlang defaults to aggressive — nodes continue operating independently. If you need conservative behavior, you build it yourself. Libraries like partisan offer alternative distribution protocols with different trade-offs, but adopting them is a significant investment.partisan

Distributed State: Pick Your Consistency Model

Shared state across nodes is where distributed Elixir gets genuinely difficult. The ecosystem gives you several options; each trades off differently.

Horde provides distributed supervisors and registries using delta-state CRDTs:horde

defmodule MyApp.DistributedRegistry do
  use Horde.Registry

  def start_link(_) do
    Horde.Registry.start_link(__MODULE__, [keys: :unique], name: __MODULE__)
  end

  def init(opts) do
    [members: members()]
    |> Keyword.merge(opts)
    |> Horde.Registry.init()
  end

  defp members do
    [Node.self() | Node.list()]
    |> Enum.map(fn node -> {__MODULE__, node} end)
  end
end

defmodule MyApp.DistributedSupervisor do
  use Horde.DynamicSupervisor

  def start_link(_) do
    Horde.DynamicSupervisor.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init(_) do
    Horde.DynamicSupervisor.init(
      strategy: :one_for_one,
      members: :auto
    )
  end

  def start_worker(id) do
    spec = {MyApp.Worker, id}
    Horde.DynamicSupervisor.start_child(__MODULE__, spec)
  end
end

Horde guarantees eventual consistency. After a partition heals, conflicting registrations resolve by keeping one process and terminating duplicates. Your processes must handle unexpected termination gracefully — which means they already should, but distributed systems have a way of finding the edge cases in your shutdown logic.

DeltaCrdt offers lower-level CRDT primitives for custom data structures:

defmodule MyApp.SharedCounter do
  def start_link(name) do
    DeltaCrdt.start_link(DeltaCrdt.AWLWWMap, name: name, sync_interval: 50)
  end

  def increment(counter, key) do
    DeltaCrdt.mutate(counter, :add, [key, 1], DeltaCrdt.AWLWWMap)
  end

  def get(counter, key) do
    DeltaCrdt.read(counter)
    |> Map.get(key, 0)
  end
end

CRDTs provide strong eventual consistency — all replicas converge to the same state without coordination.crdt-convergence The trade-off: not all operations are expressible as CRDTs. You can't build a consistent counter that decrements arbitrarily; you need a PN-Counter that tracks increments and decrements separately. AWLWWMap (Add-Wins Last-Writer-Wins Map) works for key-value data where last write wins. ORSet (Observed-Remove Set) handles sets where concurrent adds and removes must both succeed. Choosing the wrong CRDT means your data converges to the wrong value. Silently.

Phoenix.Tracker is purpose-built for presence tracking:

defmodule MyApp.Presence do
  use Phoenix.Tracker

  def start_link(opts) do
    Phoenix.Tracker.start_link(__MODULE__, opts, opts)
  end

  def init(opts) do
    server = Keyword.fetch!(opts, :pubsub_server)
    {:ok, %{pubsub_server: server}}
  end

  def handle_diff(diff, state) do
    # React to presence changes
    {:ok, state}
  end

  def track(pid, topic, key, meta) do
    Phoenix.Tracker.track(__MODULE__, pid, topic, key, meta)
  end

  def list(topic) do
    Phoenix.Tracker.list(__MODULE__, topic)
  end
end

Phoenix.Tracker uses a CRDT internally; it's optimized for tracking which users are present in which channels. If your use case fits that model, use it. It's battle-tested in ways your custom implementation won't be for years.

When Not to Distribute

Distribution adds complexity that compounds over time. Quietly, at first. Then all at once when a partition hits at 3 AM.

Stateless nodes behind a load balancer sidestep the whole problem. If your nodes share nothing, partitions can't cause inconsistency; use PostgreSQL or Redis for shared state and let the database handle coordination. Sticky sessions offer a middle ground — route all requests from a user to the same node, let that node own the user's state locally. Failover means losing in-memory state, but you avoid distributed coordination entirely.

Need distributed locks? Redis or etcd. Leader election? PostgreSQL advisory locks.skip-locked These systems have decades of production hardening. Your custom Elixir solution has whatever you built last sprint.

Sometimes the business can tolerate brief inconsistency. Two nodes both processing the same job might result in a duplicate email. Is that worse than the complexity of distributed locking? Often not.

The honest heuristic: if you can solve the problem with a database transaction, do that. PostgreSQL's SKIP LOCKED, advisory locks, and serializable isolation levels cover most coordination problems without any node-to-node communication.

When real-time coordination is genuinely required — live collaboration, multiplayer games, distributed rate limiting — distribution earns its complexity. But verify the requirement first. Many "real-time" features can tolerate 100 milliseconds of latency; that's enough time to round-trip to a database.

The Partition Playbook

When you do distribute, have a plan for partitions before they happen. Not a vague plan. A written-down, tested plan.

Instrument everything. Log node connections and disconnections; track message delivery failures; monitor process group membership changes. You cannot debug a partition after the fact without this data. I've tried. You end up staring at application logs wondering why half your processes stopped receiving messages at 14:32 and the other half kept running like nothing happened.toxiproxy

Design for graceful degradation. What does your system do when it can't reach a peer? Returning an error is acceptable. Hanging indefinitely is not — and :global will happily hang indefinitely if you let it.

Test partition behavior explicitly. Toxiproxy can simulate network failures; run chaos experiments in staging before production forces the lesson on you.

Document your consistency choices. Future you — or future maintainers — need to understand why you chose availability over consistency in that particular subsystem. The reasoning matters more than the decision; business constraints change, and without context the next person will second-guess every trade-off.

Distribution in Elixir is not difficult to implement. It's difficult to operate. The code that connects nodes is trivial; the wisdom to know when connection is the wrong choice takes longer to acquire.


What do you think of what I said?

Share with me your thoughts. You can tweet me at @allanmacgregor.