Active/Active HA
bilbycast-manager supports an active/active HA topology: two manager instances run against the same Postgres 18 cluster and serve traffic concurrently. The licence-gated FEATURE_HA controls visibility of the cluster management surfaces (instances list, multi-region settings panel) — the cluster mechanics themselves (heartbeats, LISTEN/NOTIFY, advisory locks) run in every build, so operators can stand up a two-instance pair against shared Postgres for evaluation without a licence, just without the management UI.
Why it matters
Section titled “Why it matters”Broadcast operations expect reliability you’d associate with a hardware control plane: a primary failure can’t take the manager offline for the time it takes to spin up a replacement. Active/active across two instances against shared Postgres gives you:
- Zero-downtime failover — both instances are live, both can answer REST + WS at any moment. DNS failover is the only operator action.
- Rolling deploys — drain one instance, swap the binary via systemd, rejoin. The other instance keeps serving the entire time.
- Geographic redundancy — instances in two regions, one Postgres cluster reachable from both.
- Honest observability — every Prometheus sample carries
instance_idandregionlabels so dashboards can split metrics per node.
Architecture
Section titled “Architecture” Browser / Edge nodes │ ▼ wss://manager.example.com ┌────────────┐ │ DNS │ (Route 53 / Cloudflare / etc.) └─────┬──────┘ │ ┌───────────┴───────────┐ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Manager A │ │ Manager B │ │ region=syd │ │ region=mel │ └──────┬───────┘ └──────┬───────┘ │ │ └──────────┬────────────┘ ▼ ┌──────────────┐ │ Postgres 18 │ (shared writer + read replicas if you have them) └──────────────┘Both instances:
- Read and write to the same Postgres 18 cluster.
- Heartbeat every few seconds to
manager_instances; rows whose heartbeat lapses past a 15 s window are reaped. - Coordinate cross-instance work (node hub command fan-out, browser broadcast aggregation, license / cache invalidation, session cache invalidation) via Postgres LISTEN/NOTIFY (small payloads inline; large payloads spill via
cross_instance_rpcthen NOTIFY-the-row-id). - Coordinate one-at-a-time operations (DR backup, master-key rotation) via Postgres advisory locks (
pg_try_advisory_lock(42)for backup; per-resource locks for routine schedule fires).
Edge nodes connect to whichever instance DNS sends them to. If the connected instance dies, the edge reconnects (via the same DNS name) and lands on the surviving instance — the surviving instance reads the edge’s stored secret from Postgres and authenticates the reconnect transparently.
Instance identity
Section titled “Instance identity”Each manager process resolves a stable UUID via:
BILBYCAST_INSTANCE_IDenv var, if set.<data_dir>/instance_id(auto-generated 0600 file on first boot).- Fresh v4 UUID (last resort).
BILBYCAST_REGION (free-form string) tags the instance for cross-region observability. Both values land on the manager_instances row + every Prometheus sample.
HA lifecycle CLI
Section titled “HA lifecycle CLI”Three subcommands handle the lifecycle:
| Command | What it does |
|---|---|
bilbycast-manager promote | Flips this instance’s manager_instances.role to primary and demotes peers atomically. Used after failover to mark which instance is now the writer-of-record. |
bilbycast-manager rejoin | Re-registers a row that the heartbeat reaper dropped during an outage. Used when restarting an instance after a long downtime. |
bilbycast-manager upgrade [--drain-secs N] | Writes the BILBYCAST_DRAIN signal file the running serve watches. The instance drains in-flight WS connections and exits 0 so systemd can swap the binary. The peer keeps serving the whole time. |
The runbook for DNS failover, region promotion, and rolling upgrades lives in DNS_FAILOVER.md.
Provisioning
Section titled “Provisioning”bilbycast-manager init --mode ha-primarybilbycast-manager init --mode ha-standby --master-key <paste from primary> --jwt-secret <paste from primary> --database-url postgres://…shared cluster…init generates secrets + self-signed TLS + <confdir>/manager.env + a systemd unit stub. It deliberately does not run systemctl, vendor Postgres, or apply migrations — those happen on first serve. The HA-standby invocation reuses the primary’s master key + JWT secret so both instances can decrypt the same shared ciphertexts.
Cross-instance pubsub
Section titled “Cross-instance pubsub”manager-core/src/pubsub.rs wraps Postgres LISTEN/NOTIFY behind a typed envelope. Used for:
- Node-hub command fan-out. A browser connected to instance A can address a command to a node connected to instance B. Instance A publishes; instance B’s hub picks it up and dispatches.
- Browser-broadcast aggregation. Stats, health, and event broadcasts to dashboards combine across instances so any browser sees every node’s data regardless of which instance it’s connected to.
- License / cache invalidation. A license change on instance A invalidates the moka caches on instance B in milliseconds.
- Session cache invalidation. Membership / share / ownership mutations broadcast
SESSION_INVALIDATEso peer instances drop cached entries the moment the change commits.
NOTIFY payloads under 8000 bytes ship inline; larger payloads spill to a cross_instance_rpc row and the NOTIFY ferries the row id (consume + DELETE on read).
Multi-region observability
Section titled “Multi-region observability”Every Prometheus sample carries instance_id and region labels so dashboards split per node. Cross-region RPC samples land in the bilbycast_region_latency_ms histogram so you can chart how long inter-instance commands take.
The node_connections table records which instance + region each node is currently live on; the settings UI renders a per-region / per-instance breakdown. Operations teams use this to balance load across regions and to spot drift (e.g. all nodes have re-homed onto one instance after a partial outage).
Backup safety
Section titled “Backup safety”The DR-grade bilbycast-manager backup CLI takes pg_try_advisory_lock(42) so two simultaneous invocations across the HA pair cannot race on pg_dump from the single Postgres writer. The first wins, the second backs off cleanly. See Encrypted Backup & Restore for the full backup model.
Master-key rotation in HA
Section titled “Master-key rotation in HA”Rotation runs in a single Postgres transaction that touches every _enc row. Stop both instances before running rotate-master-key — leaving one alive will deadlock against the rotation transaction. A single CLI invocation against the shared Postgres covers the whole cluster; you do not need to run it on each host. See master-key-rotation.md for the full runbook.
Licence model
Section titled “Licence model”FEATURE_HA gates only the management surfaces:
GET /api/v1/instances— the instances-list endpoint.- The multi-region panel in the settings UI.
The cluster mechanics — heartbeats, LISTEN/NOTIFY, advisory locks, cross-instance RPC — run in every build. Operators without FEATURE_HA can still stand up two instances against shared Postgres for evaluation; they just can’t observe the topology in the UI.
Reference
Section titled “Reference”- Operator runbook:
DNS_FAILOVER.md. - Provisioning walkthrough:
installer/README.mdin the bilbycast-manager repo. - API reference:
API.md(“Cluster (HA)” + “Metrics”). - Backup safety: Encrypted Backup & Restore.