Problem
There is no documentation on docs.flagsmith.com or in any Flagsmith GitHub repo covering high availability, disaster recovery, or failover for self-hosted Flagsmith. Searches across the docs site, the flagsmith-docs repo, the flagsmith-charts repo, and the main flagsmith repo for terms like "disaster recovery", "failover", "RTO", "RPO", and "high availability" return zero results.
This gap is a recurring blocker in security reviews for regulated self-hosted customers (financial services, healthcare, public sector). The defensible position today is "Flagsmith is stateless, your DR plan is your Postgres DR plan", but customers reasonably expect us to publish that position rather than have it surfaced ad-hoc on calls.
What's missing
The docs are silent on:
- HA topology recommendations (multi-AZ, multi-region)
- Recommended replica counts and PodDisruptionBudgets for production
- Postgres HA/replication guidance (managed services vs in-cluster operators)
- Backup and restore procedures
- RTO/RPO targets we recommend customers design for
- Failover testing approach
- Multi-region active/passive or active/active topology (or an explicit statement that we don't support active/active and why)
What good looks like
A new page under deployment-self-hosting/ titled "High availability and disaster recovery", covering at minimum:
- Stateless tiers - api / frontend / task-processor scale horizontally, recommended minimum replicas, PodDisruptionBudget guidance, deployment strategy settings.
- Stateful tier (Postgres) - explicit statement that DR is delegated to the operator's Postgres choice, with concrete recommendations for common managed offerings (RDS Multi-AZ, CloudSQL HA, Azure Flexible Server, plus operator-installed options like CloudNativePG / Crunchy / Patroni).
- Backup and restore - what data lives where, what to snapshot, how to restore into a fresh Flagsmith deployment.
- Reference RTO/RPO - what's achievable with each Postgres topology.
- Multi-region story - explicit statement of what we support and what the topology would look like (or what we explicitly don't support).
- Failover runbook - step-by-step for a primary-region outage.
Why now
Comes up on virtually every regulated customer security review. Filing this to put the gap on the roadmap, not as a P1.
Problem
There is no documentation on docs.flagsmith.com or in any Flagsmith GitHub repo covering high availability, disaster recovery, or failover for self-hosted Flagsmith. Searches across the docs site, the
flagsmith-docsrepo, theflagsmith-chartsrepo, and the mainflagsmithrepo for terms like "disaster recovery", "failover", "RTO", "RPO", and "high availability" return zero results.This gap is a recurring blocker in security reviews for regulated self-hosted customers (financial services, healthcare, public sector). The defensible position today is "Flagsmith is stateless, your DR plan is your Postgres DR plan", but customers reasonably expect us to publish that position rather than have it surfaced ad-hoc on calls.
What's missing
The docs are silent on:
What good looks like
A new page under
deployment-self-hosting/titled "High availability and disaster recovery", covering at minimum:Why now
Comes up on virtually every regulated customer security review. Filing this to put the gap on the roadmap, not as a P1.