Docs gap: no HA / DR / failover reference architecture for self-hosted Flagsmith

## Problem

There is no documentation on docs.flagsmith.com or in any Flagsmith GitHub repo covering high availability, disaster recovery, or failover for self-hosted Flagsmith. Searches across the docs site, the `flagsmith-docs` repo, the `flagsmith-charts` repo, and the main `flagsmith` repo for terms like "disaster recovery", "failover", "RTO", "RPO", and "high availability" return zero results.

This gap is a recurring blocker in security reviews for regulated self-hosted customers (financial services, healthcare, public sector). The defensible position today is "Flagsmith is stateless, your DR plan is your Postgres DR plan", but customers reasonably expect us to publish that position rather than have it surfaced ad-hoc on calls.

## What's missing

The docs are silent on:

- HA topology recommendations (multi-AZ, multi-region)
- Recommended replica counts and PodDisruptionBudgets for production
- Postgres HA/replication guidance (managed services vs in-cluster operators)
- Backup and restore procedures
- RTO/RPO targets we recommend customers design for
- Failover testing approach
- Multi-region active/passive or active/active topology (or an explicit statement that we don't support active/active and why)

## What good looks like

A new page under `deployment-self-hosting/` titled "High availability and disaster recovery", covering at minimum:

1. **Stateless tiers** - api / frontend / task-processor scale horizontally, recommended minimum replicas, PodDisruptionBudget guidance, deployment strategy settings.
2. **Stateful tier (Postgres)** - explicit statement that DR is delegated to the operator's Postgres choice, with concrete recommendations for common managed offerings (RDS Multi-AZ, CloudSQL HA, Azure Flexible Server, plus operator-installed options like CloudNativePG / Crunchy / Patroni).
3. **Backup and restore** - what data lives where, what to snapshot, how to restore into a fresh Flagsmith deployment.
4. **Reference RTO/RPO** - what's achievable with each Postgres topology.
5. **Multi-region story** - explicit statement of what we support and what the topology would look like (or what we explicitly don't support).
6. **Failover runbook** - step-by-step for a primary-region outage.

## Why now

Comes up on virtually every regulated customer security review. Filing this to put the gap on the roadmap, not as a P1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs gap: no HA / DR / failover reference architecture for self-hosted Flagsmith #7428

Problem

What's missing

What good looks like

Why now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs gap: no HA / DR / failover reference architecture for self-hosted Flagsmith #7428

Description

Problem

What's missing

What good looks like

Why now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions