Docs gap: no self-hosted upgrade runbook (helm upgrade procedure, rollback, breaking changes)

## Problem

The K8s/OpenShift hosting page on docs.flagsmith.com contains a single "Key upgrade notes" entry (about v0.37.0 dropping bundled in-cluster Postgres data). Beyond that, there is no documented self-hosted upgrade procedure.

The current implicit guidance to operators is "run `helm upgrade`, migrations auto-run via Django, good luck." For production deployments (especially regulated ones running change-management processes), this is insufficient.

## What's missing

- **Helm upgrade procedure** - recommended command, version-pinning guidance, what happens to each deployment during the upgrade
- **Database migration behavior** - how the `migrateDb` job runs, where to find migration logs, the fact that migrations are idempotent and safe to re-run, what to do when migrations silently fail
- **Pre-flight checks** - before upgrading, verify (a) DB role has DDL privileges, (b) DB connection auth is stable for the duration of migrations, (c) chart values schema hasn't broken
- **Rollback procedure** - does `helm rollback` work cleanly given the forward-only schema migrations? What's the operator's rollback story if a migration fails midway?
- **Downtime expectations** - what's expected to be available during a rolling upgrade, and what isn't
- **Breaking-change communication** - the chart's CHANGELOG.md is per-commit and doesn't flag breaking changes. Sub-component bumps (e.g. flagsmith-sse Redis key format change in v3 -> v4) deserve a prominent upgrade notes section.
- **Sub-component upgrade gotchas** - flagsmith-sse, flagsmith-edge-proxy, flagsmith-task-processor (where applicable) version-skew rules

## Known gotchas worth documenting

These have been observed in real upgrades and are worth surfacing publicly so operators can avoid or diagnose them:

1. **Silent migration failures** - a partial migration can leave tables missing (e.g. `environment-documents`), causing endpoint-specific 500s. Diagnostic is to inspect the `migrateDb` job logs (or new API pod startup logs). Resolution is to re-run migrations (idempotent).
2. **Short-lived DB token auth** - when Postgres auth uses a sidecar that refreshes tokens periodically (e.g. Azure managed identity), long-running migrations can fail mid-run when the token expires.
3. **Postgres role DDL privileges** - production roles with stricter grants than dev (e.g. no CREATE TABLE) cause migrations that pass in dev to fail in prod. `python manage.py createcachetable` is one command that needs CREATE TABLE.
4. **flagsmith-sse v3 -> v4** - Redis storage format changed. No data loss but in-flight subscribers will miss updates briefly during repopulation. Plan for low-traffic window.
5. **Frontend image rolling vulns** - upgrading to fix one CVE can surface new dependency CVEs. Operators using container security tooling will see this; doc should set expectation.

## What good looks like

A new page under `deployment-self-hosting/` titled "Upgrading self-hosted Flagsmith", covering the items above. Plus a structured "Breaking changes and upgrade notes" section in the chart README that's maintained per release rather than only via CHANGELOG.md.

## Why now

Recurring source of customer support load that could be defused with docs. Filing to put the gap on the roadmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs gap: no self-hosted upgrade runbook (helm upgrade procedure, rollback, breaking changes) #7429

Problem

What's missing

Known gotchas worth documenting

What good looks like

Why now

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs gap: no self-hosted upgrade runbook (helm upgrade procedure, rollback, breaking changes) #7429

Description

Problem

What's missing

Known gotchas worth documenting

What good looks like

Why now

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions