Skip to content

Docs gap: no self-hosted upgrade runbook (helm upgrade procedure, rollback, breaking changes) #7429

@Holmus

Description

@Holmus

Problem

The K8s/OpenShift hosting page on docs.flagsmith.com contains a single "Key upgrade notes" entry (about v0.37.0 dropping bundled in-cluster Postgres data). Beyond that, there is no documented self-hosted upgrade procedure.

The current implicit guidance to operators is "run helm upgrade, migrations auto-run via Django, good luck." For production deployments (especially regulated ones running change-management processes), this is insufficient.

What's missing

  • Helm upgrade procedure - recommended command, version-pinning guidance, what happens to each deployment during the upgrade
  • Database migration behavior - how the migrateDb job runs, where to find migration logs, the fact that migrations are idempotent and safe to re-run, what to do when migrations silently fail
  • Pre-flight checks - before upgrading, verify (a) DB role has DDL privileges, (b) DB connection auth is stable for the duration of migrations, (c) chart values schema hasn't broken
  • Rollback procedure - does helm rollback work cleanly given the forward-only schema migrations? What's the operator's rollback story if a migration fails midway?
  • Downtime expectations - what's expected to be available during a rolling upgrade, and what isn't
  • Breaking-change communication - the chart's CHANGELOG.md is per-commit and doesn't flag breaking changes. Sub-component bumps (e.g. flagsmith-sse Redis key format change in v3 -> v4) deserve a prominent upgrade notes section.
  • Sub-component upgrade gotchas - flagsmith-sse, flagsmith-edge-proxy, flagsmith-task-processor (where applicable) version-skew rules

Known gotchas worth documenting

These have been observed in real upgrades and are worth surfacing publicly so operators can avoid or diagnose them:

  1. Silent migration failures - a partial migration can leave tables missing (e.g. environment-documents), causing endpoint-specific 500s. Diagnostic is to inspect the migrateDb job logs (or new API pod startup logs). Resolution is to re-run migrations (idempotent).
  2. Short-lived DB token auth - when Postgres auth uses a sidecar that refreshes tokens periodically (e.g. Azure managed identity), long-running migrations can fail mid-run when the token expires.
  3. Postgres role DDL privileges - production roles with stricter grants than dev (e.g. no CREATE TABLE) cause migrations that pass in dev to fail in prod. python manage.py createcachetable is one command that needs CREATE TABLE.
  4. flagsmith-sse v3 -> v4 - Redis storage format changed. No data loss but in-flight subscribers will miss updates briefly during repopulation. Plan for low-traffic window.
  5. Frontend image rolling vulns - upgrading to fix one CVE can surface new dependency CVEs. Operators using container security tooling will see this; doc should set expectation.

What good looks like

A new page under deployment-self-hosting/ titled "Upgrading self-hosted Flagsmith", covering the items above. Plus a structured "Breaking changes and upgrade notes" section in the chart README that's maintained per release rather than only via CHANGELOG.md.

Why now

Recurring source of customer support load that could be defused with docs. Filing to put the gap on the roadmap.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions