Problem
The K8s/OpenShift hosting page on docs.flagsmith.com contains a single "Key upgrade notes" entry (about v0.37.0 dropping bundled in-cluster Postgres data). Beyond that, there is no documented self-hosted upgrade procedure.
The current implicit guidance to operators is "run helm upgrade, migrations auto-run via Django, good luck." For production deployments (especially regulated ones running change-management processes), this is insufficient.
What's missing
- Helm upgrade procedure - recommended command, version-pinning guidance, what happens to each deployment during the upgrade
- Database migration behavior - how the
migrateDb job runs, where to find migration logs, the fact that migrations are idempotent and safe to re-run, what to do when migrations silently fail
- Pre-flight checks - before upgrading, verify (a) DB role has DDL privileges, (b) DB connection auth is stable for the duration of migrations, (c) chart values schema hasn't broken
- Rollback procedure - does
helm rollback work cleanly given the forward-only schema migrations? What's the operator's rollback story if a migration fails midway?
- Downtime expectations - what's expected to be available during a rolling upgrade, and what isn't
- Breaking-change communication - the chart's CHANGELOG.md is per-commit and doesn't flag breaking changes. Sub-component bumps (e.g. flagsmith-sse Redis key format change in v3 -> v4) deserve a prominent upgrade notes section.
- Sub-component upgrade gotchas - flagsmith-sse, flagsmith-edge-proxy, flagsmith-task-processor (where applicable) version-skew rules
Known gotchas worth documenting
These have been observed in real upgrades and are worth surfacing publicly so operators can avoid or diagnose them:
- Silent migration failures - a partial migration can leave tables missing (e.g.
environment-documents), causing endpoint-specific 500s. Diagnostic is to inspect the migrateDb job logs (or new API pod startup logs). Resolution is to re-run migrations (idempotent).
- Short-lived DB token auth - when Postgres auth uses a sidecar that refreshes tokens periodically (e.g. Azure managed identity), long-running migrations can fail mid-run when the token expires.
- Postgres role DDL privileges - production roles with stricter grants than dev (e.g. no CREATE TABLE) cause migrations that pass in dev to fail in prod.
python manage.py createcachetable is one command that needs CREATE TABLE.
- flagsmith-sse v3 -> v4 - Redis storage format changed. No data loss but in-flight subscribers will miss updates briefly during repopulation. Plan for low-traffic window.
- Frontend image rolling vulns - upgrading to fix one CVE can surface new dependency CVEs. Operators using container security tooling will see this; doc should set expectation.
What good looks like
A new page under deployment-self-hosting/ titled "Upgrading self-hosted Flagsmith", covering the items above. Plus a structured "Breaking changes and upgrade notes" section in the chart README that's maintained per release rather than only via CHANGELOG.md.
Why now
Recurring source of customer support load that could be defused with docs. Filing to put the gap on the roadmap.
Problem
The K8s/OpenShift hosting page on docs.flagsmith.com contains a single "Key upgrade notes" entry (about v0.37.0 dropping bundled in-cluster Postgres data). Beyond that, there is no documented self-hosted upgrade procedure.
The current implicit guidance to operators is "run
helm upgrade, migrations auto-run via Django, good luck." For production deployments (especially regulated ones running change-management processes), this is insufficient.What's missing
migrateDbjob runs, where to find migration logs, the fact that migrations are idempotent and safe to re-run, what to do when migrations silently failhelm rollbackwork cleanly given the forward-only schema migrations? What's the operator's rollback story if a migration fails midway?Known gotchas worth documenting
These have been observed in real upgrades and are worth surfacing publicly so operators can avoid or diagnose them:
environment-documents), causing endpoint-specific 500s. Diagnostic is to inspect themigrateDbjob logs (or new API pod startup logs). Resolution is to re-run migrations (idempotent).python manage.py createcachetableis one command that needs CREATE TABLE.What good looks like
A new page under
deployment-self-hosting/titled "Upgrading self-hosted Flagsmith", covering the items above. Plus a structured "Breaking changes and upgrade notes" section in the chart README that's maintained per release rather than only via CHANGELOG.md.Why now
Recurring source of customer support load that could be defused with docs. Filing to put the gap on the roadmap.