Summary
Generic/custom (staged) graphs currently rebuild in place — delete the live graph and recreate it under the same id, leaving a multi-second-to-minutes window where reads hit an empty/half-built graph. Blue-green materialization (build a {graph_id}-wip alongside the live graph, then swap atomically) already ships for entity graphs, and every LadybugDB primitive it relies on is graph-type-agnostic. This spec extends blue-green to the source="staged" path so generic graphs get the same zero-downtime rebuild — the rename swap is the only mutation to the live graph, so a failed rebuild leaves it fully intact.
Status
Draft
Problem Statement: Current State
materialize_cmd routes by graph type (operations/graph/commands/materialize.py:230-233): entity → "extensions", everything else → "staged".
- The staged rebuild (
direct_materialization.py:150-189) is destructive in-place: delete_database(graph_id, preserve_duckdb=True) → create_database(graph_id=graph_id, ...) → materialize staged files back into the same id.
- Between the delete and end-of-materialization the graph is
"rebuilding" — reads hit an empty or half-built graph.
- Entity graphs avoid this entirely:
_materialize_blue_green builds {graph_id}-wip next to the live graph and swaps atomically ("downtime is only the milliseconds of the file rename swap").
- A consumer with a generic custom-schema graph that rebuilds nightly / on feed upload can only mask the window by running multiple replicas behind a load balancer — and even that has a hole (a replica stays "serving" through its own in-place rebuild). Single-replica graphs can't be masked at all.
Problem Statement: Desired State
Generic/staged graphs rebuild blue-green: a {graph_id}-wip is built from the live graph's staged DuckDB, then promoted via an atomic swap. Zero failed/empty reads through a rebuild (including single-replica graphs); the live graph is never mutated except by the rename swap; a failed rebuild discards the WIP and leaves the active graph intact.
Problem Statement: Why Now?
Surfaced from a generic custom-schema graph (a nightly-rebuilt product catalog) whose only availability mitigation today is extra replicas + a load balancer — which still can't cover a replica's own in-place rebuild or a single-replica graph. The fix is wiring over already-shipped, battle-tested primitives, so cost/risk is low and it removes a structural downtime window.
Proposed Solution: Approach
This is wiring, not new infrastructure — every blue-green primitive already exists and is graph-type-agnostic:
- Swap endpoint —
POST /databases/{graph_id}/swap promotes {graph_id}-wip → active, deletes the old active, atomic rename (swap.py). One-way by design.
- WIP reads source DuckDB —
tables/materialize.py + models/tables.py: a materialize can pass source_graph_id so the WIP LadybugDB ingests from the original graph's staged DuckDB without re-staging.
- Lock —
MaterializationLock resolves {id}-wip/{id}-prev → base {id}, so WIP build + swap share one per-graph lock (concurrent rebuilds 409).
- Listing hygiene —
manager.py already filters -wip/-prev out of database listings.
- Client —
client.swap_database(graph_id, lock_token=…).
The one real code gap: the staged helper materialize_table_chunked (chunked_materialization.py:26) does not accept source_graph_id — it materializes graph_id's DuckDB into graph_id. Thread source_graph_id through it → client.materialize_table (the graph_api endpoint already honors it).
Add a blue-green branch to the staged rebuild mirroring _materialize_blue_green minus its Postgres-staging step (staged data is already in the graph's DuckDB):
- Acquire the per-graph
MaterializationLock.
- Clean up any leftover
{graph_id}-wip from a prior failed run.
create_database(wip_id, schema_type=custom, custom_schema_ddl=<active schema>).
- For each table with staged data,
materialize_table_chunked(client, graph_id=wip_id, …, source_graph_id=graph_id) — build the WIP from the live graph's staged DuckDB.
- On success →
client.swap_database(graph_id, lock_token=…) (atomic; old active deleted/-prev).
- On any error → discard the WIP (
delete_database(wip, preserve_duckdb=True)), active graph untouched, re-raise.
rebuild=False (first load into a brand-new graph) stays in-place — no active graph to protect — exactly as the entity path decides.
Components Affected
Key Changes
operations/graph/engine/direct_materialization.py::materialize_graph_directly — small-graph fast path: replace the in-place delete+recreate (on rebuild=True against an existing db) with the blue-green flow.
dagster/jobs/graph.py::materialize_graph_tables (materialize_graph_job) — large-graph path: the same blue-green branch. Both staged paths must adopt it, since _should_use_direct_materialization splits by staged size.
chunked_materialization.py::materialize_table_chunked — add source_graph_id: str | None = None, pass through to client.materialize_table.
Data Model Changes
None.
API Changes
None — reuses the existing POST /databases/{graph_id}/swap and the source_graph_id materialize parameter.
Implementation Plan
Dependencies: the shipped entity-path _materialize_blue_green (template); the swap endpoint; MaterializationLock -wip/-prev resolution; source_graph_id materialize support.
Migration strategy: none — flag-gated behavior change, no data migration.
Testing
Rollout
Environments: Development → Staging → Production.
Feature flag: STAGED_BLUE_GREEN_ENABLED — ship off (in-place, today's behavior); validate on a non-critical generic graph (rebuild while reading → zero failed reads; graph_id stable across swap; -prev cleanup); then default on.
Rollback plan: flip the flag off → in-place path. The entity blue-green path is unflagged and battle-tested, lowering risk.
Success Criteria
Open Questions
References
- Already shipped (template): entity-graph blue-green —
operations/extensions/materialize.py::_materialize_blue_green.
- Related primitives:
graph_api/routers/databases/swap.py, graph_api/core/ladybug/manager.py, graph_api/core/ladybug/materialization_lock.py, graph_api/routers/databases/tables/materialize.py.
Summary
Generic/custom (staged) graphs currently rebuild in place — delete the live graph and recreate it under the same id, leaving a multi-second-to-minutes window where reads hit an empty/half-built graph. Blue-green materialization (build a
{graph_id}-wipalongside the live graph, then swap atomically) already ships for entity graphs, and every LadybugDB primitive it relies on is graph-type-agnostic. This spec extends blue-green to thesource="staged"path so generic graphs get the same zero-downtime rebuild — the rename swap is the only mutation to the live graph, so a failed rebuild leaves it fully intact.Status
Draft
Problem Statement: Current State
materialize_cmdroutes by graph type (operations/graph/commands/materialize.py:230-233): entity →"extensions", everything else →"staged".direct_materialization.py:150-189) is destructive in-place:delete_database(graph_id, preserve_duckdb=True)→create_database(graph_id=graph_id, ...)→ materialize staged files back into the same id."rebuilding"— reads hit an empty or half-built graph._materialize_blue_greenbuilds{graph_id}-wipnext to the live graph and swaps atomically ("downtime is only the milliseconds of the file rename swap").Problem Statement: Desired State
Generic/staged graphs rebuild blue-green: a
{graph_id}-wipis built from the live graph's staged DuckDB, then promoted via an atomic swap. Zero failed/empty reads through a rebuild (including single-replica graphs); the live graph is never mutated except by the rename swap; a failed rebuild discards the WIP and leaves the active graph intact.Problem Statement: Why Now?
Surfaced from a generic custom-schema graph (a nightly-rebuilt product catalog) whose only availability mitigation today is extra replicas + a load balancer — which still can't cover a replica's own in-place rebuild or a single-replica graph. The fix is wiring over already-shipped, battle-tested primitives, so cost/risk is low and it removes a structural downtime window.
Proposed Solution: Approach
This is wiring, not new infrastructure — every blue-green primitive already exists and is graph-type-agnostic:
POST /databases/{graph_id}/swappromotes{graph_id}-wip→ active, deletes the old active, atomic rename (swap.py). One-way by design.tables/materialize.py+models/tables.py: a materialize can passsource_graph_idso the WIP LadybugDB ingests from the original graph's staged DuckDB without re-staging.MaterializationLockresolves{id}-wip/{id}-prev→ base{id}, so WIP build + swap share one per-graph lock (concurrent rebuilds 409).manager.pyalready filters-wip/-prevout of database listings.client.swap_database(graph_id, lock_token=…).The one real code gap: the staged helper
materialize_table_chunked(chunked_materialization.py:26) does not acceptsource_graph_id— it materializesgraph_id's DuckDB intograph_id. Threadsource_graph_idthrough it →client.materialize_table(the graph_api endpoint already honors it).Add a blue-green branch to the staged rebuild mirroring
_materialize_blue_greenminus its Postgres-staging step (staged data is already in the graph's DuckDB):MaterializationLock.{graph_id}-wipfrom a prior failed run.create_database(wip_id, schema_type=custom, custom_schema_ddl=<active schema>).materialize_table_chunked(client, graph_id=wip_id, …, source_graph_id=graph_id)— build the WIP from the live graph's staged DuckDB.client.swap_database(graph_id, lock_token=…)(atomic; old active deleted/-prev).delete_database(wip, preserve_duckdb=True)), active graph untouched, re-raise.rebuild=False(first load into a brand-new graph) stays in-place — no active graph to protect — exactly as the entity path decides.Components Affected
/robosystems/operations/)/robosystems/dagster/)/robosystems/graph_api/)/robosystems/config/)Key Changes
operations/graph/engine/direct_materialization.py::materialize_graph_directly— small-graph fast path: replace the in-place delete+recreate (onrebuild=Trueagainst an existing db) with the blue-green flow.dagster/jobs/graph.py::materialize_graph_tables(materialize_graph_job) — large-graph path: the same blue-green branch. Both staged paths must adopt it, since_should_use_direct_materializationsplits by staged size.chunked_materialization.py::materialize_table_chunked— addsource_graph_id: str | None = None, pass through toclient.materialize_table.Data Model Changes
None.
API Changes
None — reuses the existing
POST /databases/{graph_id}/swapand thesource_graph_idmaterialize parameter.Implementation Plan
source_graph_idthroughmaterialize_table_chunked→client.materialize_table.materialize_graph_directly(small-graph path), gated onrebuild=True+ existing db; leftover-WIP cleanup on start + on failure.materialize_graph_tables(Dagster large-graph path).STAGED_BLUE_GREEN_ENABLED(mirroringDIRECT_GRAPH_MATERIALIZATION_ENABLED).Dependencies: the shipped entity-path
_materialize_blue_green(template); the swap endpoint;MaterializationLock-wip/-prevresolution;source_graph_idmaterialize support.Migration strategy: none — flag-gated behavior change, no data migration.
Testing
rebuild=True→ second 409s; no partial state.GRAPH_MATERIALIZATION_THRESHOLD_MB).Rollout
Environments: Development → Staging → Production.
Feature flag:
STAGED_BLUE_GREEN_ENABLED— ship off (in-place, today's behavior); validate on a non-critical generic graph (rebuild while reading → zero failed reads; graph_id stable across swap;-prevcleanup); then default on.Rollback plan: flip the flag off → in-place path. The entity blue-green path is unflagged and battle-tested, lowering risk.
Success Criteria
graph_idis stable across the swap; consumers see no id churn.Open Questions
-prevretention — swap is one-way today (old active deleted). Keep-prevbriefly for a manual rollback path? (Out of scope for v1; noted.)materialize_embeddings=Truemust build indexes into the WIP before swap (verify the chunked path targets the WIP).References
operations/extensions/materialize.py::_materialize_blue_green.graph_api/routers/databases/swap.py,graph_api/core/ladybug/manager.py,graph_api/core/ladybug/materialization_lock.py,graph_api/routers/databases/tables/materialize.py.