Skip to content

Blue-green materialization for generic (staged) graphs — zero-downtime rebuilds #764

Description

@jfrench9

Summary

Generic/custom (staged) graphs currently rebuild in place — delete the live graph and recreate it under the same id, leaving a multi-second-to-minutes window where reads hit an empty/half-built graph. Blue-green materialization (build a {graph_id}-wip alongside the live graph, then swap atomically) already ships for entity graphs, and every LadybugDB primitive it relies on is graph-type-agnostic. This spec extends blue-green to the source="staged" path so generic graphs get the same zero-downtime rebuild — the rename swap is the only mutation to the live graph, so a failed rebuild leaves it fully intact.

Status

Draft

Problem Statement: Current State

  • materialize_cmd routes by graph type (operations/graph/commands/materialize.py:230-233): entity → "extensions", everything else → "staged".
  • The staged rebuild (direct_materialization.py:150-189) is destructive in-place: delete_database(graph_id, preserve_duckdb=True)create_database(graph_id=graph_id, ...) → materialize staged files back into the same id.
  • Between the delete and end-of-materialization the graph is "rebuilding" — reads hit an empty or half-built graph.
  • Entity graphs avoid this entirely: _materialize_blue_green builds {graph_id}-wip next to the live graph and swaps atomically ("downtime is only the milliseconds of the file rename swap").
  • A consumer with a generic custom-schema graph that rebuilds nightly / on feed upload can only mask the window by running multiple replicas behind a load balancer — and even that has a hole (a replica stays "serving" through its own in-place rebuild). Single-replica graphs can't be masked at all.

Problem Statement: Desired State

Generic/staged graphs rebuild blue-green: a {graph_id}-wip is built from the live graph's staged DuckDB, then promoted via an atomic swap. Zero failed/empty reads through a rebuild (including single-replica graphs); the live graph is never mutated except by the rename swap; a failed rebuild discards the WIP and leaves the active graph intact.

Problem Statement: Why Now?

Surfaced from a generic custom-schema graph (a nightly-rebuilt product catalog) whose only availability mitigation today is extra replicas + a load balancer — which still can't cover a replica's own in-place rebuild or a single-replica graph. The fix is wiring over already-shipped, battle-tested primitives, so cost/risk is low and it removes a structural downtime window.

Proposed Solution: Approach

This is wiring, not new infrastructure — every blue-green primitive already exists and is graph-type-agnostic:

  • Swap endpointPOST /databases/{graph_id}/swap promotes {graph_id}-wip → active, deletes the old active, atomic rename (swap.py). One-way by design.
  • WIP reads source DuckDBtables/materialize.py + models/tables.py: a materialize can pass source_graph_id so the WIP LadybugDB ingests from the original graph's staged DuckDB without re-staging.
  • LockMaterializationLock resolves {id}-wip/{id}-prev → base {id}, so WIP build + swap share one per-graph lock (concurrent rebuilds 409).
  • Listing hygienemanager.py already filters -wip/-prev out of database listings.
  • Clientclient.swap_database(graph_id, lock_token=…).

The one real code gap: the staged helper materialize_table_chunked (chunked_materialization.py:26) does not accept source_graph_id — it materializes graph_id's DuckDB into graph_id. Thread source_graph_id through it → client.materialize_table (the graph_api endpoint already honors it).

Add a blue-green branch to the staged rebuild mirroring _materialize_blue_green minus its Postgres-staging step (staged data is already in the graph's DuckDB):

  1. Acquire the per-graph MaterializationLock.
  2. Clean up any leftover {graph_id}-wip from a prior failed run.
  3. create_database(wip_id, schema_type=custom, custom_schema_ddl=<active schema>).
  4. For each table with staged data, materialize_table_chunked(client, graph_id=wip_id, …, source_graph_id=graph_id) — build the WIP from the live graph's staged DuckDB.
  5. On success → client.swap_database(graph_id, lock_token=…) (atomic; old active deleted/-prev).
  6. On any error → discard the WIP (delete_database(wip, preserve_duckdb=True)), active graph untouched, re-raise.

rebuild=False (first load into a brand-new graph) stays in-place — no active graph to protect — exactly as the entity path decides.

Components Affected

  • Operations (/robosystems/operations/)
  • Dagster (/robosystems/dagster/)
  • Graph API (/robosystems/graph_api/)
  • Configuration (/robosystems/config/)

Key Changes

  • operations/graph/engine/direct_materialization.py::materialize_graph_directly — small-graph fast path: replace the in-place delete+recreate (on rebuild=True against an existing db) with the blue-green flow.
  • dagster/jobs/graph.py::materialize_graph_tables (materialize_graph_job) — large-graph path: the same blue-green branch. Both staged paths must adopt it, since _should_use_direct_materialization splits by staged size.
  • chunked_materialization.py::materialize_table_chunked — add source_graph_id: str | None = None, pass through to client.materialize_table.

Data Model Changes

None.

API Changes

None — reuses the existing POST /databases/{graph_id}/swap and the source_graph_id materialize parameter.

Implementation Plan

  • Phase 1: Thread source_graph_id through materialize_table_chunkedclient.materialize_table.
  • Phase 2: Add the blue-green branch to materialize_graph_directly (small-graph path), gated on rebuild=True + existing db; leftover-WIP cleanup on start + on failure.
  • Phase 3: Add the same branch to materialize_graph_tables (Dagster large-graph path).
  • Phase 4: Gate behind STAGED_BLUE_GREEN_ENABLED (mirroring DIRECT_GRAPH_MATERIALIZATION_ENABLED).

Dependencies: the shipped entity-path _materialize_blue_green (template); the swap endpoint; MaterializationLock -wip/-prev resolution; source_graph_id materialize support.

Migration strategy: none — flag-gated behavior change, no data migration.

Testing

  • Read-through-rebuild: continuous reads during a rebuild → zero failed/empty reads.
  • Swap atomicity: graph_id unchanged after swap; row counts match the new data.
  • Failure isolation: inject a materialize error → active graph unchanged, WIP gone, error surfaced.
  • Concurrency: two concurrent rebuild=True → second 409s; no partial state.
  • Single-replica zero-downtime (headline case): one graph, no replicas, rebuild with live reads → uninterrupted.
  • Both staged paths: direct (small) and Dagster (large, > GRAPH_MATERIALIZATION_THRESHOLD_MB).

Rollout

Environments: Development → Staging → Production.
Feature flag: STAGED_BLUE_GREEN_ENABLED — ship off (in-place, today's behavior); validate on a non-critical generic graph (rebuild while reading → zero failed reads; graph_id stable across swap; -prev cleanup); then default on.
Rollback plan: flip the flag off → in-place path. The entity blue-green path is unflagged and battle-tested, lowering risk.

Success Criteria

  • Generic/staged graphs rebuild with zero failed/empty reads through the rebuild (incl. single-replica).
  • graph_id is stable across the swap; consumers see no id churn.
  • A failed/aborted rebuild leaves the active graph fully intact (WIP discarded).
  • Concurrent rebuilds of one graph are mutually exclusive (second 409s).
  • Both the direct and Dagster staged paths use blue-green when the flag is on.

Open Questions

  • Rollback / -prev retention — swap is one-way today (old active deleted). Keep -prev briefly for a manual rollback path? (Out of scope for v1; noted.)
  • Embeddings / HNSWmaterialize_embeddings=True must build indexes into the WIP before swap (verify the chunked path targets the WIP).
  • Disk headroom — the WIP transiently doubles on-disk graph size; gate or document a tier check if large generic graphs are common.

References

  • Already shipped (template): entity-graph blue-green — operations/extensions/materialize.py::_materialize_blue_green.
  • Related primitives: graph_api/routers/databases/swap.py, graph_api/core/ladybug/manager.py, graph_api/core/ladybug/materialization_lock.py, graph_api/routers/databases/tables/materialize.py.

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Spec.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions