Skip to content

EPIC: Wire the eval gates into the daily-release engine with two-phase rollback #161

@maziyarpanahi

Description

@maziyarpanahi

Summary

section 7.8 requires the harness to BE the gate on the daily path: every daily candidate runs openmed benchmark on golden + public SHIELD in CI, compares to gates/baseline.json, and fails closed (quarantine + open issue/chip, never auto-publish). Rollback is two-phase: candidate staged, gates run against staged artifacts, only a green result flips the manifest pointer to last-green; a regression caught by nightly full-suite or status monitor triggers 'openmed release rollback '. The <10-min/zero-human rollback SLO is measured here. This is the orchestration layer atop the release-gate harness and depends on the manifest, HF publish step, and scheduled CI. Decompose before starting.

Scope

  • Decompose before starting.
  • Wire release_gates.py into the daily CI job so each candidate is benchmarked on golden + SHIELD and gate-checked against gates/baseline.json, failing closed (quarantine + issue/chip).
  • Implement two-phase staged rollback: stage candidate, gate against staged artifacts, flip the manifest pointer only on green.
  • Extend 'openmed release rollback ' (manifest pointer flip + card/leaderboard/status regen) leveraging each artifact's repro_hash; measure and assert the <10-min/zero-human rollback SLO; publish job writes gates/baseline.json on a green release; trigger nightly full-suite + status monitor.
  • Regenerate model cards, benchmark cards, leaderboard, and status page from the manifest on every green result.

Acceptance criteria

  • Epic decomposed into S/M/L tasks before implementation begins.
  • A failing-gate candidate is quarantined and never published; an issue/chip is opened.
  • A green candidate flips the manifest pointer and regenerates all trust artifacts.
  • 'openmed release rollback ' restores the last-green pointer, regenerates cards, and meets the <10-min/zero-human SLO under test.
  • test suite green: .venv/bin/python -m pytest tests/ -q

Out of scope

  • The gate harness scoring logic (OM-031b, orchestrated here).
  • DUA-corpus periodic promotion gate (section 3.3).

Files

  • .github/workflows/release-gates.yml

Task: OM-047 · Milestone: v2.0 · Priority: P1 · Size: XL
Depends on: OM-031b, OM-032, OM-024 · Blocks: —
Roadmap: section 7.8
Spec: PLANS/V2/EXECUTION/tasks/OM-047.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1HighepicLarge; decompose into child issues firstroadmap-v2OpenMed V2 roadmap backlog

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions