Skip to content

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57

Open
MilkClouds wants to merge 6 commits intomainfrom
feat/behavior1k-integration
Open

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57
MilkClouds wants to merge 6 commits intomainfrom
feat/behavior1k-integration

Conversation

@MilkClouds
Copy link
Copy Markdown
Collaborator

Summary

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark, plus a zero-action baseline and a demo-replay model server used to verify env wiring against the released LeRobot v2.1 trajectories.

Replaces #56 (auto-closed when its stacked base feat/rlbench-license-guard was deleted on merge of #55). Rebased onto main.

What's in here

Benchmark module

  • src/vla_eval/benchmarks/behavior1k/benchmark.pyBehavior1KBenchmark(StepBenchmark), R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset / step / cleanup run on a worker thread (Isaac Sim's SimulationApp.__init__ calls signal.signal, which has to be monkey-patched while the worker thread is set up — signal.signal assumes the main thread).
  • Lazy imports for OmniGibson (heavy startup, can't be at module level — registry resolves the class without loading the sim).
  • gm.HEADLESS=True set before og.launch (required to avoid an XR-extension segfault on the cluster).

Model servers

  • behavior1k_baseline.py — zero-action 23-D baseline. Smoke-test sanity check.
  • behavior1k_demo_replay.py — plays back recorded actions from a LeRobot v2.1 parquet episode. Used to verify the dataloader / action-space wiring matches the released dataset.

Docker

  • Dockerfile.behavior1k — installs Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2.
  • Build is gated behind ARG ACCEPT_NVIDIA_EULA=YES (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/). Surfaced via docker/build.sh --accept-license behavior1k, dispatched through the EULA_GATED map added in docker: opt-in license gate for rlbench builds #55.
  • Listed under NO_REDIST in docker/push.sh, so the image is build-locally-only.

Configs / docs

  • configs/behavior1k_eval.yamlturning_on_radio, task instance 1, max 2000 steps.
  • configs/model_servers/behavior1k/baseline.yaml — zero-action server.
  • configs/model_servers/behavior1k/demo_replay.yaml — demo-replay server (parquet path placeholder).
  • docs/reproductions/behavior1k.md — full repro write-up. Result data archived under docs/reproductions/data/.

Verification

# Demo replay (task = turning_on_radio, instance 1)
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/demo_replay.yaml
# → success=True, finished at step 1364/2000, wall 2933.8s

# Zero-action baseline
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/baseline.yaml
# → success=False at max_steps (expected)

The demo-replay success step (1364) falls inside the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations, which gives reasonable confidence that the env wiring (action space, observation cameras, success detection) matches the upstream evaluation.

Skill notes

.claude/skills/add-benchmark and .claude/skills/add-model-server gain a short note not to add tests/test_<name>_benchmark.py or tests/test_<name>_server.py with mocked sim / model libraries. tests/ is for harness mechanics, not per-sim integration; mocked omnigibson / sapien / mujoco modules drift from upstream and miss the real bugs (import paths, action encoding, physics determinism). Verification is done via the smoke-test commands above.

Checklist

Code changes:

  • make check passes (ruff + ty)
  • make test passes (pytest) — 295 passed, 1 skipped
  • Smoke-tested affected configs (demo replay + baseline runs above)

Smoke test commands run:

make check
make test
docker/build.sh behavior1k --accept-license behavior1k
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/baseline.yaml
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/demo_replay.yaml

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark.
The integration covers the standard StepBenchmark surface plus a
demo-replay model server used to verify the dataloader against the
released LeRobot v2.1 trajectories.

What's added:

- ``src/vla_eval/benchmarks/behavior1k/benchmark.py``
  Behavior1KBenchmark with the required StepBenchmark methods.
  R1Pro robot, 23-D action, RGB head + L/R wrist cameras.  The async
  bridge is overridden so reset/step/cleanup run on a worker thread
  (Isaac Sim's SimulationApp ``signal.signal`` calls have to be
  monkey-patched while the worker thread is set up — they assume the
  main thread).

- ``src/vla_eval/model_servers/behavior1k_baseline.py``
  Zero-action baseline (smoke-test sanity check).

- ``src/vla_eval/model_servers/behavior1k_demo_replay.py``
  Plays back the recorded actions from a LeRobot v2.1 parquet
  episode.  Used to verify the env wiring matches the dataset.

- ``docker/Dockerfile.behavior1k``
  Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-*
  wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K
  v3.7.2.  Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA
  Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/).

- ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1
- ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server
- ``docs/reproductions/behavior1k.md`` — repro write-up + data files

The behavior1k entry is registered in ``docker/build.sh`` (gated
via ``--accept-license behavior1k``) and listed under ``NO_REDIST``
in ``docker/push.sh`` so the image is built locally only.

Verification:

- demo-replay on turning_on_radio (task instance 1) → success=True
  at step 1364/2000 (within the human-annotated press-skill window
  [1162, 1434] from the BEHAVIOR Dataset annotations).
- zero-action baseline → success=False at max_steps (expected).

Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``)
gain a short reminder not to add ``tests/test_<name>_benchmark.py``
or ``tests/test_<name>_server.py`` with mocked sim/model libraries —
``tests/`` is for harness mechanics, not per-sim integration, and
mocked modules drift from upstream and miss real bugs.
Companion to the baseline.yaml — points the demo-replay server at a
LeRobot v2.1 parquet episode.  ``demo_path`` is a placeholder; users
swap in their own path before running.
CI surfaced three ty errors after the rebase:

- ``anyio.to_thread.run_sync(...)`` was unresolved through the module
  attribute path.  Use the same import-as-name style the rest of the
  codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``).
- ``signal.signal = lambda ...`` triggered ``invalid-assignment``.
  Use ``setattr`` so the rebinding is opaque to the type checker
  (the runtime behaviour — restoring the handler in ``finally`` —
  is unchanged).
- Drop the leftover ``# type: ignore`` mypy-style pragmas that were
  carrying the old workarounds; ty doesn't honour them anyway.

While here, refresh the docs that mention benchmark coverage:

- ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to
  ``integrated``; rlbench dropped from the registry-pulled image
  table; new "Build-locally images" note covering rlbench and
  behavior1k; build-script example shows ``--accept-license``.
- ``CONTRIBUTING.md``: integrated-benchmark roster updated to match
  the actual contents of ``src/vla_eval/benchmarks/`` (was missing
  LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds
  BEHAVIOR-1K).
Build-locally images (rlbench, behavior1k) now appear in both the
top-of-readme support table and the Docker Images table, with a
🔒 marker indicating they're not pulled from ghcr.io and require an
explicit licence opt-in.

- Top support table: 🔒 appended after the rlbench and behavior1k
  badges.  Status legend gains a fourth entry explaining 🔒.
- Docker Images table: rlbench is restored (was dropped in the prior
  pass), behavior1k is added at its 23.6 GB position.  For both, the
  Image column shows the name without a ghcr.io link, and the row
  carries 🔒.
- Replaces the earlier "Build-locally images" paragraph with a single
  caption under the table that explains the marker.
Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K
badges in the top support table, and the matching legend entry.
Build-mechanism details belong in the Docker Images table further
down — the support table just tracks integration / reproduction
status.
Three issues from review:

- ``Behavior1KBenchmark.task_instance_id`` was set once at
  construction and never varied, so ``episodes_per_task > 1`` runs
  reloaded the same TRO state every episode (and aggregate scores
  could not match the 50-task × 10-instance challenge protocol).
  Accept ``int | list[int] | None`` and index by
  ``task["episode_idx"]`` cyclically when a list is given; the
  scalar form preserves the demo-replay use case.

- ``Behavior1KDemoReplayModelServer`` kept a single
  ``_current_episode_id`` / ``_step_idx`` for the whole process,
  so two concurrent benchmark sessions on one server would race
  and consume a mixed action stream.  Key the cursor on
  ``(session_id, episode_id)``, initialise in
  ``on_episode_start`` and free in ``on_episode_end`` so the
  dict stays bounded.

- ``docs/reproductions/behavior1k.md`` build command did not pass
  ``--accept-license behavior1k``, so the new gated build skipped
  the image and the next step failed with "image not found".
  Updated the command and added the licence URL inline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant