feat(behavior1k): add BEHAVIOR-1K benchmark integration by MilkClouds · Pull Request #57 · allenai/vla-evaluation-harness

MilkClouds · 2026-04-29T16:20:42Z

Summary

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark, plus a zero-action baseline and a demo-replay model server used to verify env wiring against the released LeRobot v2.1 trajectories.

Replaces #56 (auto-closed when its stacked base feat/rlbench-license-guard was deleted on merge of #55). Rebased onto main.

What's in here

Benchmark module

src/vla_eval/benchmarks/behavior1k/benchmark.py — Behavior1KBenchmark(StepBenchmark), R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset / step / cleanup run on a worker thread (Isaac Sim's SimulationApp.__init__ calls signal.signal, which has to be monkey-patched while the worker thread is set up — signal.signal assumes the main thread).
Lazy imports for OmniGibson (heavy startup, can't be at module level — registry resolves the class without loading the sim).
gm.HEADLESS=True set before og.launch (required to avoid an XR-extension segfault on the cluster).

Model servers

behavior1k_baseline.py — zero-action 23-D baseline. Smoke-test sanity check.
behavior1k_demo_replay.py — plays back recorded actions from a LeRobot v2.1 parquet episode. Used to verify the dataloader / action-space wiring matches the released dataset.

Docker

Dockerfile.behavior1k — installs Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2.
Build is gated behind ARG ACCEPT_NVIDIA_EULA=YES (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/). Surfaced via docker/build.sh --accept-license behavior1k, dispatched through the EULA_GATED map added in docker: opt-in license gate for rlbench builds #55.
Listed under NO_REDIST in docker/push.sh, so the image is build-locally-only.

Configs / docs

configs/behavior1k_eval.yaml — turning_on_radio, task instance 1, max 2000 steps.
configs/model_servers/behavior1k/baseline.yaml — zero-action server.
configs/model_servers/behavior1k/demo_replay.yaml — demo-replay server (parquet path placeholder).
docs/reproductions/behavior1k.md — full repro write-up. Result data archived under docs/reproductions/data/.

Verification

# Demo replay (task = turning_on_radio, instance 1)
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/demo_replay.yaml
# → success=True, finished at step 1364/2000, wall 2933.8s

# Zero-action baseline
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/baseline.yaml
# → success=False at max_steps (expected)

The demo-replay success step (1364) falls inside the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations, which gives reasonable confidence that the env wiring (action space, observation cameras, success detection) matches the upstream evaluation.

Skill notes

.claude/skills/add-benchmark and .claude/skills/add-model-server gain a short note not to add tests/test_<name>_benchmark.py or tests/test_<name>_server.py with mocked sim / model libraries. tests/ is for harness mechanics, not per-sim integration; mocked omnigibson / sapien / mujoco modules drift from upstream and miss the real bugs (import paths, action encoding, physics determinism). Verification is done via the smoke-test commands above.

Checklist

I have read the relevant contributing guide (CONTRIBUTING.md)

Code changes:

make check passes (ruff + ty)
make test passes (pytest) — 295 passed, 1 skipped
Smoke-tested affected configs (demo replay + baseline runs above)

Smoke test commands run:

make check
make test
docker/build.sh behavior1k --accept-license behavior1k
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/baseline.yaml
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/demo_replay.yaml

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark. The integration covers the standard StepBenchmark surface plus a demo-replay model server used to verify the dataloader against the released LeRobot v2.1 trajectories. What's added: - ``src/vla_eval/benchmarks/behavior1k/benchmark.py`` Behavior1KBenchmark with the required StepBenchmark methods. R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset/step/cleanup run on a worker thread (Isaac Sim's SimulationApp ``signal.signal`` calls have to be monkey-patched while the worker thread is set up — they assume the main thread). - ``src/vla_eval/model_servers/behavior1k_baseline.py`` Zero-action baseline (smoke-test sanity check). - ``src/vla_eval/model_servers/behavior1k_demo_replay.py`` Plays back the recorded actions from a LeRobot v2.1 parquet episode. Used to verify the env wiring matches the dataset. - ``docker/Dockerfile.behavior1k`` Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2. Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/). - ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1 - ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server - ``docs/reproductions/behavior1k.md`` — repro write-up + data files The behavior1k entry is registered in ``docker/build.sh`` (gated via ``--accept-license behavior1k``) and listed under ``NO_REDIST`` in ``docker/push.sh`` so the image is built locally only. Verification: - demo-replay on turning_on_radio (task instance 1) → success=True at step 1364/2000 (within the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations). - zero-action baseline → success=False at max_steps (expected). Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``) gain a short reminder not to add ``tests/test_<name>_benchmark.py`` or ``tests/test_<name>_server.py`` with mocked sim/model libraries — ``tests/`` is for harness mechanics, not per-sim integration, and mocked modules drift from upstream and miss real bugs.

Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.

CI surfaced three ty errors after the rebase: - ``anyio.to_thread.run_sync(...)`` was unresolved through the module attribute path. Use the same import-as-name style the rest of the codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``). - ``signal.signal = lambda ...`` triggered ``invalid-assignment``. Use ``setattr`` so the rebinding is opaque to the type checker (the runtime behaviour — restoring the handler in ``finally`` — is unchanged). - Drop the leftover ``# type: ignore`` mypy-style pragmas that were carrying the old workarounds; ty doesn't honour them anyway. While here, refresh the docs that mention benchmark coverage: - ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to ``integrated``; rlbench dropped from the registry-pulled image table; new "Build-locally images" note covering rlbench and behavior1k; build-script example shows ``--accept-license``. - ``CONTRIBUTING.md``: integrated-benchmark roster updated to match the actual contents of ``src/vla_eval/benchmarks/`` (was missing LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds BEHAVIOR-1K).

Build-locally images (rlbench, behavior1k) now appear in both the top-of-readme support table and the Docker Images table, with a 🔒 marker indicating they're not pulled from ghcr.io and require an explicit licence opt-in. - Top support table: 🔒 appended after the rlbench and behavior1k badges. Status legend gains a fourth entry explaining 🔒. - Docker Images table: rlbench is restored (was dropped in the prior pass), behavior1k is added at its 23.6 GB position. For both, the Image column shows the name without a ghcr.io link, and the row carries 🔒. - Replaces the earlier "Build-locally images" paragraph with a single caption under the table that explains the marker.

Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K badges in the top support table, and the matching legend entry. Build-mechanism details belong in the Docker Images table further down — the support table just tracks integration / reproduction status.

Three issues from review: - ``Behavior1KBenchmark.task_instance_id`` was set once at construction and never varied, so ``episodes_per_task > 1`` runs reloaded the same TRO state every episode (and aggregate scores could not match the 50-task × 10-instance challenge protocol). Accept ``int | list[int] | None`` and index by ``task["episode_idx"]`` cyclically when a list is given; the scalar form preserves the demo-replay use case. - ``Behavior1KDemoReplayModelServer`` kept a single ``_current_episode_id`` / ``_step_idx`` for the whole process, so two concurrent benchmark sessions on one server would race and consume a mixed action stream. Key the cursor on ``(session_id, episode_id)``, initialise in ``on_episode_start`` and free in ``on_episode_end`` so the dict stays bounded. - ``docs/reproductions/behavior1k.md`` build command did not pass ``--accept-license behavior1k``, so the new gated build skipped the image and the next step failed with "image not found". Updated the command and added the licence URL inline.

MilkClouds added 6 commits April 29, 2026 16:19

configs(behavior1k): add demo_replay model-server config

bfb7988

Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.

MilkClouds mentioned this pull request Apr 29, 2026

feat(cli): vla-eval data fetch for external benchmark datasets #58

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57
MilkClouds wants to merge 6 commits intomainfrom
feat/behavior1k-integration

MilkClouds commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MilkClouds commented Apr 29, 2026

Summary

What's in here

Verification

Skill notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant