feat(cli): vla-eval data fetch for external benchmark datasets#58
Open
MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
Open
feat(cli): vla-eval data fetch for external benchmark datasets#58MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
Conversation
Some benchmarks ship the simulator in the docker image but expect the
dataset to come from a separate, licence-restricted source (BEHAVIOR-1K
is the first concrete consumer — its dataset is governed by the
BEHAVIOR Dataset ToS and OmniGibson assets). Previously, users had to
manually run a docker invocation that wired ``download_*`` helpers
inside the image and then edit ``configs/behavior1k_eval.yaml`` so the
volume pointed at their host download directory. Two awkward steps
that don't appear anywhere a first-time reader would naturally look.
This change introduces a uniform mechanism the harness can use for any
benchmark with the same shape:
- ``Benchmark.data_requirements()`` (classmethod, base default returns
``None``) declares a :class:`DataRequirement`: licence id + URL,
the in-container data path, a marker file, and the argv to run
inside the image. ``Behavior1KBenchmark`` returns the BEHAVIOR
Dataset ToS opt-in plus the canonical ``download_*`` helper script.
- New ``vla-eval data fetch -c <config> --accept-license <id>`` CLI:
resolves the benchmark class, mounts a host cache directory
read-write at the container's data path, and runs the declared
download command. Idempotent (skips when the marker is already
present). Symmetric with ``docker/build.sh --accept-license``,
so opt-in surface is the same shape across build and fetch.
- Default cache path: ``${VLA_EVAL_DATA_DIR}/<benchmark>`` if the env
var is set, else ``${HOME}/.cache/vla-eval/<benchmark>``.
``--data-dir`` overrides explicitly. ``configs/behavior1k_eval.yaml``
resolves the same expression in its ``volumes:`` line via OmegaConf,
so a fresh clone + ``vla-eval data fetch`` + ``vla-eval run``
works without any per-host config edits.
- ``cli/config_loader.py`` now always runs ``OmegaConf.to_container``
with ``resolve=True`` (previously only on configs that used
``extends:``), so ``${oc.env:VAR,default}`` interpolations are
honoured uniformly. No-op for configs without interpolations.
- ``docs/reproductions/behavior1k.md`` replaces the manual docker
download incantation with the new ``vla-eval data fetch`` step
and notes the auto-resolved cache path.
Three concrete cleanups from the simplify pass: - Use ``gpu_docker_flag(...)`` instead of building ``--gpus`` by hand. The previous form passed ``--gpus 0`` literally; ``docker run`` needs ``--gpus device=0`` for non-``all`` specs, so this was a real bug that only manifested when ``--gpus 0`` was passed at fetch time. - Add ``cache_key`` to ``DataRequirement`` so the cache subdir is declared by the benchmark instead of derived from import-string parsing (the old form sliced ``parts[-2]`` from the module path, which silently couples the CLI to a package layout convention). - De-duplicate ``_stderr_console`` — the new fetch handler had reimplemented it with a different rich Console signature. Hoisted to ``cli/_console.py``; both ``main.py`` and ``cmd_data.py`` now import it from one place. While here: - Drop the redundant ``image`` parameter from ``_build_docker_argv`` (it was always ``docker_cfg.image``). - Drop a dead ``or []`` branch in ``set(args.accept_license or [])`` (argparse already defaults to ``[]``). - Trim the ``DataRequirement`` and ``Behavior1KBenchmark.data_requirements`` docstrings to match the project's existing density. - Refresh the ``config_loader.load_config`` docstring (it claimed no-extends configs went through ``yaml.safe_load`` only, which is no longer true since the prior commit always resolved interpolations).
Two issues from review:
- ``Behavior1KBenchmark.get_metadata()`` did not declare
``action_dim``, so realtime modes (``Orchestrator`` reads
``metadata.get("action_dim", 7)`` to size the action buffer)
would fall back to 7 and instantly crash on the first
``step()`` once it saw a 23-D R1Pro action. Add the entry.
- Drop ``--data-dir`` from ``vla-eval data fetch``. When users
passed ``--data-dir /custom`` for fetch, the dataset went there,
but ``vla-eval run`` mounts the path resolved by the YAML's
``${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>`` —
fetch-target and run-target diverged silently. ``VLA_EVAL_DATA_DIR``
is the single source of truth for the cache base; both fetch and
run honour it. Users who need a different disk set the env var
once.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a uniform fetch flow for benchmarks whose dataset can't ship in the docker image because of licensing — currently BEHAVIOR-1K, but the abstraction is benchmark-agnostic. Replaces the awkward "manually run a docker command + edit the config" two-step that the BEHAVIOR-1K reproduction had been carrying.
What's in here
Abstraction layer
Benchmark.data_requirements()— new classmethod onvla_eval.benchmarks.base.Benchmark. Default returnsNone(most benchmarks bundle data in-image). Subclasses opt in by returning aDataRequirementdescribing the licence opt-in, in-container mount path, marker file, and download argv.DataRequirementdataclass declared next toStepResult, frozen, simple field set.CLI
vla-eval data fetch -c <config> --accept-license <id>runs the declared download argv inside the benchmark's docker image with the host cache mounted read-write at the container's data path. Idempotent (skips when the marker file already exists).--forceto refetch,--dry-runto print the docker command without executing,--data-dirto override the host cache path,--gpusto override the GPU spec.docker/build.sh --accept-license(the same flag shape, just at a different verb), so the opt-in surface stays consistent across build and fetch.Cache convention
${VLA_EVAL_DATA_DIR}/<benchmark>if the env var is set, else${HOME}/.cache/vla-eval/<benchmark>.configs/behavior1k_eval.yamlresolves the same expression in itsvolumes:line via OmegaConf, sovla-eval data fetch+vla-eval runworks on a fresh clone with zero per-host config edits.Plumbing
cli/config_loader.pynow always runsOmegaConf.to_container(merged, resolve=True)(previously only whenextends:was present), so${oc.env:VAR,default}interpolations work everywhere uniformly. No-op for configs without interpolations —vla-eval test --validatestill passes (63/63 configs).BEHAVIOR-1K consumer
Behavior1KBenchmark.data_requirements()returns thebehavior-dataset-tosopt-in + the canonicaldownload_omnigibson_robot_assets/download_behavior_1k_assets(accept_license=True)/download_2025_challenge_task_instanceshelper script.docs/reproductions/behavior1k.mdstep 2 now readsvla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos. The "edit the config volume" footnote is gone.Verification
The end-to-end docker invocation hasn't been re-run on the cluster (would re-download ~35 GiB for no extra coverage); the CLI orchestration is verified through
--dry-runand the existing reproduction artefact still applies.Checklist
Code changes:
make checkpasses (ruff + ty)make testpasses (pytest) — 296 passed, 1 skippedSmoke test commands run: