feat(cli): vla-eval data fetch for external benchmark datasets by MilkClouds · Pull Request #58 · allenai/vla-evaluation-harness

MilkClouds · 2026-04-29T16:55:20Z

Summary

Adds a uniform fetch flow for benchmarks whose dataset can't ship in the docker image because of licensing — currently BEHAVIOR-1K, but the abstraction is benchmark-agnostic. Replaces the awkward "manually run a docker command + edit the config" two-step that the BEHAVIOR-1K reproduction had been carrying.

Stacked on #57 (feat/behavior1k-integration). Merge that one first; this PR's diff against main will then be just the CLI/abstraction layer + behavior1k consumer.

What's in here

Abstraction layer

Benchmark.data_requirements() — new classmethod on vla_eval.benchmarks.base.Benchmark. Default returns None (most benchmarks bundle data in-image). Subclasses opt in by returning a DataRequirement describing the licence opt-in, in-container mount path, marker file, and download argv.
DataRequirement dataclass declared next to StepResult, frozen, simple field set.

CLI

vla-eval data fetch -c <config> --accept-license <id> runs the declared download argv inside the benchmark's docker image with the host cache mounted read-write at the container's data path. Idempotent (skips when the marker file already exists). --force to refetch, --dry-run to print the docker command without executing, --data-dir to override the host cache path, --gpus to override the GPU spec.
Symmetric with docker/build.sh --accept-license (the same flag shape, just at a different verb), so the opt-in surface stays consistent across build and fetch.

Cache convention

Default path: ${VLA_EVAL_DATA_DIR}/<benchmark> if the env var is set, else ${HOME}/.cache/vla-eval/<benchmark>.
configs/behavior1k_eval.yaml resolves the same expression in its volumes: line via OmegaConf, so vla-eval data fetch + vla-eval run works on a fresh clone with zero per-host config edits.

Plumbing

cli/config_loader.py now always runs OmegaConf.to_container(merged, resolve=True) (previously only when extends: was present), so ${oc.env:VAR,default} interpolations work everywhere uniformly. No-op for configs without interpolations — vla-eval test --validate still passes (63/63 configs).

BEHAVIOR-1K consumer

Behavior1KBenchmark.data_requirements() returns the behavior-dataset-tos opt-in + the canonical download_omnigibson_robot_assets / download_behavior_1k_assets(accept_license=True) / download_2025_challenge_task_instances helper script.
docs/reproductions/behavior1k.md step 2 now reads vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos. The "edit the config volume" footnote is gone.

Verification

$ uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --dry-run
ERROR: this dataset requires accepting licence 'behavior-dataset-tos'.
  Read: https://behavior.stanford.edu/dataset
  Re-run: vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos

$ uv run vla-eval data fetch -c configs/behavior1k_eval.yaml \
        --accept-license behavior-dataset-tos --dry-run
Fetching data → /home/.../.cache/vla-eval/behavior1k
  image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
  mount: /home/.../.cache/vla-eval/behavior1k → /app/BEHAVIOR-1K/datasets
  --dry-run: would run:
    docker run --rm --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e OMNIGIBSON_HEADLESS=1 …

$ VLA_EVAL_DATA_DIR=/fast/ssd uv run python \
    -c "from vla_eval.cli.config_loader import load_config; \
        print(load_config('configs/behavior1k_eval.yaml')['docker']['volumes'])"
['/fast/ssd/behavior1k:/app/BEHAVIOR-1K/datasets:ro']

$ uv run vla-eval test --validate
✓ validate  63/63 configs valid                                                  0.2s

The end-to-end docker invocation hasn't been re-run on the cluster (would re-download ~35 GiB for no extra coverage); the CLI orchestration is verified through --dry-run and the existing reproduction artefact still applies.

Checklist

I have read the relevant contributing guide (CONTRIBUTING.md)

Code changes:

make check passes (ruff + ty)
make test passes (pytest) — 296 passed, 1 skipped
Smoke-tested affected configs

Smoke test commands run:

make check
make test
uv run vla-eval test --validate
uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --dry-run
uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos --dry-run

Some benchmarks ship the simulator in the docker image but expect the dataset to come from a separate, licence-restricted source (BEHAVIOR-1K is the first concrete consumer — its dataset is governed by the BEHAVIOR Dataset ToS and OmniGibson assets). Previously, users had to manually run a docker invocation that wired ``download_*`` helpers inside the image and then edit ``configs/behavior1k_eval.yaml`` so the volume pointed at their host download directory. Two awkward steps that don't appear anywhere a first-time reader would naturally look. This change introduces a uniform mechanism the harness can use for any benchmark with the same shape: - ``Benchmark.data_requirements()`` (classmethod, base default returns ``None``) declares a :class:`DataRequirement`: licence id + URL, the in-container data path, a marker file, and the argv to run inside the image. ``Behavior1KBenchmark`` returns the BEHAVIOR Dataset ToS opt-in plus the canonical ``download_*`` helper script. - New ``vla-eval data fetch -c <config> --accept-license <id>`` CLI: resolves the benchmark class, mounts a host cache directory read-write at the container's data path, and runs the declared download command. Idempotent (skips when the marker is already present). Symmetric with ``docker/build.sh --accept-license``, so opt-in surface is the same shape across build and fetch. - Default cache path: ``${VLA_EVAL_DATA_DIR}/<benchmark>`` if the env var is set, else ``${HOME}/.cache/vla-eval/<benchmark>``. ``--data-dir`` overrides explicitly. ``configs/behavior1k_eval.yaml`` resolves the same expression in its ``volumes:`` line via OmegaConf, so a fresh clone + ``vla-eval data fetch`` + ``vla-eval run`` works without any per-host config edits. - ``cli/config_loader.py`` now always runs ``OmegaConf.to_container`` with ``resolve=True`` (previously only on configs that used ``extends:``), so ``${oc.env:VAR,default}`` interpolations are honoured uniformly. No-op for configs without interpolations. - ``docs/reproductions/behavior1k.md`` replaces the manual docker download incantation with the new ``vla-eval data fetch`` step and notes the auto-resolved cache path.

Three concrete cleanups from the simplify pass: - Use ``gpu_docker_flag(...)`` instead of building ``--gpus`` by hand. The previous form passed ``--gpus 0`` literally; ``docker run`` needs ``--gpus device=0`` for non-``all`` specs, so this was a real bug that only manifested when ``--gpus 0`` was passed at fetch time. - Add ``cache_key`` to ``DataRequirement`` so the cache subdir is declared by the benchmark instead of derived from import-string parsing (the old form sliced ``parts[-2]`` from the module path, which silently couples the CLI to a package layout convention). - De-duplicate ``_stderr_console`` — the new fetch handler had reimplemented it with a different rich Console signature. Hoisted to ``cli/_console.py``; both ``main.py`` and ``cmd_data.py`` now import it from one place. While here: - Drop the redundant ``image`` parameter from ``_build_docker_argv`` (it was always ``docker_cfg.image``). - Drop a dead ``or []`` branch in ``set(args.accept_license or [])`` (argparse already defaults to ``[]``). - Trim the ``DataRequirement`` and ``Behavior1KBenchmark.data_requirements`` docstrings to match the project's existing density. - Refresh the ``config_loader.load_config`` docstring (it claimed no-extends configs went through ``yaml.safe_load`` only, which is no longer true since the prior commit always resolved interpolations).

Two issues from review: - ``Behavior1KBenchmark.get_metadata()`` did not declare ``action_dim``, so realtime modes (``Orchestrator`` reads ``metadata.get("action_dim", 7)`` to size the action buffer) would fall back to 7 and instantly crash on the first ``step()`` once it saw a 23-D R1Pro action. Add the entry. - Drop ``--data-dir`` from ``vla-eval data fetch``. When users passed ``--data-dir /custom`` for fetch, the dataset went there, but ``vla-eval run`` mounts the path resolved by the YAML's ``${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>`` — fetch-target and run-target diverged silently. ``VLA_EVAL_DATA_DIR`` is the single source of truth for the cache base; both fetch and run honour it. Users who need a different disk set the env var once.

MilkClouds added 3 commits April 29, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): vla-eval data fetch for external benchmark datasets#58

feat(cli): vla-eval data fetch for external benchmark datasets#58
MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
feat/data-fetch-cli

MilkClouds commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MilkClouds commented Apr 29, 2026

Summary

What's in here

Verification

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant