Skip to content

feat(cli): vla-eval data fetch for external benchmark datasets#58

Open
MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
feat/data-fetch-cli
Open

feat(cli): vla-eval data fetch for external benchmark datasets#58
MilkClouds wants to merge 3 commits intofeat/behavior1k-integrationfrom
feat/data-fetch-cli

Conversation

@MilkClouds
Copy link
Copy Markdown
Collaborator

Summary

Adds a uniform fetch flow for benchmarks whose dataset can't ship in the docker image because of licensing — currently BEHAVIOR-1K, but the abstraction is benchmark-agnostic. Replaces the awkward "manually run a docker command + edit the config" two-step that the BEHAVIOR-1K reproduction had been carrying.

Stacked on #57 (feat/behavior1k-integration). Merge that one first; this PR's diff against main will then be just the CLI/abstraction layer + behavior1k consumer.

What's in here

Abstraction layer

  • Benchmark.data_requirements() — new classmethod on vla_eval.benchmarks.base.Benchmark. Default returns None (most benchmarks bundle data in-image). Subclasses opt in by returning a DataRequirement describing the licence opt-in, in-container mount path, marker file, and download argv.
  • DataRequirement dataclass declared next to StepResult, frozen, simple field set.

CLI

  • vla-eval data fetch -c <config> --accept-license <id> runs the declared download argv inside the benchmark's docker image with the host cache mounted read-write at the container's data path. Idempotent (skips when the marker file already exists). --force to refetch, --dry-run to print the docker command without executing, --data-dir to override the host cache path, --gpus to override the GPU spec.
  • Symmetric with docker/build.sh --accept-license (the same flag shape, just at a different verb), so the opt-in surface stays consistent across build and fetch.

Cache convention

  • Default path: ${VLA_EVAL_DATA_DIR}/<benchmark> if the env var is set, else ${HOME}/.cache/vla-eval/<benchmark>.
  • configs/behavior1k_eval.yaml resolves the same expression in its volumes: line via OmegaConf, so vla-eval data fetch + vla-eval run works on a fresh clone with zero per-host config edits.

Plumbing

  • cli/config_loader.py now always runs OmegaConf.to_container(merged, resolve=True) (previously only when extends: was present), so ${oc.env:VAR,default} interpolations work everywhere uniformly. No-op for configs without interpolations — vla-eval test --validate still passes (63/63 configs).

BEHAVIOR-1K consumer

  • Behavior1KBenchmark.data_requirements() returns the behavior-dataset-tos opt-in + the canonical download_omnigibson_robot_assets / download_behavior_1k_assets(accept_license=True) / download_2025_challenge_task_instances helper script.
  • docs/reproductions/behavior1k.md step 2 now reads vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos. The "edit the config volume" footnote is gone.

Verification

$ uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --dry-run
ERROR: this dataset requires accepting licence 'behavior-dataset-tos'.
  Read: https://behavior.stanford.edu/dataset
  Re-run: vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos

$ uv run vla-eval data fetch -c configs/behavior1k_eval.yaml \
        --accept-license behavior-dataset-tos --dry-run
Fetching data → /home/.../.cache/vla-eval/behavior1k
  image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
  mount: /home/.../.cache/vla-eval/behavior1k → /app/BEHAVIOR-1K/datasets
  --dry-run: would run:
    docker run --rm --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e OMNIGIBSON_HEADLESS=1 …

$ VLA_EVAL_DATA_DIR=/fast/ssd uv run python \
    -c "from vla_eval.cli.config_loader import load_config; \
        print(load_config('configs/behavior1k_eval.yaml')['docker']['volumes'])"
['/fast/ssd/behavior1k:/app/BEHAVIOR-1K/datasets:ro']

$ uv run vla-eval test --validate
✓ validate  63/63 configs valid                                                  0.2s

The end-to-end docker invocation hasn't been re-run on the cluster (would re-download ~35 GiB for no extra coverage); the CLI orchestration is verified through --dry-run and the existing reproduction artefact still applies.

Checklist

Code changes:

  • make check passes (ruff + ty)
  • make test passes (pytest) — 296 passed, 1 skipped
  • Smoke-tested affected configs

Smoke test commands run:

make check
make test
uv run vla-eval test --validate
uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --dry-run
uv run vla-eval data fetch -c configs/behavior1k_eval.yaml --accept-license behavior-dataset-tos --dry-run

Some benchmarks ship the simulator in the docker image but expect the
dataset to come from a separate, licence-restricted source (BEHAVIOR-1K
is the first concrete consumer — its dataset is governed by the
BEHAVIOR Dataset ToS and OmniGibson assets).  Previously, users had to
manually run a docker invocation that wired ``download_*`` helpers
inside the image and then edit ``configs/behavior1k_eval.yaml`` so the
volume pointed at their host download directory.  Two awkward steps
that don't appear anywhere a first-time reader would naturally look.

This change introduces a uniform mechanism the harness can use for any
benchmark with the same shape:

- ``Benchmark.data_requirements()`` (classmethod, base default returns
  ``None``) declares a :class:`DataRequirement`: licence id + URL,
  the in-container data path, a marker file, and the argv to run
  inside the image.  ``Behavior1KBenchmark`` returns the BEHAVIOR
  Dataset ToS opt-in plus the canonical ``download_*`` helper script.

- New ``vla-eval data fetch -c <config> --accept-license <id>`` CLI:
  resolves the benchmark class, mounts a host cache directory
  read-write at the container's data path, and runs the declared
  download command.  Idempotent (skips when the marker is already
  present).  Symmetric with ``docker/build.sh --accept-license``,
  so opt-in surface is the same shape across build and fetch.

- Default cache path: ``${VLA_EVAL_DATA_DIR}/<benchmark>`` if the env
  var is set, else ``${HOME}/.cache/vla-eval/<benchmark>``.
  ``--data-dir`` overrides explicitly.  ``configs/behavior1k_eval.yaml``
  resolves the same expression in its ``volumes:`` line via OmegaConf,
  so a fresh clone + ``vla-eval data fetch`` + ``vla-eval run``
  works without any per-host config edits.

- ``cli/config_loader.py`` now always runs ``OmegaConf.to_container``
  with ``resolve=True`` (previously only on configs that used
  ``extends:``), so ``${oc.env:VAR,default}`` interpolations are
  honoured uniformly.  No-op for configs without interpolations.

- ``docs/reproductions/behavior1k.md`` replaces the manual docker
  download incantation with the new ``vla-eval data fetch`` step
  and notes the auto-resolved cache path.
Three concrete cleanups from the simplify pass:

- Use ``gpu_docker_flag(...)`` instead of building ``--gpus`` by hand.
  The previous form passed ``--gpus 0`` literally; ``docker run`` needs
  ``--gpus device=0`` for non-``all`` specs, so this was a real bug
  that only manifested when ``--gpus 0`` was passed at fetch time.

- Add ``cache_key`` to ``DataRequirement`` so the cache subdir is
  declared by the benchmark instead of derived from import-string
  parsing (the old form sliced ``parts[-2]`` from the module path,
  which silently couples the CLI to a package layout convention).

- De-duplicate ``_stderr_console`` — the new fetch handler had
  reimplemented it with a different rich Console signature.  Hoisted
  to ``cli/_console.py``; both ``main.py`` and ``cmd_data.py`` now
  import it from one place.

While here:

- Drop the redundant ``image`` parameter from ``_build_docker_argv``
  (it was always ``docker_cfg.image``).
- Drop a dead ``or []`` branch in ``set(args.accept_license or [])``
  (argparse already defaults to ``[]``).
- Trim the ``DataRequirement`` and ``Behavior1KBenchmark.data_requirements``
  docstrings to match the project's existing density.
- Refresh the ``config_loader.load_config`` docstring (it claimed
  no-extends configs went through ``yaml.safe_load`` only, which is
  no longer true since the prior commit always resolved
  interpolations).
Two issues from review:

- ``Behavior1KBenchmark.get_metadata()`` did not declare
  ``action_dim``, so realtime modes (``Orchestrator`` reads
  ``metadata.get("action_dim", 7)`` to size the action buffer)
  would fall back to 7 and instantly crash on the first
  ``step()`` once it saw a 23-D R1Pro action.  Add the entry.

- Drop ``--data-dir`` from ``vla-eval data fetch``.  When users
  passed ``--data-dir /custom`` for fetch, the dataset went there,
  but ``vla-eval run`` mounts the path resolved by the YAML's
  ``${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>`` —
  fetch-target and run-target diverged silently.  ``VLA_EVAL_DATA_DIR``
  is the single source of truth for the cache base; both fetch and
  run honour it.  Users who need a different disk set the env var
  once.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant