Skip to content

UCSC-VLAA/VisualClaw

Repository files navigation

VisualClaw icon
VisualClaw

Self-evolving multimodal agent infrastructure for vision-language agents

arXiv 2606.16295 Project page VisualClawArena dataset

VisualClaw demo preview


VisualClaw is a modular multimodal agent system for tool-using vision-language agents. It sits between agent frameworks such as Claude Code, Codex, and OpenClaw and upstream LLM providers, then adds prompt-time skills, retrieved memory, self-evolution, video processing, live vision, and evaluation hooks.

VisualClaw is the reusable agent system code. VisualClawArena is the separate 200-scenario benchmark/data release with videos, multi-round questions, workspaces, and evaluation outputs.


Highlights

Feature What it does How to enable
Gateway proxy OpenAI/Anthropic-compatible gateway with pre/post hooks visualclaw start, then point OPENAI_BASE_URL or ANTHROPIC_BASE_URL at the gateway
Claude Code backend Runs scenarios with Claude Code as the tool-using agent claude setup-token, then scripts/run_visualclawarena_agent_smoke.sh
Codex backend Runs scenarios with Codex as the tool-using agent CODEX_ACCESS_TOKEN=... BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh
Skill injection Retrieves relevant SKILL.md files and injects them into agent prompts Configure skill.skills_dir, or pass --skills-mode inject --skills-dir ... to the arena runner
Memory retrieval Stores prior outcomes and retrieves relevant context later Gateway memory module, or runner flag --memory
Self-evolution Evolves the skill bank from failed rounds and successful memory patterns scripts/run_visualclawarena_self_evolve.sh
Video cascade CPU-friendly dHash, lightweight encoding, change gate, and keyframe context; also includes sb_cascade, a streaming codec-metadata selector that avoids per-frame visual feature encoding Install .[video]; use --keyframe-mode cascade or --keyframe-mode sb_cascade --max-keyframes 8
Live vision / glasses Streams phone/glasses camera frames through the live gateway Install .[live,video], enable live config, see docs/IOS_SETUP.md
VisualClawArena packaging Prepares the 200-scenario benchmark for Hugging Face scripts/prepare_visualclawarena_hf.py

One-Line Start

git clone https://github.com/UCSC-VLAA/VisualClaw.git visualclaw && cd visualclaw && pip install -e ".[arena,live,video]"

Start the gateway:

visualclaw start

Run one VisualClawArena smoke case with Codex:

VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena BACKEND=codex SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.sh

Run the same smoke case with Claude Code:

VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.sh

To use VisualClaw as a transparent gateway, point your agent framework's API base to http://localhost:30100.


Use Cases

  1. Agent evaluation - run Claude Code or Codex on VisualClawArena scenarios with consistent video context, model settings, and scoring.
  2. Self-evolving agents - use failure-aware memory pruning and schema/file-output preflight to evolve reusable SKILL.md instructions.
  3. Gateway deployment - add skill and memory hooks to OpenAI/Anthropic-compatible agent stacks without changing the agent code.
  4. Live vision and AI glasses - stream phone or smart-glasses camera frames into the same multimodal gateway and context pipeline.

Agent Framework Pipeline

VisualClaw agent pipeline

VisualClaw keeps the agent framework separate from the infrastructure that prepares context. The gateway can inject skills, retrieve memory, normalize multimodal inputs, collect traces, evolve skills, and forward requests to the chosen provider.


VisualClawArena Agent Backends

VisualClawArena runs can use either Claude Code or Codex as the tool-using agent backend. Credentials are read from environment variables or the official CLI credential stores; do not commit tokens into this repository.

VisualClawArena scenario example with video clip, agent files, workspace, and multi-round instructions

Each VisualClawArena scenario packages a video clip, role/context files, a mutable workspace, and multi-round instructions that require agents to inspect visual evidence, update files, and pass machine-checkable evaluations.

Install the benchmark/runtime dependencies:

pip install -e ".[arena]"
cp .env.example .env

Download the HF-format VisualClawArena dataset from Hugging Face, or place an equivalent local copy so it has this shape:

VisualClawArena/
  scenarios/<scenario_id>/spec/questions.json
  scenarios/<scenario_id>/data/clip/*.mp4

Claude Code

Claude Code is the default agent backend for the smoke helper.

claude setup-token
# Put these in .env:
# CLAUDE_CODE_OAUTH_TOKEN=<token printed by setup-token>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

scripts/bootstrap_agent_backend.sh claude-code
scripts/run_visualclawarena_agent_smoke.sh

You can also rely on an existing claude auth login session, or set ANTHROPIC_API_KEY instead of CLAUDE_CODE_OAUTH_TOKEN.

Codex

Codex uses the installed codex CLI. Use an access token or API key, then run the same smoke helper with BACKEND=codex:

# Put these in .env:
# CODEX_ACCESS_TOKEN=<codex access token>
# or: OPENAI_API_KEY=<OpenAI API key>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

scripts/bootstrap_agent_backend.sh codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

The helper scripts load .env automatically. Process environment variables still take priority, so CI can inject secrets without editing files. The bootstrap script validates CLI availability and auth without writing secrets into the repo. The smoke helper runs one packaged scenario round, stages keyframes, invokes the agent backend, and writes a normal VisualClawArena results.json. Defaults are SCENARIO=mmt_q1, MAX_ROUNDS=1, MAX_KEYFRAMES=8, and EFFORT=medium.


Run VisualClawArena

Place or download the HF-format VisualClawArena dataset outside this code repo:

VisualClawArena/
  scenarios/<scenario_id>/spec/questions.json
  scenarios/<scenario_id>/spec/scripts/
  scenarios/<scenario_id>/data/workspace/
  scenarios/<scenario_id>/data/clip/*.mp4

Set the dataset root:

export VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

Smoke test:

# Claude Code, default
scripts/run_visualclawarena_agent_smoke.sh

# Codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

Useful overrides:

SCENARIO=mmt_s11 MAX_ROUNDS=3 MODEL=gpt-5.5 EFFORT=medium \
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

Direct runner form:

python -m benchmark.visualclawarena.runner \
  --scenario-dir "$VISUALCLAW_ARENA_ROOT/scenarios/mmt_q1/spec" \
  --backend codex \
  --model gpt-5.5 \
  --max-keyframes 8 \
  --max-rounds 1 \
  --output-dir runs/mmt_q1_codex_smoke

Run Self-Evolution

Self-evolution uses a writable skill bank seeded from visualclaw/skills_seed/seed_universal_mc, memory retrieval, schema/file preflight, and a cross-scenario failure buffer.

Claude Code self-evolve:

# Put VISUALCLAW_ARENA_ROOT and CLAUDE_CODE_OAUTH_TOKEN in .env.

SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.sh

Codex self-evolve:

# Put VISUALCLAW_ARENA_ROOT and OPENAI_API_KEY in .env.

BACKEND=codex SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.sh

Key knobs:

SKILLS_EVOLVE_EVERY=5    # evolve after every N failures within a scenario
MEMORY_TOP_K=5           # retrieved memory cap
MEMORY_MAX_CHARS=4000    # memory render cap
EVOLVER_BACKEND=openai   # openai, claude-code, or bedrock
EVOLVER_MODEL=gpt-5.2    # optional model override

Outputs are written under runs/visualclawarena_self_evolve/, including results.json, a run-local evolved skill bank, and the failure buffer.


Package VisualClawArena For Hugging Face

The data release is intentionally separate from the system repo.

python scripts/prepare_visualclawarena_hf.py \
  --out /path/to/VisualClawArena \
  --copy-mode hardlink \
  --overwrite

The package includes scenario specs, clips/workspaces, manifests, evaluation summaries, per-question rows, and sanitized raw result JSONs. Upload with:

huggingface-cli upload <org-or-user>/VisualClawArena /path/to/VisualClawArena --repo-type dataset

Modules

Module Endpoint prefix Description
Proxy /v1/chat/completions, /v1/messages Transparent LLM forwarding: OpenAI / Anthropic / Bedrock / OpenRouter
Skill /v1/skill/ Template/embedding/multimodal skill retrieval; LLM-driven skill evolution
Memory /v1/memory/ FAISS vector memory; text+vision retrieval; user memory file
RL /v1/rl/ PRM scoring; GRPO/OPD training; active model hot-swap
Multimodal/Image /v1/multimodal/image/ Image extraction, description (VLM, disk-cached), format normalization
Multimodal/Video /v1/multimodal/video/ WebSocket frame stream; 5-stage edge+server video pipeline
Governance /v1/governance/ Constitution rules; cost tracking; skill evolution safety
Scheduler /v1/scheduler/ Idle/sleep-window RL trigger; optional Google Calendar integration
Live /v1/live/ Gemini Live + Meta Ray-Ban glasses bridge (dedicated port 8765)

Installation

# Core (gateway + proxy)
pip install -e .

# With embedding skill retrieval
pip install -e ".[embedding]"

# With RL training backend
pip install -e ".[rl]"

# With idle scheduler (psutil + Google Calendar)
pip install -e ".[scheduler]"

# With Gemini Live + audio
pip install -e ".[live]"

# With video pipeline (OpenCV)
pip install -e ".[video]"

# With VisualClawArena runner dependencies
pip install -e ".[arena]"

# Full public-system install
pip install -e ".[arena,embedding,evolve,live,video,scheduler]"

Configuration

Default config lives in visualclaw/config/defaults.yaml. Override by creating config.yaml in your project root:

gateway:
  port: 30100
  enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video]

skill:
  skills_dir: skills          # your local SKILL.md bank
  retrieval_mode: template    # template | embedding | multimodal

proxy:
  providers:
    openai:
      api_base: "https://api.openai.com/v1"
      api_key: "${OPENAI_API_KEY}"
    anthropic:
      api_base: "https://api.anthropic.com"
      api_key: "${ANTHROPIC_API_KEY}"

rl:
  enabled: false              # true = collect + train; false = collect only

Environment variable pattern: MM_<SECTION>_<KEY> (e.g. MM_GATEWAY_PORT=30200).

To migrate from v2 flat config:

visualclaw config migrate

Benchmark Pipeline And Results

VisualClawArena benchmark pipeline

VisualClawArena packages each scenario as text rounds plus video/workspace assets. The runner stages clips, extracts keyframes, invokes the chosen agent backend, records tool traces, and writes standard results.json outputs for later aggregation.

VisualClawArena per-day accuracy curve

VisualClawArena case study

The figures above show the benchmark visualization style used for the 200-scenario VisualClawArena analysis: per-day accuracy curves and frame-level case studies with decision labels, memory, and evolved skills.


Plugin Integration

Framework adapters in plugins/ inject context and collect training data via gateway HTTP endpoints:

Framework Plugin Hooks
Claude Code plugins/claude_code/hooks_multi.sh UserPromptSubmit -> inject; Stop -> collect
OpenClaw plugins/openclaw/src/index.ts before_model_resolve, before_prompt_build, agent_end
Codex plugins/codex/hooks.sh UserPromptSubmit, Stop

Proxy mode (zero config): ANTHROPIC_BASE_URL=http://localhost:30100 - the gateway handles the full Pre/Post pipeline automatically.


Skills

Skills are plain SKILL.md files stored under the configured skill.skills_dir. The gateway retrieves relevant skills automatically and injects them into the prompt. The benchmark runner can use the same mechanism through --skills-mode inject --skills-dir <bank>.

For VisualClawArena, use the shipped seed bank:

visualclaw/skills_seed/seed_universal_mc

For your own deployment, create a local bank:

mkdir -p skills/my-skill

SKILL.md format:

---
name: debug-systematically
description: Use when diagnosing a bug to avoid guessing.
category: coding
---

## Debug Systematically

1. Reproduce the error with a minimal test case...

Self-evolution writes new skills into a run-local writable bank, never directly into the version-controlled seed bank.


Live Vision And AI Glasses

VisualClaw can run as a live vision gateway for a phone camera or Meta Ray-Ban Display companion app.

Install:

pip install -e ".[live,video]"

Add to config.yaml:

gateway:
  port: 30100
  enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video, live]

live:
  enabled: true
  glasses_http_enabled: true
  glasses_port: 8765
  gemini_live:
    enabled: true
    model: "gemini-2.5-flash"
    api_key: "${GEMINI_API_KEY}"

Start and check:

# Put GEMINI_API_KEY in .env.
visualclaw start
curl http://127.0.0.1:30100/v1/live/status

Endpoints:

POST      /v1/video/frame
POST      /v1/glasses/chat
WebSocket /ws/live
WebSocket /v1/live/glasses

Build the iOS companion app from ios/ and set its gateway host/port. Full setup is in docs/IOS_SETUP.md.


Project Layout

assets/                     # VisualClaw logo, teaser, demo, and pipeline media
visualclaw/
├── config/                 # Bundled: defaults.yaml, constitution_default.yaml
├── gateway/                # Core infrastructure (app, registry, pipeline, types)
└── modules/
    ├── proxy/              # LLM transparent proxy + Claw adapters
    ├── skill/              # Skill manager + evolver
    ├── memory/             # FAISS store + retriever
    ├── rl/                 # PRM scorer + trainer + formatter
    ├── multimodal/
    │   ├── image/          # Image extractor + describer
    │   └── video/          # 5-stage video pipeline (WebSocket)
    ├── governance/         # Constitution + cost tracker
    ├── scheduler/          # Idle window scheduler + calendar
    └── live/               # Gemini Live + MetaGlassesBridge

plugins/                    # Framework hook adapters
benchmark/                  # VisualClawArena harness; data is released separately
visualclaw/skills_seed/    # Benchmark seed skill banks
scripts/                    # Packaging, auth, smoke, and self-evolve helpers
docs/
├── architecture.md
├── quickstart.md
├── api.md
└── visualclawarena.md
constitution.yaml           # Project-level governance rules (customizable)

Large benchmark data, case-study media, and paper experiment outputs should live outside this repo or in the separate VisualClawArena dataset release.


Documentation


Citation

@misc{visualclaw2026,
  title        = {{VisualClaw}: A Real-Time, Personalized Agent for the Physical World},
  author       = {Tu, Haoqin and Chen, Jianwen and Wang, Zijun and Han, Siwei and Wu, Juncheng and Chen, Hardy and Ji, Haonian and Xiong, Kaiwen and Liu, Jiaqi and Xia, Peng and Mei, Jieru and Fei, Hongliang and Eshraghian, Jason and Zheng, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Xie, Cihang},
  year         = {2026},
  eprint       = {2606.16295},
  archivePrefix = {arXiv},
  url          = {https://arxiv.org/abs/2606.16295},
  code         = {https://github.com/UCSC-VLAA/VisualClaw}
}

Acknowledgements

VisualClaw builds on ideas and infrastructure from MetaClaw, VisionClaw, Claude Code, Codex, OpenAI/Anthropic-compatible agent APIs, Gemini Live, Hugging Face datasets, and related work in self-evolving agents, multimodal agents, and egocentric-video benchmarks.

About

Official Implementation of VisualClaw: A Real-Time, Personalized Agent for the Physical World

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors