VisualClaw is a modular multimodal agent system for tool-using vision-language agents. It sits between agent frameworks such as Claude Code, Codex, and OpenClaw and upstream LLM providers, then adds prompt-time skills, retrieved memory, self-evolution, video processing, live vision, and evaluation hooks.
VisualClaw is the reusable agent system code. VisualClawArena is the separate 200-scenario benchmark/data release with videos, multi-round questions, workspaces, and evaluation outputs.
| Feature | What it does | How to enable |
|---|---|---|
| Gateway proxy | OpenAI/Anthropic-compatible gateway with pre/post hooks | visualclaw start, then point OPENAI_BASE_URL or ANTHROPIC_BASE_URL at the gateway |
| Claude Code backend | Runs scenarios with Claude Code as the tool-using agent | claude setup-token, then scripts/run_visualclawarena_agent_smoke.sh |
| Codex backend | Runs scenarios with Codex as the tool-using agent | CODEX_ACCESS_TOKEN=... BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh |
| Skill injection | Retrieves relevant SKILL.md files and injects them into agent prompts |
Configure skill.skills_dir, or pass --skills-mode inject --skills-dir ... to the arena runner |
| Memory retrieval | Stores prior outcomes and retrieves relevant context later | Gateway memory module, or runner flag --memory |
| Self-evolution | Evolves the skill bank from failed rounds and successful memory patterns | scripts/run_visualclawarena_self_evolve.sh |
| Video cascade | CPU-friendly dHash, lightweight encoding, change gate, and keyframe context; also includes sb_cascade, a streaming codec-metadata selector that avoids per-frame visual feature encoding |
Install .[video]; use --keyframe-mode cascade or --keyframe-mode sb_cascade --max-keyframes 8 |
| Live vision / glasses | Streams phone/glasses camera frames through the live gateway | Install .[live,video], enable live config, see docs/IOS_SETUP.md |
| VisualClawArena packaging | Prepares the 200-scenario benchmark for Hugging Face | scripts/prepare_visualclawarena_hf.py |
git clone https://github.com/UCSC-VLAA/VisualClaw.git visualclaw && cd visualclaw && pip install -e ".[arena,live,video]"Start the gateway:
visualclaw startRun one VisualClawArena smoke case with Codex:
VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena BACKEND=codex SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.shRun the same smoke case with Claude Code:
VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.shTo use VisualClaw as a transparent gateway, point your agent framework's API base to http://localhost:30100.
- Agent evaluation - run Claude Code or Codex on VisualClawArena scenarios with consistent video context, model settings, and scoring.
- Self-evolving agents - use failure-aware memory pruning and schema/file-output preflight to evolve reusable
SKILL.mdinstructions. - Gateway deployment - add skill and memory hooks to OpenAI/Anthropic-compatible agent stacks without changing the agent code.
- Live vision and AI glasses - stream phone or smart-glasses camera frames into the same multimodal gateway and context pipeline.
VisualClaw keeps the agent framework separate from the infrastructure that prepares context. The gateway can inject skills, retrieve memory, normalize multimodal inputs, collect traces, evolve skills, and forward requests to the chosen provider.
VisualClawArena runs can use either Claude Code or Codex as the tool-using agent backend. Credentials are read from environment variables or the official CLI credential stores; do not commit tokens into this repository.
Each VisualClawArena scenario packages a video clip, role/context files, a mutable workspace, and multi-round instructions that require agents to inspect visual evidence, update files, and pass machine-checkable evaluations.
Install the benchmark/runtime dependencies:
pip install -e ".[arena]"
cp .env.example .envDownload the HF-format VisualClawArena dataset from Hugging Face, or place an equivalent local copy so it has this shape:
VisualClawArena/
scenarios/<scenario_id>/spec/questions.json
scenarios/<scenario_id>/data/clip/*.mp4
Claude Code is the default agent backend for the smoke helper.
claude setup-token
# Put these in .env:
# CLAUDE_CODE_OAUTH_TOKEN=<token printed by setup-token>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena
scripts/bootstrap_agent_backend.sh claude-code
scripts/run_visualclawarena_agent_smoke.shYou can also rely on an existing claude auth login session, or set
ANTHROPIC_API_KEY instead of CLAUDE_CODE_OAUTH_TOKEN.
Codex uses the installed codex CLI. Use an access token or API key, then run
the same smoke helper with BACKEND=codex:
# Put these in .env:
# CODEX_ACCESS_TOKEN=<codex access token>
# or: OPENAI_API_KEY=<OpenAI API key>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena
scripts/bootstrap_agent_backend.sh codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.shThe helper scripts load .env automatically. Process environment variables still take priority, so CI can inject secrets without editing files. The bootstrap script validates CLI availability and auth without writing secrets into the repo. The smoke helper runs one packaged scenario round, stages keyframes, invokes the agent backend, and writes a normal VisualClawArena results.json. Defaults are SCENARIO=mmt_q1, MAX_ROUNDS=1, MAX_KEYFRAMES=8, and EFFORT=medium.
Place or download the HF-format VisualClawArena dataset outside this code repo:
VisualClawArena/
scenarios/<scenario_id>/spec/questions.json
scenarios/<scenario_id>/spec/scripts/
scenarios/<scenario_id>/data/workspace/
scenarios/<scenario_id>/data/clip/*.mp4
Set the dataset root:
export VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArenaSmoke test:
# Claude Code, default
scripts/run_visualclawarena_agent_smoke.sh
# Codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.shUseful overrides:
SCENARIO=mmt_s11 MAX_ROUNDS=3 MODEL=gpt-5.5 EFFORT=medium \
BACKEND=codex scripts/run_visualclawarena_agent_smoke.shDirect runner form:
python -m benchmark.visualclawarena.runner \
--scenario-dir "$VISUALCLAW_ARENA_ROOT/scenarios/mmt_q1/spec" \
--backend codex \
--model gpt-5.5 \
--max-keyframes 8 \
--max-rounds 1 \
--output-dir runs/mmt_q1_codex_smokeSelf-evolution uses a writable skill bank seeded from visualclaw/skills_seed/seed_universal_mc, memory retrieval, schema/file preflight, and a cross-scenario failure buffer.
Claude Code self-evolve:
# Put VISUALCLAW_ARENA_ROOT and CLAUDE_CODE_OAUTH_TOKEN in .env.
SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.shCodex self-evolve:
# Put VISUALCLAW_ARENA_ROOT and OPENAI_API_KEY in .env.
BACKEND=codex SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.shKey knobs:
SKILLS_EVOLVE_EVERY=5 # evolve after every N failures within a scenario
MEMORY_TOP_K=5 # retrieved memory cap
MEMORY_MAX_CHARS=4000 # memory render cap
EVOLVER_BACKEND=openai # openai, claude-code, or bedrock
EVOLVER_MODEL=gpt-5.2 # optional model overrideOutputs are written under runs/visualclawarena_self_evolve/, including results.json, a run-local evolved skill bank, and the failure buffer.
The data release is intentionally separate from the system repo.
python scripts/prepare_visualclawarena_hf.py \
--out /path/to/VisualClawArena \
--copy-mode hardlink \
--overwriteThe package includes scenario specs, clips/workspaces, manifests, evaluation summaries, per-question rows, and sanitized raw result JSONs. Upload with:
huggingface-cli upload <org-or-user>/VisualClawArena /path/to/VisualClawArena --repo-type dataset| Module | Endpoint prefix | Description |
|---|---|---|
| Proxy | /v1/chat/completions, /v1/messages |
Transparent LLM forwarding: OpenAI / Anthropic / Bedrock / OpenRouter |
| Skill | /v1/skill/ |
Template/embedding/multimodal skill retrieval; LLM-driven skill evolution |
| Memory | /v1/memory/ |
FAISS vector memory; text+vision retrieval; user memory file |
| RL | /v1/rl/ |
PRM scoring; GRPO/OPD training; active model hot-swap |
| Multimodal/Image | /v1/multimodal/image/ |
Image extraction, description (VLM, disk-cached), format normalization |
| Multimodal/Video | /v1/multimodal/video/ |
WebSocket frame stream; 5-stage edge+server video pipeline |
| Governance | /v1/governance/ |
Constitution rules; cost tracking; skill evolution safety |
| Scheduler | /v1/scheduler/ |
Idle/sleep-window RL trigger; optional Google Calendar integration |
| Live | /v1/live/ |
Gemini Live + Meta Ray-Ban glasses bridge (dedicated port 8765) |
# Core (gateway + proxy)
pip install -e .
# With embedding skill retrieval
pip install -e ".[embedding]"
# With RL training backend
pip install -e ".[rl]"
# With idle scheduler (psutil + Google Calendar)
pip install -e ".[scheduler]"
# With Gemini Live + audio
pip install -e ".[live]"
# With video pipeline (OpenCV)
pip install -e ".[video]"
# With VisualClawArena runner dependencies
pip install -e ".[arena]"
# Full public-system install
pip install -e ".[arena,embedding,evolve,live,video,scheduler]"Default config lives in visualclaw/config/defaults.yaml. Override by creating config.yaml in your project root:
gateway:
port: 30100
enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video]
skill:
skills_dir: skills # your local SKILL.md bank
retrieval_mode: template # template | embedding | multimodal
proxy:
providers:
openai:
api_base: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
anthropic:
api_base: "https://api.anthropic.com"
api_key: "${ANTHROPIC_API_KEY}"
rl:
enabled: false # true = collect + train; false = collect onlyEnvironment variable pattern: MM_<SECTION>_<KEY> (e.g. MM_GATEWAY_PORT=30200).
To migrate from v2 flat config:
visualclaw config migrateVisualClawArena packages each scenario as text rounds plus video/workspace assets. The runner stages clips, extracts keyframes, invokes the chosen agent backend, records tool traces, and writes standard results.json outputs for later aggregation.
The figures above show the benchmark visualization style used for the 200-scenario VisualClawArena analysis: per-day accuracy curves and frame-level case studies with decision labels, memory, and evolved skills.
Framework adapters in plugins/ inject context and collect training data via gateway HTTP endpoints:
| Framework | Plugin | Hooks |
|---|---|---|
| Claude Code | plugins/claude_code/hooks_multi.sh |
UserPromptSubmit -> inject; Stop -> collect |
| OpenClaw | plugins/openclaw/src/index.ts |
before_model_resolve, before_prompt_build, agent_end |
| Codex | plugins/codex/hooks.sh |
UserPromptSubmit, Stop |
Proxy mode (zero config): ANTHROPIC_BASE_URL=http://localhost:30100 - the gateway handles the full Pre/Post pipeline automatically.
Skills are plain SKILL.md files stored under the configured skill.skills_dir.
The gateway retrieves relevant skills automatically and injects them into the
prompt. The benchmark runner can use the same mechanism through
--skills-mode inject --skills-dir <bank>.
For VisualClawArena, use the shipped seed bank:
visualclaw/skills_seed/seed_universal_mcFor your own deployment, create a local bank:
mkdir -p skills/my-skillSKILL.md format:
---
name: debug-systematically
description: Use when diagnosing a bug to avoid guessing.
category: coding
---
## Debug Systematically
1. Reproduce the error with a minimal test case...Self-evolution writes new skills into a run-local writable bank, never directly into the version-controlled seed bank.
VisualClaw can run as a live vision gateway for a phone camera or Meta Ray-Ban Display companion app.
Install:
pip install -e ".[live,video]"Add to config.yaml:
gateway:
port: 30100
enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video, live]
live:
enabled: true
glasses_http_enabled: true
glasses_port: 8765
gemini_live:
enabled: true
model: "gemini-2.5-flash"
api_key: "${GEMINI_API_KEY}"Start and check:
# Put GEMINI_API_KEY in .env.
visualclaw start
curl http://127.0.0.1:30100/v1/live/statusEndpoints:
POST /v1/video/frame
POST /v1/glasses/chat
WebSocket /ws/live
WebSocket /v1/live/glasses
Build the iOS companion app from ios/ and set its gateway host/port. Full setup is in docs/IOS_SETUP.md.
assets/ # VisualClaw logo, teaser, demo, and pipeline media
visualclaw/
├── config/ # Bundled: defaults.yaml, constitution_default.yaml
├── gateway/ # Core infrastructure (app, registry, pipeline, types)
└── modules/
├── proxy/ # LLM transparent proxy + Claw adapters
├── skill/ # Skill manager + evolver
├── memory/ # FAISS store + retriever
├── rl/ # PRM scorer + trainer + formatter
├── multimodal/
│ ├── image/ # Image extractor + describer
│ └── video/ # 5-stage video pipeline (WebSocket)
├── governance/ # Constitution + cost tracker
├── scheduler/ # Idle window scheduler + calendar
└── live/ # Gemini Live + MetaGlassesBridge
plugins/ # Framework hook adapters
benchmark/ # VisualClawArena harness; data is released separately
visualclaw/skills_seed/ # Benchmark seed skill banks
scripts/ # Packaging, auth, smoke, and self-evolve helpers
docs/
├── architecture.md
├── quickstart.md
├── api.md
└── visualclawarena.md
constitution.yaml # Project-level governance rules (customizable)
Large benchmark data, case-study media, and paper experiment outputs should live outside this repo or in the separate VisualClawArena dataset release.
- Architecture — Gateway design, module lifecycle, Pre/Post pipeline
- Quickstart & Usage Guide — Setup, framework integration, config reference
- API Reference — All HTTP/WebSocket endpoints
- VisualClawArena — Benchmark/data release boundary
@misc{visualclaw2026,
title = {{VisualClaw}: A Real-Time, Personalized Agent for the Physical World},
author = {Tu, Haoqin and Chen, Jianwen and Wang, Zijun and Han, Siwei and Wu, Juncheng and Chen, Hardy and Ji, Haonian and Xiong, Kaiwen and Liu, Jiaqi and Xia, Peng and Mei, Jieru and Fei, Hongliang and Eshraghian, Jason and Zheng, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Xie, Cihang},
year = {2026},
eprint = {2606.16295},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2606.16295},
code = {https://github.com/UCSC-VLAA/VisualClaw}
}VisualClaw builds on ideas and infrastructure from MetaClaw, VisionClaw, Claude Code, Codex, OpenAI/Anthropic-compatible agent APIs, Gemini Live, Hugging Face datasets, and related work in self-evolving agents, multimodal agents, and egocentric-video benchmarks.






