GitHub - UCSC-VLAA/VisualClaw: Official Implementation of VisualClaw: A Real-Time, Personalized Agent for the Physical World

Self-evolving multimodal agent infrastructure for vision-language agents

VisualClaw is a modular multimodal agent system for tool-using vision-language agents. It sits between agent frameworks such as Claude Code, Codex, and OpenClaw and upstream LLM providers, then adds prompt-time skills, retrieved memory, self-evolution, video processing, live vision, and evaluation hooks.

VisualClaw is the reusable agent system code. VisualClawArena is the separate 200-scenario benchmark/data release with videos, multi-round questions, workspaces, and evaluation outputs.

Highlights

Feature	What it does	How to enable
Gateway proxy	OpenAI/Anthropic-compatible gateway with pre/post hooks	`visualclaw start`, then point `OPENAI_BASE_URL` or `ANTHROPIC_BASE_URL` at the gateway
Claude Code backend	Runs scenarios with Claude Code as the tool-using agent	`claude setup-token`, then `scripts/run_visualclawarena_agent_smoke.sh`
Codex backend	Runs scenarios with Codex as the tool-using agent	`CODEX_ACCESS_TOKEN=... BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh`
Skill injection	Retrieves relevant `SKILL.md` files and injects them into agent prompts	Configure `skill.skills_dir`, or pass `--skills-mode inject --skills-dir ...` to the arena runner
Memory retrieval	Stores prior outcomes and retrieves relevant context later	Gateway memory module, or runner flag `--memory`
Self-evolution	Evolves the skill bank from failed rounds and successful memory patterns	`scripts/run_visualclawarena_self_evolve.sh`
Video cascade	CPU-friendly dHash, lightweight encoding, change gate, and keyframe context; also includes `sb_cascade`, a streaming codec-metadata selector that avoids per-frame visual feature encoding	Install `.[video]`; use `--keyframe-mode cascade` or `--keyframe-mode sb_cascade --max-keyframes 8`
Live vision / glasses	Streams phone/glasses camera frames through the live gateway	Install `.[live,video]`, enable `live` config, see `docs/IOS_SETUP.md`
VisualClawArena packaging	Prepares the 200-scenario benchmark for Hugging Face	`scripts/prepare_visualclawarena_hf.py`

One-Line Start

git clone https://github.com/UCSC-VLAA/VisualClaw.git visualclaw && cd visualclaw && pip install -e ".[arena,live,video]"

Start the gateway:

visualclaw start

Run one VisualClawArena smoke case with Codex:

VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena BACKEND=codex SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.sh

Run the same smoke case with Claude Code:

VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena SCENARIO=mmt_q1 scripts/run_visualclawarena_agent_smoke.sh

To use VisualClaw as a transparent gateway, point your agent framework's API base to http://localhost:30100.

Use Cases

Agent evaluation - run Claude Code or Codex on VisualClawArena scenarios with consistent video context, model settings, and scoring.
Self-evolving agents - use failure-aware memory pruning and schema/file-output preflight to evolve reusable SKILL.md instructions.
Gateway deployment - add skill and memory hooks to OpenAI/Anthropic-compatible agent stacks without changing the agent code.
Live vision and AI glasses - stream phone or smart-glasses camera frames into the same multimodal gateway and context pipeline.

Agent Framework Pipeline

VisualClaw keeps the agent framework separate from the infrastructure that prepares context. The gateway can inject skills, retrieve memory, normalize multimodal inputs, collect traces, evolve skills, and forward requests to the chosen provider.

VisualClawArena Agent Backends

VisualClawArena runs can use either Claude Code or Codex as the tool-using agent backend. Credentials are read from environment variables or the official CLI credential stores; do not commit tokens into this repository.

Each VisualClawArena scenario packages a video clip, role/context files, a mutable workspace, and multi-round instructions that require agents to inspect visual evidence, update files, and pass machine-checkable evaluations.

Install the benchmark/runtime dependencies:

pip install -e ".[arena]"
cp .env.example .env

Download the HF-format VisualClawArena dataset from Hugging Face, or place an equivalent local copy so it has this shape:

VisualClawArena/
  scenarios/<scenario_id>/spec/questions.json
  scenarios/<scenario_id>/data/clip/*.mp4

Claude Code

Claude Code is the default agent backend for the smoke helper.

claude setup-token
# Put these in .env:
# CLAUDE_CODE_OAUTH_TOKEN=<token printed by setup-token>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

scripts/bootstrap_agent_backend.sh claude-code
scripts/run_visualclawarena_agent_smoke.sh

You can also rely on an existing claude auth login session, or set ANTHROPIC_API_KEY instead of CLAUDE_CODE_OAUTH_TOKEN.

Codex

Codex uses the installed codex CLI. Use an access token or API key, then run the same smoke helper with BACKEND=codex:

# Put these in .env:
# CODEX_ACCESS_TOKEN=<codex access token>
# or: OPENAI_API_KEY=<OpenAI API key>
# VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

scripts/bootstrap_agent_backend.sh codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

The helper scripts load .env automatically. Process environment variables still take priority, so CI can inject secrets without editing files. The bootstrap script validates CLI availability and auth without writing secrets into the repo. The smoke helper runs one packaged scenario round, stages keyframes, invokes the agent backend, and writes a normal VisualClawArena results.json. Defaults are SCENARIO=mmt_q1, MAX_ROUNDS=1, MAX_KEYFRAMES=8, and EFFORT=medium.

Run VisualClawArena

Place or download the HF-format VisualClawArena dataset outside this code repo:

VisualClawArena/
  scenarios/<scenario_id>/spec/questions.json
  scenarios/<scenario_id>/spec/scripts/
  scenarios/<scenario_id>/data/workspace/
  scenarios/<scenario_id>/data/clip/*.mp4

Set the dataset root:

export VISUALCLAW_ARENA_ROOT=/path/to/VisualClawArena

Smoke test:

# Claude Code, default
scripts/run_visualclawarena_agent_smoke.sh

# Codex
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

Useful overrides:

SCENARIO=mmt_s11 MAX_ROUNDS=3 MODEL=gpt-5.5 EFFORT=medium \
BACKEND=codex scripts/run_visualclawarena_agent_smoke.sh

Direct runner form:

python -m benchmark.visualclawarena.runner \
  --scenario-dir "$VISUALCLAW_ARENA_ROOT/scenarios/mmt_q1/spec" \
  --backend codex \
  --model gpt-5.5 \
  --max-keyframes 8 \
  --max-rounds 1 \
  --output-dir runs/mmt_q1_codex_smoke

Run Self-Evolution

Self-evolution uses a writable skill bank seeded from visualclaw/skills_seed/seed_universal_mc, memory retrieval, schema/file preflight, and a cross-scenario failure buffer.

Claude Code self-evolve:

# Put VISUALCLAW_ARENA_ROOT and CLAUDE_CODE_OAUTH_TOKEN in .env.

SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.sh

Codex self-evolve:

# Put VISUALCLAW_ARENA_ROOT and OPENAI_API_KEY in .env.

BACKEND=codex SCENARIO=mmt_q1 MAX_ROUNDS=0 scripts/run_visualclawarena_self_evolve.sh

Key knobs:

SKILLS_EVOLVE_EVERY=5    # evolve after every N failures within a scenario
MEMORY_TOP_K=5           # retrieved memory cap
MEMORY_MAX_CHARS=4000    # memory render cap
EVOLVER_BACKEND=openai   # openai, claude-code, or bedrock
EVOLVER_MODEL=gpt-5.2    # optional model override

Outputs are written under runs/visualclawarena_self_evolve/, including results.json, a run-local evolved skill bank, and the failure buffer.

Package VisualClawArena For Hugging Face

The data release is intentionally separate from the system repo.

python scripts/prepare_visualclawarena_hf.py \
  --out /path/to/VisualClawArena \
  --copy-mode hardlink \
  --overwrite

The package includes scenario specs, clips/workspaces, manifests, evaluation summaries, per-question rows, and sanitized raw result JSONs. Upload with:

huggingface-cli upload <org-or-user>/VisualClawArena /path/to/VisualClawArena --repo-type dataset

Modules

Module	Endpoint prefix	Description
Proxy	`/v1/chat/completions`, `/v1/messages`	Transparent LLM forwarding: OpenAI / Anthropic / Bedrock / OpenRouter
Skill	`/v1/skill/`	Template/embedding/multimodal skill retrieval; LLM-driven skill evolution
Memory	`/v1/memory/`	FAISS vector memory; text+vision retrieval; user memory file
RL	`/v1/rl/`	PRM scoring; GRPO/OPD training; active model hot-swap
Multimodal/Image	`/v1/multimodal/image/`	Image extraction, description (VLM, disk-cached), format normalization
Multimodal/Video	`/v1/multimodal/video/`	WebSocket frame stream; 5-stage edge+server video pipeline
Governance	`/v1/governance/`	Constitution rules; cost tracking; skill evolution safety
Scheduler	`/v1/scheduler/`	Idle/sleep-window RL trigger; optional Google Calendar integration
Live	`/v1/live/`	Gemini Live + Meta Ray-Ban glasses bridge (dedicated port 8765)

Installation

# Core (gateway + proxy)
pip install -e .

# With embedding skill retrieval
pip install -e ".[embedding]"

# With RL training backend
pip install -e ".[rl]"

# With idle scheduler (psutil + Google Calendar)
pip install -e ".[scheduler]"

# With Gemini Live + audio
pip install -e ".[live]"

# With video pipeline (OpenCV)
pip install -e ".[video]"

# With VisualClawArena runner dependencies
pip install -e ".[arena]"

# Full public-system install
pip install -e ".[arena,embedding,evolve,live,video,scheduler]"

Configuration

Default config lives in visualclaw/config/defaults.yaml. Override by creating config.yaml in your project root:

gateway:
  port: 30100
  enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video]

skill:
  skills_dir: skills          # your local SKILL.md bank
  retrieval_mode: template    # template | embedding | multimodal

proxy:
  providers:
    openai:
      api_base: "https://api.openai.com/v1"
      api_key: "${OPENAI_API_KEY}"
    anthropic:
      api_base: "https://api.anthropic.com"
      api_key: "${ANTHROPIC_API_KEY}"

rl:
  enabled: false              # true = collect + train; false = collect only

Environment variable pattern: MM_<SECTION>_<KEY> (e.g. MM_GATEWAY_PORT=30200).

To migrate from v2 flat config:

visualclaw config migrate

Benchmark Pipeline And Results

VisualClawArena packages each scenario as text rounds plus video/workspace assets. The runner stages clips, extracts keyframes, invokes the chosen agent backend, records tool traces, and writes standard results.json outputs for later aggregation.

The figures above show the benchmark visualization style used for the 200-scenario VisualClawArena analysis: per-day accuracy curves and frame-level case studies with decision labels, memory, and evolved skills.

Plugin Integration

Framework adapters in plugins/ inject context and collect training data via gateway HTTP endpoints:

Framework	Plugin	Hooks
Claude Code	`plugins/claude_code/hooks_multi.sh`	`UserPromptSubmit` -> inject; `Stop` -> collect
OpenClaw	`plugins/openclaw/src/index.ts`	`before_model_resolve`, `before_prompt_build`, `agent_end`
Codex	`plugins/codex/hooks.sh`	`UserPromptSubmit`, `Stop`

Proxy mode (zero config): ANTHROPIC_BASE_URL=http://localhost:30100 - the gateway handles the full Pre/Post pipeline automatically.

Skills

Skills are plain SKILL.md files stored under the configured skill.skills_dir. The gateway retrieves relevant skills automatically and injects them into the prompt. The benchmark runner can use the same mechanism through --skills-mode inject --skills-dir <bank>.

For VisualClawArena, use the shipped seed bank:

visualclaw/skills_seed/seed_universal_mc

For your own deployment, create a local bank:

mkdir -p skills/my-skill

SKILL.md format:

---
name: debug-systematically
description: Use when diagnosing a bug to avoid guessing.
category: coding
---

## Debug Systematically

1. Reproduce the error with a minimal test case...

Self-evolution writes new skills into a run-local writable bank, never directly into the version-controlled seed bank.

Live Vision And AI Glasses

VisualClaw can run as a live vision gateway for a phone camera or Meta Ray-Ban Display companion app.

Install:

pip install -e ".[live,video]"

Add to config.yaml:

gateway:
  port: 30100
  enabled_modules: [proxy, skill, memory, multimodal_image, multimodal_video, live]

live:
  enabled: true
  glasses_http_enabled: true
  glasses_port: 8765
  gemini_live:
    enabled: true
    model: "gemini-2.5-flash"
    api_key: "${GEMINI_API_KEY}"

Start and check:

# Put GEMINI_API_KEY in .env.
visualclaw start
curl http://127.0.0.1:30100/v1/live/status

Endpoints:

POST      /v1/video/frame
POST      /v1/glasses/chat
WebSocket /ws/live
WebSocket /v1/live/glasses

Build the iOS companion app from ios/ and set its gateway host/port. Full setup is in docs/IOS_SETUP.md.

Project Layout

assets/                     # VisualClaw logo, teaser, demo, and pipeline media
visualclaw/
├── config/                 # Bundled: defaults.yaml, constitution_default.yaml
├── gateway/                # Core infrastructure (app, registry, pipeline, types)
└── modules/
    ├── proxy/              # LLM transparent proxy + Claw adapters
    ├── skill/              # Skill manager + evolver
    ├── memory/             # FAISS store + retriever
    ├── rl/                 # PRM scorer + trainer + formatter
    ├── multimodal/
    │   ├── image/          # Image extractor + describer
    │   └── video/          # 5-stage video pipeline (WebSocket)
    ├── governance/         # Constitution + cost tracker
    ├── scheduler/          # Idle window scheduler + calendar
    └── live/               # Gemini Live + MetaGlassesBridge

plugins/                    # Framework hook adapters
benchmark/                  # VisualClawArena harness; data is released separately
visualclaw/skills_seed/    # Benchmark seed skill banks
scripts/                    # Packaging, auth, smoke, and self-evolve helpers
docs/
├── architecture.md
├── quickstart.md
├── api.md
└── visualclawarena.md
constitution.yaml           # Project-level governance rules (customizable)

Large benchmark data, case-study media, and paper experiment outputs should live outside this repo or in the separate VisualClawArena dataset release.

Documentation

Architecture — Gateway design, module lifecycle, Pre/Post pipeline
Quickstart & Usage Guide — Setup, framework integration, config reference
API Reference — All HTTP/WebSocket endpoints
VisualClawArena — Benchmark/data release boundary

Citation

@misc{visualclaw2026,
  title        = {{VisualClaw}: A Real-Time, Personalized Agent for the Physical World},
  author       = {Tu, Haoqin and Chen, Jianwen and Wang, Zijun and Han, Siwei and Wu, Juncheng and Chen, Hardy and Ji, Haonian and Xiong, Kaiwen and Liu, Jiaqi and Xia, Peng and Mei, Jieru and Fei, Hongliang and Eshraghian, Jason and Zheng, Zeyu and Zhou, Yuyin and Yao, Huaxiu and Xie, Cihang},
  year         = {2026},
  eprint       = {2606.16295},
  archivePrefix = {arXiv},
  url          = {https://arxiv.org/abs/2606.16295},
  code         = {https://github.com/UCSC-VLAA/VisualClaw}
}

Acknowledgements

VisualClaw builds on ideas and infrastructure from MetaClaw, VisionClaw, Claude Code, Codex, OpenAI/Anthropic-compatible agent APIs, Gemini Live, Hugging Face datasets, and related work in self-evolving agents, multimodal agents, and egocentric-video benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
assets		assets
benchmark		benchmark
docs		docs
ios		ios
plugins		plugins
scripts		scripts
tests		tests
visualclaw		visualclaw
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constitution.yaml		constitution.yaml
pyproject.toml		pyproject.toml
requirements-visualclawarena.txt		requirements-visualclawarena.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-evolving multimodal agent infrastructure for vision-language agents

Highlights

One-Line Start

Use Cases

Agent Framework Pipeline

VisualClawArena Agent Backends

Claude Code

Codex

Run VisualClawArena

Run Self-Evolution

Package VisualClawArena For Hugging Face

Modules

Installation

Configuration

Benchmark Pipeline And Results

Plugin Integration

Skills

Live Vision And AI Glasses

Project Layout

Documentation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-evolving multimodal agent infrastructure for vision-language agents

Highlights

One-Line Start

Use Cases

Agent Framework Pipeline

VisualClawArena Agent Backends

Claude Code

Codex

Run VisualClawArena

Run Self-Evolution

Package VisualClawArena For Hugging Face

Modules

Installation

Configuration

Benchmark Pipeline And Results

Plugin Integration

Skills

Live Vision And AI Glasses

Project Layout

Documentation

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages