Authors: Pomilon (Pomilon Intelligence Research)
Date: May 2026
Repository: github.com/Pomilon-Intelligence-Lab/DVOC
Paper: research.pomilon.xyz/papers/dvoc
We introduce DVOC (Draft-Verify-Action Loop), a recursive spatial refinement protocol that enables weak vision-language models to perform pixel-precise UI grounding through iterative visual feedback. DVOC runs a closed loop: render a predicted click position as a visual overlay (crosshair + uncertainty circle), ask the model to estimate the offset to the target, apply the correction, and repeat until convergence.
In a multi-model benchmark across 4 vision-language models (Gemini 3.1 Flash Lite, Gemini 2.5 Flash Lite, Gemma 4 26B, Gemma 4 31B) on 11 UI grounding tasks (6 synthetic + 5 web), DVOC achieved 53–67% raw convergence (63–100% protocol convergence) within 30px across all models. The strongest model, Gemma 4 31B, reached 67% DVOC raw success (93% protocol) with a mean final distance of 34px. Gemma 4 26B achieved 100% protocol convergence on every completed trial (20px mean distance). The naive one-shot baseline achieved 0–14% — all models consistently fixated on wrong coordinates with confidence 1.0, unable to self-correct without spatial feedback.
DVOC trades per-call accuracy for iterative refinement, making cheap models viable for tasks that previously required expensive, pixel-perfect vision models. The protocol is more important than the model: even the weakest model with DVOC outperforms every model's naive baseline.
UI grounding — mapping natural language instructions ("click the login button") to precise pixel coordinates — is a fundamental capability for autonomous agents. Current approaches fall into two camps:
-
One-shot vision models (GPT-4V, Gemini 2.5 Pro, etc.) attempt to predict coordinates in a single inference. This works when the model is large and capable, but is expensive (
$\sim$ $5–10/1M tokens) and brittle — a single misprediction fails silently. -
DOM-based approaches (Playwright, Selenium) use structured document access rather than vision, but fail on canvas-based UI, shadow DOM, virtualized lists, and any non-browser environment.
DVOC takes a third approach: instead of making the model smarter, make the loop smarter. A cheap vision model iteratively refines its prediction by observing the visual consequences of its previous estimate. The model doesn't need to be pixel-perfect — it only needs to be directionally correct ("the target is to the right and down from the crosshair").
Planner → Backend (capture) → Renderer (overlay) → Verifier (VLM) → Corrector → Policy → ACT
Planner generates the initial coordinate hypothesis. Supports manual mode (fixed target for testing) and model mode (VLM generates initial guess from task + screenshot).
Renderer draws visual feedback on the captured screenshot:
- Red crosshair (
+) at the predicted coordinate, 3px thick, 30px arms - Blue uncertainty circle, 3px thick, fixed 30px radius (provides spatial reference)
- Iteration counter and coordinate label with white-on-black text
Verifier is the core innovation. The annotated screenshot is sent to a vision-language model with the prompt explaining the DVOC mechanism, previous iteration history, and the task. The model calls a report_offset(dx, dy, confidence) function.
Corrector applies the offset with exponential alpha decay:
α(0) = 1.0, α(1) = 0.5, α(iteration >= 2) = 0.25
new_position = old_position + α × error_vector
Policy decides the next action:
REFINE: error too large or confidence too low, continue loopACT_SAFE: error < ε and confidence > threshold for 2 consecutive iterationsABORT_OSCILLATION: error magnitude non-decreasing for 3+ iterations
The Verifier uses a ReAct (Reasoning + Acting) prompt that:
- Explains the DVOC mechanism (how crosshair, circle, and correction work)
- Shows previous iteration history ("You were at (100,100) and said dx=375, dy=333...")
- Asks the model to reason step-by-step before calling the tool
Convergence requires error magnitude < ε (30px) AND confidence ≥ 0.5 for 2 consecutive iterations.
- Scene (synthetic): 800×600 synthetic desktop with a dark background (RGB 40,42,48) containing UI elements at 6 target positions (S1–S6)
- Scene (web): Styled HTML dashboard at 1280×720 viewport containing 5 UI element types (W1–W5)
- Models: 4 vision-language models:
gemini-3.1-flash-lite,gemini-2.5-flash-lite,gemma-4-26b,gemma-4-31b - Access via LiteLLM unified interface
- Epsilon: 30px
- Convergence window: 2 consecutive iterations
- Rate limiting: Free API tiers with ~15 req/min limits; scheduler uses quota-aware retry (3-retry cap, 30s cooldown) and hybrid provider interleaving
- Select a random starting crosshair position
$p_0 \in [50, 750] \times [50, 550]$ - Run the DVOC loop until convergence or 15 iterations
- Record: final distance to target, iterations, model calls
- Each (model, task, method) combination was repeated 3–5 times from varying starting positions
- Total: 233 DVOC trials + 230 naive trials = 463 total (396 valid, 67 lost to API errors)
- Present raw screenshot (no overlay) to the model
- Ask "Click the target. Call click_at(x, y, confidence)."
- If predicted position is within ε of target, stop
- Otherwise, repeat with the same raw screenshot (no feedback)
- Maximum 5 retries per attempt
DVOC was benchmarked across 4 models, 11 tasks (6 synthetic S1–S6, 5 web W1–W5), and 2 methods (DVOC vs naive). A total of 463 trials (396 valid, 67 API errors) were processed over 21.9 hours of cumulative benchmark time. The convergence threshold was 30px from the target center.
| Model | Method | Total | Err | Valid | Success | Raw Rate | Proto Rate | Mean Dist | Mean Iters |
|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Flash Lite | DVOC | 115 | 1 | 114 | 72 | 63% | 63% | 41.9px | 5.2 |
| Gemini 3.1 Flash Lite | NAIVE | 115 | 1 | 114 | 15 | 13% | 13% | 178.9px | 4.5 |
| Gemini 2.5 Flash Lite | DVOC | 6 | 0 | 6 | 2 | 33% | 33% | 114.8px | 3.7 |
| Gemini 2.5 Flash Lite | NAIVE | 4 | 0 | 4 | 0 | 0% | 0% | 274.7px | 5.0 |
| Gemma 4 26B | DVOC | 55 | 26 | 29 | 29 | 53% | 100% | 19.6px | 5.6 |
| Gemma 4 26B | NAIVE | 55 | 11 | 44 | 6 | 11% | 14% | 170.8px | 4.5 |
| Gemma 4 31B | DVOC | 57 | 16 | 41 | 38 | 67% | 93% | 34.4px | 7.1 |
| Gemma 4 31B | NAIVE | 56 | 12 | 44 | 8 | 14% | 18% | 166.6px | 4.3 |
Note: "Raw Rate" = successes / total (API errors counted as failures). "Proto Rate" = successes / valid (API errors excluded).
Key findings:
- DVOC dominates naive across every model: DVOC raw success rates (53–67%, 63–100% protocol) far exceed naive (0–14% raw, 0–18% protocol).
- Gemma 4 31B is the strongest model: 67% raw DVOC success (93% protocol) with mean 34.4px final distance.
- Gemma 4 26B achieves perfect DVOC protocol success (100%) on every completed trial, with the lowest mean distance (19.6px), but suffers a high API error rate (47% of DVOC trials lost to timeouts/auth failures).
- The naive baseline is consistently broken: Across 230 naive trials, the pattern is identical — the model fixates on a single wrong coordinate (typically (500, ~500) or (500, ~650)) with confidence 1.0 and never self-corrects across retries.
| Task | Gemini 3.1 FL | Gemma 4 26B | Gemma 4 31B | Description |
|---|---|---|---|---|
| S1 | 100%, 24px | 100%, 9px | 100%, 12px | Single green button (400, 300) |
| S2 | 27%, 39px | 100%, 12px | 100%, 17px | Small green button (400, 300) |
| S3 | 53%, 26px | 100%, 12px | 100%, 23px | Multi-button grid, target "Submit" (360, 220) |
| S4 | 100%, 17px | 100%, 8px | 100%, 14px | Edge target top-left (40, 25) |
| S5 | 33%, 40px | 100%, 9px | 100%, 43px | Text input field (325, 240) |
| S6 | 53%, 25px | 100%, 10px | 100%, 19px | Blue button among distractors (190, 370) |
| W1 | 100%, 65px | 100%, 25px | 100%, 25px | Sign In button (large) |
| W2 | 80%, 32px | 100%, 21px | 80%, 31px | Nav bar "Docs" link |
| W3 | 100%, 77px | 100%, 52px | 100%, 90px | Card grid "Beta" card |
| W4 | 0%, 214px | 100%, 25px | 50%, 32px | Dashboard table "Edit Bob" |
| W5 | 80%, 66px | 100%, 19px | 100%, 43px | Dropdown menu "Profile" |
Format: Success rate (%), mean final distance (px). Only counts valid (non-error) trials.
Findings:
- Synthetic tasks are easier than web tasks: Mean DVOC distance across all models: ~20px (synthetic) vs ~50px (web) among valid trials.
- When DVOC completes without API errors, success rates are high: Gemma 4 26B achieved 100% DVOC success on every task that completed, and Gemma 4 31B achieved 100% on 9 of 11 tasks.
- Task W4 (dashboard table Edit) is the hardest task: Only Gemma 4 26B achieved 100% on this task (2/2 trials). Gemini 3.1 Flash Lite scored 0% (mean 214px). The model tends to fixate on nearby rows or disabled buttons.
- Gemini 3.1 Flash Lite struggles with small or ambiguous targets: S2 (27%), S5 (33%), S3 (53%), S6 (53%) — all below 60% success.
- Task S4 (edge target) and S1 (central button) are the easiest: 100% success across all models with data.
DVOC convergence follows a characteristic pattern across all models: an initial large correction (often 100–300px in one step), followed by progressively smaller oscillations until settling within ε.
Key trajectory observations:
- First-step accuracy varies by model: Gemma 4 31B's initial corrections are the most directionally correct. Gemma 4 26B often under-corrects on the first step.
- Convergence typically requires 4–10 iterations when successful.
-
The alpha decay schedule (
$\alpha = 1.0 \to 0.5 \to 0.25$ ) causes near-misses: Several Gemini 3.1 Flash Lite trials landed at 32–47px, within a single correction step of success.
| Model | Naive Trials | Naive Success | Mean Distance | Fixation Pattern |
|---|---|---|---|---|
| Gemini 3.1 Flash Lite | 114 | 13% | 179px | (500, ~500) or (500, ~650), confidence 1.0 |
| Gemma 4 26B | 44 | 14% | 171px | Iterates between (410, 400) and (300, 400) |
| Gemma 4 31B | 44 | 18% | 167px | Varies by task, no self-correction |
| Model | Web DVOC Success | Web DVOC Mean Dist | Web DVOC Mean Iters |
|---|---|---|---|
| Gemini 3.1 Flash Lite | 72% (17/24) | 90.7px | 6.6 |
| Gemma 4 26B | 100% (14/14) | 47.0px | 8.0 |
| Gemma 4 31B | 86% (18/21) | 54.0px | 7.8 |
Findings:
- DVOC transfers reliably from synthetic to real HTML pages across all tested models.
- Web tasks require more iterations on average (6.6–8.0 vs 4.5–5.5 for synthetic) due to the larger viewport and visual complexity.
- Among completed trials, DVOC achieves 72–100% web task success.
Ablation studies were conducted on Gemini 3.1 Flash Lite with synthetic tasks S1–S6 (30 trials per condition).
| Prediction mode | Success | Mean Dist | Median Dist | Mean Iters |
|---|---|---|---|---|
| Relative (Δx, Δy) | 83% | 26.8px | 22px | 5.0 |
| Absolute (x, y) | 17% | 175.4px | 201px | 3.0 |
Relative offset parameterisation is the decisive architectural choice: 83% vs 17% success.
| Condition | Success | Mean Dist | Median Dist |
|---|---|---|---|
| Text-only (no overlay) | 35% | 135.2px | 103px |
| Crosshair only | 100% | 25.7px | 22px |
| Crosshair + circle (full) | 82% | 27.3px | 24px |
The crosshair provides the majority of the benefit. The uncertainty circle slightly degrades performance on this model-task combination (82% vs 100%), likely due to visual clutter.
| Schedule | Success | Mean Dist | Mean Iters |
|---|---|---|---|
| Decay (1.0 → 0.5 → 0.25) | 87% | 26.1px | 4.8 |
| Constant α = 0.5 | 97% | 23.0px | 4.8 |
| Undamped α = 1.0 | 97% | 29.1px | 4.5 |
All schedules achieve high success on Gemini 3.1 Flash Lite synthetic tasks. Constant α = 0.5 gives the lowest mean distance (23.0px). The decay schedule provides a balanced trade-off. These results suggest that damping offers modest gains on capable models with simple tasks, but may be more important for noisier models.
Current limitations:
- API rate limits and quota exhaustion caused significant data loss: Gemma 4 26B lost 26/55 DVOC trials to timeouts/auth errors; Gemini 2.5 Flash Lite was limited to 10 total trials by quota
- No comparison against larger premium models (GPT-4V, Gemini 2.5 Pro, Claude 3.5 Sonnet)
- Web benchmark remains limited in scope (5 elements, 1 page design)
- Web backend uses Playwright screenshot coordinates, not real OS-level mouse events
- No systematic cost modeling ($/successful-click) across model tiers and providers
- Alpha damping's limited impact on this model-task combination suggests it may be a second-order effect for capable models
Future work:
- Provisioned throughput: Benchmark with paid API tiers or local model execution to eliminate data loss from rate limits
- Premium model comparison: Compare DVOC-enhanced cheap models against GPT-4V, Gemini 2.5 Pro, and Claude 3.5 Sonnet one-shot performance
- Adaptive alpha decay: PID controller tuned to convergence velocity
- Multi-target scenarios: Forms, dropdowns, drag-and-drop, and sequential interactions
- Canvas and non-DOM UIs: Where DOM-based approaches fundamentally fail
- Cost modeling: "$/successful-click" analysis across model tiers
See Point Refine [Mittal et al., 2026] also renders a red crosshair at the model's previous prediction and iterates based on visual feedback. DVOC differs by using relative offset vectors (not absolute coordinates) and includes kinematic damping.
GUI-Cursor [Zhao et al., 2025] reframes grounding as an interactive search via reinforcement learning. DVOC requires no training or reward engineering.
Iterative Narrowing [Nguyen, 2024] uses progressive cropping to refine predictions, which discards global visual context. DVOC preserves full-resolution context throughout.
Set-of-Marks (SoM) [Yang et al., 2023] augments screenshots with numerical labels for elements. DVOC uses an interactive marker that moves based on model feedback.
ReAct [Yao et al., 2022] demonstrates that reasoning traces improve LLM task performance. DVOC's ReAct-style verifier prompt similarly lets the model reason about spatial relationships before reporting offsets.
We presented DVOC, a recursive spatial refinement loop that makes weak vision models viable for pixel-precise UI grounding. In a multi-model benchmark across 4 vision-language models and 11 UI grounding tasks (6 synthetic + 5 web), DVOC achieved 53–67% raw convergence (63–100% protocol convergence) within 30px, while the same models without DVOC achieved 0–14%. The pattern is consistent and unambiguous: iterative visual feedback enables reliable convergence; one-shot prediction does not.
Ablations confirm that relative offset parameterisation is the primary driver (83% vs 17% success), followed by the visual crosshair overlay (100% vs 35% for text-only). The protocol is model-agnostic, environment-agnostic, and directly addresses a fundamental limitation of current UI agents.
Thanks to Google for providing the Gemini API free tier which made these experiments possible. Thanks to LiteLLM for providing a clean unified interface to multiple model providers.
The full implementation is available at github.com/Pomilon-Intelligence-Lab/DVOC.
Repository structure:
dvoc-core/src/dvoc_core/— Core loop engine (10 modules, 61 tests)dvoc-os/src/dvoc_os/— Desktop backend (mss + pynput)dvoc-web/src/dvoc_web/— Web browser backend (Playwright, 13 tests)dvoc-py/src/dvoc/— SDK:DVOC(backend, model).run(task)examples/— Test harnesses, benchmark scriptsdocs/— System design spec and implementation plan
Dependencies: Python 3.11+, Pillow, Litellm, Playwright
Annotated screenshots from each DVOC iteration are available in examples/benchmark/output/artifacts/.
The canonical paper is published at research.pomilon.xyz/papers/dvoc.