Web Agent Evaluation Framework

A modular framework for implementing and benchmarking different web agents with a standardized interface.

Overview

This framework provides a standardized way to implement and evaluate different web agents (e.g., from various LLM providers like OpenAI, Anthropic, etc.) with a consistent interface. The project separates agent implementations from benchmarking logic, making it easy to add new agents and run comparative evaluations.

Project Structure

src/
├── agents/                 # Agent implementations
│   ├── interface.py        # Core interface and data models
│   ├── openai/             # OpenAI agent implementation
│   ├── anthropic/          # Anthropic agent implementation (template)
│   └── browser_use/        # Browser-based agent implementation (template)
├── benchmark/              # Benchmarking tools
│   └── test.py             # Example test script
├── utils/                  # Utility functions

Key Components

Agent Interface

All agent implementations must adhere to the standardized interface defined in src/agents/interface.py. This ensures consistent input/output formatting across different implementations:

AgentTaskExecutionInput: Standardized input format with task description and optional parameters
AgentStep: Format for recording intermediate steps during task execution
AgentTaskExecutionResult: Standardized output format for agent responses
AgentInterface: Abstract base class that all agent implementations must extend

Sample Implementation

The project includes a sample OpenAI agent implementation (src/agents/openai/openai_agent.py) that demonstrates how to implement the interface. This implementation:

Uses a Playwright-based browser controller
Handles task execution, timing, and result formatting
Follows the standardized input/output interface

Installation

This project uses Poetry for dependency management.

Mac Installation

# Install Poetry
brew install poetry

# Configure Poetry to create virtual environments in the project directory
poetry config virtualenvs.in-project true --local

# Install dependencies
poetry install

# Run a script
poetry run python src/benchmark/test.py

Adding Dependencies

To add new dependencies:

poetry add <package-name>

Usage

Here's a simple example of how to use an agent:

from src.agents import OpenAI_Agent
from src.agents.interface import AgentTaskExecutionInput

# Create an agent
agent = OpenAI_Agent()

# Define a task
input = AgentTaskExecutionInput(
    task="Find out how the pydantic model can be converted into a json schema by going to the pydantic website and finding the exact latest documentation.",
    debug=True
)

# Execute the task
result = agent.execute_task(input=input)

# Print the result
print(result)

Extending with New Agents

To add a new agent implementation:

Create a new directory under src/agents/ for your implementation (e.g., src/agents/my_agent/)

Implement the AgentInterface class:

from src.agents.interface import AgentInterface, AgentTaskExecutionResult, AgentTaskExecutionInput

class MyAgent(AgentInterface):
    # Optional init method its fine to not implmeent it if you dont need it.
    def __init__(self):
        # Initialize your agent
        pass
        
    def execute_task(self, input: AgentTaskExecutionInput) -> AgentTaskExecutionResult:
        # Implement task execution logic
        # ...
        
        # Return result in standardized format
        return AgentTaskExecutionResult(
            task=input.task,
            answer="Your agent's answer",
            final_url="https://final-url.com",
            execution_time_seconds=execution_time
        )

Add your agent to the src/agents/__init__.py file:

from .interface import AgentTaskExecutionResult, AgentInterface
from .openai.openai_agent import OpenAI_Agent
from .my_agent.my_agent import MyAgent

Create test cases in the benchmark directory to evaluate your agent

Contributing

Please ensure that any new agent implementations follow the standardized interface defined in src/agents/interface.py.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.cursor/rules		.cursor/rules
data		data
evaluation_history		evaluation_history
final_data		final_data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Agent Evaluation Framework

Overview

Project Structure

Key Components

Agent Interface

Sample Implementation

Installation

Mac Installation

Adding Dependencies

Usage

Extending with New Agents

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Agent Evaluation Framework

Overview

Project Structure

Key Components

Agent Interface

Sample Implementation

Installation

Mac Installation

Adding Dependencies

Usage

Extending with New Agents

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages