A modular framework for implementing and benchmarking different web agents with a standardized interface.
This framework provides a standardized way to implement and evaluate different web agents (e.g., from various LLM providers like OpenAI, Anthropic, etc.) with a consistent interface. The project separates agent implementations from benchmarking logic, making it easy to add new agents and run comparative evaluations.
src/
├── agents/ # Agent implementations
│ ├── interface.py # Core interface and data models
│ ├── openai/ # OpenAI agent implementation
│ ├── anthropic/ # Anthropic agent implementation (template)
│ └── browser_use/ # Browser-based agent implementation (template)
├── benchmark/ # Benchmarking tools
│ └── test.py # Example test script
├── utils/ # Utility functions
All agent implementations must adhere to the standardized interface defined in src/agents/interface.py. This ensures consistent input/output formatting across different implementations:
AgentTaskExecutionInput: Standardized input format with task description and optional parametersAgentStep: Format for recording intermediate steps during task executionAgentTaskExecutionResult: Standardized output format for agent responsesAgentInterface: Abstract base class that all agent implementations must extend
The project includes a sample OpenAI agent implementation (src/agents/openai/openai_agent.py) that demonstrates how to implement the interface. This implementation:
- Uses a Playwright-based browser controller
- Handles task execution, timing, and result formatting
- Follows the standardized input/output interface
This project uses Poetry for dependency management.
# Install Poetry
brew install poetry
# Configure Poetry to create virtual environments in the project directory
poetry config virtualenvs.in-project true --local
# Install dependencies
poetry install
# Run a script
poetry run python src/benchmark/test.pyTo add new dependencies:
poetry add <package-name>Here's a simple example of how to use an agent:
from src.agents import OpenAI_Agent
from src.agents.interface import AgentTaskExecutionInput
# Create an agent
agent = OpenAI_Agent()
# Define a task
input = AgentTaskExecutionInput(
task="Find out how the pydantic model can be converted into a json schema by going to the pydantic website and finding the exact latest documentation.",
debug=True
)
# Execute the task
result = agent.execute_task(input=input)
# Print the result
print(result)To add a new agent implementation:
-
Create a new directory under
src/agents/for your implementation (e.g.,src/agents/my_agent/) -
Implement the
AgentInterfaceclass:from src.agents.interface import AgentInterface, AgentTaskExecutionResult, AgentTaskExecutionInput class MyAgent(AgentInterface): # Optional init method its fine to not implmeent it if you dont need it. def __init__(self): # Initialize your agent pass def execute_task(self, input: AgentTaskExecutionInput) -> AgentTaskExecutionResult: # Implement task execution logic # ... # Return result in standardized format return AgentTaskExecutionResult( task=input.task, answer="Your agent's answer", final_url="https://final-url.com", execution_time_seconds=execution_time )
-
Add your agent to the
src/agents/__init__.pyfile:from .interface import AgentTaskExecutionResult, AgentInterface from .openai.openai_agent import OpenAI_Agent from .my_agent.my_agent import MyAgent
-
Create test cases in the benchmark directory to evaluate your agent
Please ensure that any new agent implementations follow the standardized interface defined in src/agents/interface.py.