Evaluation infrastructure for GUI agent benchmarks, built for OpenAdapt.
OpenAdapt Evals is a unified framework for evaluating GUI automation agents against standardized benchmarks such as Windows Agent Arena (WAA). It provides benchmark adapters, agent interfaces, Azure VM infrastructure for parallel evaluation, and result visualization -- everything needed to go from "I have a GUI agent" to "here are its benchmark scores."
More screenshots
Task Detail View -- step-by-step replay with screenshots, actions, and execution logs:
Cost Tracking Dashboard -- real-time Azure VM cost monitoring with tiered sizing and spot instances:
- Benchmark adapters for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
- Agent interfaces including
ApiAgent(Claude / GPT),RetrievalAugmentedAgent,RandomAgent, andPolicyAgent - Azure VM infrastructure with
AzureVMManager,PoolManager,SSHTunnelManager, andVMMonitorfor running evaluations at scale - CLI tools --
oa-vmfor VM and pool management (50+ commands), benchmark CLI for running evals - Cost optimization -- tiered VM sizing, spot instance support, and real-time cost tracking
- Results visualization -- HTML viewer with step-by-step screenshot replay, execution logs, and domain breakdowns
- Trace export for converting evaluation trajectories into training data
- Configuration via pydantic-settings with automatic
.envloading
pip install openadapt-evalsWith optional dependencies:
pip install openadapt-evals[azure] # Azure VM management
pip install openadapt-evals[retrieval] # Demo retrieval agent
pip install openadapt-evals[viewer] # Live results viewer
pip install openadapt-evals[all] # Everythingopenadapt-evals mock --tasks 10# Start with a single Azure VM
oa-vm pool-create --workers 1
oa-vm pool-wait
# Run evaluation
openadapt-evals run --agent api-claude --task notepad_1
# View results
openadapt-evals view --run-name live_eval
# Clean up (stop billing)
oa-vm pool-cleanup -yfrom openadapt_evals import (
ApiAgent,
WAALiveAdapter,
WAALiveConfig,
evaluate_agent_on_benchmark,
compute_metrics,
)
adapter = WAALiveAdapter(WAALiveConfig(server_url="http://localhost:5001"))
agent = ApiAgent(provider="anthropic")
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
metrics = compute_metrics(results)
print(f"Success rate: {metrics['success_rate']:.1%}")# Create a pool of VMs and distribute tasks
oa-vm pool-create --workers 5
oa-vm pool-wait
oa-vm pool-run --tasks 50
# Or use Azure ML orchestration
openadapt-evals azure --workers 10 --waa-path /path/to/WindowsAgentArenaopenadapt_evals/
├── agents/ # Agent implementations
│ ├── base.py # BenchmarkAgent ABC
│ ├── api_agent.py # ApiAgent (Claude, GPT)
│ ├── retrieval_agent.py# RetrievalAugmentedAgent
│ └── policy_agent.py # PolicyAgent (trained models)
├── adapters/ # Benchmark adapters
│ ├── base.py # BenchmarkAdapter ABC + data classes
│ └── waa/ # WAA live, mock, and local adapters
├── infrastructure/ # Azure VM and pool management
│ ├── azure_vm.py # AzureVMManager
│ ├── pool.py # PoolManager
│ ├── ssh_tunnel.py # SSHTunnelManager
│ └── vm_monitor.py # VMMonitor dashboard
├── benchmarks/ # Evaluation runner, CLI, viewers
│ ├── runner.py # evaluate_agent_on_benchmark()
│ ├── cli.py # Benchmark CLI (run, mock, live, view)
│ ├── vm_cli.py # VM/Pool CLI (oa-vm, 50+ commands)
│ ├── viewer.py # HTML results viewer
│ ├── pool_viewer.py # Pool results viewer
│ └── trace_export.py # Training data export
├── waa_deploy/ # Docker agent deployment
├── server/ # WAA server extensions
├── config.py # Settings (pydantic-settings, .env)
└── __init__.py
LOCAL MACHINE AZURE VM (Ubuntu)
┌─────────────────────┐ ┌──────────────────────┐
│ oa-vm CLI │ SSH Tunnel │ Docker │
│ (pool management) │ ─────────────> │ └─ QEMU (Win 11) │
│ │ :5001 → :5000 │ ├─ WAA Flask API │
│ openadapt-evals │ :8006 → :8006 │ └─ Agent │
│ (benchmark runner) │ │ │
└─────────────────────┘ └──────────────────────┘
| Command | Description |
|---|---|
run |
Run live evaluation (localhost:5001 default) |
mock |
Run with mock adapter (no VM required) |
live |
Run against a WAA server (full control) |
azure |
Run parallel evaluation on Azure ML |
probe |
Check if a WAA server is ready |
view |
Generate HTML viewer for results |
estimate |
Estimate Azure costs |
| Command | Description |
|---|---|
pool-create |
Create N VMs with Docker and WAA |
pool-wait |
Wait until WAA is ready on all workers |
pool-run |
Distribute tasks across pool workers |
pool-status |
Show status of all pool VMs |
pool-cleanup |
Delete all pool VMs and resources |
vm monitor |
Dashboard with SSH tunnels |
vm setup-waa |
Deploy WAA container on a VM |
Run oa-vm --help for the full list of 50+ commands.
Settings are loaded automatically from environment variables or a .env file in the project root via pydantic-settings.
# .env
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
# Azure (required for VM management)
AZURE_SUBSCRIPTION_ID=...
AZURE_ML_RESOURCE_GROUP=...
AZURE_ML_WORKSPACE_NAME=...See openadapt_evals/config.py for all available settings.
Implement the BenchmarkAgent interface to evaluate your own agent:
from openadapt_evals import BenchmarkAgent, BenchmarkAction, BenchmarkObservation, BenchmarkTask
class MyAgent(BenchmarkAgent):
def act(
self,
observation: BenchmarkObservation,
task: BenchmarkTask,
history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,
) -> BenchmarkAction:
# Your agent logic here
return BenchmarkAction(type="click", x=0.5, y=0.5)
def reset(self) -> None:
passWe welcome contributions. To get started:
git clone https://github.com/OpenAdaptAI/openadapt-evals.git
cd openadapt-evals
pip install -e ".[dev]"
pytest tests/ -vSee CLAUDE.md for development conventions and architecture details.
| Project | Description |
|---|---|
| OpenAdapt | Desktop automation with demo-conditioned AI agents |
| openadapt-ml | Training and policy runtime |
| openadapt-capture | Screen recording and demo sharing |
| openadapt-grounding | UI element localization |


