OpenAdapt Evals

Evaluation infrastructure for GUI agent benchmarks, built for OpenAdapt.

What is OpenAdapt Evals?

OpenAdapt Evals is a unified framework for evaluating GUI automation agents against standardized benchmarks such as Windows Agent Arena (WAA). It provides benchmark adapters, agent interfaces, Azure VM infrastructure for parallel evaluation, and result visualization -- everything needed to go from "I have a GUI agent" to "here are its benchmark scores."

Benchmark Viewer

More screenshots

Task Detail View -- step-by-step replay with screenshots, actions, and execution logs:

Cost Tracking Dashboard -- real-time Azure VM cost monitoring with tiered sizing and spot instances:

Key Features

Benchmark adapters for WAA (live, mock, and local modes), with an extensible base for OSWorld, WebArena, and others
Agent interfaces including ApiAgent (Claude / GPT), RetrievalAugmentedAgent, RandomAgent, and PolicyAgent
Azure VM infrastructure with AzureVMManager, PoolManager, SSHTunnelManager, and VMMonitor for running evaluations at scale
CLI tools -- oa-vm for VM and pool management (50+ commands), benchmark CLI for running evals
Cost optimization -- tiered VM sizing, spot instance support, and real-time cost tracking
Results visualization -- HTML viewer with step-by-step screenshot replay, execution logs, and domain breakdowns
Trace export for converting evaluation trajectories into training data
Configuration via pydantic-settings with automatic .env loading

Installation

pip install openadapt-evals

With optional dependencies:

pip install openadapt-evals[azure]      # Azure VM management
pip install openadapt-evals[retrieval]  # Demo retrieval agent
pip install openadapt-evals[viewer]     # Live results viewer
pip install openadapt-evals[all]        # Everything

Quick Start

Run a mock evaluation (no VM required)

openadapt-evals mock --tasks 10

Run a live evaluation against a WAA server

# Start with a single Azure VM
oa-vm pool-create --workers 1
oa-vm pool-wait

# Run evaluation
openadapt-evals run --agent api-claude --task notepad_1

# View results
openadapt-evals view --run-name live_eval

# Clean up (stop billing)
oa-vm pool-cleanup -y

Python API

from openadapt_evals import (
    ApiAgent,
    WAALiveAdapter,
    WAALiveConfig,
    evaluate_agent_on_benchmark,
    compute_metrics,
)

adapter = WAALiveAdapter(WAALiveConfig(server_url="http://localhost:5001"))
agent = ApiAgent(provider="anthropic")

results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
metrics = compute_metrics(results)
print(f"Success rate: {metrics['success_rate']:.1%}")

Parallel evaluation on Azure

# Create a pool of VMs and distribute tasks
oa-vm pool-create --workers 5
oa-vm pool-wait
oa-vm pool-run --tasks 50

# Or use Azure ML orchestration
openadapt-evals azure --workers 10 --waa-path /path/to/WindowsAgentArena

Architecture

openadapt_evals/
├── agents/               # Agent implementations
│   ├── base.py           #   BenchmarkAgent ABC
│   ├── api_agent.py      #   ApiAgent (Claude, GPT)
│   ├── retrieval_agent.py#   RetrievalAugmentedAgent
│   └── policy_agent.py   #   PolicyAgent (trained models)
├── adapters/             # Benchmark adapters
│   ├── base.py           #   BenchmarkAdapter ABC + data classes
│   └── waa/              #   WAA live, mock, and local adapters
├── infrastructure/       # Azure VM and pool management
│   ├── azure_vm.py       #   AzureVMManager
│   ├── pool.py           #   PoolManager
│   ├── ssh_tunnel.py     #   SSHTunnelManager
│   └── vm_monitor.py     #   VMMonitor dashboard
├── benchmarks/           # Evaluation runner, CLI, viewers
│   ├── runner.py         #   evaluate_agent_on_benchmark()
│   ├── cli.py            #   Benchmark CLI (run, mock, live, view)
│   ├── vm_cli.py         #   VM/Pool CLI (oa-vm, 50+ commands)
│   ├── viewer.py         #   HTML results viewer
│   ├── pool_viewer.py    #   Pool results viewer
│   └── trace_export.py   #   Training data export
├── waa_deploy/           # Docker agent deployment
├── server/               # WAA server extensions
├── config.py             # Settings (pydantic-settings, .env)
└── __init__.py

How it fits together

LOCAL MACHINE                          AZURE VM (Ubuntu)
┌─────────────────────┐                ┌──────────────────────┐
│  oa-vm CLI          │   SSH Tunnel   │  Docker              │
│  (pool management)  │ ─────────────> │  └─ QEMU (Win 11)   │
│                     │  :5001 → :5000 │     ├─ WAA Flask API │
│  openadapt-evals    │  :8006 → :8006 │     └─ Agent         │
│  (benchmark runner) │                │                      │
└─────────────────────┘                └──────────────────────┘

CLI Reference

Benchmark CLI (`openadapt-evals`)

Command	Description
`run`	Run live evaluation (localhost:5001 default)
`mock`	Run with mock adapter (no VM required)
`live`	Run against a WAA server (full control)
`azure`	Run parallel evaluation on Azure ML
`probe`	Check if a WAA server is ready
`view`	Generate HTML viewer for results
`estimate`	Estimate Azure costs

VM/Pool CLI (`oa-vm`)

Command	Description
`pool-create`	Create N VMs with Docker and WAA
`pool-wait`	Wait until WAA is ready on all workers
`pool-run`	Distribute tasks across pool workers
`pool-status`	Show status of all pool VMs
`pool-cleanup`	Delete all pool VMs and resources
`vm monitor`	Dashboard with SSH tunnels
`vm setup-waa`	Deploy WAA container on a VM

Run oa-vm --help for the full list of 50+ commands.

Configuration

Settings are loaded automatically from environment variables or a .env file in the project root via pydantic-settings.

# .env
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

# Azure (required for VM management)
AZURE_SUBSCRIPTION_ID=...
AZURE_ML_RESOURCE_GROUP=...
AZURE_ML_WORKSPACE_NAME=...

See openadapt_evals/config.py for all available settings.

Custom Agents

Implement the BenchmarkAgent interface to evaluate your own agent:

from openadapt_evals import BenchmarkAgent, BenchmarkAction, BenchmarkObservation, BenchmarkTask

class MyAgent(BenchmarkAgent):
    def act(
        self,
        observation: BenchmarkObservation,
        task: BenchmarkTask,
        history: list[tuple[BenchmarkObservation, BenchmarkAction]] | None = None,
    ) -> BenchmarkAction:
        # Your agent logic here
        return BenchmarkAction(type="click", x=0.5, y=0.5)

    def reset(self) -> None:
        pass

Contributing

We welcome contributions. To get started:

git clone https://github.com/OpenAdaptAI/openadapt-evals.git
cd openadapt-evals
pip install -e ".[dev]"
pytest tests/ -v

See CLAUDE.md for development conventions and architecture details.

Related Projects

Project	Description
OpenAdapt	Desktop automation with demo-conditioned AI agents
openadapt-ml	Training and policy runtime
openadapt-capture	Screen recording and demo sharing
openadapt-grounding	UI element localization

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.beads		.beads
.github/workflows		.github/workflows
animations		animations
benchmark_results		benchmark_results
demo_library		demo_library
docs		docs
openadapt_evals		openadapt_evals
screenshots		screenshots
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
demo_cost_tracking.py		demo_cost_tracking.py
pyproject.toml		pyproject.toml
refresh_vm_dashboard.py		refresh_vm_dashboard.py
uv.lock		uv.lock
vm_dashboard.html		vm_dashboard.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAdapt Evals

What is OpenAdapt Evals?

Benchmark Viewer

Key Features

Installation

Quick Start

Run a mock evaluation (no VM required)

Run a live evaluation against a WAA server

Python API

Parallel evaluation on Azure

Architecture

How it fits together

CLI Reference

Benchmark CLI (`openadapt-evals`)

VM/Pool CLI (`oa-vm`)

Configuration

Custom Agents

Contributing

Related Projects

License

About

Uh oh!

Releases 6

Packages

Contributors 2

Uh oh!

Languages

License

OpenAdaptAI/openadapt-evals

Folders and files

Latest commit

History

Repository files navigation

OpenAdapt Evals

What is OpenAdapt Evals?

Benchmark Viewer

Key Features

Installation

Quick Start

Run a mock evaluation (no VM required)

Run a live evaluation against a WAA server

Python API

Parallel evaluation on Azure

Architecture

How it fits together

CLI Reference

Benchmark CLI (openadapt-evals)

VM/Pool CLI (oa-vm)

Configuration

Custom Agents

Contributing

Related Projects

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Uh oh!

Languages

Benchmark CLI (`openadapt-evals`)

VM/Pool CLI (`oa-vm`)

Packages