feat: establish robust evaluation framework for workflow benchmarks by cocosheng-g · Pull Request #457 · google-github-actions/run-gemini-cli

cocosheng-g · 2026-02-04T22:36:30Z

Overview

This PR introduces a robust, automated evaluation framework for Gemini CLI example workflows (Triage, Review, Fixer, Assistant), fulfilling the requirements of #219.

Key Features

Isolated TestRig: Secure, concurrent environment using temporary GEMINI_CLI_HOME to prevent interference with local settings.
Mock MCP Server: A dedicated mock-mcp-server.ts providing high-fidelity GitHub API simulation, enabling realistic PR Review benchmarks.
Gold-Standard Datasets: Structured benchmarks in evals/data/ using high-signal technical assertions (e.g., detecting eval vulnerabilities or quadratic complexity).
Nightly Automation: Integrated GitHub Action matrix testing across 5 Gemini models.
Automated Reporting: Aggregate reporting script for GitHub Job Summaries.

Next Steps

Data Expansion: Add more complex edge cases to the existing datasets.
Prompt Tuning: Use this baseline to fine-tune the workflow prompts for even higher reliability.

Related to: #219

- Implement Isolated `TestRig` for environment-safe, concurrent evaluations. - Add gold-standard datasets for Issue Triage, Scheduled Triage, Assistant, and Issue Fixer. - Implement Mock MCP Server for high-fidelity PR Review benchmarking. - Add nightly evaluation workflow with multi-model strategy matrix. - Automated aggregate reporting for GitHub Job Summaries. Next Steps: - Expand evaluation datasets with more edge cases. - Fine-tune workflow prompts based on baseline quality analysis. Refs: #219

gemini-cli · 2026-02-04T22:36:43Z

🤖 Hi @cocosheng-g, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

.github/workflows/evals-nightly.yml

cocosheng-g · 2026-02-05T01:14:30Z

/gemini review

…workflow

cocosheng-g requested review from a team as code owners February 4, 2026 22:36

cocosheng-g requested review from R2wenD2, allenhutchison, haroonc and umairidris February 4, 2026 22:36

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/evals-nightly.yml Fixed Show fixed Hide fixed

cocosheng-g marked this pull request as draft February 4, 2026 22:38

fix: address YAML linting issues and apply formatting fixes

3864f93

github-advanced-security bot found potential problems Feb 4, 2026

View reviewed changes

.github/workflows/evals-nightly.yml Fixed Show fixed Hide fixed

fix: address ratchet unpinned references and shell quoting

c34479d

github-advanced-security bot found potential problems Feb 5, 2026

View reviewed changes

.github/workflows/evals-nightly.yml Fixed Show fixed Hide fixed

fix: add explicit GITHUB_TOKEN permissions

4657361

cocosheng-g force-pushed the feat/eval-framework branch from 7d1fc3e to 6fb3486 Compare February 5, 2026 01:22

fix: regenerate package-lock.json with public registry URLs

70d3297

cocosheng-g force-pushed the feat/eval-framework branch from 6fb3486 to 70d3297 Compare February 5, 2026 01:28

cocosheng-g marked this pull request as ready for review February 5, 2026 01:33

cocosheng-g requested review from kschaab and removed request for R2wenD2, allenhutchison, haroonc and umairidris February 5, 2026 01:33

cocosheng-g linked an issue Feb 5, 2026 that may be closed by this pull request

Evaluate and Improve Example Workflows #219

Open

cocosheng-g enabled auto-merge (squash) February 6, 2026 16:40

cocosheng-g added 3 commits February 6, 2026 12:11

fix(evals): install dependencies locally and add CLI installation in …

3bd3baa

…workflow

fix(workflow): quote strings to satisfy yaml linter

225f8a0

fix(deps): regenerate package-lock.json using public registry

aaa4ce1

kschaab approved these changes Feb 6, 2026

View reviewed changes

cocosheng-g requested a review from jerop February 6, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: establish robust evaluation framework for workflow benchmarks#457

feat: establish robust evaluation framework for workflow benchmarks#457
cocosheng-g wants to merge 8 commits intomainfrom
feat/eval-framework

cocosheng-g commented Feb 4, 2026

Uh oh!

gemini-cli bot commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cocosheng-g commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

cocosheng-g commented Feb 4, 2026

Overview

Key Features

Next Steps

Uh oh!

gemini-cli bot commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cocosheng-g commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants