Skip to content

feat: establish robust evaluation framework for workflow benchmarks#457

Open
cocosheng-g wants to merge 8 commits intomainfrom
feat/eval-framework
Open

feat: establish robust evaluation framework for workflow benchmarks#457
cocosheng-g wants to merge 8 commits intomainfrom
feat/eval-framework

Conversation

@cocosheng-g
Copy link
Collaborator

Overview

This PR introduces a robust, automated evaluation framework for Gemini CLI example workflows (Triage, Review, Fixer, Assistant), fulfilling the requirements of #219.

Key Features

  • Isolated TestRig: Secure, concurrent environment using temporary GEMINI_CLI_HOME to prevent interference with local settings.
  • Mock MCP Server: A dedicated mock-mcp-server.ts providing high-fidelity GitHub API simulation, enabling realistic PR Review benchmarks.
  • Gold-Standard Datasets: Structured benchmarks in evals/data/ using high-signal technical assertions (e.g., detecting eval vulnerabilities or quadratic complexity).
  • Nightly Automation: Integrated GitHub Action matrix testing across 5 Gemini models.
  • Automated Reporting: Aggregate reporting script for GitHub Job Summaries.

Next Steps

  • Data Expansion: Add more complex edge cases to the existing datasets.
  • Prompt Tuning: Use this baseline to fine-tune the workflow prompts for even higher reliability.

Related to: #219

- Implement Isolated `TestRig` for environment-safe, concurrent evaluations.
- Add gold-standard datasets for Issue Triage, Scheduled Triage, Assistant, and Issue Fixer.
- Implement Mock MCP Server for high-fidelity PR Review benchmarking.
- Add nightly evaluation workflow with multi-model strategy matrix.
- Automated aggregate reporting for GitHub Job Summaries.

Next Steps:
- Expand evaluation datasets with more edge cases.
- Fine-tune workflow prompts based on baseline quality analysis.

Refs: #219
@gemini-cli
Copy link
Contributor

gemini-cli bot commented Feb 4, 2026

🤖 Hi @cocosheng-g, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@cocosheng-g cocosheng-g marked this pull request as draft February 4, 2026 22:38
@cocosheng-g
Copy link
Collaborator Author

/gemini review

@cocosheng-g cocosheng-g marked this pull request as ready for review February 5, 2026 01:33
@cocosheng-g cocosheng-g requested review from kschaab and removed request for R2wenD2, allenhutchison, haroonc and umairidris February 5, 2026 01:33
@cocosheng-g cocosheng-g linked an issue Feb 5, 2026 that may be closed by this pull request
@cocosheng-g cocosheng-g enabled auto-merge (squash) February 6, 2026 16:40
@cocosheng-g cocosheng-g requested a review from jerop February 6, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate and Improve Example Workflows

2 participants