[EvalResultConvert]Leverage evaluator metrics from evaluators api #44209

YoYoJa · 2025-12-01T08:49:49Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

Copilot

Pull request overview

This PR enhances the evaluation result conversion functionality by adding support for extracting evaluator metrics directly from the evaluator API configuration. This allows metrics to be determined from the evaluator_config parameter when they are not available in the evaluation metadata.

Key Changes

Added evaluator_config parameter to _convert_results_to_aoai_evaluation_results() function to enable metric extraction from evaluator definitions
Implemented fallback logic to extract metrics from _evaluator_definition.metrics in the evaluator config when metadata doesn't provide them
Added comprehensive test data file to validate the new metric extraction path

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py`	Added `evaluator_config` parameter and implemented logic to extract metrics from evaluator definition config as a fallback when metadata metrics are not available
`sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py`	Updated test to load and pass the new evaluator config test data to validate the metric extraction functionality
`sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json`	New test data file containing evaluator configurations with metric definitions for testing the new extraction logic

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py:1810

The docstring is incomplete and missing documentation for the new evaluator_config parameter. The docstring should document all parameters including logger, eval_id, eval_run_id, evaluators, eval_run_summary, eval_meta_data, and the newly added evaluator_config. Additionally, the order of parameters in the docstring should match the function signature.

    """
    Convert evaluation results to AOAI evaluation results format.

    Each row of input results.rows looks like:
    {"inputs.query":"What is the capital of France?","inputs.context":"France is in Europe",
     "inputs.generated_response":"Paris is the capital of France.","inputs.ground_truth":"Paris is the capital of France.",
     "outputs.F1_score.f1_score":1.0,"outputs.F1_score.f1_result":"pass","outputs.F1_score.f1_threshold":0.5}

    Convert each row into new RunOutputItem object with results array.

    :param results: The evaluation results to convert
    :type results: EvaluationResult
    :param eval_meta_data: The evaluation metadata, containing eval_id, eval_run_id, and testing_criteria
    :type eval_meta_data: Dict[str, Any]
    :param logger: Logger instance
    :type logger: logging.Logger
    :return: EvaluationResult with converted evaluation results in AOAI format
    :rtype: EvaluationResult
    """

...evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json

github-actions · 2025-12-01T21:00:18Z

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-ai-evaluation

aprilk-ms · 2025-12-03T03:25:03Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+                evaluator_config
+                and criteria_name in evaluator_config
+                and isinstance(evaluator_config[criteria_name], dict)
+                and "_evaluator_definition" in evaluator_config[criteria_name]


Why is this condition needed?

It is the fall back for local evaluation, evaluator_config doesn't contain _evaluator_definition, then it'll fall back to the hardcode mappings.

aprilk-ms · 2025-12-03T03:30:46Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+    decrease_boolean_metrics = []
+    if evaluator_config and isinstance(evaluator_config, dict):
+        for criteria_name, config in evaluator_config.items():
+            if config and isinstance(config, dict) and "_evaluator_definition" in config:


Do we need to parse the eval config multiple times? Seems like we can parse it once, and construct the mapping from criteria to metric and direction.

updated to parse for only 1 time.

w-javed · 2025-12-04T18:30:05Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

    return False


+def _get_decrease_boolean_metrics(


Let's not use the term decrease or increase in the function name or outside the function name. This is internal implementation.
I would highly suggest to use semantic meanings in the code, so that code is easy and understandable.

For these evaluators, if logic is inverse or negated. We should have something like logic inverter etc.

function should return if is_pass or is_failed.
Because sometimes True means Pass, and sometime True means FAIL (for e.g these evaluators)

Updated to _get_inverse_metrics(), also updated other related decrease* name.

aprilk-ms · 2025-12-10T10:39:52Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+                metric_info
+                and isinstance(metric_info, dict)
+                and "desirable_direction" in metric_info
+                and metric_info["desirable_direction"] is True


should it be checking the direction to determine inverse? Inverse should be decrease?

Thanks for catching that, updated to:
metric_info["desirable_direction"] == "decrease"

aprilk-ms · 2025-12-10T10:40:06Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+                and "desirable_direction" in metric_info
+                and metric_info["desirable_direction"] is True
+                and "type" in metric_info
+                and metric_info["type"] == "boolean"


why does boolean matter?

Per the previous reverse logic, only binary decrease evaluators need the special reverse for label, score and passed metric value.

OK, maybe call this inverse_boolean_metrics then? If it is only specific to boolean

Just found Waqas(@w-javed ) has a comment to avoid the internal terms like decrease/increase in the method name, boolean type is the similar internal term?
#44209 (comment)

How about _get_metrics_need_extra_reverse(), just focusing on the purpose but not containing the internal terms on method name?

aprilk-ms · 2025-12-10T11:11:10Z

sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py

        test_eval_input_metadata = {}
        with open(test_input_eval_metadata_path, "r") as f:
            test_eval_input_metadata = json.load(f)
+        test_eval_input_config = {}


I see that you don't have any new test. Is there a way to verify the config is actually being used correctly in the new code?

Updated the custom evaluator type to inverse binary evaluator, so it will trigger the new code logic as there is no hardcode mapping for it.

w-javed · 2025-12-11T08:15:42Z

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

+        _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["code_vulnerability"],
+        _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["protected_material"],
+        _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["eci"],
+        _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["ungrounded_attributes"],


can we remove this hard code list of evaluator

Per our chat, it's also for the fallback for local eval.

YoYoJa added 7 commits November 14, 2025 02:48

fix result converter for usage with null value

6f06561

resolve comments

2b05f68

fix F1_Score metric

2a22572

Update changelog and version

7ba7fe6

merge main

d315969

merge main

c66fcc9

leverage metrics from evaluators api

08390c8

Copilot AI review requested due to automatic review settings December 1, 2025 08:49

YoYoJa requested a review from a team as a code owner December 1, 2025 08:49

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Dec 1, 2025

Copilot started reviewing on behalf of YoYoJa December 1, 2025 08:50 View session

Copilot finished reviewing on behalf of YoYoJa December 1, 2025 08:53

Copilot AI reviewed Dec 1, 2025

View reviewed changes

...evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json Outdated Show resolved Hide resolved

YoYoJa added 2 commits December 1, 2025 10:58

update

03c8ad1

run black

532dabc

merge main

aa8c58f

aprilk-ms reviewed Dec 3, 2025

View reviewed changes

w-javed reviewed Dec 4, 2025

View reviewed changes

YoYoJa added 8 commits December 5, 2025 05:39

address comments and refactor converter

609f47f

merge main

0fa964b

Update comments

6aee58e

add more comments

c2637ed

fix typo for test data

cfabc5a

remove dup extraction

380cd1b

remove dup extraction

3983d54

remove dup extraction

f47c603

aprilk-ms reviewed Dec 10, 2025

View reviewed changes

YoYoJa added 3 commits December 10, 2025 13:21

Address comments and update UT

ae8fe9c

update UT

1f89670

Address comments and update method name

041bc65

w-javed reviewed Dec 11, 2025

View reviewed changes

YoYoJa added 2 commits December 22, 2025 10:05

merge main

dcfd5e7

merge

411b155

[EvalResultConvert]Leverage evaluator metrics from evaluators api #44209

Are you sure you want to change the base?

[EvalResultConvert]Leverage evaluator metrics from evaluators api #44209

Uh oh!

Conversation

YoYoJa commented Dec 1, 2025

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API Change Check

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

w-javed Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YoYoJa Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Dec 1, 2025 •

edited

Loading

w-javed Dec 4, 2025 •

edited

Loading

YoYoJa Dec 11, 2025 •

edited

Loading