Skip to content

Conversation

@YoYoJa
Copy link
Contributor

@YoYoJa YoYoJa commented Dec 1, 2025

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings December 1, 2025 08:49
@YoYoJa YoYoJa requested a review from a team as a code owner December 1, 2025 08:49
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Dec 1, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the evaluation result conversion functionality by adding support for extracting evaluator metrics directly from the evaluator API configuration. This allows metrics to be determined from the evaluator_config parameter when they are not available in the evaluation metadata.

Key Changes

  • Added evaluator_config parameter to _convert_results_to_aoai_evaluation_results() function to enable metric extraction from evaluator definitions
  • Implemented fallback logic to extract metrics from _evaluator_definition.metrics in the evaluator config when metadata doesn't provide them
  • Added comprehensive test data file to validate the new metric extraction path

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py Added evaluator_config parameter and implemented logic to extract metrics from evaluator definition config as a fallback when metadata metrics are not available
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py Updated test to load and pass the new evaluator config test data to validate the metric extraction functionality
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json New test data file containing evaluator configurations with metric definitions for testing the new extraction logic
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py:1810

  • The docstring is incomplete and missing documentation for the new evaluator_config parameter. The docstring should document all parameters including logger, eval_id, eval_run_id, evaluators, eval_run_summary, eval_meta_data, and the newly added evaluator_config. Additionally, the order of parameters in the docstring should match the function signature.
    """
    Convert evaluation results to AOAI evaluation results format.

    Each row of input results.rows looks like:
    {"inputs.query":"What is the capital of France?","inputs.context":"France is in Europe",
     "inputs.generated_response":"Paris is the capital of France.","inputs.ground_truth":"Paris is the capital of France.",
     "outputs.F1_score.f1_score":1.0,"outputs.F1_score.f1_result":"pass","outputs.F1_score.f1_threshold":0.5}

    Convert each row into new RunOutputItem object with results array.

    :param results: The evaluation results to convert
    :type results: EvaluationResult
    :param eval_meta_data: The evaluation metadata, containing eval_id, eval_run_id, and testing_criteria
    :type eval_meta_data: Dict[str, Any]
    :param logger: Logger instance
    :type logger: logging.Logger
    :return: EvaluationResult with converted evaluation results in AOAI format
    :rtype: EvaluationResult
    """

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

API Change Check

APIView identified API level changes in this PR and created the following API reviews

azure-ai-evaluation

evaluator_config
and criteria_name in evaluator_config
and isinstance(evaluator_config[criteria_name], dict)
and "_evaluator_definition" in evaluator_config[criteria_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this condition needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the fall back for local evaluation, evaluator_config doesn't contain _evaluator_definition, then it'll fall back to the hardcode mappings.

decrease_boolean_metrics = []
if evaluator_config and isinstance(evaluator_config, dict):
for criteria_name, config in evaluator_config.items():
if config and isinstance(config, dict) and "_evaluator_definition" in config:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to parse the eval config multiple times? Seems like we can parse it once, and construct the mapping from criteria to metric and direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to parse for only 1 time.

return False


def _get_decrease_boolean_metrics(
Copy link
Contributor

@w-javed w-javed Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use the term decrease or increase in the function name or outside the function name. This is internal implementation.
I would highly suggest to use semantic meanings in the code, so that code is easy and understandable.

For these evaluators, if logic is inverse or negated. We should have something like logic inverter etc.

function should return if is_pass or is_failed.
Because sometimes True means Pass, and sometime True means FAIL (for e.g these evaluators)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to _get_inverse_metrics(), also updated other related decrease* name.

metric_info
and isinstance(metric_info, dict)
and "desirable_direction" in metric_info
and metric_info["desirable_direction"] is True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be checking the direction to determine inverse? Inverse should be decrease?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, updated to:
metric_info["desirable_direction"] == "decrease"

and "desirable_direction" in metric_info
and metric_info["desirable_direction"] is True
and "type" in metric_info
and metric_info["type"] == "boolean"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does boolean matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the previous reverse logic, only binary decrease evaluators need the special reverse for label, score and passed metric value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe call this inverse_boolean_metrics then? If it is only specific to boolean

Copy link
Contributor Author

@YoYoJa YoYoJa Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found Waqas(@w-javed ) has a comment to avoid the internal terms like decrease/increase in the method name, boolean type is the similar internal term?
#44209 (comment)

How about _get_metrics_need_extra_reverse(), just focusing on the purpose but not containing the internal terms on method name?

test_eval_input_metadata = {}
with open(test_input_eval_metadata_path, "r") as f:
test_eval_input_metadata = json.load(f)
test_eval_input_config = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you don't have any new test. Is there a way to verify the config is actually being used correctly in the new code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the custom evaluator type to inverse binary evaluator, so it will trigger the new code logic as there is no hardcode mapping for it.

_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["code_vulnerability"],
_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["protected_material"],
_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["eci"],
_EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["ungrounded_attributes"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this hard code list of evaluator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our chat, it's also for the fallback for local eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants