-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[EvalResultConvert]Leverage evaluator metrics from evaluators api #44209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the evaluation result conversion functionality by adding support for extracting evaluator metrics directly from the evaluator API configuration. This allows metrics to be determined from the evaluator_config parameter when they are not available in the evaluation metadata.
Key Changes
- Added
evaluator_configparameter to_convert_results_to_aoai_evaluation_results()function to enable metric extraction from evaluator definitions - Implemented fallback logic to extract metrics from
_evaluator_definition.metricsin the evaluator config when metadata doesn't provide them - Added comprehensive test data file to validate the new metric extraction path
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py |
Added evaluator_config parameter and implemented logic to extract metrics from evaluator definition config as a fallback when metadata metrics are not available |
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py |
Updated test to load and pass the new evaluator config test data to validate the metric extraction functionality |
sdk/evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json |
New test data file containing evaluator configurations with metric definitions for testing the new extraction logic |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py:1810
- The docstring is incomplete and missing documentation for the new
evaluator_configparameter. The docstring should document all parameters includinglogger,eval_id,eval_run_id,evaluators,eval_run_summary,eval_meta_data, and the newly addedevaluator_config. Additionally, the order of parameters in the docstring should match the function signature.
"""
Convert evaluation results to AOAI evaluation results format.
Each row of input results.rows looks like:
{"inputs.query":"What is the capital of France?","inputs.context":"France is in Europe",
"inputs.generated_response":"Paris is the capital of France.","inputs.ground_truth":"Paris is the capital of France.",
"outputs.F1_score.f1_score":1.0,"outputs.F1_score.f1_result":"pass","outputs.F1_score.f1_threshold":0.5}
Convert each row into new RunOutputItem object with results array.
:param results: The evaluation results to convert
:type results: EvaluationResult
:param eval_meta_data: The evaluation metadata, containing eval_id, eval_run_id, and testing_criteria
:type eval_meta_data: Dict[str, Any]
:param logger: Logger instance
:type logger: logging.Logger
:return: EvaluationResult with converted evaluation results in AOAI format
:rtype: EvaluationResult
"""
...evaluation/azure-ai-evaluation/tests/unittests/data/evaluation_util_convert_eval_config.json
Outdated
Show resolved
Hide resolved
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
| evaluator_config | ||
| and criteria_name in evaluator_config | ||
| and isinstance(evaluator_config[criteria_name], dict) | ||
| and "_evaluator_definition" in evaluator_config[criteria_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this condition needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the fall back for local evaluation, evaluator_config doesn't contain _evaluator_definition, then it'll fall back to the hardcode mappings.
| decrease_boolean_metrics = [] | ||
| if evaluator_config and isinstance(evaluator_config, dict): | ||
| for criteria_name, config in evaluator_config.items(): | ||
| if config and isinstance(config, dict) and "_evaluator_definition" in config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to parse the eval config multiple times? Seems like we can parse it once, and construct the mapping from criteria to metric and direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to parse for only 1 time.
| return False | ||
|
|
||
|
|
||
| def _get_decrease_boolean_metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use the term decrease or increase in the function name or outside the function name. This is internal implementation.
I would highly suggest to use semantic meanings in the code, so that code is easy and understandable.
For these evaluators, if logic is inverse or negated. We should have something like logic inverter etc.
function should return if is_pass or is_failed.
Because sometimes True means Pass, and sometime True means FAIL (for e.g these evaluators)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to _get_inverse_metrics(), also updated other related decrease* name.
| metric_info | ||
| and isinstance(metric_info, dict) | ||
| and "desirable_direction" in metric_info | ||
| and metric_info["desirable_direction"] is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be checking the direction to determine inverse? Inverse should be decrease?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, updated to:
metric_info["desirable_direction"] == "decrease"
| and "desirable_direction" in metric_info | ||
| and metric_info["desirable_direction"] is True | ||
| and "type" in metric_info | ||
| and metric_info["type"] == "boolean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does boolean matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the previous reverse logic, only binary decrease evaluators need the special reverse for label, score and passed metric value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe call this inverse_boolean_metrics then? If it is only specific to boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found Waqas(@w-javed ) has a comment to avoid the internal terms like decrease/increase in the method name, boolean type is the similar internal term?
#44209 (comment)
How about _get_metrics_need_extra_reverse(), just focusing on the purpose but not containing the internal terms on method name?
| test_eval_input_metadata = {} | ||
| with open(test_input_eval_metadata_path, "r") as f: | ||
| test_eval_input_metadata = json.load(f) | ||
| test_eval_input_config = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you don't have any new test. Is there a way to verify the config is actually being used correctly in the new code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the custom evaluator type to inverse binary evaluator, so it will trigger the new code logic as there is no hardcode mapping for it.
| _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["code_vulnerability"], | ||
| _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["protected_material"], | ||
| _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["eci"], | ||
| _EvaluatorMetricMapping.EVALUATOR_NAME_METRICS_MAPPINGS["ungrounded_attributes"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this hard code list of evaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our chat, it's also for the fallback for local eval.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines