-
Notifications
You must be signed in to change notification settings - Fork 850
Fix: Storing built-in feature bins in program + Fix: Using llm_feedback_weight in final score #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes two inconsistencies in OpenEvolve’s evolution bookkeeping: (1) built-in MAP-Elites feature bins (complexity/diversity) were being computed for coordinates but not persisted on the Program object, and (2) the final combined_score weighting ignored llm_feedback_weight and used a fixed constant.
Changes:
- Persist built-in MAP-Elites
complexityanddiversitybin indices onto theProgrambefore saving. - Update
combined_scorerecomputation to weight LLM feedback usingself.config.llm_feedback_weightinstead of a fixed 0.3.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
openevolve/database.py |
Stores computed built-in feature bin indices (complexity, diversity) into the Program so they persist to disk. |
openevolve/evaluator.py |
Uses llm_feedback_weight when recomputing combined_score after LLM evaluation. |
| eval_result.metrics["combined_score"] = ( | ||
| accuracy * 0.7 + llm_average * 0.3 | ||
| accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight | ||
| ) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combined_score now depends on llm_feedback_weight, but there’s no guard ensuring the weight is within [0.0, 1.0]. If a user misconfigures this, the score can become negative or exceed expected bounds; consider clamping or raising a clear config error before using it here.
| accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight | ||
| ) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line exceeds the configured Black line length (100) and is missing spaces around operators (e.g., 1 - weight). Reformatting will improve readability and avoid formatting/lint churn in future diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| if "combined_score" in eval_result.metrics: | ||
| # Original combined_score is just accuracy | ||
| accuracy = eval_result.metrics["combined_score"] | ||
| # Combine with LLM average (70% accuracy, 30% LLM quality) | ||
| eval_result.metrics["combined_score"] = ( | ||
| accuracy * 0.7 + llm_average * 0.3 | ||
| accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight | ||
| ) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage: the combined_score weighting behavior changed to depend on llm_feedback_weight, but there doesn’t appear to be a unit test asserting the new formula. Adding a focused test (including edge weights like 0.0/1.0) would prevent regressions.
| # Use code length as complexity measure | ||
| complexity = len(program.code) | ||
| bin_idx = self._calculate_complexity_bin(complexity) | ||
| program.complexity = bin_idx # Store complexity bin in program | ||
| coords.append(bin_idx) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning the bin index into Program.complexity/Program.diversity is semantically ambiguous (the dataclass defines these as derived feature values, currently typed as float). Consider either casting to float for consistency, or introducing explicit fields like complexity_bin/diversity_bin to avoid confusing bins with raw feature values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| # Use cached diversity calculation with reference set | ||
| if len(self.programs) < 2: | ||
| bin_idx = 0 | ||
| else: | ||
| diversity = self._get_cached_diversity(program) | ||
| bin_idx = self._calculate_diversity_bin(diversity) | ||
| program.diversity = bin_idx # Store diversity bin in program | ||
| coords.append(bin_idx) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the cold-start branch (len(self.programs) < 2) diversity’s bin_idx is forced to 0 but program.diversity is not updated, so saved programs may still show the default value rather than the computed bin. Set program.diversity in this branch as well for consistency with the complexity handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| elif dim == "complexity": | ||
| # Use code length as complexity measure | ||
| complexity = len(program.code) | ||
| bin_idx = self._calculate_complexity_bin(complexity) | ||
| program.complexity = bin_idx # Store complexity bin in program | ||
| coords.append(bin_idx) | ||
| elif dim == "diversity": | ||
| # Use cached diversity calculation with reference set | ||
| if len(self.programs) < 2: | ||
| bin_idx = 0 | ||
| else: | ||
| diversity = self._get_cached_diversity(program) | ||
| bin_idx = self._calculate_diversity_bin(diversity) | ||
| program.diversity = bin_idx # Store diversity bin in program | ||
| coords.append(bin_idx) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test coverage: this change is intended to persist built-in feature bins into the saved Program, but there’s no test asserting that Program.complexity/diversity are updated after coordinate calculation/add(). Add a unit test that loads/saves a program and verifies these fields are non-default when built-in dimensions are used.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
|
1 similar comment
|
|
Problem 1
Currently feature bins are shown correctly in the output but not saved in their respective programs, s.t. they are logged as the following in the program:
"complexity": 0.0,
"diversity": 0.0,
Solution 1
Additionally to appending them to coords save features.
Problem 2
Even though llm_feedback_weight is promoted to be reflected in the final score, the final score has a fixed weighting (0.3) and does not consider llm_feedback_weight at all.
Solution 2
Removed fixed weighting and use of llm_feedback_weight in final score.
Files changed
openevolve/database.py-Storing built-in feature bins in programopenevolve/evaluator.py-Using llm_feedback_weight in final score