-
Notifications
You must be signed in to change notification settings - Fork 3k
[perf] fix: correct torch profiler logic for distributed trace collection #4707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a much-needed fix for the torch_profiler to correctly support distributed environments. The changes address critical issues like race conditions during trace file saving by introducing distributed barriers, and improve configuration validation to prevent ambiguous or invalid setups. The logic is now more robust and correct. I have one high-severity suggestion to replace assert statements with ValueError exceptions for configuration validation to ensure that these checks are not disabled in production environments.
| assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0" | ||
| assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0" | ||
| assert self.tool_config.step_start < self.tool_config.step_end, ( | ||
| "[ERROR] Profile step start must be less than step end" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert for validating user configuration is not robust, as assertions can be disabled globally in Python (e.g., with the -O flag). This could lead to the profiler silently misbehaving with invalid step configurations. It's better to use ValueError for handling configuration errors, which is consistent with other checks in this method and ensures validation is always performed.
| assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0" | |
| assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0" | |
| assert self.tool_config.step_start < self.tool_config.step_end, ( | |
| "[ERROR] Profile step start must be less than step end" | |
| ) | |
| if self.tool_config.step_start < 0: | |
| raise ValueError("[Profiler] Configuration Error: 'step_start' must be non-negative.") | |
| if self.tool_config.step_end < 0: | |
| raise ValueError("[Profiler] Configuration Error: 'step_end' must be non-negative.") | |
| if self.tool_config.step_start >= self.tool_config.step_end: | |
| raise ValueError( | |
| f"[Profiler] Configuration Error: 'step_start' ({self.tool_config.step_start}) must be less than 'step_end' ({self.tool_config.step_end})." | |
| ) |
|
@AkiRusProd Please format code according to: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting |
|
Thanks for pointing that out. I have applied the pre-commit hooks and formatted the code accordingly. The latest commit should pass the linting checks now. |
What does this PR do?
This PR fixes the logic of
torch_profilerwithin theVerlto enable correct trace collection in distributed environments. Previously, the profiler logic prevented valid trace generation or caused race conditions during file saving.Key changes include:
_validateto correctly handleall_ranksvs specificranks. It now raises a clear error if both are set and correctly defaults to Rank 0 if nothing is specified.torch.distributed.barrier()in thesave()method. This ensures that:self.tool_config.step_start/endinside thesavefilename generation (previously it might have referenced incorrect config attributes).Test
Verified by running a training job with profiling enabled.
Result: Chrome trace
.jsonfiles are now correctly generated and saved to the specified output directory for the requested ranks.API and Usage Example
No API signature changes, but the configuration logic is now stricter/safer.
Config Example (
config.yamlor dict):Note: Setting both
all_ranks=Trueandranks=[...]will now raise aValueErrorto avoid ambiguity.Design & Code Changes
Profiler._validate:all_ranksandrankssimultaneously.self.profile_ranksbased on world size whenall_ranks=True.torch.distributed.get_world_size().Profiler.save:rank 0callsos.makedirs.torch.distributed.barrier()before and after saving to synchronize workers.self.tool_configparameters consistently.Related Issues:
#3985
Related PRs:
#4689
This PR supersedes #4689.
While #4689 fixes the immediate
AttributeErrorregardingtool_config, this PR provides a complete fix for the profiler in distributed settings:torch.distributed.barrier()to prevent race conditions during file saving (critical for multi-rank training).all_ranksvsranks, resolving the ambiguity that existed previously.