Skip to content

Conversation

@AkiRusProd
Copy link

@AkiRusProd AkiRusProd commented Dec 28, 2025

What does this PR do?

This PR fixes the logic of torch_profiler within the Verl to enable correct trace collection in distributed environments. Previously, the profiler logic prevented valid trace generation or caused race conditions during file saving.

Key changes include:

  1. Configuration Validation: Fixed logic in _validate to correctly handle all_ranks vs specific ranks. It now raises a clear error if both are set and correctly defaults to Rank 0 if nothing is specified.
  2. Distributed Synchronization: Added torch.distributed.barrier() in the save() method. This ensures that:
    • Rank 0 creates the directory before other ranks attempt to write.
    • All ranks finish writing their traces before the process continues or terminates.
  3. Variable Fixes: Corrected usage of self.tool_config.step_start/end inside the save filename generation (previously it might have referenced incorrect config attributes).

Test

Verified by running a training job with profiling enabled.
Result: Chrome trace .json files are now correctly generated and saved to the specified output directory for the requested ranks.

API and Usage Example

No API signature changes, but the configuration logic is now stricter/safer.

Config Example (config.yaml or dict):

actor_rollout_ref:
  actor:
      profiler:
        tool: torch
        enable: True
        ranks: [0, 8, 16, 32] # for example
        # all_ranks: True
        save_path: "/<your_path>" # profiler saving path
        tool_config:
          torch:
            _target_: verl.utils.profiler.config.TorchProfilerToolConfig
            step_start: 1
            step_end: 3

Note: Setting both all_ranks=True and ranks=[...] will now raise a ValueError to avoid ambiguity.

Design & Code Changes

  • Profiler._validate:
    • Added checks to prevent setting all_ranks and ranks simultaneously.
    • Correctly populates self.profile_ranks based on world size when all_ranks=True.
    • Added validation to ensure requested ranks exist within torch.distributed.get_world_size().
  • Profiler.save:
    • Only rank 0 calls os.makedirs.
    • Added torch.distributed.barrier() before and after saving to synchronize workers.
    • Fixed filename generation to use self.tool_config parameters consistently.

Related Issues:
#3985

Related PRs:
#4689

This PR supersedes #4689.
While #4689 fixes the immediate AttributeError regarding tool_config, this PR provides a complete fix for the profiler in distributed settings:

  1. Thread Safety: Adds torch.distributed.barrier() to prevent race conditions during file saving (critical for multi-rank training).
  2. Logic Fixes: Correctly implements the logic for all_ranks vs ranks, resolving the ambiguity that existed previously.
  3. IO Safety: Ensures the output directory is created safely by rank 0 before other ranks attempt to write.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a much-needed fix for the torch_profiler to correctly support distributed environments. The changes address critical issues like race conditions during trace file saving by introducing distributed barriers, and improve configuration validation to prevent ambiguous or invalid setups. The logic is now more robust and correct. I have one high-severity suggestion to replace assert statements with ValueError exceptions for configuration validation to ensure that these checks are not disabled in production environments.

Comment on lines +106 to +110
assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
assert self.tool_config.step_start < self.tool_config.step_end, (
"[ERROR] Profile step start must be less than step end"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert for validating user configuration is not robust, as assertions can be disabled globally in Python (e.g., with the -O flag). This could lead to the profiler silently misbehaving with invalid step configurations. It's better to use ValueError for handling configuration errors, which is consistent with other checks in this method and ensures validation is always performed.

Suggested change
assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
assert self.tool_config.step_start < self.tool_config.step_end, (
"[ERROR] Profile step start must be less than step end"
)
if self.tool_config.step_start < 0:
raise ValueError("[Profiler] Configuration Error: 'step_start' must be non-negative.")
if self.tool_config.step_end < 0:
raise ValueError("[Profiler] Configuration Error: 'step_end' must be non-negative.")
if self.tool_config.step_start >= self.tool_config.step_end:
raise ValueError(
f"[Profiler] Configuration Error: 'step_start' ({self.tool_config.step_start}) must be less than 'step_end' ({self.tool_config.step_end})."
)

@wuxibin89 wuxibin89 changed the title [tool] fix: correct torch profiler logic for distributed trace collection [perf] fix: correct torch profiler logic for distributed trace collection Dec 30, 2025
@wuxibin89
Copy link
Collaborator

@AkiRusProd
Copy link
Author

Thanks for pointing that out. I have applied the pre-commit hooks and formatted the code accordingly. The latest commit should pass the linting checks now.

@AkiRusProd AkiRusProd requested a review from wuxibin89 December 31, 2025 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants