[perf] fix: correct torch profiler logic for distributed trace collection #4707

AkiRusProd · 2025-12-28T20:13:14Z

What does this PR do?

This PR fixes the logic of torch_profiler within the Verl to enable correct trace collection in distributed environments. Previously, the profiler logic prevented valid trace generation or caused race conditions during file saving.

Key changes include:

Configuration Validation: Fixed logic in _validate to correctly handle all_ranks vs specific ranks. It now raises a clear error if both are set and correctly defaults to Rank 0 if nothing is specified.
Distributed Synchronization: Added torch.distributed.barrier() in the save() method. This ensures that:
- Rank 0 creates the directory before other ranks attempt to write.
- All ranks finish writing their traces before the process continues or terminates.
Variable Fixes: Corrected usage of self.tool_config.step_start/end inside the save filename generation (previously it might have referenced incorrect config attributes).

Test

Verified by running a training job with profiling enabled.
Result: Chrome trace .json files are now correctly generated and saved to the specified output directory for the requested ranks.

API and Usage Example

No API signature changes, but the configuration logic is now stricter/safer.

Config Example (config.yaml or dict):

actor_rollout_ref:
  actor:
      profiler:
        tool: torch
        enable: True
        ranks: [0, 8, 16, 32] # for example
        # all_ranks: True
        save_path: "/<your_path>" # profiler saving path
        tool_config:
          torch:
            _target_: verl.utils.profiler.config.TorchProfilerToolConfig
            step_start: 1
            step_end: 3

Note: Setting both all_ranks=True and ranks=[...] will now raise a ValueError to avoid ambiguity.

Design & Code Changes

Profiler._validate:
- Added checks to prevent setting all_ranks and ranks simultaneously.
- Correctly populates self.profile_ranks based on world size when all_ranks=True.
- Added validation to ensure requested ranks exist within torch.distributed.get_world_size().
Profiler.save:
- Only rank 0 calls os.makedirs.
- Added torch.distributed.barrier() before and after saving to synchronize workers.
- Fixed filename generation to use self.tool_config parameters consistently.

Related Issues:
#3985

Related PRs:
#4689

This PR supersedes #4689.
While #4689 fixes the immediate AttributeError regarding tool_config, this PR provides a complete fix for the profiler in distributed settings:

Thread Safety: Adds torch.distributed.barrier() to prevent race conditions during file saving (critical for multi-rank training).
Logic Fixes: Correctly implements the logic for all_ranks vs ranks, resolving the ambiguity that existed previously.
IO Safety: Ensures the output directory is created safely by rank 0 before other ranks attempt to write.

gemini-code-assist

Code Review

This pull request provides a much-needed fix for the torch_profiler to correctly support distributed environments. The changes address critical issues like race conditions during trace file saving by introducing distributed barriers, and improve configuration validation to prevent ambiguous or invalid setups. The logic is now more robust and correct. I have one high-severity suggestion to replace assert statements with ValueError exceptions for configuration validation to ensure that these checks are not disabled in production environments.

gemini-code-assist · 2025-12-28T20:14:45Z

verl/utils/profiler/profile.py

+        assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
+        assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
+        assert self.tool_config.step_start < self.tool_config.step_end, (
+            "[ERROR] Profile step start must be less than step end"
+        )


Using assert for validating user configuration is not robust, as assertions can be disabled globally in Python (e.g., with the -O flag). This could lead to the profiler silently misbehaving with invalid step configurations. It's better to use ValueError for handling configuration errors, which is consistent with other checks in this method and ensures validation is always performed.

Suggested change

assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"

assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"

assert self.tool_config.step_start < self.tool_config.step_end, (

"[ERROR] Profile step start must be less than step end"

)

if self.tool_config.step_start < 0:

raise ValueError("[Profiler] Configuration Error: 'step_start' must be non-negative.")

if self.tool_config.step_end < 0:

raise ValueError("[Profiler] Configuration Error: 'step_end' must be non-negative.")

if self.tool_config.step_start >= self.tool_config.step_end:

raise ValueError(

f"[Profiler] Configuration Error: 'step_start' ({self.tool_config.step_start}) must be less than 'step_end' ({self.tool_config.step_end})."

)

wuxibin89 · 2025-12-30T07:18:34Z

@AkiRusProd Please format code according to: https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting

AkiRusProd · 2025-12-30T07:43:56Z

Thanks for pointing that out. I have applied the pre-commit hooks and formatted the code accordingly. The latest commit should pass the linting checks now.

fix profiler bug

da5094a

gemini-code-assist bot reviewed Dec 28, 2025

View reviewed changes

AkiRusProd mentioned this pull request Dec 28, 2025

Fix Profiler class using wrong config attribute names #4689

Closed

3 tasks

wuxibin89 approved these changes Dec 30, 2025

View reviewed changes

wuxibin89 changed the title ~~[tool] fix: correct torch profiler logic for distributed trace collection~~ [perf] fix: correct torch profiler logic for distributed trace collection Dec 30, 2025

ruff format

10d886e

AkiRusProd requested a review from wuxibin89 December 31, 2025 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] fix: correct torch profiler logic for distributed trace collection #4707

[perf] fix: correct torch profiler logic for distributed trace collection #4707

AkiRusProd commented Dec 28, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 28, 2025

Uh oh!

wuxibin89 commented Dec 30, 2025

Uh oh!

AkiRusProd commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        assert self.tool_config.step_start >= 0, "[ERROR] Profile step start must be greater than 0"
-        assert self.tool_config.step_end >= 0, "[ERROR] Profile step end must be greater than 0"
-        assert self.tool_config.step_start < self.tool_config.step_end, (
-            "[ERROR] Profile step start must be less than step end"
-        )
+        if self.tool_config.step_start < 0:
+            raise ValueError("[Profiler] Configuration Error: 'step_start' must be non-negative.")
+        if self.tool_config.step_end < 0:
+            raise ValueError("[Profiler] Configuration Error: 'step_end' must be non-negative.")
+        if self.tool_config.step_start >= self.tool_config.step_end:
+            raise ValueError(
+                f"[Profiler] Configuration Error: 'step_start' ({self.tool_config.step_start}) must be less than 'step_end' ({self.tool_config.step_end})."
+            )

[perf] fix: correct torch profiler logic for distributed trace collection #4707

Are you sure you want to change the base?

[perf] fix: correct torch profiler logic for distributed trace collection #4707

Conversation

AkiRusProd commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

wuxibin89 commented Dec 30, 2025

Uh oh!

AkiRusProd commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AkiRusProd commented Dec 28, 2025 •

edited

Loading