Support megatron generate for vlm #773

yueshen2016 · 2026-01-13T17:16:02Z

What does this PR do?

Type of change: ?

Overview: ? This PR adds feature of VLM generation for megatron_generate

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added Vision Language Model support to text generation pipeline, enabling simultaneous processing of image and text inputs during both generation and prefill operations.
Improvements
- Enhanced data flow to properly route multimodal inputs (images and text tokens) through generation paths with automatic detection and handling of vision-enabled model architectures.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-13T17:16:13Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Introduces vision-input aware forwarding in megatron_generate.py and megatron_prefill.py. Detects vision language model (VLM) inputs via pixel_values, image_grid_thw, and image_sizes, constructing appropriate position IDs and attention masks. Implements dual-path handling that routes VLM inputs through specialized forward logic while maintaining existing text-only paths.

Changes

Cohort / File(s)	Summary
Vision-aware VLM forwarding in generate/prefill `modelopt/torch/utils/plugins/megatron_generate.py`, `modelopt/torch/utils/plugins/megatron_prefill.py`	Added vision input detection and dual-path forwarding logic. For VLMs: constructs `vlm_position_ids` and `vlm_attention_mask`, conditionally injects pixel_values/image_grid_thw/image_sizes into forward_args. Text-only path retained as fallback. Updated data flow to pass data_dict with tokens and optional vision inputs to forward step.

Sequence Diagram

sequenceDiagram
    participant DataIterator as Data Iterator
    participant ForwardStep as Forward Step Func
    participant VisionDetector as Vision Input Detector
    participant Model as Model
    
    DataIterator->>ForwardStep: data_dict (tokens + optional vision inputs)
    ForwardStep->>VisionDetector: Detect VLM inputs
    alt Vision Inputs Present
        VisionDetector->>ForwardStep: VLM detected
        ForwardStep->>ForwardStep: Construct vlm_position_ids<br/>Construct vlm_attention_mask<br/>Build forward_args with vision inputs
        ForwardStep->>Model: model(**forward_args)
    else Text-Only Path
        VisionDetector->>ForwardStep: No vision inputs
        ForwardStep->>Model: Original text-only call
    end
    Model-->>ForwardStep: Output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Support megatron generate for vlm' directly addresses the main change: adding VLM support to megatron_generate functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/utils/plugins/megatron_generate.py (1)
339-339: Potential division by zero when only one token is generated.

If the first generated token is an EOS token, output_ids.shape[-1] equals 1, causing division by zero.
Proposed fix
-            "tps": time_remaining_outputs / (output_ids.shape[-1] - 1),
+            "tps": time_remaining_outputs / max(output_ids.shape[-1] - 1, 1),

🤖 Fix all issues with AI agents

In @modelopt/torch/utils/plugins/megatron_generate.py:
- Around line 218-243: The VLM branch currently omits passing inference_context
and always builds vlm_position_ids starting at 0, which disables/ breaks
KV-cache during decoding; fix by: when has_vision_inputs is true and an
inference_context is provided (and enable_kv_cache is true), include
inference_context in forward_args (forward_args["inference_context"] =
inference_context) and compute vlm_position_ids by adding the decode offset from
the context (e.g., base = getattr(inference_context, "position_offset",
getattr(inference_context, "curr_seq_len", 0)); vlm_position_ids =
torch.arange(base, base + seq_len, dtype=torch.long,
device=device).unsqueeze(0).expand(batch_size, -1)); alternatively, if you
prefer to disallow KV-cache for VLMs, explicitly set
forward_args["inference_context"] = None (or skip passing it) and ensure
enable_kv_cache is treated as disabled when has_vision_inputs is true.
- Around line 286-292: The vision inputs (pixel_values, image_grid_thw,
image_sizes) are being added to data_dict on every decode step; change the logic
so these keys are only added during the prefill/first generation step (e.g.,
when step == 0 or when an is_prefill flag is true). Locate the block building
data_dict (symbols: data_dict, tokens, pixel_values, image_grid_thw,
image_sizes) inside the generation loop/function and wrap the conditional
additions of pixel_values, image_grid_thw, and image_sizes so they execute only
for the initial prefill step.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0d365 and 54c18a0.

📒 Files selected for processing (1)

modelopt/torch/utils/plugins/megatron_generate.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/utils/plugins/megatron_generate.py (1)

62-87: megatron_prefill accepts vision parameters but does not use them.

The function signature at lines 44-46 accepts pixel_values, image_grid_thw, and image_sizes, but the inner _forward_step_func doesn't handle vision inputs and these parameters aren't passed to data_iterator. If VLM support for prefill is intended, the same vision-aware logic from megatron_generate should be applied here.

modelopt/torch/utils/plugins/megatron_generate.py

coderabbitai · 2026-01-13T17:19:15Z

modelopt/torch/utils/plugins/megatron_generate.py

+        data_dict = {"tokens": tokens}
+        if pixel_values is not None:
+            data_dict["pixel_values"] = pixel_values
+        if image_grid_thw is not None:
+            data_dict["image_grid_thw"] = image_grid_thw
+        if image_sizes is not None:
+            data_dict["image_sizes"] = image_sizes


⚠️ Potential issue | 🟠 Major

Vision inputs are passed on every generation step instead of just prefill.

Vision inputs should only be processed during the prefill phase (step 0). Passing them on every decode step is wasteful and may cause unexpected behavior in some VLM architectures.

Proposed fix: Only include vision inputs on the first step

data_dict = {"tokens": tokens} - if pixel_values is not None: - data_dict["pixel_values"] = pixel_values - if image_grid_thw is not None: - data_dict["image_grid_thw"] = image_grid_thw - if image_sizes is not None: - data_dict["image_sizes"] = image_sizes + # Vision inputs should only be processed during prefill (step 0) + if step == 0: + if pixel_values is not None: + data_dict["pixel_values"] = pixel_values + if image_grid_thw is not None: + data_dict["image_grid_thw"] = image_grid_thw + if image_sizes is not None: + data_dict["image_sizes"] = image_sizes

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

data_dict = {"tokens": tokens}

if pixel_values is not None:

data_dict["pixel_values"] = pixel_values

if image_grid_thw is not None:

data_dict["image_grid_thw"] = image_grid_thw

if image_sizes is not None:

data_dict["image_sizes"] = image_sizes

data_dict = {"tokens": tokens}

# Vision inputs should only be processed during prefill (step 0)

if step == 0:

if pixel_values is not None:

data_dict["pixel_values"] = pixel_values

if image_grid_thw is not None:

data_dict["image_grid_thw"] = image_grid_thw

if image_sizes is not None:

data_dict["image_sizes"] = image_sizes

🤖 Prompt for AI Agents

In @modelopt/torch/utils/plugins/megatron_generate.py around lines 286 - 292, The vision inputs (pixel_values, image_grid_thw, image_sizes) are being added to data_dict on every decode step; change the logic so these keys are only added during the prefill/first generation step (e.g., when step == 0 or when an is_prefill flag is true). Locate the block building data_dict (symbols: data_dict, tokens, pixel_values, image_grid_thw, image_sizes) inside the generation loop/function and wrap the conditional additions of pixel_values, image_grid_thw, and image_sizes so they execute only for the initial prefill step.

modelopt/torch/utils/plugins/megatron_generate.py

codecov · 2026-01-23T05:46:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.17%. Comparing base (e6e4efd) to head (a4e9523).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #773      +/-   ##
==========================================
- Coverage   74.24%   74.17%   -0.08%     
==========================================
  Files         192      192              
  Lines       19033    19246     +213     
==========================================
+ Hits        14132    14276     +144     
- Misses       4901     4970      +69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: James Shen <yueshen@nvidia.com>

yueshen2016 requested a review from a team as a code owner January 13, 2026 17:16

yueshen2016 requested a review from AAnoosheh January 13, 2026 17:16

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

AAnoosheh reviewed Jan 13, 2026

View reviewed changes

modelopt/torch/utils/plugins/megatron_generate.py Outdated Show resolved Hide resolved

modelopt/torch/utils/plugins/megatron_generate.py Outdated Show resolved Hide resolved

yueshen2016 force-pushed the yueshen/megatron_generate_vlm branch from 54c18a0 to e472873 Compare January 23, 2026 05:34

Support megatron generate for vlm

a4e9523

Signed-off-by: James Shen <yueshen@nvidia.com>

yueshen2016 force-pushed the yueshen/megatron_generate_vlm branch from e472873 to a4e9523 Compare January 23, 2026 05:54

yueshen2016 requested review from AAnoosheh, ChenhanYu and kevalmorabia97 January 23, 2026 06:04

AAnoosheh approved these changes Jan 23, 2026

View reviewed changes

yueshen2016 merged commit 044c4bc into main Jan 23, 2026
69 of 75 checks passed

yueshen2016 deleted the yueshen/megatron_generate_vlm branch January 23, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support megatron generate for vlm #773

Support megatron generate for vlm #773

yueshen2016 commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 13, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support megatron generate for vlm #773

Support megatron generate for vlm #773

Conversation

yueshen2016 commented Jan 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yueshen2016 commented Jan 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading