Fix a nvfp4 weight amax attribute issue during export #785

Edwardf0t1 · 2026-01-15T01:13:39Z

What does this PR do?

Type of change: Bugfix

Overview: Fix a nvfp4 weight amax attribute issue during export, especially when calibration size is small. Context: sgl-project/sglang#14677 (comment)

Usage

python3 hf_ptq.py --pyt_ckpt_path /home/scratch.jingyux_coreai/kimi-k2/models/Kimi-K2-Thinking-BF16 --qformat nvfp4_mlp_only --export_path /home/omniml_data_3/zhiyuc/checkpoints/Kimi-K2-Thinking-NVFP4 --kv_cache_qformat none --calib_size 20 --trust_remote_code --dataset cnn_dailymail

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Bug Fixes

Improved weight quantizer calibration to ensure quantizers are properly initialized with calibration statistics before computing scaling factors.
Enhanced reliability and consistency of quantized model exports.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

codecov · 2026-01-15T01:25:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.17%. Comparing base (b44c60a) to head (8c0eb8f).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #785   +/-   ##
=======================================
  Coverage   74.17%   74.17%           
=======================================
  Files         192      192           
  Lines       19246    19246           
=======================================
  Hits        14276    14276           
  Misses       4970     4970

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/export/quant_utils.py

cjluo-nv · 2026-01-22T07:46:43Z

I think @meenchen 's question is legitimate:

Do you feel we need to do this for all the quant formats not just for NVFP4?

And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

Edwardf0t1 · 2026-01-22T09:03:37Z

I think @meenchen 's question is legitimate:

Do you feel we need to do this for all the quant formats not just for NVFP4?

And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.

"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

cjluo-nv · 2026-01-22T15:54:46Z

I think @meenchen 's question is legitimate:
Do you feel we need to do this for all the quant formats not just for NVFP4?
And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.

"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

My question is:

If the weights are not quantized because the expert has not been activated yet. Even you quantize the weights, the inputs are not quantized and the input scales are not available. How can the deployment framework deploy this checkpoint without complaining the input scales not present?

modelopt/torch/export/quant_utils.py

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai · 2026-01-23T01:39:08Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Added an internal helper function to lazily calibrate weight quantizers by resetting amax and collecting statistics. Updated weight scaling factor computations for NVFP4 variants to call this helper, ensuring proper calibration before deriving scaling factors.

Changes

Cohort / File(s)	Summary
Weight Quantizer Calibration `modelopt/torch/export/quant_utils.py`	Introduced `_ensure_weight_quantizer_calibrated()` helper function for lazy amax calibration. Integrated calibration checks into NVFP4 family quantization formats (NVFP4, NVFP4_AWQ, NVFP4_SVDQUANT) and W4A8_NVFP4_FP8 variant prior to scaling factor computation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing an nvfp4 weight amax attribute issue during export, which directly aligns with the introduced _ensure_weight_quantizer_calibrated helper and its application to NVFP4 quantization variants.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@modelopt/torch/export/quant_utils.py`:
- Around line 239-256: The current helper _ensure_weight_quantizer_calibrated
should only produce weight scales and must not attempt to produce activation
scales; ensure it remains weight-only by keeping the stats collection scoped to
the provided weight_quantizer (use enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).

🧹 Nitpick comments (1)

modelopt/torch/export/quant_utils.py (1)
239-256: Consider adding a warning when force-calibrating weights.

The implementation correctly follows the existing calibration pattern used elsewhere in this file. However, based on a previous review discussion, it would be valuable to add a warning when this lazy calibration is triggered, as it indicates the layer was not quantized during the main calibration phase (possibly due to small calib_size).
💡 Proposed enhancement to add warning
 def _ensure_weight_quantizer_calibrated(
     weight_quantizer: TensorQuantizer, weight: torch.Tensor
 ) -> None:
     """Calibrate weight quantizer if amax is not set.

     This is a lazy calibration pattern used during export when weight quantizers
     may not have been calibrated during the main calibration phase.

     Args:
         weight_quantizer: The weight quantizer to calibrate
         weight: The weight tensor to use for calibration
     """
     if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None:
+        logger.warning(
+            "Weight quantizer was not calibrated during calibration phase. "
+            "Force-calibrating weights during export. Consider increasing calib_size."
+        )
         weight_quantizer.reset_amax()
         enable_stats_collection(weight_quantizer)
         weight_quantizer(weight)
         finish_stats_collection(weight_quantizer)

coderabbitai · 2026-01-23T01:43:45Z

modelopt/torch/export/quant_utils.py

+def _ensure_weight_quantizer_calibrated(
+    weight_quantizer: TensorQuantizer, weight: torch.Tensor
+) -> None:
+    """Calibrate weight quantizer if amax is not set.
+
+    This is a lazy calibration pattern used during export when weight quantizers
+    may not have been calibrated during the main calibration phase.
+
+    Args:
+        weight_quantizer: The weight quantizer to calibrate
+        weight: The weight tensor to use for calibration
+    """
+    if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None:
+        weight_quantizer.reset_amax()
+        enable_stats_collection(weight_quantizer)
+        weight_quantizer(weight)
+        finish_stats_collection(weight_quantizer)
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM NVFP4 quantization activation scales requirements

💡 Result:

For vLLM NVFP4 (FP4) quantization, the activation-scale requirements are:

You must have per-tensor “global” activation scales computed offline via calibration data (i.e., NVFP4 requires a calibration dataset to determine these global activation scales). The per-group (group size = 16) local activation scales are generated dynamically at inference time. [1]

Group size is fixed at 16 (you can’t change it). [2]

Activation quantization only runs on NVIDIA Blackwell-class GPUs (SM100+); on older GPUs, vLLM will run weight-only (no activation quantization). [1]

vLLM’s Marlin FP4 path does not support pairing NVFP4 weights with INT8/FP8 activations. [3]

Sources: [1] [2] [3]

🌐 Web query:

TensorRT-LLM NVFP4 quantization input scales deployment

💡 Result:

In TensorRT-LLM, NVFP4 “input/activation scales” are not something you pass at serving time; they’re produced/embedded during quantization/calibration and then used by the engine at runtime.

Weights (W) in NVFP4: TensorRT requires per-block scales with block size 16 for NVFP4 weight quantization, and those scales live with the quantized weights/checkpoint/engine artifacts. [1]

Activations (A) in NVFP4: TensorRT uses dynamic, per-block (block size 16) activation quantization. Practically, the engine computes a per-block scale from the current activation block at runtime. [1]

What TensorRT-LLM calls “activation global scale”: For NVFP4 activations, TensorRT supports Dynamic Double Quantization, where the per-block activation scales are themselves quantized (to FP8), and that scale-quantization needs an offline-calibrated per-tensor scalar (often described as a “global scale”, globalSf). This globalSf is what you calibrate offline and then deploy with the engine—again, not provided per request. [2]

Deployment workflow (where scales come from): Use a quantization flow (e.g., NeMo/ModelOpt → TensorRT-LLM export) that includes a calibration step to obtain algorithm-specific scaling factors, producing a TensorRT-LLM-ready checkpoint/dir used to build the serving engine. [3]

How you “enable” NVFP4 in TRT-LLM code/config: TensorRT-LLM has an NVFP4 quantization mode flag (e.g., QuantMode.NVFP4), which is what causes the runtime to use the NVFP4 quantized kernels/scales. [4]

Implication for serving: your client still sends normal (e.g., FP16/BF16) inputs; the TensorRT(-LLM) engine applies NVFP4 activation quantization internally using the calibrated globalSf plus runtime-computed per-block scales. [1][2]

Sources: [1] NVIDIA TensorRT “Working with Quantized Types” (NVFP4 block size + activation scheme) https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html, [2] same doc section on Dynamic Double Quantization / globalSf (NVFP4) https://docs.nvidia.com/deeplearning/tensorrt/10.13.0/inference-library/work-quantized-types.html, [3] NeMo Quantization workflow (calibration produces scaling factors; output used to build TRT-LLM engine) https://docs.nvidia.com/nemo-framework/user-guide/25.09/model-optimization/quantization/quantization.html, [4] TensorRT-LLM quantization mode flags (includes NVFP4) https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html

🌐 Web query:

NVFP4 weight-only quantization activation amax

💡 Result:

NVFP4 (TensorRT) is FP4E2M1 with per-block scaling (block size = 16), with data clipped to [-6, 6] during quantization. Weights can be quantized to NVFP4, and activations can be quantized too, but activation quantization is dynamic (computed at inference time). [1]

If you are doing NVFP4 weight-only quantization, then activations are not quantized, so there is no “activation amax” parameter to precompute/provide (amax is only needed when you actually quantize activations). [1]

When activations are NVFP4-quantized, TensorRT’s dynamic quantization computes the per-block scale from the block’s amax = max(abs(x)) at runtime (conceptually):
[
scale=\max_i \left(\frac{|x_i|}{qTypeMax}\right), \quad qTypeMax=6\text{ for FP4E2M1}
]
so “activation amax” corresponds to that runtime max-abs statistic. [2]

Sources:
[1] NVIDIA TensorRT Docs — Working with Quantized Types (NVFP4 scheme + activation quantization requirement) (docs.nvidia.com)
[2] NVIDIA TensorRT Docs — Working with Quantized Types (dynamic quantization scale formula / max-abs) (docs.nvidia.com)

Citations:

1: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html

2: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html

Activation scales are required by deployment frameworks, but only if activation quantization is enabled.

Both vLLM and TensorRT-LLM require offline-calibrated activation scales (global scales/globalSf) when NVFP4 activation quantization is used. However, for weight-only NVFP4 deployment, activation scales are not needed.

This fix addresses weight quantizer calibration, which is sufficient for weight-only NVFP4. For MoE models where experts may not be activated during calibration, missing activation scales would be problematic only if activation quantization is required. If the Kimi-K2 checkpoint uses weight-only NVFP4, this limitation is not a concern.

🤖 Prompt for AI Agents

In `@modelopt/torch/export/quant_utils.py` around lines 239 - 256, The current helper _ensure_weight_quantizer_calibrated should only produce weight scales and must not attempt to produce activation scales; ensure it remains weight-only by keeping the stats collection scoped to the provided weight_quantizer (use enable_stats_collection(weight_quantizer) / finish_stats_collection(weight_quantizer) as shown) and do not add any global or activation quantizer calibration here; if activation quantization support is needed, add a separate explicit code path elsewhere that checks an "activation quantization enabled" flag and performs offline activation calibration (do not rely on this weight-only helper to populate activation/global scales).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 · 2026-01-23T01:48:31Z

I think @meenchen 's question is legitimate:
Do you feel we need to do this for all the quant formats not just for NVFP4?
And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.
"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

My question is:

If the weights are not quantized because the expert has not been activated yet. Even you quantize the weights, the inputs are not quantized and the input scales are not available. How can the deployment framework deploy this checkpoint without complaining the input scales not present?

Resolved offline.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 requested review from a team as code owners January 15, 2026 01:13

Edwardf0t1 requested review from ajrasane, cjluo-nv and meenchen January 15, 2026 01:13

cjluo-nv reviewed Jan 15, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

Edwardf0t1 requested review from jingyu-ml and sugunav14 January 15, 2026 18:37

meenchen reviewed Jan 21, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 21, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Show resolved Hide resolved

meenchen reviewed Jan 22, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 22, 2026

View reviewed changes

modelopt/torch/export/quant_utils.py Show resolved Hide resolved

cjluo-nv approved these changes Jan 22, 2026

View reviewed changes

Edwardf0t1 added 5 commits January 22, 2026 17:36

fix a nvfp4 quantization amax attribute error

4b1bec5

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix a nvfp4 quantization amax attribute error

9283649

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

fix a nvfp4 quantization amax attribute error

97af258

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

address reviews

1045532

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

address reviews

4d54f55

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 force-pushed the zhiyu/fix-nvfp4-amax-attribute branch from afec8d2 to 4d54f55 Compare January 23, 2026 01:38

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

address reviews

eee6d2f

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 enabled auto-merge (squash) January 23, 2026 01:48

fix ci

8c0eb8f

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 merged commit 4f4558a into main Jan 23, 2026
36 checks passed

Edwardf0t1 deleted the zhiyu/fix-nvfp4-amax-attribute branch January 23, 2026 08:58

Fix a nvfp4 weight amax attribute issue during export #785

Fix a nvfp4 weight amax attribute issue during export #785

Uh oh!

Conversation

Edwardf0t1 commented Jan 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Bug Fixes

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjluo-nv commented Jan 22, 2026

Uh oh!

Edwardf0t1 commented Jan 22, 2026

Uh oh!

cjluo-nv commented Jan 22, 2026

Uh oh!

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Edwardf0t1 commented Jan 15, 2026 •

edited by coderabbitai bot

Loading

codecov bot commented Jan 15, 2026 •

edited

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading