Skip to content

Conversation

@Edwardf0t1
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 commented Jan 15, 2026

What does this PR do?

Type of change: Bugfix

Overview: Fix a nvfp4 weight amax attribute issue during export, especially when calibration size is small. Context: sgl-project/sglang#14677 (comment)

Usage

python3 hf_ptq.py --pyt_ckpt_path /home/scratch.jingyux_coreai/kimi-k2/models/Kimi-K2-Thinking-BF16 --qformat nvfp4_mlp_only --export_path /home/omniml_data_3/zhiyuc/checkpoints/Kimi-K2-Thinking-NVFP4 --kv_cache_qformat none --calib_size 20 --trust_remote_code --dataset cnn_dailymail

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Bug Fixes

  • Improved weight quantizer calibration to ensure quantizers are properly initialized with calibration statistics before computing scaling factors.
  • Enhanced reliability and consistency of quantized model exports.

✏️ Tip: You can customize this high-level summary in your review settings.

@Edwardf0t1 Edwardf0t1 requested review from a team as code owners January 15, 2026 01:13
@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.17%. Comparing base (b44c60a) to head (8c0eb8f).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #785   +/-   ##
=======================================
  Coverage   74.17%   74.17%           
=======================================
  Files         192      192           
  Lines       19246    19246           
=======================================
  Hits        14276    14276           
  Misses       4970     4970           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv
Copy link
Collaborator

I think @meenchen 's question is legitimate:

Do you feel we need to do this for all the quant formats not just for NVFP4?

And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

@Edwardf0t1
Copy link
Contributor Author

I think @meenchen 's question is legitimate:

Do you feel we need to do this for all the quant formats not just for NVFP4?

And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.

"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

@cjluo-nv
Copy link
Collaborator

I think @meenchen 's question is legitimate:
Do you feel we need to do this for all the quant formats not just for NVFP4?
And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.

"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

My question is:

If the weights are not quantized because the expert has not been activated yet. Even you quantize the weights, the inputs are not quantized and the input scales are not available. How can the deployment framework deploy this checkpoint without complaining the input scales not present?

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/fix-nvfp4-amax-attribute branch from afec8d2 to 4d54f55 Compare January 23, 2026 01:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 23, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Added an internal helper function to lazily calibrate weight quantizers by resetting amax and collecting statistics. Updated weight scaling factor computations for NVFP4 variants to call this helper, ensuring proper calibration before deriving scaling factors.

Changes

Cohort / File(s) Summary
Weight Quantizer Calibration
modelopt/torch/export/quant_utils.py
Introduced _ensure_weight_quantizer_calibrated() helper function for lazy amax calibration. Integrated calibration checks into NVFP4 family quantization formats (NVFP4, NVFP4_AWQ, NVFP4_SVDQUANT) and W4A8_NVFP4_FP8 variant prior to scaling factor computation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing an nvfp4 weight amax attribute issue during export, which directly aligns with the introduced _ensure_weight_quantizer_calibrated helper and its application to NVFP4 quantization variants.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@modelopt/torch/export/quant_utils.py`:
- Around line 239-256: The current helper _ensure_weight_quantizer_calibrated
should only produce weight scales and must not attempt to produce activation
scales; ensure it remains weight-only by keeping the stats collection scoped to
the provided weight_quantizer (use enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).
🧹 Nitpick comments (1)
modelopt/torch/export/quant_utils.py (1)

239-256: Consider adding a warning when force-calibrating weights.

The implementation correctly follows the existing calibration pattern used elsewhere in this file. However, based on a previous review discussion, it would be valuable to add a warning when this lazy calibration is triggered, as it indicates the layer was not quantized during the main calibration phase (possibly due to small calib_size).

💡 Proposed enhancement to add warning
 def _ensure_weight_quantizer_calibrated(
     weight_quantizer: TensorQuantizer, weight: torch.Tensor
 ) -> None:
     """Calibrate weight quantizer if amax is not set.

     This is a lazy calibration pattern used during export when weight quantizers
     may not have been calibrated during the main calibration phase.

     Args:
         weight_quantizer: The weight quantizer to calibrate
         weight: The weight tensor to use for calibration
     """
     if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None:
+        logger.warning(
+            "Weight quantizer was not calibrated during calibration phase. "
+            "Force-calibrating weights during export. Consider increasing calib_size."
+        )
         weight_quantizer.reset_amax()
         enable_stats_collection(weight_quantizer)
         weight_quantizer(weight)
         finish_stats_collection(weight_quantizer)

Comment on lines 239 to 256
def _ensure_weight_quantizer_calibrated(
weight_quantizer: TensorQuantizer, weight: torch.Tensor
) -> None:
"""Calibrate weight quantizer if amax is not set.

This is a lazy calibration pattern used during export when weight quantizers
may not have been calibrated during the main calibration phase.

Args:
weight_quantizer: The weight quantizer to calibrate
weight: The weight tensor to use for calibration
"""
if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None:
weight_quantizer.reset_amax()
enable_stats_collection(weight_quantizer)
weight_quantizer(weight)
finish_stats_collection(weight_quantizer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

vLLM NVFP4 quantization activation scales requirements

💡 Result:

For vLLM NVFP4 (FP4) quantization, the activation-scale requirements are:

  • You must have per-tensor “global” activation scales computed offline via calibration data (i.e., NVFP4 requires a calibration dataset to determine these global activation scales). The per-group (group size = 16) local activation scales are generated dynamically at inference time. [1]
  • Group size is fixed at 16 (you can’t change it). [2]
  • Activation quantization only runs on NVIDIA Blackwell-class GPUs (SM100+); on older GPUs, vLLM will run weight-only (no activation quantization). [1]
  • vLLM’s Marlin FP4 path does not support pairing NVFP4 weights with INT8/FP8 activations. [3]

Sources: [1] [2] [3]


🌐 Web query:

TensorRT-LLM NVFP4 quantization input scales deployment

💡 Result:

In TensorRT-LLM, NVFP4 “input/activation scales” are not something you pass at serving time; they’re produced/embedded during quantization/calibration and then used by the engine at runtime.

  • Weights (W) in NVFP4: TensorRT requires per-block scales with block size 16 for NVFP4 weight quantization, and those scales live with the quantized weights/checkpoint/engine artifacts. [1]
  • Activations (A) in NVFP4: TensorRT uses dynamic, per-block (block size 16) activation quantization. Practically, the engine computes a per-block scale from the current activation block at runtime. [1]
  • What TensorRT-LLM calls “activation global scale”: For NVFP4 activations, TensorRT supports Dynamic Double Quantization, where the per-block activation scales are themselves quantized (to FP8), and that scale-quantization needs an offline-calibrated per-tensor scalar (often described as a “global scale”, globalSf). This globalSf is what you calibrate offline and then deploy with the engine—again, not provided per request. [2]
  • Deployment workflow (where scales come from): Use a quantization flow (e.g., NeMo/ModelOpt → TensorRT-LLM export) that includes a calibration step to obtain algorithm-specific scaling factors, producing a TensorRT-LLM-ready checkpoint/dir used to build the serving engine. [3]
  • How you “enable” NVFP4 in TRT-LLM code/config: TensorRT-LLM has an NVFP4 quantization mode flag (e.g., QuantMode.NVFP4), which is what causes the runtime to use the NVFP4 quantized kernels/scales. [4]

Implication for serving: your client still sends normal (e.g., FP16/BF16) inputs; the TensorRT(-LLM) engine applies NVFP4 activation quantization internally using the calibrated globalSf plus runtime-computed per-block scales. [1][2]

Sources: [1] NVIDIA TensorRT “Working with Quantized Types” (NVFP4 block size + activation scheme) https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html, [2] same doc section on Dynamic Double Quantization / globalSf (NVFP4) https://docs.nvidia.com/deeplearning/tensorrt/10.13.0/inference-library/work-quantized-types.html, [3] NeMo Quantization workflow (calibration produces scaling factors; output used to build TRT-LLM engine) https://docs.nvidia.com/nemo-framework/user-guide/25.09/model-optimization/quantization/quantization.html, [4] TensorRT-LLM quantization mode flags (includes NVFP4) https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html


🌐 Web query:

NVFP4 weight-only quantization activation amax

💡 Result:

  • NVFP4 (TensorRT) is FP4E2M1 with per-block scaling (block size = 16), with data clipped to [-6, 6] during quantization. Weights can be quantized to NVFP4, and activations can be quantized too, but activation quantization is dynamic (computed at inference time). [1]
  • If you are doing NVFP4 weight-only quantization, then activations are not quantized, so there is no “activation amax” parameter to precompute/provide (amax is only needed when you actually quantize activations). [1]
  • When activations are NVFP4-quantized, TensorRT’s dynamic quantization computes the per-block scale from the block’s amax = max(abs(x)) at runtime (conceptually):
    [
    scale=\max_i \left(\frac{|x_i|}{qTypeMax}\right), \quad qTypeMax=6\text{ for FP4E2M1}
    ]
    so “activation amax” corresponds to that runtime max-abs statistic. [2]

Sources:
[1] NVIDIA TensorRT Docs — Working with Quantized Types (NVFP4 scheme + activation quantization requirement) (docs.nvidia.com)
[2] NVIDIA TensorRT Docs — Working with Quantized Types (dynamic quantization scale formula / max-abs) (docs.nvidia.com)

Citations:


Activation scales are required by deployment frameworks, but only if activation quantization is enabled.

Both vLLM and TensorRT-LLM require offline-calibrated activation scales (global scales/globalSf) when NVFP4 activation quantization is used. However, for weight-only NVFP4 deployment, activation scales are not needed.

This fix addresses weight quantizer calibration, which is sufficient for weight-only NVFP4. For MoE models where experts may not be activated during calibration, missing activation scales would be problematic only if activation quantization is required. If the Kimi-K2 checkpoint uses weight-only NVFP4, this limitation is not a concern.

🤖 Prompt for AI Agents
In `@modelopt/torch/export/quant_utils.py` around lines 239 - 256, The current
helper _ensure_weight_quantizer_calibrated should only produce weight scales and
must not attempt to produce activation scales; ensure it remains weight-only by
keeping the stats collection scoped to the provided weight_quantizer (use
enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1
Copy link
Contributor Author

I think @meenchen 's question is legitimate:
Do you feel we need to do this for all the quant formats not just for NVFP4?
And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint?

I think we can include other cases as needed later.
"How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint.

My question is:

If the weights are not quantized because the expert has not been activated yet. Even you quantize the weights, the inputs are not quantized and the input scales are not available. How can the deployment framework deploy this checkpoint without complaining the input scales not present?

Resolved offline.

@Edwardf0t1 Edwardf0t1 enabled auto-merge (squash) January 23, 2026 01:48
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 merged commit 4f4558a into main Jan 23, 2026
36 checks passed
@Edwardf0t1 Edwardf0t1 deleted the zhiyu/fix-nvfp4-amax-attribute branch January 23, 2026 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants