-
Notifications
You must be signed in to change notification settings - Fork 243
Fix a nvfp4 weight amax attribute issue during export #785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #785 +/- ##
=======================================
Coverage 74.17% 74.17%
=======================================
Files 192 192
Lines 19246 19246
=======================================
Hits 14276 14276
Misses 4970 4970 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I think @meenchen 's question is legitimate: Do you feel we need to do this for all the quant formats not just for NVFP4? And even with this weight calibration, the activation amax is still not present. How will this PR be able to generate a valid HF checkpoint? |
I think we can include other cases as needed later. "How will this PR be able to generate a valid HF checkpoint?" What do you mean? This patch has been tested by Google team, they were able to generate the kimi-k2-thinking nvfp4 checkpoint. |
My question is: If the weights are not quantized because the expert has not been activated yet. Even you quantize the weights, the inputs are not quantized and the input scales are not available. How can the deployment framework deploy this checkpoint without complaining the input scales not present? |
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
afec8d2 to
4d54f55
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughAdded an internal helper function to lazily calibrate weight quantizers by resetting amax and collecting statistics. Updated weight scaling factor computations for NVFP4 variants to call this helper, ensuring proper calibration before deriving scaling factors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/quant_utils.py`:
- Around line 239-256: The current helper _ensure_weight_quantizer_calibrated
should only produce weight scales and must not attempt to produce activation
scales; ensure it remains weight-only by keeping the stats collection scoped to
the provided weight_quantizer (use enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).
🧹 Nitpick comments (1)
modelopt/torch/export/quant_utils.py (1)
239-256: Consider adding a warning when force-calibrating weights.The implementation correctly follows the existing calibration pattern used elsewhere in this file. However, based on a previous review discussion, it would be valuable to add a warning when this lazy calibration is triggered, as it indicates the layer was not quantized during the main calibration phase (possibly due to small
calib_size).💡 Proposed enhancement to add warning
def _ensure_weight_quantizer_calibrated( weight_quantizer: TensorQuantizer, weight: torch.Tensor ) -> None: """Calibrate weight quantizer if amax is not set. This is a lazy calibration pattern used during export when weight quantizers may not have been calibrated during the main calibration phase. Args: weight_quantizer: The weight quantizer to calibrate weight: The weight tensor to use for calibration """ if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None: + logger.warning( + "Weight quantizer was not calibrated during calibration phase. " + "Force-calibrating weights during export. Consider increasing calib_size." + ) weight_quantizer.reset_amax() enable_stats_collection(weight_quantizer) weight_quantizer(weight) finish_stats_collection(weight_quantizer)
| def _ensure_weight_quantizer_calibrated( | ||
| weight_quantizer: TensorQuantizer, weight: torch.Tensor | ||
| ) -> None: | ||
| """Calibrate weight quantizer if amax is not set. | ||
|
|
||
| This is a lazy calibration pattern used during export when weight quantizers | ||
| may not have been calibrated during the main calibration phase. | ||
|
|
||
| Args: | ||
| weight_quantizer: The weight quantizer to calibrate | ||
| weight: The weight tensor to use for calibration | ||
| """ | ||
| if not hasattr(weight_quantizer, "_amax") or weight_quantizer._amax is None: | ||
| weight_quantizer.reset_amax() | ||
| enable_stats_collection(weight_quantizer) | ||
| weight_quantizer(weight) | ||
| finish_stats_collection(weight_quantizer) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
vLLM NVFP4 quantization activation scales requirements
💡 Result:
For vLLM NVFP4 (FP4) quantization, the activation-scale requirements are:
- You must have per-tensor “global” activation scales computed offline via calibration data (i.e., NVFP4 requires a calibration dataset to determine these global activation scales). The per-group (group size = 16) local activation scales are generated dynamically at inference time. [1]
- Group size is fixed at 16 (you can’t change it). [2]
- Activation quantization only runs on NVIDIA Blackwell-class GPUs (SM100+); on older GPUs, vLLM will run weight-only (no activation quantization). [1]
- vLLM’s Marlin FP4 path does not support pairing NVFP4 weights with INT8/FP8 activations. [3]
Sources: [1] [2] [3]
🌐 Web query:
TensorRT-LLM NVFP4 quantization input scales deployment
💡 Result:
In TensorRT-LLM, NVFP4 “input/activation scales” are not something you pass at serving time; they’re produced/embedded during quantization/calibration and then used by the engine at runtime.
- Weights (W) in NVFP4: TensorRT requires per-block scales with block size 16 for NVFP4 weight quantization, and those scales live with the quantized weights/checkpoint/engine artifacts. [1]
- Activations (A) in NVFP4: TensorRT uses dynamic, per-block (block size 16) activation quantization. Practically, the engine computes a per-block scale from the current activation block at runtime. [1]
- What TensorRT-LLM calls “activation global scale”: For NVFP4 activations, TensorRT supports Dynamic Double Quantization, where the per-block activation scales are themselves quantized (to FP8), and that scale-quantization needs an offline-calibrated per-tensor scalar (often described as a “global scale”,
globalSf). ThisglobalSfis what you calibrate offline and then deploy with the engine—again, not provided per request. [2] - Deployment workflow (where scales come from): Use a quantization flow (e.g., NeMo/ModelOpt → TensorRT-LLM export) that includes a calibration step to obtain algorithm-specific scaling factors, producing a TensorRT-LLM-ready checkpoint/dir used to build the serving engine. [3]
- How you “enable” NVFP4 in TRT-LLM code/config: TensorRT-LLM has an NVFP4 quantization mode flag (e.g.,
QuantMode.NVFP4), which is what causes the runtime to use the NVFP4 quantized kernels/scales. [4]
Implication for serving: your client still sends normal (e.g., FP16/BF16) inputs; the TensorRT(-LLM) engine applies NVFP4 activation quantization internally using the calibrated globalSf plus runtime-computed per-block scales. [1][2]
Sources: [1] NVIDIA TensorRT “Working with Quantized Types” (NVFP4 block size + activation scheme) https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html, [2] same doc section on Dynamic Double Quantization / globalSf (NVFP4) https://docs.nvidia.com/deeplearning/tensorrt/10.13.0/inference-library/work-quantized-types.html, [3] NeMo Quantization workflow (calibration produces scaling factors; output used to build TRT-LLM engine) https://docs.nvidia.com/nemo-framework/user-guide/25.09/model-optimization/quantization/quantization.html, [4] TensorRT-LLM quantization mode flags (includes NVFP4) https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html
🌐 Web query:
NVFP4 weight-only quantization activation amax
💡 Result:
- NVFP4 (TensorRT) is FP4E2M1 with per-block scaling (block size = 16), with data clipped to
[-6, 6]during quantization. Weights can be quantized to NVFP4, and activations can be quantized too, but activation quantization is dynamic (computed at inference time). [1] - If you are doing NVFP4 weight-only quantization, then activations are not quantized, so there is no “activation amax” parameter to precompute/provide (amax is only needed when you actually quantize activations). [1]
- When activations are NVFP4-quantized, TensorRT’s dynamic quantization computes the per-block scale from the block’s amax = max(abs(x)) at runtime (conceptually):
[
scale=\max_i \left(\frac{|x_i|}{qTypeMax}\right), \quad qTypeMax=6\text{ for FP4E2M1}
]
so “activation amax” corresponds to that runtime max-abs statistic. [2]
Sources:
[1] NVIDIA TensorRT Docs — Working with Quantized Types (NVFP4 scheme + activation quantization requirement) (docs.nvidia.com)
[2] NVIDIA TensorRT Docs — Working with Quantized Types (dynamic quantization scale formula / max-abs) (docs.nvidia.com)
Citations:
- 1: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html
- 2: https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html
Activation scales are required by deployment frameworks, but only if activation quantization is enabled.
Both vLLM and TensorRT-LLM require offline-calibrated activation scales (global scales/globalSf) when NVFP4 activation quantization is used. However, for weight-only NVFP4 deployment, activation scales are not needed.
This fix addresses weight quantizer calibration, which is sufficient for weight-only NVFP4. For MoE models where experts may not be activated during calibration, missing activation scales would be problematic only if activation quantization is required. If the Kimi-K2 checkpoint uses weight-only NVFP4, this limitation is not a concern.
🤖 Prompt for AI Agents
In `@modelopt/torch/export/quant_utils.py` around lines 239 - 256, The current
helper _ensure_weight_quantizer_calibrated should only produce weight scales and
must not attempt to produce activation scales; ensure it remains weight-only by
keeping the stats collection scoped to the provided weight_quantizer (use
enable_stats_collection(weight_quantizer) /
finish_stats_collection(weight_quantizer) as shown) and do not add any global or
activation quantizer calibration here; if activation quantization support is
needed, add a separate explicit code path elsewhere that checks an "activation
quantization enabled" flag and performs offline activation calibration (do not
rely on this weight-only helper to populate activation/global scales).
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Resolved offline. |
What does this PR do?
Type of change: Bugfix
Overview: Fix a nvfp4 weight amax attribute issue during export, especially when calibration size is small. Context: sgl-project/sglang#14677 (comment)
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.