Fix Muon optimizer checkpoint resume with bf16 mode #7748

yurekami · 2025-12-28T22:21:28Z

Summary

Fix dtype mismatch crash when resuming Muon optimizer training from checkpoint with bf16 enabled
Add load_state_dict override to cast momentum_buffer to match parameter dtype after loading
Add unit test for bf16 checkpoint resume scenario

Root Cause

When resuming training from a checkpoint with bf16 enabled:

momentum_buffer is saved as fp32 in the checkpoint
After load_state_dict(), momentum_buffer remains fp32
Gradients in bf16 mode are bf16
momentum.lerp_(grad, 1 - beta) crashes due to dtype mismatch

Fix

Added load_state_dict override to all Muon optimizer classes that casts optimizer state buffers to match the parameter dtype after loading:

Class	Buffers Fixed
`Muon`	`momentum_buffer`
`SingleDeviceMuon`	`momentum_buffer`
`MuonWithAuxAdam`	`momentum_buffer`, `exp_avg`, `exp_avg_sq`
`SingleDeviceMuonWithAuxAdam`	`momentum_buffer`, `exp_avg`, `exp_avg_sq`

Test Plan

Added TestMuonBF16CheckpointResume test class that:
1. Creates model with bf16 enabled + Muon optimizer
2. Trains for a few steps (creates momentum_buffer state)
3. Saves checkpoint
4. Loads checkpoint
5. Resumes training (validates fix)
Tests both ZeRO stage 1 and stage 2

Fixes: #7746

🤖 Generated with Claude Code

When resuming training from a checkpoint with bf16 enabled, the Muon optimizer's momentum_buffer was loaded as fp32 (from the checkpoint) while gradients were bf16, causing a dtype mismatch error in the lerp_() operation. This fix adds a load_state_dict override to all Muon optimizer classes (Muon, SingleDeviceMuon, MuonWithAuxAdam, SingleDeviceMuonWithAuxAdam) that casts the momentum_buffer (and exp_avg/exp_avg_sq for hybrid classes) to match the parameter dtype after loading the checkpoint. Fixes: deepspeedai#7746 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: yurekami <yurekami@users.noreply.github.com>

yurekami requested review from loadams, tjruwase and tohtana as code owners December 28, 2025 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Muon optimizer checkpoint resume with bf16 mode #7748

Fix Muon optimizer checkpoint resume with bf16 mode #7748

yurekami commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix Muon optimizer checkpoint resume with bf16 mode #7748

Are you sure you want to change the base?

Fix Muon optimizer checkpoint resume with bf16 mode #7748

Conversation

yurekami commented Dec 28, 2025

Summary

Root Cause

Fix

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant