Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5911 commits
Select commit Hold shift + click to select a range
3b83c3f
Revert "Dynamic engine suspend/resume via prefill. (#1982)"
ko3n1g Nov 18, 2025
19d0422
fix: Pass the timeout argument for the EP group (#2268)
yanring Nov 19, 2025
efdc681
JIT for MoE router and preprocess (#1919)
yaox12 Nov 19, 2025
00884a8
Hotfix to CI, until the fix gets reviewed (#2298)
tdene Nov 19, 2025
f885d9c
Add functional test for DP coordinator throughput (#2189)
tdene Nov 19, 2025
70db86a
Add asyncio Queue like in Python 3.13 (#2224)
tdene Nov 19, 2025
744505e
Fixes for PR#1982 (#2303)
lmcafee-nvidia Nov 19, 2025
314a378
Fix PP KV cache allocation and enable multi-node PP inference (#2182)
santhnm2 Nov 19, 2025
21968ea
Revert active-buffer-size-gb arg name. (#2257)
lmcafee-nvidia Nov 19, 2025
712dff8
feat: check: api backwards compatibility (#2251)
pablo-garay Nov 19, 2025
6c8cdd5
Add MambaInferenceStateConfig dataclass (#2265)
santhnm2 Nov 19, 2025
dc473f9
Fix typo in inference example (#2311)
santhnm2 Nov 20, 2025
7dec856
feat: initialization of API backward compatibility verification (#2310)
pablo-garay Nov 20, 2025
e4b7259
Fix Mamba TP and remove confusing legacy initialization (#2202)
jaredcasper Nov 20, 2025
8463257
Refactor KD to use ModelOpt plugins file (#2305)
AAnoosheh Nov 20, 2025
9ce2482
mcore trigger mbridge
pablo-garay Nov 20, 2025
c2b1c7c
mcore trigger mbridge
pablo-garay Nov 20, 2025
a813740
mcore trigger mbridge
pablo-garay Nov 20, 2025
7e18da2
Revert "Refactor KD to use ModelOpt plugins file (#2305)"
ko3n1g Nov 20, 2025
8e830a1
Fix dynamic context syntax and remove redundant tensors (#2336)
kanz-nv Nov 20, 2025
475d7fa
Improve asyncio exception handling (#2300)
tdene Nov 20, 2025
5ab6392
ci: Upload to testpypi only on main (#2342)
ko3n1g Nov 21, 2025
0634924
implement graph config (#2203)
kanz-nv Nov 21, 2025
ddc55cd
Revert "implement graph config (#2203)"
ko3n1g Nov 21, 2025
f7fb5ec
feat: required check adjustment (#2350)
pablo-garay Nov 21, 2025
e772e06
synthesize, optimize
pablo-garay Nov 21, 2025
2cc0736
synthesize, optimize
pablo-garay Nov 21, 2025
f426230
Change default baseline commit for api compat check
pablo-garay Nov 21, 2025
f07cb14
fix: load iteration 0 for release checkpoints (#2351)
ananthsub Nov 21, 2025
81a87e2
Break apart dynamic inference step into 2 methods (#2192)
tdene Nov 21, 2025
c90160d
Bugfix for Mamba with Chunked-Prefill (#2293)
sidsingh-nvidia Nov 21, 2025
c9d2c8f
Explicitly zero out padding token activations for dynamic inference (…
santhnm2 Nov 21, 2025
63d4e7d
Refactor KD to use ModelOpt plugins file (v2) (#2355)
AAnoosheh Nov 21, 2025
29a810e
Prevent unnecessarily overwriting the default Hugging Face chat templ…
santhnm2 Nov 21, 2025
7994405
add FIM dataset support (#2291)
dimapihtar Nov 21, 2025
e35495d
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Nov 22, 2025
233b5b0
Revert "Explicitly zero out padding token activations for dynamic inf…
chtruong814 Nov 22, 2025
90c8536
Clean up DP coord code & unit test (#2277)
tdene Nov 22, 2025
8daf046
[4/4] Merge Megatron-RL into LM (#2002)
tdene Nov 22, 2025
53bbf7a
Update coordinator control logic to be compatible with RL (#2227)
tdene Nov 22, 2025
8954e04
ci: Update backwards compat check baseline to 53bbf7a (#2361)
chtruong814 Nov 22, 2025
d313c6d
Account for test regression caused by prints (#2354)
tdene Nov 22, 2025
14464d1
Remove dependency on `megatron.training` within `megatron.core` (#2274)
ananthsub Nov 22, 2025
9873958
Fixes for gpt-oss (#2038)
cuichenx Nov 22, 2025
26b2e72
update
pablo-garay Nov 24, 2025
326ec8c
[HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher (#2286)
Autumn1998 Nov 24, 2025
17cd106
ci: Remove nemo-ci environment (#2364)
chtruong814 Nov 24, 2025
278e058
ci: Pass COMMUNITY_PROJECT_ID to community bot (#2366)
chtruong814 Nov 24, 2025
d61029f
ci: Remove environment from community-bot (#2376)
chtruong814 Nov 24, 2025
9269dda
monitoring & results in mcore
pablo-garay Nov 24, 2025
77b65ed
Add mbridge_ref input to select MBridge branch
pablo-garay Nov 24, 2025
aa7a564
Fix: Use correct repo NVIDIA-NeMo/Megatron-Bridge and add mbridge_ref…
pablo-garay Nov 24, 2025
7f70e22
gha action
pablo-garay Nov 24, 2025
c28b84e
ci: Bump commit for api check to d61029f (#2386)
chtruong814 Nov 24, 2025
ab1e26e
tidy / synthesize / enhance
pablo-garay Nov 24, 2025
56e8810
Merge branch 'main' of https://github.com/NVIDIA/Megatron-LM
pablo-garay Nov 24, 2025
bc242d9
Revert: trigger_mbridge_tests.yml‎ file change (#2389)
pablo-garay Nov 25, 2025
49eef58
build: Upgrade deps (#2289)
ko3n1g Nov 25, 2025
2a51d86
Change KV cache init to empty to speedup graph recording and first pr…
kanz-nv Nov 25, 2025
4c7d3d6
Handle UVM compile lock issues (#2299)
tdene Nov 25, 2025
14b791b
Remove experimental tags for fused kernels. (#2233)
Victarry Nov 25, 2025
ffb8c35
Reduce Overhead in Timers (#2210)
yaox12 Nov 25, 2025
60df5c2
Revert "build: Upgrade deps (#2289)"
ko3n1g Nov 25, 2025
ba9caf4
Fix the entropy sign. (#2374)
yobibyte Nov 25, 2025
77a2d8b
Remove RL use of mock dataloader and kill RL inference interface on e…
jon-barker Nov 25, 2025
6f65536
Fix block_bag for RL (#2399)
kanz-nv Nov 25, 2025
13efcb8
adding action for checking whether PR author is nvidia employee or no…
theothermike Nov 25, 2025
898d633
Added top n log probs (#2262)
shanmugamr1992 Nov 25, 2025
3f91727
Fix logging when no IS is enabled. (#2375)
yobibyte Nov 26, 2025
6fc13a9
fix: exit failure when PR author is external contributor removed (#2410)
theothermike Nov 26, 2025
ebb2e91
Various small fixes for Megatron-FSDP. (#2346)
cspades Nov 26, 2025
f5531b0
Add grpo loop functional test (#2403)
jon-barker Nov 26, 2025
cb8f94e
Revert "Add grpo loop functional test (#2403)"
ko3n1g Nov 26, 2025
5153663
YARN position embedding clear forward method lru cache in init functi…
guyueh1 Nov 27, 2025
0819f3c
Graph Config Implementation (#2380)
kanz-nv Nov 27, 2025
b96d876
fix: adding k8s taints for ephermeral jobs (#2420)
theothermike Nov 27, 2025
9f15fed
ci: Enable functional tests (#2419)
ko3n1g Nov 27, 2025
40ef044
Reapply "build: Upgrade deps (NVIDIA#2289)" (#2408)
ko3n1g Nov 27, 2025
b21bbad
fix: use a script to do node tainting in the cicd workflow (#2421)
theothermike Nov 27, 2025
65ce253
ci: Mark gpt_dynamic_inference_tp1_pp1_583m_cuda_graphs_fp8_logitsmat…
ko3n1g Nov 28, 2025
6646d1a
ci: Disable `gpt_static_inference_cuda_graphs_pad_tp4_pp1_ep4_16B_log…
ko3n1g Nov 28, 2025
66c07b0
Fix rl training with data reuse. (#2428)
yobibyte Nov 28, 2025
8cde93d
Reapply - Add grpo loop functional test (#2411)
jon-barker Nov 28, 2025
6cc29a2
Revert "Reapply - Add grpo loop functional test (#2411)"
ko3n1g Nov 28, 2025
a62e237
chore: Add copyright to run_simple_mcore_train_loop.py (#2441)
chtruong814 Dec 1, 2025
66407fa
Retry inference test on different device if throughput slower than ex…
mathemakitten Dec 1, 2025
6d2a123
feat: mcore trigger mbridge (#2340)
pablo-garay Dec 1, 2025
7f4df2c
Reapply "Reapply - Add grpo loop functional test (#2411)"
ko3n1g Dec 1, 2025
848bff1
Remove redundant reduce in aux_loss logging (#2095)
BestJuly Dec 2, 2025
e2bd0db
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Dec 2, 2025
9927a85
Add support for fake distributed process groups. (#2280)
Victarry Dec 2, 2025
0150d73
[Fix] Pass metadata to sharded_state_dict in load_modelopt_checkpoint…
kevalmorabia97 Dec 2, 2025
a6764e0
chore: Update codeowners for post-training (#2462)
ko3n1g Dec 2, 2025
77bc0f5
fix: Add merge_group support with pre-flight pattern (#2463)
pablo-garay Dec 2, 2025
3cacd5b
Add assertion for mxfp8 params without dp overlap (#2271)
kunlunl Dec 2, 2025
409f954
Add missing checkpoint arguments for MoE models (#2465)
santhnm2 Dec 2, 2025
40a4674
Clean log probs (#2404)
shanmugamr1992 Dec 2, 2025
08fdf5b
ci: Bump copyright workflow (#2473)
ko3n1g Dec 3, 2025
209bd6c
Fix `ImportError` and `NameError` in `examples/run_simple_mcore_train…
marksverdhei Dec 2, 2025
e8749f8
fix: Revert "Clean log probs (#2404)" (#2475)
chtruong814 Dec 3, 2025
2e6b2bc
Make grpo CI test use read-only data (#2472)
jon-barker Dec 3, 2025
54c33cb
Update golden values to allow new PRs to be merged (#2478)
tdene Dec 3, 2025
2d3459a
Clean log probs copy (#2477)
shanmugamr1992 Dec 3, 2025
c0b5c2c
Fix default.yaml for HFDatasetAgent use in countdown (#2487)
jon-barker Dec 3, 2025
e847643
Attention mask as PackedSeqParams (#2461)
jalbericiola Dec 3, 2025
299034c
fp8 param cuda graph support main (#2088)
kunlunl Dec 4, 2025
f9d02e9
docs: Add changelog for 0.15 (#2499)
ko3n1g Dec 4, 2025
50deedf
feat: improve external contributor single use ephemeral nodes (#2503)
theothermike Dec 4, 2025
f534416
Fix sequence parallel. (#2444)
yobibyte Dec 4, 2025
7e22b9c
update API check baseline (#2505)
pablo-garay Dec 4, 2025
c7ed7c6
Associate default rl cuda graphs attributes with args (#2453)
yobibyte Dec 4, 2025
b27818c
No using tokenizer in request record. (#2382)
lmcafee-nvidia Dec 4, 2025
abba836
make default --inference-dynamic-batching-cuda-graph-max-tokens value…
jon-barker Dec 4, 2025
8950d1a
Adjust the default CG size for functional test (#2544)
tdene Dec 4, 2025
c46a8ca
feat: API compat: ignore AttributeChangedValueBreakage (not a signatu…
pablo-garay Dec 4, 2025
f32dfec
feat: add decorator: experimental_api (#2539)
pablo-garay Dec 4, 2025
e79d9a8
ci: Add release workflows (#2507)
ko3n1g Dec 5, 2025
c3e1d2d
Fixing PG routing for inference & training separation (#2485)
wdykas Dec 5, 2025
b3a814e
ci: Fix release workflow (#2553)
ko3n1g Dec 5, 2025
286c806
fix: Duplicate artifact names (#2556)
ko3n1g Dec 5, 2025
fcc1aaf
ci: Avoid naming collision (#2558)
ko3n1g Dec 5, 2025
5883064
ci: Fixing naming collision (#2559)
ko3n1g Dec 5, 2025
dcee1c5
fix: publish release wheel and github release version number (#2561)
ko3n1g Dec 5, 2025
e152473
Revert "Fixing PG routing for inference & training separation (#2485)"
ko3n1g Dec 5, 2025
ecb948e
Fix MoE capacity handling (#2214)
DaizeDong Dec 5, 2025
04d202a
Avoid calling set_save_original_input with FP8 delayed scaling (#1860)
dalgarak Dec 5, 2025
b9d3736
build: Bump TE to 2.10 (#2496)
ko3n1g Dec 5, 2025
d2e7060
Add per-module TE quant config. (#2359)
kwyss-nvidia Dec 5, 2025
2c06b04
add more tokenizer arguments (#2377)
dimapihtar Dec 5, 2025
bcf07a2
Make check_large_grads non-fatal (#2307)
kwyss-nvidia Dec 5, 2025
416687f
fix for sequence packing plus sequence parallel: padding the sequence…
jalbericiola Dec 5, 2025
32ebde7
Revert "Make check_large_grads non-fatal (#2307)"
ko3n1g Dec 5, 2025
972d9b6
Torch symmetric - new latency optimized NVLS communication kernels fo…
sidsingh-nvidia Dec 5, 2025
8c4df6b
[Main] Support MTP packed-seq in main branch (#2173)
BestJuly Dec 7, 2025
8a5f379
Various quality-of-life improvements in training loop (#2580)
deepakn94 Dec 7, 2025
f7dfb99
Support TP greater than num_kv_heads by supporting QKV activation sub…
deepakn94 Dec 7, 2025
dfc3913
Fix FA3 import (#2577)
santhnm2 Dec 7, 2025
5232820
Fix runaway Etpt in straggler detector by resetting FLOPs accumulator…
cms42 Dec 7, 2025
4cf809c
Rename TensorRT Model Optimizer to Model Optimizer (#2373)
AAnoosheh Dec 7, 2025
e2199af
Reapply "Make check_large_grads non-fatal (#2307)"
ko3n1g Dec 7, 2025
f6e0d42
Fix aux loss scale when CP is enabled. (#2237)
Victarry Dec 7, 2025
01aad93
Save memory using main_param for moe in param_l2_norm (#2249)
BestJuly Dec 7, 2025
b51db3e
Changes to support latent MoEs (#2296)
deepakn94 Dec 8, 2025
03b6d31
update API compat check baseline to b51db3e (#2588)
pablo-garay Dec 8, 2025
f4957d1
Fix invalid argument failing tests on main (#2589)
tdene Dec 8, 2025
2bc35c5
Add openmathinstruct config. (#2586)
yobibyte Dec 8, 2025
e2cf81c
Move model configs to github. (#2587)
yobibyte Dec 8, 2025
8d18afd
fix: Assign tokenizer to Encoder.tokenizer in legacy mode (#2498)
iuyo5678 Dec 9, 2025
c21bf6e
Delete redundant import in yaml_arguments.py (#2139)
wplf Dec 9, 2025
dfb78dc
Fix world size mismatch causing distributed init deadlock (Issue #245…
CodersAcademy006 Dec 9, 2025
4a4f23a
Improve performance of request_metadata logic (#2378)
tdene Dec 9, 2025
7b11553
Fix broken Table of Contents links in README.md (#1954)
JungHoyoun Dec 9, 2025
9fba363
Add minor log update (#2080)
gautham-kollu Dec 9, 2025
bd32927
Fix link to NeMo performance summary documentation (#2190)
janbernloehr Dec 9, 2025
ef12f16
Prep for refit (#2590)
wdykas Dec 9, 2025
a2aafe3
feat: API compat: ignore ParameterMovedBreakage for __init__ methods …
pablo-garay Dec 9, 2025
5c54bb6
Revert "Prep for refit (#2590)"
ko3n1g Dec 9, 2025
be4baad
Fix NameError in pretrain_retro.py (add import_module), remove unused…
vignesh1507 Dec 10, 2025
d2b500f
Use the latest Hybrid-EP (#2479)
Autumn1998 Dec 10, 2025
dd54609
QK logits clipping (non-split version) (#1929)
BoxiangW Dec 10, 2025
93da800
update checkpointing documentation (#2606)
dimapihtar Dec 10, 2025
72230d2
[training migration] add training config dataclass and arg generation…
maanug-nv Dec 10, 2025
d9c911a
Check skip_prompt_log_probs in add_request (#2593)
tdene Dec 10, 2025
587a0ff
Refit prep 2 (#2608)
wdykas Dec 10, 2025
d2bd9fa
Batch Invariance (#2308)
wdykas Dec 10, 2025
5ab481c
Remove flattened_range code paths for distributed optimizer checkpoin…
dimapihtar Dec 11, 2025
5a24ff3
update commit (#2631)
dimapihtar Dec 11, 2025
f67b7bd
tests: Disable grads test
ko3n1g Dec 12, 2025
44899aa
Create separate teacher Layer Spec in KD mode (#2429)
AAnoosheh Dec 12, 2025
6b186c1
Dynamic context | Re-add max_requests arg. (#2488)
lmcafee-nvidia Dec 12, 2025
2ab9253
Inference | Fix entangled request generations. (#2584)
lmcafee-nvidia Dec 12, 2025
3a9f086
fix gpt3_mcore_reruns_resume_check_grads (#2646)
dimapihtar Dec 12, 2025
f5daa16
Nemotron nano v2 vl changes for Megatron Bridge (#2078)
cuichenx Dec 12, 2025
4f700f7
[docs] Migrate docs to new Sphinx (#2489)
Phlip79 Dec 12, 2025
1c6f6eb
Add option to only log inference every N steps (#2637)
tdene Dec 12, 2025
0a59bea
[docs] Use autodoc2 and remove automodule (#2542)
Phlip79 Dec 12, 2025
845617a
add backward compatibility support for loading mcore 0.15 checkpoints…
dimapihtar Dec 12, 2025
12b4406
add offline eagle3 instructions to readme (#2246)
yeyu-nvidia Dec 13, 2025
fe7fb73
Only initialize symmetric memory when needed (#2665)
sidsingh-nvidia Dec 15, 2025
e869218
Simplify parameter sync for checkpoint save (#2344)
ananthsub Dec 15, 2025
4a9b4a2
Update docstrings for dataset (#2666)
Phlip79 Dec 15, 2025
597e88a
[Megatron-FSDP] Support both old and new DeviceMesh APIs. (#2575)
cspades Dec 15, 2025
ff45bd4
Enable hybrid tensor + expert + data parallelism in mcore inference (…
sidsingh-nvidia Dec 16, 2025
43a0c33
Fix failing functional tests (#2679)
sidsingh-nvidia Dec 16, 2025
5f5741d
M4 + Dist Checkpoint: Replace global parallel state with explicit gro…
dimapihtar Dec 16, 2025
4bdd7b1
fix deprecated decorator import (#2680)
dimapihtar Dec 16, 2025
36a9081
Added integration for Kitchen extensions' SDPA and FA implementations…
frsun-nvda Dec 16, 2025
bacd164
Inference | Add request only if no paused requests. (#2600)
lmcafee-nvidia Dec 16, 2025
e9082fd
Pipeline parallelism fix in RL and sequence packing rewriting (#2632)
jalbericiola Dec 16, 2025
815d86c
Add oncall rotation (#2622)
Phlip79 Dec 16, 2025
bdc362a
Upgrade GitHub Actions to latest versions (#2678)
salmanmkc Dec 16, 2025
2a8bcf0
docs: Adding documentation.md to cover building documentation. (#2683)
aschilling-nv Dec 16, 2025
cf39a4d
Add moe layer perf UT. (#2673)
Victarry Dec 16, 2025
732bb8d
[Megatron-FSDP] Build default FSDP DeviceMesh, and remove model arg f…
cspades Dec 16, 2025
ae774fe
[docs] Add ability to disable autodoc2 for local builds (#2669)
Phlip79 Dec 17, 2025
d944ef9
Fix oncall assignment (#2686)
Phlip79 Dec 17, 2025
5613ed0
docs(readme): update Latest News section (#2684)
sbhavani Dec 17, 2025
2485495
Update RNG sharding to include EP rank (#2658)
paul-gibbons Dec 17, 2025
0b8a2ff
Add CODEOWNER for API backwards compatibility check files (#2687)
pablo-garay Dec 17, 2025
72416d0
Mark API backwards compatibility checks as OPTIONAL (non-blocking) (#…
pablo-garay Dec 17, 2025
9288125
pip install uv during GH action (#2695)
Phlip79 Dec 17, 2025
32b9ee4
chore: rotate oncall schedule
github-actions[bot] Dec 17, 2025
ff4a622
Don't delete svcnvidia-nemo-ci team from oncall (#2703)
Phlip79 Dec 17, 2025
3d1a5c8
RL: Rollouts should be distributed over the regular data parallel gro…
sidsingh-nvidia Dec 17, 2025
d321026
Use pull_request_target and don't use uv (#2702)
Phlip79 Dec 17, 2025
94b4759
Optimize TE cudagraph input memory (#2392)
buptzyb Dec 18, 2025
c7e5489
ci(fix): Pin gojq to stable version (#2480)
ko3n1g Dec 18, 2025
f19b59e
NVLS - fused reduce-scatter + residual + rms-norm + all-gather kernel…
sidsingh-nvidia Dec 18, 2025
0170e70
Default UVM level to 0. (#2450)
lmcafee-nvidia Dec 18, 2025
0b13c98
docs: improve documentation organization and add additional guides (#…
sbhavani Dec 18, 2025
d81b37b
Revert "Default UVM level to 0. (#2450)" (#2713)
chtruong814 Dec 18, 2025
a6f822c
Add missing imports in no-triton fallback (#2711)
maanug-nv Dec 18, 2025
1503c33
Fixes for #2450. (#2714)
lmcafee-nvidia Dec 18, 2025
1a2257b
Add RL parameter to set parallel generation tasks (#2712)
tdene Dec 19, 2025
30694e0
Refit prep 3 (#2708)
wdykas Dec 19, 2025
000c2e2
chore: Add cudagraph codeowners (#2720)
ko3n1g Dec 19, 2025
703bc36
[docs] Add developer section to docs (#2717)
Phlip79 Dec 19, 2025
7f471d7
Fix UVM argument for RL (#2722)
tdene Dec 19, 2025
ddf691d
[dcos] Update docs title to Megatron Core (#2729)
Phlip79 Dec 20, 2025
4193f3a
remove fp16 assert in moe_grouped_gemm & EP (#2495)
HaochenYuan Dec 22, 2025
a057662
Improve ModelOpt paths & add more Nemotron/hybrid model support (#2131)
jenchen13 Dec 22, 2025
cfd980b
Add options to improve data loader initialization time, especially at…
asolergi-nv Dec 22, 2025
1c67e7e
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Dec 22, 2025
8ea3b8d
ci: Fix copy-pr-bot update (#2736)
ko3n1g Dec 22, 2025
5b1ef07
Add oncall to all new PRs (#2734)
Phlip79 Dec 22, 2025
cc1b0b5
Hsdp register submesh fix lifuz mirror (#2467)
tomlifu Dec 23, 2025
1febe9f
Adding stop word support (#2685)
shanmugamr1992 Dec 23, 2025
a477766
Fix oncall assign (#2737)
Phlip79 Dec 23, 2025
f5d4c3a
Add support for non-decode CUDA graphs for Mamba models (#2474)
santhnm2 Dec 23, 2025
0a77122
Update sequence packing case when dummy PackedSeqParams are used (#2743)
mathemakitten Dec 24, 2025
876a046
feat: manual registration mode for nccl-ub option when using megatron…
youngeunkwon0405 Dec 24, 2025
ede9ae4
chore: rotate oncall schedule
github-actions[bot] Dec 24, 2025
3cf7a63
Update oncall for next few weeks (#2748)
Phlip79 Dec 24, 2025
dd7c9f4
Prep work for migrating to types from ModuleSpec (#2668)
nschank Dec 24, 2025
2b343d7
feat(MoE): Refactor cuda_graph_scope (#1920)
buptzyb Dec 30, 2025
a2d7d67
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Dec 31, 2025
11c9680
Fix merge conflict in #1920 (#2781)
tdene Dec 31, 2025
40d590d
ci: Allow disabling external contributors (#2784)
chtruong814 Dec 31, 2025
6977db9
chore: rotate oncall schedule
github-actions[bot] Dec 31, 2025
f33e009
Reflect the changes made by #1920 in RL (#2780)
tdene Dec 31, 2025
852e791
Fix 2780 (#2791)
tdene Dec 31, 2025
1909eb2
Only assign oncall to main PRs (#2755)
Phlip79 Dec 31, 2025
52bf635
Ignore bot for oncall (#2756)
Phlip79 Dec 31, 2025
0e33828
Update PR message (#2778)
Phlip79 Dec 31, 2025
ccc9ad3
Explicitly zero out padding token outputs when using quantization sca…
santhnm2 Dec 31, 2025
a427c47
Synchronize total block count across pipeline parallel ranks (#2578)
santhnm2 Dec 31, 2025
7843a80
Optimize TE CUDA Graph capturing time (#2482)
buptzyb Jan 2, 2026
1eed1d2
Do a pass of typing fixes on transformer/ (#2766)
nschank Jan 2, 2026
939f520
moe: remove unused variable scale_up (#1670)
WineChord Jan 4, 2026
e8dbcf7
build: Pin down `nvidia-nvshmem-cu13` (#2798) (#2803)
ko3n1g Jan 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
59 changes: 59 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
megatron/core/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/models/gpt/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/gpt

megatron/core/models/multimodal/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/multi-modal

megatron/core/models/mamba/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba
megatron/core/ssm/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/distributed/fsdp/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/transformer/fsdp_dtensor_checkpoint.py @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/dist_checkpointing/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-checkpointing

megatron/core/optimizer/distrib_optimizer/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-optimizer

megatron/core/inference/modelopt_support @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/quantization-and-inference

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/pipeline_parallel/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/pipeline-parallelism

megatron/core/transformer/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/mixture-of-experts-adlr @NVIDIA/mixture-of-experts-devtech

megatron/core/inference/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/inference

megatron/core/parallel_state.py @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

megatron/post_training/ @NVIDIA/post-training

megatron/core/transformer/cuda_graphs.py @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/cuda-graphs

.gitlab/ @NVIDIA/ci
.github/ @NVIDIA/ci
.gitlab-ci.yml @NVIDIA/ci
docker/ @NVIDIA/ci
tests/functional_tests/python_test_utils/ @NVIDIA/ci
tests/functional_tests/shell_test_utils/ @NVIDIA/ci
tests/test_utils/recipes/ @NVIDIA/ci
tests/unit_tests/run_ci_test.sh @NVIDIA/ci

# API Backwards Compatibility Check
scripts/check_api_backwards_compatibility.py @NVIDIA/ci @pablo-garay
scripts/README_API_COMPAT.md @NVIDIA/ci @pablo-garay
.github/workflows/check_api_backwards_compatibility_workflow.yml @NVIDIA/ci @pablo-garay
docs/api-backwards-compatibility-check.md @NVIDIA/ci @pablo-garay
tests/unit_tests/test_api_backwards_compat_setup.py @NVIDIA/ci @pablo-garay

megatron/rl/ @NVIDIA/reinforcement-learning
examples/rl/ @NVIDIA/reinforcement-learning
test/unit_tests/test_rl_utils.py @NVIDIA/reinforcement-learning
train_rl.py @NVIDIA/reinforcement-learning
28 changes: 28 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
name: Bug report
about: Create a report to help us improve the repository or project
title: ""
labels: bug
assignees: ''

---

**Describe the bug**

A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blank_issues_enabled: false

20 changes: 20 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ""
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
Loading