fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

chen2021673 · 2026-01-22T07:55:38Z

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

Add ResetCounters() method to reset tensor counter at iteration boundaries
Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

Remove baseline comparison functionality (use separate script instead)
Remove table format output, keep only simple and md5 formats
Add SaveNpy() function with rank subdirectory support
Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

scripts/precision_check/precision_compare.py - Offline NPY comparison tool
scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

…sion checker Counter mechanism: - Add ResetCounters() to clear tensor counter at iteration boundaries - Move counter management to PrecisionCheckEnv with thread_local storage - Call ResetCounters() at start of each training step in gpt2/llama3 Precision checker refactoring: - Remove baseline comparison functionality (use separate script instead) - Remove table format output, keep only simple and md5 formats - Add TensorStats struct with min/max/mean/nan_count/inf_count - Add SaveNpy() function for NPY file saving with rank subdirectories - Simplify log output format with dtype, shape, stats, and first 6 values - Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output" - Use std::filesystem instead of sys/stat.h for directory creation Documentation and scripts: - Update docs/precision_checker_guide.md with current implementation - Add precision_compare.py for offline NPY comparison - Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker from Module::operator(), allowing any hook to be registered globally - Add md5_tolerance config option for PrecisionChecker to handle BF16 precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and 4.0004 produce the same MD5 hash) - Update gpt2 and llama3 examples to use the new hook registration API Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

chen2021673 and others added 3 commits January 22, 2026 07:49

docs: translate Chinese comments to English in GlobalModuleHookRegistry

2d8abfe

Replace all Chinese comments with English translations in global_module_hook_registry.h for better international accessibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Uh oh!

chen2021673 commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Are you sure you want to change the base?

fix: resolve multi-iteration tensor file overwrite and simplify precision checker #104

Uh oh!

Conversation

chen2021673 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Counter Mechanism Fix

Precision Checker Refactoring

New Scripts

Documentation

Usage Example

Testing Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chen2021673 commented Jan 22, 2026 •

edited

Loading