Skip to content

Conversation

@chen2021673
Copy link
Contributor

@chen2021673 chen2021673 commented Jan 22, 2026

Summary

Fix precision checker file accumulation issue during multi-iteration runs and simplify the overall implementation.

Changes

Counter Mechanism Fix

  • Add ResetCounters() method to reset tensor counter at iteration boundaries
  • Move counter management to PrecisionCheckEnv with thread_local storage for thread safety
  • Call ResetCounters() at the start of each training step in gpt2/llama3

Precision Checker Refactoring

  • Remove baseline comparison functionality (use separate script instead)
  • Remove table format output, keep only simple and md5 formats
  • Add SaveNpy() function with rank subdirectory support
  • Simplify log format: [GAS-X] [L-Y] name_idx_stage tensor[i]: dtype=... shape=... min=... max=... mean=... [values] [NaN:X Inf:Y]

New Scripts

  • scripts/precision_check/precision_compare.py - Offline NPY comparison tool
  • scripts/precision_check/run_precision_check_gpt2.sh - GPT2 verification script
  • scripts/precision_check/run_precision_check_llama3.sh - LLaMA3 verification script

Documentation

  • Update docs/precision_checker_guide.md to reflect current implementation

Usage Example

# Basic check
./build/gpt2 --precision_check "level=1" --num_iteration 1

# Save NPY files
./build/gpt2 --precision_check "level=1,save_tensors=true" --num_iteration 1

# MD5 format
./build/gpt2 --precision_check "level=1,format=md5" --num_iteration 1

# Compare two runs
python scripts/precision_check/precision_compare.py \
    --dir1 ./precision_check/run1 \
    --dir2 ./precision_check/run2

Testing Example

Run verification script:

bash scripts/precision_check/run_precision_check_gpt2.sh

chen2021673 and others added 3 commits January 22, 2026 07:49
…sion checker

Counter mechanism:
- Add ResetCounters() to clear tensor counter at iteration boundaries
- Move counter management to PrecisionCheckEnv with thread_local storage
- Call ResetCounters() at start of each training step in gpt2/llama3

Precision checker refactoring:
- Remove baseline comparison functionality (use separate script instead)
- Remove table format output, keep only simple and md5 formats
- Add TensorStats struct with min/max/mean/nan_count/inf_count
- Add SaveNpy() function for NPY file saving with rank subdirectories
- Simplify log output format with dtype, shape, stats, and first 6 values
- Change stage names from "Module Forward/Backward Output" to "Forward/Backward Output"
- Use std::filesystem instead of sys/stat.h for directory creation

Documentation and scripts:
- Update docs/precision_checker_guide.md with current implementation
- Add precision_compare.py for offline NPY comparison
- Add run_precision_check_gpt2.sh and run_precision_check_llama3.sh

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GlobalModuleHookRegistry singleton to decouple PrecisionChecker
  from Module::operator(), allowing any hook to be registered globally
- Add md5_tolerance config option for PrecisionChecker to handle BF16
  precision differences (e.g., md5_tolerance=1e-3 makes 4.0003 and
  4.0004 produce the same MD5 hash)
- Update gpt2 and llama3 examples to use the new hook registration API

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace all Chinese comments with English translations in
global_module_hook_registry.h for better international accessibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants