-
Notifications
You must be signed in to change notification settings - Fork 205
Description
Performance: Replace heap allocations with stackalloc in hot paths
Overview
NumSharp has 74+ instances of heap allocations in critical hot paths that should use stackalloc. This causes significant GC pressure and performance degradation, especially for reduction operations.
Evidence (Issue #509): sum(axis=0) on 500×500×500 float32 takes 11 seconds vs NumPy's 75ms (147× slower). User profiling confirmed: "GetOffset in Shape class creates list coords millions of times". A stackalloc fix reduced time from 6287ms to 2315ms (63% improvement).
Problem
Array dimension sizes are typically ≤20, yet NumSharp allocates new int[ndim] on the heap for every coordinate lookup, slice operation, broadcast, transpose, and reduction. These small, short-lived allocations:
- Trigger frequent GC collections
- Fragment the managed heap
- Prevent JIT optimization (allocation checks, write barriers)
Solution
Replace new int[n] with Span<int> buffer = stackalloc int[n] where n is bounded (typically ≤32 dimensions).
// BEFORE (slow - heap allocation)
var dims = new int[ndim];
// AFTER (fast - stack allocation)
Span<int> dims = stackalloc int[32];
dims = dims[..ndim]; // Slice to actual sizeFindings by Severity
TIER 1: CRITICAL (Tight Loops - Millions of Allocations)
1. np.repeat.cs - Per-Element Allocation in Nested Loop
File: src/NumSharp.Core/Manipulation/np.repeat.cs
Lines: 32, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135, 144, 165
// CURRENT: Allocates int[1] PER ITERATION (millions of times!)
for (int i = 0; i < data.size; i++)
for (int j = 0; j < repeats; j++)
ret.itemset(new int[1] {i * repeats + j}, data.GetAtIndex(i));
// FIX: Hoist allocation outside loop
Span<int> idx = stackalloc int[1];
for (int i = 0; i < data.size; i++)
for (int j = 0; j < repeats; j++)
{
idx[0] = i * repeats + j;
ret.itemset(idx, data.GetAtIndex(i));
}Impact: 90-95% GC pressure reduction for repeat operations
2. Default.NonZero.cs - List + Array Clone Per Element
File: src/NumSharp.Core/Backends/Default/Indexing/Default.NonZero.cs
Lines: 60, 77, 99, 113, 127, 141, 155, 169, 183, 197, 211, 225, 239, 253
// CURRENT: Allocates List + clones array per nonzero element
var nonzeroCoords = new List<int[]>(x.size / 3);
// ... in loop:
nonzeroCoords.Add(coords.CloneArray());Impact: High for sparse arrays with many nonzero elements
3. Shape.Slice() - List Allocations
File: src/NumSharp.Core/View/Shape.cs
Lines: 906-907
// CURRENT
var sliced_axes = new List<int>();
var sliced_strides_list = new List<int>();
// FIX: Use stackalloc with manual count
Span<int> sliced_axes = stackalloc int[32];
Span<int> sliced_strides = stackalloc int[32];
int count = 0;TIER 2: HIGH (Per-Operation Allocations)
4. Broadcasting Operations
File: src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Broadcasting.cs
| Line | Allocation | Method |
|---|---|---|
| 28 | new int[nd] |
ResolveReturnShape() |
| 215 | new int[nd] |
Broadcast(Shape[]) |
| 246 | new int[nd] |
Broadcast(Shape, Shape) - inside loop! |
| 282 | new int[rightShape.NDim] |
Scalar broadcast |
| 296 | new int[leftShape.NDim] |
Scalar broadcast |
Impact: 5 allocations per binary operation (add, multiply, etc.)
5. Transpose/SwapAxes
File: src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Transpose.cs
| Line | Allocation | Method |
|---|---|---|
| 85 | new int[ndims] |
SwapAxes() |
| 114 | new List<int>(n) |
RollAxis() |
| 126-127 | new int[nd.ndim] × 2 |
Transpose() |
| 160 | new int[n] |
Transpose() empty case |
| 178-179 | new int[n] × 2 |
Transpose() |
Impact: 4+ allocations per transpose operation
6. Reduction Operations (keepdims path)
Files: src/NumSharp.Core/Backends/Default/Math/Reduction/
| File | Lines | Allocation |
|---|---|---|
| Default.Reduction.Add.cs | 38, 64 | new int[arr.ndim] |
| Default.Reduction.Mean.cs | 24, 39 | new int[arr.ndim] |
| Default.Reduction.AMax.cs | 24, 42 | new int[arr.ndim] |
| Default.Reduction.AMin.cs | 24, 42 | new int[arr.ndim] |
| Default.Reduction.Product.cs | 24, 40 | new int[arr.ndim] |
| Default.Reduction.Std.cs | 21, 37 | new int[arr.ndim] |
| Default.Reduction.Var.cs | 20, 36 | new int[arr.ndim] |
Impact: 14 allocations across reduction operations with keepdims=true
TIER 3: MEDIUM (Less Frequent)
7. Shape.Unmanaged.cs - GetSubshape()
File: src/NumSharp.Core/View/Shape.Unmanaged.cs
Lines: 69-70, 95
var retShape = new int[newNDim];
var retStrides = new int[newNDim];
var innerShape = new int[newNDim];Note: Line 52 already uses stackalloc - follow this pattern!
8. Selection/Indexing
Files:
src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Getter.cs(lines 321, 368, 384, 388, 399)src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Setter.cs(lines 414, 419, 430)
9. Coordinate Incrementors
Files: src/NumSharp.Core/Utilities/Incrementors/
- NDCoordinatesAxisIncrementor.cs (lines 33, 66)
- NDCoordinatesIncrementor.cs (lines 21, 39, 102, 115)
- NDCoordinatesLeftToAxisIncrementor.cs (line 32)
- NDExtendedCoordinatesIncrementor.cs (line 26)
10. Logic Operations
np.all.csline 75:new int[inputShape.Length]np.any.csline 76:new int[inputShape.Length]
11. Dot Product / MatMul
Default.Dot.cslines 80, 88:new int[shape.NDim + 1]Default.Dot.NDMD.csline 24:new int[retlen]
12. Utilities
File: src/NumSharp.Core/Utilities/Arrays.cs
- Lines 289, 328:
new int[array.Rank] - Lines 300, 339:
new List<int>(16) - Line 562:
new int[1]
Implementation Checklist
Phase 1: Critical (Immediate Impact)
-
np.repeat.cs- Hoist allocation outside loops (13 locations) -
Shape.cs:Slice()- Replace Lists with stackalloc spans -
Default.NonZero.cs- Pool or restructure coordinate collection
Phase 2: High Priority
-
Default.Broadcasting.cs- 5 allocations -
Default.Transpose.cs- 6 allocations -
Default.Reduction.*.cs- 14 allocations across 7 files
Phase 3: Medium Priority
-
Shape.Unmanaged.cs:GetSubshape()- 3 allocations - Selection/Indexing - 8 allocations
- Coordinate incrementors - 6 allocations
- Remaining utilities and logic operations
Pattern to Follow
// Standard pattern for dimension arrays (ndim ≤ 32 typical)
Span<int> dims = stackalloc int[32];
dims = dims[..actualSize];
// For methods that need to return arrays
Span<int> temp = stackalloc int[32];
// ... populate temp ...
return temp[..count].ToArray(); // Only allocate at return
// For loops with index arrays
Span<int> idx = stackalloc int[1];
for (...)
{
idx[0] = computedIndex;
array.itemset(idx, value);
}Verification
After fixes, benchmark with:
var arr = np.random.rand(500, 500, 500).astype(NPTypeCode.Single);
var sw = Stopwatch.StartNew();
var result = np.sum(arr, axis: 0);
Console.WriteLine($"Time: {sw.ElapsedMilliseconds}ms");
// Before: ~11000ms, After: ~2300ms (expected 63%+ improvement)Related Issues
- Extremely poor performance on sum reduce #509 - sum(axis=0) performance (11s vs 75ms)
- np.argmax is slow #451 - argmax slowness (500ms for 1×13×512×512)
- Performance on np.matmul #427 - matmul performance
Labels
performance, optimization, gc-pressure, good-first-issue (for individual file fixes)