Skip to content

Performance: Replace heap allocations with stackalloc in 74+ hot paths #575

@Nucs

Description

@Nucs

Performance: Replace heap allocations with stackalloc in hot paths

Overview

NumSharp has 74+ instances of heap allocations in critical hot paths that should use stackalloc. This causes significant GC pressure and performance degradation, especially for reduction operations.

Evidence (Issue #509): sum(axis=0) on 500×500×500 float32 takes 11 seconds vs NumPy's 75ms (147× slower). User profiling confirmed: "GetOffset in Shape class creates list coords millions of times". A stackalloc fix reduced time from 6287ms to 2315ms (63% improvement).

Problem

Array dimension sizes are typically ≤20, yet NumSharp allocates new int[ndim] on the heap for every coordinate lookup, slice operation, broadcast, transpose, and reduction. These small, short-lived allocations:

  • Trigger frequent GC collections
  • Fragment the managed heap
  • Prevent JIT optimization (allocation checks, write barriers)

Solution

Replace new int[n] with Span<int> buffer = stackalloc int[n] where n is bounded (typically ≤32 dimensions).

// BEFORE (slow - heap allocation)
var dims = new int[ndim];

// AFTER (fast - stack allocation)
Span<int> dims = stackalloc int[32];
dims = dims[..ndim];  // Slice to actual size

Findings by Severity

TIER 1: CRITICAL (Tight Loops - Millions of Allocations)

1. np.repeat.cs - Per-Element Allocation in Nested Loop

File: src/NumSharp.Core/Manipulation/np.repeat.cs
Lines: 32, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135, 144, 165

// CURRENT: Allocates int[1] PER ITERATION (millions of times!)
for (int i = 0; i < data.size; i++)
    for (int j = 0; j < repeats; j++)
        ret.itemset(new int[1] {i * repeats + j}, data.GetAtIndex(i));

// FIX: Hoist allocation outside loop
Span<int> idx = stackalloc int[1];
for (int i = 0; i < data.size; i++)
    for (int j = 0; j < repeats; j++)
    {
        idx[0] = i * repeats + j;
        ret.itemset(idx, data.GetAtIndex(i));
    }

Impact: 90-95% GC pressure reduction for repeat operations


2. Default.NonZero.cs - List + Array Clone Per Element

File: src/NumSharp.Core/Backends/Default/Indexing/Default.NonZero.cs
Lines: 60, 77, 99, 113, 127, 141, 155, 169, 183, 197, 211, 225, 239, 253

// CURRENT: Allocates List + clones array per nonzero element
var nonzeroCoords = new List<int[]>(x.size / 3);
// ... in loop:
nonzeroCoords.Add(coords.CloneArray());

Impact: High for sparse arrays with many nonzero elements


3. Shape.Slice() - List Allocations

File: src/NumSharp.Core/View/Shape.cs
Lines: 906-907

// CURRENT
var sliced_axes = new List<int>();
var sliced_strides_list = new List<int>();

// FIX: Use stackalloc with manual count
Span<int> sliced_axes = stackalloc int[32];
Span<int> sliced_strides = stackalloc int[32];
int count = 0;

TIER 2: HIGH (Per-Operation Allocations)

4. Broadcasting Operations

File: src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Broadcasting.cs

Line Allocation Method
28 new int[nd] ResolveReturnShape()
215 new int[nd] Broadcast(Shape[])
246 new int[nd] Broadcast(Shape, Shape) - inside loop!
282 new int[rightShape.NDim] Scalar broadcast
296 new int[leftShape.NDim] Scalar broadcast

Impact: 5 allocations per binary operation (add, multiply, etc.)


5. Transpose/SwapAxes

File: src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Transpose.cs

Line Allocation Method
85 new int[ndims] SwapAxes()
114 new List<int>(n) RollAxis()
126-127 new int[nd.ndim] × 2 Transpose()
160 new int[n] Transpose() empty case
178-179 new int[n] × 2 Transpose()

Impact: 4+ allocations per transpose operation


6. Reduction Operations (keepdims path)

Files: src/NumSharp.Core/Backends/Default/Math/Reduction/

File Lines Allocation
Default.Reduction.Add.cs 38, 64 new int[arr.ndim]
Default.Reduction.Mean.cs 24, 39 new int[arr.ndim]
Default.Reduction.AMax.cs 24, 42 new int[arr.ndim]
Default.Reduction.AMin.cs 24, 42 new int[arr.ndim]
Default.Reduction.Product.cs 24, 40 new int[arr.ndim]
Default.Reduction.Std.cs 21, 37 new int[arr.ndim]
Default.Reduction.Var.cs 20, 36 new int[arr.ndim]

Impact: 14 allocations across reduction operations with keepdims=true


TIER 3: MEDIUM (Less Frequent)

7. Shape.Unmanaged.cs - GetSubshape()

File: src/NumSharp.Core/View/Shape.Unmanaged.cs
Lines: 69-70, 95

var retShape = new int[newNDim];
var retStrides = new int[newNDim];
var innerShape = new int[newNDim];

Note: Line 52 already uses stackalloc - follow this pattern!


8. Selection/Indexing

Files:

  • src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Getter.cs (lines 321, 368, 384, 388, 399)
  • src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Setter.cs (lines 414, 419, 430)

9. Coordinate Incrementors

Files: src/NumSharp.Core/Utilities/Incrementors/

  • NDCoordinatesAxisIncrementor.cs (lines 33, 66)
  • NDCoordinatesIncrementor.cs (lines 21, 39, 102, 115)
  • NDCoordinatesLeftToAxisIncrementor.cs (line 32)
  • NDExtendedCoordinatesIncrementor.cs (line 26)

10. Logic Operations

  • np.all.cs line 75: new int[inputShape.Length]
  • np.any.cs line 76: new int[inputShape.Length]

11. Dot Product / MatMul

  • Default.Dot.cs lines 80, 88: new int[shape.NDim + 1]
  • Default.Dot.NDMD.cs line 24: new int[retlen]

12. Utilities

File: src/NumSharp.Core/Utilities/Arrays.cs

  • Lines 289, 328: new int[array.Rank]
  • Lines 300, 339: new List<int>(16)
  • Line 562: new int[1]

Implementation Checklist

Phase 1: Critical (Immediate Impact)

  • np.repeat.cs - Hoist allocation outside loops (13 locations)
  • Shape.cs:Slice() - Replace Lists with stackalloc spans
  • Default.NonZero.cs - Pool or restructure coordinate collection

Phase 2: High Priority

  • Default.Broadcasting.cs - 5 allocations
  • Default.Transpose.cs - 6 allocations
  • Default.Reduction.*.cs - 14 allocations across 7 files

Phase 3: Medium Priority

  • Shape.Unmanaged.cs:GetSubshape() - 3 allocations
  • Selection/Indexing - 8 allocations
  • Coordinate incrementors - 6 allocations
  • Remaining utilities and logic operations

Pattern to Follow

// Standard pattern for dimension arrays (ndim ≤ 32 typical)
Span<int> dims = stackalloc int[32];
dims = dims[..actualSize];

// For methods that need to return arrays
Span<int> temp = stackalloc int[32];
// ... populate temp ...
return temp[..count].ToArray();  // Only allocate at return

// For loops with index arrays
Span<int> idx = stackalloc int[1];
for (...)
{
    idx[0] = computedIndex;
    array.itemset(idx, value);
}

Verification

After fixes, benchmark with:

var arr = np.random.rand(500, 500, 500).astype(NPTypeCode.Single);
var sw = Stopwatch.StartNew();
var result = np.sum(arr, axis: 0);
Console.WriteLine($"Time: {sw.ElapsedMilliseconds}ms");
// Before: ~11000ms, After: ~2300ms (expected 63%+ improvement)

Related Issues

Labels

performance, optimization, gc-pressure, good-first-issue (for individual file fixes)

Metadata

Metadata

Assignees

Labels

coreInternal engine: Shape, Storage, TensorEngine, iteratorsperformancePerformance improvements or optimizations

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions