Performance: Replace heap allocations with stackalloc in 74+ hot paths

# Performance: Replace heap allocations with stackalloc in hot paths

## Overview

NumSharp has **74+ instances** of heap allocations in critical hot paths that should use `stackalloc`. This causes significant GC pressure and performance degradation, especially for reduction operations.

**Evidence (Issue #509)**: `sum(axis=0)` on 500×500×500 float32 takes **11 seconds** vs NumPy's **75ms** (147× slower). User profiling confirmed: "GetOffset in Shape class creates list coords millions of times". A stackalloc fix reduced time from 6287ms to 2315ms (**63% improvement**).

## Problem

Array dimension sizes are typically ≤20, yet NumSharp allocates `new int[ndim]` on the heap for every coordinate lookup, slice operation, broadcast, transpose, and reduction. These small, short-lived allocations:
- Trigger frequent GC collections
- Fragment the managed heap
- Prevent JIT optimization (allocation checks, write barriers)

## Solution

Replace `new int[n]` with `Span<int> buffer = stackalloc int[n]` where `n` is bounded (typically ≤32 dimensions).

```csharp
// BEFORE (slow - heap allocation)
var dims = new int[ndim];

// AFTER (fast - stack allocation)
Span<int> dims = stackalloc int[32];
dims = dims[..ndim];  // Slice to actual size
```

---

## Findings by Severity

### TIER 1: CRITICAL (Tight Loops - Millions of Allocations)

#### 1. np.repeat.cs - Per-Element Allocation in Nested Loop
**File**: `src/NumSharp.Core/Manipulation/np.repeat.cs`
**Lines**: 32, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 135, 144, 165

```csharp
// CURRENT: Allocates int[1] PER ITERATION (millions of times!)
for (int i = 0; i < data.size; i++)
    for (int j = 0; j < repeats; j++)
        ret.itemset(new int[1] {i * repeats + j}, data.GetAtIndex(i));

// FIX: Hoist allocation outside loop
Span<int> idx = stackalloc int[1];
for (int i = 0; i < data.size; i++)
    for (int j = 0; j < repeats; j++)
    {
        idx[0] = i * repeats + j;
        ret.itemset(idx, data.GetAtIndex(i));
    }
```

**Impact**: 90-95% GC pressure reduction for repeat operations

---

#### 2. Default.NonZero.cs - List + Array Clone Per Element
**File**: `src/NumSharp.Core/Backends/Default/Indexing/Default.NonZero.cs`
**Lines**: 60, 77, 99, 113, 127, 141, 155, 169, 183, 197, 211, 225, 239, 253

```csharp
// CURRENT: Allocates List + clones array per nonzero element
var nonzeroCoords = new List<int[]>(x.size / 3);
// ... in loop:
nonzeroCoords.Add(coords.CloneArray());
```

**Impact**: High for sparse arrays with many nonzero elements

---

#### 3. Shape.Slice() - List Allocations
**File**: `src/NumSharp.Core/View/Shape.cs`
**Lines**: 906-907

```csharp
// CURRENT
var sliced_axes = new List<int>();
var sliced_strides_list = new List<int>();

// FIX: Use stackalloc with manual count
Span<int> sliced_axes = stackalloc int[32];
Span<int> sliced_strides = stackalloc int[32];
int count = 0;
```

---

### TIER 2: HIGH (Per-Operation Allocations)

#### 4. Broadcasting Operations
**File**: `src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Broadcasting.cs`

| Line | Allocation | Method |
|------|-----------|--------|
| 28 | `new int[nd]` | ResolveReturnShape() |
| 215 | `new int[nd]` | Broadcast(Shape[]) |
| 246 | `new int[nd]` | Broadcast(Shape, Shape) - inside loop! |
| 282 | `new int[rightShape.NDim]` | Scalar broadcast |
| 296 | `new int[leftShape.NDim]` | Scalar broadcast |

**Impact**: 5 allocations per binary operation (add, multiply, etc.)

---

#### 5. Transpose/SwapAxes
**File**: `src/NumSharp.Core/Backends/Default/ArrayManipulation/Default.Transpose.cs`

| Line | Allocation | Method |
|------|-----------|--------|
| 85 | `new int[ndims]` | SwapAxes() |
| 114 | `new List<int>(n)` | RollAxis() |
| 126-127 | `new int[nd.ndim]` × 2 | Transpose() |
| 160 | `new int[n]` | Transpose() empty case |
| 178-179 | `new int[n]` × 2 | Transpose() |

**Impact**: 4+ allocations per transpose operation

---

#### 6. Reduction Operations (keepdims path)
**Files**: `src/NumSharp.Core/Backends/Default/Math/Reduction/`

| File | Lines | Allocation |
|------|-------|-----------|
| Default.Reduction.Add.cs | 38, 64 | `new int[arr.ndim]` |
| Default.Reduction.Mean.cs | 24, 39 | `new int[arr.ndim]` |
| Default.Reduction.AMax.cs | 24, 42 | `new int[arr.ndim]` |
| Default.Reduction.AMin.cs | 24, 42 | `new int[arr.ndim]` |
| Default.Reduction.Product.cs | 24, 40 | `new int[arr.ndim]` |
| Default.Reduction.Std.cs | 21, 37 | `new int[arr.ndim]` |
| Default.Reduction.Var.cs | 20, 36 | `new int[arr.ndim]` |

**Impact**: 14 allocations across reduction operations with keepdims=true

---

### TIER 3: MEDIUM (Less Frequent)

#### 7. Shape.Unmanaged.cs - GetSubshape()
**File**: `src/NumSharp.Core/View/Shape.Unmanaged.cs`
**Lines**: 69-70, 95

```csharp
var retShape = new int[newNDim];
var retStrides = new int[newNDim];
var innerShape = new int[newNDim];
```

**Note**: Line 52 already uses `stackalloc` - follow this pattern!

---

#### 8. Selection/Indexing
**Files**:
- `src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Getter.cs` (lines 321, 368, 384, 388, 399)
- `src/NumSharp.Core/Selection/NDArray.Indexing.Selection.Setter.cs` (lines 414, 419, 430)

---

#### 9. Coordinate Incrementors
**Files**: `src/NumSharp.Core/Utilities/Incrementors/`
- NDCoordinatesAxisIncrementor.cs (lines 33, 66)
- NDCoordinatesIncrementor.cs (lines 21, 39, 102, 115)
- NDCoordinatesLeftToAxisIncrementor.cs (line 32)
- NDExtendedCoordinatesIncrementor.cs (line 26)

---

#### 10. Logic Operations
- `np.all.cs` line 75: `new int[inputShape.Length]`
- `np.any.cs` line 76: `new int[inputShape.Length]`

---

#### 11. Dot Product / MatMul
- `Default.Dot.cs` lines 80, 88: `new int[shape.NDim + 1]`
- `Default.Dot.NDMD.cs` line 24: `new int[retlen]`

---

#### 12. Utilities
**File**: `src/NumSharp.Core/Utilities/Arrays.cs`
- Lines 289, 328: `new int[array.Rank]`
- Lines 300, 339: `new List<int>(16)`
- Line 562: `new int[1]`

---

## Implementation Checklist

### Phase 1: Critical (Immediate Impact)
- [ ] `np.repeat.cs` - Hoist allocation outside loops (13 locations)
- [ ] `Shape.cs:Slice()` - Replace Lists with stackalloc spans
- [ ] `Default.NonZero.cs` - Pool or restructure coordinate collection

### Phase 2: High Priority
- [ ] `Default.Broadcasting.cs` - 5 allocations
- [ ] `Default.Transpose.cs` - 6 allocations
- [ ] `Default.Reduction.*.cs` - 14 allocations across 7 files

### Phase 3: Medium Priority
- [ ] `Shape.Unmanaged.cs:GetSubshape()` - 3 allocations
- [ ] Selection/Indexing - 8 allocations
- [ ] Coordinate incrementors - 6 allocations
- [ ] Remaining utilities and logic operations

---

## Pattern to Follow

```csharp
// Standard pattern for dimension arrays (ndim ≤ 32 typical)
Span<int> dims = stackalloc int[32];
dims = dims[..actualSize];

// For methods that need to return arrays
Span<int> temp = stackalloc int[32];
// ... populate temp ...
return temp[..count].ToArray();  // Only allocate at return

// For loops with index arrays
Span<int> idx = stackalloc int[1];
for (...)
{
    idx[0] = computedIndex;
    array.itemset(idx, value);
}
```

---

## Verification

After fixes, benchmark with:
```csharp
var arr = np.random.rand(500, 500, 500).astype(NPTypeCode.Single);
var sw = Stopwatch.StartNew();
var result = np.sum(arr, axis: 0);
Console.WriteLine($"Time: {sw.ElapsedMilliseconds}ms");
// Before: ~11000ms, After: ~2300ms (expected 63%+ improvement)
```

---

## Related Issues
- #509 - sum(axis=0) performance (11s vs 75ms)
- #451 - argmax slowness (500ms for 1×13×512×512)
- #427 - matmul performance

## Labels
`performance`, `optimization`, `gc-pressure`, `good-first-issue` (for individual file fixes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Replace heap allocations with stackalloc in 74+ hot paths #575

Performance: Replace heap allocations with stackalloc in hot paths

Overview

Problem

Solution

Findings by Severity

TIER 1: CRITICAL (Tight Loops - Millions of Allocations)

1. np.repeat.cs - Per-Element Allocation in Nested Loop

2. Default.NonZero.cs - List + Array Clone Per Element

3. Shape.Slice() - List Allocations

TIER 2: HIGH (Per-Operation Allocations)

4. Broadcasting Operations

5. Transpose/SwapAxes

6. Reduction Operations (keepdims path)

TIER 3: MEDIUM (Less Frequent)

7. Shape.Unmanaged.cs - GetSubshape()

8. Selection/Indexing

9. Coordinate Incrementors

10. Logic Operations

11. Dot Product / MatMul

12. Utilities

Implementation Checklist

Phase 1: Critical (Immediate Impact)

Phase 2: High Priority

Phase 3: Medium Priority

Pattern to Follow

Verification

Related Issues

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Line	Allocation	Method
28	`new int[nd]`	ResolveReturnShape()
215	`new int[nd]`	Broadcast(Shape[])
246	`new int[nd]`	Broadcast(Shape, Shape) - inside loop!
282	`new int[rightShape.NDim]`	Scalar broadcast
296	`new int[leftShape.NDim]`	Scalar broadcast

Line	Allocation	Method
85	`new int[ndims]`	SwapAxes()
114	`new List<int>(n)`	RollAxis()
126-127	`new int[nd.ndim]` × 2	Transpose()
160	`new int[n]`	Transpose() empty case
178-179	`new int[n]` × 2	Transpose()

File	Lines	Allocation
Default.Reduction.Add.cs	38, 64	`new int[arr.ndim]`
Default.Reduction.Mean.cs	24, 39	`new int[arr.ndim]`
Default.Reduction.AMax.cs	24, 42	`new int[arr.ndim]`
Default.Reduction.AMin.cs	24, 42	`new int[arr.ndim]`
Default.Reduction.Product.cs	24, 40	`new int[arr.ndim]`
Default.Reduction.Std.cs	21, 37	`new int[arr.ndim]`
Default.Reduction.Var.cs	20, 36	`new int[arr.ndim]`

Performance: Replace heap allocations with stackalloc in 74+ hot paths #575

Description

Performance: Replace heap allocations with stackalloc in hot paths

Overview

Problem

Solution

Findings by Severity

TIER 1: CRITICAL (Tight Loops - Millions of Allocations)

1. np.repeat.cs - Per-Element Allocation in Nested Loop

2. Default.NonZero.cs - List + Array Clone Per Element

3. Shape.Slice() - List Allocations

TIER 2: HIGH (Per-Operation Allocations)

4. Broadcasting Operations

5. Transpose/SwapAxes

6. Reduction Operations (keepdims path)

TIER 3: MEDIUM (Less Frequent)

7. Shape.Unmanaged.cs - GetSubshape()

8. Selection/Indexing

9. Coordinate Incrementors

10. Logic Operations

11. Dot Product / MatMul

12. Utilities

Implementation Checklist

Phase 1: Critical (Immediate Impact)

Phase 2: High Priority

Phase 3: Medium Priority

Pattern to Follow

Verification

Related Issues

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions