-
Notifications
You must be signed in to change notification settings - Fork 205
Open
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations
Description
Overview
Extend the IL kernel generator's SIMD reduction support beyond the current Sum/Max/Min element-wise operations.
Parent issue: #545
Current State
| Reduction | Element-wise (axis=null) | Axis Reduction |
|---|---|---|
| Sum | ✅ SIMD | ❌ NDIterator |
| Max | ✅ SIMD | ❌ NDIterator |
| Min | ✅ SIMD | ❌ NDIterator |
| Prod | ❌ Scalar | ❌ NDIterator |
| ArgMax | ❌ Scalar | ❌ NDIterator |
| ArgMin | ❌ Scalar | ❌ NDIterator |
Existing SIMD code: ILKernelGenerator.cs:3996-4095
Task List
Tier 1: Element-wise Improvements
-
SIMD Prod (Product reduction)
- Challenge: No
Vector256.HorizontalMultiply()in .NET - Solution: Lane-wise multiply accumulator, horizontal product at end
-
Vector256<T> prodVec = Vector256.Create((T)1); for (i = 0; i <= vectorEnd; i += vectorCount) prodVec = prodVec * Vector256.Load(input + i); result = prodVec[0] * prodVec[1] * ... * prodVec[N-1];
- Expected: 1.5-2× speedup
- Challenge: No
-
SIMD ArgMax/ArgMin
- Challenge: Need to track indices alongside values
- Solution: Dual-accumulator (value + index vectors)
-
Vector256<T> maxVec = Vector256.Create(T.MinValue); Vector256<int> idxVec = Vector256.Create(0, 1, 2, ...); for (i = 0; i <= vectorEnd; i += vectorCount) { var vec = Vector256.Load(input + i); var mask = Vector256.GreaterThan(vec, maxVec); maxVec = Vector256.ConditionalSelect(mask, vec, maxVec); idxVec = Vector256.ConditionalSelect(mask.AsInt32(), currentIdx, idxVec); }
- Expected: 2× speedup
Tier 2: Axis Reductions
- SIMD for inner-axis reductions
- Current:
Default.Reduction.AMax.csuses slow NDIterator loops - Opportunity: For axis=-1 (inner), the reduction dimension is contiguous
-
// Current (slow): for each outer_coord: iter = arr[outer_coord, :].AsIterator() while iter.HasNext(): val = iter.MoveNext() // Virtual calls! // With SIMD: for each outer_coord: T* rowPtr = basePtr + outer_coord * outerStride result[outer_coord] = SimdReduce(rowPtr, innerSize) // Vector256 loop
- Expected: 3-5× speedup for large axis reductions
- Current:
Files to Modify
| File | Changes |
|---|---|
ILKernelGenerator.cs |
Add Prod/ArgMax/ArgMin to element-wise SIMD paths |
ReductionKernel.cs |
Add AxisReductionKernel delegate if needed |
Default.Reduction.*.cs |
Refactor axis reductions to use IL kernels |
DefaultEngine.ReductionOp.cs |
Wire up new kernels |
Benchmarks to Add
[Benchmark] public double Prod_10M() => np.prod(_array);
[Benchmark] public int ArgMax_10M() => np.argmax(_array);
[Benchmark] public NDArray Sum_Axis0() => np.sum(_matrix, axis: 0);
[Benchmark] public NDArray Max_Axis1() => np.amax(_matrix, axis: 1);Success Criteria
- Prod element-wise: ≥1.5× faster than current
- ArgMax/ArgMin element-wise: ≥1.5× faster than current
- Inner-axis reductions: ≥2× faster than current
- All existing reduction tests pass
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations