Skip to content

[SIMD] Reductions: Add SIMD for Prod, ArgMax/ArgMin, and Axis Reductions #576

@Nucs

Description

@Nucs

Overview

Extend the IL kernel generator's SIMD reduction support beyond the current Sum/Max/Min element-wise operations.

Parent issue: #545

Current State

Reduction Element-wise (axis=null) Axis Reduction
Sum ✅ SIMD ❌ NDIterator
Max ✅ SIMD ❌ NDIterator
Min ✅ SIMD ❌ NDIterator
Prod ❌ Scalar ❌ NDIterator
ArgMax ❌ Scalar ❌ NDIterator
ArgMin ❌ Scalar ❌ NDIterator

Existing SIMD code: ILKernelGenerator.cs:3996-4095

Task List

Tier 1: Element-wise Improvements

  • SIMD Prod (Product reduction)

    • Challenge: No Vector256.HorizontalMultiply() in .NET
    • Solution: Lane-wise multiply accumulator, horizontal product at end
    • Vector256<T> prodVec = Vector256.Create((T)1);
      for (i = 0; i <= vectorEnd; i += vectorCount)
          prodVec = prodVec * Vector256.Load(input + i);
      result = prodVec[0] * prodVec[1] * ... * prodVec[N-1];
    • Expected: 1.5-2× speedup
  • SIMD ArgMax/ArgMin

    • Challenge: Need to track indices alongside values
    • Solution: Dual-accumulator (value + index vectors)
    • Vector256<T> maxVec = Vector256.Create(T.MinValue);
      Vector256<int> idxVec = Vector256.Create(0, 1, 2, ...);
      for (i = 0; i <= vectorEnd; i += vectorCount) {
          var vec = Vector256.Load(input + i);
          var mask = Vector256.GreaterThan(vec, maxVec);
          maxVec = Vector256.ConditionalSelect(mask, vec, maxVec);
          idxVec = Vector256.ConditionalSelect(mask.AsInt32(), currentIdx, idxVec);
      }
    • Expected: 2× speedup

Tier 2: Axis Reductions

  • SIMD for inner-axis reductions
    • Current: Default.Reduction.AMax.cs uses slow NDIterator loops
    • Opportunity: For axis=-1 (inner), the reduction dimension is contiguous
    • // Current (slow):
      for each outer_coord:
          iter = arr[outer_coord, :].AsIterator()
          while iter.HasNext(): val = iter.MoveNext()  // Virtual calls!
      
      // With SIMD:
      for each outer_coord:
          T* rowPtr = basePtr + outer_coord * outerStride
          result[outer_coord] = SimdReduce(rowPtr, innerSize)  // Vector256 loop
    • Expected: 3-5× speedup for large axis reductions

Files to Modify

File Changes
ILKernelGenerator.cs Add Prod/ArgMax/ArgMin to element-wise SIMD paths
ReductionKernel.cs Add AxisReductionKernel delegate if needed
Default.Reduction.*.cs Refactor axis reductions to use IL kernels
DefaultEngine.ReductionOp.cs Wire up new kernels

Benchmarks to Add

[Benchmark] public double Prod_10M() => np.prod(_array);
[Benchmark] public int ArgMax_10M() => np.argmax(_array);
[Benchmark] public NDArray Sum_Axis0() => np.sum(_matrix, axis: 0);
[Benchmark] public NDArray Max_Axis1() => np.amax(_matrix, axis: 1);

Success Criteria

  1. Prod element-wise: ≥1.5× faster than current
  2. ArgMax/ArgMin element-wise: ≥1.5× faster than current
  3. Inner-axis reductions: ≥2× faster than current
  4. All existing reduction tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions