[SIMD] Reductions: Add SIMD for Prod, ArgMax/ArgMin, and Axis Reductions

## Overview

Extend the IL kernel generator's SIMD reduction support beyond the current Sum/Max/Min element-wise operations.

**Parent issue:** #545

## Current State

| Reduction | Element-wise (axis=null) | Axis Reduction |
|-----------|-------------------------|----------------|
| **Sum** | ✅ SIMD | ❌ NDIterator |
| **Max** | ✅ SIMD | ❌ NDIterator |
| **Min** | ✅ SIMD | ❌ NDIterator |
| **Prod** | ❌ Scalar | ❌ NDIterator |
| **ArgMax** | ❌ Scalar | ❌ NDIterator |
| **ArgMin** | ❌ Scalar | ❌ NDIterator |

**Existing SIMD code:** `ILKernelGenerator.cs:3996-4095`

## Task List

### Tier 1: Element-wise Improvements

- [ ] **SIMD Prod (Product reduction)**
  - Challenge: No `Vector256.HorizontalMultiply()` in .NET
  - Solution: Lane-wise multiply accumulator, horizontal product at end
  - ```csharp
    Vector256<T> prodVec = Vector256.Create((T)1);
    for (i = 0; i <= vectorEnd; i += vectorCount)
        prodVec = prodVec * Vector256.Load(input + i);
    result = prodVec[0] * prodVec[1] * ... * prodVec[N-1];
    ```
  - Expected: **1.5-2× speedup**

- [ ] **SIMD ArgMax/ArgMin**
  - Challenge: Need to track indices alongside values
  - Solution: Dual-accumulator (value + index vectors)
  - ```csharp
    Vector256<T> maxVec = Vector256.Create(T.MinValue);
    Vector256<int> idxVec = Vector256.Create(0, 1, 2, ...);
    for (i = 0; i <= vectorEnd; i += vectorCount) {
        var vec = Vector256.Load(input + i);
        var mask = Vector256.GreaterThan(vec, maxVec);
        maxVec = Vector256.ConditionalSelect(mask, vec, maxVec);
        idxVec = Vector256.ConditionalSelect(mask.AsInt32(), currentIdx, idxVec);
    }
    ```
  - Expected: **2× speedup**

### Tier 2: Axis Reductions

- [ ] **SIMD for inner-axis reductions**
  - Current: `Default.Reduction.AMax.cs` uses slow NDIterator loops
  - Opportunity: For axis=-1 (inner), the reduction dimension is contiguous
  - ```csharp
    // Current (slow):
    for each outer_coord:
        iter = arr[outer_coord, :].AsIterator()
        while iter.HasNext(): val = iter.MoveNext()  // Virtual calls!
    
    // With SIMD:
    for each outer_coord:
        T* rowPtr = basePtr + outer_coord * outerStride
        result[outer_coord] = SimdReduce(rowPtr, innerSize)  // Vector256 loop
    ```
  - Expected: **3-5× speedup** for large axis reductions

## Files to Modify

| File | Changes |
|------|---------|
| `ILKernelGenerator.cs` | Add Prod/ArgMax/ArgMin to element-wise SIMD paths |
| `ReductionKernel.cs` | Add AxisReductionKernel delegate if needed |
| `Default.Reduction.*.cs` | Refactor axis reductions to use IL kernels |
| `DefaultEngine.ReductionOp.cs` | Wire up new kernels |

## Benchmarks to Add

```csharp
[Benchmark] public double Prod_10M() => np.prod(_array);
[Benchmark] public int ArgMax_10M() => np.argmax(_array);
[Benchmark] public NDArray Sum_Axis0() => np.sum(_matrix, axis: 0);
[Benchmark] public NDArray Max_Axis1() => np.amax(_matrix, axis: 1);
```

## Success Criteria

1. Prod element-wise: ≥1.5× faster than current
2. ArgMax/ArgMin element-wise: ≥1.5× faster than current
3. Inner-axis reductions: ≥2× faster than current
4. All existing reduction tests pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD] Reductions: Add SIMD for Prod, ArgMax/ArgMin, and Axis Reductions #576

Overview

Current State

Task List

Tier 1: Element-wise Improvements

Tier 2: Axis Reductions

Files to Modify

Benchmarks to Add

Success Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduction	Element-wise (axis=null)	Axis Reduction
Sum	✅ SIMD	❌ NDIterator
Max	✅ SIMD	❌ NDIterator
Min	✅ SIMD	❌ NDIterator
Prod	❌ Scalar	❌ NDIterator
ArgMax	❌ Scalar	❌ NDIterator
ArgMin	❌ Scalar	❌ NDIterator

File	Changes
`ILKernelGenerator.cs`	Add Prod/ArgMax/ArgMin to element-wise SIMD paths
`ReductionKernel.cs`	Add AxisReductionKernel delegate if needed
`Default.Reduction.*.cs`	Refactor axis reductions to use IL kernels
`DefaultEngine.ReductionOp.cs`	Wire up new kernels

[SIMD] Reductions: Add SIMD for Prod, ArgMax/ArgMin, and Axis Reductions #576

Description

Overview

Current State

Task List

Tier 1: Element-wise Improvements

Tier 2: Axis Reductions

Files to Modify

Benchmarks to Add

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions