Skip to content

[SIMD] Unary Ops: Add Floor/Ceil/Round and Transcendental Vectorization #577

@Nucs

Description

@Nucs

Overview

Extend the IL kernel generator's SIMD unary operation support beyond the current Negate/Abs/Sqrt.

Parent issue: #545

Current State

Operation SIMD Status Implementation
Negate ✅ SIMD Vector256.op_UnaryNegation
Abs ✅ SIMD Vector256.Abs()
Sqrt ✅ SIMD Vector256.Sqrt()
Floor ❌ Scalar Math.Floor per-element
Ceil ❌ Scalar Math.Ceiling per-element
Round ❌ Scalar Math.Round per-element
Exp ❌ Scalar Math.Exp per-element
Log ❌ Scalar Math.Log per-element
Sin/Cos/Tan ❌ Scalar Math.Sin/Cos/Tan per-element

SIMD eligibility check: ILKernelGenerator.cs:2086
Vector operation dispatch: ILKernelGenerator.cs:2980-3014

Task List

Tier 1: Quick Wins (Vector256 methods exist in .NET)

  • SIMD Floor

    • .NET has Vector256.Floor()
    • Add to CanUseUnarySimd() eligibility
    • Add to EmitUnaryVectorOperation() dispatch
    • Expected: 2× speedup
  • SIMD Ceiling

    • .NET has Vector256.Ceiling()
    • Same implementation pattern as Floor
    • Expected: 2× speedup
  • SIMD Truncate

    • .NET has Vector256.Truncate()
    • Same implementation pattern
    • Expected: 2× speedup

Tier 2: Medium Effort

  • SIMD Round
    • May need Vector256.Round() or composition
    • Check .NET 8+ availability
    • Expected: 1.5-2× speedup

Tier 3: Transcendentals (Complex)

  • SIMD Exp/Log (research)

    • No Vector256.Exp() in .NET BCL
    • Options:
      1. Polynomial approximation (Remez/minimax)
      2. External library (MathNet.Numerics)
      3. P/Invoke to Intel SVML
      4. Wait for .NET Tensor primitives
    • Expected: 2-4× speedup if implemented
  • SIMD Sin/Cos/Tan (research)

    • Same challenge as Exp/Log
    • Range reduction + polynomial approximation
    • More complex due to periodicity

Implementation Details

Floor/Ceil (Tier 1)

// In CanUseUnarySimd (line ~2086), add:
|| key.Op == UnaryOp.Floor
|| key.Op == UnaryOp.Ceil

// In EmitUnaryVectorOperation (line ~2980), add:
case UnaryOp.Floor:
    var floorMethod = typeof(Vector256).GetMethod("Floor", 
        new[] { typeof(Vector256<>).MakeGenericType(GetClrType(type)) });
    il.EmitCall(OpCodes.Call, floorMethod, null);
    break;
    
case UnaryOp.Ceil:
    var ceilMethod = typeof(Vector256).GetMethod("Ceiling", 
        new[] { typeof(Vector256<>).MakeGenericType(GetClrType(type)) });
    il.EmitCall(OpCodes.Call, ceilMethod, null);
    break;

Transcendentals (Tier 3) - Research Notes

Exp approximation approach:

// Exp(x) via range reduction + polynomial
// 1. Clamp x to avoid overflow
// 2. n = round(x / ln2), r = x - n*ln2
// 3. exp(r) ≈ polynomial (|r| < ln2/2)
// 4. result = 2^n * exp(r)
public static Vector256<float> Exp(Vector256<float> x)
{
    var ln2 = Vector256.Create(0.693147180559945f);
    var invLn2 = Vector256.Create(1.44269504088896f);
    // ... polynomial coefficients ...
}

Files to Modify

File Changes
ILKernelGenerator.cs:2086 Add Floor/Ceil/Round to eligibility
ILKernelGenerator.cs:2980 Add vector operation dispatch
SimdKernels.cs (optional) C# fallback implementations

Benchmarks

[Benchmark] public NDArray Floor_10M() => np.floor(_array);
[Benchmark] public NDArray Ceil_10M() => np.ceil(_array);
[Benchmark] public NDArray Exp_10M() => np.exp(_array);
[Benchmark] public NDArray Sin_10M() => np.sin(_array);

NumPy Baseline (10M float64)

Operation NumPy Time
np.floor ~8 ms
np.ceil ~8 ms
np.exp ~20 ms
np.sin ~50 ms

Success Criteria

  1. Floor/Ceil SIMD: ≥1.5× faster than current scalar
  2. All existing unary tests pass
  3. No accuracy regression vs scalar implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions