Implementing the ParallelIterator trait for Windows might be quit helpful in accelerating usecases such as convolution and correlation over large ndarrays. Please let me know if you have considered this already and found some issues in implementing.
Here is an example:
let a = arr
.windows((3, 3))
.into_par_iter()
.map(|w| (&w * &kernel).sum())
.collect::<Array1<f32>>();