Conversation
|
Could these things maybe live in https://github.com/anicusan/AcceleratedKernels.jl in the future ? |
There is a dependency ordering issue, GPUArrays is the common infrastructure and this is would be the fallback implementation for a common implementation. So GPUArrays would need to take a dependency on something like AcceleratedKernels.jl |
|
Of course JLArrays doesn't work.. That uses the CPU backend and this is |
I was considering it as "leave it to AcceleratedKernels" to implement these. Well, it's a very young package, but I was wondering if it could be a path towards the future ;) |
|
Just to write down my current understanding of the JLArray issue: Is not valid for the CPU in KA right now due to the synchronization within a GPU execution on all vendors should still work, and Arrays should have their own implementation somewhere else. It's just that the JLArray tests will fail for a bit here. |
src/host/mapreduce.jl
Outdated
| # reduce_items = launch_configuration(kernel) | ||
| reduce_items = 512 |
There was a problem hiding this comment.
| # reduce_items = launch_configuration(kernel) | |
| reduce_items = 512 | |
| # reduce_items = compute_items(launch_configuration(kernel)) | |
| reduce_items = compute_items(512) |
But also has to become dynamic, of course.
src/host/mapreduce.jl
Outdated
| # we need multiple steps to cover all values to reduce | ||
| partial = similar(R, (size(R)..., reduce_groups)) | ||
| if init === nothing | ||
| # without an explicit initializer we need to copy from the output container | ||
| partial .= R | ||
| end | ||
| reduce_kernel(f, op, init, Val(items), Rreduce, Rother, partial, A; ndrange) | ||
|
|
||
| GPUArrays.mapreducedim!(identity, op, R′, partial; init=init) |
There was a problem hiding this comment.
This may be a good time to add support for grid stride loops to KA.jl and handle this with a single kernel launch + atomic writes to global memory?
7348bba to
f418d7a
Compare
4974a5e to
2314e24
Compare
|
If we continue this, see JuliaGPU/CUDA.jl#2778: The |
|
I thought the idea was to move towards depending on AK.jl for these kernels? |
Ideally it would be. I pushed the feedback so it would be easier to benchmark between existing implementations and the equivalent KA port (for Metal at least, CUDA has some differences as previously discussed) |
Ported from oneAPI.jl