Shared Memory in CUDA using Julia and CUDAnative
/This series continues to outline what I am learning from building a machine learning algorithm (Neuroevolution) in Julia and CUDAnative.
As mentioned in the title, I am using CUDAnative to program on GPUs.
My setup is a late-2016 MacBook Pro with a GeForce GTX 1080 Ti. If you want to know how I connect a GeForce GTX 1080 Ti to a MacBook Pro, that may be worth another article.
Shared memory is typically used to synchronize across GPU threads. It also has the added benefit of being much faster than global memory. Yet, there is only a few dozens KBs of shared memory that can be used across all threads in a block. So one cannot dump everything into shared memory just to get the speed increase.
using CUDAdrv, CUDAnative
function kernel(x)
i = threadIdx().x
shared = @cuStaticSharedMem(Int64,1)
if i == 1
shared[1] = 255
end
sync_threads()
x[i] = shared[1]
return nothing
end
d_x = CuArray{Int64,1}(10)
@cuda (1, 10) kernel(d_x)
x = Array(d_x)
println(x)
OUTPUT: [255, 255, 255, 255, 255, 255, 255, 255, 255, 255]