Shared Memory in CUDA using Julia and CUDAnative

This series continues to outline what I am learning from building a machine learning algorithm (Neuroevolution) in Julia and CUDAnative.

As mentioned in the title, I am using CUDAnative to program on GPUs.

My setup is a late-2016 MacBook Pro with a GeForce GTX 1080 Ti. If you want to know how I connect a GeForce GTX 1080 Ti to a MacBook Pro, that may be worth another article.

Shared memory is typically used to synchronize across GPU threads. It also has the added benefit of being much faster than global memory. Yet, there is only a few dozens KBs of shared memory that can be used across all threads in a block. So one cannot dump everything into shared memory just to get the speed increase.

using CUDAdrv, CUDAnative

function kernel(x)
    i = threadIdx().x
    shared = @cuStaticSharedMem(Int64,1)
    if i == 1
        shared[1] = 255
    end
    sync_threads()
    x[i] = shared[1]
    return nothing
end

d_x = CuArray{Int64,1}(10)
@cuda (1, 10) kernel(d_x)
x = Array(d_x)
println(x)

OUTPUT: [255, 255, 255, 255, 255, 255, 255, 255, 255, 255]