Shared Memory in CUDA using Julia and CUDAnative

This series continues to outline what I am learning from building a machine learning algorithm (Neuroevolution) in Julia and CUDAnative.

As mentioned in the title, I am using CUDAnative to program on GPUs.

My setup is a late-2016 MacBook Pro with a GeForce GTX 1080 Ti. If you want to know how I connect a GeForce GTX 1080 Ti to a MacBook Pro, that may be worth another article.

Shared memory is typically used to synchronize across GPU threads. It also has the added benefit of being much faster than global memory. Yet, there is only a few dozens KBs of shared memory that can be used across all threads in a block. So one cannot dump everything into shared memory just to get the speed increase.

using CUDAdrv, CUDAnative

function kernel(x)
    i = threadIdx().x
    shared = @cuStaticSharedMem(Int64,1)
    if i == 1
        shared[1] = 255
    end
    sync_threads()
    x[i] = shared[1]
    return nothing
end

d_x = CuArray{Int64,1}(10)
@cuda (1, 10) kernel(d_x)
x = Array(d_x)
println(x)

OUTPUT: [255, 255, 255, 255, 255, 255, 255, 255, 255, 255]

Iterative Quicksort in Julia that also works on CUDA

This series outlines what I am learning from building a machine learning algorithm (Neuroevolution) in Julia and attempting to make it run on GPUs (CUDA) using CUDAnative.

Recursive programming is among the things that you lose when programming on GPUs. Hence, if you need quicksort, you would need its iterative version.

Why do you want program on GPUs? Currently my Neuroevolution runs many times faster on GPUs than on CPUs. My CPU is a 2.7 GHz Intel Core i7 on a late-2016 MacBook Pro. But, my GPU is a GeForce GTX 1080 Ti, which I initially bought for gaming.

Here I adapted an iterative quicksort by Darel Rex Finley to Julia.

Please note that the algorithm would fail and return false if i ever reaches 1,000.

This version runs on Julia on CPUs, but to get it to run on GPUs (CUDA) using CUDAnative, you must also remove all the returns.

Please also note that this version of quicksort is not optimal on GPUs unless you are running multiple quicksorts concurrently to saturate the GPUs.

function qsort!(xs::Array{Int64,1}, n::Int64)
    i = 1
    begins = Array{Int64}(1000)
    ends = Array{Int64}(1000)
    begins[1] = 1
    ends[1] = n + 1
    while i >= 1
        left = begins[i]
        right = ends[i] - 1
        if left < right
            pivot = xs[left]
            if i == 1000
                return false
            end
            while left < right
                while xs[right] >= pivot && left < right
                    right -= 1
                end
                if left < right
                    xs[left] = xs[right]
                    left += 1
                end
                while xs[left] <= pivot && left < right
                    left += 1
                end
                if left < right
                    xs[right] = xs[left]
                    right -= 1
                end
            end
            xs[left] = pivot
            begins[i + 1] = left + 1
            ends[i + 1] = ends[i]
            ends[i] = left
            i += 1
        else
            i -= 1
        end
    end
    return true
end

xs = [11, 13, 14, 15, 12, 1233, 32, 1, -3, 442, 45, 55, 5, 9, 3, 132, 4, 111, 2, 994]
len = length(xs)
qsort!(xs, len)
println(xs)

OUTPUT: [-3, 1, 2, 3, 4, 5, 9, 11, 12, 13, 14, 15, 32, 45, 55, 111, 132, 442, 994, 1233]