Parallel processing in Julia

1 Overview

Julia provides built-in support for various kinds of parallel processing on one or more machines. This material focuses on some standard approaches that are (mostly) analogous to functionality in Python and R. However there is other functionality available, including the ability to control tasks and sending data between processes in a fine-grained way.

In addition to parallelization, the second to last section discusses some issues related to efficiency with for loops, in particular fused operations. This is not directly related to parallelization but given the focus on loops in this document, it’s useful and interesting to know about.

Finally, the last section discussed offloading computation to the GPU, i.e., massive parallelization on many GPU cores.

2 Threading

Threaded calculations are done in parallel on software threads.

Threads share objects in memory with the parent process, which is useful for avoiding copies but raises the danger of a “race condition”, where different threads modify data that other threads are using and cause errors..

2.1 Threaded linear algebra

As with Python and R, Julia uses BLAS, a standard library of basic linear algebra operations (written in Fortran or C), for linear algebra operations. A fast BLAS can greatly speed up linear algebra relative to the default BLAS on a machine. Julia uses a fast, open source, free BLAS library called OpenBLAS. In addition to being fast when used on a single core, the openBLAS library is threaded - if your computer has multiple cores and there are free resources, your linear algebra will use multiple cores

Here’s an example.

using BenchmarkTools

using LinearAlgebra
using Distributions
n = 7000
x = rand(Uniform(0,1), n,n);

println(BLAS.get_num_threads())

function chol_xtx(x)
    z = x'*x   ## z is positive definite
    C = cholesky(z) 
end

BLAS.set_num_threads(4)
@btime chol = chol_xtx(x);

  9.413 s (5 allocations: 747.68 MiB)

BLAS.set_num_threads(1)
@btime chol = chol_xtx(x);

  11.551 s (5 allocations: 747.68 MiB)

We see that using four threads is faster than one, but in this case we don’t get a four-fold speedup.

Number of threads

By default, Julia will set the number of threads for linear algebra equal to the number of processors on your machine.

As seen above, you can check the number of threads being used with:

BLAS.get_num_threads()

Other ways to control the number of threads used for linear algebra include:

setting the OMP_NUM_THREADS environment variable in the shell before starting Julia, and
using BLAS.set_num_threads(n).

2.2 Threaded for loops

In Julia, you can directly set up software threads to use for parallel processing.

Here we’ll see some examples of running a for loop in parallel, both acting on a single object and used as a parallel map operation.

Here we can operate on a vector in parallel:

using Base.Threads

n = 50000000;
x = rand(n);

@threads for i in 1:length(x)
    x[i] = exp(x[i]) + sin(x[i]);
end

We could also threads to carry out a parallel map operation, implemented as a for loop.

n = 1000

function test(n)
    x = rand(Uniform(0,1), n,n)
    z = x'*x 
    C = cholesky(z)
    return(C.U[1,1])
end

a = zeros(12)
@threads for i in 1:12
    a[i] = test(n)
end

2.3 Spawning tasks on threads

You can also create (aka ‘spawn’) individual tasks on threads, with the tasks running in parallel.

Let’s see an example (taken from here of sorting a vector in parallel, by sorting subsets of the vector in separate threads.

import Base.Threads.@spawn

# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
    println(current_task(), ' ', lo, ' ', hi)
    if lo >= hi                       # 1 or 0 elements; nothing to do
        return v
    end
    if hi - lo < 100000               # below some cutoff, run in serial
        sort!(view(v, lo:hi), alg = MergeSort)
        return v
    end

    mid = (lo+hi)>>>1                 # find the midpoint

    ### Set up parallelization here ###

    ## Sort two halves in parallel, one in current call and one in a new task
    ## in a separate thread:
    
    half = @spawn psort!(v, lo, mid)  # task to sort the lower half
    psort!(v, mid+1, hi)              # sort the upper half in the current call
    
    wait(half)                        # wait for the lower half to finish

    temp = v[lo:mid]                  # workspace for merging

    i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
    @inbounds while k < j <= hi
        if v[j] < temp[i]
            v[k] = v[j]
            j += 1
        else
            v[k] = temp[i]
            i += 1
        end
        k += 1
    end
    @inbounds while k < j
        v[k] = temp[i]
        k += 1
        i += 1
    end

    return v
end

psort! (generic function with 3 methods)

How does this work? Let’s consider an example where we sort a vector of length 250000.

The vector gets split into elements 1:125000 (run in task #1) and 125001:250000 (run in the main call). Then the elements 1:125000 are split into 1:62500 (run in task #2) and 62501:125000 (run in task #1), while the elements 125001:250000 are split into 125001:187500 (run in task #3) and 187501:250000 (run in the main call). No more splitting occurs because vectors of length less than 100000 are run in serial.

Assuming we have at least four threads (including the main process), each of the tasks will run in a separate thread, and all four sorts on the vector subsets will run in parallel.

x = rand(250000);
psort!(x);

Task (runnable) @0x00007f91333bad60 1 250000
Task (runnable) @0x00007f91333bad60 125001 250000
Task (runnable) @0x00007f91333bad60 187501 250000
Task (runnable) @0x00007f91352101a0 1 125000
Task (runnable) @0x00007f91352101a0 62501 125000
Task (runnable) @0x00007f9135210330 125001 187500
Task (runnable) @0x00007f91352104c0 1 62500

We see that the output from current_task() shows that the task labels correspond with what I stated above.

The number of tasks running in parallel will be at most the number of threads set in the Julia session.

2.4 Controlling the number of threads

You can see the number of threads available:

Threads.nthreads()

You can control the number of threads used for threading in Julia (apart from linear algebra) either by:

setting the JULIA_NUM_THREADS environment variable in the shell before starting Julia, or
starting Julia with the -t (or --threads) flag, e.g.: julia -t 4.

Note that we can’t use OMP_NUM_THREADS as the Julia threading is not based on openMP.

3 Multi-process parallelization

3.1 Parallel map operations

We can use pmap to run a parallel map operation across multiple Julia processes (on one or more machines). pmap is good for cases where each task takes a non-negligible amount of time, as there is overhead (latency) in starting the tasks.

Here we’ll carry out multiple computationally expensive calculations in the map.

We need to import packages and create the function on each of the worker processes using @everywhere.

using Distributed

if nprocs() == 1
    addprocs(4)
end

nprocs()

WARNING: using Distributed.@spawn in module Main conflicts with an existing identifier.

@everywhere begin
    using Distributions
    using LinearAlgebra
    function test(n)
        x = rand(Uniform(0,1), n,n)
        z = transpose(x)*x 
        C = cholesky(z)
        return C.U[2,3]
    end
end

result = pmap(test, repeat([5000],12))

12-element Vector{Float64}:
 11.46796239567315
 10.98360399939223
 11.855528600147663
 11.809624824425185
 11.271693682820008
 11.191565722035467
 11.603740118960042
 11.540730849990373
 11.362238174937845
 11.789584816924307
 11.194481356957468
 11.393640476610342

One can use static allocation (prescheduling) with the batch_size argument, thereby assigning that many tasks to each worker to reduce latentcy.

3.2 Parallel for loops

One can execute for loops in parallel across multiple worker processes as follows. This is particularly handy for cases where one uses a reduction operator (e.g., the + here) so that little data needs to be copied back to the main process. (And in this case we don’t copy any data to the workers either.)

Here we’ll sum over a large number of random numbers with chunks done on each of the workers, comparing the time to a basic for loop.

function forfun(n)
    sum = 0.0
    for i in 1:n
        sum += rand(1)[1]
    end
    return(sum)
end

function pforfun(n)
   out = @sync @distributed (+) for i = 1:n
       rand(1)[1]
   end
   return(out)
end

n=50000000
@time forfun(n);

  3.441998 seconds (50.01 M allocations: 2.981 GiB, 17.53% gc time, 0.59% compilation time)

@time pforfun(n);

  2.179350 seconds (498.54 k allocations: 33.226 MiB, 24.30% compilation time)

The use of @sync causes the operation to block until the result is available so we can get the correct timing.

Without a reduction operation, one would generally end up passing a lot of data back to the main process, and this could take a lot of time. For such calculations, one would generally be better off using threaded for loops in order to take advantage of shared memory.

We’d have to look into how the random number seed is set on each worker to better understand any issues that might arise from parallel random number generation, but I believe that each worker has a different seed (but note that this does not explicitly ensure that the random number streams on the workers are distinct, as is the case if one uses the L’Ecuyer algorithm).

3.3 Passing data to the workers

With multiple workers, particularly on more than one machine, one generally wants to be careful about having to copy large data objects to each worker, as that could make up a substantial portion of the time involved in the computation.

One can explicitly copy a variable to the workers in an @everywhere block by using Julia’s interpolation syntax:

@everywhere begin
    x = $x  # copy to workers using interpolation syntax
    println(pointer_from_objref(x), ' ', x[1])  
end

Ptr{Nothing} @0x00007f91334cc2b0 3.0383289700841587e-6
      From worker 2:    Ptr{Nothing} @0x00007fb9f003c280 3.0383289700841587e-6
      From worker 3:    Ptr{Nothing} @0x00007f8345d34280 3.0383289700841587e-6
      From worker 4:    Ptr{Nothing} @0x00007f7cc4498280 3.0383289700841587e-6
      From worker 5:    Ptr{Nothing} @0x00007f607f094280 3.0383289700841587e-6

We see based on pointer_from_objref that each copy of x is stored at a distinct location in memory, even when processes are on the same machine.

Also note that if one creates a variable within an @everywhere block, that variable is available to all tasks run on the worker, so it is global’ with respect to those tasks. Note the repeated values in the result here.

@everywhere begin
    x = rand(5)
    function test(i)
        return sum(x)
    end
end

result = pmap(test, 1:12, batch_size = 3)

12-element Vector{Float64}:
 0.9939036615065234
 1.4005465832334134
 1.7529749131894112
 3.2801398110921394
 0.9939036615065234
 1.4005465832334134
 1.7529749131894112
 3.2801398110921394
 0.9939036615065234
 1.4005465832334134
 1.7529749131894112
 3.2801398110921394

If one wants to have multiple processes all work on the same object, without copying it, one can consider using Julia’s SharedArray (one machine) or DArray from the DistributedArrays package (multiple machines) types, which break up arrays into pieces, with different pieces stored locally on different processes.

3.4 Spawning tasks

One can use the Distributed.@spawnat macro to run tasks on processes, in a fashion similar to using Threads.@spawn. More details can be found here.

3.5 Using multiple machines

In addition to using processes on one machine, one can use processes across multiple machines. One can either start the processes when you start the main Julia session or you can start them from within the Julia session. In both cases you’ll need to have the ability to ssh to the other machines without entering your password.

To start the processes when starting Julia, create a “machinefile” that lists the names of the machines and the number of worker processes to start on each machine.

Here’s an example machinefile:

arwen.berkeley.edu
arwen.berkeley.edu
gandalf.berkeley.edu
gandalf.berkeley.edu

Note that if you’re using Slurm on a Linux cluster, you could generate that file in the shell from within your Slurm allocation like this:

srun hostname > machines

Then start Julia like this:

julia --machine-file machines

From within Julia, you can add processes like this (first we’ll remove the existing worker processes started using addprocs() previously):

rmprocs(workers())

addprocs([("arwen", 2), ("gandalf", 2)])

4-element Vector{Int64}:
 6
 7
 8
 9

To check on the number of processes:

nprocs()

4 Loops and fused operations

Consider the following vectorized code that you might run in a variety of languages (e.g., Julia, Python, R).

x = tan(x) + 3*sin(x)

If run as vectorized code, this has downsides. First, it will use additional memory (temporary arrays will be created to store tan(x), sin(x), 3*sin(x)). Second, multiple for loops will have to get executed when the vectorized code is run, looping over the elements of x to calculate tan(x), sin(x), etc. (For example in R or Python/numpy, multiple for loops would get run in the underlying C code.)

Contrast that to running directly as a for loop (e.g., in Julia or in C/C++):

for i in 1:length(x)
    x[i] = tan(x[i]) + 3*sin(x[i])
end

Here temporary arrays don’t need to be allocated and there is only a single for loop.

Combining loops is called ‘fusing’ and is an important optimization that Julia can do. (It’s also a key optimization done by XLA, a compiler used with JAX and Tensorflow.)

Of course you might ask why use vectorized code at all given that Julia will JIT compile the for loop above and run it really quickly. That’s true, but reading and writing vectorized code is easier than writing for loops.

Let’s compare the speed of the following approaches. We’ll put everything into functions as generally recommended when timing Julia code to avoid global variables that incur a performance penalty because their type can change.

First, let’s find the time when directly using a for loop, as a baseline.

n = 50000000
y = Array{Float64}(undef, n);
x = rand(n);

function direct_for_loop_calc(x, y)
    for i in 1:length(x)
        y[i] = exp(x[i]) + sin(x[i])
    end
end

using BenchmarkTools
@benchmark direct_for_loop_calc(x, y)

BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  646.281 ms … 693.152 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     655.665 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   664.190 ms ±  17.358 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █     ██ █    █                           █ █               █  
  █▁▁▁▁▁██▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  646 ms           Histogram: frequency by time          693 ms <
 Memory estimate: 0 bytes, allocs estimate: 0.

Notice the lack of additional memory use.

Now let’s try a basic vectorized calculation (for which we need the various periods to get vectorization), without fusing. We’ll reassign the result to the allocated y vector for comparability to the for loop implementation above.

function basic_vectorized_calc(x, y)
     y .= exp.(x) + 3 * sin.(x)
end

using BenchmarkTools
@benchmark basic_vectorized_calc(x, y)

BenchmarkTools.Trial: 4 samples with 1 evaluation.
 Range (min … max):  1.369 s …    1.631 s  ┊ GC (min … max):  4.95% … 16.17%
 Time  (median):     1.485 s               ┊ GC (median):    10.83%
 Time  (mean ± σ):   1.493 s ± 121.252 ms  ┊ GC (mean ± σ):  10.78% ±  6.77%
  █         █                              █               █  
  █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.37 s         Histogram: frequency by time         1.63 s <
 Memory estimate: 1.49 GiB, allocs estimate: 8.

The original x array is 400 MB; notice the additional memory allocation and that this takes almost twice as long as the original for loop.

Here’s a fused version of the vectorized calculation, where the @. causes the loops to be fused.

function fused_vectorized_calc(x, y)
    y .= @. tan(x) + 3 * sin(x)
end

@benchmark fused_vectorized_calc(x, y)

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.088 s …   1.144 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.105 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.112 s ± 25.876 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  ██               █                              █       █  
  ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  1.09 s         Histogram: frequency by time        1.14 s <
 Memory estimate: 0 bytes, allocs estimate: 0.

We see that the time and (lack of) memory allocation are essentially the same as the original basic for loop.

Finally one can achieve the same fusion by having the function just compute scalar quantities and then using the vectorized version of the function (by using scalar_calc.() instead of scalar_calc()), which also does the fusion.

function scalar_calc(x)
    return(tan(x) + 3 * sin(x))
end

@benchmark y .= scalar_calc.(x)

BenchmarkTools.Trial: 5 samples with 1 evaluation.
 Range (min … max):  1.040 s …   1.081 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.065 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.063 s ± 15.025 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
  █                          █      █      █              █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.04 s         Histogram: frequency by time        1.08 s <
 Memory estimate: 16 bytes, allocs estimate: 1.

5 Using the GPU - basic offloading

We can use CUDA.jl to offload computations to the GPU. Here we’ll explore matrix multiplication and vectorized calculations. For this, Julia will take care, behind the scenes, of converting our Julia code to code that can run on the GPU.

There are a couple key things to remember about using a GPU:

The GPU memory is separate from CPU memory, and transferring data from the CPU to GPU (or back) is often more costly than doing the computation on the GPU.
- If possible, generate the data on the GPU or keep the data on the GPU when carrying out a sequence of operations.
By default GPU calculations are often doing using 32-bit (4-byte) floating point numbers rather than the standard of 64-bit (8-byte) when on the CPU.
- This can affect speed comparisons between CPU and GPU.

Note that for this section, I’m pasting in the output when running the code separately on a machine with a GPU because this document is generated on a machine without a GPU.

5.1 Matrix multiplication

Let’s first consider basic matrix multiplication. In this case since we generate the matrices on the CPU, they are 64-bit.

using BenchmarkTools
using CUDA
using LinearAlgebra

function matmult(x, y)
    z = x * y
    return z
end

n = 7000

x = randn(n, n);
y = randn(n, n);
x_gpu = CuArray(x);
y_gpu = CuArray(y);

## These use 64-bit numbers:
typeof(x)
# Matrix{Float64} (alias for Array{Float64, 2})
typeof(x_gpu)
# CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}

LinearAlgebra.BLAS.set_num_threads(1);

@benchmark z = matmult(x, y) 
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
#  Single result which took 17.271 s (0.00% GC) to evaluate,
#  with a memory estimate of 373.84 MiB, over 2 allocations.

@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 65 samples with 1 evaluation.
#  Range (min … max):  53.172 ms … 90.679 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     76.419 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   77.404 ms ±  4.092 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

Clearly the the GPU calculation is much faster, taking about 75 milliseconds, compared to 17 seconds on the CPU (albeit using a single thread).

Let’s compare that to the time of copying the data to the GPU:

@benchmark CUDA.@sync tmp = CuArray(x)
# BenchmarkTools.Trial: 59 samples with 1 evaluation.
#  Range (min … max):  83.766 ms … 137.849 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     84.684 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   85.696 ms ±   7.011 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

This suggests that the time in copying the data is similar to that for doing the computation.

If we count the time of transferring the data to and from the GPU, that ends up being a substantial part of the time, compared to the 75 ms for simply doing the matrix multiplication.

function matmult_with_transfer(x, y)
    xc = CuArray(x)
    yc = CuArray(y)
    z = xc * yc
    return Array(z)
end

@benchmark CUDA.@sync z = matmult_with_transfer(x, y) 
# BenchmarkTools.Trial: 20 samples with 1 evaluation.
#  Range (min … max):  251.578 ms … 258.017 ms  ┊ GC (min … max): 0.00% … 0.57%
#  Time  (median):     253.886 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   254.228 ms ±   1.708 ms  ┊ GC (mean ± σ):  4.33% ± 6.78%

As a sidenote, we can force use of 64-bit numbers on the GPU (in this case when generating values on the GPU) like this.

x_gpu = CUDA.randn(Float64, n, n);

Finally, let’s consider whether the matrix multiplication is faster using 32-bit numbers.

x = randn(Float32, n, n);
y = randn(Float32, n, n);
x_gpu = CuArray(x);
y_gpu = CuArray(y);
typeof(x_gpu)

@benchmark z = matmult(x, y) 
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
#  Single result which took 8.671 s (0.00% GC) to evaluate,
@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 91 samples with 1 evaluation.
#  Range (min … max):  41.174 ms … 70.491 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     54.420 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   55.363 ms ±  2.912 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

So that’s faster, though I’m not sure why the CPU implementation is about twice as fast (which makes sense in that it is working with numbers taking up half as much space) while the GPU implementation does not achieve that speedup (54 ms. with 32-bit compared to 75 ms. with 64-bit).

5.2 Vectorized calculations

Here we’ll consider using the GPU for vectorized calculations, noting that earlier we talked about using . to vectorize and @ to fuse loops in the context of CPU-based calculations.

# Scalar function to do something.
function myfun(x)
    y = tan(x) + 3 * sin(x)
    return y
end

# Vectorized version that modifies `y` in place.
function myfun_vec(x, y)
    y .= myfun.(x)
    return 
end

n = 250000000; 
y = Vector{Float64}(undef, n);
x = rand(n);

x_gpu = CuArray(x);
y_gpu = CuArray(y);

@benchmark myfun_vec(x, y)   # 3.5 sec.

@benchmark CUDA.@sync myfun_vec(x_gpu, y_gpu)  # 6 ms.

Here we have a massive 500x speedup of 6 ms. compared to 3.5 seconds.

Of course, as in the matrix multiplication example above, if you need to copy the data to and from the GPU, that will add substantial time.

6 Using the GPU - writing GPU kernels

Next we’ll explore writing our own GPU kernels. Kernels are functions that encode the core computational operations that are executed in parallel.

In other languages, the basic mode of operation with a GPU when you are writing your own GPU code is to write a kernel using CUDA (basically C) code and then call the kernel in parallel via C, R, or Python code. In Julia, we can write the kernel using Julia syntax (though many operations (particularly non-numerical ones) will not run on the GPU…).

6.1 Basic example

Here’s a basic example in which we’ll do a calculation in place. We run 1000 scalar calculations using 1000 threads.

We use @cuda to compile and run the kernel.

function my_kernel(x)
  idx = threadIdx().x;   # What thread am I?
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
  end
  return
end

n = 1000;
x_gpu = CUDA.randn(n);
Array(x_gpu)[n]
# -1.5321726f0
@cuda threads=n my_kernel(x_gpu);
Array(x_gpu)[n]   # Check the computation was done by checking last element.
# -28.875708f0

There are limits on the number of threads we can use.

n = 2000;
x_gpu = CUDA.randn(n);
@cuda threads=n my_kernel(x_gpu);
# ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).

6.1.1 Multiple blocks

We need to use at least as many threads as computations, and in addition to only being able to use 1024 threads in the x dimension, we can have at most 1024 threads in a block on the A100 GPU we’re using. So we’ll need multiple blocks.

function my_kernel(x)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
  end
  return
end

n = 2000;
nthreads = 1024;
x_gpu = CUDA.randn(n);
initial = Array(x_gpu)[n]
nblocks = Int(ceil(n/nthreads));

@cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
(initial, Array(x_gpu)[n])  # Check that calculation was done.

Let’s do a smaller test run in which we can check on the thread and block indexing.

function my_kernel_print(x)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    x[idx] = tan(x[idx]) + 3*sin(x[idx]);
    @cuprintln idx, i, j, blockDim().x, blockDim().y;
  end
  return
end

n = 200;
x_gpu = CUDA.randn(n);
nthreads = 100;
nblocks = Int(ceil(n/nthreads));
@cuda threads=nthreads blocks=nblocks my_kernel_print(x_gpu);

When we run this, notice the output seems to be grouped based on warps of 32 threads (apart from the last set since n=200 is not a multiple of 32).

6.1.2 Larger computations

In many cases we’ll have more tasks than the total number of GPU cores. As long as we don’t exceed the maximum size of a block or grid, we can just ask for as many threads as we have tasks and rely on the GPU to manage assigning the tasks to the GPU cores.

We’d want to check that the number/dimension of the block here does not exceed the maximum block size. I didn’t do that, but it ran, so it must have been ok!

Here we’ll run the computation we ran earlier when we did not write our own kernel and just relied on Julia to offload to the GPU behind the scene.

n = 250000000;
x_gpu = CUDA.randn(n);
nthreads = 1024;
nblocks = Int(ceil(n/nthreads));
Array(x_gpu)[n]

# Run it once to flush out any compilation/transformation time.
y_gpu = CUDA.randn(5);
CUDA.@sync @cuda threads=nthreads blocks=nblocks my_kernel(y_gpu);

CUDA.@time CUDA.@sync @cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
# 0.002003 seconds (45 CPU allocations: 2.719 KiB)
Array(x_gpu)[n]

The 2.0 ms is reasonably comparable to the 3.7 ms when we just had Julia run the vectorized computation on the GPU (from the last time we ran it). That used 64-bit floats. When I reran the code above using 64-bit floats, the time was 5.2 ms.

6.2 Efficient memory access

We’ll explore two final topics related to efficiently accessing data in memory: first accessing global GPU memory efficiently and second making use of shared GPU memory.

6.2.1 Coalesced access to global memory

If adjacent threads in a block access adjacent memory locations, a chunk of data can be obtained in a single access to global memory.

We’ll implement element-wise summing of two matrices. Obviously one can just do this directly with CuArrays in Julia, but if we implement it ourselves, it illustrates that reading a matrix by column is much more efficient than reading by row. Here a thread block either handles part of a column (good) or part of a row (bad). The x-dimension of the blocks in the grid then handles multiple thread blocks within each column (or row; bad) and the y-dimension of the blocks in the grid handles the different columns (or rows; bad).

n = 10000;
X_gpu = CUDA.randn(n,n);
Y_gpu = CUDA.randn(n,n);
out_gpu = CUDA.zeros(n,n);

X_gpu_small = CUDA.randn(5,5);
Y_gpu_small = CUDA.randn(5,5);
out_gpu_small = CUDA.zeros(5,5);

# Good: Adjacent threads process elements in a column.
function kernel_sum_bycol!(X, Y, output)
    row_idx = threadIdx().x + (blockIdx().x - 1)*blockDim().x;
    col_idx = blockIdx().y;
    
    if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
        output[row_idx, col_idx] = X[row_idx, col_idx] + Y[row_idx, col_idx]
    end
    return nothing
end

nthreads = 1024;
# x dim of grid is number of thread blocks in a column.
# y dim of grid is number of columns.
nblocks = (Int(ceil(n/nthreads)), n);


# Flush out any compilation time.
CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu_small, Y_gpu_small, out_gpu_small);

@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu, Y_gpu, out_gpu);
# 2.153 ms (47 allocations: 1.30 KiB)

# Bad: Adjacent threads process elements in a row.
function kernel_sum_byrow!(X, Y, output)
    row_idx = blockIdx().y;
    col_idx = threadIdx().x + (blockIdx().x - 1)*blockDim().x;
    
    if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
        output[row_idx, col_idx] = X[row_idx, col_idx] + Y[row_idx, col_idx]
    end
    return nothing
end


# Flush out any compilation time.
CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu_small, Y_gpu_small, out_gpu_small);

@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu, Y_gpu, out_gpu);
# 10.500 ms (47 allocations: 1.30 KiB)

6.2.2 Using shared memory

Accessing global GPU memory is much slower than doing computation on the GPU. So we’d like to avoid repeated access to global memory (e.g., a bad scenario would be a ratio of one arithmetic calculation per retrieval from global memory). One strategy is for multiple threads in a block to cooperate to load data from global memory into shared memory accessible by all the threads in the block. The computation can then be done on the data in shared memory.

Here’s a simplified example that shows how to load the data into shared memory. There’s no actual computation coded here, but one could imagine that each thread would then each do a computation that uses the entire chunk of data in shared memory.

function kernel_reader_bycol(x::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  dims = size(x);
  
  # Setup shared memory for the subset of data.
  shared_data = CuDynamicSharedArray(T, (blockDim().x, dims[2]));

  for chunk_start = 1:blockDim().x:dims[1]
    chunk_size = min(blockDim().x, dims[1] - chunk_start + 1);
    ## Transfer a chunk of rows in parallel, one row per thread.
    if i <= chunk_size
      for col in 1:dims[2]
        shared_data[i, col] = x[chunk_start + i - 1, col];
      end
    end
    sync_threads()
    
    # At this point we'd insert code to do the actual computation, based on `idx`.
    # Each thread now has the opportunity to compute on all the data in the chunk in
    # `shared_data`.
    
  end

  return
end

n = 10000000;
m = 10;
x_gpu = CUDA.randn(n, m);
x_gpu_small = CUDA.randn(5, m);

nthreads = 1024;
nblocks = 100;  # This is arbitrary in this example as we are not doing an actual computation.

memsize = nthreads * m * 4;

CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_bycol(x_gpu_small);

CUDA.@time @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_bycol(x_gpu);
# 0.138480 seconds (24 CPU allocations: 752 bytes)

If m gets much bigger, we get an error “ERROR: Amount of dynamic shared memory exceeds device limit (400.000 KiB > 48.000 KiB).” So for larger m we’d need to rework how we manipulate the data.

Let’s close by seeing if the memory access patterns make a difference in this example. Instead of accessing by column, we’ll access by row, but with the matrix transposed so it is very wide instead of very long.

My initial thought was that accessing by row would be slower because adjacent threads are not reading from adjacent locations in global memory.

function kernel_reader_byrow(x::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  dims = size(x);
  
  # Setup shared memory for the subset of data.
  shared_data = CuDynamicSharedArray(T, (dims[1], blockDim().x));

  for chunk_start = 1:blockDim().x:dims[2]
    chunk_size = min(blockDim().x, dims[2] - chunk_start + 1);
    ## Transfer a chunk of rows in parallel, one column per thread.
    if i <= chunk_size
      for row in 1:dims[1]
        shared_data[row, i] = x[row, chunk_start + i - 1];
      end
    end
    sync_threads()
    
    # At this point we'd insert code to do the actual computation, based on `idx`.
    # Each thread now has the opportunity to compute on all the data in the chunk in
    # `shared_data`.
    
  end

  return
end

n = 10000000;
m = 10;
x_gpu = CUDA.randn(m, n);
x_gpu_small = CUDA.randn(m, 5);

nthreads = 1024;
nblocks = 100;  # This is arbitrary in this example as we are not doing an actual computation.

memsize = nthreads * m * 4;

CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_byrow(x_gpu);

CUDA.@time CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize kernel_reader_byrow(x_gpu);
# 0.105434 seconds (25 CPU allocations: 1008 bytes)

We see that the access by row here is (a bit) faster. I think this is because the entire chunk of data in the wide matrix lives in a small area of global memory, while in the long matrix, each column in the chunk has adjacent values but separate columns are very far apart because the matrix is so long.

We might be able to improve efficiency with the wide matrix by operating by column within the wide matrix. This would involve more work to manage the indexing because we wouldn’t just have each thread manage a column (unless we used very few threads, which would presumably reduce efficiency).

6.3 Using atomics for reduction operations

One thing we haven’t seen so far is being able to have different threads write to the same memory location (e.g., to a scalar or to an element of an array). One can easily imagine needing to do this to carry out reduction operations (e.g., calculating a sum or a max or min).

The obvious danger is that two threads might write to the memory location at the same time and somehow cause the location not to be properly updated.

Suppose we want to calculate the log-likelihood (or some other loss function) across independent observations. We’d like to do the summation on the GPU to avoid passing all the log-likelihood values from GPU to CPU and then having to do the sum on the CPU.

using BenchmarkTools
using Distributions

n = 100_000_000;   # Formatted for readability.
norm_dist = Normal(0,1)
samples = rand(norm_dist, n);

function loglik_kernel(x, result)
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;
  if idx <= length(x)
    # logpdf(norm_dist, x[idx]) # Doesn't compile.
    CUDA.@atomic result[1] += -0.5*x[idx]^2;            # Experimental, but nicer interface.
    #CUDA.atomic_add!(pointer(result), -0.5*x[idx]^2);  # Stable low-level API.
  end
  return
end

nthreads = 1024;
nblocks = Int(ceil(n/nthreads));

samples_gpu = CuArray(samples);
result = CUDA.zeros(typeof(samples[1]), 1);
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks loglik_kernel(samples_gpu, result);
# 161.352 ms (34 allocations: 944 bytes)
Array(result)[1] -n*log(2*pi)/2  # Adjust for normalizing constant as scalar computation, not on GPU.

@btime sum(logpdf.(norm_dist, samples))
# 1.410 s (5 allocations: 762.94 MiB)

So we got about a 12-fold speedup, which is less than we’ve been getting for some of our other comparisons.

I was curious how much time is spent handling the reduction operation (presumably there is some loss in efficiency from having all the threads write to the same memory location). When I changed result to be a vector of length equal to that of samples and just assign the individual PDF evaluations to the corresponding elements of result without the atomic operation, the time was 3 milliseconds (compared to 161 above), so there is a performance degradation from the atomic operation.

6.3.1 Using shared memory to reduce the cost of atomic operations

One solution to the performance degradation is to not have all of the summing make use of the same location in memory to accumulate the result.

Instead we can use shared memory to more efficiently do the reduction within each thread block before doing the final reduction across blocks. Here’s an approach using a tree-like operation (as suggested by a ChatBot, but requiring some debugging on my part) to compute the partial sum within each thread block before using the atomic operation to compute the sum of the partial sums:

function loglik_kernel_shmem(x::CuDeviceArray{T}, result::CuDeviceArray{T}) where T
  i = threadIdx().x;  # What thread am I within the block?
  j = blockIdx().x;   # What block am I in?
  idx =  (j-1)*blockDim().x + i;

  shared_data = CuDynamicSharedArray(T, (blockDim().x));
  # First do the core calculation and store in shared memory.
  if idx <= length(x)
     shared_data[i] = -0.5*x[idx]^2;
  else
     shared_data[i] = 0.0; 
  end

  # Tree-like partial sum within the thread block,
  # summing pairs until the sum within the block
  # is contained in `shared_data[1]`.
  s = blockDim().x ÷ 2;   # `÷` ensures `s` is Int.
  while s >= 1
    if i <= s
      shared_data[i] += shared_data[i + s];
    end
    sync_threads()
    s ÷= 2;
  end
    
  # The first thread in the block writes the partial sum to global memory.
  if i == 1
    CUDA.@atomic result[1] += shared_data[1];
  end
  return
end

memsize = nthreads * sizeof(samples[1]);

result2 = CUDA.zeros(typeof(samples[1]), 1);
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks shmem=memsize loglik_kernel_shmem(samples_gpu, result2);
# 6.317 ms (34 allocations: 944 bytes)

Array(result2)[1] -n*log(2*pi)/2

6.4 Debugging kernel code

It can be much harder to debug kernel code than regular Julia code. If the syntax doesn’t produce valid compiled code that can run on the GPU, it may not be obvious from the error messsage what the problem is.

As an example in the code in the previous section, I originally had s = blockDim().x / 2; and s /= 2;. I didn’t realize that even with integer inputs, that this produced a float output type for s and that as a result using s for indexing in shared_data[i + s]; wouldn’t work. The error message said there was a problem with the LLVM/IR code produced from the kernel, but didn’t say where and it took a binary search on my part to figure out that shared_data[i + s]; was the problematic piece of code and that was caused by s being a float.

On an only somewhat related point, the ChatBot originally gave me while s >= 0, which is a bug that doesn’t prevent the code from running, but does give incorrect numerical results, so we still need to be careful with what we get from ChatBots.