using BenchmarkTools
using LinearAlgebra
using Distributions
= 7000
n = rand(Uniform(0,1), n,n);
x
println(BLAS.get_num_threads())
4
Julia provides built-in support for various kinds of parallel processing on one or more machines. This material focuses on some standard approaches that are (mostly) analogous to functionality in Python and R. However there is other functionality available, including the ability to control tasks and sending data between processes in a fine-grained way.
In addition to parallelization, the second to last section discusses some issues related to efficiency with for loops, in particular fused operations. This is not directly related to parallelization but given the focus on loops in this document, it’s useful and interesting to know about.
Finally, the last section discussed offloading computation to the GPU, i.e., massive parallelization on many GPU cores.
Threaded calculations are done in parallel on software threads.
Threads share objects in memory with the parent process, which is useful for avoiding copies but raises the danger of a “race condition”, where different threads modify data that other threads are using and cause errors..
As with Python and R, Julia uses BLAS, a standard library of basic linear algebra operations (written in Fortran or C), for linear algebra operations. A fast BLAS can greatly speed up linear algebra relative to the default BLAS on a machine. Julia uses a fast, open source, free BLAS library called OpenBLAS. In addition to being fast when used on a single core, the openBLAS library is threaded - if your computer has multiple cores and there are free resources, your linear algebra will use multiple cores
Here’s an example.
using BenchmarkTools
using LinearAlgebra
using Distributions
= 7000
n = rand(Uniform(0,1), n,n);
x
println(BLAS.get_num_threads())
4
function chol_xtx(x)
= x'*x ## z is positive definite
z = cholesky(z)
C end
set_num_threads(4)
BLAS.@btime chol = chol_xtx(x);
9.413 s (5 allocations: 747.68 MiB)
set_num_threads(1)
BLAS.@btime chol = chol_xtx(x);
11.551 s (5 allocations: 747.68 MiB)
We see that using four threads is faster than one, but in this case we don’t get a four-fold speedup.
By default, Julia will set the number of threads for linear algebra equal to the number of processors on your machine.
As seen above, you can check the number of threads being used with:
get_num_threads() BLAS.
1
Other ways to control the number of threads used for linear algebra include:
OMP_NUM_THREADS
environment variable in the shell before starting Julia, andBLAS.set_num_threads(n)
.In Julia, you can directly set up software threads to use for parallel processing.
Here we’ll see some examples of running a for loop in parallel, both acting on a single object and used as a parallel map operation.
Here we can operate on a vector in parallel:
using Base.Threads
= 50000000;
n = rand(n);
x
@threads for i in 1:length(x)
= exp(x[i]) + sin(x[i]);
x[i] end
We could also threads to carry out a parallel map operation, implemented as a for loop.
= 1000
n
function test(n)
= rand(Uniform(0,1), n,n)
x = x'*x
z = cholesky(z)
C return(C.U[1,1])
end
= zeros(12)
a @threads for i in 1:12
= test(n)
a[i] end
You can also create (aka ‘spawn’) individual tasks on threads, with the tasks running in parallel.
Let’s see an example (taken from here of sorting a vector in parallel, by sorting subsets of the vector in separate threads.
import Base.Threads.@spawn
# sort the elements of `v` in place, from indices `lo` to `hi` inclusive
function psort!(v, lo::Int=1, hi::Int=length(v))
println(current_task(), ' ', lo, ' ', hi)
if lo >= hi # 1 or 0 elements; nothing to do
return v
end
if hi - lo < 100000 # below some cutoff, run in serial
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
= (lo+hi)>>>1 # find the midpoint
mid
### Set up parallelization here ###
## Sort two halves in parallel, one in current call and one in a new task
## in a separate thread:
= @spawn psort!(v, lo, mid) # task to sort the lower half
half psort!(v, mid+1, hi) # sort the upper half in the current call
wait(half) # wait for the lower half to finish
= v[lo:mid] # workspace for merging
temp
= 1, lo, mid+1 # merge the two sorted sub-arrays
i, k, j @inbounds while k < j <= hi
if v[j] < temp[i]
= v[j]
v[k] += 1
j else
= temp[i]
v[k] += 1
i end
+= 1
k end
@inbounds while k < j
= temp[i]
v[k] += 1
k += 1
i end
return v
end
psort! (generic function with 3 methods)
How does this work? Let’s consider an example where we sort a vector of length 250000.
The vector gets split into elements 1:125000 (run in task #1) and 125001:250000 (run in the main call). Then the elements 1:125000 are split into 1:62500 (run in task #2) and 62501:125000 (run in task #1), while the elements 125001:250000 are split into 125001:187500 (run in task #3) and 187501:250000 (run in the main call). No more splitting occurs because vectors of length less than 100000 are run in serial.
Assuming we have at least four threads (including the main process), each of the tasks will run in a separate thread, and all four sorts on the vector subsets will run in parallel.
= rand(250000);
x psort!(x);
Task (runnable) @0x00007f91333bad60 1 250000
Task (runnable) @0x00007f91333bad60 125001 250000
Task (runnable) @0x00007f91333bad60 187501 250000
Task (runnable) @0x00007f91352101a0 1 125000
Task (runnable) @0x00007f91352101a0 62501 125000
Task (runnable) @0x00007f9135210330 125001 187500
Task (runnable) @0x00007f91352104c0 1 62500
We see that the output from current_task()
shows that the task labels correspond with what I stated above.
The number of tasks running in parallel will be at most the number of threads set in the Julia session.
You can see the number of threads available:
Threads.nthreads()
1
You can control the number of threads used for threading in Julia (apart from linear algebra) either by:
JULIA_NUM_THREADS
environment variable in the shell before starting Julia, or-t
(or --threads
) flag, e.g.: julia -t 4
.Note that we can’t use OMP_NUM_THREADS
as the Julia threading is not based on openMP.
We can use pmap
to run a parallel map operation across multiple Julia processes (on one or more machines). pmap
is good for cases where each task takes a non-negligible amount of time, as there is overhead (latency) in starting the tasks.
Here we’ll carry out multiple computationally expensive calculations in the map.
We need to import packages and create the function on each of the worker processes using @everywhere
.
using Distributed
if nprocs() == 1
addprocs(4)
end
nprocs()
WARNING: using Distributed.@spawn in module Main conflicts with an existing identifier.
5
@everywhere begin
using Distributions
using LinearAlgebra
function test(n)
= rand(Uniform(0,1), n,n)
x = transpose(x)*x
z = cholesky(z)
C return C.U[2,3]
end
end
= pmap(test, repeat([5000],12)) result
12-element Vector{Float64}:
11.46796239567315
10.98360399939223
11.855528600147663
11.809624824425185
11.271693682820008
11.191565722035467
11.603740118960042
11.540730849990373
11.362238174937845
11.789584816924307
11.194481356957468
11.393640476610342
One can use static allocation (prescheduling) with the batch_size
argument, thereby assigning that many tasks to each worker to reduce latentcy.
One can execute for loops in parallel across multiple worker processes as follows. This is particularly handy for cases where one uses a reduction operator (e.g., the +
here) so that little data needs to be copied back to the main process. (And in this case we don’t copy any data to the workers either.)
Here we’ll sum over a large number of random numbers with chunks done on each of the workers, comparing the time to a basic for loop.
function forfun(n)
= 0.0
sum for i in 1:n
+= rand(1)[1]
sum end
return(sum)
end
function pforfun(n)
= @sync @distributed (+) for i = 1:n
out rand(1)[1]
end
return(out)
end
=50000000
n@time forfun(n);
3.441998 seconds (50.01 M allocations: 2.981 GiB, 17.53% gc time, 0.59% compilation time)
@time pforfun(n);
2.179350 seconds (498.54 k allocations: 33.226 MiB, 24.30% compilation time)
The use of @sync
causes the operation to block until the result is available so we can get the correct timing.
Without a reduction operation, one would generally end up passing a lot of data back to the main process, and this could take a lot of time. For such calculations, one would generally be better off using threaded for loops in order to take advantage of shared memory.
We’d have to look into how the random number seed is set on each worker to better understand any issues that might arise from parallel random number generation, but I believe that each worker has a different seed (but note that this does not explicitly ensure that the random number streams on the workers are distinct, as is the case if one uses the L’Ecuyer algorithm).
With multiple workers, particularly on more than one machine, one generally wants to be careful about having to copy large data objects to each worker, as that could make up a substantial portion of the time involved in the computation.
One can explicitly copy a variable to the workers in an @everywhere
block by using Julia’s interpolation syntax:
@everywhere begin
= $x # copy to workers using interpolation syntax
x println(pointer_from_objref(x), ' ', x[1])
end
Ptr{Nothing} @0x00007f91334cc2b0 3.0383289700841587e-6
From worker 2: Ptr{Nothing} @0x00007fb9f003c280 3.0383289700841587e-6
From worker 3: Ptr{Nothing} @0x00007f8345d34280 3.0383289700841587e-6
From worker 4: Ptr{Nothing} @0x00007f7cc4498280 3.0383289700841587e-6
From worker 5: Ptr{Nothing} @0x00007f607f094280 3.0383289700841587e-6
We see based on pointer_from_objref
that each copy of x
is stored at a distinct location in memory, even when processes are on the same machine.
Also note that if one creates a variable within an @everywhere
block, that variable is available to all tasks run on the worker, so it is global’ with respect to those tasks. Note the repeated values in the result here.
@everywhere begin
= rand(5)
x function test(i)
return sum(x)
end
end
= pmap(test, 1:12, batch_size = 3) result
12-element Vector{Float64}:
0.9939036615065234
1.4005465832334134
1.7529749131894112
3.2801398110921394
0.9939036615065234
1.4005465832334134
1.7529749131894112
3.2801398110921394
0.9939036615065234
1.4005465832334134
1.7529749131894112
3.2801398110921394
If one wants to have multiple processes all work on the same object, without copying it, one can consider using Julia’s SharedArray (one machine) or DArray from the DistributedArrays package (multiple machines) types, which break up arrays into pieces, with different pieces stored locally on different processes.
One can use the Distributed.@spawnat
macro to run tasks on processes, in a fashion similar to using Threads.@spawn
. More details can be found here.
In addition to using processes on one machine, one can use processes across multiple machines. One can either start the processes when you start the main Julia session or you can start them from within the Julia session. In both cases you’ll need to have the ability to ssh to the other machines without entering your password.
To start the processes when starting Julia, create a “machinefile” that lists the names of the machines and the number of worker processes to start on each machine.
Here’s an example machinefile:
arwen.berkeley.edu
arwen.berkeley.edu
gandalf.berkeley.edu
gandalf.berkeley.edu
Note that if you’re using Slurm on a Linux cluster, you could generate that file in the shell from within your Slurm allocation like this:
srun hostname > machines
Then start Julia like this:
julia --machine-file machines
From within Julia, you can add processes like this (first we’ll remove the existing worker processes started using addprocs()
previously):
rmprocs(workers())
addprocs([("arwen", 2), ("gandalf", 2)])
4-element Vector{Int64}:
6
7
8
9
To check on the number of processes:
nprocs()
5
Consider the following vectorized code that you might run in a variety of languages (e.g., Julia, Python, R).
x = tan(x) + 3*sin(x)
If run as vectorized code, this has downsides. First, it will use additional memory (temporary arrays will be created to store tan(x)
, sin(x)
, 3*sin(x)
). Second, multiple for loops will have to get executed when the vectorized code is run, looping over the elements of x
to calculate tan(x)
, sin(x)
, etc. (For example in R or Python/numpy, multiple for loops would get run in the underlying C code.)
Contrast that to running directly as a for loop (e.g., in Julia or in C/C++):
for i in 1:length(x)
= tan(x[i]) + 3*sin(x[i])
x[i] end
Here temporary arrays don’t need to be allocated and there is only a single for loop.
Combining loops is called ‘fusing’ and is an important optimization that Julia can do. (It’s also a key optimization done by XLA, a compiler used with JAX and Tensorflow.)
Of course you might ask why use vectorized code at all given that Julia will JIT compile the for loop above and run it really quickly. That’s true, but reading and writing vectorized code is easier than writing for loops.
Let’s compare the speed of the following approaches. We’ll put everything into functions as generally recommended when timing Julia code to avoid global variables that incur a performance penalty because their type can change.
First, let’s find the time when directly using a for loop, as a baseline.
= 50000000
n = Array{Float64}(undef, n);
y = rand(n); x
function direct_for_loop_calc(x, y)
for i in 1:length(x)
= exp(x[i]) + sin(x[i])
y[i] end
end
using BenchmarkTools
@benchmark direct_for_loop_calc(x, y)
BenchmarkTools.Trial: 8 samples with 1 evaluation. Range (min … max): 646.281 ms … 693.152 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 655.665 ms ┊ GC (median): 0.00% Time (mean ± σ): 664.190 ms ± 17.358 ms ┊ GC (mean ± σ): 0.00% ± 0.00% █ ██ █ █ █ █ █ █▁▁▁▁▁██▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁ 646 ms Histogram: frequency by time 693 ms < Memory estimate: 0 bytes, allocs estimate: 0.
Notice the lack of additional memory use.
Now let’s try a basic vectorized calculation (for which we need the various periods to get vectorization), without fusing. We’ll reassign the result to the allocated y
vector for comparability to the for loop implementation above.
function basic_vectorized_calc(x, y)
.= exp.(x) + 3 * sin.(x)
y end
using BenchmarkTools
@benchmark basic_vectorized_calc(x, y)
BenchmarkTools.Trial: 4 samples with 1 evaluation. Range (min … max): 1.369 s … 1.631 s ┊ GC (min … max): 4.95% … 16.17% Time (median): 1.485 s ┊ GC (median): 10.83% Time (mean ± σ): 1.493 s ± 121.252 ms ┊ GC (mean ± σ): 10.78% ± 6.77% █ █ █ █ █▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁ 1.37 s Histogram: frequency by time 1.63 s < Memory estimate: 1.49 GiB, allocs estimate: 8.
The original x
array is 400 MB; notice the additional memory allocation and that this takes almost twice as long as the original for loop.
Here’s a fused version of the vectorized calculation, where the @.
causes the loops to be fused.
function fused_vectorized_calc(x, y)
.= @. tan(x) + 3 * sin(x)
y end
@benchmark fused_vectorized_calc(x, y)
BenchmarkTools.Trial: 5 samples with 1 evaluation. Range (min … max): 1.088 s … 1.144 s ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.105 s ┊ GC (median): 0.00% Time (mean ± σ): 1.112 s ± 25.876 ms ┊ GC (mean ± σ): 0.00% ± 0.00% ██ █ █ █ ██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁ 1.09 s Histogram: frequency by time 1.14 s < Memory estimate: 0 bytes, allocs estimate: 0.
We see that the time and (lack of) memory allocation are essentially the same as the original basic for loop.
Finally one can achieve the same fusion by having the function just compute scalar quantities and then using the vectorized version of the function (by using scalar_calc.()
instead of scalar_calc()
), which also does the fusion.
function scalar_calc(x)
return(tan(x) + 3 * sin(x))
end
@benchmark y .= scalar_calc.(x)
BenchmarkTools.Trial: 5 samples with 1 evaluation. Range (min … max): 1.040 s … 1.081 s ┊ GC (min … max): 0.00% … 0.00% Time (median): 1.065 s ┊ GC (median): 0.00% Time (mean ± σ): 1.063 s ± 15.025 ms ┊ GC (mean ± σ): 0.00% ± 0.00% █ █ █ █ █ █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁ 1.04 s Histogram: frequency by time 1.08 s < Memory estimate: 16 bytes, allocs estimate: 1.
We can use CUDA.jl
to offload computations to the GPU. Here we’ll explore matrix multiplication and vectorized calculations. For this, Julia will take care, behind the scenes, of converting our Julia code to code that can run on the GPU.
There are a couple key things to remember about using a GPU:
Note that for this section, I’m pasting in the output when running the code separately on a machine with a GPU because this document is generated on a machine without a GPU.
Let’s first consider basic matrix multiplication. In this case since we generate the matrices on the CPU, they are 64-bit.
using BenchmarkTools
using CUDA
using LinearAlgebra
function matmult(x, y)
= x * y
z return z
end
= 7000
n
= randn(n, n);
x = randn(n, n);
y = CuArray(x);
x_gpu = CuArray(y);
y_gpu
## These use 64-bit numbers:
typeof(x)
# Matrix{Float64} (alias for Array{Float64, 2})
typeof(x_gpu)
# CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}
LinearAlgebra.BLAS.set_num_threads(1);
@benchmark z = matmult(x, y)
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
# Single result which took 17.271 s (0.00% GC) to evaluate,
# with a memory estimate of 373.84 MiB, over 2 allocations.
@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 65 samples with 1 evaluation.
# Range (min … max): 53.172 ms … 90.679 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 76.419 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 77.404 ms ± 4.092 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
Clearly the the GPU calculation is much faster, taking about 75 milliseconds, compared to 17 seconds on the CPU (albeit using a single thread).
Let’s compare that to the time of copying the data to the GPU:
@benchmark CUDA.@sync tmp = CuArray(x)
# BenchmarkTools.Trial: 59 samples with 1 evaluation.
# Range (min … max): 83.766 ms … 137.849 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 84.684 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 85.696 ms ± 7.011 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
This suggests that the time in copying the data is similar to that for doing the computation.
If we count the time of transferring the data to and from the GPU, that ends up being a substantial part of the time, compared to the 75 ms for simply doing the matrix multiplication.
function matmult_with_transfer(x, y)
= CuArray(x)
xc = CuArray(y)
yc = xc * yc
z return Array(z)
end
@benchmark CUDA.@sync z = matmult_with_transfer(x, y)
# BenchmarkTools.Trial: 20 samples with 1 evaluation.
# Range (min … max): 251.578 ms … 258.017 ms ┊ GC (min … max): 0.00% … 0.57%
# Time (median): 253.886 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 254.228 ms ± 1.708 ms ┊ GC (mean ± σ): 4.33% ± 6.78%
As a sidenote, we can force use of 64-bit numbers on the GPU (in this case when generating values on the GPU) like this.
= CUDA.randn(Float64, n, n); x_gpu
Finally, let’s consider whether the matrix multiplication is faster using 32-bit numbers.
= randn(Float32, n, n);
x = randn(Float32, n, n);
y = CuArray(x);
x_gpu = CuArray(y);
y_gpu typeof(x_gpu)
@benchmark z = matmult(x, y)
# BenchmarkTools.Trial: 1 sample with 1 evaluation.
# Single result which took 8.671 s (0.00% GC) to evaluate,
@benchmark CUDA.@sync z_gpu = matmult(x_gpu, y_gpu)
# BenchmarkTools.Trial: 91 samples with 1 evaluation.
# Range (min … max): 41.174 ms … 70.491 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 54.420 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 55.363 ms ± 2.912 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
So that’s faster, though I’m not sure why the CPU implementation is about twice as fast (which makes sense in that it is working with numbers taking up half as much space) while the GPU implementation does not achieve that speedup (54 ms. with 32-bit compared to 75 ms. with 64-bit).
Here we’ll consider using the GPU for vectorized calculations, noting that earlier we talked about using .
to vectorize and @
to fuse loops in the context of CPU-based calculations.
# Scalar function to do something.
function myfun(x)
= tan(x) + 3 * sin(x)
y return y
end
# Vectorized version that modifies `y` in place.
function myfun_vec(x, y)
.= myfun.(x)
y return
end
= 250000000;
n = Vector{Float64}(undef, n);
y = rand(n);
x
= CuArray(x);
x_gpu = CuArray(y);
y_gpu
@benchmark myfun_vec(x, y) # 3.5 sec.
@benchmark CUDA.@sync myfun_vec(x_gpu, y_gpu) # 6 ms.
Here we have a massive 500x speedup of 6 ms. compared to 3.5 seconds.
Of course, as in the matrix multiplication example above, if you need to copy the data to and from the GPU, that will add substantial time.
Next we’ll explore writing our own GPU kernels. Kernels are functions that encode the core computational operations that are executed in parallel.
In other languages, the basic mode of operation with a GPU when you are writing your own GPU code is to write a kernel using CUDA (basically C) code and then call the kernel in parallel via C, R, or Python code. In Julia, we can write the kernel using Julia syntax (though many operations (particularly non-numerical ones) will not run on the GPU…).
Here’s a basic example in which we’ll do a calculation in place. We run 1000 scalar calculations using 1000 threads.
We use @cuda
to compile and run the kernel.
function my_kernel(x)
= threadIdx().x; # What thread am I?
idx if idx <= length(x)
= tan(x[idx]) + 3*sin(x[idx]);
x[idx] end
return
end
= 1000;
n = CUDA.randn(n);
x_gpu Array(x_gpu)[n]
# -1.5321726f0
@cuda threads=n my_kernel(x_gpu);
Array(x_gpu)[n] # Check the computation was done by checking last element.
# -28.875708f0
There are limits on the number of threads we can use.
= 2000;
n = CUDA.randn(n);
x_gpu @cuda threads=n my_kernel(x_gpu);
# ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
We need to use at least as many threads as computations, and in addition to only being able to use 1024 threads in the x dimension, we can have at most 1024 threads in a block on the A100 GPU we’re using. So we’ll need multiple blocks.
function my_kernel(x)
= threadIdx().x; # What thread am I within the block?
i = blockIdx().x; # What block am I in?
j = (j-1)*blockDim().x + i;
idx if idx <= length(x)
= tan(x[idx]) + 3*sin(x[idx]);
x[idx] end
return
end
= 2000;
n = 1024;
nthreads = CUDA.randn(n);
x_gpu = Array(x_gpu)[n]
initial = Int(ceil(n/nthreads));
nblocks
@cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
Array(x_gpu)[n]) # Check that calculation was done. (initial,
Let’s do a smaller test run in which we can check on the thread and block indexing.
function my_kernel_print(x)
= threadIdx().x; # What thread am I within the block?
i = blockIdx().x; # What block am I in?
j = (j-1)*blockDim().x + i;
idx if idx <= length(x)
= tan(x[idx]) + 3*sin(x[idx]);
x[idx] @cuprintln idx, i, j, blockDim().x, blockDim().y;
end
return
end
= 200;
n = CUDA.randn(n);
x_gpu = 100;
nthreads = Int(ceil(n/nthreads));
nblocks @cuda threads=nthreads blocks=nblocks my_kernel_print(x_gpu);
When we run this, notice the output seems to be grouped based on warps of 32 threads (apart from the last set since n=200
is not a multiple of 32).
In many cases we’ll have more tasks than the total number of GPU cores. As long as we don’t exceed the maximum size of a block or grid, we can just ask for as many threads as we have tasks and rely on the GPU to manage assigning the tasks to the GPU cores.
We’d want to check that the number/dimension of the block here does not exceed the maximum block size. I didn’t do that, but it ran, so it must have been ok!
Here we’ll run the computation we ran earlier when we did not write our own kernel and just relied on Julia to offload to the GPU behind the scene.
= 250000000;
n = CUDA.randn(n);
x_gpu = 1024;
nthreads = Int(ceil(n/nthreads));
nblocks Array(x_gpu)[n]
# Run it once to flush out any compilation/transformation time.
= CUDA.randn(5);
y_gpu @sync @cuda threads=nthreads blocks=nblocks my_kernel(y_gpu);
CUDA.
@time CUDA.@sync @cuda threads=nthreads blocks=nblocks my_kernel(x_gpu);
CUDA.# 0.002003 seconds (45 CPU allocations: 2.719 KiB)
Array(x_gpu)[n]
The 2.0 ms is reasonably comparable to the 3.7 ms when we just had Julia run the vectorized computation on the GPU (from the last time we ran it). That used 64-bit floats. When I reran the code above using 64-bit floats, the time was 5.2 ms.
We’ll explore two final topics related to efficiently accessing data in memory: first accessing global GPU memory efficiently and second making use of shared GPU memory.
If adjacent threads in a block access adjacent memory locations, a chunk of data can be obtained in a single access to global memory.
We’ll implement element-wise summing of two matrices. Obviously one can just do this directly with CuArray
s in Julia, but if we implement it ourselves, it illustrates that reading a matrix by column is much more efficient than reading by row. Here a thread block either handles part of a column (good) or part of a row (bad). The x-dimension of the blocks in the grid then handles multiple thread blocks within each column (or row; bad) and the y-dimension of the blocks in the grid handles the different columns (or rows; bad).
= 10000;
n = CUDA.randn(n,n);
X_gpu = CUDA.randn(n,n);
Y_gpu = CUDA.zeros(n,n);
out_gpu
= CUDA.randn(5,5);
X_gpu_small = CUDA.randn(5,5);
Y_gpu_small = CUDA.zeros(5,5);
out_gpu_small
# Good: Adjacent threads process elements in a column.
function kernel_sum_bycol!(X, Y, output)
= threadIdx().x + (blockIdx().x - 1)*blockDim().x;
row_idx = blockIdx().y;
col_idx
if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
= X[row_idx, col_idx] + Y[row_idx, col_idx]
output[row_idx, col_idx] end
return nothing
end
= 1024;
nthreads # x dim of grid is number of thread blocks in a column.
# y dim of grid is number of columns.
= (Int(ceil(n/nthreads)), n);
nblocks
# Flush out any compilation time.
@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu_small, Y_gpu_small, out_gpu_small);
CUDA.
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_bycol!(X_gpu, Y_gpu, out_gpu);
# 2.153 ms (47 allocations: 1.30 KiB)
# Bad: Adjacent threads process elements in a row.
function kernel_sum_byrow!(X, Y, output)
= blockIdx().y;
row_idx = threadIdx().x + (blockIdx().x - 1)*blockDim().x;
col_idx
if row_idx <= size(X, 1) && col_idx <= size(Y, 2)
= X[row_idx, col_idx] + Y[row_idx, col_idx]
output[row_idx, col_idx] end
return nothing
end
# Flush out any compilation time.
@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu_small, Y_gpu_small, out_gpu_small);
CUDA.
@btime CUDA.@sync @cuda threads=nthreads blocks=nblocks kernel_sum_byrow!(X_gpu, Y_gpu, out_gpu);
# 10.500 ms (47 allocations: 1.30 KiB)
One thing we haven’t seen so far is being able to have different threads write to the same memory location (e.g., to a scalar or to an element of an array). One can easily imagine needing to do this to carry out reduction operations (e.g., calculating a sum or a max or min).
The obvious danger is that two threads might write to the memory location at the same time and somehow cause the location not to be properly updated.
Suppose we want to calculate the log-likelihood (or some other loss function) across independent observations. We’d like to do the summation on the GPU to avoid passing all the log-likelihood values from GPU to CPU and then having to do the sum on the CPU.
using BenchmarkTools
using Distributions
= 100_000_000; # Formatted for readability.
n = Normal(0,1)
norm_dist = rand(norm_dist, n);
samples
function loglik_kernel(x, result)
= threadIdx().x; # What thread am I within the block?
i = blockIdx().x; # What block am I in?
j = (j-1)*blockDim().x + i;
idx if idx <= length(x)
# logpdf(norm_dist, x[idx]) # Doesn't compile.
@atomic result[1] += -0.5*x[idx]^2; # Experimental, but nicer interface.
CUDA.#CUDA.atomic_add!(pointer(result), -0.5*x[idx]^2); # Stable low-level API.
end
return
end
= 1024;
nthreads = Int(ceil(n/nthreads));
nblocks
= CuArray(samples);
samples_gpu = CUDA.zeros(typeof(samples[1]), 1);
result @btime CUDA.@sync @cuda threads=nthreads blocks=nblocks loglik_kernel(samples_gpu, result);
# 161.352 ms (34 allocations: 944 bytes)
Array(result)[1] -n*log(2*pi)/2 # Adjust for normalizing constant as scalar computation, not on GPU.
@btime sum(logpdf.(norm_dist, samples))
# 1.410 s (5 allocations: 762.94 MiB)
So we got about a 12-fold speedup, which is less than we’ve been getting for some of our other comparisons.
I was curious how much time is spent handling the reduction operation (presumably there is some loss in efficiency from having all the threads write to the same memory location). When I changed result
to be a vector of length equal to that of samples
and just assign the individual PDF evaluations to the corresponding elements of result
without the atomic operation, the time was 3 milliseconds (compared to 161 above), so there is a performance degradation from the atomic operation.
It can be much harder to debug kernel code than regular Julia code. If the syntax doesn’t produce valid compiled code that can run on the GPU, it may not be obvious from the error messsage what the problem is.
As an example in the code in the previous section, I originally had s = blockDim().x / 2;
and s /= 2;
. I didn’t realize that even with integer inputs, that this produced a float output type for s
and that as a result using s
for indexing in shared_data[i + s];
wouldn’t work. The error message said there was a problem with the LLVM/IR code produced from the kernel, but didn’t say where and it took a binary search on my part to figure out that shared_data[i + s];
was the problematic piece of code and that was caused by s
being a float.
On an only somewhat related point, the ChatBot originally gave me while s >= 0
, which is a bug that doesn’t prevent the code from running, but does give incorrect numerical results, so we still need to be careful with what we get from ChatBots.