1. Benchmarking
system.time is very handy for comparing the speed of different implementations. Here’s a basic comparison of the time to calculate the row means of a matrix using a for loop compared to the built-in rowMeans function.
n <- 10000
m <- 1000
x <- matrix(rnorm(n*m), nrow = n)
system.time({
mns <- rep(NA, n)
for(i in 1:n) mns[i] <- mean(x[i , ])
})
## user system elapsed
## 0.215 0.028 0.243
system.time(rowMeans(x))
## user system elapsed
## 0.021 0.000 0.021
In general, user time gives the CPU time spent by R and system time gives the CPU time spent by the kernel (the operating system) on behalf of R. Operations that fall under system time include opening files, doing input or output, starting other processes, etc.
To time code that runs very quickly, you should use the microbenchmark package. Of course one would generally only care about accurately timing quick calculations if a larger operation does the quick calculation very many times. Here’s a comparison of different ways of accessing an element of a dataframe.
library(microbenchmark)
df <- data.frame(vals = 1:3, labs = c('a','b','c'))
microbenchmark(
df[2,1],
df$vals[2],
df[2, 'vals']
)
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## df[2, 1] 8187 8710.5 9349.14 9049.0 9339.0 33255 100 b
## df$vals[2] 602 825.5 1031.48 889.5 1064.5 9222 100 a
## df[2, "vals"] 8143 8827.5 9498.09 9089.0 9478.5 41009 100 b
Challenge: Think about what order of operations in the first and third approaches above might lead to the timing result seen above.
microbenchmark is also a good way to compare slower calculations, e.g., doing a crossproduct via crossproduct() compared to “manually” via transpose and matrix multiply:
library(microbenchmark)
n <- 1000
x <- matrix(rnorm(n^2), n)
microbenchmark(
t(x) %*% x,
crossprod(x),
times = 10)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## t(x) %*% x 30.69937 31.28786 35.04459 32.86499 38.01497 48.31874 10 b
## crossprod(x) 12.43864 14.12903 14.06831 14.24277 14.42822 14.46092 10 a
Challenge: What is inefficient about the manual approach above?
An alternative that also automates timings and comparisons is the benchmark function from the rbenchmark package, but there’s not really a reason to use it when microbenchmark is available.
library(rbenchmark)
# speed of one calculation
benchmark(t(x) %*% x,
crossprod(x),
replications = 10,
columns=c('test', 'elapsed', 'replications'))
## test elapsed replications
## 2 crossprod(x) 0.130 10
## 1 t(x) %*% x 0.292 10
In general, it’s a good idea to repeat (replicate) your timing, as there is some stochasticity in how fast your computer will run a piece of code at any given moment.
You might also checkout the tictoc package.
2. Profiling
The Rprof function will show you how much time is spent in different functions, which can help you pinpoint bottlenecks in your code. The output from Rprof can be hard to decipher, so you will probably want to use the proftools package functions, which make use of Rprof under the hood.
Here’s a function that does the linear algebra to find the least squares
solution in a linear regression, assuming x
is the matrix of
predictors, including a column for the intercept.
lr_slow <- function(y, x) {
xtx <- t(x) %*% x
xty <- t(x) %*% y
inv <- solve(xtx) ## explicit matrix inverse is slow and usually a bad idea numerically
return(inv %*% xty)
}
Let’s run the function with profiling turned on.
## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)
library(proftools)
pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)
## path total.pct self.pct
## lr_slow 100.00 0.00
## . %*% (<text>:2) 37.50 37.50
## . %*% (<text>:3) 1.56 1.56
## . solve (<text>:4) 29.69 0.00
## . . solve.default 29.69 28.12
## . . . diag 1.56 1.56
## . t (<text>:2) 1.56 0.00
## . . t.default 1.56 1.56
## . t (<text>:3) 29.69 0.00
## . . t.default 29.69 29.69
hotPaths(pd1, value = 'time')
## path total.time self.time
## lr_slow 1.28 0.00
## . %*% (<text>:2) 0.48 0.48
## . %*% (<text>:3) 0.02 0.02
## . solve (<text>:4) 0.38 0.00
## . . solve.default 0.38 0.36
## . . . diag 0.02 0.02
## . t (<text>:2) 0.02 0.00
## . . t.default 0.02 0.02
## . t (<text>:3) 0.38 0.00
## . . t.default 0.38 0.38
The first call to hotPaths shows the percentage of time spent in each call, while the second shows the actual time.
Note the nestedness of the results. For example, essentially all the time spent in solve is actually spent in solve.default. In this case solve is just an S3 generic method that immediately calls the specific method solve.default.
We can see that a lot of time is spent in doing the two crossproducts. So let’s try using crossprod to make those steps faster.
lr_medium <- function(y, x) {
xtx <- crossprod(x)
xty <- crossprod(x, y)
inv <- solve(xtx) ## explicit matrix inverse is slow and usually a bad idea numerically
return(inv %*% xty)
}
pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)
## path total.pct self.pct
## lr_medium 100.00 0.00
## . crossprod (<text>:2) 43.75 43.75
## . crossprod (<text>:3) 6.25 6.25
## . solve (<text>:4) 50.00 0.00
## . . solve.default 50.00 46.88
## . . . diag 3.12 3.12
hotPaths(pd2, value = 'time')
## path total.time self.time
## lr_medium 0.64 0.00
## . crossprod (<text>:2) 0.28 0.28
## . crossprod (<text>:3) 0.04 0.04
## . solve (<text>:4) 0.32 0.00
## . . solve.default 0.32 0.30
## . . . diag 0.02 0.02
First note that this version takes less than half the time of the previous one. Second note that a fair amount of time is spent computing the explicit matrix inverse using solve. (There’s not much we can do to speed up the crossprod operations, apart from making sure we are using a fast BLAS and potentially a parallelized BLAS.) It’s well known that one should avoid computing the explicit inverse if one can avoid it. Here’s a faster version that avoids it.
lr_fast <- function(y, x) {
xtx <- crossprod(x)
xty <- crossprod(x, y)
U <- chol(xtx)
tmp <- backsolve(U, xty, transpose = TRUE)
return(backsolve(U, tmp))
}
pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)
## path total.pct self.pct
## lr_fast 100.00 0.00
## . backsolve (<text>:5) 5.56 5.56
## . chol (<text>:4) 11.11 0.00
## . . chol.default 5.56 5.56
## . . get 5.56 0.00
## . . . lazyLoadDBfetch 5.56 5.56
## . crossprod (<text>:2) 77.78 77.78
## . crossprod (<text>:3) 5.56 5.56
hotPaths(pd3, value = 'time')
## path total.time self.time
## lr_fast 0.36 0.00
## . backsolve (<text>:5) 0.02 0.02
## . chol (<text>:4) 0.04 0.00
## . . chol.default 0.02 0.02
## . . get 0.02 0.00
## . . . lazyLoadDBfetch 0.02 0.02
## . crossprod (<text>:2) 0.28 0.28
## . crossprod (<text>:3) 0.02 0.02
We can see we get another speedup from that final version of the code. (But beware my earlier caution that if comparing times between implementations, we should have replication.)
You might also check out profvis for an alternative to displaying profiling information generated by Rprof.
Note that Rprof works by sampling - every little while (the interval argument) during a calculation it finds out what function R is in and saves that information to the file given as the argument to Rprof. So if you try to profile code that finishes really quickly, there’s not enough opportunity for the sampling to represent the calculation accurately and you may get spurious results.
Warning: Rprof conflicts with threaded linear algebra, so you may need to set OMP_NUM_THREADS to 1 to disable threaded linear algebra if you profile code that involves linear algebra.