August 2022, UC Berkeley
Chris Paciorek
warning()
and stop()
to give a warning or stop
execution when something unexpected happens.try()
with functions that may fail in cases where
you don’t want overall execution to fail because a single piece of the
execution fails.[...]
vs. [[...]]
==
vs. =
==
is dangerous.
Suppose you generate x = 0.333333
in some fashion with some
code and then check: x == 1/3
. This will produce
FALSE.||
vs. |
and &&
vs
&
if
statement) – consider using
identical()
or all.equal()
## [1] "{" "+" "<-" "print" "x"
[
operator takes an
additional optional argument that can avoid dropping dimensions.## NULL
## [1] 1 3
## Error in base::colSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions
## [1] 1 3
As a scripting language, R essentially has a debugger working automatically by virtue of you often being able to easily run code line by line.
But there is an official debugger and other tools that greatly help in figuring out problems, particularly for more complicated situations.
Let’s briefly see these in action. I’ll demo this in a very basic way, but hopefully this will give you an idea of the power of these tools.
buggyFun <- function(myDF) {
print(names(myDF))
myDF$id <- seq_len(nrow(myDF))
sums <- rowSums(myDF)
return(sums)
}
buggyFun(gapminder)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
## Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be numeric
if(FALSE) {
traceback()
debug(buggyFun)
buggyFun(gapminder)
undebug(buggyFun)
options(error = recover)
buggyFun(gapminder)
}
We can use debug()
to step through a function line
by line
After an error occurs, we can use traceback()
to
look at the call stack
More helpfully, if we set options(error = recover)
before running code, we can go into the function in which the error
occurred
We can insert browser()
inside a function and R will
stop there and allow us to proceed with debugging statements
You can temporarily insert code into a function (including
built-in functions) with
trace(fxnName, edit = TRUE)
Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called unit tests. Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.
The testthat package is very helpful for setting up tests. Also, RUnit is a testing framework for R that helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor’s unit testing guidelines.
First, a cautionary note…
premature optimization is the root of all evil
— Donald Knuth, 1974
There are a few tools in R for timing your code.
## user system elapsed
## 0.686 0.024 0.711
library(rbenchmark)
x <- rnorm(1e7)
benchmark(ifelse(x < 0, x, 0),
x[x < 0] <- 0, replications = 5,
columns = c('replications', 'elapsed'))
## replications elapsed
## 1 5 1.483
## 2 5 0.175
For more advanced assessment of bottlenecks in your code, consider
Rprof()
. Actually, the output from Rprof can be
hard to decipher, so you may want to use the proftools package
functions, which make use of Rprof under the hood.
Here’s a function that does the linear algebra to implement a linear
regression, assuming X
is the matrix of predictors,
including a column for the intercept:
lr_slow <- function(y, X) {
XtX <- t(X) %*% X
Xty <- t(X) %*% y
inv <- solve(XtX) ## explicit matrix inverse is slow and generally a bad idea numerically
return(inv %*% Xty)
}
lr_medium <- function(y, X) {
XtX <- crossprod(X)
Xty <- crossprod(X, y)
inv <- solve(XtX) ## explicit matrix inverse is slow and generally a bad idea numerically
return(inv %*% Xty)
}
lr_fast <- function(y, X) {
XtX <- crossprod(X)
Xty <- crossprod(X, y)
U <- chol(XtX)
tmp <- backsolve(U, Xty, transpose = TRUE)
return(backsolve(U, tmp))
}
Now let’s try these two functions with profiling turned on.
## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)
library(proftools)
pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)
## path total.pct self.pct
## lazyLoadDBfetch 20.83 0.00
## . <Anonymous> 20.83 0.00
## . . lazyLoadDBfetch 20.83 20.83
## lr_slow 79.17 0.00
## . %*% (<text>:2) 17.71 17.71
## . %*% (<text>:3) 3.12 3.12
## . solve (<text>:4) 37.50 0.00
## . . standardGeneric 37.50 0.00
## . . . solve 37.50 0.00
## . . . . solve.default 37.50 37.50
## . t (<text>:2) 2.08 0.00
## . . standardGeneric 2.08 0.00
## . . . t 2.08 0.00
## . . . . t.default 2.08 2.08
## . t (<text>:3) 18.75 0.00
## . . standardGeneric 18.75 0.00
## . . . t 18.75 0.00
## . . . . t.default 18.75 18.75
## path total.time self.time
## lazyLoadDBfetch 0.40 0.00
## . <Anonymous> 0.40 0.00
## . . lazyLoadDBfetch 0.40 0.40
## lr_slow 1.52 0.00
## . %*% (<text>:2) 0.34 0.34
## . %*% (<text>:3) 0.06 0.06
## . solve (<text>:4) 0.72 0.00
## . . standardGeneric 0.72 0.00
## . . . solve 0.72 0.00
## . . . . solve.default 0.72 0.72
## . t (<text>:2) 0.04 0.00
## . . standardGeneric 0.04 0.00
## . . . t 0.04 0.00
## . . . . t.default 0.04 0.04
## . t (<text>:3) 0.36 0.00
## . . standardGeneric 0.36 0.00
## . . . t 0.36 0.00
## . . . . t.default 0.36 0.36
## path total.pct self.pct
## lazyLoadDBfetch 2.56 2.56
## lr_medium 97.44 0.00
## . %*% (<text>:12) 2.56 2.56
## . crossprod (<text>:10) 5.13 0.00
## . . crossprod 5.13 0.00
## . . . base::crossprod 5.13 5.13
## . crossprod (<text>:9) 51.28 0.00
## . . crossprod 51.28 0.00
## . . . base::crossprod 51.28 51.28
## . solve (<text>:11) 38.46 0.00
## . . standardGeneric 38.46 0.00
## . . . solve 38.46 0.00
## . . . . solve.default 38.46 35.90
## . . . . . diag 2.56 2.56
## path total.time self.time
## lazyLoadDBfetch 0.02 0.02
## lr_medium 0.76 0.00
## . %*% (<text>:12) 0.02 0.02
## . crossprod (<text>:10) 0.04 0.00
## . . crossprod 0.04 0.00
## . . . base::crossprod 0.04 0.04
## . crossprod (<text>:9) 0.40 0.00
## . . crossprod 0.40 0.00
## . . . base::crossprod 0.40 0.40
## . solve (<text>:11) 0.30 0.00
## . . standardGeneric 0.30 0.00
## . . . solve 0.30 0.00
## . . . . solve.default 0.30 0.28
## . . . . . diag 0.02 0.02
## path total.pct self.pct
## lr_fast 100.00 0.00
## . chol (<text>:18) 15.79 0.00
## . . standardGeneric 15.79 0.00
## . . . chol 15.79 0.00
## . . . . chol.default 15.79 15.79
## . crossprod (<text>:16) 73.68 0.00
## . . crossprod 73.68 0.00
## . . . base::crossprod 73.68 73.68
## . crossprod (<text>:17) 10.53 0.00
## . . crossprod 10.53 0.00
## . . . base::crossprod 10.53 10.53
## path total.time self.time
## lr_fast 0.38 0.00
## . chol (<text>:18) 0.06 0.00
## . . standardGeneric 0.06 0.00
## . . . chol 0.06 0.00
## . . . . chol.default 0.06 0.06
## . crossprod (<text>:16) 0.28 0.00
## . . crossprod 0.28 0.00
## . . . base::crossprod 0.28 0.28
## . crossprod (<text>:17) 0.04 0.00
## . . crossprod 0.04 0.00
## . . . base::crossprod 0.04 0.04
You might also check out profvis for an alternative to displaying profiling information generated by Rprof.
You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.
If in total, the jobs on a machine approach the physical RAM, the machine may (depending on how it is set up) start to use the hard disk as ‘virtual memory’. This is called paging or swapping, and once this happens you’re often toast (i.e., your code may take essentially forever to finish). And if paging doesn’t happen, your job will die with an out-of-memory (OOM) error.
You can assess memory use with top
or ps
in
Linux/Mac or the Task Manager in Windows.
Often it’s a good idea to roughly estimate how much memory an object will take up even before creating it in R. You can do this with some simple arithmetic. Every real number takes 8 bytes (integers and logicals take less; character strings are complicated), so an object with, say, 1 million rows and 10 columns, all numbers, would take roughly 8 * 1000000 * 10 bytes or 800 Mb.
## 80000048 bytes
## [1] 80
## 76.3 Mb
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2083510 111.3 4077525 217.8 3162653 169.0
## Vcells 113614165 866.9 159622940 1217.9 123668420 943.6
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2083522 111.3 4077525 217.8 3162653 169.0
## Vcells 13614197 103.9 127698352 974.3 123668420 943.6
POLL 9A: Suppose you have a CSV file with one million rows and one thousand columns, filled with numbers. How much memory will this take up if you read the data into R?
(respond at https://pollev.com/chrispaciorek428)
If you use a good editor that understands the syntax of the language you are using, it’s easier to write and understand your code. Some editors to consider:
To run all the code in an entire file, do
source('myCodeFile.R')
.
RStudio projects help to organize your work. In many cases you may
also want to create an R package to share your code, even if you don’t
plan to make it publicly available. The devtools
and
usethis
packages can help you create and manage projects
and packages.
To run code as a background job in a Linux/Mac context:
Then you don’t need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.
IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).
Use functions whenever possible. In particular try to write functions rather than carry out your work using blocks of code. Why? Functions allow us to reuse blocks of code easily for later use and for recreating an analysis (reproducible research). It’s more transparent than sourcing a file of code because the inputs and outputs are specified formally, so you don’t have to read through the code to figure out what it does.
Good use of functions includes:
Functions should:
For an example that demonstrates some of these ideas, see goodCode.R. And for a counterexample, see badCode.R.
a/y*x
will work but is not easy to
read and you can easily induce a bug if you forget the order of
operations.## [1] 3 5
## [1] 7
## function (...) .Primitive("c")
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
— Jonathan Buckheit and David Donoho, WaveLab and Reproducible Research (1995)
Here are some useful articles talking about reproducibility.
save()
or
save.image()
to save all the inputs needed to recreate a
figure, with the code for making the figure in a script file.Please fill out the feedback survey. See the Ed Discussion board for the link.