R bootcamp, Module 9: Workflows, managing projects and good practices

August 2023, UC Berkeley

Chris Paciorek

Tips for avoiding bugs

Common syntax errors and bugs

library(codetools)
f <- function() {y <- 3; print(x + y)}
findGlobals(f)
## [1] "{"     "+"     "<-"    "print" "x"
mat <- matrix(1:4, 2, 2)[1, ]
dim(mat); print(mat)
## NULL
## [1] 1 3
# colSums(mat)   # Fails!
mat <- matrix(1:4, 2, 2)[1, , drop = FALSE]
colSums(mat)     # Succeeds!
## [1] 1 3

Debugging

As a scripting language, R essentially has a debugger working automatically by virtue of you often being able to easily run code line by line.

But there is an official debugger and other tools that greatly help in figuring out problems, particularly for more complicated situations.

Let’s briefly see these in action. I’ll demo this in a very basic way, but hopefully this will give you an idea of the power of these tools.

buggyFun <- function(myDF) {
   print(names(myDF))
   myDF$id <- seq_len(nrow(myDF))
   sums <- rowSums(myDF)
   return(sums)
}

buggyFun(gapminder)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
## Error in rowSums(myDF): 'x' must be numeric
if(FALSE) {
  traceback()
  debug(buggyFun)
  buggyFun(gapminder)

  undebug(buggyFun)
  options(error = recover)
  buggyFun(gapminder)
}
  1. We can use debug() to step through a function line by line

  2. After an error occurs, we can use traceback() to look at the call stack

  3. More helpfully, if we set options(error = recover) before running code, we can go into the function in which the error occurred

  4. We can insert browser() inside a function and R will stop there and allow us to proceed with debugging statements

  5. You can temporarily insert code into a function (including built-in functions) with trace(fxnName, edit = TRUE)

Testing

Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called unit tests. Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.

The testthat package is very helpful for setting up tests. Also, RUnit is a testing framework for R that helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor’s unit testing guidelines.

Timing your code

First, a cautionary note…

premature optimization is the root of all evil

— Donald Knuth, 1974

There are a few tools in R for timing your code.

system.time(mean(rnorm(1e7)))
##    user  system elapsed 
##   0.623   0.035   0.658
library(rbenchmark)
x <- rnorm(1e7)
benchmark(ifelse(x < 0, x, 0),
                   x[x < 0] <- 0, replications = 5,
                   columns = c('replications', 'elapsed'))
##   replications elapsed
## 1            5   1.732
## 2            5   0.202

Profiling your code

For more advanced assessment of bottlenecks in your code, consider Rprof(). Actually, the output from Rprof can be hard to decipher, so you may want to use the proftools package functions, which make use of Rprof under the hood.

Here’s a function that does the linear algebra to implement a linear regression, assuming X is the matrix of predictors, including a column for the intercept:

\hat{\beta} = (X^{\top}X )^{-1}X^{\top}y

lr_slow <- function(y, X) {
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_medium <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_fast <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  U <- chol(XtX)
  tmp <- backsolve(U, Xty, transpose = TRUE)
  return(backsolve(U, tmp))
}                   

Now let’s try these two functions with profiling turned on.

## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)

library(proftools)

pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)
##  path                total.pct self.pct
##  lazyLoadDBfetch     32.99      0.00   
##  . <Anonymous>       32.99      0.00   
##  . . lazyLoadDBfetch 32.99     32.99   
##  lr_slow             67.01      0.00   
##  . %*% (<text>:2)    19.59     19.59   
##  . %*% (<text>:3)     3.09      3.09   
##  . %*% (<text>:5)     1.03      1.03   
##  . solve (<text>:4)  23.71      0.00   
##  . . solve.default   23.71     22.68   
##  . . . diag           1.03      1.03   
##  . t (<text>:2)       1.03      0.00   
##  . . t.default        1.03      1.03   
##  . t (<text>:3)      18.56      0.00   
##  . . t.default       18.56     18.56
hotPaths(pd1, value = 'time')
##  path                total.time self.time
##  lazyLoadDBfetch     0.64       0.00     
##  . <Anonymous>       0.64       0.00     
##  . . lazyLoadDBfetch 0.64       0.64     
##  lr_slow             1.30       0.00     
##  . %*% (<text>:2)    0.38       0.38     
##  . %*% (<text>:3)    0.06       0.06     
##  . %*% (<text>:5)    0.02       0.02     
##  . solve (<text>:4)  0.46       0.00     
##  . . solve.default   0.46       0.44     
##  . . . diag          0.02       0.02     
##  . t (<text>:2)      0.02       0.00     
##  . . t.default       0.02       0.02     
##  . t (<text>:3)      0.36       0.00     
##  . . t.default       0.36       0.36
pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)
##  path                    total.pct self.pct
##  lr_medium               100.00     0.00   
##  . crossprod (<text>:10)   7.89     7.89   
##  . crossprod (<text>:9)   50.00    50.00   
##  . solve (<text>:11)      42.11     0.00   
##  . . solve.default        42.11    36.84   
##  . . . diag                5.26     5.26
hotPaths(pd2, value = 'time')
##  path                    total.time self.time
##  lr_medium               0.76       0.00     
##  . crossprod (<text>:10) 0.06       0.06     
##  . crossprod (<text>:9)  0.38       0.38     
##  . solve (<text>:11)     0.32       0.00     
##  . . solve.default       0.32       0.28     
##  . . . diag              0.04       0.04
pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)
##  path                    total.pct self.pct
##  lr_fast                 100.00     0.00   
##  . chol (<text>:18)       15.79     0.00   
##  . . chol.default         15.79    15.79   
##  . crossprod (<text>:16)  73.68    73.68   
##  . crossprod (<text>:17)  10.53    10.53
hotPaths(pd3, value = 'time')
##  path                    total.time self.time
##  lr_fast                 0.38       0.00     
##  . chol (<text>:18)      0.06       0.00     
##  . . chol.default        0.06       0.06     
##  . crossprod (<text>:16) 0.28       0.28     
##  . crossprod (<text>:17) 0.04       0.04

You might also check out profvis for an alternative to displaying profiling information generated by Rprof.

Memory use

You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.

If in total, the jobs on a machine approach the physical RAM, the machine may (depending on how it is set up) start to use the hard disk as ‘virtual memory’. This is called paging or swapping, and once this happens you’re often toast (i.e., your code may take essentially forever to finish). And if paging doesn’t happen, your job will die with an out-of-memory (OOM) error.

You can assess memory use with top or ps in Linux/Mac or the Task Manager in Windows.

Often it’s a good idea to roughly estimate how much memory an object will take up even before creating it in R. You can do this with some simple arithmetic. Every real number takes 8 bytes (integers and logicals take less; character strings are complicated), so an object with, say, 1 million rows and 10 columns, all numbers, would take roughly 8 * 1000000 * 10 bytes or 800 Mb.

x <- rnorm(1e7)
object.size(x)
## 80000048 bytes
1e7*8/1e6  # direct calculation of Mb
## [1] 80
print(object.size(x), units = 'auto')
## 76.3 Mb
x <- rnorm(1e8)
gc()
##             used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells    705996  37.8    1251997   66.9   1251997  66.9
## Vcells 111305601 849.2  177271667 1352.5 121317097 925.6
rm(x)
gc()
##            used (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells   706018 37.8    1251997   66.9   1251997  66.9
## Vcells 11305646 86.3  141817334 1082.0 121317097 925.6

Memory use: quick quiz

POLL 9A: Suppose you have a CSV file with one million rows and one thousand columns, filled with numbers. How much memory will this take up if you read the data into R?

(respond at https://pollev.com/chrispaciorek428)

  1. 800 MB
  2. 8 GB
  3. 1 GB
  4. 80 GB
  5. something else

Scripting

If you use a good editor that understands the syntax of the language you are using, it’s easier to write and understand your code. Some editors to consider:

To run all the code in an entire file, do source('myCodeFile.R').

RStudio projects help to organize your work. In many cases you may also want to create an R package to share your code, even if you don’t plan to make it publicly available. The devtools and usethis packages can help you create and manage projects and packages.

Batch jobs / background execution

To run code as a background job in a Linux/Mac context:

R CMD BATCH --no-save myCodeFile.R myOutput.Rout &

Then you don’t need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.

IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).

Good coding practices: functions

Use functions whenever possible. In particular try to write functions rather than carry out your work using blocks of code. Why? Functions allow us to reuse blocks of code easily for later use and for recreating an analysis (reproducible research). It’s more transparent than sourcing a file of code because the inputs and outputs are specified formally, so you don’t have to read through the code to figure out what it does.

Good use of functions includes:

Functions should:

Good coding practices: syntax

For an example that demonstrates some of these ideas, see goodCode.R. And for a counterexample, see badCode.R.

c <- 7
c(3,5)
## [1] 3 5
c
## [1] 7
rm(c)
c
## function (...)  .Primitive("c")

Reproducible research

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

— Jonathan Buckheit and David Donoho, WaveLab and Reproducible Research (1995)

Here are some useful articles talking about reproducibility.

Some ideas for improving reproducibility

Breakout

Please fill out the feedback survey. See the Ed Discussion board for the link.