R bootcamp, Module 9: Workflows, managing projects and good practices

August 2023, UC Berkeley

Chris Paciorek

Tips for avoiding bugs

Plan out your code in advance, including all special cases/possibilities.
When doing software development, write tests for your code early in the process.
Build up code in pieces, testing along the way. Make big changes in small steps, sequentially checking to see if the code has broken on test case(s).
Code in a modular fashion, making good use of functions, so that you don’t need to debug the same code multiple times. Smaller functions are easier to debug, easier to understand, and can be combined in a modular fashion (like the UNIX utilities)
Use core R functionality and algorithms already coded. Figure out if a functionality already exists in (or can be adapted from) an R package (or potentially in a C/Fortran library/package): code that is part of standard mathematical/numerical packages will probably be more efficient and bug-free than anything you would write.
Remove objects you don’t need, to avoid accidentally using values from an old object via the scoping rules.
Be careful that the conditions of if statements and the sequences of for loops are robust when they involve evaluating R code.
Don’t hard code numbers - use variables (e.g., number of iterations, parameter values in simulations), even if you don’t expect to change the value, as this makes the code more readable and reduces bugs when you use the same number multiple times:
Check that inputs to and outputs from functions (either functions you call or functions you write) are valid and use warning() and stop() to give a warning or stop execution when something unexpected happens.
Use try() with functions that may fail in cases where you don’t want overall execution to fail because a single piece of the execution fails.
Write code for clarity and accuracy first; then worry about efficiency. Write an initial version of the code in the simplest way, without trying to be efficient (e.g., you might use loops even if you’re coding in R); then make a second version that employs efficiency tricks and check that both produce the same output.

Common syntax errors and bugs

Parenthesis mis-matches
[...] vs. [[...]]
== vs. =
Comparing real numbers exactly using == is dangerous. Suppose you generate x = 0.333333 in some fashion with some code and then check: x == 1/3. This will produce FALSE.
Vectors vs. single values:
- || vs. | and && vs &
- You expect a single value but your code gives you a vector
- You want to compare an entire vector but your code just compares the first value (e.g., in an if statement) – consider using identical() or all.equal()
Silent type conversion when you don’t want it, or lack of coercion where you’re expecting it
Using the wrong function or variable name
Giving unnamed arguments to a function in the wrong order
Forgetting to define a variable in the environment of a function and having the function, via R’s scoping rules, get that variable as a global variable from one of the enclosing environments. At best the types are not compatible and you get an error; at worst, you use a garbage value and the bug is hard to trace. In some cases your code may work fine when you develop the code (if the variable exists in the enclosing environment), but then may not work when you restart R if the variable no longer exists or is different.

library(codetools)
f <- function() {y <- 3; print(x + y)}
findGlobals(f)

## [1] "{"     "+"     "<-"    "print" "x"

R (usually helpfully) drops matrix and array dimensions that are extraneous; which can sometimes confuse later code that expects an object of a certain dimension. The [ operator takes an additional optional argument that can avoid dropping dimensions.

mat <- matrix(1:4, 2, 2)[1, ]
dim(mat); print(mat)

## NULL

## [1] 1 3

# colSums(mat)   # Fails!
mat <- matrix(1:4, 2, 2)[1, , drop = FALSE]
colSums(mat)     # Succeeds!

## [1] 1 3

Debugging

As a scripting language, R essentially has a debugger working automatically by virtue of you often being able to easily run code line by line.

But there is an official debugger and other tools that greatly help in figuring out problems, particularly for more complicated situations.

Let’s briefly see these in action. I’ll demo this in a very basic way, but hopefully this will give you an idea of the power of these tools.

buggyFun <- function(myDF) {
   print(names(myDF))
   myDF$id <- seq_len(nrow(myDF))
   sums <- rowSums(myDF)
   return(sums)
}

buggyFun(gapminder)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

## Error in rowSums(myDF): 'x' must be numeric

if(FALSE) {
  traceback()
  debug(buggyFun)
  buggyFun(gapminder)

  undebug(buggyFun)
  options(error = recover)
  buggyFun(gapminder)
}

We can use debug() to step through a function line by line
After an error occurs, we can use traceback() to look at the call stack
More helpfully, if we set options(error = recover) before running code, we can go into the function in which the error occurred
We can insert browser() inside a function and R will stop there and allow us to proceed with debugging statements
You can temporarily insert code into a function (including built-in functions) with trace(fxnName, edit = TRUE)

Testing

Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called unit tests. Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.

The testthat package is very helpful for setting up tests. Also, RUnit is a testing framework for R that helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor’s unit testing guidelines.

Timing your code

First, a cautionary note…

premature optimization is the root of all evil

— Donald Knuth, 1974

There are a few tools in R for timing your code.

system.time(mean(rnorm(1e7)))

##    user  system elapsed 
##   0.623   0.035   0.658

library(rbenchmark)
x <- rnorm(1e7)
benchmark(ifelse(x < 0, x, 0),
                   x[x < 0] <- 0, replications = 5,
                   columns = c('replications', 'elapsed'))

##   replications elapsed
## 1            5   1.732
## 2            5   0.202

Profiling your code

For more advanced assessment of bottlenecks in your code, consider Rprof(). Actually, the output from Rprof can be hard to decipher, so you may want to use the proftools package functions, which make use of Rprof under the hood.

Here’s a function that does the linear algebra to implement a linear regression, assuming X is the matrix of predictors, including a column for the intercept:

$\hat{\beta} = (X^{\top}X )^{-1}X^{\top}y$

lr_slow <- function(y, X) {
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_medium <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_fast <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  U <- chol(XtX)
  tmp <- backsolve(U, Xty, transpose = TRUE)
  return(backsolve(U, tmp))
}

Now let’s try these two functions with profiling turned on.

## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)

library(proftools)

pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)

##  path                total.pct self.pct
##  lazyLoadDBfetch     32.99      0.00   
##  . <Anonymous>       32.99      0.00   
##  . . lazyLoadDBfetch 32.99     32.99   
##  lr_slow             67.01      0.00   
##  . %*% (<text>:2)    19.59     19.59   
##  . %*% (<text>:3)     3.09      3.09   
##  . %*% (<text>:5)     1.03      1.03   
##  . solve (<text>:4)  23.71      0.00   
##  . . solve.default   23.71     22.68   
##  . . . diag           1.03      1.03   
##  . t (<text>:2)       1.03      0.00   
##  . . t.default        1.03      1.03   
##  . t (<text>:3)      18.56      0.00   
##  . . t.default       18.56     18.56

hotPaths(pd1, value = 'time')

##  path                total.time self.time
##  lazyLoadDBfetch     0.64       0.00     
##  . <Anonymous>       0.64       0.00     
##  . . lazyLoadDBfetch 0.64       0.64     
##  lr_slow             1.30       0.00     
##  . %*% (<text>:2)    0.38       0.38     
##  . %*% (<text>:3)    0.06       0.06     
##  . %*% (<text>:5)    0.02       0.02     
##  . solve (<text>:4)  0.46       0.00     
##  . . solve.default   0.46       0.44     
##  . . . diag          0.02       0.02     
##  . t (<text>:2)      0.02       0.00     
##  . . t.default       0.02       0.02     
##  . t (<text>:3)      0.36       0.00     
##  . . t.default       0.36       0.36

pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)

##  path                    total.pct self.pct
##  lr_medium               100.00     0.00   
##  . crossprod (<text>:10)   7.89     7.89   
##  . crossprod (<text>:9)   50.00    50.00   
##  . solve (<text>:11)      42.11     0.00   
##  . . solve.default        42.11    36.84   
##  . . . diag                5.26     5.26

hotPaths(pd2, value = 'time')

##  path                    total.time self.time
##  lr_medium               0.76       0.00     
##  . crossprod (<text>:10) 0.06       0.06     
##  . crossprod (<text>:9)  0.38       0.38     
##  . solve (<text>:11)     0.32       0.00     
##  . . solve.default       0.32       0.28     
##  . . . diag              0.04       0.04

pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)

##  path                    total.pct self.pct
##  lr_fast                 100.00     0.00   
##  . chol (<text>:18)       15.79     0.00   
##  . . chol.default         15.79    15.79   
##  . crossprod (<text>:16)  73.68    73.68   
##  . crossprod (<text>:17)  10.53    10.53

hotPaths(pd3, value = 'time')

##  path                    total.time self.time
##  lr_fast                 0.38       0.00     
##  . chol (<text>:18)      0.06       0.00     
##  . . chol.default        0.06       0.06     
##  . crossprod (<text>:16) 0.28       0.28     
##  . crossprod (<text>:17) 0.04       0.04

You might also check out profvis for an alternative to displaying profiling information generated by Rprof.

Memory use

You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.

If in total, the jobs on a machine approach the physical RAM, the machine may (depending on how it is set up) start to use the hard disk as ‘virtual memory’. This is called paging or swapping, and once this happens you’re often toast (i.e., your code may take essentially forever to finish). And if paging doesn’t happen, your job will die with an out-of-memory (OOM) error.

You can assess memory use with top or ps in Linux/Mac or the Task Manager in Windows.

Often it’s a good idea to roughly estimate how much memory an object will take up even before creating it in R. You can do this with some simple arithmetic. Every real number takes 8 bytes (integers and logicals take less; character strings are complicated), so an object with, say, 1 million rows and 10 columns, all numbers, would take roughly 8 * 1000000 * 10 bytes or 800 Mb.

x <- rnorm(1e7)
object.size(x)

## 80000048 bytes

1e7*8/1e6  # direct calculation of Mb

## [1] 80

print(object.size(x), units = 'auto')

## 76.3 Mb

x <- rnorm(1e8)
gc()

##             used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells    705996  37.8    1251997   66.9   1251997  66.9
## Vcells 111305601 849.2  177271667 1352.5 121317097 925.6

rm(x)
gc()

##            used (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells   706018 37.8    1251997   66.9   1251997  66.9
## Vcells 11305646 86.3  141817334 1082.0 121317097 925.6

Memory use: quick quiz

POLL 9A: Suppose you have a CSV file with one million rows and one thousand columns, filled with numbers. How much memory will this take up if you read the data into R?

(respond at https://pollev.com/chrispaciorek428)

800 MB
8 GB
1 GB
80 GB
something else

Scripting

Keep your code in script (i.e., text) files.
Keep your files modular and focused.
Write functions to reuse code.
Learn a powerful, general purpose text editor

If you use a good editor that understands the syntax of the language you are using, it’s easier to write and understand your code. Some editors to consider:

RStudio’s built-in editor
Mac: Aquamacs, TextMate, TextEdit
Windows: WinEdt
Linux: vim, Emacs with ESS, Atom

To run all the code in an entire file, do source('myCodeFile.R').

RStudio projects help to organize your work. In many cases you may also want to create an R package to share your code, even if you don’t plan to make it publicly available. The devtools and usethis packages can help you create and manage projects and packages.

Batch jobs / background execution

To run code as a background job in a Linux/Mac context:

R CMD BATCH --no-save myCodeFile.R myOutput.Rout &

Then you don’t need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.

IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).

Good coding practices: functions

Use functions whenever possible. In particular try to write functions rather than carry out your work using blocks of code. Why? Functions allow us to reuse blocks of code easily for later use and for recreating an analysis (reproducible research). It’s more transparent than sourcing a file of code because the inputs and outputs are specified formally, so you don’t have to read through the code to figure out what it does.

Good use of functions includes:

Write reusable code for core functionality and keep a single copy of the code (w/ backups of course) so you only need to change it in one place
Smaller functions are easier to debug, easier to understand, and can be combined in a modular fashion (like the UNIX utilities)

Functions should:

be modular (having a single task);
have meaningful name; and
have a comment describing their purpose, inputs and outputs (see the help file for any standard R function for how this is done in that context).

Good coding practices: syntax

For an example that demonstrates some of these ideas, see goodCode.R. And for a counterexample, see badCode.R.

Header information: put metainfo on the code into the first few lines of the file as comments. Include who, when, what, how the code fits within a larger program (if appropriate), possibly the versions of R and key packages that you wrote this for
Indentation: do this systematically (your editor can help here). This helps you and others to read and understand the code and can help in detecting errors in your code because it can expose lack of symmetry.
Whitespace: use a lot of it. Some places where it is good to have it are (1) around operators (assignment and arithmetic), (2) between function arguments and list elements, (3) between matrix/array indices, in particular for missing indices.
Use blank lines to separate blocks of code and comments to say what the block does
Split long lines at meaningful places.
Use parentheses for clarity even if not needed for order of operations. For example, a/y*x will work but is not easy to read and you can easily induce a bug if you forget the order of operations.
Documentation - add lots of comments (but don’t belabor the obvious). Remember that in a few months, you may not follow your own code any better than a stranger. Some key things to document:
- summarizing a block of code,
- explaining a very complicated piece of code
- explaining arbitrary constant values.
break code into separate files (<2000-3000 lines per file) with meaningful file names and related functions grouped within a file.
Choose a consistent naming style for objects and functions: e.g. numIts (lowerCamelCase) vs. NumIts (UpperCamelCase) vs. num.its vs. num_its
- I’d suggest avoiding periods in names since periods are used for object-oriented programming in R and many other languages
Try to have the names be informative without being overly long.
Try to avoid using the names of standard R functions for your objects, but R will generally be fairly smart about things.

c <- 7
c(3,5)

## [1] 3 5

## [1] 7

rm(c)
c

## function (...)  .Primitive("c")

Use active names for functions (e.g., calc_loglik, calcLogLik)

Reproducible research

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

— Jonathan Buckheit and David Donoho, WaveLab and Reproducible Research (1995)

Here are some useful articles talking about reproducibility.

Some ideas for improving reproducibility

Never change a dataset manually, including in pre-processing and post-processing. Always have a script that operates on the data (or results/output).
Always have version numbers for datasets.
Note when and where you download, and ideally download via a script as well (e.g., wget and curl are UNIX tools for automated downloading).
Produce figures (e.g., from R) via a script and not by point-and-click.
When making figures, use save() or save.image() to save all the inputs needed to recreate a figure, with the code for making the figure in a script file.
If feasible, include your code for doing analyses and making figures in the relevant document reporting the work by using one of the following tools
- Quarto and R Markdown
- LaTeX with knitr
- Jupyter notebooks
Always set the random number seed so someone else can duplicate your exact numbers.
Use version control tools such as Git!

Breakout

Please fill out the feedback survey. See the Ed Discussion board for the link.