R bootcamp, Module 5: Programming

August 2022, UC Berkeley

Chris Paciorek

R is a functional language

Functions are one of the most important constructs in R (and many other languages). They allow you to modularize your code - encapsulating a set of repeatable operations as an individual function call.

median
## function (x, na.rm = FALSE, ...) 
## UseMethod("median")
## <bytecode: 0x55963eaa7408>
## <environment: namespace:stats>
args(median)
## function (x, na.rm = FALSE, ...) 
## NULL
class(median)
## [1] "function"

Using functions

You should rely heavily on functions rather than having long sets of expressions in R scripts.

Functions have many important advantages:

A basic goal is writing functions is modularity.

In general, a function should

Writing functions

In module 4, we sorted the gapminder dataframe.

ord <- order(gapminder$year, gapminder$lifeExp, decreasing = TRUE)
ord[1:5]
## [1]  804  672  696 1488   72
gm_ord <- gapminder[ord, ]

How would we encapsulate that functionality generically so that we can apply it to other dataframes (or matrices)?

The general approach is to take some code you wrote already (e.g., see above) and insert it as the body of the function.

sortByCol <- function(data, col1, col2) {
    # Sorts a matrix or dataframe based on two columns
    #
    # Args:
    #     data: a dataframe or matrix with at least two columns
    #                  and any number of rows
    #     col1: a reference to the column to sort on
    #     col2: a reference to the column to use for ties
    #
    # Returns:
    #     <data> sorted in increasing order by the values
    #     in the first column. Any ties should be broken by values
    #     in the second column. The row pairs should be maintained
    #     in this result
    
    data <- as.data.frame(data)  # to avoid issues with order on a tibble column
    ord <- order(data[, col1], data[, col2], decreasing=TRUE)
    sorted <- data[ord, ]
    return(sorted)
}

gm_ord2 <- sortByCol(gapminder, "year", "lifeExp")
head(gm_ord)
## # A tibble: 6 × 6
##   country          continent  year lifeExp       pop gdpPercap
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Japan            Asia       2007    82.6 127467972    31656.
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725.
## 3 Iceland          Europe     2007    81.8    301931    36181.
## 4 Switzerland      Europe     2007    81.7   7554661    37506.
## 5 Australia        Oceania    2007    81.2  20434176    34435.
## 6 Spain            Europe     2007    80.9  40448191    28821.
head(gm_ord2)
##               country continent year lifeExp       pop gdpPercap
## 804             Japan      Asia 2007  82.603 127467972  31656.07
## 672  Hong Kong, China      Asia 2007  82.208   6980412  39724.98
## 696           Iceland    Europe 2007  81.757    301931  36180.79
## 1488      Switzerland    Europe 2007  81.701   7554661  37506.42
## 72          Australia   Oceania 2007  81.235  20434176  34435.37
## 1428            Spain    Europe 2007  80.941  40448191  28821.06

Of course this is somewhat limited in that it is specific to sorting based on two columns. We’d usually want to extend this to be more general, but it’s usually good to start with something concrete and more limited in generality and then generalize once you are sure it works.

Function arguments

R can match arguments by name (when provided) or by position (the fall-back). It also allows one to specify default values so that the user doesn’t have to explicitly provide all the arguments.

log(100)
## [1] 4.60517
log(100, base = 10)
## [1] 2
log(100, 10)
## [1] 2
log(base = 10, 100)
## [1] 2
log(base = 10, x = 100)
## [1] 2

And in our running example:

sortByCol <- function(data, col1 = 1, col2 = 2) {
    data <- as.data.frame(data)  # to avoid issues with order on a tibble column
    ord <- order(data[, col1], data[, col2], decreasing=TRUE)
    sorted <- data[ord, ]
    return(sorted)
}
identical(sortByCol(gapminder, 1, 2), sortByCol(gapminder))
## [1] TRUE
identical(sortByCol(col2 = 2, data = gapminder, col1 = 1), sortByCol(gapminder, 1, 2))
## [1] TRUE

What is the “…” argument for?

Using ... as one of the arguments to a function allows a function to pass along user-provided arguments without specifying explicitly what the user might provide.

Here’s an example of tailoring some plotting specifications that I use a lot. I generally define pplot() in my .Rprofile file so it is always available to me.

pplot <- function(x, y, ...) {
      plot(x, y, pch = 16, cex = 0.6, ...)
}

pplot(gapminder$gdpPercap, gapminder$lifeExp,  xlab = 'gdpPercap (log $)',
      ylab = 'life expectancy (years)', log = 'x')

Important concepts with R functions

Functions in R return an object. In general, R functions are and should be designed such that the only effect of the function is through the return value.

Side effects are when a function affects the state of the world in addition to its return value.

Question: Can you think of any side effects that you saw an R function produce from earlier today?

Important concepts with R functions - pass-by-value

x <- rnorm(3)

myfun <- function(x) {
      x <- x + 100
      return(x)
}      

new_x <- myfun(x)
print(new_x)
## [1] 100.2904 100.2134 100.2444
print(x)
## [1] 0.2903942 0.2134274 0.2443608

Functions in R are (roughly) pass-by-value and not pass-by-reference. This means that if you modify an argument inside the function it will not change the original value outside the function. This protects you from a major potential source of side effects. (There are exceptions to this rule.)

In actuality, functions in R are call-by-value. What this means for our purposes is that you can pass an input argument in without a copy being made of it. This saves time and memory. At the time that you modify the input within the function (if ever), then a copy is made and the modified input is different than the original value outside the function.

Variable scope and global variables

In general functions should not make use of variables from outside the function. (However, for quick-and-dirty work and in some other circumstances, one may do this.) This provides modularity and reduces bugs and surprises.

If R can’t find a variable that is used in a function based on the function arguments and variables defined locally in the function, it goes and looks elsewhere following a set of rules called lexical scoping.

(This type of scoping has to do with R’s roots (and explains why R is very similar to other languages for functional programming) - we won’t go into details here but certainly worth looking into as you start using R more.)

Basically this means that it looks for variables relative to where the function is defined (not relative to where the function is called).

This can get involved, but a couple brief examples illustrate the basic idea.

x <- 2
f <- function(y) {
    return(x + y)
}
f(1)
## [1] 3
g <- function(y) {
  x <- 10
  return(f(y))
}

g(1)
## [1] 3
## This is a bit complicated... 
g <- function(y) {
  f <- function(y) {
     return(x + y)
  }
  x <- 10
  return(f(y))
}

g(1)
## [1] 11

Note that x is used as a global variable here, which in general is bad practice.

Why does scoping work that way?

Consider the lm function. It uses lm.fit for its actual computation.

Suppose scoping depended on where the function (lm in this case) is called from? What would happen now:

x <- rnorm(5)
y <- rnorm(5)
lm(y ~ x)

lm.fit <- function(...) print('Better luck next time, sucker.')

lm.fit()
lm(y~x)

R’s scoping, in combination with package namespaces protects against this kind of problem.

Loops

In many languages, looping (for loops, while loops, etc.) is one of the main constructs used to carry out computation. Loops are not emphasized as much in R, both because they can be slow and because other syntax (vectorized calls, lapply, etc.) is often cleaner.

But there are lots of times when using a loop does make sense.

Most of you are probably familiar at least with the basic idea of iterating through a series of steps. A for loop iterates through a pre-determined number of iterations, while a while loop iterates until some condition is met. For loops are more common in R, but while loops can be handy particularly for things like optimization.

Let’s take an example from Module 4 and use a for loop instead of lapply or aggregate.

out <- list()
years <- unique(gapminder$year)
length(out) <- length(years)
names(out) <- years

for (yrIdx in seq_along(years)) {
     # equivalently: for(i in 1:length(years))
     # n.b., seq_along(x) is better than 1:length(x), since it handles the case
     # where the length of an object is 0 or NULL more robustly.
     sub <- subset(gapminder, gapminder$year == years[yrIdx])
     # fit regression
     out[[yrIdx]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
}
out[['2007']]
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Coefficients:
##    (Intercept)  log(gdpPercap)  
##          4.950           7.203
summary(out[['2007']])
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.947  -2.661   1.215   4.469  13.115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.9496     3.8577   1.283    0.202    
## log(gdpPercap)   7.2028     0.4423  16.283   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.652 
## F-statistic: 265.2 on 1 and 140 DF,  p-value: < 2.2e-16
summary(out[['1952']])
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.9571  -5.7319   0.7517   6.5770  13.7361 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -17.8457     5.0668  -3.522 0.000578 ***
## log(gdpPercap)   8.8298     0.6626  13.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.146 on 140 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.556 
## F-statistic: 177.6 on 1 and 140 DF,  p-value: < 2.2e-16

The iterations do not have to explicitly be over sequential numbers.

for (yr in years) {
     cat(yr, '\n')
}
## 1952 
## 1957 
## 1962 
## 1967 
## 1972 
## 1977 
## 1982 
## 1987 
## 1992 
## 1997 
## 2002 
## 2007

Calculation approaches: quick quiz

POLL 5A: Which of these give exactly this result: pi, 2*pi, 3*pi, …?

(respond at https://pollev.com/chrispaciorek428)

  1. (1:n)*pi
  2. out <- rep(0, n); for(x in 1:n) out <- x*pi
  3. sapply(1:n, function(x) x*pi)
  4. out <- rep(0, n); for(x in 1:n) out[i] <- x*pi
  5. lapply(1:n, function(x) x*pi)
  6. sapply(1:n, “*“, pi)
  7. 1:n*pi

Question: Which of these do you think will be fastest for a long vector x?

  1. x <- exp(x)
  2. x <- sapply(x, exp)
  3. for(i in 1:length(x)) x[i] <- exp(x[i])

While loop (optional)

It’s not a particularly interesting example, but we can see the while loop syntax in the same example. In this case

yrIdx <- 1
while (yrIdx <= length(years)) {
     sub <- subset(gapminder, gapminder$year == years[yrIdx])
     # fit regression
     out[[yrIdx]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
     yrIdx = yrIdx + 1
}
summary(out[['2007']])
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.947  -2.661   1.215   4.469  13.115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.9496     3.8577   1.283    0.202    
## log(gdpPercap)   7.2028     0.4423  16.283   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.652 
## F-statistic: 265.2 on 1 and 140 DF,  p-value: < 2.2e-16

Branching (if-then-else syntax)

Often we need our code to do different things depending on whether some condition is true or false.

Here’s a simple example to illustrate the syntax. Note that the then is implicit.

val <- rnorm(1)
val
## [1] 0.786164
if (val < 0) {
  print("val is negative")
} else {
  print("val is positive")
}
## [1] "val is positive"

We can chain together if statements as follows.

val <- rnorm(1)
val
## [1] -0.1062454
if (val < -1) {
  print("val is more than one standard deviation below the mean.")
} else if (abs(val) <= 1) {
  print("val is within one standard deviation of the mean.")
} else {
  print("val is more than one standard deviation above the mean.")
}
## [1] "val is within one standard deviation of the mean."

In general, the { brackets are only needed if you have multiple R expressions, but R will complain when an else starts a line of code, so generally using the { is good practice. That said, this works just fine:

if (val < 0) print("val is negative") else print("val is positive")
## [1] "val is negative"

The condition in an if statement

The condition in the if statement cannot be NA or R will give an error. This is a very common bug.

continents <- unique(gapminder$continent)
continents  
## [1] Asia     Europe   Africa   Americas Oceania 
## Levels: Africa Americas Asia Europe Oceania
continents <- unique(as.character(gapminder$continent))
continents <- c('Antarctica', continents)

out <- rep(0, length(continents))
for (i in seq_along(continents)) {
    sub <- gapminder[gapminder$continent == continents[i], ]
    if(mean(sub$lifeExp) < 50)
       print(continents[i])
}
## Error in if (mean(sub$lifeExp) < 50) print(continents[i]): missing value where TRUE/FALSE needed
print(i)
## [1] 1
sub <- gapminder[gapminder$continent == continents[i], ]
if(mean(sub$lifeExp) < 50) print('here')
## Error in if (mean(sub$lifeExp) < 50) print("here"): missing value where TRUE/FALSE needed
mean(sub$lifeExp) < 50
## [1] NA

An NA/NaN is the main reason an if statement may fail, because R will generally convert other values to logical values.

Zero evaluates to FALSE, all other numbers evaluate to TRUE. In general strings are not converted to booleans.

A more robust alternative is to use isTRUE():

out <- rep(0, length(continents))
for (i in seq_along(continents)) {
    sub <- gapminder[gapminder$continent == continents[i], ]
    if(isTRUE(mean(sub$lifeExp) < 60))
       print(continents[i])
}
## [1] "Africa"

Here’s another example. What will this return?

if(sqrt(-1) == 0i) {
   print('i')
} else print('not i')
  1. ‘i’
  2. an error
  3. ‘not i’

Boolean arithmetic

R provides both && and & as well as || and |.

& and | are the vectorized AND and OR operators we’ve seen already.

&& and || operate only on a single element and are generally used in the conditions of if statements (or while loops).

For example:

x <- rnorm(1)
y <- rnorm(1)
if(x > 0 && y < 0) {
  # do something
}

A nice thing about && and || is that they evaluate left to right, so it can save some computation, e.g., if the first condition of && is FALSE, the second condition is not evaluated.

Flow control: next and break statements (optional)

‘next’ skips the current evaluation of the loop statements:

continents <- unique(gapminder2007$continent)
continents[2] <- "Oceania"; continents[5] <- "Europe"  # reorder to illustrate points below
continents
## [1] Asia     Oceania  Africa   Americas Europe  
## Levels: Africa Americas Asia Europe Oceania
out <- list(); length(out) <- length(continents); names(out) <- continents

for (i in seq_along(continents)) {
     # equivalently: for(i in 1:length(continents))
     sub <- subset(gapminder2007, gapminder2007$continent == continents[i])
     if(sum(!is.na(sub$lifeExp)) > 2) { # don't regress if <= 2 obs
     # fit regression
       out[[i]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
     } else {
       next
     }
}
cat("We got to iteration ", i, " of ", length(continents), " items.\n", sep = "")
## We got to iteration 5 of 5 items.
out[['Oceania']]
## NULL

‘break’ immediately ends loop evaluation:

out <- list(); length(out) <- length(continents); names(out) <- continents

for (i in seq_along(continents)) {
     # equivalently: for(i in 1:length(continents))
     sub <- subset(gapminder2007, gapminder2007$continent == continents[i])
     if(sum(!is.na(sub$lifeExp)) > 2) { # don't regress if <= 2 obs
     # fit regression
       out[[i]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
     } else {
       break
     }
}

cat("We got to iteration ", i, " of ", length(continents), " items.\n", sep = "")
## We got to iteration 2 of 5 items.

Effective use of next and break can make your for (and other) loops both more robust and efficient (e.g., by skipping cases where computations may fail).

When do I start programming? (optional)

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

John Chambers, “Stages in the Evolution of S”

Key Principles of R (optional)

What does the 2nd principle mean?

Are arithmetic operations really just functions?

3+2
## [1] 5
'+'(3,2)
## [1] 5

Yes!

And what about indexing?

x <- matrix(runif(100), 10)
x[ , 2]
##  [1] 0.3736735 0.2301198 0.3279895 0.9573521 0.1122490 0.1379409 0.5371405
##  [8] 0.1900901 0.2598349 0.7517959
'['(x , , 2)
##  [1] 0.3736735 0.2301198 0.3279895 0.9573521 0.1122490 0.1379409 0.5371405
##  [8] 0.1900901 0.2598349 0.7517959

Also yes!

This holds more generally - one can investigate and see the same thing in terms of a for loop.

What does the 1st principle mean?

class(1)
## [1] "numeric"
class(runif)
## [1] "function"
class(function(x) x^2)
## [1] "function"
square <- function(x) x^2
class(square)
## [1] "function"

(Advanced) Modern Functional Programming in R (optional)

Now that we’ve taken an extensive look at for loops and programming with functions, we can consider moving on from base R to look at more modern and powerful tools for programming provided by purrr, a core package in the tidyverse, a set of modern tools for “doing data science” in R. purrr provides facilities to manipulate data sets using functions in a “tidy” manner. Using purrr requires familiarity with several other core packages of the tidyverse, most notably dplyr and magrittr. We won’t have time to cover details of how to use purrr here, but you may want to look into these packages as you continuing exploring R.

library(dplyr)
library(purrr)

We’ll leave you with a simple example of how to use purrr to deal with a task we already looked at with for loops. In fact, we’ll return to the logistic regression exercise that we started with.

A simple example

Recall that we have been fitting regression models for each year. To fit our models across each of the years separately, we’ll first need to put our data in “tidy” format. It’s really easy with some helpful verbs from dplyr

# let's clean up the data set first
gm_tidy <- gapminder %>%
  split(.$year)

Now, we can fit our regression models across each of the years using purrr’s map:

gm_lms <- gm_tidy %>%
  map(~lm(lifeExp ~ log(gdpPercap), data = .))

What about protecting ourselves against situations where there wasn’t enough variation in the outcome variable (fewer than two non-missing observations in a year). (It doesn’t happen here but can easily happen in other situations.)

So, can purrr handle this? Yes - in fact, it’s really easy. We can use a verb called safely to separate situations where the GLM succeeds from those where it doesn’t. Let’s try it out

gm_lms <- gm_tidy %>%
  map(safely(~lm(lifeExp ~ log(gdpPercap), data = .)))

Remark: What we accomplish here with purrr::map is also easily done using tools from base R. In fact, using lapply, we can evaluate the very same lm formula with our gapminder dataset, albeit without the extra goodies offered by the pipe (aka %>%) syntax and the safely convenience, afforded by magrittr and purrr, respectively.

Ok. Now, we’ll look at the results for one destination, just to get a feel for the output

gm_lms$`2007`
## $result
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = .)
## 
## Coefficients:
##    (Intercept)  log(gdpPercap)  
##          4.950           7.203  
## 
## 
## $error
## NULL

Now that we’ve seen what we can do with purrr, it’s important to compare this modern approach to what is offered in base R. (It’s very important to understand and be comfortable with using tools from base R before adopting those that are offered as part of the tidyverse.) To that end, here are some points of comparison:

For a full comparison of purrr versus base R, consider checking out this quite thorough tutorial by Jenny Bryan.

Onwards: Readings and References

Breakout

Basics

  1. Write an R function that will take an input vector and set any negative values in the vector to zero.

Using the ideas

  1. Write an R function that will take an input vector and set any value below a threshold to be the value of threshold. Optionally, the function should instead set values above a threshold to the value of the threshold.

  2. Augment your function so that it checks that the input is a numeric vector and return an error if not. (See the help information for stop() (or stopifnot().)

  3. Figure out what invisible() does and why it is useful when writing functions. Use invisible() in your function from just above or in the sortByCol() function.

  4. Sort the gapminder dataset and find the shortest life expectancy value. Now consider the use of which.min() and why using that should be much quicker with large datasets.

Advanced

  1. Extend the sortByCol() function to handle an arbitrary number of columns on which to do the sorting. You may need to modify the functionality a bit to use full vectors for the sorting rather than column names.

  2. Extend the sortByCol() function so that it can take either numeric indices or character strings indicating columns of the input data or take entire vectors to use in the sorting of the input data. If information specifying columns is given, make sure your function carefully checks that the input refers to legitimate columns and returns an error or warning (see warning()) if not.

  3. Explore scoping in the following code. Explain why the result is 11 and not 3. Note that funGenerator() returns a function as the return object, consistent with the idea that functions are objects in R. This is an example of what is called a closure in R. Basically, the function contains object(s) enclosed with and accessible to the function.

funGenerator <- function(x) {
   x <- 10
   g <- function(y) {
      return(x + y)
   }
   return(g)
}

x <- 2
f <- funGenerator()
f(1)
## [1] 11