August 2022, UC Berkeley
Chris Paciorek
Functions are one of the most important constructs in R (and many other languages). They allow you to modularize your code - encapsulating a set of repeatable operations as an individual function call.
## function (x, na.rm = FALSE, ...)
## UseMethod("median")
## <bytecode: 0x55963eaa7408>
## <environment: namespace:stats>
## function (x, na.rm = FALSE, ...)
## NULL
## [1] "function"
You should rely heavily on functions rather than having long sets of expressions in R scripts.
Functions have many important advantages:
A basic goal is writing functions is modularity.
In general, a function should
In module 4, we sorted the gapminder dataframe.
## [1] 804 672 696 1488 72
How would we encapsulate that functionality generically so that we can apply it to other dataframes (or matrices)?
The general approach is to take some code you wrote already (e.g., see above) and insert it as the body of the function.
sortByCol <- function(data, col1, col2) {
# Sorts a matrix or dataframe based on two columns
#
# Args:
# data: a dataframe or matrix with at least two columns
# and any number of rows
# col1: a reference to the column to sort on
# col2: a reference to the column to use for ties
#
# Returns:
# <data> sorted in increasing order by the values
# in the first column. Any ties should be broken by values
# in the second column. The row pairs should be maintained
# in this result
data <- as.data.frame(data) # to avoid issues with order on a tibble column
ord <- order(data[, col1], data[, col2], decreasing=TRUE)
sorted <- data[ord, ]
return(sorted)
}
gm_ord2 <- sortByCol(gapminder, "year", "lifeExp")
head(gm_ord)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Iceland Europe 2007 81.8 301931 36181.
## 4 Switzerland Europe 2007 81.7 7554661 37506.
## 5 Australia Oceania 2007 81.2 20434176 34435.
## 6 Spain Europe 2007 80.9 40448191 28821.
## country continent year lifeExp pop gdpPercap
## 804 Japan Asia 2007 82.603 127467972 31656.07
## 672 Hong Kong, China Asia 2007 82.208 6980412 39724.98
## 696 Iceland Europe 2007 81.757 301931 36180.79
## 1488 Switzerland Europe 2007 81.701 7554661 37506.42
## 72 Australia Oceania 2007 81.235 20434176 34435.37
## 1428 Spain Europe 2007 80.941 40448191 28821.06
Of course this is somewhat limited in that it is specific to sorting based on two columns. We’d usually want to extend this to be more general, but it’s usually good to start with something concrete and more limited in generality and then generalize once you are sure it works.
R can match arguments by name (when provided) or by position (the fall-back). It also allows one to specify default values so that the user doesn’t have to explicitly provide all the arguments.
## [1] 4.60517
## [1] 2
## [1] 2
## [1] 2
## [1] 2
And in our running example:
sortByCol <- function(data, col1 = 1, col2 = 2) {
data <- as.data.frame(data) # to avoid issues with order on a tibble column
ord <- order(data[, col1], data[, col2], decreasing=TRUE)
sorted <- data[ord, ]
return(sorted)
}
identical(sortByCol(gapminder, 1, 2), sortByCol(gapminder))
## [1] TRUE
## [1] TRUE
Using ...
as one of the arguments to a function allows a
function to pass along user-provided arguments without specifying
explicitly what the user might provide.
Here’s an example of tailoring some plotting specifications that I
use a lot. I generally define pplot()
in my
.Rprofile file so it is always available to me.
Functions in R return an object. In general, R functions are and should be designed such that the only effect of the function is through the return value.
Side effects are when a function affects the state of the world in addition to its return value.
Question: Can you think of any side effects that you saw an R function produce from earlier today?
## [1] 100.2904 100.2134 100.2444
## [1] 0.2903942 0.2134274 0.2443608
Functions in R are (roughly) pass-by-value and not pass-by-reference. This means that if you modify an argument inside the function it will not change the original value outside the function. This protects you from a major potential source of side effects. (There are exceptions to this rule.)
In actuality, functions in R are call-by-value. What this means for our purposes is that you can pass an input argument in without a copy being made of it. This saves time and memory. At the time that you modify the input within the function (if ever), then a copy is made and the modified input is different than the original value outside the function.
In general functions should not make use of variables from outside the function. (However, for quick-and-dirty work and in some other circumstances, one may do this.) This provides modularity and reduces bugs and surprises.
If R can’t find a variable that is used in a function based on the function arguments and variables defined locally in the function, it goes and looks elsewhere following a set of rules called lexical scoping.
(This type of scoping has to do with R’s roots (and explains why R is very similar to other languages for functional programming) - we won’t go into details here but certainly worth looking into as you start using R more.)
Basically this means that it looks for variables relative to where the function is defined (not relative to where the function is called).
This can get involved, but a couple brief examples illustrate the basic idea.
## [1] 3
## [1] 3
## This is a bit complicated...
g <- function(y) {
f <- function(y) {
return(x + y)
}
x <- 10
return(f(y))
}
g(1)
## [1] 11
Note that x
is used as a global variable here, which in
general is bad practice.
Consider the lm
function. It uses lm.fit
for its actual computation.
Suppose scoping depended on where the function (lm in this case) is called from? What would happen now:
x <- rnorm(5)
y <- rnorm(5)
lm(y ~ x)
lm.fit <- function(...) print('Better luck next time, sucker.')
lm.fit()
lm(y~x)
R’s scoping, in combination with package namespaces protects against this kind of problem.
In many languages, looping (for loops, while loops, etc.) is one of the main constructs used to carry out computation. Loops are not emphasized as much in R, both because they can be slow and because other syntax (vectorized calls, lapply, etc.) is often cleaner.
But there are lots of times when using a loop does make sense.
Most of you are probably familiar at least with the basic idea of iterating through a series of steps. A for loop iterates through a pre-determined number of iterations, while a while loop iterates until some condition is met. For loops are more common in R, but while loops can be handy particularly for things like optimization.
Let’s take an example from Module 4 and use a for loop instead of
lapply
or aggregate
.
out <- list()
years <- unique(gapminder$year)
length(out) <- length(years)
names(out) <- years
for (yrIdx in seq_along(years)) {
# equivalently: for(i in 1:length(years))
# n.b., seq_along(x) is better than 1:length(x), since it handles the case
# where the length of an object is 0 or NULL more robustly.
sub <- subset(gapminder, gapminder$year == years[yrIdx])
# fit regression
out[[yrIdx]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
}
out[['2007']]
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Coefficients:
## (Intercept) log(gdpPercap)
## 4.950 7.203
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.947 -2.661 1.215 4.469 13.115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9496 3.8577 1.283 0.202
## log(gdpPercap) 7.2028 0.4423 16.283 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared: 0.6544, Adjusted R-squared: 0.652
## F-statistic: 265.2 on 1 and 140 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9571 -5.7319 0.7517 6.5770 13.7361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.8457 5.0668 -3.522 0.000578 ***
## log(gdpPercap) 8.8298 0.6626 13.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.146 on 140 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.556
## F-statistic: 177.6 on 1 and 140 DF, p-value: < 2.2e-16
The iterations do not have to explicitly be over sequential numbers.
## 1952
## 1957
## 1962
## 1967
## 1972
## 1977
## 1982
## 1987
## 1992
## 1997
## 2002
## 2007
POLL
5A: Which of these give exactly this result: pi
,
2*pi
, 3*pi
, …?
(respond at https://pollev.com/chrispaciorek428)
Question: Which of these do you think will be fastest for a long vector x?
It’s not a particularly interesting example, but we can see the
while
loop syntax in the same example. In this case
yrIdx <- 1
while (yrIdx <= length(years)) {
sub <- subset(gapminder, gapminder$year == years[yrIdx])
# fit regression
out[[yrIdx]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
yrIdx = yrIdx + 1
}
summary(out[['2007']])
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.947 -2.661 1.215 4.469 13.115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9496 3.8577 1.283 0.202
## log(gdpPercap) 7.2028 0.4423 16.283 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared: 0.6544, Adjusted R-squared: 0.652
## F-statistic: 265.2 on 1 and 140 DF, p-value: < 2.2e-16
Often we need our code to do different things depending on whether some condition is true or false.
Here’s a simple example to illustrate the syntax. Note that the then is implicit.
## [1] 0.786164
## [1] "val is positive"
We can chain together if
statements as follows.
## [1] -0.1062454
if (val < -1) {
print("val is more than one standard deviation below the mean.")
} else if (abs(val) <= 1) {
print("val is within one standard deviation of the mean.")
} else {
print("val is more than one standard deviation above the mean.")
}
## [1] "val is within one standard deviation of the mean."
In general, the {
brackets are only needed if you have
multiple R expressions, but R will complain when an else
starts a line of code, so generally using the {
is good
practice. That said, this works just fine:
## [1] "val is negative"
The condition in the if statement cannot be NA or R will give an error. This is a very common bug.
## [1] Asia Europe Africa Americas Oceania
## Levels: Africa Americas Asia Europe Oceania
continents <- unique(as.character(gapminder$continent))
continents <- c('Antarctica', continents)
out <- rep(0, length(continents))
for (i in seq_along(continents)) {
sub <- gapminder[gapminder$continent == continents[i], ]
if(mean(sub$lifeExp) < 50)
print(continents[i])
}
## Error in if (mean(sub$lifeExp) < 50) print(continents[i]): missing value where TRUE/FALSE needed
## [1] 1
## Error in if (mean(sub$lifeExp) < 50) print("here"): missing value where TRUE/FALSE needed
## [1] NA
An NA
/NaN
is the main reason an if
statement may fail, because R will generally convert other values to
logical values.
Zero evaluates to FALSE
, all other numbers evaluate to
TRUE
. In general strings are not converted to booleans.
A more robust alternative is to use isTRUE()
:
out <- rep(0, length(continents))
for (i in seq_along(continents)) {
sub <- gapminder[gapminder$continent == continents[i], ]
if(isTRUE(mean(sub$lifeExp) < 60))
print(continents[i])
}
## [1] "Africa"
Here’s another example. What will this return?
if(sqrt(-1) == 0i) {
print('i')
} else print('not i')
R provides both &&
and &
as
well as ||
and |
.
&
and |
are the vectorized AND and OR
operators we’ve seen already.
&&
and ||
operate only on a single
element and are generally used in the conditions of if
statements (or while
loops).
For example:
A nice thing about &&
and ||
is
that they evaluate left to right, so it can save some computation, e.g.,
if the first condition of &&
is FALSE, the second
condition is not evaluated.
next
and break
statements
(optional)‘next’ skips the current evaluation of the loop statements:
continents <- unique(gapminder2007$continent)
continents[2] <- "Oceania"; continents[5] <- "Europe" # reorder to illustrate points below
continents
## [1] Asia Oceania Africa Americas Europe
## Levels: Africa Americas Asia Europe Oceania
out <- list(); length(out) <- length(continents); names(out) <- continents
for (i in seq_along(continents)) {
# equivalently: for(i in 1:length(continents))
sub <- subset(gapminder2007, gapminder2007$continent == continents[i])
if(sum(!is.na(sub$lifeExp)) > 2) { # don't regress if <= 2 obs
# fit regression
out[[i]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
} else {
next
}
}
cat("We got to iteration ", i, " of ", length(continents), " items.\n", sep = "")
## We got to iteration 5 of 5 items.
## NULL
‘break’ immediately ends loop evaluation:
out <- list(); length(out) <- length(continents); names(out) <- continents
for (i in seq_along(continents)) {
# equivalently: for(i in 1:length(continents))
sub <- subset(gapminder2007, gapminder2007$continent == continents[i])
if(sum(!is.na(sub$lifeExp)) > 2) { # don't regress if <= 2 obs
# fit regression
out[[i]] <- lm(lifeExp ~ log(gdpPercap), data = sub)
} else {
break
}
}
cat("We got to iteration ", i, " of ", length(continents), " items.\n", sep = "")
## We got to iteration 2 of 5 items.
Effective use of next
and break
can make
your for
(and other) loops both more robust and efficient
(e.g., by skipping cases where computations may fail).
“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”
Everything that exists is an object.
Everything that happens is a function call.
Are arithmetic operations really just functions?
## [1] 5
## [1] 5
Yes!
And what about indexing?
## [1] 0.3736735 0.2301198 0.3279895 0.9573521 0.1122490 0.1379409 0.5371405
## [8] 0.1900901 0.2598349 0.7517959
## [1] 0.3736735 0.2301198 0.3279895 0.9573521 0.1122490 0.1379409 0.5371405
## [8] 0.1900901 0.2598349 0.7517959
Also yes!
This holds more generally - one can investigate and see the same
thing in terms of a for
loop.
## [1] "numeric"
## [1] "function"
## [1] "function"
## [1] "function"
Now that we’ve taken an extensive look at for
loops and
programming with functions, we can consider moving on from base R to
look at more modern and powerful tools for programming provided by purrr
, a core
package in the tidyverse
, a set of
modern tools for “doing data science” in R. purrr
provides
facilities to manipulate data sets using functions in a “tidy” manner.
Using purrr
requires familiarity with several other core
packages of the tidyverse
, most notably dplyr
and magrittr
. We won’t have time to cover details of how to
use purrr
here, but you may want to look into these
packages as you continuing exploring R.
We’ll leave you with a simple example of how to use
purrr
to deal with a task we already looked at with
for
loops. In fact, we’ll return to the logistic regression
exercise that we started with.
Recall that we have been fitting regression models for each year. To
fit our models across each of the years separately, we’ll first need to
put our data in “tidy” format. It’s really easy with some
helpful verbs from dplyr
…
Now, we can fit our regression models across each of the years using
purrr
’s map
:
What about protecting ourselves against situations where there wasn’t enough variation in the outcome variable (fewer than two non-missing observations in a year). (It doesn’t happen here but can easily happen in other situations.)
So, can purrr
handle this? Yes - in fact, it’s
really easy. We can use a verb called safely
to separate
situations where the GLM succeeds from those where it doesn’t. Let’s try
it out
Remark: What we accomplish here with purrr::map
is also easily done using tools from base R. In fact, using
lapply
, we can evaluate the very same lm
formula with our gapminder
dataset, albeit without the
extra goodies offered by the pipe (aka %>%
) syntax and
the safely
convenience, afforded by magrittr
and purrr
, respectively.
Ok. Now, we’ll look at the results for one destination, just to get a feel for the output
## $result
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = .)
##
## Coefficients:
## (Intercept) log(gdpPercap)
## 4.950 7.203
##
##
## $error
## NULL
Now that we’ve seen what we can do with purrr
, it’s
important to compare this modern approach to what is offered in base R.
(It’s very important to understand and be comfortable with using tools
from base R before adopting those that are offered as part of the
tidyverse
.) To that end, here are some points of
comparison:
purrr::map
really different from
lapply
, sapply
, vapply
?
purrr
is:
safely
)apply
:
tidyverse
)For a full comparison of purrr
versus base R, consider
checking out this
quite thorough tutorial by Jenny Bryan.
A great reference for learning both basic and advanced concepts in using the R language for data analysis is the book R for Data Science, by Garrett Grolemund and Hadley Wickham. An online version is freely available and may be accessed here. In particular, chapter 21 (“Iteration”) is a great review of much of what we have covered in this module.
Here you can explore
the documentation website for the purrr
package. It
includes details about functionality we did not have time to discuss and
many useful examples that you can use to go further with
purrr
.
Here you can browse the
tidyverse
documentation website. It includes an
introduction to the core packages, the philosophy of this ecosystem of
packages, and much more.
Write an R function that will take an input vector and set any value below a threshold to be the value of threshold. Optionally, the function should instead set values above a threshold to the value of the threshold.
Augment your function so that it checks that the input is a
numeric vector and return an error if not. (See the help information for
stop()
(or stopifnot()
.)
Figure out what invisible()
does and why it is
useful when writing functions. Use invisible()
in your
function from just above or in the sortByCol()
function.
Sort the gapminder dataset and find the shortest life expectancy
value. Now consider the use of which.min()
and why using
that should be much quicker with large datasets.
Extend the sortByCol()
function to handle an
arbitrary number of columns on which to do the sorting. You may need to
modify the functionality a bit to use full vectors for the sorting
rather than column names.
Extend the sortByCol()
function so that it can take
either numeric indices or character strings indicating columns of the
input data or take entire vectors to use in the sorting of the input
data. If information specifying columns is given, make sure your
function carefully checks that the input refers to legitimate columns
and returns an error or warning (see warning()
) if
not.
Explore scoping in the following code. Explain why the result is
11 and not 3. Note that funGenerator()
returns a function
as the return object, consistent with the idea that functions are
objects in R. This is an example of what is called a closure in
R. Basically, the function contains object(s) enclosed with and
accessible to the function.
funGenerator <- function(x) {
x <- 10
g <- function(y) {
return(x + y)
}
return(g)
}
x <- 2
f <- funGenerator()
f(1)
## [1] 11