August 2022, UC Berkeley
Alan Aw
At the core of R is the idea of doing calculations on entire vectors.
## Vectorized arithmetic
gdpTotal <- gapminder$gdpPercap * gapminder$pop
## Vectorized comparisons
wealthy <- gapminder2007$gdpPercap >= 30000
vec1 <- rnorm(5)
vec2 <- rnorm(5)
vec1 > vec2
## [1] FALSE FALSE TRUE TRUE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE
## [1] TRUE TRUE TRUE TRUE TRUE
## [1] TRUE
## Error in eval(expr, envir, enclos): object 'gdpSubset' not found
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
An important related concept is that of recycling
vec10 <- sample(1:10, 10, replace = TRUE)
vec3 <- sample(1:10, 3, replace = TRUE)
vec5 <- sample(1:10, 5, replace = TRUE)
vec10
## [1] 4 4 2 2 9 9 5 1 10 7
## [1] 8 6 9
## [1] 7 5 5 7 6
## [1] 11 9 7 9 15 16 10 6 17 13
## Warning in vec10 + vec3: longer object length is not a multiple of shorter
## object length
## [1] 12 10 11 10 15 18 13 7 19 15
Question: Tell me what’s going on. What choices were made by the R developers?
POLL 4A: 2^3 is raising 2 to the power 3, so you get 8.
(respond at https://pollev.com/chrispaciorek428)
What will this return?
(1:4)^(1:4)
POLL 4B: What will this return?
vec1 <- c(1,2,3,4)
vec2 <- c(1,2)
vec1 + vec2
As we’ve seen, R has many functions that allow you to operate on each element of a vector all at once. Here’s an example of a statistical calculation with simulated values.
Imagine how this code would look if written using a loop, or three separate loops.
Advantages:
Sometimes there are surprises in terms of what is fast, as well as tricks for vectorizing things in unexpected ways:
## user system elapsed
## 0.026 0.008 0.034
## user system elapsed
## 0.004 0.000 0.004
Question: What am I doing with
vals * (vals > 0)
? What happens behind the scenes in
R?
If you use a trick like this, having a comment in your code is a good idea.
Lots of functions in R are vectorized (i.e., they can take a single value or a vector as an input argument), such as some we’ve already used.
## [1] "52" "57" "62" "67" "72" "77"
Question: Note the code above runs and the syntax is perfectly good R syntax, but in terms of what it does, there is a bug in it. See if you can see what it is.
Recall that +
, -
,*
,
/
do vectorized calculations:
## [,1] [,2] [,3]
## [1,] 5 20 35
## [2,] 10 25 40
## [3,] 15 30 45
## [,1] [,2] [,3]
## [1,] 5 8 11
## [2,] 10 13 16
## [3,] 15 18 21
## [,1] [,2] [,3]
## [1,] 4 64 196
## [2,] 16 100 256
## [3,] 36 144 324
## [,1] [,2] [,3]
## [1,] 4 16 28
## [2,] 16 40 64
## [3,] 36 72 108
## [,1]
## [1,] 120
## [2,] 144
## [3,] 168
## [,1] [,2] [,3]
## [1,] 120 264 408
## [2,] 144 324 504
## [3,] 168 384 600
## [1] TRUE
R can do essentially any linear algebra you need. It uses system-level packages called BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package). Note that these calculations will be essentially as fast as if you wrote C code because R just calls C and Fortran routines to do the calculations.
The BLAS that comes with R is fairly slow. It’s possible to use a faster BLAS, as well as one that uses multiple cores automatically. This can in some cases give you an order of magnitude speedup if your work involves a lot of matrix manipulations/linear algebra. More details in Module 10.
Here are some examples of common matrix decompositions: Cholesky decomposition, eigenvalues/eigenvectors, and SVD. These all use BLAS+LAPACK.
## next 3 lines generate a positive definite matrix
library(fields)
times <- seq(0, 1, length = 100)
R <- exp(-rdist(times) / 0.2) # a correlation matrix
######################################################
e <- eigen(R)
range(e$values)
## [1] 0.02525338 32.85537225
## [1] 0.05195413 0.05448567 0.05698864 0.05946173 0.06190363 0.06431308
## [7] 0.06668879 0.06902954 0.07133409 0.07360123 0.07582978 0.07801856
## [13] 0.08016643 0.08227226 0.08433494 0.08635341 0.08832659 0.09025345
## [19] 0.09213298 0.09396420 0.09574615 0.09747789 0.09915851 0.10078713
## [25] 0.10236291 0.10388500 0.10535262 0.10676499 0.10812137 0.10942106
## [31] 0.11066337 0.11184765 0.11297327 0.11403965 0.11504623 0.11599249
## [37] 0.11687791 0.11770205 0.11846447 0.11916476 0.11980256 0.12037754
## [43] 0.12088940 0.12133786 0.12172270 0.12204370 0.12230071 0.12249358
## [49] 0.12262222 0.12268655 0.12268655 0.12262222 0.12249358 0.12230071
## [55] 0.12204370 0.12172270 0.12133786 0.12088940 0.12037754 0.11980256
## [61] 0.11916476 0.11846447 0.11770205 0.11687791 0.11599249 0.11504623
## [67] 0.11403965 0.11297327 0.11184765 0.11066337 0.10942106 0.10812137
## [73] 0.10676499 0.10535262 0.10388500 0.10236291 0.10078713 0.09915851
## [79] 0.09747789 0.09574615 0.09396420 0.09213298 0.09025345 0.08832659
## [85] 0.08635341 0.08433494 0.08227226 0.08016643 0.07801856 0.07582978
## [91] 0.07360123 0.07133409 0.06902954 0.06668879 0.06431308 0.06190363
## [97] 0.05946173 0.05698864 0.05448567 0.05195413
This is slow.
## user system elapsed
## 2.945 0.007 2.954
The same holds for using rbind()
, cbind()
,
or adding to a list, one element at a time.
Question: Thoughts on why this are so slow? Think about what R might be doing behind the scenes.
Note: This is one area where Python and some other languages handle the situation in a more sophisticated way.
This is not so slow. (Please ignore the for-loop hypocrisy and the
fact that I could do this as vals <- 1:n
.)
n <- 50000
system.time({
vals <- rep(0, n)
# alternatively: vals <- as.numeric(NA); length(vals) <- n
for(i in 1:n)
vals[i] <- i
})
## user system elapsed
## 0.027 0.001 0.027
Here’s how to pre-allocate an empty list:
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
Some functions aren’t vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.
For this we use the apply()
family of functions.
mat <- matrix(rnorm(100*1000), nr = 100)
row_min <- apply(mat, MARGIN = 1, FUN = min)
col_max <- apply(mat, MARGIN = 2, FUN = max)
There are actually some even faster specialized functions:
lapply()
and sapply()
These are ‘map’ operations that apply a function to each element of a list.
## [[1]]
## [1] -0.280493
##
## [[2]]
## [1] -0.07862098
##
## [[3]]
## [1] -1.474092
## [1] -0.28049304 -0.07862098 -1.47409165
Note that we don’t generally want to use apply()
on a
data frame.
Recall that a dataframe is a list, hence this works:
## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
## year2
## "character"
lapply()
and sapply()
with vectorsYou can use lapply()
and sapply()
on
regular vectors, such as vectors of indices, which can come in handy.
This is a bit silly but it illustrates the idea:
## [[1]]
## [1] 2.376375
##
## [[2]]
## [1] 2.164874
##
## [[3]]
## [1] 2.816697
##
## [[4]]
## [1] 1.717506
##
## [[5]]
## [1] 3.356898
##
## [[6]]
## [1] 3.28338
## [1] 1 4 9 16 25 36 49 64 81 100
POLL 4C: What do you think this will return:
(respond at https://pollev.com/chrispaciorek428)
sapply(1:6, function(n) rnorm(n))
apply()
s (optional)There are a bunch of apply()
variants, as well as
parallelized versions of them:
tapply()
, vapply()
, mapply()
,
rapply()
, eapply()
?clusterApply
table()
is our
friend##
## Africa Americas Asia Europe Oceania
## Afghanistan 0 0 12 0 0
## Albania 0 0 0 12 0
## Algeria 12 0 0 0 0
## Angola 12 0 0 0 0
## Argentina 0 12 0 0 0
## Australia 0 0 0 0 12
## Afghanistan Albania Algeria
## 12 12 12
## Angola Argentina Australia
## 12 12 12
## Austria Bahrain Bangladesh
## 12 12 12
## Belgium Benin Bolivia
## 12 12 12
## Bosnia and Herzegovina Botswana Brazil
## 12 12 12
## Bulgaria Burkina Faso Burundi
## 12 12 12
## Cambodia Cameroon
## 12 12
## Here we do a data validity check: all countries should have 12 obs in the same continent
all(rowSums(tbl) == 12)
## [1] TRUE
You may need to discretize a continuous variable [or a discrete variable with many levels], e.g., by life expectancy.
We can use cut
to do it:
gapminder2007$lifeExpBin <- cut(gapminder2007$lifeExp,
breaks = c(0, 40, 50, 60, 70, 75, 80, Inf))
tbl <- table(gapminder2007$continent, gapminder2007$lifeExpBin)
round( prop.table(tbl, margin = 1), 2 )
##
## (0,40] (40,50] (50,60] (60,70] (70,75] (75,80] (80,Inf]
## Africa 0.02 0.33 0.42 0.10 0.12 0.02 0.00
## Americas 0.00 0.00 0.00 0.12 0.48 0.36 0.04
## Asia 0.00 0.03 0.06 0.24 0.39 0.18 0.09
## Europe 0.00 0.00 0.00 0.00 0.27 0.50 0.23
## Oceania 0.00 0.00 0.00 0.00 0.00 0.00 1.00
Often we want to do individual analyses within subsets or clusters of our data.
As a first step, we might want to just split our dataset by a
stratifying variable (perhaps after using cut
to create a
categorical variable).
## [1] 12
## [1] 142 7
par(mfrow = c(1,2))
plot(lifeExp ~ gdpPercap, data = subsets[['1952']], main = '1952')
abline(h = 0, col = 'grey')
plot(lifeExp ~ gdpPercap, data = subsets[['2007']], main = '2007')
abline(h = 0, col = 'grey')
Obviously, we’d want to iterate to improve that plot given the outlier.
POLL 4D: What kind of object is produced by split, e.g.,
(respond at https://pollev.com/chrispaciorek428)
subsets <- split(gap, gap$year)
Once you have the stratified dataset you can use lapply
to apply a function to each item in the list.
This is known as a ‘map’ operation in other languages.
meanfun <- function(sub) {
mean(sub$lifeExp)
}
meanLifeExpByYear <- sapply(subsets, meanfun)
names(subsets)
## [1] "1952" "1957" "1962" "1967" "1972" "1977" "1982" "1987" "1992" "1997"
## [11] "2002" "2007"
## 1952 1957 1962 1967 1972 1977 1982 1987
## 49.05762 51.50740 53.60925 55.67829 57.64739 59.57016 61.53320 63.21261
## 1992 1997 2002 2007
## 64.16034 65.01468 65.69492 67.00742
We’ll talk in detail about writing functions in Module 5.
Often we want to do individual analyses within subsets or clusters of
our data. R has a variety of tools for this; for now we’ll look at
aggregate()
and by()
. These are wrappers of
tapply()
.
gmSmall <- gapminder[ , c('lifeExp', 'gdpPercap')] # reduce to only numeric columns
aggregate(gmSmall, by = list(year = gapminder$year), FUN = median, na.rm = TRUE) # na.rm not needed here but illustrates use of additional arguments to FUN
## year lifeExp gdpPercap
## 1 1952 45.1355 1968.528
## 2 1957 48.3605 2173.220
## 3 1962 50.8810 2335.440
## 4 1967 53.8250 2678.335
## 5 1972 56.5300 3339.129
## 6 1977 59.6720 3798.609
## 7 1982 62.4415 4216.228
## 8 1987 65.8340 4280.300
## 9 1992 67.7030 4386.086
## 10 1997 69.3940 4781.825
## 11 2002 70.8255 5319.805
## 12 2007 71.9355 6124.371
## year continent lifeExp
## 1 1952 Africa 38.8330
## 2 1957 Africa 40.5925
## 3 1962 Africa 42.6305
## 4 1967 Africa 44.6985
## 5 1972 Africa 47.0315
## 6 1977 Africa 49.2725
## 7 1982 Africa 50.7560
## 8 1987 Africa 51.6395
## 9 1992 Africa 52.4290
## 10 1997 Africa 52.7590
## 11 2002 Africa 51.2355
## 12 2007 Africa 52.9265
## 13 1952 Americas 54.7450
## 14 1957 Americas 56.0740
## 15 1962 Americas 58.2990
## 16 1967 Americas 60.5230
## 17 1972 Americas 63.4410
## 18 1977 Americas 66.3530
## 19 1982 Americas 67.4050
## 20 1987 Americas 69.4980
## 21 1992 Americas 69.8620
## 22 1997 Americas 72.1460
## 23 2002 Americas 72.0470
## 24 2007 Americas 72.8990
## 25 1952 Asia 44.8690
## 26 1957 Asia 48.2840
## 27 1962 Asia 49.3250
## 28 1967 Asia 53.6550
## 29 1972 Asia 56.9500
## 30 1977 Asia 60.7650
## 31 1982 Asia 63.7390
## 32 1987 Asia 66.2950
## 33 1992 Asia 68.6900
## 34 1997 Asia 70.2650
## 35 2002 Asia 71.0280
## 36 2007 Asia 72.3960
## 37 1952 Europe 65.9000
## 38 1957 Europe 67.6500
## 39 1962 Europe 69.5250
## 40 1967 Europe 70.6100
## 41 1972 Europe 70.8850
## 42 1977 Europe 72.3350
## 43 1982 Europe 73.4900
## 44 1987 Europe 74.8150
## 45 1992 Europe 75.4510
## 46 1997 Europe 76.1160
## 47 2002 Europe 77.5365
## 48 2007 Europe 78.6085
## 49 1952 Oceania 69.2550
## 50 1957 Oceania 70.2950
## 51 1962 Oceania 71.0850
## 52 1967 Oceania 71.3100
## 53 1972 Oceania 71.9100
## 54 1977 Oceania 72.8550
## 55 1982 Oceania 74.2900
## 56 1987 Oceania 75.3200
## 57 1992 Oceania 76.9450
## 58 1997 Oceania 78.1900
## 59 2002 Oceania 79.7400
## 60 2007 Oceania 80.7195
agg <- aggregate(lifeExp ~ year + continent , data = gapminder, FUN = median)
xtabs(lifeExp ~ ., data = agg)
## continent
## year Africa Americas Asia Europe Oceania
## 1952 38.8330 54.7450 44.8690 65.9000 69.2550
## 1957 40.5925 56.0740 48.2840 67.6500 70.2950
## 1962 42.6305 58.2990 49.3250 69.5250 71.0850
## 1967 44.6985 60.5230 53.6550 70.6100 71.3100
## 1972 47.0315 63.4410 56.9500 70.8850 71.9100
## 1977 49.2725 66.3530 60.7650 72.3350 72.8550
## 1982 50.7560 67.4050 63.7390 73.4900 74.2900
## 1987 51.6395 69.4980 66.2950 74.8150 75.3200
## 1992 52.4290 69.8620 68.6900 75.4510 76.9450
## 1997 52.7590 72.1460 70.2650 76.1160 78.1900
## 2002 51.2355 72.0470 71.0280 77.5365 79.7400
## 2007 52.9265 72.8990 72.3960 78.6085 80.7195
Notice the ‘long’ vs. ‘wide’ formats. You’ll see more about that sort of thing in Module 6.
While you can use aggregate
for this, many people use
dplyr
, as we’ll discuss in Module 6.
aggregate()
works fine when the output is univariate,
but what about more complicated analyses than computing the median, such
as fitting a set of regressions?
out <- by(gapminder, gapminder$year,
function(sub) {
lm(lifeExp ~ log(gdpPercap), data = sub)
}
)
length(out)
## [1] 12
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.947 -2.661 1.215 4.469 13.115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9496 3.8577 1.283 0.202
## log(gdpPercap) 7.2028 0.4423 16.283 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared: 0.6544, Adjusted R-squared: 0.652
## F-statistic: 265.2 on 1 and 140 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9571 -5.7319 0.7517 6.5770 13.7361
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.8457 5.0668 -3.522 0.000578 ***
## log(gdpPercap) 8.8298 0.6626 13.326 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.146 on 140 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.556
## F-statistic: 177.6 on 1 and 140 DF, p-value: < 2.2e-16
This is handy, but it’s just slightly simpler syntax than using
split
followed by lapply
sort()
applied to a vector does what you expect.
Sorting a matrix or dataframe based on one or more columns is a somewhat manual process, but once you get the hang of it, it’s not bad.
## [1] 804 672 696 1488 72
## # A tibble: 6 × 7
## country continent year lifeExp pop gdpPercap year2
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Japan Asia 2007 82.6 127467972 31656. 07
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725. 07
## 3 Iceland Europe 2007 81.8 301931 36181. 07
## 4 Switzerland Europe 2007 81.7 7554661 37506. 07
## 5 Australia Oceania 2007 81.2 20434176 34435. 07
## 6 Spain Europe 2007 80.9 40448191 28821. 07
You could of course write your own sort function that uses
order()
. More in Module 5.
Or, most people probably use dplyr::arrange
. More in
Module 6.
We often need to combine data across multiple data frames, merging on common fields (i.e., keys). In database terminology, this is a join operation.
Suppose that our dataset did not have ‘continent’ in it, but that we had a separate data frame that matches country to continent.
# ignore the 'wizard' behind the curtain...
c2c <- unique(gapminder[ , c('country', 'continent')])
gapminderSave <- gapminder
gapminder <- gapminder[ , -which(names(gapminder) == "continent")]
# Now our gapminder dataset doesn't have continent
head(gapminder)
## # A tibble: 6 × 6
## country year lifeExp pop gdpPercap year2
## <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan 1952 28.8 8425333 779. 52
## 2 Afghanistan 1957 30.3 9240934 821. 57
## 3 Afghanistan 1962 32.0 10267083 853. 62
## 4 Afghanistan 1967 34.0 11537966 836. 67
## 5 Afghanistan 1972 36.1 13079460 740. 72
## 6 Afghanistan 1977 38.4 14880372 786. 77
dplyr
’s join
functionsThe dplyr
package (to be discussed in detail in Module
6) provides a variety of join operations like those used in
databases.
Now let’s add the continent info in:
## # A tibble: 6 × 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Albania Europe
## 3 Algeria Africa
## 4 Angola Africa
## 5 Argentina Americas
## 6 Australia Oceania
## # A tibble: 6 × 6
## country year lifeExp pop gdpPercap year2
## <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan 1952 28.8 8425333 779. 52
## 2 Afghanistan 1957 30.3 9240934 821. 57
## 3 Afghanistan 1962 32.0 10267083 853. 62
## 4 Afghanistan 1967 34.0 11537966 836. 67
## 5 Afghanistan 1972 36.1 13079460 740. 72
## 6 Afghanistan 1977 38.4 14880372 786. 77
## [1] 1704 7
## [1] 1704 7
## [1] FALSE
## [1] TRUE
You can also do left, right, and full joins.
merge
(optional)Alternatively, we could use the base R function, merge().
Now let’s add the continent info in:
## # A tibble: 6 × 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Albania Europe
## 3 Algeria Africa
## 4 Angola Africa
## 5 Argentina Americas
## 6 Australia Oceania
## # A tibble: 6 × 6
## country year lifeExp pop gdpPercap year2
## <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan 1952 28.8 8425333 779. 52
## 2 Afghanistan 1957 30.3 9240934 821. 57
## 3 Afghanistan 1962 32.0 10267083 853. 62
## 4 Afghanistan 1967 34.0 11537966 836. 67
## 5 Afghanistan 1972 36.1 13079460 740. 72
## 6 Afghanistan 1977 38.4 14880372 786. 77
gapminder <- merge(gapminder, c2c, by.x = 'country', by.y = 'country',
all.x = TRUE, all.y = FALSE)
dim(gapminderSave)
## [1] 1704 7
## [1] 1704 7
## [1] FALSE
## [1] TRUE
What’s the deal with the all.x
and all.y
?
They allow one to request the equivalent of left, right, and full joins.
We can tell R whether we want to keep all of the x
observations, all the y
observations, or neither, or both,
when there may be rows in either of the datasets that don’t match the
other dataset.
Create a vector that concatenates the country and year to create a ‘country-year’ variable in a vectorized way using the string processing functions.
Use table()
to figure out the number of countries
available for each continent.
Explain the steps of what this code is doing:
tmp <- gapminder[ , -which(names(gapminder) == "continent")]
.
Compute the number of NAs in each column of the gapminder dataset
using sapply()
and making use of the is.na()
function. It’s possible to do this without writing a function (which is
a topic we’ll cover in Module 5).
Discretize gdpPercap into some bins and create a gdpPercap_binned variable. Count the number of values in each bin.
Create a boxplot of life expectancy by binned gdpPercap.
Create a dataframe that has the total (i.e., world) population across all the countries for each year.
Merge the info from problem 7 back into the original gapminder dataset. Now plot life expectancy as a function of world population.
Suppose we have two categorical variables and we conduct a hypothesis test of independence. The chi-square statistic is:
where , with the sum of the values in the i’th row, the sum of values in the j’th column, and the sum of all the values. Suppose I give you a matrix in R with the values.
You can generate a test matrix as:
Compute the statistic without any loops as follows:
y
and
e
?outer()
function can be used.For each combination of year and continent, find the 95th percentile of life expectancy.
Here’s a cool trick to pull off a particular element of a list of lists:
params <- list(a = list(mn = 7, sd = 3), b = list(mn = 6,sd = 1),
c = list(mn = 2, sd = 1))
sapply(params, "[[", 1)
## a b c
## 7 6 2
Explain what that does and why it works.
Hint:
## [1] 7