August 2022, UC Berkeley
Chris Paciorek
Collections of disparate or complicated objects
myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2),
moreStuff = c("china", "japan"), list(5, "bear"))
myList
## $stuff
## [1] 3
##
## $mat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $moreStuff
## [1] "china" "japan"
##
## [[4]]
## [[4]][[1]]
## [1] 5
##
## [[4]][[2]]
## [1] "bear"
## [1] "china" "japan"
## [1] TRUE
## [1] "japan"
## [1] "bear"
## $stuff
## [1] 3
##
## $mat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $moreStuff
## [1] "china" "japan"
## [1] "stuff" "mat" "moreStuff" "" "newOne"
Lists can be used as vectors of complicated objects. E.g., suppose you have a linear regression for each value of a stratifying variable. You could have a list of regression fits. Each regression fit will itself be a list, so you’ll have a list of lists.
POLL 3A: How would you extract “china” from this list?
(respond at https://pollev.com/chrispaciorek428)
myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2),
moreStuff = c("china", "japan"), list(5, "bear"))
A review from Module 1…
## [1] "tbl_df" "tbl" "data.frame"
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
## [1] TRUE
## [1] 6
## [1] 1952 1957 1962 1967 1972
## $country
## [1] "factor"
##
## $continent
## [1] "factor"
##
## $year
## [1] "integer"
##
## $lifeExp
## [1] "numeric"
##
## $pop
## [1] "integer"
##
## $gdpPercap
## [1] "numeric"
lapply()
is a function used on lists; it works here to
apply the class()
function to each element of the list,
which in this case is each field/column.
## [1] 6
## # A tibble: 6 × 2
## year pop
## <int> <int>
## 1 1952 8425333
## 2 1957 9240934
## 3 1962 10267083
## 4 1967 11537966
## 5 1972 13079460
## 6 1977 14880372
## [1] TRUE
In general the placement of commas in R is crucial, but here, two different operations give the same result because of the underlying structure of data frames.
If you need to do numeric calculations on an entire non-vector object (dimension > 1), you generally want to use matrices and arrays, not data frames.
## [,1] [,2] [,3] [,4]
## [1,] -0.997563234 -1.4843480 0.67704596 -0.6704612
## [2,] 0.001128528 -0.2429397 0.01286389 -0.5274799
## [3,] 0.667443893 -0.6124777 1.34872524 0.7868177
## [,1] [,2] [,3] [,4]
## [1,] -3.990252937 -5.9373919 2.70818386 -2.681845
## [2,] 0.004514113 -0.9717588 0.05145555 -2.109920
## [3,] 2.669775572 -2.4499108 5.39490097 3.147271
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.997563234 -1.4843480 0.67704596 -0.6704612 1
## [2,] 0.001128528 -0.2429397 0.01286389 -0.5274799 2
## [3,] 0.667443893 -0.6124777 1.34872524 0.7868177 3
# Let's convert the gapminder dataframe to a matrix:
gm_mat <- as.matrix(gapminder[ , c('lifeExp', 'gdpPercap')])
head(gm_mat)
## lifeExp gdpPercap
## [1,] 28.801 779.4453
## [2,] 30.332 820.8530
## [3,] 31.997 853.1007
## [4,] 34.020 836.1971
## [5,] 36.088 739.9811
## [6,] 38.438 786.1134
POLL 3B: Recall the gap dataframe has columns that are numeric and columns that are character strings. What do you think will happen if we do this:
as.matrix(gapminder)
(respond at https://pollev.com/chrispaciorek428)
Arrays are like matrices but can have more or fewer than two dimensions.
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.8342853 1.0857439 0.2890717
## [2,] -0.4741951 0.3268233 0.5573720
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0.4952659 -1.309363 0.8070183
## [2,] 0.8302991 -1.280746 -0.4620758
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 0.8342853 1.0857439 0.2890717
## [2,] -0.4741951 0.3268233 0.5573720
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 0.4952659 -1.309363 0.8070183
## [2,] 0.8302991 -1.280746 -0.4620758
Objects have attributes.
## $dim
## [1] 3 5
## [,1] [,2] [,3] [,4] [,5]
## first -0.997563234 -1.4843480 0.67704596 -0.6704612 1
## middle 0.001128528 -0.2429397 0.01286389 -0.5274799 2
## last 0.667443893 -0.6124777 1.34872524 0.7868177 3
## $dim
## [1] 3 5
##
## $dimnames
## $dimnames[[1]]
## [1] "first" "middle" "last"
##
## $dimnames[[2]]
## NULL
## [1] "names" "class" "row.names"
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
## [1] 1 2 3 4 5 6 7 8 9 10
Now let’s do a bit of manipulation and see if you can infer how R represents matrices internally.
POLL 3C: Consider our matrix ‘mat’:
(respond at https://pollev.com/chrispaciorek428)
mat <- matrix(1:16, nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Suppose I run this code: mat[4]
What do you think will be returned?
Question: What can you infer about what a matrix is in R?
Question: What kind of object are the attributes themselves? How do I check?
This is like Fortran, MATLAB and Julia but not like C or Python(numpy).
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
You can go smoothly back and forth between a matrix (or an array) and a vector:
## [1] TRUE
## [1] FALSE
This is a common cause of bugs!
Since it was designed by statisticians, R handles missing values very well relative to other languages.
NA
is a missing value## [1] -1.04710949 -0.25433306 NA 1.30207420 NA 0.19756982
## [7] -1.44054992 0.76560416 0.15789745 -0.04049116 -1.03669646 -1.51556459
## [1] 12
## [1] NA
## [1] -2.911599
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Be careful because many R functions won’t warn you that they are ignoring the missing values.
## [1] Inf
## [1] Inf
NaN
stands for Not a
Number## Warning in sqrt(-5): NaNs produced
## [1] NaN
## [1] NaN
## [1] Inf
NULL
## [1] -1.04710949 -0.25433306 NA 1.30207420 NA 0.19756982
## [7] -1.44054992 0.76560416 0.15789745 -0.04049116 -1.03669646 -1.51556459
## [1] 12
## numeric(0)
## NULL
## [1] TRUE
## $b
## [1] 5
NA
can hold a place but NULL
cannot.
NULL
is useful for having a function argument default to
‘nothing’. See help(crossprod)
, which can compute either
or .
POLL 3D
(just respond in your head; I won’t collect the answers online)
Question 1: Consider the following vector:
vec <- c(3, NA, 7)
What is vec[2]:
Question 2: Consider this vector:
vec <- c(3, NULL, 7)
What is vec[2]:
Question 3: Consider this list:
mylist <- list(3, NULL, 7)
What is mylist[[2]]:
Question 4: Consider this code:
mylist <- list(3, 5, 7)
mylist[[2]] <- NULL
What is length(mylist):
gapminder2007 <- gapminder[gapminder$year == 2007, ]
wealthy <- gapminder2007$gdpPercap > 35000
healthy <- gapminder2007$lifeExp > 75
head(wealthy)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## wealthy
## FALSE TRUE
## 130 12
## # A tibble: 12 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Austria Europe 2007 79.8 8199783 36126.
## 2 Canada Americas 2007 80.7 33390141 36319.
## 3 Denmark Europe 2007 78.3 5468120 35278.
## 4 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 5 Iceland Europe 2007 81.8 301931 36181.
## 6 Ireland Europe 2007 78.9 4109086 40676.
## 7 Kuwait Asia 2007 77.6 2505559 47307.
## 8 Netherlands Europe 2007 79.8 16570613 36798.
## 9 Norway Europe 2007 80.2 4627926 49357.
## 10 Singapore Asia 2007 80.0 4553009 47143.
## 11 Switzerland Europe 2007 81.7 7554661 37506.
## 12 United States Americas 2007 78.2 301139947 42952.
## # A tibble: 44 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Argentina Americas 2007 75.3 40301927 12779.
## 3 Australia Oceania 2007 81.2 20434176 34435.
## 4 Austria Europe 2007 79.8 8199783 36126.
## 5 Bahrain Asia 2007 75.6 708573 29796.
## 6 Belgium Europe 2007 79.4 10392226 33693.
## 7 Canada Americas 2007 80.7 33390141 36319.
## 8 Chile Americas 2007 78.6 16284741 13172.
## 9 Costa Rica Americas 2007 78.8 4133884 9645.
## 10 Croatia Europe 2007 75.7 4493312 14619.
## # … with 34 more rows
## # ℹ Use `print(n = ...)` to see more rows
## # A tibble: 0 × 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## # lifeExp <dbl>, pop <int>, gdpPercap <dbl>
## # ℹ Use `colnames()` to see all variable names
## [1] 44
## [1] 0.3098592
Question: What do you think R is doing to do arithmetic on logical vectors?
You can use the as()
family of functions.
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
## [1] 3.7 4.8
Be careful: R tries to be helpful and convert between types/classes when it thinks it’s a good idea. Sometimes it is overly optimistic.
## [1] 1 2
## integer(0)
POLL 3E:
(just respond in your head; I won’t collect the answers online)
Question 1: What do you think this will do?
ints <- 1:5
ints[0.9999]
Question 2: What does the code do when it tries to use 0.9999 to subset?
## let's read the Gapminder data from a file with a special argument:
gapminder <- read.csv(file.path('..', 'data', 'gapminder-FiveYearData.csv'),
stringsAsFactors = TRUE) # This was the default before R 4.0
class(gapminder$continent)
## [1] "factor"
## [1] Asia Asia Asia Asia Asia Asia
## Levels: Africa Americas Asia Europe Oceania
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
This example is a bit artificial as ‘continent’ doesn’t really have a natural ordering.
gapminder$continent2 <- ordered(gapminder$continent,
levels = levels(gapminder$continent)[c(2,1,3,4,5)])
head(gapminder$continent2)
## [1] Asia Asia Asia Asia Asia Asia
## Levels: Americas < Africa < Asia < Europe < Oceania
## [1] "Americas" "Africa" "Asia" "Europe" "Oceania"
students <- factor(c('basic','proficient','advanced','basic',
'advanced', 'minimal'))
levels(students)
## [1] "advanced" "basic" "minimal" "proficient"
## [1] 2 4 1 2 1 3
## attr(,"levels")
## [1] "advanced" "basic" "minimal" "proficient"
students <- factor(c('basic','proficient','advanced','basic',
'advanced', 'minimal'))
score = c(minimal = 65, basic = 75, advanced = 95, proficient = 85) # a named vector
score["advanced"] # look up by name
## advanced
## 95
## [1] advanced
## Levels: advanced basic minimal proficient
## minimal
## 65
## advanced
## 95
What went wrong and how did we fix it? Notice how easily this could be a big bug in your code.
R has lots of functionality for character strings. Usually these are stored as vectors of strings, each string of arbitrary length.
## [1] 5
## [1] 2 5 8 8 13
## [1] "bill clinton"
## [1] "hi hallo mother's father's He said, \"hi\""
## [[1]]
## [1] "This" "is" "the" "R" "bootcamp"
## [1] "Afg" "Alb" "Alg" "Ang" "Arg" "Aus" "Aus" "Bah" "Ban" "Bel" "Ben" "Bol"
## [13] "Bos" "Bot" "Bra" "Bul" "Bur" "Bur" "Cam" "Cam" "Can" "Cen" "Cha" "Chi"
## [25] "Chi" "Col" "Com" "Con" "Con" "Cos" "Cot" "Cro" "Cub" "Cze" "Den" "Dji"
## [37] "Dom" "Ecu" "Egy" "El " "Equ" "Eri" "Eth" "Fin" "Fra" "Gab" "Gam" "Ger"
## [49] "Gha" "Gre" "Gua" "Gui" "Gui" "Hai" "Hon" "Hon" "Hun" "Ice" "Ind" "Ind"
## [61] "Ira" "Ira" "Ire" "Isr" "Ita" "Jam" "Jap" "Jor" "Ken" "Kor" "Kor" "Kuw"
## [73] "Leb" "Les" "Lib" "Lib" "Mad" "Mal" "Mal" "Mal" "Mau" "Mau" "Mex" "Mon"
## [85] "Mon" "Mor" "Moz" "Mya" "Nam" "Nep" "Net" "New" "Nic" "Nig" "Nig" "Nor"
## [97] "Oma" "Pak" "Pan" "Par" "Per" "Phi" "Pol" "Por" "Pue" "Reu" "Rom" "Rwa"
## [109] "Sao" "Sau" "Sen" "Ser" "Sie" "Sin" "Slo" "Slo" "Som" "Sou" "Spa" "Sri"
## [121] "Sud" "Swa" "Swe" "Swi" "Syr" "Tai" "Tan" "Tha" "Tog" "Tri" "Tun" "Tur"
## [133] "Uga" "Uni" "Uni" "Uru" "Ven" "Vie" "Wes" "Yem" "Zam" "Zim"
## [1] "Afgh______n" "Alba___" "Alge___"
## [4] "Ango__" "Arge_____" "Aust_____"
## [7] "Aust___" "Bahr___" "Bang______"
## [10] "Belg___" "Beni_" "Boli___"
## [13] "Bosn______ Herzegovina" "Bots____" "Braz__"
## [16] "Bulg____" "Burk______so" "Buru___"
## [19] "Camb____" "Came____"
We can search for patterns in character vectors and replace patterns (both vectorized!)
## [1] 70 71
## [1] "Korea, Dem. Rep." "Korea, Rep."
## [1] "North Korea" "Korea, Rep."
Some of you may be familiar with using regular expressions, which is functionality for doing sophisticated pattern matching and replacement with strings. Python and Perl are both used extensively for such text manipulation.
R has a full set of regular expression capabilities available through the grep(), gregexpr(), and gsub() functions (among others - many R functions will work with regular expressions). However, a particularly nice way to make use of this functionality is to use the stringr package, which is more user-friendly than directly using the core R functions.
You can basically do any regular expression/string manipulations in R.
There are many ways to select subsets in R. The syntax below is useful for vectors, matrices, data frames, arrays and lists.
## [,1] [,2] [,3] [,4] [,5]
## a 1 5 9 13 17
## b 2 6 10 14 18
## c 3 7 11 15 19
## d 4 8 12 16 20
## [1] 72.301 75.320 65.554 74.852 50.728
## [1] 43.828 76.423 42.731 81.235 79.829 75.635 64.062 79.441 56.728 65.554
## [11] 74.852 50.728 72.390 73.005 52.295 49.580 59.723 50.430 80.653 44.741
## [21] 50.651 78.553 72.961 72.889 65.152 46.462 55.322 78.782 48.328 75.748
## [31] 78.273 76.486 78.332 54.791 72.235 74.994 71.338 71.878 51.579 58.040
## [41] 52.947 79.313 80.657 56.735 59.448 79.406 60.022 79.483 70.259 56.007
## [51] 46.388 60.916 70.198 82.208 73.338 81.757 64.698 70.650 70.964 59.545
## [61] 78.885 80.745 80.546 72.567 82.603 72.535 54.110 67.297 78.623 77.588
## [71] 71.993 42.592 45.678 73.952 59.443 48.303 74.241 54.467 64.164 72.801
## [81] 76.195 66.803 74.543 71.164 42.082 62.069 52.906 63.785 79.762 80.204
## [91] 72.899 56.867 46.859 80.196 75.640 65.483 75.537 71.752 71.421 71.688
## [101] 75.563 78.098 78.746 76.442 72.476 46.242 65.528 72.777 63.062 74.002
## [111] 42.568 79.972 74.663 77.926 48.159 49.339 80.941 72.396 58.556 39.613
## [121] 80.884 81.701 74.143 78.400 52.517 70.616 58.420 69.819 73.923 71.777
## [131] 51.542 79.425 78.242 76.384 73.747 74.249 73.422 62.698 42.384 43.487
## [1] 30.332 34.020
## [1] 30.332 34.020
## Advanced: subset using a 2-column matrix of indices:
rowInd <- c(1, 3, 4)
colInd <- c(2, 2, 1)
elemInd <- cbind(rowInd, colInd)
elemInd
## rowInd colInd
## [1,] 1 2
## [2,] 3 2
## [3,] 4 1
## [1] "1952" "1962" "Afghanistan"
## [1] 108382.35 113523.13 95458.11 80894.88 109347.87 59265.48
## country year pop continent lifeExp gdpPercap continent2
## 853 Kuwait 1952 160000 Asia 55.565 108382.35 Asia
## 854 Kuwait 1957 212846 Asia 58.033 113523.13 Asia
## 855 Kuwait 1962 358266 Asia 60.470 95458.11 Asia
## 856 Kuwait 1967 575003 Asia 64.624 80894.88 Asia
## 857 Kuwait 1972 841934 Asia 67.712 109347.87 Asia
## 858 Kuwait 1977 1140357 Asia 69.343 59265.48 Asia
What happened in the last subsetting operation?
## [,1] [,2] [,3] [,4] [,5]
## a 1 5 9 13 17
## d 4 8 12 16 20
## a 1 5 9 13 17
## country year pop continent lifeExp gdpPercap continent2
## 853 Kuwait 1952 160000 Asia 55.565 108382.35 Asia
## 854 Kuwait 1957 212846 Asia 58.033 113523.13 Asia
## 855 Kuwait 1962 358266 Asia 60.470 95458.11 Asia
## 856 Kuwait 1967 575003 Asia 64.624 80894.88 Asia
## 857 Kuwait 1972 841934 Asia 67.712 109347.87 Asia
## 858 Kuwait 1977 1140357 Asia 69.343 59265.48 Asia
We can assign into subsets by using similar syntax, as we saw with vectors.
## [1] -0.54609384 0.41276992 1.00000000 -0.08751844 2.00000000 -0.04387560
## [7] -1.71994174 -1.52028202 0.30287244 -0.57162163 0.12997778 3.00000000
## [13] 4.00000000 5.00000000 0.38895028 0.91324059 -0.56111335 0.65813862
## [19] -0.86476739 1.43320661
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1133369 1.42521557 1.146829016 3.8904220 -1.9544089
## [2,] 0.3710806 -0.28580184 -0.008796788 0.9700758 -0.9604171
## [3,] 0.6533810 -0.31371728 0.667937581 0.7597577 -1.2873426
## [4,] 2.6515761 -0.02834754 -0.971263419 -0.7769350 0.4070947
## [5,] 0.7967564 1.24422818 -1.651554999 0.8570243 -1.3529498
## [6,] 0.8221053 -0.80660057 0.728385214 0.9393530 1.2343110
## [,1] [,2] [,3] [,4] [,5]
## [1,] -Inf -Inf -Inf -Inf -Inf
## [2,] -Inf -Inf -Inf -Inf -Inf
## [3,] -Inf -Inf -Inf -Inf -Inf
## [4,] -Inf -Inf -Inf -Inf -Inf
## [5,] -Inf -Inf -Inf -Inf -Inf
## [6,] -Inf -Inf -Inf -Inf -Inf
POLL 3F: Suppose I want to select the 3rd elements from the 2nd and 4th columns of a matrix or dataframe. Which syntax will work?
(respond at https://pollev.com/chrispaciorek428)
Here’s a test matrix:
mat <- matrix(1:16, nrow = 4, ncol = 4)
POLL 3F: (Advanced) One of those answers won’t work with a matrix but will work with a dataframe. Which one?
Extract the 5th row from the gapminder dataset.
Extract the last row from the gapminder dataset.
Count the number of gdpPercap values greater than 50000 in the gapminder dataset.
Set all of the gdpPercap values greater than 50000 to NA. You
should probably first copy the gap
object and work on the
copy so that the dataset is unchanged (or just read the data into R
again afterwards to get a clean copy).
Consider the first row of the gapminder dataset, which has
Afghanistan for 1952. How do I create a string “Afghanistan-1952” using
gap$country[1]
and gap$year[1]
?
Create a character string using paste()
that tells
the user how many rows there are in the data frame - do this
programmatically such that it would work for any data frame regardless
of how many rows it has. The result should look like this: “There are
1704 rows in the dataset”
If you didn’t do it this way already in problem #2, extract the last row from the gapminder dataset without typing the number ‘1704’.
Create a boolean vector indicating if the life expectancy is greater than 75 and the gdpPercap is less than 10000 and calculate the proportion of all the records these represent.
Use that vector to create a new data frame that is a subset of the original data frame.
Consider the attributes of the gapminder dataset. What kind of R object is the set of attributes?