August 2022, UC Berkeley
Chris Paciorek
## [1] 4
## [1] 6.283185
## [1] 7.71528
## [1] 81
## [1] 16
## [1] 2.302585
## [1] 2
## [1] 11
## [1] 1
## [1] 5e+12
## [1] 5e+12
Think of a mathematical operation you need - can you guess how to do it in R?
Side note to presenter: turn off R Notebook inline view via RStudio -> Preferences -> R Markdown -> Show output inline …
POLL 1A:
Question 1: How do I calculate the cosine of 2 pi?
Question 2: What happens if you do this?
A key action in R is to store values in the form of R objects, and to examine the value of R objects.
## [1] 3
## [1] 3
## [1] 3
## [1] 7
We can work with (and store) sequences and repetitions
## [1] 1 2 3 4 5 6
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## [1] 12
## This is a comment: here is an example of non-numeric data
country <- rep("Afghanistan", 12)
country
## [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [6] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [11] "Afghanistan" "Afghanistan"
If we don’t assign the output of a command to an object, we haven’t saved it for later use.
R gives us a lot of flexibility (within certain rules) for assigning to (parts of) objects from (parts of) other objects. We’ll see this through the bootcamp.
The most basic form of an R object is a vector. The various objects
mySeq
, years
, country
are all
vectors.
In fact, individual (scalar) values are vectors of length one, so
val
and Val
are also vectors.
We can concatenate values into a vector with c()
.
## [1] 0.19323788 0.08528635 -1.90114512 1.02340237 0.21099795
## integer vector
ints <- c(1L, 5L, -3L) # force storage as integer not decimal number
## 'L' is for 'long integer' (historical)
nObs <- 1000
mySample <- sample(1:1000, 100, replace = TRUE)
## character vector
chars <- c('hi', 'hallo', "mother's", 'father\'s',
"She said, 'hi'", "He said, \"hi\"" )
chars
## [1] "hi" "hallo" "mother's" "father's"
## [5] "She said, 'hi'" "He said, \"hi\""
## hi
## hallo
## mother's
## father's
## She said, 'hi'
## He said, "hi"
## [1] TRUE FALSE TRUE
This is not valid syntax in R. Let’s try it and see what happens.
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## [1] 1962
## [1] 1962 1967 1972
## [1] 1952 1962 1977
## [1] 1957 1967 1972 1982 1987 1992 1997 2002 2007
## [1] 1952 1957 1962 1977
## Installing package into '/accounts/vis/paciorek/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## create a simple vector from the Gapminder dataset
library(gapminder)
gdp <- gapminder$gdpPercap
gdp[1:10]
## [1] 779.4453 820.8530 853.1007 836.1971 739.9811 786.1134 978.0114 852.3959
## [9] 649.3414 635.3414
We can substitute values into vectors
POLL 1B: Which of these will work to extract a subset of a vector? Assume the vector is created like this:
vals <- rnorm(4)
(respond at https://pollev.com/chrispaciorek428)
At the core of R is the idea of doing calculations on entire vectors.
gdpTotal <- gapminder$gdpPercap * gapminder$pop
tmp <- gdpTotal[gapminder$year == "2007"] # let's pick apart what is happening here
gdpSubset <- tmp[1:20]
gdpSubset >= 1e6 # Dr. Evil's version of "a lot"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
## [1] 59.47444
## [1] 59.91524
## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## {
## ret.x <- x
## ret.y <- y
## cl <- match.call()
## mf <- match.call(expand.dots = FALSE)
## m <- match(c("formula", "data", "subset", "weights", "na.action",
## "offset"), names(mf), 0L)
## mf <- mf[c(1L, m)]
## mf$drop.unused.levels <- TRUE
## mf[[1L]] <- quote(stats::model.frame)
## mf <- eval(mf, parent.frame())
## if (method == "model.frame")
## return(mf)
## else if (method != "qr")
## warning(gettextf("method = '%s' is not supported. Using 'qr'",
## method), domain = NA)
## mt <- attr(mf, "terms")
## y <- model.response(mf, "numeric")
## w <- as.vector(model.weights(mf))
## if (!is.null(w) && !is.numeric(w))
## stop("'weights' must be a numeric vector")
## offset <- model.offset(mf)
## mlm <- is.matrix(y)
## ny <- if (mlm)
## nrow(y)
## else length(y)
## if (!is.null(offset)) {
## if (!mlm)
## offset <- as.vector(offset)
## if (NROW(offset) != ny)
## stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
## NROW(offset), ny), domain = NA)
## }
## if (is.empty.model(mt)) {
## x <- NULL
## z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
## ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
## y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
## 0) else ny)
## if (!is.null(offset)) {
## z$fitted.values <- offset
## z$residuals <- y - offset
## }
## }
## else {
## x <- model.matrix(mt, mf, contrasts)
## z <- if (is.null(w))
## lm.fit(x, y, offset = offset, singular.ok = singular.ok,
## ...)
## else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
## ...)
## }
## class(z) <- c(if (mlm) "mlm", "lm")
## z$na.action <- attr(mf, "na.action")
## z$offset <- offset
## z$contrasts <- attr(x, "contrasts")
## z$xlevels <- .getXlevels(mt, mf)
## z$call <- cl
## z$terms <- mt
## if (model)
## z$model <- mf
## if (ret.x)
## z$x <- x
## if (ret.y)
## z$y <- y
## if (!qr)
## z$qr <- NULL
## z
## }
## <bytecode: 0x5642a19f75f0>
## <environment: namespace:stats>
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x5642a1ca3db0>
## <environment: namespace:base>
To get information about a function you know exists, use
help
or ?
, e.g., ?lm
.
We’ve seen vectors of various types (numeric (i.e., decimal/floating point/double), integer, boolean, character).
All items in a single vector must be of the same type.
But vectors are not the only kinds of R objects.
Collections of columns of potentially different types.
gapminder
is actually an enhanced kind of data frame called
a ‘tibble’ (more in Module 6).
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## [1] 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763
## [1] 1704 6
## [1] 1704
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
## [1] "tbl_df" "tbl" "data.frame"
## [1] FALSE
## [1] "integer"
## [1] "numeric"
## [1] "factor"
Collections of disparate or complicated objects
myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2),
moreStuff = c("china", "japan"), list(5, "bear"))
myList
## $stuff
## [1] 3
##
## $mat
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## $moreStuff
## [1] "china" "japan"
##
## [[4]]
## [[4]][[1]]
## [1] 5
##
## [[4]][[2]]
## [1] "bear"
## [1] "china" "japan"
## [1] TRUE
## [1] "japan"
## [1] "stuff" "mat" "moreStuff" ""
Lists can be used as vectors of complicated objects. E.g., suppose you have a linear regression for each value of a stratifying variable. You could have a list of regression fits. Each regression fit will itself be a list, so you’ll have a list of lists.
R has several different plotting systems:
We’ll see a little bit of base graphics here and then ggplot2 tomorrow in Module 7.
Check out help(par)
for various graphics settings; these are set via
par()
or within the specific graphics command (some can be
set in either place), e.g.,
In general, your answers to any questions should involve writing code to manipulate objects. For example, if I ask you to find the maximum life expectancy, do not scan through all the values and find it by eye. Use R to do the calculations and print results.
Create a variable called ‘x’ that contains the mean life expectancy.
Use functions in R to round ‘x’ to two decimal places and to two significant digits.
Create a vector of GDP per capita in units of Euros rather than dollars.
Create a boolean (TRUE/FALSE) vector indicating whether total country GDP is greater than 1 trillion dollars. When entering 1 trillion, use R’s scientific notation.
Use the boolean vector from problem 4 to produce a new vector containing the per capita GDP only from the biggest economies.
How does R process the following subset operations in the first line of code? Explain the individual steps that R carries out:
Plot life expectancy against gdpPercap with gdpPercap values greater than 40000 set to 40000.
Make a histogram of the life expectancy values for the year 2007. Explore the effect of changing the number of bins in the histogram using the ‘breaks’ argument.
Subset the data to those for the year 2007 (there is a way to do this all at once, but using what we’ve seen already, you can pull out and subset the individual columns you need). Plot life expectancy against GDP per capita. Add a title to the plot. Now plot so that data for Asia are in one color and those for all other countries are in another color and those for all other continents are in another, using the ‘col’ argument. Hint: ‘col’ can take a vector of colors such as “black”,“red”,“black”, …
help(par)
will provide a lot of
information.