R bootcamp, Module 1: Basics

August 2022, UC Berkeley

Chris Paciorek

R as a calculator

2 + 2 # add numbers
## [1] 4
2 * pi # multiply by a constant
## [1] 6.283185
7 + runif(1) # add a random number
## [1] 7.71528
3^4 # powers
## [1] 81
sqrt(4^4) # functions
## [1] 16
log(10)
## [1] 2.302585
log(100, base = 10)
## [1] 2
23 %/% 2 
## [1] 11
23 %% 2
## [1] 1
# scientific notation
5000000000 * 1000
## [1] 5e+12
5e9 * 1e3
## [1] 5e+12

Think of a mathematical operation you need - can you guess how to do it in R?

Side note to presenter: turn off R Notebook inline view via RStudio -> Preferences -> R Markdown -> Show output inline …

R as a calculator: quick quiz (respond at https://pollev.com/chrispaciorek428)

POLL 1A:

Question 1: How do I calculate the cosine of 2 pi?

  1. cosine(2pi)
  2. cosine(2*pi)
  3. cos(2 * pi)
  4. cos(2 x pi)
  5. cos(2*pi)
  6. cos(2 * 3.14159)
  7. cos[2*pi]

Question 2: What happens if you do this?

cos(2*pi

Assigning values to R objects

A key action in R is to store values in the form of R objects, and to examine the value of R objects.

val <- 3
val
## [1] 3
print(val)
## [1] 3
Val <- 7 # case-sensitive!
print(val)
## [1] 3
print(Val)
## [1] 7

We can work with (and store) sequences and repetitions

mySeq <- 1:6
mySeq
## [1] 1 2 3 4 5 6
years <- seq(1952, 2007, by = 5)
years
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
length(years)
## [1] 12
## This is a comment: here is an example of non-numeric data
country <- rep("Afghanistan", 12)
country 
##  [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
##  [6] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
## [11] "Afghanistan" "Afghanistan"

If we don’t assign the output of a command to an object, we haven’t saved it for later use.

R gives us a lot of flexibility (within certain rules) for assigning to (parts of) objects from (parts of) other objects. We’ll see this through the bootcamp.

Vectors in R

The most basic form of an R object is a vector. The various objects mySeq, years, country are all vectors.

In fact, individual (scalar) values are vectors of length one, so val and Val are also vectors.

We can concatenate values into a vector with c().

## numeric vector
nums <- c(1.1, 3, -5.7)
devs <- rnorm(5)
devs
## [1]  0.19323788  0.08528635 -1.90114512  1.02340237  0.21099795
## integer vector
ints <- c(1L, 5L, -3L) # force storage as integer not decimal number
## 'L' is for 'long integer' (historical)

nObs <- 1000
mySample <- sample(1:1000, 100, replace = TRUE)

## character vector
chars <- c('hi', 'hallo', "mother's", 'father\'s', 
   "She said, 'hi'", "He said, \"hi\"" )
chars
## [1] "hi"              "hallo"           "mother's"        "father's"       
## [5] "She said, 'hi'"  "He said, \"hi\""
cat(chars, sep = "\n")
## hi
## hallo
## mother's
## father's
## She said, 'hi'
## He said, "hi"
## logical vector
bools <- c(TRUE, FALSE, TRUE)
bools
## [1]  TRUE FALSE  TRUE

This is not valid syntax in R. Let’s try it and see what happens.

nums <- (1.1, 3, -5.7)
nums <- [1.1, 3, -5.7]

Working with indices and subsets

years
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
years[3]
## [1] 1962
years[3:5]
## [1] 1962 1967 1972
years[c(1, 3, 6)]
## [1] 1952 1962 1977
years[-c(1, 3, 6)]
## [1] 1957 1967 1972 1982 1987 1992 1997 2002 2007
years[c(rep(TRUE, 3), rep(FALSE, 2), TRUE, rep(FALSE, 6))]
## [1] 1952 1957 1962 1977
## If you haven't installed the gapminder package, do this first:
install.packages('gapminder')
## Installing package into '/accounts/vis/paciorek/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## create a simple vector from the Gapminder dataset
library(gapminder)
gdp <- gapminder$gdpPercap
gdp[1:10]
##  [1] 779.4453 820.8530 853.1007 836.1971 739.9811 786.1134 978.0114 852.3959
##  [9] 649.3414 635.3414

We can substitute values into vectors

gdp[4] <- 822.9711

vals <- rnorm(100)
vals[3:4] <- c(7.5, 2.4)
vals[1:2] <- 0  # this uses 'recycling' - more in Module 4

Working with indices and subsets: quick quiz

POLL 1B: Which of these will work to extract a subset of a vector? Assume the vector is created like this:

vals <- rnorm(4)

(respond at https://pollev.com/chrispaciorek428)

  1. vals[3]
  2. vals[2,3]
  3. vals[c(2,3)]
  4. vals(2,3)
  5. vals[c(FALSE, TRUE, TRUE, FALSE)]
  6. vals[c(f,t,t,f)]
  7. vals(3)

Vectorized calculations and comparisons

At the core of R is the idea of doing calculations on entire vectors.

gdpTotal <- gapminder$gdpPercap * gapminder$pop

tmp <- gdpTotal[gapminder$year == "2007"]  # let's pick apart what is happening here
gdpSubset <- tmp[1:20]

gdpSubset >= 1e6  # Dr. Evil's version of "a lot"
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE

Using functions in R

mean(gapminder$lifeExp)
## [1] 59.47444
mean(gapminder$lifeExp, trim = 0.1)
## [1] 59.91524
hist(rnorm(1000))

lm
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- model.offset(mf)
##     mlm <- is.matrix(y)
##     ny <- if (mlm) 
##         nrow(y)
##     else length(y)
##     if (!is.null(offset)) {
##         if (!mlm) 
##             offset <- as.vector(offset)
##         if (NROW(offset) != ny) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 NROW(offset), ny), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (mlm) matrix(NA_real_, 0, 
##             ncol(y)) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else ny)
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (mlm) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## }
## <bytecode: 0x5642a19f75f0>
## <environment: namespace:stats>
mean  # We'll investigate what 'UseMethod' does in Module 10
## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x5642a1ca3db0>
## <environment: namespace:base>

Getting help about a function

To get information about a function you know exists, use help or ?, e.g., ?lm.

help(lm)
?lm

?log

Basic kinds of R objects

We’ve seen vectors of various types (numeric (i.e., decimal/floating point/double), integer, boolean, character).

All items in a single vector must be of the same type.

But vectors are not the only kinds of R objects.

Data frames

Collections of columns of potentially different types. gapminder is actually an enhanced kind of data frame called a ‘tibble’ (more in Module 6).

head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
gapminder$lifeExp[1:10]
##  [1] 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763
dim(gapminder)
## [1] 1704    6
nrow(gapminder)
## [1] 1704
names(gapminder)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
class(gapminder)
## [1] "tbl_df"     "tbl"        "data.frame"
is.matrix(gapminder)
## [1] FALSE
class(gapminder$year)
## [1] "integer"
class(gapminder$lifeExp)
## [1] "numeric"
class(gapminder$country)
## [1] "factor"

Lists

Collections of disparate or complicated objects

myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2), 
   moreStuff = c("china", "japan"), list(5, "bear"))
myList
## $stuff
## [1] 3
## 
## $mat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $moreStuff
## [1] "china" "japan"
## 
## [[4]]
## [[4]][[1]]
## [1] 5
## 
## [[4]][[2]]
## [1] "bear"
myList[[3]] # result is not (usually) a list (unless you have nested lists)
## [1] "china" "japan"
identical(myList[[3]], myList$moreStuff)
## [1] TRUE
myList$moreStuff[2]
## [1] "japan"
names(myList)
## [1] "stuff"     "mat"       "moreStuff" ""

Lists can be used as vectors of complicated objects. E.g., suppose you have a linear regression for each value of a stratifying variable. You could have a list of regression fits. Each regression fit will itself be a list, so you’ll have a list of lists.

A bit on plotting

R has several different plotting systems:

We’ll see a little bit of base graphics here and then ggplot2 tomorrow in Module 7.

hist(gapminder$lifeExp)

plot(gapminder$lifeExp ~ gapminder$gdpPercap)

boxplot(gapminder$lifeExp ~ gapminder$year)

Graphics options

Check out help(par) for various graphics settings; these are set via par() or within the specific graphics command (some can be set in either place), e.g.,

par(pch = 16)
plot(gapminder$lifeExp ~ gapminder$gdpPercap, xlab = 'GDP per capita (dollars)',
   ylab = 'life expectancy (years)', log = 'x')

Breakout

In general, your answers to any questions should involve writing code to manipulate objects. For example, if I ask you to find the maximum life expectancy, do not scan through all the values and find it by eye. Use R to do the calculations and print results.

Basics

  1. Create a variable called ‘x’ that contains the mean life expectancy.

  2. Use functions in R to round ‘x’ to two decimal places and to two significant digits.

  3. Create a vector of GDP per capita in units of Euros rather than dollars.

  4. Create a boolean (TRUE/FALSE) vector indicating whether total country GDP is greater than 1 trillion dollars. When entering 1 trillion, use R’s scientific notation.

Using the ideas

  1. Use the boolean vector from problem 4 to produce a new vector containing the per capita GDP only from the biggest economies.

  2. How does R process the following subset operations in the first line of code? Explain the individual steps that R carries out:

vals[vals < 0] <- 0
vals[1:8]
  1. Plot life expectancy against gdpPercap with gdpPercap values greater than 40000 set to 40000.

  2. Make a histogram of the life expectancy values for the year 2007. Explore the effect of changing the number of bins in the histogram using the ‘breaks’ argument.

  3. Subset the data to those for the year 2007 (there is a way to do this all at once, but using what we’ve seen already, you can pull out and subset the individual columns you need). Plot life expectancy against GDP per capita. Add a title to the plot. Now plot so that data for Asia are in one color and those for all other countries are in another color and those for all other continents are in another, using the ‘col’ argument. Hint: ‘col’ can take a vector of colors such as “black”,“red”,“black”, …

Advanced

  1. Consider the following regression model. Figure out how to extract the R^2 and residual standard error and store in new R variables.
mod <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
summ <- summary(mod)
  1. Take your plot from problem 9. Now modify the size of the points. Add a legend. Rotate the numbers on the y-axis so they are printed horizontally. Recall that help(par) will provide a lot of information.