August 2022, UC Berkeley
Chris Paciorek
If you’re starting to type something you’ve typed before, or the long name of an R object or function, STOP! You likely don’t need to type all of that.
source()
. For example:
source('myRcodeFile.R')
Question: Are there other tricks that anyone knows of? Please share in the online discussion forum.
R has a number of functions for getting metadata about your objects. Some of this is built in to RStudio Environment tab/panel.
## [1] 1704
## int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## [1] "integer"
## [1] "integer"
## [1] "factor"
## [1] "integer"
## [1] "numeric"
## [1] "double"
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] FALSE
Question: What have you learned? Does it make sense?
POLL 2A: Which of these is true about the gapminder object in R?
(respond at https://pollev.com/chrispaciorek428)
gapminder
is a data framegapminder
is a matrixgapminder
is a vectorgapminder
is a listgapminder
is a functionR has functions for learning about the collection of objects in your workspace. Some of this is built in to RStudio.
## Let's first create a few objects
x <- rnorm(5)
y <- c(5L, 2L, 7L)
z <- list(a = 3, b = c('sam', 'yang'))
ls() # search the user workspace (global environment)
## [1] "myList" "v1" "v2" "v3" "x" "y" "z"
## [1] "myList" "v1" "v2" "v3" "y" "z"
## myList : List of 3
## $ : num 3
## $ : chr [1:2] "uganda" "bulgaria"
## $ : int [1:2, 1:2] 1 2 3 4
## v1 : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## v2 : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## v3 : num [1:1704] 28.8 30.3 32 34 36.1 ...
## y : int [1:3] 5 2 7
## z : List of 2
## $ a: num 3
## $ b: chr [1:2] "sam" "yang"
Finally we can save the objects in our R session:
## [1] "myList" "v1" "v2" "v3" "y" "z"
## character(0)
## [1] "a" "D2R" "deepExtract" "denslines"
## [5] "densplot" "dim2" "ellipse.default" "f.angdist"
## [9] "f.ciplot" "f.dplot" "f.ess" "f.ess.old"
## [13] "f.flushplot" "f.gm" "f.grstat" "f.identity"
## [17] "f.invlogit" "f.logit" "f.logmatern.euc" "f.lonlat2eucl"
## [21] "f.matern.ang" "f.matern.ang.cov" "f.matern.euc" "f.merge"
## [25] "f.rdist.earth" "f.sort" "f.sort2" "f.squexp"
## [29] "f.trimat" "f.vecrep" "format_bytes" "getNcdf"
## [33] "im" "indices" "ln" "lnm"
## [37] "ls_sizes" "machineName" "makePoly" "module"
## [41] "plot.ell" "pmap" "pmap2" "pointsInPoly"
## [45] "pplot" "pretty_size" "print.closeR" "q"
## [49] "R2" "R2D" "rcsv" "rotate"
## [53] "sizes" "source" "temp.colors" "thresh"
## [57] "time_chol" "tplot" "tsplot" "wcsv"
Challenge: how would I find all of my objects that have ‘x’ in their names?
Let’s check out the packages on CRAN. In particular check out the CRAN Task Views.
Essentially any well-established and many not-so-established statistical methods and other functionality is available in a package.
If you want to sound like an R expert, make sure to call them packages and not libraries. A library is the location in the directory structure where the packages are installed/stored.
Two steps:
To install a package, in RStudio, just do
Packages->Install Packages
.
From the command line, you generally will just do
That should work without specifying the repository from which to download the package (though sometimes you will be given a menu of repositories from which to select). There may be some cases in which you might need to specify the repository explicitly, e.g.,
If you’re on a network and are not the administrator of the machine, you may need to explicitly tell R to install it in a directory you are able to write in:
If you’re using R directly installed on your laptop (i.e., most of you), now (or at the break) would be a good point to install the various packages we need for the bootcamp, which can be done easily with the following command:
install.packages(c('chron','colorspace','codetools', 'DBI','devtools',
'dichromat','digest','doFuture','dplyr', 'fields',
'foreach','future.apply', 'gapminder', 'ggplot2',
'gridExtra','gtable','inline','iterators','knitr',
'labeling','lattice','lme4','mapproj','maps','munsell',
'proftools','proto','purrr','R6','rbenchmark',
'RColorBrewer','Rcpp','reshape2','rJava',
'RSQLite', 'scales','spam','stringr','tidyr','xlsx',
'xlsxjars','xtable'))
Note that packages often are dependent on other packages so these dependencies may be installed and loaded automatically. E.g., fields depends on maps and on spam.
You can also install directly from a package zip/tarball rather than from CRAN by giving a filename instead of a package name.
You can use syntax as follows to get a list of the objects in a package and a brief description:
On CRAN there often vignettes that are an overview and describe usage of a package if you click on a specific package. The reference manual is just a single document with the help files for all of the objects/functions in a package, so may be helpful but often it’s hard to get the big picture view from that.
To see the packages that are loaded and the order in which packages are searched for functions/objects:
To see what libraries (i.e., directory locations) R is retrieving packages from:
And to see where R is getting specific packages:
Namespaces are way to keep all the names for objects in a package together in a coherent way and allow R to look for objects in a principled way.
A few useful things to know:
## [1] "acf" "acf2AR" "add.scope"
## [4] "add1" "addmargins" "aggregate"
## [7] "aggregate.data.frame" "aggregate.ts" "AIC"
## [10] "alias" "anova" "ansari.test"
## [13] "aov" "approx" "approxfun"
## [16] "ar" "ar.burg" "ar.mle"
## [19] "ar.ols" "ar.yw"
## [1] 7
## gapminder$lifeExp ~ gapminder$gdpPercap
## <environment: 0x55b88e9f9dc8>
##
## Call:
## stats::lm(formula = gapminder$lifeExp ~ gapminder$gdpPercap)
##
## Coefficients:
## (Intercept) gapminder$gdpPercap
## 5.396e+01 7.649e-04
Can you explain what is going on? Consider the results of
search()
.
Packages are available as “Package source”, namely the raw code and help files, and “binaries”, where stuff is packaged up for R to use efficiently.
To look at the raw R code (and possibly C/C++/Fortran code included in some packages), download and unzip the package source tarball. From the command line of a Linux/Mac terminal (note this won’t look right in the slides version of the HTML):
curl https://cran.r-project.org/src/contrib/fields_9.6.tar.gz \
-o fields_9.6.tar.gz
tar -xvzf fields_9.6.tar.gz
cd fields
ls R
ls src
ls man
ls data
R is do-it-yourself - you can write your own package. At its most basic this is just some R scripts that are packaged together in a convenient format. And if giving it to someone else, it’s best to have some documentation in the form of function help files.
Why make a package?
See the devtools package and package.skeleton()
for some useful tools to help you create a package. And there are lots
of tips/tutorials online, in particular Hadley Wickham’s R packages
book.
To read and write from R, you need to have a firm grasp of where in the computer’s filesystem you are reading and writing from.
## What directory does R look for files in (working directory)?
getwd()
## Changing the working directory (Linux/Mac specific)
setwd('~/Desktop/r-bootcamp-fall-2022') # change the working directory
setwd('/Users/paciorek/Desktop') # absolute path
getwd()
setwd('r-bootcamp-fall-2022/modules') # relative path
setwd('../tmp') # relative path, up and back down the tree
## Changing the working directory (Windows specific)
## Windows - use either \\ or / to indicate directories
# setwd('C:\\Users\\Your_username\\Desktop\\r-bootcamp-fall-2022')
# setwd('..\\r-bootcamp-fall-2022')
## Changing the working directory (platform-agnostic)
setwd(file.path('~', 'Desktop', 'r-bootcamp-fall-2022', 'modules')) # change the working directory
setwd(file.path('/', 'Users', 'paciorek', 'Desktop', 'r-bootcamp-fall-2022', 'modules')) # absolute path
getwd()
setwd(file.path('..', 'data')) # relative path
Many errors and much confusion result from you and R not being on the same page in terms of where in the directory structure you are.
In RStudio, you can use
Session -> Set Working Directory
instead of
setwd
.
(respond at https://pollev.com/chrispaciorek428)
POLL 2B:
Suppose I am on a Mac that has the following directories:
Users
--paciorek
----Desktop
------r-bootcamp-fall-2022
--------data
--------modules
--------schedule
----Documents
Which of the following use relative paths?
POLL 2C:
Suppose my current working directory is:
/Users/paciorek/Desktop/r-bootcamp-fall-2022/modules
.
Windows users, just think of this as being:
C:\Users\paciorek\Desktop\r-bootcamp-fall-2022\modules
.
Which of the following will allow me to change to the ‘data’ subdirectory?
The workhorse for reading into a data frame is
read.table()
, which allows any separator (CSV,
tab-delimited, etc.). read.csv()
is a special case of
read.table()
for CSV files.
Here’s a simple example where R is able to read the data in using the
default arguments to read.csv()
.
## [1] "/accounts/vis/paciorek/staff/workshops/r-bootcamp-fall-2022/modules"
## year country vturn outlays realgdpgr unemp
## 1 1960 Australia 95.5 NA NA 1.42
## 2 1961 Australia 95.3 NA -0.07 2.79
## 3 1962 Australia 95.3 23.17 5.71 2.63
## 4 1963 Australia 95.7 23.01 6.10 2.12
## 5 1964 Australia 95.7 22.88 6.28 1.15
## 6 1965 Australia 95.7 24.90 4.97 1.15
It’s good to first look at your data in plain text format outside of R and then to check it after you’ve read it into R.
Remember that you’ll need to know the current working directory so that you know where R is looking for files.
Next let’s work through a more involved example, so you can see some of the steps and tricks involved in reading data into R.
## time X40010 X40015 X40020 X40025
## 1 2010-03-01 14:58 821 209 828 258
## 2 2010-03-01 15:01 804 209 804 248
## 3 2010-03-01 15:04 892 212 801 237
## 4 2010-03-01 15:07 857 214 821 243
## 5 2010-03-01 15:10 849 222 834 252
## [1] 120822 62
## [1] "849"
## [1] "character"
# let's delve more deeply
# unique(rta[ , 2]) # don't run when creating slides
head(sort(unique(rta[ , 2])))
## [1] "" "1000" "1001" "1002" "1003" "1004"
## [1] "995" "996" "997" "998" "999" "x"
# can we handle that with read.table?
# help(read.table)
rta2 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE,
na.strings = c('NA', 'x'))
class(rta2[ , 2])
## [1] "integer"
## [1] 24507
## [1] 24507
It’s good to first look at your data in plain text format outside of R and then to check it after you’ve read it into R.
The read.table() family of functions just skims the surface of things…
Here’s an example of reading data produced by another statistical
package (Stata) with read.dta()
.
library(foreign)
vote <- read.dta(file.path('..', 'data', '2004_labeled_processed_race.dta'))
head(vote)
## state pres04 sex race age9 partyid income relign8 age60 age65 geocode
## 1 2 1 female white 25-29 <NA> <NA> <NA> 18-29 25-29 3
## 2 2 2 male white 18-24 <NA> <NA> <NA> 18-29 18-24 3
## 3 2 1 female black 30-39 <NA> <NA> <NA> 30-44 30-39 3
## 4 2 1 female black 30-39 <NA> <NA> <NA> 30-44 30-39 3
## 5 2 1 female white 40-44 <NA> <NA> <NA> 30-44 40-49 3
## 6 2 1 female white 30-39 <NA> <NA> <NA> 30-44 30-39 3
## sizeplac brnagain attend year region y
## 1 rural <NA> <NA> 2004 4 0
## 2 rural <NA> <NA> 2004 4 1
## 3 rural <NA> <NA> 2004 4 0
## 4 rural <NA> <NA> 2004 4 0
## 5 rural <NA> <NA> 2004 4 0
## 6 rural <NA> <NA> 2004 4 0
There are a number of other formats that we can handle for either reading or writing. Let’s see:
R can also read in (and write out) Excel files, netCDF files, HDF5 files, etc., in many cases through add-on packages from CRAN.
A pause for a (gentle) diatribe:
Please try to avoid using Excel files as a data storage format. It’s proprietary, complicated (can have multiple sheets), allows a limited number of rows/columns, and files are not easily readable/viewable (unlike simple text files).
Here you have a number of options.
save()
and save.image()
.write.csv()
and write.table()
to write data frames/matrices to flat text files with delimiters such as
comma and tab.write()
to write out matrices in a simple
flat text format.cat()
to write to a file, while controlling
the formatting to a fine degree.## png
## 2
xtable()
formats tables for HTML and Latex (the
default).
## <!-- html table generated in R 4.2.0 by xtable 1.8-4 package -->
## <!-- Fri Aug 19 11:16:27 2022 -->
## <table border=1>
## <tr> <th> </th> <th> Africa </th> <th> Americas </th> <th> Asia </th> <th> Europe </th> <th> Oceania </th> </tr>
## <tr> <td align="right"> 1952 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1957 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1962 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1967 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1972 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1977 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1982 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1987 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1992 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 1997 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 2002 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## <tr> <td align="right"> 2007 </td> <td align="right"> 52 </td> <td align="right"> 25 </td> <td align="right"> 33 </td> <td align="right"> 30 </td> <td align="right"> 2 </td> </tr>
## </table>
At a basic level, a simple principle is to have version numbers for all your work: code, datasets, manuscripts. Whenever you make a change to a dataset, increment the version number. For code and manuscripts, increment when you make substantial changes or have obvious breakpoints in your workflow.
However, this is a hassle to do manually. Instead of manually trying to keep track of what changes you’ve made to code, data, documents, you use software to help you manage the process. This has several benefits:
Git is a popular tool for version control. Git is based around the notion of a repository, which is basically a version-controlled project directory. Many people use it with the GitHub, GitLab, or Bitbucket online hosting services for repositories.
In the introductory material, we’ve already seen how to get a copy of a GitHub repository on your local machine.
As you’re gathering by now, I’ve used Git and GitHub to manage all the content for this workshop.
We’ll go through a short example of making changes to the r-bootcamp-fall-2022 repository. In this case you don’t have permission to make changes so you’ll just have to follow along as I do it. However, you could start your own repository and then you’d be able to do similar things.
Note that there are graphical interfaces to Git that you might want to check out, but here I’m just going to do it from the command line on my Mac.
The basic notion we need is a commit. As we make changes to our files, we want to commit those changes to the repository regularly. A commit is a set of changes recorded with Git. We will often then push those changes to a remote copy of the repository, such as on GitHub.
Here’s a basic workflow:
Here’s how this would look from the command line (this won’t look right in the slides version of the HTML):
git add myfile
# make changes to mycode.R
git commit -am'added myfile and fixed bug in mycode.R'
git push
The changes are then available to anyone to pull from the
remote repository, a using git pull
or graphical
interfaces, such as using RStudio’s tools to pull the changes to your
machine, discussed in the GitHub slide in module 0.
There are several mailing lists that have lots of useful postings. In general if you have an error, others have already posted about it.
If you are searching you often want to search for a specific error message. Remember to use double quotes around your error message so it is not broken into individual words by the search engine.
The main rule of thumb is to do your homework first to make sure the answer is not already available on the mailing list or in other documentation. Some of the folks who respond to mailing list questions are not the friendliest so it helps to have a thick skin, even if you have done your homework. On the plus side, they are very knowledgeable and include the world’s foremost R experts/developers.
Here are some guidelines when posting to one of the R mailing lists https://www.r-project.org/posting-guide.html
sessionInfo()
is a function that will give information
about your R version, OS, etc., that you can include in your
posting.
You also want to include a short, focused, reproducible example of your problem that others can run.
Make sure you are able to install packages from CRAN. E.g., try to install lmtest.
Figure out what your current working directory is.
Put the data/cpds.csv file in some other directory on
your computer, such as Downloads. Use setwd()
to
set your working directory to be that directory. Read the file in using
read.csv()
. Now use setwd()
to point to a
different directory such as Desktop. Write the data frame out
to a file without any row names and without quotes on the character
strings.
Make a plot with the gapminder data. Save it as a PDF in Desktop. Now see what happens if you set the width and height arguments to be very small and see how it affects the resulting PDF. Do the same but setting width and height to be very large.
Figure out where (what directory) the graphics package is stored on your machine. Is it the same as where the fields package is stored?
backsolve()
being masked from package:base. Now if you
enter backsolve
, you’ll see the code associated with the
version of backsolve()
provided by the spam
package. Now enter base::backsolve
and you’ll see the code
for the version of backsolve()
provided by base R. Explain
why typing backsolve
shows the spam version rather
than the base version.