August 2022, UC Berkeley
Chris Paciorek
For some of the topics here, my goal is not to teach you how to fish, but merely to tell you that fish exist, they are delicious, that they can be caught, and where one might go to figure out how to catch them.
Confusingly, R has three (well, actually five) different systems for OOP. This can be confusing, but they get the job done for a lot of tasks.
lm()
, glm()
,
and many other core features in R in the stats packageTo understand how base R works, it’s helpful to understand S3.
To use OOP in a fashion similar to Python and C++, I suggest using R6.
The basic idea is that coding is structured around objects, which belong to a class, and methods that operate on objects in the class.
Objects are like lists, but with methods that are specifically
associated with particular classes, as we’ve seen with the
lm
class.
Objects have fields, analogous to the components of a list. For S4 and reference classes, the fields of the class are fixed and all objects of the class must contain those fields.
S3 objects are generally built upon lists.
## [1] "lm"
## [1] TRUE
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## (Intercept) log(gapminder$gdpPercap)
## -9.100889 8.405085
## (Intercept) log(gapminder$gdpPercap)
## -9.100889 8.405085
## (Intercept) log(gapminder$gdpPercap)
## -9.100889 8.405085
The magic of the S3 OOP approach here is that ‘generic’ methods (i.e., functions) can be tailored to work specifically with specific kinds of objects. This has been called “functional OOP” because the generic methods look like regular functions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.60 48.20 60.71 59.47 70.85 82.60
##
## Call:
## lm(formula = gapminder$lifeExp ~ log(gapminder$gdpPercap))
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.778 -4.204 1.212 4.658 19.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1009 1.2277 -7.413 1.93e-13 ***
## log(gapminder$gdpPercap) 8.4051 0.1488 56.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared: 0.6522, Adjusted R-squared: 0.652
## F-statistic: 3192 on 1 and 1702 DF, p-value: < 2.2e-16
Question: What do you think R is doing behind the scenes?
Consider summary.lm
.
library(methods)
yb <- gapminder$lifeExp > 75
yc <- gapminder$lifeExp
x <- log(gapminder$gdpPercap)
mod1 <- lm(yc ~ x)
mod2 <- glm(yb ~ x, family = binomial)
mod2$residuals[1:20] # access field with list-like syntax
## 1 2 3 4 5 6 7 8
## -1.000026 -1.000031 -1.000035 -1.000033 -1.000022 -1.000027 -1.000054 -1.000035
## 9 10 11 12 13 14 15 16
## -1.000015 -1.000014 -1.000021 -1.000053 -1.000258 -1.000477 -1.000832 -1.001461
## 17 18 19 20
## -1.002613 -1.003204 -1.003496 -1.003838
## [1] "glm" "lm"
## [1] TRUE
## [1] TRUE
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
## [1] add1 anova coerce confint cooks.distance
## [6] deviance drop1 effects extractAIC family
## [11] formula influence initialize logLik model.frame
## [16] nobs predict print profile residuals
## [21] rstandard rstudent show slotsFromS3 summary
## [26] vcov weights
## see '?methods' for accessing help and source code
## [1] predict.ar* predict.Arima*
## [3] predict.arima0* predict.bam
## [5] predict.bs* predict.bSpline*
## [7] predict.fastTps predict.gam
## [9] predict.glm predict.glmmPQL*
## [11] predict.gls* predict.gnls*
## [13] predict.HoltWinters* predict.interp.surface
## [15] predict.jam* predict.Krig
## [17] predict.lda* predict.lm
## [19] predict.lme* predict.lmList*
## [21] predict.lmList4* predict.loess*
## [23] predict.lqs* predict.mca*
## [25] predict.merMod* predict.mKrig
## [27] predict.mlm* predict.nbSpline*
## [29] predict.nlme* predict.nls*
## [31] predict.npolySpline* predict.ns*
## [33] predict.pbSpline* predict.polr*
## [35] predict.poly* predict.polySpline*
## [37] predict.ppolySpline* predict.ppr*
## [39] predict.prcomp* predict.princomp*
## [41] predict.qda* predict.qsreg
## [43] predict.rlm* predict.smooth.spline*
## [45] predict.smooth.spline.fit* predict.sreg
## [47] predict.StructTS* predict.surface
## [49] predict.surface.default predict.Tps
## see '?methods' for accessing help and source code
## function (object, ...)
## UseMethod("predict")
## <bytecode: 0x55e1588c7140>
## <environment: namespace:stats>
When predict()
is called on a GLM object, it first calls
the generic predict()
, which then recognizes that the first
argument is of the class glm and immediately calls the right
class-specific method, predict.glm()
in this case.
Making an object and class-specific methods under S3 is simple.
rboot2022 <- list(month = 'August', year = 2022,
instructor = 'Paciorek', attendance = 100)
class(rboot2022) <- "workshop"
rboot2022
## $month
## [1] "August"
##
## $year
## [1] 2022
##
## $instructor
## [1] "Paciorek"
##
## $attendance
## [1] 100
##
## attr(,"class")
## [1] "workshop"
## [1] TRUE
## [1] "Paciorek"
print.workshop <- function(x) {
with(x,
cat("A workshop held in ", month, " ", year, "; taught by ", instructor, ".\nThe attendance was ", attendance, ".\n", sep = ""))
invisible(x)
}
# doesn't execute correctly in the slide creation, so comment out here:
# rboot2022
Note that we rely on the generic print()
already
existing in R. Otherwise we’d need to create it.
So what is happening behind the scenes here?
Unlike S4, S4 classes have a formal definition and objects in the class must have the specified fields.
The fields of an S4 class are called ‘slots’. Instead of
x$field
you do x@field
.
Here’s a bit of an example of an S4 class for a linear mixed effects model from lme4:
## [1] "lmerMod"
## attr(,"package")
## [1] "lme4"
## [1] getL show
## see '?methods' for accessing help and source code
## [1] "resp" "Gp" "call" "frame" "flist" "cnms" "lower"
## [8] "theta" "beta" "u" "devcomp" "pp" "optinfo"
## Error in fm1$theta: $ operator not defined for this S4 class
## [1] 0.96674177 0.01516906 0.23090995
R6 classes are a somewhat new feature in R that provides object-oriented programming with behavior more like in other languages.
Here’s an extended example that simulates random time series.
library(R6)
tsSimClass <- R6Class("tsSimClass",
## class for holding time series simulators
public = list(
initialize = function(times, mean = 0, corParam = 1){
library(fields)
stopifnot(is.numeric(corParam), length(corParam) == 1)
stopifnot(is.numeric(times))
private$times <- times
private$n <- length(times)
private$mean <- mean
private$corParam <- corParam
private$currentU <- FALSE
private$calcMats()
},
changeTimes = function(newTimes){
# modifies a private member field (i.e., a 'setter') and recalculates
private$times <- newTimes
private$calcMats()
},
getTimes = function(){
# a 'getter' method
return(private$times)
},
print = function(){ # 'print' method
cat("R6 Object of class 'tsSimClass' with ",
private$n, " time points.\n", sep = '')
invisible(self)
}
),
## private methods and functions not accessible externally
private = list(
calcMats = function() {
## calculates correlation matrix and Cholesky factor
lagMat <- fields::rdist(private$times) # local variable
corMat <- exp(-lagMat^2 / private$corParam^2)
private$U <- chol(corMat) # square root matrix
cat("Done updating correlation matrix and Cholesky factor.\n")
private$currentU <- TRUE
invisible(self)
},
# internal (private) fields not directly accessible from outside the object
n = NULL,
times = NULL,
mean = NULL,
corParam = NULL,
U = NULL,
currentU = FALSE
)
)
## add another method after class is already created
tsSimClass$set("public", "simulate", function() {
if(!private$currentU)
private$calcMats()
## analogous to mu+sigma*z for generating N(mu, sigma^2)
return(private$mean + crossprod(private$U, rnorm(private$n)))
})
my_ts <- tsSimClass$new(1:100, 2, 1)
## Done updating correlation matrix and Cholesky factor.
## R6 Object of class 'tsSimClass' with 100 time points.
set.seed(1)
y <- my_ts$simulate() # generate a time series
plot(my_ts$getTimes(), y, type = 'l', xlab = 'time',
ylab = 'process values')
## We can't directly access private fields or methods
my_ts$times
## NULL
## Error in eval(expr, envir, enclos): attempt to apply non-function
## Done updating correlation matrix and Cholesky factor.
When you write your own functions, and particularly for distributing to others, it’s a good idea to:
We can use stop()
and warning()
to do this.
They’re the same functions that are being called when you see an error
message or a warning in reaction to your own work in R.
mysqrt <- function(x) {
if(is.list(x)) {
warning("x is a list; converting to a vector")
x <- unlist(x)
}
if(!is.numeric(x)) {
stop("What is the square root of 'bob'?")
} else {
if(any(x < 0)) {
warning("mysqrt: found negative values; proceeding anyway")
x[x >= 0] <- (x[x >= 0])^(1/2)
x[x < 0] <- NaN
return(x)
} else return(x^(1/2))
}
}
mysqrt(c(1, 2, 3))
## [1] 1.000000 1.414214 1.732051
## Warning in mysqrt(c(5, -7)): mysqrt: found negative values; proceeding anyway
## [1] 2.236068 NaN
## Error in mysqrt(c("asdf", "sdf")): What is the square root of 'bob'?
## Warning in mysqrt(list(5, 3, "ab")): x is a list; converting to a vector
## Error in mysqrt(list(5, 3, "ab")): What is the square root of 'bob'?
## Warning in sqrt(c(5, -7)): NaNs produced
## [1] 2.236068 NaN
## Error in sqrt("asdf"): non-numeric argument to mathematical function
## Error in sqrt(list(5, 3, 2)): non-numeric argument to mathematical function
So we’ve done something similar to what sqrt()
actually
does in R.
When you automate analyses, sometimes an R function call will fail. But you don’t want all of your analyses to grind to a halt because one failed. Rather, you want to catch the error, record that it failed, and move on.
For me this is most critical when I’m doing stratified analyses or sequential operations.
The try()
function is a powerful tool here.
try()
Suppose we tried to do a stratified analysis of life expectancy on GDP within continents, for 2007. I’m going to do this as a for loop for pedagogical reasons, but again, it would be better to do this with dplyr/lapply/by type tools.
For the purpose of illustration, I’m going to monkey a bit with the data such that there is an error in fitting Oceania. This is artificial, but when you stratify data into smaller groups it’s not uncommon that the analysis can fail for one of the groups (often because of small sample size or missing data).
mod <- list()
fakedat <- gapminder[gapminder$year == 2007, ]
fakedat$gdpPercap[fakedat$continent == 'Oceania'] <- NA
for(cont in c('Asia', 'Oceania', 'Europe', 'Americas', 'Africa')) {
cat("Fitting model for continent ", cont, ".\n")
tmp <- subset(fakedat, continent == cont)
mod[[cont]] <- lm(lifeExp ~ log(gdpPercap), data = tmp)
}
## Fitting model for continent Asia .
## Fitting model for continent Oceania .
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
What happened?
try()
hardermod <- list()
for(cont in c('Asia', 'Oceania', 'Europe', 'Americas', 'Africa')) {
cat("Fitting model for continent ", cont, ".\n")
tmp <- subset(fakedat, continent == cont)
curMod <- try(lm(lifeExp ~ log(gdpPercap), data = tmp))
if(is(curMod, "try-error")) mod[[cont]] <- NA
else mod[[cont]] <- curMod
}
## Fitting model for continent Asia .
## Fitting model for continent Oceania .
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## 0 (non-NA) cases
## Fitting model for continent Europe .
## Fitting model for continent Americas .
## Fitting model for continent Africa .
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = tmp)
##
## Coefficients:
## (Intercept) log(gdpPercap)
## 25.650 5.157
## [1] NA
One of the powerful capabilities you have in R is the ability to use R to modify and create R code.
First we need to understand a bit about how R code is stored and manipulated when we don’t want to immediately evaluate it.
When you send some code to R to execute, it has to ‘parse’ the input; i.e., to process it so that it know how to evaluate it. The parsed input can then be evaluated in the proper context (i.e., the right frame, holding the objects to be operated on and created).
We can capture parsed code before it is evaluated, manipulate it, and execute the modified result.
Here’s a teaser:
## [1] 3
## n <- 100
## [1] "<-"
## Error in eval(expr, envir, enclos): object 'n' not found
## [1] 100
results <- rep(0, n)
moreCode <- quote(for(i in 1:n) {
tmp <- rnorm(30)
results[i] <- min(tmp)
})
class(moreCode)
## [1] "for"
## [[1]]
## `for`
##
## [[2]]
## i
##
## [[3]]
## 1:n
##
## [[4]]
## {
## tmp <- rnorm(30)
## results[i] <- min(tmp)
## }
## [1] "n <- 200"
## expression(n <- 200)
## [1] 200
So you could use R’s string manipulation capabilities to write and then evaluate R code. Meta.
Suppose you were given a bunch of objects named “x1”, “x2”, “x3”, … and you wanted to write code to automatically do some computation on them. Here I’ll just demonstrate with three, but this is obviously more compelling if there are many of them.
# assume these objects were provided to you
x1 <- rnorm(5)
x2 <- rgamma(10, 1)
x3 <- runif(20)
# now you want to work with them
nVals <- 3
results <- rep(0, nVals)
for(i in 1:nVals) {
varName <- paste("x", i, sep = "")
tmp <- eval(as.name(varName))
# tmp <- get(varName) # an alternative
results[i] <- mean(tmp)
}
results
## [1] -0.3971457 1.4027052 0.4788078
## [1] "x3"
## [1] 0.5549006 0.4296244 0.4527201 0.3064433 0.5783539 0.9103703 0.1426041
## [8] 0.4150476 0.2109258 0.4287504 0.1326900 0.4600964 0.9429571 0.7619739
## [15] 0.9329098 0.4706785 0.6035881 0.4849897 0.1088063 0.2477268
Or suppose you needed to create “x1”, “x2”, “x3”, automatically.
nVals <- 3
results <- rep(0, nVals)
for(i in 1:nVals) {
varName <- paste("x", i, sep = "")
assign(varName, rnorm(10))
}
x2
## [1] -0.91068068 0.74127631 0.06851153 -0.32375075 -1.08650305 -1.01592895
## [7] -0.76779018 -1.11972006 -0.44817424 0.47173637
Can you think of any uses of this ability for R to self-generate?
Text (either in the form of a file with regular language in it or a data file with fields of character strings) will often contain characters that are not part of the limited ASCII set of characters, which has 128 characters and control codes; basically what you see on a standard US keyboard.
UTF-8 is an encoding for the Unicode characters that include more than 110,000 characters from 100 different alphabets/scripts. It’s widely used on the web.
Latin-1 encodes a small subset of Unicode and contains the characters used in many European languages (e.g., letters with accents).
To read files with other characters correctly into R, you may need to
tell R what encoding the file is in. E.g., see help on
read.table()
for the fileEncoding and
encoding arguments.
With strings already in R, you can convert between encodings with
iconv()
:
## [1] "Melhore sua segurança"
## [1] "Melhore sua seguran???a"
You can mark a string with an encoding so R can display it correctly:
## [1] "façile"
## [1] "¡ ¢ £ ñ ò"
Windows, Mac, and Linux handle line endings in text files somewhat differently. So if you read a text file into R that was created in a different operating system you can run into difficulties.
\n
) and a carriage return (\r
).So in UNIX you might see ^M
at the end of lines when you
open a Windows file in a text editor. The dos2unix or
fromdos in UNIX commands can do the necessary conversion
In Windows you might have a UNIX text file appear to be all one line. The unix2dos or todos commands in UNIX can do the conversion.
There may also be Windows tools to deal with this.
R has the capability to read and write from a variety of relational database management systems (DBMS). Basically a database is a collection of rectangular format datasets (tables). Some of these tables have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information.
The DBI package provides a front-end for manipulating databases from a variety of DBMS (MySQL, SQLite, Oracle, among others)
Basically, you tell the package what DBMS is being used on the back-end, link to the actual database, and then you can use the standard functions in the package regardless of the back-end.
You can get an example database of information about Stack Overflow questions and answers from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2016.db. Stack Overflow is a website where programmers and software users can pose questions and get answers to technical problems.
library(RSQLite) # DBI is a dependency
db <- dbConnect(SQLite(), dbname = "../data/stackoverflow-2016.db")
## stackoverflow-2016.db is an SQLite database
## metadata
dbListTables(db)
## [1] "answers" "maxRepByQuestion" "questions" "questionsAugment"
## [5] "questions_tags" "users"
## [1] "questionid" "creationdate" "score" "viewcount" "title"
## [6] "ownerid"
## simple filter operation
popular <- dbGetQuery(db, "select * from questions
where viewcount > 10000")
## a join followed by a filter operation
popularR <- dbGetQuery(db, "select * from questions join questions_tags
on questions.questionid = questions_tags.questionid
where viewcount > 10000 and
tag = 'r'")
dbDisconnect(db)
POLL 10A:
What kind of R object do you think is returned when you query a database using dbGetQuery?
(respond at https://pollev.com/chrispaciorek428)
lm()
returns a specialized object)Note to participants: I’m having trouble with parallelization in RStudio, so we’ll just run the demo code in this module in a command line R session. You can open the basic R GUI for Mac or Windows, or, on a Mac, start R in a terminal window.
We’ll focus on shared memory computation here.
nproc
system_profiler | grep -i 'Cores'
To see if multiple cores are being used by your job, you can do:
Some basic approaches are:
lapply()
and its variantsR comes with a default BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package) that carry out the core linear algebra computations. However, you can generally improve performance (sometimes by an order of magnitude) by using a different BLAS. Furthermore a threaded BLAS will allow you to use multiple cores.
A ‘thread’ is a lightweight process, and the operating system sees multiple threads as part of a single process.
We’ll show by demonstration that my desktop in my office is using multiple cores for linear algebra operations.
# note to CJP: don't run on laptop with slow BLAS
n <- 5000
x <- matrix(rnorm(n^2), n)
U <- chol(crossprod(x))
You should see that your R process is using more than 100% of CPU. Inconceivable!
You can talk with your systems administrator about linking R to a fast BLAS or you can look into it yourself for your personal machine; see the R Installation and Administration manual.
Note that in some cases, in particular for small matrix operations,
using multiple threads may actually slow down computation, so you may
want to experiment, particularly with Linux. You can force the linear
algebra to use only a single core by doing (assuming you’re using the
bash shell) export OMP_NUM_THREADS=1
in the terminal window
before starting R in the same terminal. Or see the
RhpcBLASctl package to do it from within R.
Finally, note that threaded BLAS and either foreach
or
parallel versions of apply()
can conflict and cause R to
hang, so you’re likely to want to set the number of threads to 1 as
above if you’re doing explicit parallelization.
Do you think you should be asking?
An EP problem is one that can be solved by doing independent computations as separate processes without communication between the processes. You can get the answer by doing separate tasks and then collecting the results.
Examples in statistics / data science / machine learning include
Can you think of others in your work?
Some things that are not EP (at least not in a basic formulation):
POLL 10B: Have you used R’s parallel processing capabilities?
(respond at https://pollev.com/chrispaciorek428)
future
The future
package provides a lot of nice features for
parallelization. We’ll just scratch the surface here to parallelize
operations over the elements of a list (note that this is essentially
equivalent to parLapply
and mclapply
).
First, make sure your computations on the elements are independent of each other and don’t involve sequential calculations!
We’ll use a new dataset here, which is a dataset of airline departure times (in particular delays) for all flights from SFO over a period of several years. We’ll do a stratified analysis, fitting a GAM (see Unit 8) for each of the destination airports.
fitFun <- function(curDest) {
library(mgcv)
tmp <- subset(air, Dest == curDest)
tmp$Hour <- tmp$CRSDepTime %/% 100
curMod <- try(gam(DepDelay ~ Year + s(Month) + s(Hour) +
as.factor(DayOfWeek), data = tmp))
if(is(tmp, "try-error")) curMod <- NA
return(curMod)
}
library(future.apply)
nCores <- 4
plan(multisession, workers = nCores)
out <- future_lapply(unique(air$Dest), fitFun)
out[[1]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 7.51 7.49 total = 23
##
## GCV score: 1051.858
## [1] "Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : \n A term has fewer unique covariate combinations than specified maximum degrees of freedom\n"
## attr(,"class")
## [1] "try-error"
## attr(,"condition")
## <simpleError in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom>
Note that the plan
statement determines how the
parallelization is done behind the scenes. As shown here, it will start
up workers locally on your computer, but if you have access to a
cluster, you can modify the plan to make use of multiple compute nodes
in a cluster.
One thing to keep in mind is whether the different tasks all take
about the same amount of time or widely different times. In the latter
case, one wants to sequentially dispatch tasks as earlier tasks finish,
rather than dispatching a block of tasks to each core. See the
future.scheduling
argument for user control over how the
allocation is done.
First, make sure your iterations are independent and don’t involve sequential calculations!
The foreach package provides a way to do a for loop using
multiple cores. It can use a variety of ‘back-ends’ that handle the
nitty-gritty of the parallelization. Happily it integrates with the
future
package nicely.
## Loading required package: foreach
library(foreach)
nCores <- 4 # actually only 2 on my laptop, but appears hyperthreaded
registerDoFuture()
plan(multisession, workers = nCores)
out <- foreach(dest = unique(air$Dest)) %dopar% {
cat("Starting job for ", dest, ".\n", sep = "")
outSub <- fitFun(dest)
cat("Finishing job for ", dest, ".\n", sep = "")
outSub # this will become part of the out objec
}
## Starting job for SLC.
## Finishing job for SLC.
## Starting job for PDX.
## Finishing job for PDX.
## Starting job for SNA.
## Finishing job for SNA.
## Starting job for DEN.
## Finishing job for DEN.
## Starting job for ORD.
## Finishing job for ORD.
## Starting job for RNO.
## Finishing job for RNO.
## Starting job for SAN.
## Finishing job for SAN.
## Starting job for IAH.
## Finishing job for IAH.
## Starting job for SEA.
## Finishing job for SEA.
## Starting job for IAD.
## Finishing job for IAD.
## Starting job for LAX.
## Finishing job for LAX.
## Starting job for EWR.
## Finishing job for EWR.
## Starting job for JFK.
## Finishing job for JFK.
## Starting job for OGG.
## Finishing job for OGG.
## Starting job for KOA.
## Finishing job for KOA.
## Starting job for BUR.
## Finishing job for BUR.
## Starting job for LAS.
## Finishing job for LAS.
## Starting job for PHX.
## Finishing job for PHX.
## Starting job for CLT.
## Finishing job for CLT.
## Starting job for PHL.
## Finishing job for PHL.
## Starting job for PIT.
## Finishing job for PIT.
## Starting job for MEM.
## Finishing job for MEM.
## Starting job for CIC.
## Finishing job for CIC.
## Starting job for MFR.
## Finishing job for MFR.
## Starting job for ACV.
## Finishing job for ACV.
## Starting job for BFL.
## Finishing job for BFL.
## Starting job for SBA.
## Finishing job for SBA.
## Starting job for FAT.
## Finishing job for FAT.
## Starting job for MOD.
## Finishing job for MOD.
## Starting job for MRY.
## Finishing job for MRY.
## Starting job for CEC.
## Finishing job for CEC.
## Starting job for RDD.
## Finishing job for RDD.
## Starting job for RDM.
## Finishing job for RDM.
## Starting job for SBP.
## Finishing job for SBP.
## Starting job for SMF.
## Finishing job for SMF.
## Starting job for BOI.
## Finishing job for BOI.
## Starting job for EUG.
## Finishing job for EUG.
## Starting job for AUS.
## Finishing job for AUS.
## Starting job for TWF.
## Finishing job for TWF.
## Starting job for IDA.
## Finishing job for IDA.
## Starting job for PSC.
## Finishing job for PSC.
## Starting job for MDW.
## Finishing job for MDW.
## Starting job for IND.
## Finishing job for IND.
## Starting job for HNL.
## Finishing job for HNL.
## Starting job for LIH.
## Finishing job for LIH.
## Starting job for BOS.
## Finishing job for BOS.
## Starting job for ATL.
## Finishing job for ATL.
## Starting job for MCO.
## Finishing job for MCO.
## Starting job for BWI.
## Finishing job for BWI.
## Starting job for MSY.
## Finishing job for MSY.
## Starting job for DFW.
## Finishing job for DFW.
## Starting job for CVG.
## Finishing job for CVG.
## Starting job for DTW.
## Finishing job for DTW.
## Starting job for MSP.
## Finishing job for MSP.
## Starting job for MIA.
## Finishing job for MIA.
## Starting job for STL.
## Finishing job for STL.
## Starting job for PSP.
## Finishing job for PSP.
## Starting job for CLE.
## Finishing job for CLE.
## Starting job for COS.
## Finishing job for COS.
## Starting job for SAT.
## Finishing job for SAT.
## Starting job for ABQ.
## Finishing job for ABQ.
## Starting job for ANC.
## Finishing job for ANC.
## Starting job for SMX.
## Finishing job for SMX.
## Starting job for DRO.
## Finishing job for DRO.
## Starting job for PIH.
## Finishing job for PIH.
## Starting job for SJC.
## Finishing job for SJC.
## Starting job for ONT.
## Finishing job for ONT.
## Starting job for TUS.
## Finishing job for TUS.
## Starting job for EGE.
## Finishing job for EGE.
## Starting job for GJT.
## Finishing job for GJT.
## Starting job for ASE.
## Finishing job for ASE.
## Starting job for ELP.
## Finishing job for ELP.
## Starting job for PMD.
## Finishing job for PMD.
## Starting job for BZN.
## Finishing job for BZN.
## Starting job for BIL.
## Finishing job for BIL.
## Starting job for OAK.
## Finishing job for OAK.
## Starting job for MKE.
## Finishing job for MKE.
## Starting job for MSO.
## Finishing job for MSO.
## Starting job for FCA.
## Finishing job for FCA.
## Starting job for LMT.
## Finishing job for LMT.
## Starting job for OTH.
## Finishing job for OTH.
## Starting job for LGB.
## Finishing job for LGB.
## [[1]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 7.51 7.49 total = 23
##
## GCV score: 1051.858
##
## [[2]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 7.46 8.64 total = 24.1
##
## GCV score: 1325.755
##
## [[3]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 5.93 8.70 total = 22.63
##
## GCV score: 639.5956
##
## [[4]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 7.84 8.51 total = 24.35
##
## GCV score: 880.4363
##
## [[5]]
##
## Family: gaussian
## Link function: identity
##
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
##
## Estimated degrees of freedom:
## 8.00 5.68 total = 21.69
##
## GCV score: 1367.66
Question: What do you think are the advantages and disadvantages of having many small tasks vs. a few large tasks?
A tale of the good, the bad, and the ugly
Random numbers on a computer are not truly random but are generated as a sequence of pseudo-random numbers. The sequence is finite (but very, very, very, very long) and eventally repeats itself.
A random number seed determines where in the sequence one starts when generating random numbers.
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
## [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
The (not so) bad: Use a different seed for each task or each process. It’s possible the subsequences will overlap but quite unlikely.
The good: Use the L’Ecuyer algorithm
(library(lecuyer)
) to ensure distinct subsequences
foreach
you can use %dorng%
from the
doRNG package in place of %dopar%
mclapply()
, use the mc.set.seed
argument (see help(mcparallel)
)The ugly but good: Generate all your random numbers in the main process and distribute them to the tasks if feasible.
The syntax for using L’Ecuyer is available in my parallel computing tutorial.
If you have access to multiple machines within a networked environment, such as the compute servers in the Statistics Department or the campus-wide Savio cluster or machines on Amazon’s EC2 service, there are a few (sometimes) straightforward ways to parallelize EP jobs across machines.
foreach
with doMPI as the back-end. You’ll
need MPI and Rmpi installed on all machines.foreach
with doSNOW as the backend and
start a cluster of type “SOCK” to avoid needing MPI/Rmpi.parLapply()
, parSapply()
,
mclapply()
, etc.For option 1, see my tutorial on the future package (and also Python’s dask package) for syntax and more details.
For options 4, see my general tutorial on parallel computing for syntax and more details.