library(tidyverse)
library(here)
set.seed(84625)
For the next couple of lectures we’ll talk about writing fast R code.
Remember correct code is your first priority, clear code your next priority, then think about making it fast if necessary.
Efficiency in the project rubric
Does not meet expectations: Code includes common sources of inefficiency: unnecessary iteration, not preallocating objects in loops, etc.
Exceeds expectations Some effort has been made to identify and improve the efficiency of functions. May include: timing experiments, implementations in C++, reimplementations of existing functions etc.
Today
Two key techniques:
(Taken from Hand’s On Programming in R)
abs_loop <- function(vec){
for (i in 1:length(vec)) {
if (vec[i] < 0) {
vec[i] <- -vec[i]
}
}
vec
}
abs_sets <- function(vec){
negs <- vec < 0
vec[negs] <- -vec[negs]
vec
}
x <- rnorm(1000)
Your Turn
Do they give the same answer when applied to x
?
Which is faster?
bench::mark(
abs_loop = abs_loop(x),
abs_sets = abs_sets(x)
)
abs_loop()
relies on a loop to examine each element of x
abs_set()
finds all negative values of x
as a set, and operates on them all at once
The combination of for
with if
is a key signal, a more efficient implementation will use R’s strengths of elementwise operations and subsetting.
ages <- c(14, 21, 33, 19, 60)
age_cats <- character(length(ages))
for (i in seq_along(ages)){
age <- ages[i]
if (age < 15){
age_cats[i] <- "under 15"
} else if (age < 21){
age_cats[i] <- "15 to 21"
} else if (age < 55){
age_cats[i] <- "21 to 55"
} else {
age_cats[i] <- "over 55"
}
}
Your Turn Eliminate the for
loop by using logical operators and subsetting.
age_cats2 <- character(length(ages))
Even better, use cut()
cut(ages, breaks = c(0, 15, 21, 55, 150),
labels = c("under 15", "15 to 21", "21 to 55", "over 55"))
## [1] under 15 15 to 21 21 to 55 15 to 21 over 55
## Levels: under 15 15 to 21 21 to 55 over 55
abs()
?bench::mark(
abs_loop = abs_loop(x),
abs_sets = abs_sets(x),
abs_base = abs(x)
)
## # A tibble: 3 x 10
## expression min mean median max `itr/sec` mem_alloc n_gc
## <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 abs_loop 66.76µs 73.92µs 67.84µs 911µs 13528. 7.86KB 1
## 2 abs_sets 6.52µs 10.39µs 8.43µs 838µs 96259. 27.46KB 5
## 3 abs_base 517ns 1.32µs 852ns 679µs 754532. 7.86KB 1
## # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
abs
## function (x) .Primitive("abs")
In the body of a function, .Primitive
.Internal
or .Call
signify a function that is written in another lower level language. These are compiled, and are usually very fast.
Reimplementing functions in R that are implemented in C (or other low level language), isn’t usually going to get you a speed up.
Imagine we are exploring the performance of the sample mean over samples from a specified population, for simplicity let’s just say Normal(0, 1).
Our current approach for 100 simulations and a sample size of 10, might be something like:
set.seed(187)
samples <- rerun(1000, rnorm(n = 10))
map_dbl(samples, mean)
But, it may not be the fastest approach…and although it doesn’t matter if we do this once, it might if we are repeating this over a whole set of varying settings.
If you have many vectors all of same type and length, you’ll often get speed ups by working with a matrix and using vectorized functions or matrix operations.
In R people often use the term “vectorized” for function that takes a vector rather than scalar (single number) approach to a problem.
We can generate all our samples in once go since they are all i.i.d:
set.seed(187)
samples_matrix <- rnorm(n = 10*1000) %>% matrix(ncol = 1000)
Each sample sits in a column of this matrix.
Then to take the mean of each sample, we need to take the mean of each column, which we could do with a for
loop, but, a much better option is colMeans()
(there are also rowMeans()
, rowSums()
and colSums()
functions):
colMeans(samples_matrix)
Your Turn I’ve extracted the two approaches into the functions below, how much faster in sample_means_matrix()
?
sample_means_purrr <- function(n_sims = 1000, n = 10){
samples <- rerun(n_sims, rnorm(n = n))
map_dbl(samples, mean)
}
sample_means_matrix <- function(n_sims = 1000, n = 10){
samples_matrix <- rnorm(n = n_sims*n) %>% matrix(ncol = n_sims)
colMeans(samples_matrix)
}
If your problem can be solved by matrix algebra (rather than looping over elements, columns or rows), you’ll often get a speed up, since the matrix operations usually call very optimized external libraries.
For example, imagine fitting a regression to each column of Y
using the explanatories in X
:
nsim <- 1000
n <- 30
p <- 5
Y <- rnorm(nsim*n) %>% matrix(ncol = nsim)
X <- rnorm(n*p) %>% matrix(ncol = p)
for
loop:
coefs <- matrix(nrow = p, ncol = nsim)
for (i in 1:nsim){
coefs[, i] <- lm(Y[, i] ~ X - 1)$coefficients
}
Matrix algebra:
coefs2 <- solve(t(X) %*% X) %*% t(X) %*% Y
lm()
coefs3 <- lm(Y ~ X - 1)$coefficients
all.equal(coefs, coefs2)
## [1] TRUE
all.equal(coefs2, coefs3, check.attributes = FALSE)
## [1] TRUE
Also see the functions crossprod()
and outer()
.
Don’t use a for
loop when you don’t have to.
for()
and if()
that signal vectorization will be much more efficientdiff()
, cumsum()
, rowSums()
, rowMeans()
(Aside: using purrr::map()
instead of a for
loop doesn’t help with speed, just clarity. The technique of vectorization should reduce the need for purrr::map()
as well.)
From Advanced R:
Vectorisation won’t solve every problem, and rather than torturing an existing algorithm into one that uses a vectorised approach, you’re often better off writing your own vectorised function in C++.
for
loopsIf you must use a for
loop, two key things to make it as fast as possible:
Your Turn Compare the timing on these two loops:
loop_no_init <- function(n = 1000000){
output <- NA
for (i in 1:n) {
output[i] <- i + 1
}
}
loop_init <- function(n = 1000000){
output <- integer(n)
for (i in 1:n) {
output[i] <- i + 1
}
}
An example in C++ next week
In R generic functions are functions whose behavior depends on the class of the object given as the first argument.
This is part of the S3 system, one of R’s approaches to object oriented programming.
Lot’s of functions are generic e.g. mean()
, plot()
, summary()
you can tell from the help
?mean
Or when you look inside the function:
mean
## function (x, ...)
## UseMethod("mean")
## <bytecode: 0x7fb85349ebb0>
## <environment: namespace:base>
A method for a generic function has a name like generic.class
, you can see them with the methods()
function:
methods("mean")
## [1] mean.bench_time* mean.Date mean.default mean.difftime
## [5] mean.POSIXct mean.POSIXlt
## see '?methods' for accessing help and source code
When you call a generic you lose a little time while R figures out which method to call.
z <- rnorm(100)
Your Turn Compare the timing for mean()
compared to mean.default()
Tend to be similar for large vectors.
This optimisation is a little risky. While
mean.default()
is almost twice as fast, it’ll fail in surprising ways ifx
is not a numeric vector. You should only use it if you know for sure whatx
is.
(Slightly different for S4 methods)
Imagine a setting where we are investigating by simulation the distribution of the Welch’s t-test statistic over various scenarios.
One option:
simulated_data <- data.frame(y = rnorm(20), grp = rep(1:2, 10))
t.test(y ~ grp, data = simulated_data)$statistic
## t
## 1.109222
Your turn Look at the source for t.test.default()
. What else, beyond finding the test statistic does the function do?
getAnywhere(t.test.default)
## A single object matching 't.test.default' was found
## It was found in the following places
## registered S3 method for t.test from namespace stats
## namespace:stats
## with value
##
## function (x, y = NULL, alternative = c("two.sided", "less", "greater"),
## mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95,
## ...)
## {
## alternative <- match.arg(alternative)
## if (!missing(mu) && (length(mu) != 1 || is.na(mu)))
## stop("'mu' must be a single number")
## if (!missing(conf.level) && (length(conf.level) != 1 || !is.finite(conf.level) ||
## conf.level < 0 || conf.level > 1))
## stop("'conf.level' must be a single number between 0 and 1")
## if (!is.null(y)) {
## dname <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))
## if (paired)
## xok <- yok <- complete.cases(x, y)
## else {
## yok <- !is.na(y)
## xok <- !is.na(x)
## }
## y <- y[yok]
## }
## else {
## dname <- deparse(substitute(x))
## if (paired)
## stop("'y' is missing for paired test")
## xok <- !is.na(x)
## yok <- NULL
## }
## x <- x[xok]
## if (paired) {
## x <- x - y
## y <- NULL
## }
## nx <- length(x)
## mx <- mean(x)
## vx <- var(x)
## if (is.null(y)) {
## if (nx < 2)
## stop("not enough 'x' observations")
## df <- nx - 1
## stderr <- sqrt(vx/nx)
## if (stderr < 10 * .Machine$double.eps * abs(mx))
## stop("data are essentially constant")
## tstat <- (mx - mu)/stderr
## method <- if (paired)
## "Paired t-test"
## else "One Sample t-test"
## estimate <- setNames(mx, if (paired)
## "mean of the differences"
## else "mean of x")
## }
## else {
## ny <- length(y)
## if (nx < 1 || (!var.equal && nx < 2))
## stop("not enough 'x' observations")
## if (ny < 1 || (!var.equal && ny < 2))
## stop("not enough 'y' observations")
## if (var.equal && nx + ny < 3)
## stop("not enough observations")
## my <- mean(y)
## vy <- var(y)
## method <- paste(if (!var.equal)
## "Welch", "Two Sample t-test")
## estimate <- c(mx, my)
## names(estimate) <- c("mean of x", "mean of y")
## if (var.equal) {
## df <- nx + ny - 2
## v <- 0
## if (nx > 1)
## v <- v + (nx - 1) * vx
## if (ny > 1)
## v <- v + (ny - 1) * vy
## v <- v/df
## stderr <- sqrt(v * (1/nx + 1/ny))
## }
## else {
## stderrx <- sqrt(vx/nx)
## stderry <- sqrt(vy/ny)
## stderr <- sqrt(stderrx^2 + stderry^2)
## df <- stderr^4/(stderrx^4/(nx - 1) + stderry^4/(ny -
## 1))
## }
## if (stderr < 10 * .Machine$double.eps * max(abs(mx),
## abs(my)))
## stop("data are essentially constant")
## tstat <- (mx - my - mu)/stderr
## }
## if (alternative == "less") {
## pval <- pt(tstat, df)
## cint <- c(-Inf, tstat + qt(conf.level, df))
## }
## else if (alternative == "greater") {
## pval <- pt(tstat, df, lower.tail = FALSE)
## cint <- c(tstat - qt(conf.level, df), Inf)
## }
## else {
## pval <- 2 * pt(-abs(tstat), df)
## alpha <- 1 - conf.level
## cint <- qt(1 - alpha/2, df)
## cint <- tstat + c(-cint, cint)
## }
## cint <- mu + cint * stderr
## names(tstat) <- "t"
## names(df) <- "df"
## names(mu) <- if (paired || !is.null(y))
## "difference in means"
## else "mean"
## attr(cint, "conf.level") <- conf.level
## rval <- list(statistic = tstat, parameter = df, p.value = pval,
## conf.int = cint, estimate = estimate, null.value = mu,
## alternative = alternative, method = method, data.name = dname)
## class(rval) <- "htest"
## return(rval)
## }
## <bytecode: 0x7fb85630db60>
## <environment: namespace:stats>
Rewriting a function to do just enough work for your specific case can provide speed ups.
In lab you’ll work through the t.test()
case study.
Also, see the example of the rewrite of diff()
for the difference of adjacent values in a vector in Do as Little as Possible in Advanced R.