library(tidyverse)
library(here)
set.seed(18973)
A lower level compiled language which means a much lower overhead on for
loops.
Writing a function in C++ may bring big speed improvements for things like:
We don’t have time to learn whole new language, but I want to expose you to a little C++, so you know how to get started and can see the benefits.
Mostly you would use C++ to write small self contained functions, that are called from your R scripts. I.e. rewrite just a few functions that are called often (or are inefficient) not the whole simulation/analysis.
Resource: https://adv-r.hadley.nz/rcpp.html
An R package that makes it easy to connect a C++ definition of a function to R:
library(Rcpp)
On your own machine: You’ll need a working C++ compiler, see https://adv-r.hadley.nz/rcpp.html#prerequistes
Here are two functions that do the same thing, one implemented in R and one implemented in C++.
In R:
add <- function(x, y, z){
x + y + z
}
In C++:
int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}
Your Turn Brainstorm the ways the two function definitions are different.
Differences:
Three possibilities:
Pass the definition inside '
to cppFunction()
:
cppFunction('int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}')
Put the definition (plus some additional lines) into a special Rcpp
code chunk in your RMarkdown (look at the RMD version of these notes to see this in action):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int add(int x, int y, int z) {
int sum = x + y + z;
return sum;
}
Put the definition (and the same additional lines) into their own file, e.g. file.cpp
and use sourceCpp()
:
sourceCpp("path/to/file.cpp")
Either way, then the function, add()
in this case, can be used just like any R function:
add
## function (x, y, z)
## .Call(<pointer: 0x112cda210>, x, y, z)
add(1, 2, 3)
## [1] 6
In C++ scalars (single numbers) are a different type to vectors.
C++ types:
integer | double | character | logical | |
---|---|---|---|---|
Scalar | int |
double |
String |
bool |
Vector | IntegerVector |
NumericVector |
CharacterVector |
LogicalVector |
Matrix | IntegerMatrix |
NumericMatrix |
CharacterMatrix |
LogicalMatrix |
signR()
defined in R:
signR <- function(x) {
if (x > 0) {
1
} else if (x == 0) {
0
} else {
-1
}
}
signC()
defined in C++:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int signC(int x) {
if (x > 0) {
return 1;
} else if (x == 0) {
return 0;
} else {
return -1;
}
}
Your Turn How are if
statements different in C++?
for
loop#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}
In C++:
you call a method with object_name.method_name()
, e.g. x.size()
calls the size()
method associated with the object x
- in this case the length of the vector.
for
loops look like for(init; check; increment)
:
init
the initial value(s) int i = 1
, remembering to define its type.check
the value against a stopping rule i < n
increment
the value for the next iteration ++i
means i = i + 1
Vector indicies start at 0. VECTOR INDICIES START AT 0!
There are modify in place operations total += x[i]
is equivalent to total = total + x[i]
Can you figure out what this function does? What is the equivalent in base R?
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector f2(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for(int i = 1; i < n; ++i) {
out[i] = out[i - 1] + x[i];
}
return out;
}
Remember the abs()
example from last week:
abs_loop <- function(vec){
for (i in 1:length(vec)) {
if (vec[i] < 0) {
vec[i] <- -vec[i]
}
}
vec
}
abs_sets <- function(vec){
negs <- vec < 0
vec[negs] <- -vec[negs]
vec
}
Here’s a C++ version:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector absC(NumericVector x){
int n = x.size();
for(int i = 0; i < n; i++){
if(x[i] < 0){
x[i] *= -1;
}
}
return x;
}
x <- -50:50
bench::mark(
abs_loop = abs_loop(x),
abs_sets = abs_sets(x),
absC = absC(x),
abs_base = abs(x))
## # A tibble: 4 x 10
## expression min mean median max `itr/sec` mem_alloc n_gc
## <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 abs_loop 6.64µs 8.23µs 7.07µs 67.07µs 121530. 49.12KB 0
## 2 abs_sets 1.79µs 4.5µs 4.42µs 3.03ms 222173. 2.95KB 1
## 3 absC 1.28µs 3.72µs 3.92µs 913.97µs 268934. 3.33KB 0
## 4 abs_base 157ns 454.96ns 169ns 39.44µs 2198010. 456B 0
## # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
Despite operating one element at a time, the C++ version is about as fast as the set version. Still not as fast as the base version.
A part of the Rcpp package to make translating code easier.
Rcpp sugar “bring[s] a subset of the high-level R syntax in C++.”
Arithmetic and logical operators are vectorized
Access to common vectorized functions in R like: any()
, ifelse()
, is.na()
Access to common distribution functions
A software library for the C++ with common algorithms and data structures.
E.g. std::max()
, std::random_shuffle()
, std::partial_sort()
Embarrasingly parallel problems
is one where little or no effort is needed to separate the problem into a number of parallel tasks
– https://en.wikipedia.org/wiki/Embarrassingly_parallel
Simulation studies often fall into this category:
What do they share?
On my machine
parallel::detectCores()
## [1] 4
(Charlotte’s rough conceptual breakdown)
Setup specify what parallel means on your set up, i.e. use multiple cores, background sessions, remote computers, a cluster etc.
Execute changes to your code that make use of the parallel set up. (Technically, explicit parallelism as opposed to implicit parallelism).
Lot’s of R packages available: https://cran.r-project.org/web/views/HighPerformanceComputing.html
Your turn: Start reading the readme at: https://github.com/DavisVaughan/furrr
What is the goal of the package? What packages does it build on?
Your turn: Scroll down to the first example. What does it illustrate?
Your turn: Keep reading the examples, what changes to get the map operation to run in parallel? How does the author provide proof that it is running in parallel?
Your turn: Try running the example on your own, can you replicate the timings?