Good code

Pick up from Thursday


Good code

In rough order of importance (for us as statisticians/data scientists):

  1. Correct
  2. Obviously correct: the code is easy to understand
  3. Expressive: the code emphasizes the actions not the details
  4. Efficiency: speed, memory, I/O

Use existing functions when you can

E.g. rbinom() rather than implementing the inverse transform method

Inverse transform (Tue) and Rejection method (Lab) are options when nothing exists for the distribution you need.

Monte Carlo simulations

Missisippi

Consider the 11 letters in the word Mississippi.

library(tidyverse)
(mi_letters <- str_split("Mississippi", "")[[1]])
##  [1] "M" "i" "s" "s" "i" "s" "s" "i" "p" "p" "i"

What is the probability that no adjacent letters are the same, in a random reordering of the letters?

Taken from Tijms, Henk. Probability: A Lively Introduction. Cambridge University Press, 2017.

Your Turn:

  • How would you do a simulation to estimate this probability?
  • How is this simulation similar to the Russian election example?

General structure of Monte Carlo Simulations

  1. Simulate many examples
  2. Calculate something on each example
  3. Explore the many calculated things

Mississippi

  1. Generate random reorderings of the letters in Mississippi
  2. For each reordering, ask: are all adjacent letters different?
  3. Find the proportion of “Yes”s

Russian Election Fraud

  1. Generate turnout vectors based on no fraud (Binomial counts based on number on voter list and observed turnout, transformed to turnout percentages)
  2. For each turnout vector, count the number of integers.
  3. Find the mean and quantiles of the counts

One pretty flexible approach

First, figure out 1. and 2. on a single example.

Then scale up:

  1. Simulate many examples rerun()
  2. Calculate something on each example map(), map_dbl()
  3. Explore the many calculated things # depends on goal

rerun() and map() are in the purrr package.


One Mississippi

Step 1: How do we get a random reordering of the letters in Mississippi?

sample(mi_letters)
##  [1] "p" "i" "M" "s" "i" "p" "s" "s" "i" "s" "i"

sample(x, size, replace = FALSE)

sample takes a sample of the specified size from the elements of x using either with or without replacement.

?sample

(one_reordering <- sample(mi_letters))
##  [1] "s" "i" "p" "i" "s" "i" "i" "M" "s" "p" "s"

Two Mississippi

Step 2: Given a reordering, does it have letters next to each other that are the same? We want TRUE when no adjacent letters match.

The rle() function will be very useful.

What does it do?

one_reordering
##  [1] "s" "i" "p" "i" "s" "i" "i" "M" "s" "p" "s"
rle(one_reordering)
## Run Length Encoding
##   lengths: int [1:10] 1 1 1 1 1 2 1 1 1 1
##   values : chr [1:10] "s" "i" "p" "i" "s" "i" "M" "s" "p" "s"

How can we use it?

Look for any lengths greater than 2.

How do you get out the lengths? Some strategies:

rle(one_reordering)$lengths # guess

?rle # read the Value section 

rel_one <- rle(one_reordering) # save  
rel_one$lengths # then rely on RStudio completion

# use str()
str(one_reordering)

Now find out if any are greater than 1. My approach:

all(rle(one_reordering)$lengths == 1)
## [1] FALSE

Some other approaches

length(rle(one_reordering)$lengths) == length(one_reordering)
!(mean(rle(one_reordering)$lengths) > 1)
max(rle(one_reordering)$lengths) == 1

So far

Back at 9:01am

one_reordering <- sample(mi_letters) # One example

# Fill in this bit in class
all(rle(one_reordering)$lengths == 1)
## [1] FALSE

Scaling up - many examples with rerun()

The first argument is the number of times you’d like to repeat the evaluation of the second argument.

many_reorderings <- rerun(.n = 1000, sample(mi_letters))

Your turn Take a look at the object many_reorderings. What kind of object is it? Generate 1000 reorderings instead.


Scaling up - for each example, do something with map()

map() solves iteration problems, like: for each ___ do ___.

  1. First argument is the object you want to iterate over, many_reorderings

  2. Second argument describes what you want to do. One way, specify a formula (starts with ~) using . as a placeholder for a single example: ~ any(rle(.)$lengths > 1

map(many_reorderings,
  ~ all(rle(.)$lengths == 1))

map() returns a list

Use one of its friends instead: map_dbl(), map_lgl(), map_int(), map_chr() to get an atomic vector.

Your turn: Swap out map() for the appropriate function

map_lgl(many_reorderings, ~ all(rle(.x)$lengths == 1))

All together

num_sims <- 1000
many_reorderings <- rerun(num_sims, sample(mi_letters))
adj_letters_same <- map_lgl(many_reorderings, 
  ~ all(rle(.x)$lengths == 1))

# Explore 
adj_letters_same %>% table()
## .
## FALSE  TRUE 
##   940    60
adj_letters_same %>% mean()
## [1] 0.06

Your turn

A random sequence of H’s and T’s is generated by tossing a fair coin \(n = 20\) times. What’s the expected length of the longest run of consecutive heads or tails?

Taken from Tijms, Henk. Probability: A Lively Introduction. Cambridge University Press, 2017.


Iteration

Common patterns:

  1. Do this thing m times, rerun()
  2. Do this thing to each element of x, map()
  3. Do this thing until some condition is satisfied while

You can do 1. and 2. with for loops, but the purrr functions abstract away the details and let you focus on “this thing”.

(You also don’t run the risk of writing an inefficient for loop)

But remember R loves working with vectors. Don’t iterate over the elements of a vector, when a function exists to handle the whole vector.


Questions

  • How many simulations do we need? I.e. what should num_sims be?
  • What do we do when the simulation part is complicated? What do we do when the calculation part is complicated?

Reading for next week

Functions in R4DS 19.1 through and including 19.3

(If you are on a roll keep reading…)

Be prepared to answer:

  • What are the three key steps to writing a function?
  • Do exercise 19.2.1 #3