Project proposals

By the end of tomorrow you will have feedback on your proposal.

homework-3 github repo -> Issues tab -> I ask questions, you are free to answer in Issues, talk in person, or address as project progresses.

Setting

We have an i.i.d sample \(X_1, \ldots, X_n\) from some unknown population distribution.

We are interested in estimating \(\theta\), some parameter of the population distribution.

We use \(\hat{\theta}\) some function of the sample as our estimate of \(\theta\).

How do we know how uncertain our estimate is? I.e. how do we construct a confidence interval for \(\theta\)?

We need to know the sampling distribution of our estimate \(\hat{\theta}\).

Recall: the sampling distribution of an estimate is the distribution of the estimate over all possible samples from the population.

Example: Population mean, unknown population distribution

Let \(X_1, \ldots, X_n\) be an i.i.d sample from some unknown population distribution.

We are interested in \(\theta = \mu\) the population mean, and estimate it with \(\hat{\theta} = \overline{X}\) the sample mean.

Your Turn: What is the sampling distribution of \(\hat{\theta}\)?

Example: Population median, known population distribution

Let \(X_1, \ldots, X_n\) be an i.i.d sample from some known population distribution.

We are interested in \(\theta\) the population median, and estimate it with \(\hat{\theta}\) the sample median.

Your Turn: How could you use simulation to find the sampling distribution of \(\hat{\theta}\)?

Your Turn

Discuss with your neighbour

In the simulation approach we have to make some assumptions about the population distribution. What does the bootstrap replace the population distribution with? Why is this a good replacement?

Bootstrap

The bootstrap is like a simulation approach, but instead of using an assumed population distribution we use our best guess of the population distribution - the distribution of the sample.

The bootstrap distribution of estimate \(\hat{\theta}\) is the distribution of the estimate over all possible samples from the empirical distribution function.

The empirical distribution function (ecdf), is the cdf that comes from placing probability \(1/n\) at each of the observed values in our sample.

Sampling from the empirical distribution function is equivalent to resampling the original sample with replacement.

Bootstrap Algorithm

A Monte Carlo approach to generating the bootstrap distribution (From Chernick):

  1. Generate a sample with replacement from the empirical distribution (a bootstrap sample),

  2. Compute \(\hat{\theta}^*\) the value of \(\hat{\theta}\) obtained by using the bootstrap sample in place of the original sample,

  3. Repeat steps 1 and 2 \(k\) times.

Important technical detail: you should think of the distribution of \(\hat{\theta}^* - \hat{\theta}\) being an estimate of the distribution of \(\hat{\theta} - \theta\), (not the distribution of \(\theta^*\) being an estimate of the distribution of \(\hat{\theta}\)). Why? Think about a biased estimator.

Your Turn Why do we need a Monte Carlo approach?

One method for constructing a confidence interval for \(\theta\)

Once we have the bootstrap distribution, we use it to construct a bootstrap confidence interval.

Efron’s percentile method

Use the \(\alpha/2\) and \(1-\alpha/2\) quantiles of \(\hat{\theta}^*\) as the endpoints in a \((1-\alpha)100\%\) confidence interval for \(\theta\).

Many other methods…more next week.

Implementing the bootstrap

  1. Use a purpose-built package, e.g. boot Handles the sampling, and confidence interval calculations for you. You still need to define a function that takes a bootstrap sample and calculates the estimate.
  2. Something in between
  3. Implement from scratch yourself You implement sampling, calculate estimates and summarise yourself.

1. less room for error. 3. useful learning experience. We’ll go with 2. today, use rsample package which fits in nicely with our current MC approach.

https://github.com/topepo/rsample

Note that resampled data sets created by rsample are directly accessible in a resampling object but do not contain much overhead in memory.

Example project on rstudio.cloud

https://rstudio.cloud/spaces/4116/project/112636

Also on github at: https://github.com/ST541-Fall2018/bootstrap