By the end of tomorrow you will have feedback on your proposal.
homework-3 github repo -> Issues tab -> I ask questions, you are free to answer in Issues, talk in person, or address as project progresses.
We have an i.i.d sample \(X_1, \ldots, X_n\) from some unknown population distribution.
We are interested in estimating \(\theta\), some parameter of the population distribution.
We use \(\hat{\theta}\) some function of the sample as our estimate of \(\theta\).
How do we know how uncertain our estimate is? I.e. how do we construct a confidence interval for \(\theta\)?
We need to know the sampling distribution of our estimate \(\hat{\theta}\).
Recall: the sampling distribution of an estimate is the distribution of the estimate over all possible samples from the population.
Let \(X_1, \ldots, X_n\) be an i.i.d sample from some unknown population distribution.
We are interested in \(\theta = \mu\) the population mean, and estimate it with \(\hat{\theta} = \overline{X}\) the sample mean.
Your Turn: What is the sampling distribution of \(\hat{\theta}\)?
Let \(X_1, \ldots, X_n\) be an i.i.d sample from some known population distribution.
We are interested in \(\theta\) the population median, and estimate it with \(\hat{\theta}\) the sample median.
Your Turn: How could you use simulation to find the sampling distribution of \(\hat{\theta}\)?
Discuss with your neighbour
In the simulation approach we have to make some assumptions about the population distribution. What does the bootstrap replace the population distribution with? Why is this a good replacement?
The bootstrap is like a simulation approach, but instead of using an assumed population distribution we use our best guess of the population distribution - the distribution of the sample.
The bootstrap distribution of estimate \(\hat{\theta}\) is the distribution of the estimate over all possible samples from the empirical distribution function.
The empirical distribution function (ecdf), is the cdf that comes from placing probability \(1/n\) at each of the observed values in our sample.
Sampling from the empirical distribution function is equivalent to resampling the original sample with replacement.
A Monte Carlo approach to generating the bootstrap distribution (From Chernick):
Generate a sample with replacement from the empirical distribution (a bootstrap sample),
Compute \(\hat{\theta}^*\) the value of \(\hat{\theta}\) obtained by using the bootstrap sample in place of the original sample,
Repeat steps 1 and 2 \(k\) times.
Important technical detail: you should think of the distribution of \(\hat{\theta}^* - \hat{\theta}\) being an estimate of the distribution of \(\hat{\theta} - \theta\), (not the distribution of \(\theta^*\) being an estimate of the distribution of \(\hat{\theta}\)). Why? Think about a biased estimator.
Your Turn Why do we need a Monte Carlo approach?
Once we have the bootstrap distribution, we use it to construct a bootstrap confidence interval.
Efron’s percentile method
Use the \(\alpha/2\) and \(1-\alpha/2\) quantiles of \(\hat{\theta}^*\) as the endpoints in a \((1-\alpha)100\%\) confidence interval for \(\theta\).
Many other methods…more next week.
1. less room for error. 3. useful learning experience. We’ll go with 2. today, use rsample package which fits in nicely with our current MC approach.
https://github.com/topepo/rsample
Note that resampled data sets created by rsample are directly accessible in a resampling object but do not contain much overhead in memory.
https://rstudio.cloud/spaces/4116/project/112636
Also on github at: https://github.com/ST541-Fall2018/bootstrap