Sampling Distribution

“One of the key concerns of statistics is the drawing of conclusions from a set of observed data. These data will usually consist of a sample of certain elements of a population, and the objective will be to use the sample to draw conclusions about the entire population.”

In this lecture, we will learn about how to construct distributions of sample statistics (e.g., minimum, maximum, mean, median, proportion, standard deviation).

  1. Determine which parameter we would like to know from the population.
  2. Draw a random sample of size \(n\).
  3. Calculate the sample statistic.
  4. Repeat steps 2 and 3 a large number of times.
  5. Display all the sample statistic we obtained in step 4 on the same graph.


Example 1 Suppose in a certain population, the amount of cash people have in their pockets is uniformly distributed between 0 and 100. Use the app to build the sampling distribution of the sample minimum.



Example 2 Suppose in a certain population, the amount of cash people have in their pockets is normally distributed between 0 and 100. Use the app to build the sampling distribution of the sample mean.



Example 3 Pick another population distribution. Use the app to build the sampling distribution of the sample mean.



Central Limit Theorem

Central Limit Theorem:

For a random sample of size \(n\) from a population with mean \(\mu\) and standard deviation \(\sigma\), the sampling distribution of the sample mean, \(\overline{X}\), is approximately normal and has a mean of \(\mu\) and a standard deviation of \(\dfrac{\sigma}{\sqrt n}\).

\[E(\overline{X})=\mu, \mbox{ and }SD(\overline{X}\,)=\dfrac{\sigma}{\sqrt n}.\]


“… practically speaking, no matter how nonnormal the underlying population distribution is, the sample mean of a sample size of at least 30 will be approximately normal.”

Some insights on CLT.

Suppose \(X\) and \(Y\) are independent normal random variables. The additive properties are

\[E[X + Y] = E[X] + E[Y],\] \[Var[X + Y] = Var[X] + Var[Y].\]

The constant multiple properties are \[E[\color{red}cX] = \color{red}cE[X],\] \[Var[\color{red}cX] = \color{red}{c^2} Var[X].\]


If the independent random variables \(X_1, X_2, \ldots, X_n\) are from the same population, whose mean is \(\mu\) and standard deviation is \(\sigma\), then

  • \(E[X_1]=E[X_2]=\cdots=E[X_n]=\mu\).
  • \(Var[X_1]=Var[X_2]=\cdots=Var[X_n]=\sigma^2\).

Therefore, \[E[\overline{X}] = E\left[\dfrac{X_1 + X_2 + \cdots + X_n}{n}\right] = \dfrac{E[X_1]+E[X_2] + \cdots + E[X_n]}{n}=\dfrac{n\mu}{n}=\mu.\]

\[Var[\overline{X}] = Var\left[\dfrac{X_1 + X_2 + \cdots + X_n}{n}\right] = \dfrac{Var[X_1]+Var[X_2] + \cdots + Var[X_n]}{n^2}=\dfrac{n\sigma^2}{n^2}=\dfrac{\sigma^2}{n}.\]

\[SD(\overline{X})=\sqrt{Var(\overline{X})}=\dfrac{\sigma}{\sqrt{n}}.\]

Example 4 Men’s weight is normally distributed with \(\mu = 172\) lb and \(\sigma = 29\) lb.

  1. If 1 man is randomly selected, find the probability that his weight is less than 167 lb.

  2. If 36 men are randomly selected, find the probability that their average weight is less than 167 lb.

  3. If 1 man is randomly selected, find the probability that his weight is between 170 and 175 lb.

  4. If 64 men are randomly selected, find the probability that their mean weight is between 170 and 175.

  5. You are to design an elevator to safely hold 16 people. Find the maximum allowable weight if we want a 0.95 probability that this maximum will not be exceeded in the worst case when 16 randomly selected males are on it.




Ross 7.2 An insurance company has 10,000 automobile policyholders. If the expected yearly claim per policyholder is $260 with a standard deviation of $800, approximate the probability that the total yearly claim exceeds $2.8 million.




Ross 7.3 The blood cholesterol levels of a population of workers have mean 202 and standard deviation 14. If a sample of 36 workers is selected, approximate the probability that the sample mean of their blood cholesterol levels will lie between 198 and 206.




Ross 7.4 An astronomer is interested in measuring, in units of light-years, the distance from her observatory to a distant star. However, the astronomer knows that due to differing atmospheric conditions and normal errors, each time a measurement is made, it will yield not the exact distance, but an estimate of it. As a result, she is planning on making a series of 10 measurements and using the average of these measurements as her estimated value for the actual distance. If the values of the measurements constitute a sample from a population having mean d (the actual distance) and a standard deviation of 3 light-years, approximate the probability that the astronomer’s estimated value of the distance will be within 0.5 light-years of the actual distance.




Sampling Distribution of Sample Proportion

Suppose that the underlying population is large in relation to the sample size \(n\). If the proportion of individuals in the population with a certain characteristic is \(p\), then the sampling distribution of sample proportion (\(\widehat{p}\)) is approximately normal, \[E[\widehat{p}] = p, \quad\mbox{ and }\quad SD(\widehat{p}) = \sqrt{\dfrac{p(1-p)}{n}}.\]


A rule of thumb is that when np(1-p)>10, the binomial distribution can be approximated by a normal distribution.

Some insight into the formulas.

When the underlying population size is way larger than the sample size, we can assume that the probability of each individual in the sample with the characteristic is \(p\). If we denote

  • 1 for with the characteristic;
  • 0 for the lack of the characteristic,

then the distribution for each individual is

\(X_i\) \(P(X_i)\)
1 \(p\)
0 \(1-p\)

The expected value is \(E[X_i]=(1)(p) + (0)(1-p)=p\) and it follows that the variance is \[Var(X_i) = (1-E[X_i])^2(p) + (0 - E[X_i])^2(1-p)=(1-p)^2p+(0-p)^2(1-p)=p(1-p).\]

Therefore, \[E[\widehat{p}]=E\left[\dfrac{X_1 + X_2 + \cdots + X_n}{n}\right] = \dfrac{E[X_1]+E[X_2]+\cdots+E[X_n]}{n}=\dfrac{np}{n}=p.\]

\[\small Var(\widehat{p})=Var\left(\dfrac{X_1 + X_2 + \cdots + X_n}{n}\right)=\dfrac{Var(X_1)+Var(X_2)+\cdots+Var(X_n)}{n^2}=\dfrac{np(1-p)}{n^2}=\dfrac{p(1-p)}{n}.\]

\[SD(\widehat{p})=\sqrt{Var(\widehat{p})}=\sqrt{\dfrac{p(1-p)}{n}}.\]

Example 5 Construct a sampling distribution of sample proportion.

The data file ATL_Departure_Flights_2017.csv has the flights status information (on-time or delayed) of all the domestic departure flights in Atlanta Hartsfield-Jackson Airport 2017.

departure <- read.csv("https://albums.yuanting.lu/sta126/data/ATL_Departure_Flights_2017.csv")
  1. How large is the dataset?

  2. What percentage of departure flights were on-time in 2017?

  3. Take a random sample of 50 flights. What percentage of departure flights were on-time in the sample?

    x <- departure$Status[sample(364655, 50)]
    table(x) / 50
  4. Repeatedly take 30 or more samples and create a distribution graph.

    phats <- c(p1, p2, p3, p4, ..., p30)
    stripchart(phats, method = 'stack',
            at = 0.15, offset = 0.5, xlim = c(0, 1))
  5. To draw 2000 samples and create a simulated distribution graph.

n <- 50
pile <- rep(0, 2000)
for (i in 1:length(pile)) {
  x <- departure$Status[sample(364655, n)]
  phat <- table(x) / n
  pile[i] <- as.numeric(phat[2])
}
stripchart(pile, method = 'stack', pch = 19,
           at = 0.15, offset = 0.02, xlim = c(0, 1),
           main = "Sampling Distribution of Sample Proportion (n=50)", 
           xlab = "Proportion of on-time departure flights")

Example 6 Repeat the process in the previous example to build a sampling distribution of the sample proportion. This time, use 200 as sample size. Compare the sampling distribution graph with the one in the previous question. Which one has a wider spread?




Example 7 Suppose we take a sample of 100 flights departing from ATL Airport in 2017.

  • What is the probability that the proportion of the on-time flights is larger than 0.9?
  • What is the probability that less than half of the flights depart on time?
  • What is the probability that the proportion of the on-time flights is between 0.8 and 0.85?



Continuous Correction

The possible proportions in a particular sample are discrete. For example, in a sample of 200 people, the possible proportions are \[\frac{1}{200}=0.005, \frac{2}{200}=0.01, \frac{3}{200}=0.015, \cdots\] There is no way to get a proportion of, for example, 0.004 or 0.006.

But, the normal distribution is continuous. Therefore, the following rules are called the continuous correct when we applying normal approximations to binomial distributions.

Binomial Probability Normal Approximation
\(P(X=a)\) \(P(a - 0.5 < X < a + 0.5)\)
\(P(X>a)\) or \(P(X\ge a + 1)\) \(P(X > a + 0.5)\)
\(P(X<a)\) or \(P(X\le a - 1)\) \(P(X < a - 0.5)\)