In the previous two sections, we worked with categorical variables (e.g., in favor/not in favor, smoker/nonsmoker, defective component/good component), for which we learned how to

  • Predict (i.e., confidence interval) an unknown true proportion of a particular category in the population.
  • Test (i.e., hypothesis testing) an unknown true proportion of a particular category in the population.

Now, we turn to numerical variables. Our objectives remain the same.

  • Predicting (i.e., confidence interval) an unknown population mean (this section).
  • Testing (i.e., hypothesis testing) an unknown population mean (next section).

Notations

Let’s start with a review of the notations.

  • \(\mu\): the population mean
  • \(\sigma\): the population standard deviation
  • \(\overline{x}\): a sample mean
  • \(s\): a sample standard deviation
  • \(n\): sample size
  • \(z_\alpha\): a \(z\)-score in a standard normal distribution whose right tail has a size of \(\alpha\). For example, \(z_{0.01}=2.326\). We can find it by qnorm(0.99), because this is the 99th percentile (i.e., only 1% data is above this number). As we have learned before, \(z_{0.01}\) is the \(z\)-score we need for a 98% confidence interval, \(z_{0.025}\) is the \(z\)-score we need for a 95% confidence interval, and so on.

Interval Estimator

If the population standard deviation, \(\sigma\), is unknown, a confidence interval for a population mean \(\mu\) is \[\left(\overline{x} - t_{\alpha}\dfrac{s}{\sqrt n}, \quad\overline{x} + t_{\alpha}\dfrac{s}{\sqrt n}\right).\]


What differences have you noticed between the two formulas (known \(\sigma\) vs unknown \(\sigma\))?

  • When the population standard deviation \(\sigma\) is unknown, we use the sample standard deviation \(s\) to replace it in the formula.

  • As a result, we need to adjust the probability model from a standard normal distribution (\(z_\alpha\)) to t-distribution (\(t_{\alpha}\)). Below is a comparison of the standard normal distribution (red curve) with a t-distribution whose degree of freedom is 10 (df=10, black curve).

T-distribution Any t-distribution has a bell-shaped curve centered at 0 similar to the standard normal distribution. However, it has fatter tails. Therefore, compare to the standard normal distribution, t-distribution has a greater chance to include extreme values in the tails. Since we don’t know about the population standard deviation, we would like to prepare for the possiblity of more extreme values in the tails, thus the change from standard normal distribution to t-distributions.

T-distribution has a parameter called the degree of freedom, which equals the sample size minus 1. So, the degree of freedom for a sample of size 20 will be 19.


The table below compares the R commands for the t-distribution with the standard normal distribution. As you can see the only difference is that we have to plug in a degree of freedom (df) for the t-distribution.

T-distribution Standard Normal Distribution
Probability pt(t, df) pnorm(z)
Percentile qt(p, df) qnorm(p)


Example Find the 99th percentile in a t-distribution with a degree of freedom 10.

Use qt(0.99, 10) to see \(t_{0.01}=2.764\). (We denote it by \(t_{0.01}\) because there is 1% data above this number in the distribution.)



Ross 8.13 The Environmental Protection Agency (EPA) is concerned about the amounts of PCB, a toxic chemical, in the milk of nursing mothers. In a sample of 20 women, the amounts (in parts per million) of PCB were as follows: \[16, 0, 0, 2, 3, 6, 8, 2, 5, 0, 12, 10, 5, 7, 2, 3, 8, 17, 9, 1\] Use these data to obtain a

  1. 95% confidence interval of the average amount of PCB in the milk of nursing mothers.
  • Plug in the following numbers to the formula:
    • \(\overline{x}=5.8, s=5.085\).
    • \(t_{0.025}=2.093\) qt(0.975, 19) when the degree of freedom is 19 (sample size minus 1).
  • We are 95% confident that the true amount of PCB in the milk of nursing mothers is between 3.42 ppm and 8.18 ppm.
  1. 99% confidence interval of the average amount of PCB in the milk of nursing mothers.
  • Same as (a), except \(t_{0.005}=2.861\) qt(0.995, 19) when the degree of freedom is 19 (sample size minus 1).

  • We are 99% confident that the true amount of PCB in the milk of nursing mothers is between 2.55 ppm and 9.05 ppm. We lose accuracy as we increase the confidence level.