Linear Model

We are often interested in trying to determine the relationship between a pair of numerical variables. For example,

  • Advertising spending and the product sales.
  • Hours of study and the test scores.

One of the two variables is called the input/explanatory/independent variable and the other is called the response/dependent variable.

Example 1. Come up with a couple of bivariate examples. Identify the explanatory and the response variables.



The relationship between the explanatory and the response variables can be depicted by a scatter diagram.

Example 2. Is there a relation between one’s credit ratings and one’s credit cards limit?

a <- read.csv("https://albums.yuanting.lu/sta126/data/credit.csv")
plot(a$Limit ~ a$Rating, xlab = "Credit Rating", ylab = "Credit Limits")

Example 3. Is there a relation between one’s credit ratings and one’s income?



Example 4. Is there a relation between one’s credit ratings and one’s age?



Linear Correlation Coefficient

\[r = \dfrac{1}{n-1}\sum \left(\dfrac{x_i - \bar x}{s_x}\right)\left(\dfrac{y_i - \bar y}{s_y}\right)\]


Given the raw data in \(x\) and \(y\), the \(R\) command for the linear correlation coefficient is

cor(x, y)

Properties of the linear correlation coefficient \(r\):

  • \(-1\le r \le 1\).
  • When \(r\) is close to \(1\), there is strong positive linear correlation.
  • When \(r\) is close to \(-1\), there is strong negative linear correlation.
  • When \(r\) is close to 0, there is little to no linear correlation.
  • \(r\) has no unit, i.e., it is a unitless measure of association.

Example 5. Find the linear correlation coefficient in the first three examples. Which pair has the strongest linear correlation?





Correlation vs. Causation

In observational studies, we cannot establish a causal relationship with two correlated variables. Check out some of the spurious correlations.

Regression Line

\[\widehat{y} = b_1 x + b_0\]

Given the raw data \(x\) and \(y\), the least-squares linear regression line, can be found by the \(R\) command

lm(y ~ x)

Example 6. Find the linear regression lines between credit limits and credit ratings. What is the expected limit for someone whoes credit rating is 900?

lm(a$Limit ~ a$Rating)
## 
## Call:
## lm(formula = a$Limit ~ a$Rating)
## 
## Coefficients:
## (Intercept)     a$Rating  
##     -542.93        14.87


Example 7. One of the built-in datasets in \(R\) is trees.

  • Type trees in the \(R\) console to see the data.
  • Type ?trees to see the details information about the data.

Questions:

  1. Is there a correlation between the volume and the diameter of the trees?
  2. Find the least-squares linear regression line.
  3. Interpret the slope and the \(y\)-intercept in the regression line.
  4. Predict the mean volume of a tree that has a diameter of 16.5 inches.



Be aware of the extrapolation issue. Extrapolation is the process of predicting unknown values by extending existing data beyond its original range, assuming the observed trends will continue outside the initial range.


Example 8. Least-squares explained.

x <- c(3, 5, 7, 9, 11)
y <- c(0, 2, 3, 6, 9)