Numerical Data

Load Dataset

We will continue work with the credit dataset, a simulated dataset containing the credit information of a large number of customers¹.

credit <- read.csv('https://albums.yuanting.lu/sta126/data/credit.csv')

Dataset Overview

Some review questions from previous notes: What are the variables (columns) in this dataset? Which of the variables are categorical? Which ones are numerical? Are the numerical variables continuous or discrete?

head(credit)

Bar Graph

Let’s start with the variable Cards, which shows the number of credit cards owned by each person in the dataset. Create a frequency table for Cards. In the output table, row one displays the number of cards and row two displays the corresponding frequencies.

table(credit$Cards)

We can then visualize the frequency table in a bar graph. The ylim option sets the lower and upper limits of the y-axis (e.g., from 0 to 120).

barplot(table(credit$Cards), main = "Number of Credit Cards", ylim = c(0, 120), ylab = "Frequency")

Practice Produce bar graphs for the variables Age and Balance. Do you see any issue on these graphs?

Histogram

You probably notice that the bar graph looks ugly for both Age and Balance. In general, bar graphs work well with categorical or discrete variables that only have a small number of distinct values. When there are tons of distinct values or the variable is continuous, we break up the values into disjoint classes (bins) to display as a histogram. Unlike barplot that works with a frequency table, the hist function works with raw data. The xlab option adds a text label to the x-axis.

hist(credit$Age, xlab = "Age", main = "Distribution of Credit Cards Holders Ages", right = FALSE)

To manually set up the partition of the bins, add a breaks option. For example, hist(credit$Age, breaks = c(0, 20, 40, 60, 80, 100), xlab = "Age") will consolidate the Age data into 5 bins: [0, 20), [20, 40), [40, 60), [60, 80), and [80, 100).
To make the bins left open and right closed, i.e., (0, 20], (20, 40], (40, 60], (60, 80], and (80, 100], use the option right = TRUE.

Practice Produce histograms for the other numerical variables. What are the shapes of the distribution?

Frequency Polygon

A frequency polygon for a continuous variable is based on a histogram. Each bin of the histogram is represented a vertex, whose x-coordinate is the mid-point of the bin and y-coordinate is the frequency. Then, the polygon is completed by connecting all the representative points.

For example, the previous histogram has eight bins from 20 to 100. So, the x-coordinates of the vertices are 25, 35, 45, 55, 65, 76, 85, and 95. The y-coordinates are the frequencies in the groups [20, 30), [30, 40), [40, 50), [50, 60), [60, 70), [70, 80), [80, 90), and [90, 100). The freqpoly function is not a built-in R function. It is included in the nicer.R file written by your instructor.

The breaks option sets the partition points of the histogram bins.
The pch option (plot character) specifies the symbol for points. A value of 1 means empty circle. Adjust the value of the integer to see other type of symbols.

source("https://albums.yuanting.lu/sta126/tools/nicer.R")
freqpoly(credit$Age, breaks = c(20, 30, 40, 50, 60, 70, 80, 90, 100),
         pch = 1, xlab = 'Age', ylab = "Frequency", 
         main = "A Frequency Polygon of Credit Cards Holders' Age")

Practice Produce frequency polygons for the variables Balance and Cards.

Ogive

An ogive is a graph that represents the cumulative frequency or cumulative relative frequency.

Stem-and-Leaf Plot

In the histogram and the frequency polygon, we see the overall shape of the distribution but not the individual data. A Stem-and-leaf plot provides both and works well with small-to-moderate size dataset. Below is a stem-and-leaf plot for the variable Age.² Does it look like a rotated histogram?

stem(credit$Age, scale = 0.5)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   2 | 344455555556778888899999
##   3 | 000000001112222222233333334444445555556666677777777888889999
##   4 | 00000001111111112222333334444444444444555556666666666777777777778888
##   5 | 0000000000001111112222223334444555556666666667777777777888888999999
##   6 | 00000012222223333344444444455555666666666666677777777778888889999999
##   7 | 0000000111111111222222233444455555555666667777788888889999999
##   8 | 00000001111111111222223333334444456779
##   9 | 18

Descriptive Statistics

We will continue using the Age variable as examples for descriptive statistics.

Statistics	Example	Notes
Minimum	`min(credit$Age)`
Maximum	`max(credit$Age)`
Mean	`mean(credit$Age)`	Average.
Median	`median(credit$Age)`	Center.
Percentile	`quantile(credit$Age, 0.3, type = 2)`³	The 30th percentile.
Five-number summary	`fivenum(credit$Age)`	Min, Q1, median, Q3, max.
Sample variance	`var(credit$Age)`
Sample standard deviation	`sd(credit$Age)`

Boxplot

Another popular type of graph to display continuous variable is the boxplot, also known as the box and whisker plot.

The option horizontal = TRUE displays the boxplot horizontally. If we change it to FALSE, we will see vertical boxplot.
In the option whisklty = 1 defines the whisker line type. The digit 1 means solid line⁴.

boxplot(credit$Age, whisklty = 1, xlab = "Age", col = 'lightblue',
        horizontal = TRUE, main = "Cardholders' Age Distribution")

The box marks the interquartile range (IQR) of the dataset, i.e., $IQR = Q3 - Q1$.
Any data value below $Q1-1.5\times IQR$ or above $Q3 + 1.5 \times IQR$ are considered as outliers.
By default, a boxplot identifies outliers as open circles on the graph. If we add the option outline = FALSE, then the outliers (if any) will not be singled out.

Standard Deviation

Given a sample of $n$ numbers $x_1, x_2, x_3, \cdots, x_n$ with a mean of $\bar{x}$. The standard deviation is defined by the formula

\[\displaystyle \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}.\]

We typically denote a sample standard deviation by $s$.

Classroom Activity Consider hometown temperature, heart rates, and semester credit hours for the entire class. Can you rank the standard deviation for the three variables?

Click here to submit your anonymous data.

$z$-Scores

The $z$-score represents the distance that a data value is from the mean in terms of the number of standard deviations.

\[z = \dfrac{x-\mu}{\sigma}\mbox{ or }z = \dfrac{x-\bar{x}}{s}\]

Example Which score is better, an SAT of 1380 or an ACT of 29? Assume SAT has a mean of 1050 and a standard deviation of 100, while ACT has a mean of 21 and a standard deviation of 5.

Empirical Rule

A dataset is said to be normal if the histogram is bell-shaped and symmetric, with its highest at the middle interval.

If a dataset is approximately normal with sample mean $\bar{x}$ and sample standard deviation $s$, then the following are true.

Approximately 68% of the observations lie between $\bar{x} - s$ and $\bar{x} + s$.
Approximately 95% of the observations lie between $\bar{x} - 2s$ and $\bar{x} + 2s$.
Approximately 99.7% of the observations lie between $\bar{x} - 3s$ and $\bar{x} + 3s$.

Example The stem-and-leaf plot displays 25 test scores.

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   5 | 0358
##   6 | 22457
##   7 | 0035589
##   8 | 344669
##   9 | 004

Find the sample mean and sample standard deviation.
According to the empirical rule, what percentage of observations is between 60.88 and 86.48? What is the actual percentage of observations that are between those two scores?
According to the empirical rule, what percentage of observations is between 48.08 and 96.28? What is the actual percentage of observations that are between those two scores?

Example The following dataset has the weights of 200 people from a fitness club.

club <- read.csv("https://albums.yuanting.lu/sta126/data/Ross-ex-3-21.csv")

Create a stem-and-leaf plot to show the distribution of weights for the 200 people. The distribution is bimodal.
Create stem-and-leaf plot for female and male club members separately. Run the following two lines to separate the club dataset by gender.

women <- club[which(club$Gender == "Female"), ]
men <- club[which(club$Gender == "Male"), ]

Verify the empirical rule in the female group.

[Plus] Graphic Skills +

Put multiple graphs in one canvas. The option mfrow = c(2, 2) in the par function divides the canvas into four subpanels, two rows and two columns.The subsequential graphs go into the subpanels row by row.⁵
Any thing after the # sign in a line is a comment. Comments are not run by R. They are there to make the codes more readable to the audience.
The rug function and the jitter function combined add each individual data point as a tick mark just above the x-axis. It is a nice way to display the density of the distribution.
The option xlab = NA in the histograms hides the x-axis labels.
The option las = 2 rotates the y-axis tickmarks in the top right subpanel.
The option frame = FALSE in the boxplots hides the frame of the boxplots.

par(mfrow = c(2, 2))

# Top left subpanel
hist(credit$Income, main = "Distribution of Income", xlab = NA)
rug(jitter(credit$Income))

# Top right subpanel
hist(credit$Age, main = "Distribution of Cardholders' Age", xlab = NA, las = 2)
rug(jitter(credit$Age))

# Bottom left subpanel
boxplot(credit$Income, horizontal = TRUE, xlab = "Income ($1000)", frame = FALSE, ylim = c(0, 200))

# Bottom right subpanel
boxplot(credit$Age, horizontal = TRUE, xlab = "Age", frame = FALSE)

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York: Springer.↩︎
The default scale is 1. Adding an option scale = 2 will double the length of the graph, i.e., double the number of stems.↩︎
There are 9 different ways in R to identify a percentile. The algorithm that agrees with our textbook definition is the second one, so we add the option type = 2.↩︎
Change the digit from 1 to 6 to see other available line types.↩︎
The mfrow in the par function creats equal space for each subpanel. To get more control of allocating spaces, look up the layout function.↩︎