Load Dataset
We will continue work with the credit dataset, a simulated
dataset containing the credit information of a large number of
customers.
credit <- read.csv('https://albums.yuanting.lu/sta126/data/credit.csv')
Dataset Overview
Some review questions from previous notes: What are the variables
(columns) in this dataset? Which of the variables are categorical? Which
ones are numerical? Are the numerical variables
continuous or
discrete?
Bar Graph
Let’s start with the variable Cards, which shows the
number of credit cards owned by each person in the dataset. Create a
frequency table for Cards. In the output table, row one
displays the number of cards and row two displays the corresponding
frequencies.
We can then visualize the frequency table in a bar graph. The
ylim option sets the lower and upper limits of the y-axis
(e.g., from 0 to 120).
barplot(table(credit$Cards), main = "Number of Credit Cards", ylim = c(0, 120), ylab = "Frequency")

Practice Produce bar graphs for the
variables Age and Balance. Do you see any
issue on these graphs?
Histogram
You probably notice that the bar graph looks ugly for both
Age and Balance. In general, bar graphs work
well with categorical or discrete variables that only have a small
number of distinct values. When there are tons of distinct values or the
variable is continuous, we break up the values into disjoint classes
(bins) to display as a histogram. Unlike
barplot that works with a frequency table, the
hist function works with raw data. The xlab option
adds a text label to the x-axis.
hist(credit$Age, xlab = "Age", main = "Distribution of Credit Cards Holders Ages", right = FALSE)

- To manually set up the partition of the bins, add a
breaks option. For example,
hist(credit$Age, breaks = c(0, 20, 40, 60, 80, 100), xlab = "Age")
will consolidate the Age data into 5 bins: [0, 20), [20,
40), [40, 60), [60, 80), and [80, 100).
- To make the bins left open and right closed, i.e., (0, 20], (20,
40], (40, 60], (60, 80], and (80, 100], use the option right =
TRUE.
Practice Produce histograms for the other
numerical variables. What are the shapes of the
distribution?
Frequency Polygon
A frequency polygon for a continuous
variable is based on a histogram. Each bin of the histogram is
represented a vertex, whose x-coordinate is the mid-point of the bin and
y-coordinate is the frequency. Then, the polygon is completed by
connecting all the representative points.
For example, the previous histogram has eight bins from 20 to 100.
So, the x-coordinates of the vertices are 25, 35, 45, 55, 65, 76, 85,
and 95. The y-coordinates are the frequencies in the groups [20, 30),
[30, 40), [40, 50), [50, 60), [60, 70), [70, 80), [80, 90), and [90,
100). The freqpoly function is not a built-in R function. It
is included in the nicer.R file written by your
instructor.
- The breaks option sets the partition points of the
histogram bins.
- The pch option (plot
character) specifies the symbol for points. A value of
1 means empty circle. Adjust the value of the integer to see other type
of symbols.
source("https://albums.yuanting.lu/sta126/tools/nicer.R")
freqpoly(credit$Age, breaks = c(20, 30, 40, 50, 60, 70, 80, 90, 100),
pch = 1, xlab = 'Age', ylab = "Frequency",
main = "A Frequency Polygon of Credit Cards Holders' Age")

Practice Produce frequency polygons for the
variables Balance and Cards.
Ogive
An ogive is a graph that represents the cumulative
frequency or cumulative relative frequency.

Stem-and-Leaf Plot
In the histogram and the frequency polygon, we see the overall shape
of the distribution but not the individual data. A
Stem-and-leaf plot provides both and works
well with small-to-moderate size dataset. Below is a stem-and-leaf plot
for the variable Age. Does it look like a rotated histogram?
stem(credit$Age, scale = 0.5)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 2 | 344455555556778888899999
## 3 | 000000001112222222233333334444445555556666677777777888889999
## 4 | 00000001111111112222333334444444444444555556666666666777777777778888
## 5 | 0000000000001111112222223334444555556666666667777777777888888999999
## 6 | 00000012222223333344444444455555666666666666677777777778888889999999
## 7 | 0000000111111111222222233444455555555666667777788888889999999
## 8 | 00000001111111111222223333334444456779
## 9 | 18
Descriptive Statistics
We will continue using the Age variable as examples for
descriptive statistics.
| Statistics |
Example |
Notes |
| Minimum |
min(credit$Age) |
|
| Maximum |
max(credit$Age) |
|
| Mean |
mean(credit$Age) |
Average. |
| Median |
median(credit$Age) |
Center. |
| Percentile |
quantile(credit$Age, 0.3, type = 2) |
The 30th percentile. |
| Five-number summary |
fivenum(credit$Age) |
Min, Q1, median, Q3, max. |
| Sample variance |
var(credit$Age) |
|
| Sample standard deviation |
sd(credit$Age) |
|
Boxplot
Another popular type of graph to display continuous variable is the
boxplot, also known as the box
and whisker plot.
- The option horizontal = TRUE displays the boxplot
horizontally. If we change it to FALSE, we will see vertical
boxplot.
- In the option whisklty = 1 defines the
whisker line type.
The digit 1 means solid line.
boxplot(credit$Age, whisklty = 1, xlab = "Age", col = 'lightblue',
horizontal = TRUE, main = "Cardholders' Age Distribution")

- The box marks the interquartile range
(IQR) of the dataset, i.e., \(IQR = Q3 - Q1\).
- Any data value below \(Q1-1.5\times
IQR\) or above \(Q3 + 1.5 \times
IQR\) are considered as
outliers.
- By default, a boxplot identifies outliers as open circles on the
graph. If we add the option outline = FALSE, then the
outliers (if any) will not be singled out.
Standard Deviation
Given a sample of
\(n\) numbers
\(x_1, x_2, x_3, \cdots, x_n\) with a mean of
\(\bar{x}\). The
standard
deviation is defined by the formula
\[\displaystyle
\sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}.\]
We typically denote a sample standard deviation by \(s\).
Classroom Activity Consider hometown
temperature, heart rates, and semester credit hours for the entire
class. Can you rank the standard deviation for the three variables?
Click here to submit your anonymous data.
\(z\)-Scores
The \(z\)-score
represents the distance that a data value is from the mean in terms of
the number of standard deviations.
\[z =
\dfrac{x-\mu}{\sigma}\mbox{ or }z =
\dfrac{x-\bar{x}}{s}\]
Example Which score is better, an SAT of
1380 or an ACT of 29? Assume SAT has a mean of 1050 and a standard
deviation of 100, while ACT has a mean of 21 and a standard deviation of
5.
Empirical Rule
A dataset is said to be normal if the
histogram is bell-shaped and symmetric, with its highest at the middle
interval.
If a dataset is approximately normal with sample mean
\(\bar{x}\)
and sample standard deviation \(s\), then the following are
true.
- Approximately 68% of the observations lie between \(\bar{x} - s\) and \(\bar{x} + s\).
- Approximately 95% of the observations lie between \(\bar{x} - 2s\) and \(\bar{x} + 2s\).
- Approximately 99.7% of the observations lie between \(\bar{x} - 3s\) and \(\bar{x} + 3s\).
Example The stem-and-leaf plot displays 25
test scores.
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 0358
## 6 | 22457
## 7 | 0035589
## 8 | 344669
## 9 | 004
- Find the sample mean and sample standard deviation.
- According to the empirical rule, what percentage of observations is
between 60.88 and 86.48? What is the actual percentage of observations
that are between those two scores?
- According to the empirical rule, what percentage of observations is
between 48.08 and 96.28? What is the actual percentage of observations
that are between those two scores?
Example The following dataset has the
weights of 200 people from a fitness club.
club <- read.csv("https://albums.yuanting.lu/sta126/data/Ross-ex-3-21.csv")
- Create a stem-and-leaf plot to show the distribution of weights for
the 200 people. The distribution is
bimodal.
- Create stem-and-leaf plot for female and male club members
separately. Run the following two lines to separate the club
dataset by gender.
women <- club[which(club$Gender == "Female"), ]
men <- club[which(club$Gender == "Male"), ]
- Verify the empirical rule in the female group.
[Plus] Graphic Skills +
- Put multiple graphs in one canvas. The option mfrow = c(2,
2) in the par function divides the canvas into four
subpanels, two rows and two columns.The subsequential graphs go into the
subpanels row by row.
- Any thing after the # sign in a line is a
comment. Comments are not run by R. They are there to make the
codes more readable to the audience.
- The rug function and the jitter function
combined add each individual data point as a tick mark just above the
x-axis. It is a nice way to display the density of the
distribution.
- The option xlab = NA in the histograms hides the x-axis
labels.
- The option las = 2 rotates the y-axis tickmarks in the
top right subpanel.
- The option frame = FALSE in the boxplots hides the frame
of the boxplots.
par(mfrow = c(2, 2))
# Top left subpanel
hist(credit$Income, main = "Distribution of Income", xlab = NA)
rug(jitter(credit$Income))
# Top right subpanel
hist(credit$Age, main = "Distribution of Cardholders' Age", xlab = NA, las = 2)
rug(jitter(credit$Age))
# Bottom left subpanel
boxplot(credit$Income, horizontal = TRUE, xlab = "Income ($1000)", frame = FALSE, ylim = c(0, 200))
# Bottom right subpanel
boxplot(credit$Age, horizontal = TRUE, xlab = "Age", frame = FALSE)
