4. DESCRIPTIVE STATISTICS

Descriptive Statistics is a body of techniques for summarizing
and presenting the essential information in a data set.
Eg: Here are daily high temperatures for Jan 16, 2009 in 30 U.S.
cities:
Albany, 35. Atlanta, 57. Austin, 68. Billings, 43. Boise, 40.
Buffalo, 45. Casper, 32. Chattanooga, 52. Cincinnati, 48.
Columbia, 72. Concord, 40. Des Moines, 28. El Paso, 55. Hartford,
38. Houston, 67. Jacksonville, 74. Key West, 84. Lexington, 49.
Louisville, 50. Miami, 84. Minn.- St. Paul, 30. New York City, 55.
Omaha, 31. Philadelphia, 57. Providence, 47. St. Louis, 36. San
Francisco, 70. Spokane, 37. Tulsa, 47. Wilmington, 55.
A better presentation is the Frequency Distribution:
Group the data into intervals called classes, and record the
frequency (i.e., the number of observations) for each class.
Histograms provide a
graphical representation of a
frequency distribution.
Eg: Daily High
Temperatures:
Class Frequency
20-29
1
30-39
7
40-49
8
50-59
7
60-69
2
70-79
3
80-89
2
8
7
6
Frequency
4. DESCRIPTIVE STATISTICS
5
4
3
2
1
0
20
30
40
50
Clearly, a long list of numbers is not very informative.
Measures of Central Tendency (Location)
Needed: An objective, concise summary of a data set.
For many purposes, just two numbers will suffice:
(1) A measure of central tendency (i.e., the typical value, or location),
(2) A measure of dispersion (fluctuation).
Here, we discuss measures of central tendency. The two most
popular measures are:
The Mean; The Median.
Of these, the mean is the most important.
60
70
80
90
temp
Sample Mean
The mean of a sample of n measurements x1,···, xn is
n
x = 1n ∑ xi = 1n ( x1 + ⋅ ⋅ ⋅ + xn )
i =1
Eg: The first five high temperatures in our data set are 35, 57,
68, 43, 40. The sample size is n=5, and the sample mean is
x = 15 (35 + 57 + 68 + 43 + 40) = 15 (243) = 48.6
• x is the average value. It is also the center of gravity (balance
point) of the data set.
Population Mean
The mean of a population of N measurements x1, ···, xN is
N
μ = 1 ∑ xi = 1 ( x +⋅⋅⋅+ x )
N
N i =1
N 1
Eg: If we view our data set of 30 high temperatures as a
population, the population mean is
30
μ = 1 ∑ x = 1 (35 + 57 + ⋅⋅⋅+ 55) = 50.87
30 i =1 i 30
• x is a statistic, μ is a parameter.
• Often, we would like to know the population mean μ, but our
information is limited to a small random sample. Then we can
use x as an estimate of μ. Using principles of statistical
inference, we can even assess the accuracy of this procedure,
and thereby draw conclusions (“make inferences”) about μ.
• The Problem With x : It is extremely sensitive to outliers
(“extreme observations”, or “wild values”).
These outliers may be due to errors in recording the data, or they
may be real (but exceptional) observations. In either case, it is
usually best to set aside the outliers (to be described separately)
before computing x . Alternatively, use the median.
Median
Given n measurements arranged in order of magnitude,
Median = The Middle Value if n is odd
Median = The average of the two middle values if n is even.
Eg: CEO Base Salary for 2009 for five companies
Merrill Lynch
750,000
Morgan Stanley
800,000
Goldman Sachs
600,000
Lehman Brothers 750,000
American Express 1,240,000
Converting to multiples of $1,000 and arranging the data in
order gives: 600, 750, 750, 800, 1240.
The median base salary is $750,000.
The mean base salary is $828,000.
The mean is substantially larger than the median due to the
outlier, American Express.
If we remove the outlier, then x becomes $725,000, while
the sample median remains at $750,000.
For the 1998 individual baseball salaries, the mean is
$1,447,690, while the median is only $500,000.
• The median divides a data set into two equal parts. Half of the
data lie below the median, and half lie above it.
• The median is resistant to outliers. Thus it can be safely used on
a raw, unexamined data set. (Of course, it is always best to look at
the data; you will usually learn something.)
• Although the median is very useful as a descriptive statistic, it is
rarely used for statistical inference. Reason: No simple
mathematical theory for the median.
Mean and Median Website (Build your own data set, examine
sample statistics):
Simple Measures of Dispersion
The mean (or median) cannot completely summarize a data set.
Once we know the typical value, the next question is:
To what extent do the data fluctuate from their typical value?
Eg: Consider the lifetimes of “GE” and “Philips” light bulbs.
Both brands are rated for 750 hours (average lifetime).
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
"GE" and "Philips" Lightbulb Lifetimes (in hours)
Frequency
20
Range
20
10
10
0
0
500
600
700
800
GE
900
1000
The range is the difference between the largest and smallest
values.
540 620 700 780 860 940
Philips
Eg: For the baseball salaries, the highest and lowest values were
$10,000,000 (for Albert Belle and Gary Sheffield) and $170,000
(for the 17 lowest-paid players), so the range was
$10,000,000 − $170,000 = $9,830,000
The GE bulbs exhibit better quality control:
performance is consistent, since there is not much variation.
The range is a very crude measure, containing no information
about the dispersion of the values between the extremes.
Philips’ performance is more erratic: There’s more fluctuation,
although the average is the same as for GE.
It has absolutely no resistance to outliers. (Why?)
A resistant measure of dispersion is provided by the
interquartile range.
IQR = QU − QL = 75th Percentile − 25th Percentile
Definition of Quartiles:
The first (or lower) quartile QL = 25th percentile =
Value such that 25% of the distribution is below it.
The second quartile = 50th percentile = Median.
The third (or upper) quartile QU = 75th percentile =
Value such that 75% of the distribution is below it.
Five Year Performance of Mutual Funds (1988-1992)
Based on a $2000/yr investment
Eg: For the baseball data, the 25th percentile is $210,000,
the 75th percentile is $2,000,000, so the IQR is
$2,000,000 − $210,000 = $1,790,000
• Using the extremes, quartiles and median, we can draw a
boxplot, a graphical summary which reveals basic
distributional properties (Center, spread, skewness, outliers),
and which is especially useful for comparing several data
sets, side-by-side.
For each data set, we draw a
box extending from QL(bottom)
to QU (top). We draw a
horizontal line at the median.
Then, we draw two vertical
“whiskers” from the box to the
most extreme non-outlying
observations.
Any data value beyond the whiskers is declared to be an outlier
and flagged with an asterisk or circle.
whisker
700
GMAT
25000
• IQR is the width of the “middle half” of the data set.
The IQR is resistant to outliers.
3rd quartile
650
median
1st quartile
600
FiveYR
20000
15000
whisker
The height of the box is the IQR.
10000
Internatl
Aggressive Balanced Equity Growth Growth
& Global Small
Growth
Income
Company
& Income
For symmetric distributions, median will be halfway between QL
and QU. Otherwise, the distribution is skewed.
The width of the box doesn’t mean anything!
Time Between Eruptions
How outliers are labeled:
100
Note: Boxplots can
hide bimodality.
“Highly suspect outliers”, labeled with a circle, are those
more than 3*IQR above QU or below QL.
Eg: Old Faithful
eruptions.
90
Time Between Eruptions
“Suspect outliers”, labeled with an asterisk, are those more than
1.5*IQR above QU or below QL.
80
70
60
50
40
Time Between Eruptions,
Separated by Eruption Duration
highly suspect outlier
* suspect outlier
1st pt > Q3 + 1.5 IQR
30
< 3 min
> 3 min
3rd quartile
650
20
median
1st quartile
600
Frequency
GMAT
700
10
1st pt < Q1 - 1.5 IQR
* suspect outlier
0
highly suspect outlier
40
50
60
70
80
Time Between Eruptions
Distribution Shape
A distribution may be symmetric or skewed, it may be unimodal,
bimodal or multi-modal, it may be long-tailed (lots of outliers) or
short-tailed (almost no outliers). Histograms and boxplots help us to
see the distribution shape.
1) Symmetrical: Roughly equal tails.
Eg: Bell-Shaped Distribution.
2) Positively Skewed (skewed to the right): Longer tail on right.
Eg: Income Distributions.
3) Negatively Skewed (skewed to the left): Longer tail on left.
Eg: Scores on an easy exam.
90
100
• For nearly symmetrical distributions,
mean ≈ median, (Also, Median is about halfway between
25th and 75th percentiles)
•For positively skewed distributions,
mean > median (Also, Median is closer to the 25th
percentile than to 75th)
Eg: A sample of student heights.
Eg: Salaries of the 1998 Baseball players.
Descriptive Statistics
Descriptive Statistics
Variable: height
Variable: Salary
Anderson-Darling Normality Test
Anderson-Darling Normality Test
A-Squared:
P-Value:
61
63
65
67
69
68% Conf idence Interv al f or Mu
0.201
0.874
Mean
StDev
Variance
Skewness
Kurtosis
N
65.2900
1.8994
3.60786
-1.3E-02
-5.4E-01
50
Minimum
1st Quartile
Median
3rd Quartile
Maximum
61.5000
63.9750
65.3000
66.6000
69.5000
A-Squared:
P-Value:
400000 2000000 3600000 5200000 6800000 8400000 10000000
95% Conf idence Interv al f or Mu
68% Conf idence Interv al f or Mu
65.0201
65.0
65.1
65.2
65.3
65.4
65.5
65.6
1.7344
65.0000
1000000
2.1232
65.5014
For negatively skewed distributions,
mean < median (Also, Median is closer to the 75th
percentile than to 25th)
Eg: Class scores on an individual project
Descriptive Statistics
Variable: Project
Anderson-Darling Normality Test
A-Squared:
P-Value:
73
79
85
91
Mean
StDev
Variance
Skewness
Kurtosis
N
97
Minimum
1st Quartile
Median
3rd Quartile
Maximum
95% Conf idence Interv al f or Mu
1.610
0.000
92.6071
6.7404
45.4325
-1.59939
2.41526
28
72.000
91.000
95.000
96.750
100.000
95% Conf idence Interv al f or Mu
89.994
89.5
90.5
91.5
92.5
93.5
94.5
95.5
96.5
95.221
95% Conf idence Interv al f or Sigma
5.329
9.175
95% Conf idence Interv al f or Median
95% Conf idence Interv al f or Median
92.448
96.000
1447690
1865460
3.48E+12
1.85548
3.16118
837
Minimum
1st Quartile
Median
3rd Quartile
Maximum
170000
210000
500000
2000000
10000000
1321129
500000
1500000
1574251
95% Conf idence Interv al f or Sigma
1780175
68% Conf idence Interv al f or Median
68% Conf idence Interv al f or Median
Mean
StDev
Variance
Skewness
Kurtosis
N
95% Conf idence Interv al f or Mu
65.5599
68% Conf idence Interv al f or Sigma
86.167
0.000
1959393
95% Conf idence Interv al f or Median
95% Conf idence Interv al f or Median
450000
633408