Descriptive Statistics is a body of techniques for summarizing and presenting the essential information in a data set. Eg: Here are daily high temperatures for Jan 16, 2009 in 30 U.S. cities: Albany, 35. Atlanta, 57. Austin, 68. Billings, 43. Boise, 40. Buffalo, 45. Casper, 32. Chattanooga, 52. Cincinnati, 48. Columbia, 72. Concord, 40. Des Moines, 28. El Paso, 55. Hartford, 38. Houston, 67. Jacksonville, 74. Key West, 84. Lexington, 49. Louisville, 50. Miami, 84. Minn.- St. Paul, 30. New York City, 55. Omaha, 31. Philadelphia, 57. Providence, 47. St. Louis, 36. San Francisco, 70. Spokane, 37. Tulsa, 47. Wilmington, 55. A better presentation is the Frequency Distribution: Group the data into intervals called classes, and record the frequency (i.e., the number of observations) for each class. Histograms provide a graphical representation of a frequency distribution. Eg: Daily High Temperatures: Class Frequency 20-29 1 30-39 7 40-49 8 50-59 7 60-69 2 70-79 3 80-89 2 8 7 6 Frequency 4. DESCRIPTIVE STATISTICS 5 4 3 2 1 0 20 30 40 50 Clearly, a long list of numbers is not very informative. Measures of Central Tendency (Location) Needed: An objective, concise summary of a data set. For many purposes, just two numbers will suffice: (1) A measure of central tendency (i.e., the typical value, or location), (2) A measure of dispersion (fluctuation). Here, we discuss measures of central tendency. The two most popular measures are: The Mean; The Median. Of these, the mean is the most important. 60 70 80 90 temp Sample Mean The mean of a sample of n measurements x1,···, xn is n x = 1n ∑ xi = 1n ( x1 + ⋅ ⋅ ⋅ + xn ) i =1 Eg: The first five high temperatures in our data set are 35, 57, 68, 43, 40. The sample size is n=5, and the sample mean is x = 15 (35 + 57 + 68 + 43 + 40) = 15 (243) = 48.6 • x is the average value. It is also the center of gravity (balance point) of the data set. Population Mean The mean of a population of N measurements x1, ···, xN is N μ = 1 ∑ xi = 1 ( x +⋅⋅⋅+ x ) N N i =1 N 1 Eg: If we view our data set of 30 high temperatures as a population, the population mean is 30 μ = 1 ∑ x = 1 (35 + 57 + ⋅⋅⋅+ 55) = 50.87 30 i =1 i 30 • x is a statistic, μ is a parameter. • Often, we would like to know the population mean μ, but our information is limited to a small random sample. Then we can use x as an estimate of μ. Using principles of statistical inference, we can even assess the accuracy of this procedure, and thereby draw conclusions (“make inferences”) about μ. • The Problem With x : It is extremely sensitive to outliers (“extreme observations”, or “wild values”). These outliers may be due to errors in recording the data, or they may be real (but exceptional) observations. In either case, it is usually best to set aside the outliers (to be described separately) before computing x . Alternatively, use the median. Median Given n measurements arranged in order of magnitude, Median = The Middle Value if n is odd Median = The average of the two middle values if n is even. Eg: CEO Base Salary for 2009 for five companies Merrill Lynch 750,000 Morgan Stanley 800,000 Goldman Sachs 600,000 Lehman Brothers 750,000 American Express 1,240,000 Converting to multiples of $1,000 and arranging the data in order gives: 600, 750, 750, 800, 1240. The median base salary is $750,000. The mean base salary is $828,000. The mean is substantially larger than the median due to the outlier, American Express. If we remove the outlier, then x becomes $725,000, while the sample median remains at $750,000. For the 1998 individual baseball salaries, the mean is $1,447,690, while the median is only $500,000. • The median divides a data set into two equal parts. Half of the data lie below the median, and half lie above it. • The median is resistant to outliers. Thus it can be safely used on a raw, unexamined data set. (Of course, it is always best to look at the data; you will usually learn something.) • Although the median is very useful as a descriptive statistic, it is rarely used for statistical inference. Reason: No simple mathematical theory for the median. Mean and Median Website (Build your own data set, examine sample statistics): Simple Measures of Dispersion The mean (or median) cannot completely summarize a data set. Once we know the typical value, the next question is: To what extent do the data fluctuate from their typical value? Eg: Consider the lifetimes of “GE” and “Philips” light bulbs. Both brands are rated for 750 hours (average lifetime). http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html "GE" and "Philips" Lightbulb Lifetimes (in hours) Frequency 20 Range 20 10 10 0 0 500 600 700 800 GE 900 1000 The range is the difference between the largest and smallest values. 540 620 700 780 860 940 Philips Eg: For the baseball salaries, the highest and lowest values were $10,000,000 (for Albert Belle and Gary Sheffield) and $170,000 (for the 17 lowest-paid players), so the range was $10,000,000 − $170,000 = $9,830,000 The GE bulbs exhibit better quality control: performance is consistent, since there is not much variation. The range is a very crude measure, containing no information about the dispersion of the values between the extremes. Philips’ performance is more erratic: There’s more fluctuation, although the average is the same as for GE. It has absolutely no resistance to outliers. (Why?) A resistant measure of dispersion is provided by the interquartile range. IQR = QU − QL = 75th Percentile − 25th Percentile Definition of Quartiles: The first (or lower) quartile QL = 25th percentile = Value such that 25% of the distribution is below it. The second quartile = 50th percentile = Median. The third (or upper) quartile QU = 75th percentile = Value such that 75% of the distribution is below it. Five Year Performance of Mutual Funds (1988-1992) Based on a $2000/yr investment Eg: For the baseball data, the 25th percentile is $210,000, the 75th percentile is $2,000,000, so the IQR is $2,000,000 − $210,000 = $1,790,000 • Using the extremes, quartiles and median, we can draw a boxplot, a graphical summary which reveals basic distributional properties (Center, spread, skewness, outliers), and which is especially useful for comparing several data sets, side-by-side. For each data set, we draw a box extending from QL(bottom) to QU (top). We draw a horizontal line at the median. Then, we draw two vertical “whiskers” from the box to the most extreme non-outlying observations. Any data value beyond the whiskers is declared to be an outlier and flagged with an asterisk or circle. whisker 700 GMAT 25000 • IQR is the width of the “middle half” of the data set. The IQR is resistant to outliers. 3rd quartile 650 median 1st quartile 600 FiveYR 20000 15000 whisker The height of the box is the IQR. 10000 Internatl Aggressive Balanced Equity Growth Growth & Global Small Growth Income Company & Income For symmetric distributions, median will be halfway between QL and QU. Otherwise, the distribution is skewed. The width of the box doesn’t mean anything! Time Between Eruptions How outliers are labeled: 100 Note: Boxplots can hide bimodality. “Highly suspect outliers”, labeled with a circle, are those more than 3*IQR above QU or below QL. Eg: Old Faithful eruptions. 90 Time Between Eruptions “Suspect outliers”, labeled with an asterisk, are those more than 1.5*IQR above QU or below QL. 80 70 60 50 40 Time Between Eruptions, Separated by Eruption Duration highly suspect outlier * suspect outlier 1st pt > Q3 + 1.5 IQR 30 < 3 min > 3 min 3rd quartile 650 20 median 1st quartile 600 Frequency GMAT 700 10 1st pt < Q1 - 1.5 IQR * suspect outlier 0 highly suspect outlier 40 50 60 70 80 Time Between Eruptions Distribution Shape A distribution may be symmetric or skewed, it may be unimodal, bimodal or multi-modal, it may be long-tailed (lots of outliers) or short-tailed (almost no outliers). Histograms and boxplots help us to see the distribution shape. 1) Symmetrical: Roughly equal tails. Eg: Bell-Shaped Distribution. 2) Positively Skewed (skewed to the right): Longer tail on right. Eg: Income Distributions. 3) Negatively Skewed (skewed to the left): Longer tail on left. Eg: Scores on an easy exam. 90 100 • For nearly symmetrical distributions, mean ≈ median, (Also, Median is about halfway between 25th and 75th percentiles) •For positively skewed distributions, mean > median (Also, Median is closer to the 25th percentile than to 75th) Eg: A sample of student heights. Eg: Salaries of the 1998 Baseball players. Descriptive Statistics Descriptive Statistics Variable: height Variable: Salary Anderson-Darling Normality Test Anderson-Darling Normality Test A-Squared: P-Value: 61 63 65 67 69 68% Conf idence Interv al f or Mu 0.201 0.874 Mean StDev Variance Skewness Kurtosis N 65.2900 1.8994 3.60786 -1.3E-02 -5.4E-01 50 Minimum 1st Quartile Median 3rd Quartile Maximum 61.5000 63.9750 65.3000 66.6000 69.5000 A-Squared: P-Value: 400000 2000000 3600000 5200000 6800000 8400000 10000000 95% Conf idence Interv al f or Mu 68% Conf idence Interv al f or Mu 65.0201 65.0 65.1 65.2 65.3 65.4 65.5 65.6 1.7344 65.0000 1000000 2.1232 65.5014 For negatively skewed distributions, mean < median (Also, Median is closer to the 75th percentile than to 25th) Eg: Class scores on an individual project Descriptive Statistics Variable: Project Anderson-Darling Normality Test A-Squared: P-Value: 73 79 85 91 Mean StDev Variance Skewness Kurtosis N 97 Minimum 1st Quartile Median 3rd Quartile Maximum 95% Conf idence Interv al f or Mu 1.610 0.000 92.6071 6.7404 45.4325 -1.59939 2.41526 28 72.000 91.000 95.000 96.750 100.000 95% Conf idence Interv al f or Mu 89.994 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 95.221 95% Conf idence Interv al f or Sigma 5.329 9.175 95% Conf idence Interv al f or Median 95% Conf idence Interv al f or Median 92.448 96.000 1447690 1865460 3.48E+12 1.85548 3.16118 837 Minimum 1st Quartile Median 3rd Quartile Maximum 170000 210000 500000 2000000 10000000 1321129 500000 1500000 1574251 95% Conf idence Interv al f or Sigma 1780175 68% Conf idence Interv al f or Median 68% Conf idence Interv al f or Median Mean StDev Variance Skewness Kurtosis N 95% Conf idence Interv al f or Mu 65.5599 68% Conf idence Interv al f or Sigma 86.167 0.000 1959393 95% Conf idence Interv al f or Median 95% Conf idence Interv al f or Median 450000 633408
© Copyright 2024