BMED2803 Fall07, BV Sample and Its Properties I Sample and Its Properties Accordingly to the fittest style of lofty, mean, or lowly. - Milton This handout is subject to constant change (added explanations, typo fixing, added exercises and examples). 1 Annotated MATLAB session on univariate descriptive statistics: Cell Area Data The file http://zoe.bme.gatech.edu/~bv20/bmestat/Bank/cellarea.dat contains measurements from the Lab of Todd McDevitt, BME Gatech, http://www.bme.gatech.edu/groups/mcdevitt/. In the same directory you can find MATLAB native mat file, cellarea.mat. The measurements are areas of grown cell [in µm2 ] after time t = 2 under the static condition. The experiment on cell growth involved several times t and two motion conditions. Here is a brief description: Embryonic stem cells (ESCs) have the ability to differentiate to all somatic cells types, making ESCs useful for studying developmental biology, in vitro drug screening and as a cell source for regenerative medicine and cell-based therapies. A common method to induce differentiation of ESCs is through the formation of multicellular spheroids termed embryoid bodies (EBs). ESCs spontaneously aggregate into EBs when cultured on a non-adherent substrate; however, under static conditions, this aggregation is uncontrolled and EBs form in various sizes and shapes which may lead to variability in cell differentiation patterns. When rotary motion is applied during EB formation, the resulting population of EBs appears more uniform in size and shape. In order to elucidate the differences between static and rotary culture, EBs were formed under both conditions with equal starting cell densities. After 2, 4 and 7 days of culture, images of EBs were acquired using phase-contrast microscopy. Image analysis software (Image J) was used to determine the area and circularity (defined as 4 π[Area/Perimeter2 ]) of each EB imaged. At least 100 EBs were analyzed from three separate plates for both static and rotary cultures at the three time points studied. The area and circularity data was used to examine differences in the size and shape of EBs formed under the two conditions as well as differences in the variability of both size and shape. >> load(’C:\Courses\bmestatu\cellarea.dat’); 1 It the file is on MATLAB path on your local computer, then >> load ’cellarea.dat’ would suffice. A glimpse in the data is provided by histogram command, hist: hist(cellarea, 100) and it seems that there is one quite unusual observation, placed way out from the other experimental measurements. 300 250 200 150 100 50 0 0 2 4 6 8 10 12 14 5 x 10 Figure 1: Histogram of raw data. Notice the outlier beyond the point 12 × 105 . We find minimum, mean, and maximum of data set and compare the extreme observation to those values. More formal rules for detecting outliers will be discussed later. >> min(cellarea) >> mean(cellarea) >> max(cellarea) We conclude that this maximal element looks like outlier and omit it from the data set, >> aa =cellarea(cellarea ~= max(cellarea)); We also rescale the data to more moderate values, so that the area is expressed in thousands of µm2 . 2 >> aa = aa/1000; >> n = length(aa); %nn is sample size First we find all location measures of our sample. The geometric mean (geomean) is 1/n (X1 × X2 × · · · × Xn ) = Ã n Y !1/n Xi , i=1 and the harmonic mean (harmmean) is n n = Pn . 1/X1 + 1/X2 + · · · 1/Xn i=1 1/Xi These two sample averages are not often used in standard statistical analysis. The sample mean (mean), n X ¯ = X1 + · · · + Xn = 1 X Xi , n n i=1 on the other hand is a fundamental summary statistic. For example, for the data set {1, 2, 3} √ the mean is 2, the geometric mean is 6 = 2.4495, and the harmonic mean is 3/(1/1 + 1/2 + 1/3) = 1.6364. The geometric and harmonic means are seldom used in everyday’s statistical practice. To illustrate the cases in which they should been used consider several simple examples. Suppose you have an investment which earns 10% the first year, 50% the second year, and 30% the third year. What is its average rate of return? It is not the arithmetic mean, because what these numbers signify is that on the first year your investment was multiplied (not added to) by 1.10, on the second year it was multiplied by 1.50, and the third year it was multiplied by 1.30. The relevant quantity is the geometric mean of these three numbers, which is about 1.28966 or about 29% annual interest. If, for example, you are averaging ratios (i.e. ratio = new method/old method) over many experiments, geometric mean should be used. This becomes evident when considering the two extremes. If one experiment yields a ratio of 10,000 and the next yields a ratio of 0.0001, an arithmetic mean would misleadingly report that the average ratio was near 5000. Taking a geometric mean will more honestly represent the fact that the average ratio was 1. Consider now two examples when the harmonic mean should be used. If for half the DISTANCE of a trip you travel at 40 miles per hour and for the other half of the distance you travel at 60 miles per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 miles per hour. Note however that if you had traveled for half the TIME at one speed and the other half at another, the arithmetic mean, in this case 50 miles per hour, would provide the correct notion of average. In finance, the harmonic mean is used to calculate the average cost of shares purchased over a period of time. For example, an investor purchases $1000 worth of stock every month for three months. If the 3 spot prices at execution time are $8, $9, and $10, then the average price the investor paid is $8.926 per share. However, if the investor purchased 1000 shares per month, the arithmetic mean would be used. Median. The median1 is the middle of the sorted sample. If the sample X1 , . . . , Xn is ordered as X(1) ≤ X(2) ≤ · · · ≤ X(n) so that X(1) is the minimum and X(n) is the maximum, the median is ( Me = X((n+1)/2) , (X(n/2) + X(n/2+1) )/2, if n is odd if n is even In simple terms, if the sample size is odd, there is a single observation in the middle of the ordered sample, at the position (n + 1)/2, while for the even sample sizes the ordered sample has two elements in the middle, at positions n/2 and n/2 + 1. For instance, in data set {−1, 0, 4, 7, 20} the median is 4, and in the data set {−1, 0, 4, 7, 20, 100} the median is (4 + 7)/2 = 5.5 Mode. The most frequent (fashionable2 ) observation in the sample (if such exists) is the mode of the sample. If the sample is composite then the observation xi corresponding to the largest frequency fi is the mode. Mode may not be unique. If there are two modes, the sample is bimodal, three modes - trimodal, etc. Trimmed mean. The mean is a location measure sensitive to extreme observations and possible outliers. To make this measure more robust, one may trim 100 ∗ p% of the data symmetrically from both sides of ordered sample (trim 100p/2% smallest and 100p/2% largest observations). If your sample, for instance, is {1, 2, 3, 4, 5, 6, 7, 8, 9, 100} then 20% trimmed mean is the mean of {2, 3, 4, 5, 6, 7, 8, 9}. Here is the command in MATLAB that determines the discussed locations for the cells data. >> location = [geomean(aa) harmmean(aa) mean(aa) ... median(aa) mode(aa) trimmean(aa,20)] Another robust location measure is winsorized mean, and it is similar to trimmed mean. In the p100% trimmed mean, we end up with the mean from the sample of reduced size ((1 − p)100%), To preserve the sample size, all the observations that would be trimmed are replaced with minimum or maximum of the trimmed sample. and it is not implemented in MATLAB. >> >> >> >> >> 1 2 p=20; sa = sort(aa); sa(1:floor( n*p/200 )) = sa(floor( n*p/200 ) + 1); sa(end-floor( n*p/200 ):end) = sa(end-floor( n*p/200 ) - 1); winsmean = mean(sa) Latin: medianus = middle) mode (fr) = fashion 4 1.1 Variability Measures The variance of the sample, i.e., sample variance is defined as s2 = n 1 X ¯ 2, (Xi − X) n − 1 i=1 1 is taken for good properties of this statistics which will be discussed later. In where n−1 MATLAB, the sample variance of data vector x is var(x) or var(x, 0) Flag 0 in the argument list means that the ratio 1/(n − 1) is used in calculating the sample variance, if the flag is 1, i.e., then var(x, 1) means, s2∗ = n 1X ¯ 2, (Xi − X) n i=1 which is sometimes used instead of s2 . We will see later that s2∗ is an optimal estimator in the Maximum Likelihood context. The square root of sample variance is sample standard deviation, v u u s=t n 1 X ¯ 2. (Xi − X) n − 1 i=1 In MATLAB the standard deviation can be calculated by std(x)=std(x,0) or std(x,1), depending if we divided the sum of squares by n − 1 or by n. %Variability Measures >> var(aa) % standard sample variance, also var(aa,0) >> var(aa,1) % sample variance with sum of squares >> % divided by n >> std(aa) % sample standard deviation, sum of squares >> % divided by (n-1), also std(aa,0) >> std(aa,1) % sample standard deviation, sum of squares >> % divided by n >> sqrt(var(aa)) %should be equal to std(aa) >> sqrt(var(aa,1)) %should be equal to std(aa,1) Another estimator of variability in the sample is MAD estimator. The MAD stands for either Mean Absolute Difference from the mean or Median Absolute Difference from the Median, and this is more common interpretation. MATLAB uses mad(aa), mad(a,0) for the first and mad(aa,1) for the second. M AD0 = 1/n n X ¯ M AD1 = 1/n median{|Xi − median{Xi }|}. |Xi − X|, i=1 Often used convention is to divide either of MAD estimators with 0.6745, to make the estimator comparable with the sample standard deviation. 5 >> >> >> >> >> mad(aa) % mean absolute deviation from the mean; % MAD is usually conotating to % median absolute deviation from the median realmad = 1/0.6745 * median( abs(aa - median(aa))) % real mad in MATLAB is 1/0.6745 * mad(aa,1) Two simple measures of variability, or better, the spread of the samnple are the range and interquartile range,range and iqr. We will define iqr in a moment. >> range(aa) %Range, span of data, Max - Min >> iqr(aa) % inter-quartile range, Q3-Q1 Sample quantiles or sample percentiles are very important statistics revealing both location and spread of the sample. For example, we may be interested in a point xp that divides the ordered sample in two parts, one part with p × 100 percent of observations smaller than xp and part two with (1 − p) × 100% observations greater than xp . In MATLAB if we are interested in 5, 10, 25, 50, 75, 90, and 95 percentiles of our sample, we use command prctile. >> % For 5%, 10%, 25%, 50%, 75%, 90%, 95% percentiles >> % are: >> prctile(aa, 100*[0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95] ) Some percentiles/quantiles are special. We already mentioned the median of the sample which is 50th percentile, 25th percentile is known as the first quartile, Q1 and 75th percentile is known as the third quartile, Q3 . In MATLAB, Q1 = prctile(aa, 25); Q3 = prctile(aa, 75). Now we can define interquartile range iqr as Q3 − Q1 , >> prctile(aa, 75)- prctile(aa, 25) %should be equal to iqr(aa). Sample moments draw their name from mechanical interpretation. If the observations are interpreted as unit masses with positions X1 , . . . , Xn then the sample mean is the first moment – it represents the balance-point for the system of all points. The moments of higher order have analogous mechanical interpretation. The formula for k-th moment is n 1 k 1X k mk = (X1 + . . . + Xn ) = Xik . n n i=1 The moments mk are sometimes called raw sample moments. As we mentioned, the sample ¯ The central moments of order k are defined as, mean is the first moment, m1 = X. µk = n 1X (Xi − m1 )k . n i=1 Notice that µ1 = 0, and that µ2 is the sample variance (calculated by var( ., 1) where the sum of squares is divided by n) 6 >> >> >> >> >> % Moments of Higher Orders %kth (row) moment: mean(aa.^k) mean(aa.^3) % third moment %kth central moment mean( (aa-mean(aa)).^k ) mean( (aa-mean(aa)).^3 ) There are many uses of higher moments in describing the sample. Two important sample measures are skewness and kurtosis. The skewness is defined as sk(x) = µ3 /s3∗ , and the kurtosis as kur(x) = µ4 /s4∗ . >> >> >> >> >> >> % sample skewness mean(aa.^3)/std(aa,1)^3 mean( (aa-mean(aa)).^3 )/std(aa,1)^3 skewness(aa) % sample kurtosis mean( (aa-mean(aa)).^4 )/std(aa,1)^4 kurtosis(aa) Coefficient of variation is the quantity s CV = ¯ 100% X and expresses the variability in units of the mean. The assumption is that the mean is positive. >> % sample CV [coefficient of variation] >> std(aa)/mean(aa) 1.2 Displaying Data The following are graphical summaries of the sample. The first is box-plot. It consists of a box determined by quartiles Q1 and Q3 with the bar in it representing the median. The “whiskers” connect box and the maximum and minimum in sample, respectively. >> %----------- Some Graphical Summaries of the Sample >> figure(1) >> boxplot(aa) 7 90 200 80 70 Values 150 60 50 100 40 30 50 20 10 0 1 Column Number 0 0 (a) 50 100 (b) 150 200 250 Figure 2: (a) Box plot; (b) Histogram >> figure(2) >> hist(aa, 80) >> >> >> >> figure(3) [f,x] = ksdensity(aa); plot(x,f); title(’Density estimate for Cell Areas’) >> >> >> >> figure(4) [f,xf] = ecdf(aa); stairs(xf,f) title(’Empirical Cumulative Distribution Function (CDF) for Cell Areas’) 2 Less Annotated MATLAB session on univariate descriptive statistics: Cell Circularity Data From Todd McDevitt’s Lab: Cell Circularity Data 8 0.05 1 0.045 0.9 0.04 0.8 0.035 0.7 0.03 0.6 0.025 0.5 0.02 0.4 0.015 0.3 0.01 0.2 0.005 0.1 0 −50 0 50 100 (a) 150 200 0 0 250 50 100 (b) 150 200 250 Figure 3: (a) Smoothed histogram (density estimator); (b) Empirical CDF In order to elucidate the differences between static and rotary culture, embrionic bodies (EBs) were formed under both conditions with equal starting cell densities. After 2, 4 and 7 days of culture, images of EBs were acquired using phase-contrast microscopy. Image analysis software was used to determine the circularity (defined as 4 pi [Area/Perimeter2]) of each EB imaged. At least 100 EBs were analyzed from three separate plates for both static and rotary cultures at the three time points studied. The area and circularity data was used to examine differences in the size and shape of EBs formed under the two conditions as well as differences in the variability of both size and shape. The file can be downloaded from http://zoe.bme.gatech.edu/~bv20/bmestat/Bank/ as circ.dat or circ.mat. The size of data set is 325 × 6, 325 rows corresponding to 325 different cells, and 6 columns corresponding to circular-2,4,7days and static-2,4,7 days. Cells in different columns are different (and assumed independent; this is not a repeated measures design). In the analysis below only 2nd and 4th column (rotation/2days and static/2 days) are used. close all clear all % preamble disp(’Univariate Data: Circularity of Cells from the Lab of Todd McDevitt’) lw = 3; set(0, ’DefaultAxesFontSize’, 16); fs = 15; msize = 10; load circ.mat 9 circc = 1 - circ; %deviation from perfect circularity (perfect circularity % is 1 for a circle) % size(circc) % r2 = circc(:,1); %rotation treatment, 2 days s2 = circc(:,4); %static treatment, 2 days %--------% measures of location %--------meanr2 = mean(r2) means2 = mean(s2) gmeanr2 = geomean(r2) gmeans2 = geomean(s2) hmeanr2 = harmmean(r2) med = median(r2) % trimmean(r2, 100) gives error, but tmean1 = trimmean(r2, 99.99) % gives the median tmean2 = trimmean(r2, 0) %gives the ordinary mean -- no trimming mean(r2) %check [mo fre] = mode(r2) %-----------------% measures of variation %-----------------stdr2 = std(r2) %by default flag=0, i.e., std(r2,0) srd1r2 = std(r2, 1) %flag 1 stands for division by n format long std(r2) std(r2, 1) format short varr2 = var(r2, 1) ranr2 = range(r2) mad0 = mad(r2) mad1 = mad(r2,1) realmad = 1/0.6745 * mad(r2,1) % compare to std(r2) % missing values NaN r2nan=r2; r2nan([2 32 200 321])= NaN; m1 = mean(r2nan) m2 = nanmean(r2nan) nmax = nanmax(r2nan) nsum = nansum(r2nan) 10 %------------------qts = quantile(r2,[0.1 0.25 0.5 0.75 0.9]) prts = prctile(r2,[10 25 50 75 90]) % figure(1) % boxplot of r2 and s2 boxplot([r2 s2],’notch’,’on’,’labels’,{’rotation 2 days’,’static 2 days’}) figure(2) % histogram with 30 bins hist(r2, 30) 0.9 0.8 0.7 Values 0.6 0.5 0.4 0.3 0.2 0.1 rotation 2 days static 2 days Figure 4: Boxplots of Rotation/2 days and Static/2 days. figure(3) [f,x,u] = ksdensity(r2); plot(x,f) title(’Density estimate for r2’) hold on [f,x] = ksdensity(r2,’width’,u/4); plot(x,f,’r’); 11 50 Density estimate for r2 35 default width 1/4 default 4*default 45 30 40 35 25 30 20 25 15 20 15 10 10 5 5 0 0.05 0.1 0.15 (a) 0.2 0.25 0 0 0.3 0.05 0.1 0.15 (b) 0.2 0.25 0.3 0.35 Figure 5: (a) Histogram (preliminary density estimator) of circularity defects for Rotation/2 days; (b) Kernel density estimators for Rotation/2 days. Note how the kernel width affects the smoothness of the estimator. [f,x] = ksdensity(r2,’width’,u*4); plot(x,f,’g’); legend(’default width’,’1/4 default’,’4*default’) hold off 3 Examples and Exercises A need for randomness. In 1936 the Literary Digest took the largest poll ever taken: they had 2.4M respondents to their opinion poll about the Franklin Roosevelt/Alf Landon election that year, of which about 57% indicated a preference for Landon. With such a large sample size the sampling error becomes very small, 0.0006. That means, the preference for Landon in the population of US voters is predicted to be 57 ± 0.06%. In fact 38% voted for Landon. What went wrong? The Literary Digest failed to choose a random sample. Instead, they used a “sample of convenience” consisting of those who responded to their questionnaire from among the telephone and magazine subscribers whose addresses they could find. Randomizing is not just something that statisticians like to complain about - it is something that, if ignored, can lead to wrong answers to scientific questions. Survey. Ann Landers once asked her readers, “If you had it to do over again, would you 12 have children?” She received nearly 10,000 responses, of which 70% said “No.” a. Why is this not a representative sample of all American parents? b. For the population of American parents, is the true proportion of “No” responses likely to be higher or lower than 70%? On Average. • An average Australian has less that two legs! True, there are some Australians that lost their leg, therefore the number of legs is less than twice the number of people. • Small company salaries: 4 employees 20K, 3 employees 30K, vice-president 200K, president 400K. Averages: Mean=85.5K, GeoMean=41.2K, Median = 30K, HarMean=29.3K, Mode=20K. Mushrooms. The unhappy outcome of uninformed mushroom-picking is poisoning. In many cases such poisoning is due to ignorance or a superficial approach to identification. The most dangerous fungi are Death Cap (Amanita Phaloides) and two species akin to it Amanita Verna and Destroying Angel (Amanita Virosa). These three toadstools cause the majority of fatal poisoning. One of the keys for mushroom identification is the spore deposit. Spores of Amanita Phaloides are colorless, almost spherical, and smooth. Measurements in mµ of 28 spores are given below: 9.2 8.5 9.7 8.0 8.3 9.0 7.9 8.8 8.4 9.9 9.5 9.0 8.7 8.6 9.1 9.3 8.4 8.8 8.2 9.1 9.0 10.1 8.7 8.6 8.1 8.6 9.2 9.1 Find the five number summary (M in, Q1 , M e, Q3 , M ax) for the spore measurements data. Find the mean and the mode. ¯ Find the z-scores, zi = (Xi − X)/s. >> amanita =[9.2, 8.5, 9.7, 8.0, 8.3, 9.0, 7.9, 8.8, 8.4, 9.9, 9.5, 9.0, 8.7, 8.6, 9.1, 10.1,... 9.3, 8.7,... 8.4, 8.6,... 8.8, 8.1,... 8.2, 8.6,... 9.1, 9.2,... 9.0, 9.1,... 13 9.2, 8.8, 9.1, 10.1]; >> zis = zscore(amanita); >> hist(zis,15) Favorite. Known: y¯ = 6 and sy = 3.32, and n = 11. Observation y11 = 7 removed. Find y¯new and sy(new) . Solution: y¯new = 11·6−7 = 59 = 5.9. 10 10 1 P11 2 sy(old) = 11.0224 = 10 ( i=1 yi2 − 11 · 62 ). P11 2 Thus, i=1 yi = 506.224; P10 2 2 i=1 yi = 506.224 − 7 = 457.224. s2y(new) = 19 (457.224 − 10 · 5.92 ) = 12.12489. sy(new) = 3.48. CV Examples. (1) It might be nice, for example, to determine whether UK voters (whose parties have somewhat more distinct policy positions than in the US) have as a result wider variation in their evaluations of the parties than in the US. Problem is the British election survey takes evaluations scored 0-10, while the US National Election Survey gets evaluations scored 0-100. Using CV would allow easy comparisons of the amount of variation without worrying about the different scales. (2) You and I both teach sections of the same class, but we make up our own finals. To compare the effectiveness of our final exam designs at creating maximum variance in exam scores (an oft-unmentioned goal of exam design) we can compare CV, which is not related to how many points we have on the exam, to the mean performance of our students, or to the mean difficulty of the exam. It might be related to the homogeneity of our classes, but to test exams like this we would need CV. You the Stat Gury. Graduate student Rosa Juliusdottir reported results of an experiment to her advisor who wanted to include them in the grant proposal. Before leaving to Reykjavik for a short vacation, she left the following data in advisor’s mailbox: Sample size n = 9, ¯ = 17.11111 and sample variance s2 = 13.86111. sample mean X The advisor noted with horror that the last measurement X9 was wrongly recorded. It should have been 20 instead of 21. ¯ and s2 , but the advisor did not have the previous 8 measureIt would be easy to fix X ments and neither statistics training. Rosa was in Island, and grant proposal due tomorrow. Advisor is desperate, but suddenly you came along... 14