Sample and Its Properties BMED2803 Fall07, BV Sample and Its Properties I

BMED2803 Fall07, BV
Sample and Its Properties I
Sample and Its Properties
Accordingly to the fittest style of lofty, mean, or lowly. - Milton
This handout is subject to constant change (added explanations, typo fixing, added
exercises and examples).
1
Annotated MATLAB session on univariate descriptive statistics: Cell Area Data
The file http://zoe.bme.gatech.edu/~bv20/bmestat/Bank/cellarea.dat contains measurements from the Lab of Todd McDevitt, BME Gatech,
http://www.bme.gatech.edu/groups/mcdevitt/. In the same directory you can find
MATLAB native mat file, cellarea.mat.
The measurements are areas of grown cell [in µm2 ] after time t = 2 under the static
condition. The experiment on cell growth involved several times t and two motion conditions.
Here is a brief description:
Embryonic stem cells (ESCs) have the ability to differentiate to all somatic cells
types, making ESCs useful for studying developmental biology, in vitro drug
screening and as a cell source for regenerative medicine and cell-based therapies.
A common method to induce differentiation of ESCs is through the formation
of multicellular spheroids termed embryoid bodies (EBs). ESCs spontaneously
aggregate into EBs when cultured on a non-adherent substrate; however, under
static conditions, this aggregation is uncontrolled and EBs form in various sizes
and shapes which may lead to variability in cell differentiation patterns. When
rotary motion is applied during EB formation, the resulting population of EBs
appears more uniform in size and shape.
In order to elucidate the differences between static and rotary culture, EBs were
formed under both conditions with equal starting cell densities. After 2, 4 and
7 days of culture, images of EBs were acquired using phase-contrast microscopy.
Image analysis software (Image J) was used to determine the area and circularity
(defined as 4 π[Area/Perimeter2 ]) of each EB imaged. At least 100 EBs were
analyzed from three separate plates for both static and rotary cultures at the
three time points studied. The area and circularity data was used to examine
differences in the size and shape of EBs formed under the two conditions as well
as differences in the variability of both size and shape.
>> load(’C:\Courses\bmestatu\cellarea.dat’);
1
It the file is on MATLAB path on your local computer, then >> load ’cellarea.dat’
would suffice.
A glimpse in the data is provided by histogram command, hist:
hist(cellarea, 100)
and it seems that there is one quite unusual observation, placed way out from the other
experimental measurements.
300
250
200
150
100
50
0
0
2
4
6
8
10
12
14
5
x 10
Figure 1: Histogram of raw data. Notice the outlier beyond the point 12 × 105 .
We find minimum, mean, and maximum of data set and compare the extreme observation
to those values. More formal rules for detecting outliers will be discussed later.
>> min(cellarea)
>> mean(cellarea)
>> max(cellarea)
We conclude that this maximal element looks like outlier and omit it from the data set,
>> aa =cellarea(cellarea ~= max(cellarea));
We also rescale the data to more moderate values, so that the area is expressed in thousands
of µm2 .
2
>> aa = aa/1000;
>> n = length(aa); %nn is sample size
First we find all location measures of our sample. The geometric mean (geomean) is
1/n
(X1 × X2 × · · · × Xn )
=
à n
Y
!1/n
Xi
,
i=1
and the harmonic mean (harmmean) is
n
n
= Pn
.
1/X1 + 1/X2 + · · · 1/Xn
i=1 1/Xi
These two sample averages are not often used in standard statistical analysis. The sample
mean (mean),
n
X
¯ = X1 + · · · + Xn = 1
X
Xi ,
n
n i=1
on the other hand is a fundamental summary
statistic. For example, for the data set {1, 2, 3}
√
the mean is 2, the geometric mean is 6 = 2.4495, and the harmonic mean is 3/(1/1 + 1/2
+ 1/3) = 1.6364.
The geometric and harmonic means are seldom used in everyday’s statistical practice. To
illustrate the cases in which they should been used consider several simple examples. Suppose
you have an investment which earns 10% the first year, 50% the second year, and 30% the
third year. What is its average rate of return? It is not the arithmetic mean, because what
these numbers signify is that on the first year your investment was multiplied (not added to)
by 1.10, on the second year it was multiplied by 1.50, and the third year it was multiplied
by 1.30. The relevant quantity is the geometric mean of these three numbers, which is about
1.28966 or about 29% annual interest. If, for example, you are averaging ratios (i.e. ratio
= new method/old method) over many experiments, geometric mean should be used. This
becomes evident when considering the two extremes. If one experiment yields a ratio of
10,000 and the next yields a ratio of 0.0001, an arithmetic mean would misleadingly report
that the average ratio was near 5000. Taking a geometric mean will more honestly represent
the fact that the average ratio was 1.
Consider now two examples when the harmonic mean should be used. If for half the
DISTANCE of a trip you travel at 40 miles per hour and for the other half of the distance
you travel at 60 miles per hour, then your average speed for the trip is given by the harmonic
mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if
you traveled the entire trip at 48 miles per hour. Note however that if you had traveled for
half the TIME at one speed and the other half at another, the arithmetic mean, in this case
50 miles per hour, would provide the correct notion of average. In finance, the harmonic
mean is used to calculate the average cost of shares purchased over a period of time. For
example, an investor purchases $1000 worth of stock every month for three months. If the
3
spot prices at execution time are $8, $9, and $10, then the average price the investor paid is
$8.926 per share. However, if the investor purchased 1000 shares per month, the arithmetic
mean would be used.
Median. The median1 is the middle of the sorted sample. If the sample X1 , . . . , Xn is
ordered as X(1) ≤ X(2) ≤ · · · ≤ X(n) so that X(1) is the minimum and X(n) is the maximum,
the median is
(
Me =
X((n+1)/2) ,
(X(n/2) + X(n/2+1) )/2,
if n is odd
if n is even
In simple terms, if the sample size is odd, there is a single observation in the middle of
the ordered sample, at the position (n + 1)/2, while for the even sample sizes the ordered
sample has two elements in the middle, at positions n/2 and n/2 + 1. For instance, in data
set {−1, 0, 4, 7, 20} the median is 4, and in the data set {−1, 0, 4, 7, 20, 100} the median is
(4 + 7)/2 = 5.5
Mode. The most frequent (fashionable2 ) observation in the sample (if such exists) is
the mode of the sample. If the sample is composite then the observation xi corresponding
to the largest frequency fi is the mode. Mode may not be unique. If there are two modes,
the sample is bimodal, three modes - trimodal, etc.
Trimmed mean. The mean is a location measure sensitive to extreme observations
and possible outliers. To make this measure more robust, one may trim 100 ∗ p% of the
data symmetrically from both sides of ordered sample (trim 100p/2% smallest and 100p/2%
largest observations). If your sample, for instance, is {1, 2, 3, 4, 5, 6, 7, 8, 9, 100} then 20%
trimmed mean is the mean of {2, 3, 4, 5, 6, 7, 8, 9}.
Here is the command in MATLAB that determines the discussed locations for the cells
data.
>> location = [geomean(aa) harmmean(aa) mean(aa) ...
median(aa) mode(aa) trimmean(aa,20)]
Another robust location measure is winsorized mean, and it is similar to trimmed mean.
In the p100% trimmed mean, we end up with the mean from the sample of reduced size
((1 − p)100%), To preserve the sample size, all the observations that would be trimmed are
replaced with minimum or maximum of the trimmed sample.
and it is not implemented in MATLAB.
>>
>>
>>
>>
>>
1
2
p=20;
sa = sort(aa);
sa(1:floor( n*p/200 )) = sa(floor( n*p/200 ) + 1);
sa(end-floor( n*p/200 ):end) = sa(end-floor( n*p/200 ) - 1);
winsmean = mean(sa)
Latin: medianus = middle)
mode (fr) = fashion
4
1.1
Variability Measures
The variance of the sample, i.e., sample variance is defined as
s2 =
n
1 X
¯ 2,
(Xi − X)
n − 1 i=1
1
is taken for good properties of this statistics which will be discussed later. In
where n−1
MATLAB, the sample variance of data vector x is var(x) or var(x, 0) Flag 0 in the
argument list means that the ratio 1/(n − 1) is used in calculating the sample variance, if
the flag is 1, i.e., then var(x, 1) means,
s2∗ =
n
1X
¯ 2,
(Xi − X)
n i=1
which is sometimes used instead of s2 . We will see later that s2∗ is an optimal estimator in
the Maximum Likelihood context. The square root of sample variance is sample standard
deviation,
v
u
u
s=t
n
1 X
¯ 2.
(Xi − X)
n − 1 i=1
In MATLAB the standard deviation can be calculated by std(x)=std(x,0) or std(x,1),
depending if we divided the sum of squares by n − 1 or by n.
%Variability Measures
>> var(aa)
% standard sample variance, also var(aa,0)
>> var(aa,1) % sample variance with sum of squares
>>
% divided by n
>> std(aa)
% sample standard deviation, sum of squares
>>
% divided by (n-1), also std(aa,0)
>> std(aa,1) % sample standard deviation, sum of squares
>>
% divided by n
>> sqrt(var(aa)) %should be equal to std(aa)
>> sqrt(var(aa,1)) %should be equal to std(aa,1)
Another estimator of variability in the sample is MAD estimator. The MAD stands for
either Mean Absolute Difference from the mean or Median Absolute Difference from the
Median, and this is more common interpretation. MATLAB uses mad(aa), mad(a,0) for
the first and mad(aa,1) for the second.
M AD0 = 1/n
n
X
¯ M AD1 = 1/n median{|Xi − median{Xi }|}.
|Xi − X|,
i=1
Often used convention is to divide either of MAD estimators with 0.6745, to make the
estimator comparable with the sample standard deviation.
5
>>
>>
>>
>>
>>
mad(aa) % mean absolute deviation from the mean;
% MAD is usually conotating to
% median absolute deviation from the median
realmad = 1/0.6745 * median( abs(aa - median(aa)))
% real mad in MATLAB is 1/0.6745 * mad(aa,1)
Two simple measures of variability, or better, the spread of the samnple are the range and
interquartile range,range and iqr. We will define iqr in a moment.
>> range(aa) %Range, span of data, Max - Min
>> iqr(aa) % inter-quartile range, Q3-Q1
Sample quantiles or sample percentiles are very important statistics revealing both location
and spread of the sample. For example, we may be interested in a point xp that divides
the ordered sample in two parts, one part with p × 100 percent of observations smaller
than xp and part two with (1 − p) × 100% observations greater than xp . In MATLAB if we
are interested in 5, 10, 25, 50, 75, 90, and 95 percentiles of our sample, we use command
prctile.
>> % For 5%, 10%, 25%, 50%, 75%, 90%, 95% percentiles
>> % are:
>> prctile(aa, 100*[0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95] )
Some percentiles/quantiles are special. We already mentioned the median of the sample
which is 50th percentile, 25th percentile is known as the first quartile, Q1 and 75th percentile is known as the third quartile, Q3 . In MATLAB, Q1 = prctile(aa, 25); Q3 =
prctile(aa, 75). Now we can define interquartile range iqr as Q3 − Q1 ,
>> prctile(aa, 75)- prctile(aa, 25) %should be equal to iqr(aa).
Sample moments draw their name from mechanical interpretation. If the observations are
interpreted as unit masses with positions X1 , . . . , Xn then the sample mean is the first
moment – it represents the balance-point for the system of all points. The moments of
higher order have analogous mechanical interpretation. The formula for k-th moment is
n
1 k
1X
k
mk = (X1 + . . . + Xn ) =
Xik .
n
n i=1
The moments mk are sometimes called raw sample moments. As we mentioned, the sample
¯ The central moments of order k are defined as,
mean is the first moment, m1 = X.
µk =
n
1X
(Xi − m1 )k .
n i=1
Notice that µ1 = 0, and that µ2 is the sample variance (calculated by var( ., 1) where the
sum of squares is divided by n)
6
>>
>>
>>
>>
>>
% Moments of Higher Orders
%kth (row) moment: mean(aa.^k)
mean(aa.^3) % third moment
%kth central moment mean( (aa-mean(aa)).^k )
mean( (aa-mean(aa)).^3 )
There are many uses of higher moments in describing the sample. Two important sample
measures are skewness and kurtosis. The skewness is defined as
sk(x) = µ3 /s3∗ ,
and the kurtosis as
kur(x) = µ4 /s4∗ .
>>
>>
>>
>>
>>
>>
% sample skewness mean(aa.^3)/std(aa,1)^3
mean( (aa-mean(aa)).^3 )/std(aa,1)^3
skewness(aa)
% sample kurtosis
mean( (aa-mean(aa)).^4 )/std(aa,1)^4
kurtosis(aa)
Coefficient of variation is the quantity
s
CV = ¯ 100%
X
and expresses the variability in units of the mean. The assumption is that the mean is
positive.
>> % sample CV [coefficient of variation]
>> std(aa)/mean(aa)
1.2
Displaying Data
The following are graphical summaries of the sample. The first is box-plot. It consists of
a box determined by quartiles Q1 and Q3 with the bar in it representing the median. The
“whiskers” connect box and the maximum and minimum in sample, respectively.
>> %----------- Some Graphical Summaries of the Sample
>> figure(1)
>> boxplot(aa)
7
90
200
80
70
Values
150
60
50
100
40
30
50
20
10
0
1
Column Number
0
0
(a)
50
100
(b)
150
200
250
Figure 2: (a) Box plot; (b) Histogram
>> figure(2)
>> hist(aa, 80)
>>
>>
>>
>>
figure(3)
[f,x] = ksdensity(aa);
plot(x,f);
title(’Density estimate for Cell Areas’)
>>
>>
>>
>>
figure(4)
[f,xf] = ecdf(aa);
stairs(xf,f)
title(’Empirical Cumulative Distribution Function (CDF) for Cell Areas’)
2
Less Annotated MATLAB session on univariate descriptive statistics: Cell Circularity Data
From Todd McDevitt’s Lab: Cell Circularity Data
8
0.05
1
0.045
0.9
0.04
0.8
0.035
0.7
0.03
0.6
0.025
0.5
0.02
0.4
0.015
0.3
0.01
0.2
0.005
0.1
0
−50
0
50
100
(a)
150
200
0
0
250
50
100
(b)
150
200
250
Figure 3: (a) Smoothed histogram (density estimator); (b) Empirical CDF
In order to elucidate the differences between static and rotary culture, embrionic bodies
(EBs) were formed under both conditions with equal starting cell densities. After 2, 4 and
7 days of culture, images of EBs were acquired using phase-contrast microscopy. Image
analysis software was used to determine the circularity (defined as 4 pi [Area/Perimeter2])
of each EB imaged. At least 100 EBs were analyzed from three separate plates for both static
and rotary cultures at the three time points studied. The area and circularity data was used
to examine differences in the size and shape of EBs formed under the two conditions as well
as differences in the variability of both size and shape.
The file can be downloaded from http://zoe.bme.gatech.edu/~bv20/bmestat/Bank/
as circ.dat or circ.mat. The size of data set is 325 × 6, 325 rows corresponding to 325
different cells, and 6 columns corresponding to circular-2,4,7days and static-2,4,7 days. Cells
in different columns are different (and assumed independent; this is not a repeated measures
design). In the analysis below only 2nd and 4th column (rotation/2days and static/2 days)
are used.
close all
clear all
% preamble
disp(’Univariate Data: Circularity of Cells from the Lab of Todd McDevitt’)
lw = 3;
set(0, ’DefaultAxesFontSize’, 16);
fs = 15;
msize = 10;
load circ.mat
9
circc = 1 - circ; %deviation from perfect circularity (perfect circularity
%
is 1 for a circle)
%
size(circc)
%
r2 = circc(:,1); %rotation treatment, 2 days
s2 = circc(:,4); %static treatment, 2 days
%--------% measures of location
%--------meanr2 = mean(r2)
means2 = mean(s2)
gmeanr2 = geomean(r2)
gmeans2 = geomean(s2)
hmeanr2 = harmmean(r2)
med = median(r2)
% trimmean(r2, 100) gives error, but
tmean1 = trimmean(r2, 99.99) % gives the median
tmean2 = trimmean(r2, 0) %gives the ordinary mean -- no trimming
mean(r2) %check
[mo fre] = mode(r2)
%-----------------% measures of variation
%-----------------stdr2 = std(r2)
%by default flag=0, i.e., std(r2,0)
srd1r2 = std(r2, 1) %flag 1 stands for division by n
format long
std(r2)
std(r2, 1)
format short
varr2 = var(r2, 1)
ranr2 = range(r2)
mad0 = mad(r2)
mad1 = mad(r2,1)
realmad = 1/0.6745 * mad(r2,1) % compare to std(r2)
% missing values NaN
r2nan=r2;
r2nan([2 32 200 321])= NaN;
m1 = mean(r2nan)
m2 = nanmean(r2nan)
nmax = nanmax(r2nan)
nsum = nansum(r2nan)
10
%------------------qts = quantile(r2,[0.1 0.25 0.5 0.75 0.9])
prts = prctile(r2,[10 25 50 75 90])
%
figure(1)
% boxplot of r2 and s2
boxplot([r2 s2],’notch’,’on’,’labels’,{’rotation 2 days’,’static 2 days’})
figure(2)
% histogram with 30 bins
hist(r2, 30)
0.9
0.8
0.7
Values
0.6
0.5
0.4
0.3
0.2
0.1
rotation 2 days
static 2 days
Figure 4: Boxplots of Rotation/2 days and Static/2 days.
figure(3)
[f,x,u] = ksdensity(r2);
plot(x,f)
title(’Density estimate for r2’)
hold on
[f,x] = ksdensity(r2,’width’,u/4);
plot(x,f,’r’);
11
50
Density estimate for r2
35
default width
1/4 default
4*default
45
30
40
35
25
30
20
25
15
20
15
10
10
5
5
0
0.05
0.1
0.15
(a)
0.2
0.25
0
0
0.3
0.05
0.1
0.15
(b)
0.2
0.25
0.3
0.35
Figure 5: (a) Histogram (preliminary density estimator) of circularity defects for Rotation/2
days; (b) Kernel density estimators for Rotation/2 days. Note how the kernel width affects
the smoothness of the estimator.
[f,x] = ksdensity(r2,’width’,u*4);
plot(x,f,’g’);
legend(’default width’,’1/4 default’,’4*default’)
hold off
3
Examples and Exercises
A need for randomness. In 1936 the Literary Digest took the largest poll ever taken:
they had 2.4M respondents to their opinion poll about the Franklin Roosevelt/Alf Landon
election that year, of which about 57% indicated a preference for Landon. With such a large
sample size the sampling error becomes very small, 0.0006. That means, the preference for
Landon in the population of US voters is predicted to be 57 ± 0.06%. In fact 38% voted
for Landon. What went wrong? The Literary Digest failed to choose a random sample.
Instead, they used a “sample of convenience” consisting of those who responded to their
questionnaire from among the telephone and magazine subscribers whose addresses they
could find. Randomizing is not just something that statisticians like to complain about - it
is something that, if ignored, can lead to wrong answers to scientific questions.
Survey. Ann Landers once asked her readers, “If you had it to do over again, would you
12
have children?” She received nearly 10,000 responses, of which 70% said “No.”
a. Why is this not a representative sample of all American parents?
b. For the population of American parents, is the true proportion of “No” responses likely
to be higher or lower than 70%?
On Average.
• An average Australian has less that two legs! True, there are some Australians that
lost their leg, therefore the number of legs is less than twice the number of people.
• Small company salaries: 4 employees 20K, 3 employees 30K, vice-president 200K,
president 400K.
Averages: Mean=85.5K, GeoMean=41.2K, Median = 30K, HarMean=29.3K, Mode=20K.
Mushrooms. The unhappy outcome of uninformed mushroom-picking is poisoning. In
many cases such poisoning is due to ignorance or a superficial approach to identification.
The most dangerous fungi are Death Cap (Amanita Phaloides) and two species akin to it
Amanita Verna and Destroying Angel (Amanita Virosa). These three toadstools cause the
majority of fatal poisoning.
One of the keys for mushroom identification is the spore deposit. Spores of Amanita
Phaloides are colorless, almost spherical, and smooth. Measurements in mµ of 28 spores are
given below:
9.2
8.5
9.7
8.0
8.3
9.0
7.9
8.8
8.4
9.9
9.5
9.0
8.7
8.6
9.1
9.3
8.4
8.8
8.2
9.1
9.0
10.1
8.7
8.6
8.1
8.6
9.2
9.1
Find the five number summary (M in, Q1 , M e, Q3 , M ax) for the spore measurements
data.
Find the mean and the mode.
¯
Find the z-scores, zi = (Xi − X)/s.
>> amanita =[9.2,
8.5,
9.7,
8.0,
8.3,
9.0,
7.9,
8.8,
8.4,
9.9,
9.5,
9.0,
8.7,
8.6,
9.1, 10.1,...
9.3, 8.7,...
8.4, 8.6,...
8.8, 8.1,...
8.2, 8.6,...
9.1, 9.2,...
9.0, 9.1,...
13
9.2, 8.8, 9.1, 10.1];
>> zis = zscore(amanita);
>> hist(zis,15)
Favorite. Known: y¯ = 6 and sy = 3.32, and n = 11.
Observation y11 = 7 removed.
Find y¯new and sy(new) .
Solution:
y¯new = 11·6−7
= 59
= 5.9.
10
10
1 P11
2
sy(old) = 11.0224 = 10
( i=1 yi2 − 11 · 62 ).
P11 2
Thus, i=1 yi = 506.224;
P10 2
2
i=1 yi = 506.224 − 7 = 457.224.
s2y(new) = 19 (457.224 − 10 · 5.92 ) = 12.12489.
sy(new) = 3.48.
CV Examples. (1) It might be nice, for example, to determine whether UK voters (whose
parties have somewhat more distinct policy positions than in the US) have as a result wider
variation in their evaluations of the parties than in the US. Problem is the British election
survey takes evaluations scored 0-10, while the US National Election Survey gets evaluations
scored 0-100. Using CV would allow easy comparisons of the amount of variation without
worrying about the different scales.
(2) You and I both teach sections of the same class, but we make up our own finals. To
compare the effectiveness of our final exam designs at creating maximum variance in exam
scores (an oft-unmentioned goal of exam design) we can compare CV, which is not related
to how many points we have on the exam, to the mean performance of our students, or to
the mean difficulty of the exam. It might be related to the homogeneity of our classes, but
to test exams like this we would need CV.
You the Stat Gury. Graduate student Rosa Juliusdottir reported results of an experiment
to her advisor who wanted to include them in the grant proposal. Before leaving to Reykjavik
for a short vacation, she left the following data in advisor’s mailbox: Sample size n = 9,
¯ = 17.11111 and sample variance s2 = 13.86111.
sample mean X
The advisor noted with horror that the last measurement X9 was wrongly recorded. It
should have been 20 instead of 21.
¯ and s2 , but the advisor did not have the previous 8 measureIt would be easy to fix X
ments and neither statistics training. Rosa was in Island, and grant proposal due tomorrow.
Advisor is desperate, but suddenly you came along...
14