Descriptive and Inferential Statistics 1 Estimators

1
Descriptive and Inferential Statistics
Statistics is often said to consist of two parts: Descriptive Statistics and Inferential Statistics.
Descriptive Statistics describes various characteristics of data samples. A data sample could be
a poll of n = 500 randomly selected people from a total population of N = 60 million people, or
a sample of n = 1000 trials from an experiment that could be repeated indefinitely (N → ∞).
Typically the population size N is very large (e.g. N = 60 million), or infinite, and the sample size
could be small or large. The sample has numerical characteristics, such as sample mean, median,
quartiles, sample variance, sample correlation coefficient. Each such characteristic describing the
data sample is referred to as a statistic.
Inferential Statistics tries to make inferences from the data – and, in the process, goes beyond the
properties of the sample itself. For example, if the data are collected from a random sample of
the population, one attempts to make inferences about the characteristics of the population as a
whole. Such characteristics of the population are referred to parameters ( they are not random).
1
Estimators
Recall the notions of sample mean
Y =
N
1 X
yi ,
N i=1
(1.1)
and that of sample variance
SY2 =
N
1 X
(yi − Y )2 .
N − 1 i=1
(1.2)
The sample mean and sample variance are examples of so-called point estimators.
Theorem 1.1. The sample mean is an unbiased estimator, that is, the sample mean as a random
variable satisfies E(Y ) = µ, where µ is the population mean µ = E(Y ).
Proof.
Y =
N
1 X
yi ,
N i=1
We apply the expectation to Y ,
E(Y ) =
N
N
1 X
1 X
E(Yi ) =
µ = µ.
N i=1
N i=1
The theorem can be stated succinctly as E(Y ) = E(Y ).
Theorem 1.2. The sample variance is an unbiased estimator for the sample variance, i.e., the
sample variance satisfies E(SY2 ) = V (Y ) = σ 2 .
2
Proof. We have:
SY2 =
N
N
1 X
1 X
(Yi −
Yi )2
N − 1 i=1
N i=1
N
1 X 2
N
=
Yi −
N − 1 i=1
N −1
N
1 X
Yi
N i=1
!2
.
And hence,
E(SY2 ) =
N
N
E(Y 2 ) −
N −1
(N − 1)N 2
= (


X
X

E(Yi2 ) +
E(Yi Yj )
i
i6=j
1
N
−
)E(Y 2 ) − E(Y )2 = V (Y ) = σ 2 .
N −1 N −1
From this, we have the following definition:
Definition. The Sample Standard Deviation
The sample standard deviation of Y is given by SY =
p
SY2 .
The above considerations also show why we use a normalizing factor of N − 1 instead of N .
The random variable
0
SY2 =
N
1 X
(yi − Y )2
N − 1 i=1
is a biased estimator for σ 2 . Indeed, from the calculations used in the proof above, we see that
0
E(SY2 ) =
2
1−
1 2
σ .
N
Parameter Estimation
We have already come across parameter estimation in an elementary way by noting for instance
1 PN
that the sample mean, Y =
yi , as a random variable, is a good (unbiased) estimator of
N i=1
the population mean µ = E(Y ). We have also seen that the strength of the sample mean as an
σ2
estimator of the population mean is further enhanced by the fact that Var(Y ) =
, so that the
N
variance of Y decreases with N .
For large N , we can further invoke the central limit theorem to conclude that:
Z =
Y −µ
√ ∼ N (0, 1) ,
σ/ N
allowing to produce quantitative estimates of probabilities of deviations of a sample mean from the
population mean. Using the CDF φ(x) = P (Z ≤ x) of the distribution Z ∼ N (0, 1), we obtain
!
√ !
√N Y −µ N
≥
= 2 1−φ
.
P √
σ
σ
σ/N 3
Thus we have:
P Y − µ ≥ = P
!
√ !
√N Y −µ N
√ ≥
= 2 1−φ
.
σ/ N σ
σ
Since φ(x) → 1 for x → + ∞, we can conclude that P (|Y − µ| ≥ ) can be very small, even for
√
very small , provided N is large enough such that
N 1. This result allows to put confidence
σ
limits on parameter ranges for population means from a sample estimate, which we turn to next.
3
Parameter Estimation-Confidence Intervals
We would like to estimate an unknown parameter of a population from measurements of its sample
1 PN
Yi . By the law of large numbers and the central limit theorem, we know
mean Y =
N i=1
Y −µ
√ is a normally distributed random variable,
that for large N the random variable Z =
σ/ N
Z ∼ N (0, 1). Hence we know that
√ !
√ !!
Y −µ N
N
= 2 1−φ
,
P (|Y − µ| ≥ ) = P √ ≥
σ
σ
σ/ N
A confidence interval is established by demanding that P(|Y − µ| ≥ ) = α where α is small.
This translates to
P (|Y − µ| ≥ ) = 1 − α .
From the table for the CDF φ(x) of standard normal distribution, we can read off the values:
P (|z| ≥ zα ) = α ⇐⇒ 2(1 − φ(zα )) = α ⇐⇒ φ(zα ) = 1 −
α
.
2
Typical values for α are 0.01 (1%) or 0.05 (5%).
α
= 0.995 → zα ' 2.576 .
2
α
α = 0.05 → 1 −
= 0.975 → zα ' 1.960 .
2
α = 0.01 → 1 −
This allows us to conclude with a given confidence level 1 − α that
σ
σ
Y − zα √ ≤ µ ≤ Y + zα √ ,
N
N
(3.1)
a result that we obtain from a measurement of Y of the sample.
Often times, we do not know the population standard deviation σ! In such situations, we need
to use estimates for σ. There are a number of situations where we have good estimates. Some
examples are:
1. If we know that Y ∈ [a , b], i.e. the random variable falls in a finite interval, then clearly
b−a
σ≤
and we can replace σ in (3.1) by this estimate.
2
4
2. If we know Y to be a Poisson random variable, i.e. Y ∼ Poisson(µ), we know that σ =
√
µ
and we can use this relation to determine the bounds of the confidence interval. For example,
the upper limit is given as
r
µ ≤ Y + zα
r
µ
µ
⇐⇒ µ − zα
≤ Y
N
N
2
zα
1 zα2
√
≤ Y +
⇐⇒
µ− √
4N
2 N
r
zα
1 zα2
√
⇐⇒ µ ≤ √ + Y +
4N
2 N
!2
r
zα
1 zα2
√ + Y +
.
⇐⇒ µ ≤
4N
2 N
A similar argument works for the lower bound of the confidence interval.
3. If we know Y to be a Bernoulli random variable, i.e. Y ∼ B(p) then we know µ = p and
p
p
σ = p(1 − p) = µ(1 − µ), and we can use this in (3.1) to find analogous equations for
the boundaries of the confidence interval.
4. If the random variable Y is known to be normal or Gaussian we can use the sample standard
deviation SY as an estimator for σ in (3.1). Note that SY can be measured from the sample
and its probability density function is known, so it can be used to establish boundaries of
confidence intervals. We provide the necessary results required in this line of reasoning in
the next section.
4
Parameter Estimation Using the t-Distribution
The introduction of the t-distribution finally allows to provide confidence intervals for small samples
of Gaussian random data, which circumvent the problem of the unknown population variance σ 2
which was needed for Z-statistics. The steps are follows:
1. Given n independent identically distributed normal random variables, Yi ∼ N (µ, σ), i =
1, · · · , n, we know that
Z =
√ Y −µ
n
∼ N (0, 1),
σ
1 Pn
Yi is the sample mean.
n i=1
2. Next, we know that:
n−1 2
W =
SY ∼ χ2n−1 ,
σ
where Y =
5
where SY2 is the sample variance,
3. Further, we have that:
T = p
Z
W/n − 1
=
√ Y −µ
n
∼ Tn−1 ,
S
is t-distributed with n − 1 degrees of freedom.
These identities then allow to put confidence intervals on estimates of µ as follows. At significance level α define tα/2 via
S
P (|T | ≥ tα/2 ) = α ⇐⇒ P (|Y − µ| ≥ tα/2 √ ) = α ,
n
which is equivalent to
S
S
−tα/2 √ ≤ Y − µ ≤ tα/2 √
n
n
with probability 1 − α,
which is equivalent to
S
S
Y − tα/2 √ ≤ µ ≤ Y + tα/2 √
n
n
with probability 1 − α.
We have established that the confidence interval for confidence level 1 − α is
5
S
S Y − tα/2 √ , Y + tα/2 √ .
n
n
The difference of two means
Consider two independent samples,
Yi1 : i = 1, · · · , n1 , Yi1 ∼ N (µ1 , σ1 ), and Yi2 : i = 1, · · · , n2 , Yi2 ∼ N (µ2 , σ2 ).
1 Pn1
1 Pn2
With their sample means, Y1 =
Yi2 . And we have, E(Y1 −Y2 ) =
i = 1 Yi1 and Y2 =
n1
n2 i = 1
s
σ12
σ22
σ12
σ2
µ1 − µ2 , and V ar(Y1 − Y2 ) = V ar(Y1 ) + V ar(Y2 ) =
+
, ⇒ σY1 −Y2 =
+ 2.
n1
n2
n1
n2
For large samples, n1 , n2 >> 1, we have
σ12 ≈ SY21 =
n1
1 X
(Yi1 − Y1 )2 ,
n1 − 1 i = 1
σ22 ≈ SY22 =
n2
1 X
(Yi2 − Y2 )2 ,
n2 − 1 i = 1
Also, since we are given that Y1 and Y2 are normal, we have
Z =
Y1 − Y2 − (µ1 − µ2 )
s
∼ N (0, 1).
σ12
σ22
+
n1
n2
For small samples, we can also develop an appropriate estimator using t-distributions as before,
we shall not develop this in this course.
6
6
Characteristics of bivariate samples
We have discussed sample characteristics like the sample mean, sample covariance, etc for a sample
of one random variable. We also briefly consider a sample of 2-dimensional data (xi , yi ), i =
1, 2, · · · , N .
Definition. Sample Covariance
The sample covariance is defined as
SXY =
where X =
N
1 X
1 X
N
(xi − X)(yi − Y ) =
xi yi −
XY ,
N − 1 i =1
N −1 i
N −1
(6.1)
1 PN
1 PN
yi are the sample means of the two variables.
i =1 xi and Y =
N
N i =1
Definition. Sample Correlation Coefficient
The sample correlation coefficient is defined as
rXY =
SXY
,
SX Sy
(6.2)
where SXY is the sample covariance and SX and SY are sample standard deviations.
7
Linear Regression
Suppose we have a bivariate data set (xi , yi ) i = 1, 2, · · · , N that looks like the following graph:
Figure 1: Linear Regression
We can think of this as manifestations of two random variables (Xi , Yi ) that are correlated.
From the figure, we deduce that the two variables are approximately linearly related. One can
attempt to find the ”best linear relation” fitting the data. To do so, one tries to find the equation
7
of the straight line y = ax + b which minimizes the squared error:
N
N
X
X
1
1
(yi − axi − b)2 =
ε2
2(N − 1) i = 1
2(N − 1) i = 1 i
ε2 =
(7.1)
with respect to the parameters a and b. This procedure is called the method of least squared error.
We rewrite the squared error as
ε2 =
with X =
N
X
1
(yi − Y − a(xi − X) + Y − aX − b)2 ,
2(N − 1) i = 1
(7.2)
1 PN
1 PN
yi . Expanding the (· · · )2 part and summing over i, we
i = 1 xi and Y =
N
N i=1
obtain:
1
ε =
2
2
2
SY2 + a2 SX
− 2 a SXY +
N
(Y − aX − b)2
N −1
.
(7.3)
To find the minimum value of ε2 as a function of a and b, we set the two partial derivatives to
zero:
∂ε2
N
2
= a SX
− SXY −
(Y − aX − b)X = 0 .
∂a
N −1
∂ε2
N
= −
(Y − aX − b) = 0 .
∂b
N −1
(7.4)
The above equations can be solved by
a =
SXY SXY SY
SY
= rXY
,
2
SX SX SY SX
SX
b = y − ax .
(7.5)
So the slope is given in terms of the sample correlation coefficient and the sample standard deviation
of the xi and the yi . We can also check by computing the second derivative that the stationary
point is a minimum.
The resulting straight line
y = (rXY
SY
SX
SY
)x + y − rXY
X = rXY
(x − X) + Y
SX
SY
SX
(7.6)
is called the regression line.
The value of ε2 for the minimizing values of a and b is
ε2min =
1 2
2
2
(S + rXY
SY2 − 2rXY
SY2 ).
2 Y
(7.7)
Since |rXY | ≤ 1, ε2min = 0 ⇔ rXY = 1, which means ε2min = 0 is realized if and only if the
points (xi , yi ) lie exactly on the straight line.
Note that the idea of least squares fit is not restricted to a linear relation. We could, for
instance, find that (xi , yi ) are scattered around some parabolic shape in which case we would seek
to minimize
ε2 =
X
2
1
yi − a2 x2i − a1 xi − a0 ,
2(N − 1) i
(7.8)
References
8
with respect to a0 , a1 , a2 .
In the above least-squares fit we attempted to find the best values of a and b that would
minimize
X
1
(yi − axi − b)2 .
2(N − 1)
This would be interpreted as using the xi to predict the yi .
One could turn this around and attempt a least squares fit using
X
1
2
(xi − αyi − β) .
2(N − 1) i
εˆ2 =
(7.9)
Here, we use the yi as predictors for the xi . Using the same reorganization of the terms as above
we have
1
εˆ =
2
2
2
SX
2
+a
SY2
− 2 a SXY
N
2
(X − a Y − β) .
−
N −1
We calculate the derivatives,
N
∂ εˆ2
= a SY2 − SXY −
(X − a Y − β)Y ,
∂a
N −1
(7.10)
N
∂ εˆ2
= −
(X − a Y − β) .
∂b
N −1
(7.11)
The solution is
a = rXY
SX
,
SY
b = X − aY .
(7.12)
and the resulting regression line is:
y =
1 SY
(x − X) + Y .
rXY SX
(7.13)
Note that the two regression lines (7.6) and (7.13) are not identical unless |rXY | = 1. The reason
is that we are minimizing different error measures – the mean square deviations in the Y direction
in the first case, and the mean square deviations in the X direction in the second case.
References
[1] Dennis D. Wackerly, William Mendenhall, Richard L. Scheaffer, Mathematical Statistics with
Applications. Chapter 8.1–8.8, Chapter 11.1, 11.3.