1 Descriptive and Inferential Statistics Statistics is often said to consist of two parts: Descriptive Statistics and Inferential Statistics. Descriptive Statistics describes various characteristics of data samples. A data sample could be a poll of n = 500 randomly selected people from a total population of N = 60 million people, or a sample of n = 1000 trials from an experiment that could be repeated indefinitely (N → ∞). Typically the population size N is very large (e.g. N = 60 million), or infinite, and the sample size could be small or large. The sample has numerical characteristics, such as sample mean, median, quartiles, sample variance, sample correlation coefficient. Each such characteristic describing the data sample is referred to as a statistic. Inferential Statistics tries to make inferences from the data – and, in the process, goes beyond the properties of the sample itself. For example, if the data are collected from a random sample of the population, one attempts to make inferences about the characteristics of the population as a whole. Such characteristics of the population are referred to parameters ( they are not random). 1 Estimators Recall the notions of sample mean Y = N 1 X yi , N i=1 (1.1) and that of sample variance SY2 = N 1 X (yi − Y )2 . N − 1 i=1 (1.2) The sample mean and sample variance are examples of so-called point estimators. Theorem 1.1. The sample mean is an unbiased estimator, that is, the sample mean as a random variable satisfies E(Y ) = µ, where µ is the population mean µ = E(Y ). Proof. Y = N 1 X yi , N i=1 We apply the expectation to Y , E(Y ) = N N 1 X 1 X E(Yi ) = µ = µ. N i=1 N i=1 The theorem can be stated succinctly as E(Y ) = E(Y ). Theorem 1.2. The sample variance is an unbiased estimator for the sample variance, i.e., the sample variance satisfies E(SY2 ) = V (Y ) = σ 2 . 2 Proof. We have: SY2 = N N 1 X 1 X (Yi − Yi )2 N − 1 i=1 N i=1 N 1 X 2 N = Yi − N − 1 i=1 N −1 N 1 X Yi N i=1 !2 . And hence, E(SY2 ) = N N E(Y 2 ) − N −1 (N − 1)N 2 = ( X X E(Yi2 ) + E(Yi Yj ) i i6=j 1 N − )E(Y 2 ) − E(Y )2 = V (Y ) = σ 2 . N −1 N −1 From this, we have the following definition: Definition. The Sample Standard Deviation The sample standard deviation of Y is given by SY = p SY2 . The above considerations also show why we use a normalizing factor of N − 1 instead of N . The random variable 0 SY2 = N 1 X (yi − Y )2 N − 1 i=1 is a biased estimator for σ 2 . Indeed, from the calculations used in the proof above, we see that 0 E(SY2 ) = 2 1− 1 2 σ . N Parameter Estimation We have already come across parameter estimation in an elementary way by noting for instance 1 PN that the sample mean, Y = yi , as a random variable, is a good (unbiased) estimator of N i=1 the population mean µ = E(Y ). We have also seen that the strength of the sample mean as an σ2 estimator of the population mean is further enhanced by the fact that Var(Y ) = , so that the N variance of Y decreases with N . For large N , we can further invoke the central limit theorem to conclude that: Z = Y −µ √ ∼ N (0, 1) , σ/ N allowing to produce quantitative estimates of probabilities of deviations of a sample mean from the population mean. Using the CDF φ(x) = P (Z ≤ x) of the distribution Z ∼ N (0, 1), we obtain ! √ ! √N Y −µ N ≥ = 2 1−φ . P √ σ σ σ/N 3 Thus we have: P Y − µ ≥ = P ! √ ! √N Y −µ N √ ≥ = 2 1−φ . σ/ N σ σ Since φ(x) → 1 for x → + ∞, we can conclude that P (|Y − µ| ≥ ) can be very small, even for √ very small , provided N is large enough such that N 1. This result allows to put confidence σ limits on parameter ranges for population means from a sample estimate, which we turn to next. 3 Parameter Estimation-Confidence Intervals We would like to estimate an unknown parameter of a population from measurements of its sample 1 PN Yi . By the law of large numbers and the central limit theorem, we know mean Y = N i=1 Y −µ √ is a normally distributed random variable, that for large N the random variable Z = σ/ N Z ∼ N (0, 1). Hence we know that √ ! √ !! Y −µ N N = 2 1−φ , P (|Y − µ| ≥ ) = P √ ≥ σ σ σ/ N A confidence interval is established by demanding that P(|Y − µ| ≥ ) = α where α is small. This translates to P (|Y − µ| ≥ ) = 1 − α . From the table for the CDF φ(x) of standard normal distribution, we can read off the values: P (|z| ≥ zα ) = α ⇐⇒ 2(1 − φ(zα )) = α ⇐⇒ φ(zα ) = 1 − α . 2 Typical values for α are 0.01 (1%) or 0.05 (5%). α = 0.995 → zα ' 2.576 . 2 α α = 0.05 → 1 − = 0.975 → zα ' 1.960 . 2 α = 0.01 → 1 − This allows us to conclude with a given confidence level 1 − α that σ σ Y − zα √ ≤ µ ≤ Y + zα √ , N N (3.1) a result that we obtain from a measurement of Y of the sample. Often times, we do not know the population standard deviation σ! In such situations, we need to use estimates for σ. There are a number of situations where we have good estimates. Some examples are: 1. If we know that Y ∈ [a , b], i.e. the random variable falls in a finite interval, then clearly b−a σ≤ and we can replace σ in (3.1) by this estimate. 2 4 2. If we know Y to be a Poisson random variable, i.e. Y ∼ Poisson(µ), we know that σ = √ µ and we can use this relation to determine the bounds of the confidence interval. For example, the upper limit is given as r µ ≤ Y + zα r µ µ ⇐⇒ µ − zα ≤ Y N N 2 zα 1 zα2 √ ≤ Y + ⇐⇒ µ− √ 4N 2 N r zα 1 zα2 √ ⇐⇒ µ ≤ √ + Y + 4N 2 N !2 r zα 1 zα2 √ + Y + . ⇐⇒ µ ≤ 4N 2 N A similar argument works for the lower bound of the confidence interval. 3. If we know Y to be a Bernoulli random variable, i.e. Y ∼ B(p) then we know µ = p and p p σ = p(1 − p) = µ(1 − µ), and we can use this in (3.1) to find analogous equations for the boundaries of the confidence interval. 4. If the random variable Y is known to be normal or Gaussian we can use the sample standard deviation SY as an estimator for σ in (3.1). Note that SY can be measured from the sample and its probability density function is known, so it can be used to establish boundaries of confidence intervals. We provide the necessary results required in this line of reasoning in the next section. 4 Parameter Estimation Using the t-Distribution The introduction of the t-distribution finally allows to provide confidence intervals for small samples of Gaussian random data, which circumvent the problem of the unknown population variance σ 2 which was needed for Z-statistics. The steps are follows: 1. Given n independent identically distributed normal random variables, Yi ∼ N (µ, σ), i = 1, · · · , n, we know that Z = √ Y −µ n ∼ N (0, 1), σ 1 Pn Yi is the sample mean. n i=1 2. Next, we know that: n−1 2 W = SY ∼ χ2n−1 , σ where Y = 5 where SY2 is the sample variance, 3. Further, we have that: T = p Z W/n − 1 = √ Y −µ n ∼ Tn−1 , S is t-distributed with n − 1 degrees of freedom. These identities then allow to put confidence intervals on estimates of µ as follows. At significance level α define tα/2 via S P (|T | ≥ tα/2 ) = α ⇐⇒ P (|Y − µ| ≥ tα/2 √ ) = α , n which is equivalent to S S −tα/2 √ ≤ Y − µ ≤ tα/2 √ n n with probability 1 − α, which is equivalent to S S Y − tα/2 √ ≤ µ ≤ Y + tα/2 √ n n with probability 1 − α. We have established that the confidence interval for confidence level 1 − α is 5 S S Y − tα/2 √ , Y + tα/2 √ . n n The difference of two means Consider two independent samples, Yi1 : i = 1, · · · , n1 , Yi1 ∼ N (µ1 , σ1 ), and Yi2 : i = 1, · · · , n2 , Yi2 ∼ N (µ2 , σ2 ). 1 Pn1 1 Pn2 With their sample means, Y1 = Yi2 . And we have, E(Y1 −Y2 ) = i = 1 Yi1 and Y2 = n1 n2 i = 1 s σ12 σ22 σ12 σ2 µ1 − µ2 , and V ar(Y1 − Y2 ) = V ar(Y1 ) + V ar(Y2 ) = + , ⇒ σY1 −Y2 = + 2. n1 n2 n1 n2 For large samples, n1 , n2 >> 1, we have σ12 ≈ SY21 = n1 1 X (Yi1 − Y1 )2 , n1 − 1 i = 1 σ22 ≈ SY22 = n2 1 X (Yi2 − Y2 )2 , n2 − 1 i = 1 Also, since we are given that Y1 and Y2 are normal, we have Z = Y1 − Y2 − (µ1 − µ2 ) s ∼ N (0, 1). σ12 σ22 + n1 n2 For small samples, we can also develop an appropriate estimator using t-distributions as before, we shall not develop this in this course. 6 6 Characteristics of bivariate samples We have discussed sample characteristics like the sample mean, sample covariance, etc for a sample of one random variable. We also briefly consider a sample of 2-dimensional data (xi , yi ), i = 1, 2, · · · , N . Definition. Sample Covariance The sample covariance is defined as SXY = where X = N 1 X 1 X N (xi − X)(yi − Y ) = xi yi − XY , N − 1 i =1 N −1 i N −1 (6.1) 1 PN 1 PN yi are the sample means of the two variables. i =1 xi and Y = N N i =1 Definition. Sample Correlation Coefficient The sample correlation coefficient is defined as rXY = SXY , SX Sy (6.2) where SXY is the sample covariance and SX and SY are sample standard deviations. 7 Linear Regression Suppose we have a bivariate data set (xi , yi ) i = 1, 2, · · · , N that looks like the following graph: Figure 1: Linear Regression We can think of this as manifestations of two random variables (Xi , Yi ) that are correlated. From the figure, we deduce that the two variables are approximately linearly related. One can attempt to find the ”best linear relation” fitting the data. To do so, one tries to find the equation 7 of the straight line y = ax + b which minimizes the squared error: N N X X 1 1 (yi − axi − b)2 = ε2 2(N − 1) i = 1 2(N − 1) i = 1 i ε2 = (7.1) with respect to the parameters a and b. This procedure is called the method of least squared error. We rewrite the squared error as ε2 = with X = N X 1 (yi − Y − a(xi − X) + Y − aX − b)2 , 2(N − 1) i = 1 (7.2) 1 PN 1 PN yi . Expanding the (· · · )2 part and summing over i, we i = 1 xi and Y = N N i=1 obtain: 1 ε = 2 2 2 SY2 + a2 SX − 2 a SXY + N (Y − aX − b)2 N −1 . (7.3) To find the minimum value of ε2 as a function of a and b, we set the two partial derivatives to zero: ∂ε2 N 2 = a SX − SXY − (Y − aX − b)X = 0 . ∂a N −1 ∂ε2 N = − (Y − aX − b) = 0 . ∂b N −1 (7.4) The above equations can be solved by a = SXY SXY SY SY = rXY , 2 SX SX SY SX SX b = y − ax . (7.5) So the slope is given in terms of the sample correlation coefficient and the sample standard deviation of the xi and the yi . We can also check by computing the second derivative that the stationary point is a minimum. The resulting straight line y = (rXY SY SX SY )x + y − rXY X = rXY (x − X) + Y SX SY SX (7.6) is called the regression line. The value of ε2 for the minimizing values of a and b is ε2min = 1 2 2 2 (S + rXY SY2 − 2rXY SY2 ). 2 Y (7.7) Since |rXY | ≤ 1, ε2min = 0 ⇔ rXY = 1, which means ε2min = 0 is realized if and only if the points (xi , yi ) lie exactly on the straight line. Note that the idea of least squares fit is not restricted to a linear relation. We could, for instance, find that (xi , yi ) are scattered around some parabolic shape in which case we would seek to minimize ε2 = X 2 1 yi − a2 x2i − a1 xi − a0 , 2(N − 1) i (7.8) References 8 with respect to a0 , a1 , a2 . In the above least-squares fit we attempted to find the best values of a and b that would minimize X 1 (yi − axi − b)2 . 2(N − 1) This would be interpreted as using the xi to predict the yi . One could turn this around and attempt a least squares fit using X 1 2 (xi − αyi − β) . 2(N − 1) i εˆ2 = (7.9) Here, we use the yi as predictors for the xi . Using the same reorganization of the terms as above we have 1 εˆ = 2 2 2 SX 2 +a SY2 − 2 a SXY N 2 (X − a Y − β) . − N −1 We calculate the derivatives, N ∂ εˆ2 = a SY2 − SXY − (X − a Y − β)Y , ∂a N −1 (7.10) N ∂ εˆ2 = − (X − a Y − β) . ∂b N −1 (7.11) The solution is a = rXY SX , SY b = X − aY . (7.12) and the resulting regression line is: y = 1 SY (x − X) + Y . rXY SX (7.13) Note that the two regression lines (7.6) and (7.13) are not identical unless |rXY | = 1. The reason is that we are minimizing different error measures – the mean square deviations in the Y direction in the first case, and the mean square deviations in the X direction in the second case. References [1] Dennis D. Wackerly, William Mendenhall, Richard L. Scheaffer, Mathematical Statistics with Applications. Chapter 8.1–8.8, Chapter 11.1, 11.3.
© Copyright 2024