Estimation with Large Sample of Data: Part I Ba Chu E-mail: ba [email protected] Web: http://www.carleton.ca/∼bchu (Note that this is a lecture note. Please refer to the textbooks suggested in the course outline for details. Examples will be given and explained in the class.) 1 Objectives At the end of Lecture 04, I emphasized that, because of sampling variations and small sample size, the sample estimates of population quantities are rarely precise. The present lecture is an early step in exploring this phenomenon in greater depth. To ease students into this rather complicated topic, I will first introduce a number of important theorems that govern the sampling distribution of the average, X: 1) the weak law of large numbers (WLLN) and 2) the central limit theorem (CLT). The main assumption that I have imposed throughout this lecture is that data in our sample are IID. 2 (Weak) Law of Large Numbers (WLLN) We know from classical probability that if a coin is tossed one time, we cannot predict the outcome, but the probability of getting a head is 1/2 and the probability of getting a tail is 1/2 if everything is fair. But what happens if we toss the coin 100 times? Will we get 50 heads? Common sense tells us that most of the time, we will not get exactly 50 heads, but we should get close to 50 heads. What will happen if we toss a coin 1000 times? Will we get exactly 500 heads? Probably not. 1 However, as the number of tosses increases, the ratio of the number of heads to the total number of tosses will get closer to 1/2. This phenomenon is known as the law of large numbers. This law holds for any type of gambling game such as rolling dice, playing roulette, etc. Suppose that a random variable, X, follows a distribution, P , and that a scientist wants to estimate the population mean, µ = E[X]. To do so, the scientist draws n independent copies of X, say X1 , . . . , Xn , from P , then computes n Xn = 1X Xi . n i=1 The random variable X n is called the sample mean. The question is which value of n to make the sample mean is sufficiently close the true mean, µ. The answer is simple: large samples are more reliable than small or moderate samples. Suppose that X1 , . . . , Xn have a finite mean E[Xi ] = µ and a finite variance V ar(Xi ) = σ 2 , we shall study the behaviour of X n , as n increases. We can notice at the outset that, although the expectation of X and X n are similar, i.e., E[X n ] = P 2 µ, the variances are different, i.e., V ar(X n ) = n12 ni=1 V ar(Xi ) = σn . Hence, the sample mean has less variability than any of the individual random variables that are being averaged. Averaging decreases variation, i.e., as n −→ ∞, V ar(X n ) −→ 0. This suggests that, if the population mean, µ, is unknown, we can draw inferences about it by observing the behaviour of the sample mean, X n . To do this, we need to study the WLLN and CLT. We shall begin with a definition: Definition 1. A sequence of random variables X1 , . . . , Xn converges in prob. to a constant, c, p written as Xn =⇒ c, if and only if, for every > 0, limn−→∞ P (Xn ∈ (c − , c + )) = 1. [ Figure 1 is about here.] The above concept allows us to state an important result. Theorem 1 (WLLN). Let X1 , . . . , Xn be any sequence of IID random variables having finite mean µ and finite variance σ 2 . Then p X n =⇒ µ. 2 Corollary 2 (Law of Averages). Let A be any event and consider a sequence of IID experiments in which we observe whether or not A occurs. Let p = P (A) and define IID random variables by    1, A occurs, Xi =  0, Ac occurs. Then Xi ∼ Bernoulli(p), X n is the observed frequency with which A occurs in n trials, and µ = E[Xi ] = p is the theoretical prob. of A. The WLLN states that the former tends to the latter as the number of trials increases. Remark 2.1. The WLLN formalizes our common experience that “things tend to average out in the long run.” For instance, we might be surprised if we tossed a fair coin n = 10 times and observed X 10 = 0.9; however, if we knew that the coin was indeed fair (p = 0.5), then we would remain confident that, as n increased, X n would eventually tend to 0.5. 3 The Central Limit Theorem (CLT) The WLLN states a precise fact that the distribution of values of the sample mean collapes to the population mean as the sample size increases. However, there are several obvious questions unanswered: 1. How rapidly does the sample mean tend toward the population mean?. 2. How does the shape of the sample mean’s distribution changes as the sample mean tends toward the population mean?. To answer the above questions, we need to convert the random variables to standardized random variables. This can be done in the following table: [Table 1 is about here.] Notice that standardizing a random variable does not change the shape of its distribution. Let Zn = X n√ −µ σ/ n denote this standardized random variable which we shall focus our attention on. 3 p We begin by observing that V ar(X n − µ) = ( √σn )2 . The WLLN states that X n − µ =⇒ 0, so √ the factor 1/ n measures how rapidly the sample mean tends toward the population mean. Now, we can answer the second question mentioned above by studying the behaviour of Zn as n becomes large. The following theorem is one of the most remarkable and useful results in all of mathematics. It is fundamental to the study of statistics. Theorem 3 (CLT). Let X1 , . . . , Xn be any sequence of IID random variables having finite mean µ and finite variance σ 2 . Let Fn denote the cdf of Zn , and let Φ denote the cdf of the standard normal distribution. Then, for any fixed value z ∈ R, we have Fn (z) = P (Zn ≤ z) −→ Φ(z), as n −→ ∞. The CLT states that the behaviour of the average of a large number of IID random variables will resemble the behaviour of a standard normal random variable. This is true regardless of the distribution of the random variables that are being averaged. Thus, the CLT allows us to approximate a variety of probability distributions with the standard normal distribution. This approximation is sufficiently precise if n > 30. Now, I present some simulation studies to assess the accuracy of the CLT. I assumed that X1 , . . . , Xn are IID χ2 (2) with mean µ = 2 and variance σ 2 = 4. If I draw n values from this distribution 25,000 times, compute the mean and the standard deviation of these 25,000 draws, and plot a histogram of the results, I will have the following graphs: [Figure 2 is about here.] In the case X1 , . . . , Xn are distributed as counts from Bernoulli trials with the success prob., E[Xi ] = p, and V ar(Xi ) = p(1 − p), it follows that E[X n ] = np, V ar(X n ) = np(1 − p), and √X n −np np(1−p) is approximately distributed as N (0, 1) as n −→ ∞. This result was first established by De Moivre in 1730s. Example 1. My friend, John, is attempting to have formal dates. He is very confident that his 4 success probability is 50%. If he replicates his dating experience 36 times, then what is the probability that his sample success probability will fall within 0.1 of the true success prob.? Answer: Let Xi denote the dating result obtained from the replication i, for i = 1, . . . , 36. This is a Bernoulli random variable. His expectation is E[X] = 0.5, and his variance is V ar(X) = 0.52 . Let Z ∼ N (0, 1). Then, applying the CLT, P (µ − 0.1 < X 36 < µ + 0.1) = P (− X 36 − µ 0.1 0.1 < < ) 0.6/6 0.5/6 0.6/6 = P (−1.2 < Z < 1.2) = Φ(1.2) − Φ(−1.2) = 0.7698. Example 2. Suppose that John will try to replicated his dating experience an additional of 10 times. What is the prob. that his sample success prob. with respect to 36 dates will fall within 0.1 of that with respect to 46 dates? 2 Answer: Note that, from Theorem 3, we have X n ≈ N (µ, σn ). Thus, it is straight-forward to obtain X 36 − X 46 ≈ N (0, 0.25/36 + 0.25/46 = 0.1112622 ). Standardizing, it follows that P (−0.1 < X 36 − X 46 < 0.1) = P (− 0.1 0.1 X 36 − X 46 < < ) 0.111262 0.111262 0.111262 = P (−0.89878 < Z < 0.89878) = Φ(0.89878) − Φ(−0.89878). I conclude this section with a warning. Statisticians usually apply the CLT in order to approximate the distribution of a sum or an average of random variables, Xi , that are observed in an experiment. These random variables need not to be normally distributed themselves – indeed, the glamour of the CLT is that it does not assume the normality of Xi . 5 4 Exercises 1. Suppose that I toss a fair coin 100 times and observe 60 Heads. Now, I decide to toss the same coin another 100 times. Does the WLLN or the Law of Averages imply that I should expect to observe another 40 Heads? 2. Suppose that an dice has the following probabilities of producing the 4 possible uppermost faces: P (1) = P (6) = 0.1, P (3) = P (4) = 0.4. This dice is to be thrown 100 times. Let Xi denote the value of the uppermost face that results from throw i. (a) Compute the expected value and the variance of Xi . (b) Compute the prob. that the average value of the 100 throws will exceed 3.6. 3. It has been found that 2% of the tools produced by a certain machine are defective. What is the probability that in a shipment of 400 such tools (a) 4% or more and (b) 2% or less will be defective? 4. A financial theory posits that daily fluctuations in stock prices are independent random variables. Suppose that the daily price fluctuations of a certain blue-chip stock1 are IID random variables X1 , X2 , . . . , with E[Xi ] = 0.01 and V ar(Xi ) = 0.01. (Thus, if today’s price of this stock is $50, then tomorrow’s price is $50+X1 , etc.) Suppose that the daily price fluctuations of a certain internet stock are IID random variables Y1 , Y2 , . . . , with E[Yj ] = 0 and V ar(Yj ) = 0.25. Now suppose that both stocks are currently selling for $50/share and you wish to invest $50 in one of these two stocks for a period of 400 days. Assume that the costs of purchasing and selling a share of either stock are zero. (a) Approximate the prob. that you will make a profit on your investment if you purchase a share of the blue-chip stock. 1 Please check the link: http://en.wikipedia.org/wiki/Blue_chip_(stock_market) if you do not know about blue-chip stocks. 6 (b) Approximate the prob. that you will make a profit on your investment if you purchase a share of the internet stock. (c) Approximate the prob. that you will make a profit of at least $20 if you purchase a share of the blue-chip stock. (d) Approximate the prob. that you will make a profit of at least $20 if you purchase a share of the internet stock. (e) Assuming that the internet stock fluctuations and the blue-chip stock fluctuations are independent, approximate the prob. that, after 400 days, the price of the internet stock will exceed the price of the blue-chip stock. 7 random variable X Pn i 1 Xi Xn Table 1: Standardizing random variables expected value standard deviation standardized random variable X −µ µ σ Pn1 Xiσi−nµ √ √ nµ nσ nσ √ X n√ −µ µ σ/ n σ/ n 8 Figure 1: An example of convergence in prob. Figure 2: An illustration of the CLT. Note that the shapes approach the bell shape. 9