One-sample Tests of Hypothesis Ka-fu WONG 23 June 2007 Abstract Like confidence interval, hypothesis test is a probability statement. Often, we have to decide whether we should accept or reject a claim or a statement about the population parameters made by others, based on information from a random sample. The concepts covered in this chapter are very important and can be extended to more complicated siutations. Thus, a thorough understanding of this chapter is important. Although it might appear obivious, there is in fact a close connection between constructing confidence interval and testing hypothesis. Thus, this chapter will also serve as a test of our understanding of confidence interval, as well. We all have some belief about things around us. Most of the times, the belief is correct. Sometimes that belief turns out incorrect. When do we reject our original belief about these things? We do so only if we see strong evidence against it. In the early days it was believed that earth was a flat plain. Some pioneer astronomists proposed that the earth took the shape of a ball. Although it is now obvious to us, it actually took very strong evidence and decades to convince people that the earth was indeed ball-sahped. 1 Hypothesis Definition 1 (Hypothesis): A Hypothesis is a statement about the value of a population parameter developed for the purpose of testing. Example 1 (Hypothesis): 1. The current unemployment rate is 5%. 2. The current unemployment rate is larger than 5%. 3. At least 20% of people in the local economy earns less than 2000 dollars a month. 4. The mean monthly income for systems analysts is $3,625. 5. Twenty percent of all customers at Spaghetti House return for another meal within a month. To illustrate the idea of testing hypothesis, consider the following example. 1 Example 2 (Extreme case – rejecting a hypothesis based on only one observation): After a midterm exam, the teaching assistant of a class of 127 students announced that the average mark was 70. David got only 65. He suspected that the true average mark was lower than 70. Based on his own result of 65, which is less than the announced average of 70, should David reject the teaching assistant’s statement about the population average mark and favor his own hypothesis? Probably not, because in most cases (with some reasonable standard deviation), the chance of seeing a random draw of 65 when the true average is 70 is quite high. For instance, suppose the marks of the 127 students are distributed as normal with mean 70 and standard deviation of 8, the chance of seeing a random draw of 65 or lower is P rob(x < 65 | x ∼ N (70, 8)) = P rob( x−70 < 8 65−70 8 ) = P rob(z < −0.625) = 0.266 which is usually not viewed as small. For the sake of illustration, suppose the marks of the 127 students are distributed as normal with mean 70 and standard deviation of 2, the chance of seeing a random draw of 65 or lower is P rob(x < 65 | x ∼ N (70, 2)) = P rob( x−70 < 2 65−70 2 ) = P rob(z < −2.5) = 0.00621 which is usually viewed as small. Thus, depending on the dispersion (variance) of the population, often one observation cannot be considered as hard evidence against a hypothesis. As we learn from previous chapters, it is possible to improve our precision of the estimator about the population parameter with a larger sample. Thus, if the standard deviation of the distribution of marks were 8, David might want to collect more information before he concludes whether to reject the statement. Example 3 (Extreme case – rejecting a hypothesis based on the population): After a mid-term exam, the teaching assistant of a class of 127 announced that the average mark was 70. David who believed himself to have done a very good job in the exam got only 65. He was shocked to know that the average mark was 70. He suspected that the true average mark was lower than 70. To verify his hypothesis, he sent an email to all students taking the course to gather their marks. Suppose all students told him truthfully about their marks. He was able to compute the population average mark, which turned out to be 60. Therefore, he rejected the teaching assistant’s statement about the population average mark and favor his own hypothesis. 2 In this extreme example, David obtained the population average mark to check against the statement made by teaching assistant. In this case, rejecting the TA’s statement about the average midterm mark is not a probability statement. However, in a lot of situations, it is impossible to obtain information about the whole population. Instead, we have only a random sample (i.e., a subset) of the population. For instance, in the last example, it could be that only 30 students reply to David’s email. Suppose the responses were random and the average from the sample was 60. Can we reject the teaching assistant’s statement about the population average mark? Yes, we might, but with a chance (probability) of making a mistake. When we reject the statement, we want to inform the readers how likely (i.e., this chance or probability) we will be making a mistake (rejecting the statement when the statement is actually correct). Generally, we would want to minimize this chance of making a mistake. That is, the hypothesis that “the average mark is 70” is maintained until we observe very strong evidence against it. In a sense, we are giving the benefit of doubt to this hypothesis — so called null hypothesis.1 The null hypothesis is presumed true until we prove beyond reasonable doubt that it is false. “Beyond reasonable doubt” means that “the probability of rejecting our maintained hypothesis when the null hypothesis is true” is less than an a priori level of significance (usually, 10%, 5% or 1%). Definition 2 (Null and alternative Hypothesis): Null Hypothesis (often written as H0 ) is a maintained hypothesis. The altnerative hypothesis (often written as Ha or H1 ) is the hypothesis we will accept when the maintained hypothesis is rejected. Example 4 (Null and alternative hypothesis): In the example, the null and alternative hypothesis can be stated as one of the followings: 1. H0 : The average mark is 70. H1 : The average mark is not 70. 2. H0 : The average mark is 70. H1 : The average mark is less than 70 Note that the second pair of hypotheses differs from the first one in that the alternative is onesided (or one-tail): “less than 70” instead of “not equal”. In essence, “the average mark is not 70” means that the average mark is either less or greater than 70. Thus, the first set fo hypothesis may be called a two-sided test or a two-tail test. 1 In court, the defendant is presumed innocent until proven beyond reasonable doubt to be guilty of stated charges. 3 In the examples above, David’s claims that the average mark is lower than 70 is like charging a person (the teaching assitant) guilty. Like the court, we will give the benefit of doubt to the opposite statement that the average mark is 70. 2 Type I and Type II Errors Our acceptence or rejection of null hypothesis is based on sample information. Sampling errors prevent us from knowing the truth exactly. Thus, there is some probability that we will make a mistake in our acceptance or rejection decision based on the sample. There are two types of mistakes/errors: 1. Type I error: Reject a null hypothesis when the null hypothesis is actually correrct. 2. Type II error: Accept the null hypothesis when the null hypothesis is actually false. Truth: null true Truth: null false Decision: Accept null correct decision Type II error Decision: Reject null Type I error Correct decision Theoretically, we would like to minimize both the probability of Type I errors and the probability of Type II errors. However, there is a trade off between committing the two errors. In the extreme case, we can avoid Type I error completely by never rejecting the null, i.e. Prob(reject null | null true) = 0. But, in this case, we will be committing type II error too often, i.e., Prob(accept null | null false) = 1. In the other extreme, to avoid Type II error completely, we always reject the null, i.e., Prob(reject null | null false) = 1. But, in this case, we will be committing type I error too often, i.e., Prob(reject null | null true) = 1. In practice, for practitioners at least, we often focus on reducing the probability of committing the Type I error to some tolerable level. This tolerable level of probability of committing the Type I error is known as the level of significance. Definition 3 (Level of significance): Level of significance (often denoted as α) is the probability of rejecting the null hypothesis when it is actually true. That is, Prob(reject null | null true) = α. Test of hypothesis is based on the sampling information. A summary of the sampling information is called statistics. Test statistics is a statistics developed for the purpose of testing hypothesis. Definition 4 (Test statistic): Test statistic is a value, determined from sample information, used to determine whether or not to reject the null hypothesis. 4 If, under the null hypothesis, the probability of observing the sample is less than α, the null is rejected. From a sample, we can derive many statistics. Some statistics are better than the others in testing our hypothesis. What statistics to use in testing the hypothesis depends on the type of hypothesis we have. If we are testing whether the population mean is equal to zero, naturally we will use the sample mean as a test statistic. In order to talk about the probability of observing the sample (or sample statistic), we will have to know the probability distrbution of the statistic, which is generally a random variable. Often, we will try to standardize the test statistic so that the test statistic will have some common distributions, such as standard normal distribution and Student-t distribution. From the distribution of the “random” test statistic, we can find the critical value for our decision. Definition 5 (Critical value): Critical value is the dividing point between the region where the null hypothesis is rejected and the region where it is not rejected. 3 Testing population mean Often, we are asked to test whether the population mean equals to some number. Suppose we have a population with mean µ and variance σ 2 . Suppose that σ 2 is known. We are asked to test whether the population mean is equal to k (H0 : µ = k versus H1 : µ 6= k) based on a sample of n observations (say, n > 30) drawn from the population. First, we know the logical test statistic is the sample mean. If the observed sample mean differs very much from k (the value that the population mean is assumed to take under H0 ), i.e., when m is too small or too big, we will reject the null hypothesis. Second, we know that the sample mean of n i.i.d. observations from the population is a random variable and is approximately normal with mean k and variance σ 2 /n because of the Central Limit Theorem. Let m denote the random variable of sample mean. We know that, under the null m−k z=p σ 2 /n A ∼ N (0, 1) Thus, if the observed z (denoted as zˆ) differs very much from 0 (the population mean of z under H0 ), i.e., when z is much smaller than 0 or when z is much bigger than 0, we will reject the null hypothesis. Suppose we want to keep the probability of type I error to be 0.05 (i.e., α = 0.05). Then, we will be looking for a c1 (much smaller than zero) and c2 (much larger than zero) such that P rob(z > c1 ) + P rob(z < c2 ) = 0.05. And, we will reject the null when zˆ > c1 or zˆ < c2 . Because of the symmetry property of normal 5 distribution and for convenience, we often make c2 = −c1 . From the standard normal table, we read that c1 would have to be 1.96 for P rob(z > c1 ) + P rob(z < −c1 ) = 0.05, i.e., P rob(|z| > 1.96) = 0.05. Of course, the critical values of c1 for z can be converted to critical values for m. Let d1 and d2 be such critical values for m. The critical values d1 and d2 may be computed using the following relationship. c1 = c2 = p d −k p1 ⇒ d1 = k + c1 × σ 2 /n 2 σ /n p d −k p2 ⇒ d2 = k + c2 × σ 2 /n σ 2 /n We reject the null if m > d1 or m < d2 . Example 5 (Significance level, critical value and the null): Suppose we want to test the population mean, µ, using a sample with n (large enough) observations from a huge population with variance σ 2 . The null and alternative are given by H0 = k versus H1 6= k. The fantastic Central Limit Theorem gives us the following graph about the distribution of sample mean, m: m2 d2 k m3 d1 m1 As mentioned above, whenever our sample mean m < d2 or m > d1 , i.e. m1 and m2 in the graph, we are ready to reject H0 . But we do not reject H0 for sample mean (like m3 ) falling in between d1 and d2 . Simulation 1 (Testing hypothesis): Let’s simulate a test of the hypothesis H0 : µ = 0 versus H1 : µ 6= 0 when the true µ takes different values. 1. Fix a µ, say, µ = 0. 2. Generate n (=50) obervations from x = µ + , ∼ N (0, 1), so that x ∼ N (µ, 1). 6 3. For this sample, compute m−0 z=p σ 2 /n Reject H0 and favor H1 if |z| > zα/2 . Do this for different values of α. 4. Repeat the last two steps 1000 times and compute the percentage of samples that reject H0 . The following table reports the percentage of the simulated samples in which the null (H0 : µ = 0) is rejected. α Rejection rule µ=0 µ = 0.1 µ = 0.3 µ = 0.5 µ=1 µ=2 µ=5 0.01 |z| > 2.576 1.20% 2.80% 32.90% 85.90% 100.00% 100.00% 100.00% 0.05 |z| > 1.960 4.40% 10.80% 60.80% 94.80% 100.00% 100.00% 100.00% 0.10 |z| > 1.645 8.70% 18.40% 70.80% 97.40% 100.00% 100.00% 100.00% We can see that the percentage of simulated samples are close to α when the null is true, i.e., µ is in fact zero. When µ is not zero, the rejection rate is higher than α because m is more likely to be much larger than the value of m under the null. This higher rejection rate is more apparent when the true mean differs from the hypothesized value very much. [Reference: Sim1.xls] 4 4.1 Testing hypothesis when the variance is unknown Variance unknown but sample size is reasonably large In the discussion above, we have assumed that the population variance is known. However, in real situations, the populatioin variance is not known and has to be estimated. It turns out that, as the following simulation shows, we can still assume the test statistic to be normal if the sample size is reasonally large (n > 30). Simulation 2 (Does it matter if we know the population variance?): Let’s simulate a test of the hypothesis H0 : µ = 0 versus H1 : µ 6= 0 when the true µ takes different values, and when the population variance has to be estiimate. 1. Fixed a µ, say, µ = 0. 2. Generate n (=50) obervations from x = µ + , ∼ N (0, 1), so that x ∼ N (µ, 1). 7 3. For this sample, compute m−0 z=p s2 /n Reject H0 and favor H1 if |z| > zα/2 . Do this for different values of α. 4. Repeat the last two steps 1000 times and compute the percentage of samples that reject H0 . The following table summarize our simulation results. α Rejection rule µ=0 µ = 0.1 µ = 0.3 µ = 0.5 µ=1 µ=2 µ=5 0.01 |z| > 2.576 1.20% 3.30% 35.00% 85.00% 100.00% 100.00% 100.00% 0.05 |z| > 1.960 4.80% 11.30% 59.90% 94.30% 100.00% 100.00% 100.00% 0.10 |z| > 1.645 8.80% 19.50% 71.20% 96.80% 100.00% 100.00% 100.00% Thus, using estimate variance to replace the population variance is fine when the sample size is large. [Reference: Sim2.xls] 4.2 Variance is unknown and sample size is small In the example, n > 30. What if the number of observation is less than 30? We have discussed similar siuations in last chapter. In fact, the discussion here is almost identical to that in the last chapter of constructing confindence interval. This is complicated. When n < 30, we cannot apply the Central Limit Theorem to get normality. However, if we are willing to impose some additional assumptions, we can still conduct hypothesis testing but in a slightly different manner. Basically we have to impose the assumption of normality of the population distribution. If the popuation distribution is normal, the random variable m will be normal with mean k and variance σ 2 /n under the null. 1. If σ 2 is known, we will have m−k z=p σ 2 /n A ∼ N (0, 1) If σ 2 is unknown, we have to use the sample estimate of σ 2 as a substitute. Let’s denote the sample estimate be s2 . We have m−k z=p s2 /n ∼ t(df = n − 1) 8 Can we check the normality assumption (of the underlying population)? Theoretically, the normality assumptioin might be checked. Note that, however, this assumption is need when the sample size is small. When we have small sample, most test of normality is likely unreliable. Due to this technical difficulty, normality assumption is often made with any check.2 Simulation 3 (Standard normal or student-t): We would like to investigate whether the hypothesis of population mean depends on the knowledge of σ, via simulations. We test the hypothesis H0 : µ = 0 versus H1 : µ 6= 0. 1. Generate one sample of n observations drawn from a probability distribution, N (0, 1) or U (−2, 2). Mean and variance of the uniform random variable (denoted as µ and σ 2 , respectively) may be computed. 2. Compute sample mean m = 1 n Pn i=1 xi , and the sample variance s2 = 1 n−1 Pn i=1 (xi − m)2 . Test the hypothesis using three different test statistics: (m−0) , where σ 2 is assumed known, and reject H0 if |t1 | > zα/2 where zα/2 is (a) t1 = √ 2 σ /n the value of standard normal distribution such that probability of the standard normal random variable z larger than zα/2 equals α/2, i.e., P rob(z > zα/2 ) = α/2. In essence, √ we assume t1 = m−µ σm , where σm = σ/ n to be standard normal. (m−0) (b) t2 = √ , where s2 is estimated variance, and reject H0 if |t2 | > zα/2 where zα/2 is 2 s /n the value of standard normal distribution such that probability of the standard normal random variable z larger than zα/2 equals α/2, i.e., P rob(z > zα/2 ) = α/2. In essence, √ we assume t2 = m−µ sm , where sm = s/ n to be standard normal. (m−0) (c) t3 = √ , where σ 2 is assumed unknown and is replaced by s2 , and and reject H0 if 2 s /n |t3 | > tn−1,α/2 where tn−1,α/2 is the value of student-t distribution with n − 1 degrees of freedom such that probability of the student-t random variable t larger than tn−1,α/2 equals α/2, i.e., P rob(t > tn−1,α/2 ) = α/2. In essence, we assume t3 = √ sm = s/ n to be student-t with n − 1 degrees of freedom. m−µ sm , where 3. Repeat the last two steps 1000 times. Compute the percentage of rejection in the simulated samples. 2 There are advanced statistical procedures to test hypothesis without the assumption of normality. Examples include Jacknife and Bootstrap. Interested readers may take a look at Efron and Tihshirani (1993). 9 Distribution n α t1 t2 t3 xi ∼ U (−2, 2) 16 0.01 1.10% 3.40% 1.60% xi ∼ U (−2, 2) 40 0.01 0.90% 1.40% 1.10% xi ∼ U (−2, 2) 16 0.05 6.60% 8.50% 6.30% xi ∼ U (−2, 2) 40 0.05 5.20% 6.00% 5.20% xi ∼ U (−2, 2) 16 0.10 11.20% 12.70% 11.10% xi ∼ U (−2, 2) 40 0.10 9.60% 10.30% 9.80% xi ∼ N (0, 1) 16 0.01 0.90% 1.40% 0.60% xi ∼ N (0, 1) 40 0.01 1.40% 1.40% 1.30% xi ∼ N (0, 1) 16 0.05 5.10% 6.90% 4.20% xi ∼ N (0, 1) 40 0.05 5.00% 5.40% 4.90% xi ∼ N (0, 1) 16 0.10 10.30% 12.40% 9.80% xi ∼ N (0, 1) 40 0.10 9.30% 10.60% 9.50% We note that the simulated rejection rate of t1 and t3 is closer to the theoretical rejection rate of α than that of t2 . The simulated rejection rate of t1 , t2 and t3 are equally close to the theoretical rejection rate when the sample size is large. Thus, when sample size is large (say, n > 30), using any of the three ways to test hypothesis makes no difference. When the sample size is small, it is better to use t3 . [Reference: Sim3.xls] When to use normal approximation? When to use Student-t? The rule is always use Student-t if a computer is readily available to compute the t values or probability. Use normal or Student-t according to the discussion above when only statistical tables are available. 5 Testing proportions When the population mean in question is really population proportion: Thus the discussion above goes through but the variance takes a special form. If π is the population proportion, the variance of a randomly drawn obervation is π(1 − π). Thus, under the null that π = k, we have p−k z=p A k(1 − k)/n 10 ∼ N (0, 1) What if the variance is unknown for the proportion under the null? It simply cannot happen because under the null that π = k, the variance of a randomly drawn obervation is π(1 − π) – and hence known. Some might argue that we can also use p(1 − p)/n instead of k(1 − k)/n as an estimate for the variance of p. To convince ourseleves, we can do a simulation as follows. Simulation 4 (Using the sample variance of the variance under the null?): We would like to test H0 : π = 0.3 versus H1 : π 6= 0.3, using different variance estimate of p. 1. Draw n (= 50) observations from Bernoulli distribbution with parameter π (i.e., prob(success) = π), π = 0.3, 0.5, 0.7. 2. For the sample, obtain the sample proportion of success, p. Test the hypothesis using the two different statistics at various levels of signficance (α). p−k t1 = p k(1 − k)/n t2 = p p−k p(1 − p)/n 3. Repeat the last two steps 1000 times. Compute the percentage of rejection in the simulated samples. ————————— t1 ————————— α Rejection rule π = 0.1 π = 0.3 π = 0.5 π = 0.7 π = 0.9 0.01 |t1 | > 2.576 79.00% 1.20% 68.30% 100.00% 100.00% 0.05 |t1 | > 1.960 94.30% 5.40% 84.50% 100.00% 100.00% 0.10 |t1 | > 1.645 97.60% 9.40% 89.50% 100.00% 100.00% ————————— t2 ————————— α Rejection rule π = 0.1 π = 0.3 π = 0.5 π = 0.7 π = 0.9 0.01 |t1 | > 2.576 93.80% 2.60% 57.60% 100.00% 99.50% 0.05 |t1 | > 1.960 97.10% 7.30% 84.50% 100.00% 99.50% 0.10 |t1 | > 1.645 98.50% 13.10% 89.50% 100.00% 99.50% The simulation results show that the test statistic t1 with the null value imposed in computation of variance performs better than t2 without using the null value in computation of variance. 11 Specifically, t1 yields a simulated rejection rate closer to the theoretical rejection rate when the null is correct (column labeled π = 0.3); t1 ’s ability of rejecting the null (π = 0.3) when the null is false (i.e., π 6= 0.3) is comparable to that of t2 , if not better. [Reference: Sim4.xls] Thus, the simulation suggests that in testing hypothesis about proportions, it is better to use t1 with the null value imposed in computation of variance. 6 6.1 One-sided tests One-sided test with simple null, H0 : µ = k versus H1 : µ > k The example deals with the hypothesis in this form H0 : µ = k versus H1 : µ 6= k. What if we have inequality structure like H0 : µ = k versus H1 : µ > k? Basically the same analysis goes through. However, we will be looking for a c1 such that P rob(z > c1 ) = 0.05. And, we will reject the null when the test statistic z is larger than c1 . From the standard normal table, we read that c1 would have to be 1.64 for P rob(z > c1 ) = 0.05. Simulation 5 (One-sided tests with simple null): We would like to convince ourselves that the one-sided test may be conducted as described. Let’s consider the null H0 : µ = k versus H1 : µ > k, where k is set to 0 for convenience. 1. Fix a µ, say, µ = 0. 2. Generate n (= 50) obervations from x = µ + , ∼ N (0, 1), so that x ∼ N (µ, 1). 3. For this sample, compute m−0 z=p s2 /n At α level of significance, reject H0 and favor H1 if z > zα . Do this for different values of α. 4. Repeat the last two steps 1000 times and compute the percentage of samples that reject H0 . The following table summarize our simulation results. 12 —– H0 : µ = 0 versus H1 : µ > 0 —– —– H0 : µ = 0 versus H1 : µ < 0 —– α Rejection rule µ=0 µ=1 Rejection rule µ=0 µ = −1 0.01 z > 2.326 1.20% 100.00% z < −2.326 1.20% 100.00% 0.05 z > 1.645 4.80% 100.00% z < −1.645 4.80% 100.00% 0.10 z > 1.282 10.00% 100.00% z < −1.282 10.00% 100.00% It is obvious that the simulated rejection rates is close to the theretical rejection rates. Thus, the simulation confirms that our suggested procedure in testing the one-sided hypthesis is valid. [Reference: Sim5.xls] Note that in both panels of the simulation above, we reported only the test of the hypothesis when the true parameter equals to the null value or the alternatives. For instance, in the left panel, when we are testing H0 : µ = 0 versus H1 : µ > 0, we consider the true parameter of 0 (µ = 0), and the true parameter larger than 0 (i.e., µ > 0, in the example above), but not the case of µ < 0 because the µ < 0 is not feasible under the null or the alternative. Situations similar to this hypothesis structure happen in real life. For instances, we might have 1. H0 : The mean income of students at a local primary school is zero. H1 : The mean income of students at a local primary school is larger than zero. (Mean income below zero is impossible, theoretically.) 2. H0 : The mean number of spouses of a married person in Hong Kong is one. H1 : The mean income of spouses of a married person in Hong Kong is larger than one. (Mean number of spouses of a married person in Hong Kong cannot be less than one, theoretically.) Example 6 (one-sided test): In the past, 15% of the mail order solicitations for a political party resulted in a financial contribution. A new solicitation letter that has been drafted is sent to a sample of 200 people and 40 responded with a contribution. At the 0.05 significance level can it be concluded that the new letter is more effective? The problem is solve in the following steps: 1. State the null and the alternate hypothesis: H0 : π = 0.15 (new letter is as effective as the old letter) H1 : π > 0.15 (new letter is more effective) 13 2. Identify the test statistics and its distribution: z = (p − 0.15)/std(p) ∼ N (0, 1), where p is the sample proportion 3. State the decision rule: The null hypothesis is rejected if z is greater than 1.96, i.e., P rob(z > 1.96) = 0.05. 4. Make a decision and interpret the results: p−π z=q 40 π(1−π) n = q200 − 0.15 = 1.98 0.15(1−0.15) 200 The null hypothesis is rejected. Rejection region α=0.05 π=0.15 µ=0 1.65 0.2 p 1.98 z Standardized to standard normal: z=(p-π)/std(p) Can you give other examples that fit into the one-sided hypothesis structure? 6.2 One-sided test with composite null, H0 : µ ≤ k versus H1 : µ > k There are many situations when it appears logical to consider the hypothesis with inequality structure like H0 : µ ≤ k versus H1 : µ > k. For instances, 1. H0 : The unemployment rate is lower than or equal to 5 percent. H1 : The unemployment rate is higher than 5 percent. 2. H0 : The average marks of students in the mid-term is lower than or equal to 70. H1 : The average marks of students in the mid-term is higher than 70. Note that under the null, µ can take a range of values, e.g., µ = k and µ = k − 0.001. We call this kind of null hypothesis that the population parameter can take a set of values “composite null”. This composite null makes the hypothesis testing very different from what we have discussed earlier. 14 Simulation 6 (One-sided tests with composite null): We would like to explore the properties of the one-sided test with composite null. Let’s consider the null H0 : µ ≤ k versus H1 : µ > k, where k is set to 0 for convenience. 1. Fix a µ, say, µ = 0. 2. Generate n (= 50) obervations from x = µ + , ∼ N (0, 1), so that x ∼ N (µ, 1). 3. For this sample, compute m−0 z=p s2 /n At α level of significance, reject H0 and favor H1 if z > zα . Do this for different values of α. 4. Repeat the last two steps 1000 times and compute the percentage of samples that reject H0 . 5. Repeat the simulation with different µ. The following table summarize our simulation results. ——————— H0 : µ ≤ 0 versus H1 : µ > 0 ——————– α Rejection rule µ = −1 µ = −0.5 µ=0 µ = 0.5 µ=1 0.01 z > 2.326 0.00% 0.00% 1.20% 89.90% 100.00% 0.05 z > 1.645 0.00% 0.00% 4.80% 96.80% 100.00% 0.10 z > 1.282 0.00% 0.00% 10.00% 98.80% 100.00% [Reference: Sim6.xls] From the simulation results, we see that when the true parameter is less than zero, the probability of rejecting the null is much less than α. In addition, when the true parameter is equal to zero, the probability of rejecting the null is much close to α. Based on the simulation, we conclude that treating the composite null (hypothesized parameters taking a range of value) as a simple null (taking only one value) mechanically will yield 1. a rejection rate that is less than the intended rejection rate (i.e., α) when the true parameter differs from the boundary value of the composite null. 2. a rejection rate the same as the intended rejection rate (i.e., α) only when the true parameter lies at the boundary value of the composite null. 15 Thus, if we treat the composite null (taking a range of value) as a simple null (taking only one value) mechanically, the level of significance (α) is really the maximum level of significance, correct only when the true parameter is at the boundary. Should we be concerned? Yes! Because when we report our test, we typically say “We reject H0 at 0.05 level of significance” which means “The probability of rejecting the null when the null is true is 0.05”. When the “probability of rejecting the null when the null is true” (i.e., the true rejection rate) is less than 0.05 and we state that it is 0.05, we are making a false statement. It is not too difficult to understand why we have difficulty in finding a correct level of significance for a composist null. Hypothesis testing builds on the probability distribution of a sample statistic, which are often characterized by the population parameters (such as mean and variance). Under a composite null, we really have infinite possible null parameters, and hence correspondingly infinite probability distributions (one for each of the infinite parameter values), and correspondingly infinite level of significance (one for each of the infinite parameter values). In most empirical studies, researchers often avoid this kind of composite null because the interpretation of the level of significance is unclear. We suggest to follow this conventional practice and avoid the composite null in our own work.3 However, at times it is more intuitive to consider the composite null. In that case, we better remember what α means. 7 Level of significance versus p-value Suppose given a sample statistic, we want to claim a rejection of the null but we want to make a honest probability statement. That is, we would like to find the smallest possible level of significance (p) such that we can reject the null at his level of signifiance. Definition 6 (p-value): A p-value is the probability, assuming that the null hypothesis is true, of finding a value of the test statistic (denoted as zˆ) at least as extreme as the computed value for the test. p-value= P rob(z > zˆ) where z is the corresponding random variable. The p-value can be used to make acceptance and rejection decisions. 3 Most textbook in statistics either never mention the composite null or gave the wrong interpretation when they mention it. See Liu and Stone (1999) for a discussion. 16 1. If the p-Value is smaller than the level of significance, H0 is rejected. 2. If the p-Value is larger than the level of significance, H0 is not rejected. To understand the relation between the level of significance and the p-value, consider the following algorithm of finding the p-value. Consider H0 : µ = k versus H1 : µ 6= k. Suppose the observe statistic is denote as zˆ (= (m ˆ − k)/sm ), and its corresponding variable is denoted z (= (m − k)/sm ), where m ˆ denotes a realized sample mean, and m denotes a random mean (not yet realized). 1. Set the level of significance to its large possible number, i.e., α = 1. 2. Test the hypothesis at α level of significance. 3. Update the α with the following set of rules. (a) If the hypothesis is not rejected at α level of significance, stop the process and set the p-value to α. (b) If the hypothesis is rejected at α level of significance, replace α with α − ∆ (where ∆ is a small number, say, 0.0001) and repeat the last two steps. Simulation 7 (Relating the p-value to level of significance): We generate one sample of 50 observations from a N (0, 1) population, i.e., the true population mean is 0. We are interested in the hypothesis H0 : µ = k versus H1 : µ 6= k, for different k. We use the above algorithm in finding the p-value of rejecting the null. The following table reports whether the hypothesis is rejected at different level of signficance α. 17 α k=0 k = 0.1 k = 0.3 k = 0.5 k = 0.7 k = 0.9 1 Yes Yes Yes Yes Yes Yes 0.9 No Yes Yes Yes Yes Yes 0.8 No Yes Yes Yes Yes Yes 0.7 No Yes Yes Yes Yes Yes 0.6 No Yes Yes Yes Yes Yes 0.5 No Yes Yes Yes Yes Yes 0.4 No No Yes Yes Yes Yes 0.3 No No Yes Yes Yes Yes 0.2 No No Yes Yes Yes Yes 0.1 No No Yes Yes Yes Yes 0.05 No No Yes Yes Yes Yes 0.01 No No No Yes Yes Yes We can reject the null with k = 0.5, the null with k = 0.7 and the null with k = 0.9 at 0.01 level. That is, if one of these nulls are correct, the chance of observing the sample statistics (which is in fact generated with N(0,1)) is extremely small (i.e., less then 0.01). If the null with k = 0.3 is correct, the chance of observing the sample statistics (which is in fact generated with N (0, 1)) is small, and definitely not as small as when k = 0.9. When the null with k = 0 is correct, the chance of observing the sample statistics (which is in fact generated with N (0, 1)) is big, and we were not able to reject the null at α ≤ 0.9). In fact, the p-value for the hypotheses was found to be as k= 0 0.1 0.3 0.5 0.7 0.9 p-value = 0.9642 0.4632 0.0220 0.0001 0.0000 0.0000 It is easy to verify that the null is rejected at α significance level if p − value < α. [Reference: Sim7.xls] Example 7 (p-value of a one-sided test): In the past, 15% of the mail order solicitations for a political party resulted in a financial contribution. A new solicitation letter that has been drafted is sent to a sample of 200 people and 40 responded with a contribution. We would like to test H0 : π = 0.15 (new letter is as effective as the old letter) versus H1 : π > 0.15 (new letter is more effective). What is the p-value of rejecting the null? 18 1. The relevant test statistics and its distribution is z = (p − 0.15)/std(p) ∼ N (0, 1), where p is the sample proportion. 2. The sample statistic is: p−π z=q π(1−π) n 40 = q200 − 0.15 = 1.98 0.15(1−0.15) 200 3. The p-value is prob(z > 1.98) = prob(z > 1.98) = 0.023835 α critical value Test z-stat. Decision 0.05 1.644854 1.98 reject 0.04 1.750686 1.98 reject 0.03 1.880794 1.98 reject 0.02 2.053749 1.98 Not reject 0.01 2.326348 1.98 Not reject Sampling distribution of the test statistics P-value Test statistic from the sample. Example 8 (p-value of a two-sided test): The supervisor of a production line believes that the average time to assemble an electronic component is 14 minutes. Assume that assembly time is normally distributed with a standard deviation of 3.4 minutes. The supervisor times the assembly of 14 components, and finds that the average time for completion was 16.6 minutes. What is the smallest significant level the null hypothesis H0 : µ = µ0 = 14 could be rejected? Test statistic = (m∗ − µ0 )/std(m) = (16.6 − 14)/(3.4/141/2) = 2.86 > 0. 19 P-value = 2 × P (Z > 2.86) = 2 × 0.0021 = 0.0042. Note that p-value = Prob [ |z| > the absolute value of test statistic | H0 ] = 2× Prob [ z < value of test statistic | H0 ] if value of test statistic < 0 = 2× Prob[ z > value of test statistic | H0 ] if value of test statistic > 0 B P-value = A+B A Test statistic from the sample. 8 Power At a starry night, we noticed a bright spot in the sky. From the map, we knew that it is a system of binary stars. However, our naked eyes could not tell. So, we borrowed a telescope and looked at it again. Given the same telescope (or naked eyes), the larger the distance between the two stars, the more likely we can distinguish between them. Given the same two stars, the larger the Magnification Coefficient (MC) of our telescope, the more likely we can distinguish between the two stars. While the discussion about star-gazing might appear totally irrelevant to statistics, star-gazing actually is analogous to the power of hypothesis testing. In the context of hypothesis testing, the power of the test is the ability of the test in telling us the null is wrong when the true parameter is different from the null. Conditional (given) a test, such ability depends on how far apart the null is from the true parameter. The larger the difference between the truth and the null, the larger the power of a given test. In this sense, there is some similarity in statistical power and the power of a telescope. 20 Statistical test Telescope (Star gazing) Power of a test in distinguishing between Magnification coefficient (MC) of your two values of parameter. telescope The larger the difference between the The larger the distance between the two truth and the null, the larger the power. stars, the more likely we can distinguish between them using the same telescope. Given the same truth and null, the larger Given the same two stars, the larger the is the power, the more accurate is the MC of our telescope, the more likely we test. can distinguish the two stars. Definition 7 (Power): The power of a test is a measure of the ability of a test in distinguishing between two possible values of the parameter of interest. The power of a test against an alternative value of parameter (different from the null value) is the probability of rejecting the alternative value when true parameter equals to the alternative value. Example 9 (Power): A random sample of 802 supermarket shoppers had 378 shoppers that preferred generic brand items if the price was lower. Test at the 10% level the null hypothesis that at least one-half of all shoppers preferred generic brand items against the alternative that the population proportion is less than one-half. Find the power of a 10% level test if, in fact, 45% of the supermarket shoppers are able to state the correct price of an item immediately after putting it into the cart. 1. The hypotheses are H0 : π = 0.5, versus H1 : π < 0.5 2. Variance of the sample proportion p under H0 is: 0.5 × (1 − 0.5)/802 = 0.000312 3. Level of significance = 0.1 Reject H0 if sample proportion p is too small. At the level of significance (=0.1), z = −1.28. Upper limit of rejection is 0.5 + z × [std.dev.underH0 ] = 0.4774 . Therefore, H0 is rejected when sample proportion is less than 0.4774. 4. If the real proportion is 0.45 (a) H0 is false since p = 0.45 < 0.5. (b) The power is P rob(rejecting H0 |p = 0.45) = P r(p < 0.4774|p = 0.45). (c) Variance under π = 0.45 is : .45 ∗ (1 − .45)/802 = .000309 p (d) P rob(p < 0.4774|π = 0.45) = P rob(z < (.4774 − .45)/ (0.000309) = 0.9404 21 5. Thus, 0.9404 is the probability that H0 (π = 0.5) is correctly rejected when the truth is π = 0.45. Simulation 8 (Power): We would like to simulate the power of a simple test of the population mean, H0 : µ = 0 versus H1 : µ 6= 0. 1. Fix a population mean µ. µ = 0 + 0.1 × i, i = −10, −9, ... − 1, 0, 1..., 10. 2. Generate a sample of 30 observation from a population of N(µ,1). 3. Perform the test of hypothesis H0 : µ = 0 versus H1 : µ 6= 0 at α level of significance. 4. Repeat the last two steps 1000 times. Compute the percentage of rejections in the simulations, for different α. 5. Repeat with different µ. The simulated rejection rate is plotted against different µ in the following chart. Rejection rate 100% 80% 60% 40% 20% 0% -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 True mean [Reference: Sim8.xls] Note that if the test has extremely power, it will reject the null whenever the true µ is not zero. This ideal scenario does not happen. What often happens is that a test will have much power against the null when the true µ is very far away from the null value, and very little power against the null when the true µ is very close to the null value. 9 Switching Null and alternative hypothesis Why must we set a hypothesis as null, the other as alternative? We have discuss the argument earlier – “giving the benefit of doubt to the defedent”. However, in that section, we were not able to see the impact 22 of switching the null and alternative hypothesis on the conclusion because the execution of hypothesis tests was not yet discussed. Now we are ready. To see the impact, it is better to consider the following example. Example 10 (Switching null and alternative): In the past, 15% of the mail order solicitations for a certain charity resulted in a financial contribution. In past years, the letter were drafted by a staff Mr A. A new solicitation letter has been drafted by a job applicant Mr B. The letter is sent to a sample of 200 people and 30 responded with a contribution. At the .05 significance level can it be concluded that the new letter is more effective? Can we conclude that the job applicant Mr B is better than Mr A? Suppose we give the benefit of doubt to the old letter (or Mr A). That is, unless the new letter performs much better than the old one, we will use the old one. 1. Let π be the rate of the mail order solicitations that resulted in a financial contribution. State the null and the alternate hypothesis: H0 : π ≤ 0.15 (new letter is no more effective than the old letter) H1 : π > 0.15 (new letter is more effective) 2. Identify the test statistics and its distribution: z = (p − 0.15)/std(p) ∼ N (0, 1), where p is the sample proportion 3. State the decision rule: The null hypothesis is rejected if z is greater than 1.65, i.e., P rob(z > 1.65) = 0.05. 4. Make a decision and interpret the results: p−π z=q π(1−π) n 30 = q200 − 0.15 = 0 < 1.65 0.15(1−0.15) 200 The null is not rejected. What if we switch the null and the alternative, and give the benefit of doubt to the new letter (or Mr B). That is, unless the new letter performs much worse than the old one, we will use the new one. 1. State the null and the alternate hypothesis: H0 : π > 0.15 (new letter is more effective) H1 : π ≤ 0.15 (new letter is as effective as the old letter) 23 2. Identify the test statistics and its distribution: z = (p − 0.15)/std(p) ∼ N (0, 1), where p is the sample proportion 3. State the decision rule: The null hypothesis is rejected if z is smaller than -1.65, i.e., P rob(z > −1.65) = 0.05. 4. Make a decision and interpret the results: p−π z=q π(1−π) n 30 = q200 − 0.15 = 0 > −1.65 0.15(1−0.15) 200 Hence the null that the new letter is more effective than the old letter is not rejected. In this example, if we give the benefit of the doubt to Mr A, we will keep Mr A. If we give the benefit of doubt to Mr B, we will fire Mr A and hire Mr B. In most organizations, we tend to keep the existing staff unless the contender is much better. Note that this example is very extreme – we have p = π but we reach opposite conclusions when we switch the null and the alternative. In this example, the logical choice of null and alternative are rather obvious. Unfortunately, in a lot of examples, the choice is not as easy as this one. It will take a lot of practice to acquire such skills in formulating the null and the alternative. 10 Relation between testing hypothesis and confidence intervals Suppose we are interested in the inference about population mean µ. Using a sample, we may compute the sample mean (m) and use it as an estimate for the population mean. We may further construct a (1 − α) × 100% confidence interval for the population mean, (m − zα/2 × σm ,m + zα/2 × σm ). Recall that a (1 − α) × 100% confidence interval for the population mean means that (1 − α) × 100% of the random intervals constructed in this way is expected to cover the population mean, and α × 100% of the random intervals constructed in this way is expected not to cover the population mean. With the sample mean, we can also test, at α level of significance, the hypothesis about the population mean equal to the true value, H0 : µ = µ0 versus H1 : µ 6= µ0 . We reject H0 if (m − µ0 )/σm > zα/2 or (m − µ0 )/σm < −zα/2 . Equivalently, we reject H0 if m > µ0 + zα/2 × σm or m < µ0 − zα/2 × σm . Or written in a slightly more complicated way, we reject H0 if m − zα/2 × σm > µ0 or µ0 > m + zα/2 /σm . That is, we do not reject the null if µ0 lies between (m − zα/2 × σm ,m + zα/2 × σm ). 24 Theorem 1 (Equivalence of hypothesis test and confidence intervals):Consider the hypothesis test H0 : µ = µ0 versus H1 : µ 6= µ0 at α level of significance. We reject H0 if and only if µ0 does not lie into the (1 − α) × 100% confidence interval of µ. Simulation 9 (Relation between testing hypothesis and confidence intevals):We illustrate that that testing hypothesis and CI are equivalent in the above theorem. We will the hypothesis H0 : µ = k versus H1 : µ 6= k at α level of significance. 1. Fix a µ = 0. 2. Generate a sample of 50 observations according to N (µ, σ 2 ), σ 2 = 1. 3. Construct (1 − α) × 100% confidence interval for µ. Test the hypothesis using CI. 4. Test the hypothesis using the conventional method at α level of significance, with α = 0.01, 0.05, 0.10. 5. Repeat the above steps 1000 times. Compute the percentage of rejection rate with different k using both procedures. ————– Simulated rejection rate of testing ————– H0 : µ = k versus H1 : µ 6= k at α level of significance α k=0 k = 0.1 k = 0.3 k = 0.5 k = 0.7 0.10 9.20% 15.20% 67.50% 97.10% 100.00% 0.05 4.70% 8.70% 52.50% 94.00% 99.80% 0.01 1.20% 2.90% 28.80% 83.40% 99.00% —————– Simulated coverage rate of —————– (1 − α) × 100% confidence interval of covering µ = k α k=0 k = 0.1 k = 0.3 k = 0.5 k = 0.7 0.10 90.80% 84.80% 32.50% 2.90% 0.00% 0.05 95.30% 91.30% 47.50% 6.00% 0.20% 0.01 98.80% 97.10% 71.20% 16.60% 1.00% Thus, it is obvious that the rejection rate and the coverage rate of the same null value of µ are related: 25 rejection rate = 1 − coverage rate. [Reference: Sim9.xls] 11 A general procedure of testing hypothesis Suppose we are interested in testing whether the population parameter θ is equal to k. H0 : θ=k H1 : θ 6= k 1. We need to get a sample estimate (q) of the population parameter θ. 2. We know in most cases, the test statistics will be in the following form: t= q−k σq where σq is the standard deviation of q under the null. The form of σq depends on what q is. 3. Sample size and the null at hand determine the distribution of the statistic. If θ is population mean, and the sample size is larger than 30, t is approximately standard normal. 12 Testing population variance The general procedure of testing hypothesis suggests that it is essential to know the distribution of the sample analog of the population parameter of concern. If such distribution is not directly available, we may try to obtain the distribution of a transformation of the sample analog. This is the case when we test the population variance. Consider the case that we are interested in testing H0 : σ 2 = σ02 , versus H0 : σ 2 > σ02 . To test the Pn hypothesis, we have to rely on an estimate of sample variance, i.e., s2 = i=1 (xi − m)/(n − 1). We can show that E(s2 ) = σ 2 and V ar(s2 ) = 2σ 4 /(n − 1). Because neither s2 nor s is normally distributed, we cannot 26 rely on the procedure we discussed earlier. Fortunately, we can show that the following statistics4 (n − 1)s2 σ2 has a χ2 (Chi-square) distribution with n−1 degrees of freedom. Thus we can formulate our testing procedure as 1. Compute the sample variance, s2 . 2. Compute the statistic χ2 = (n − 1)s2 /σ02 . 3. Reject the null at α level of significance if χ2 > χ2n−1,α , where χ2n−1,α is the critical value such that P rob(χ2 > χ2n−1,α ) = α. Example 11 (Example):The time it takes to complete the assembly of an electronic component is normally distributed with a standard deviation of 4.38 minutes. If we select 20 components at random, what is the probability that the standard deviation for the time of assembly of these units is less than 3.0 minutes? Let s2 denote the sample variance, and hence s denote the sample standard deviation. We are trying to compute P rob(s < 3.0). But, P rob(s < 3.0) = P rob(s2 < 9.0) = P rob((n − 1)s2 /σ 2 < (n − 1)9.0/σ 2 ). I know (n − 1)s2 /σ 2 has a Chi-square distribution with degree n − 1. So, we just have to find P rob(χ2 < 19 × 9.0/19.18) = P rob(χ2 < 8.91) = 1 − 0.975 = 0.25 4 Suppose x1 , x2 , ..., xn are standard normal random variables. Then, n X x2i ∼ χ2 (n) i=1 a Chi-square distribution with n degrees of freedom. 27 References [1] Efron, Bradley, and Robert J. Tibshirani (1993): An Introduction to the Bootstrap, Chapman & Hall. [2] Liu, Tung, and Courtenay C. Stone (1999): “A Critique of One-Tailed Hypothesis Test Procedures in Business and Economics Statistics Textbooks,” Journal of Economic Education, 30(1): 59-63. Problem sets We have tried to include some of the most important examples in the text. To get a good understand of the concepts, it is most useful to re-do the examples and simulations in the text. Work on the following problems only if you need extra practice or if your instructor assigns them as an assignment. Of course, the more you work on these problems, the more you learn. [To be written] 28
© Copyright 2024