Sect. 8.1: Inference for a Population Proportion Previously, we have been making inferences about the population mean . Now, we will be concerned about estimating the proportion, p, for some population that consists of "successes" and "failures" as in Ch. 5. The_______________________ of interest is the population proportion, p, of “successes”. The ___________________ we will be using to estimate p is the sample proportion, pˆ where number of successes (1's) in the sample X n n For _____________sample sizes, the Binomial distribution must be used for inference about p. pˆ We will assume___________ sample size and use the _______________________________. Recall from Chapter 5: If a SRS of size n is chosen from a population with proportion p of “successes”, then As the sample size n increases, the sampling distribution becomes approximately _________ 1) The mean of the sampling distribution is ____, therefore it is ____________________. 2) The standard deviation of the sampling distribution is Confidence Intervals: Note that the standard deviation of pˆ involves the _____________ we are trying to estimate, so we use instead the _____________________ of pˆ , (i.e. the estimate of this standard deviation) which has the form: S .E . Confidence intervals for p. A C = 1- confidence interval for p has the form where_____ is the upper ˆp critical value for the standard normal distribution. Use Table___ to get this value. Use this interval for confidence level above _______%, only when both ____ and ___________ are __________________ . For somewhat ____________ Moore and McCabe point out that this method can be ___________ . They propose an alternative method that calculates an estimate of p as though ____ additional observations had been obtained and _______ of them were "successes." This method –the Plus Four estimate moves the estimate closer to ______ 1 Sta 245 Sec.8.1 SB Plus Four Confidence intervals for p. Estimate p by p X n p (1 p ) n where z* is the upper ___________ critical value for the standard normal distribution. Use Table D to get this value. A C = 1- confidence interval for p has the form p z * Use this interval for confidence level above _______%, only when _____________. Hypothesis Testing and Confidence Intervals The hypothesis testing involving sample proportions is another version of the ________-sample z test. When we are testing the null hypothesis H0: p = p0, we use____ in place of________ when calculating the standard deviation in our test statistic, i.e. we standardize assuming p 0 is the true mean. Z= The P-value is calculated based on the form of Ha: Ha: p > p P is P(Z > z) Ha: p < p P is P(Z < z) Ha: p p P is 2P(Z > |z|) Replace z with the observed value of the test statistic. This method should be used only when the ____________ is large enough such that the ______ number of success and the ______________ number of failures is _______________ or more. 2 Sta 245 Sec.8.1 SB Note that the close connection between confidence intervals and two sided hypothesis tests that we saw in Chapter 6 does___________ here. That is due to ____________ the same estimate of the standard deviation of pˆ in the two procedures. It is approximately true that the confidence interval gives the range of p values that would be accepted if we made them the p0 in a hypothesis test. Checklist for Inference About a Proportion 1) The data is a ______________ from the population of interest. 2) The population is at least ____________ times larger than the sample. Otherwise, the formula for the standard deviation is not very accurate. 3) The sample size n is large enough such that: Traditional Confidence Interval: Plus Four Confidence interval: n ____________ Hypothesis Test of H0: p = p0: np 0 10 and n ( 1 p 0 ) 10 If these assumptions are not satisfied then the conclusion about the parameter p is __________ _____________ reliable. Example In recent years, 70% of first-year college students responding to a national survey identified “being very well-off financially” as an important personal goal. Suppose OSU finds that 103 of a SRS of 150 of its first-year students agree that this goal is important. a) Verify that the assumptions needed to make reliable conclusions from hypothesis testing and confidence intervals are met. b) Give a 95% confidence interval for the proportion of all first-year students at OSU who would identify being financially well-off as an important goal. c) Is there sufficient evidence that the proportion of first-year students at this university who think this goal is important differs from the national value of 70%? State your hypotheses, calculate the P-value and state your conclusions at the = 5% level. 3 Sta 245 Sec.8.1 SB Necessary Sample Size Recall the margin of error for the large sample confidence interval is pˆ (1 pˆ ) z* n A certain sample size is necessary to guarantee a particular margin of error. Since this is decided prior to data collection, we need to guess the value of ˆp . Call this guess p*. Two ways of guessing are 1) using information from similar studies or pilot studies or past experience. m 2) z* using p* = _____. The margin of error is ____________ when ˆp = _________ so that our guess will be conservative. That is, any other value for the sample proportion pˆ will yield a __________________ margin of error than planned. A level C confidence interval for a population proportion p will have margin of error approximately equal to m when the sample size is 2 z* n= p *(1 p*) where p* is some guessed value for the sample proportion pˆ . Example PTC is a substance that has a strong bitter taste for some people and is tasteless for others. The ability to taste PTC is inherited. About 75% of Italians can taste PTC. You want to estimate the proportion of Americans with at least one Italian grandparent who can taste PTC. How large a sample is necessary to test in order to estimate the proportion of PTC tasters within 0.04 with 95% confidence? 4 Sta 245 Sec.8.1 SB Example: Last year's survey of the sta135 class had 502 responses to a question "did you eat breakfast today?". Let's take that group of 502 as the population. I asked a stat program to draw 10 random samples, each of size 25. Here is a list of the individuals included in each sample: No. No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 10 24 19 21 9 22 16 14 79 1 27 27 32 71 26 33 35 19 112 10 51 52 51 72 62 77 54 20 135 24 97 61 114 95 99 79 78 23 159 30 98 85 115 98 108 125 81 24 170 34 106 123 122 110 139 130 150 39 206 92 131 138 126 191 173 139 154 52 216 103 141 169 134 201 177 173 160 70 231 114 172 176 141 225 184 174 200 92 232 119 182 230 198 227 230 196 235 103 258 163 197 231 219 232 249 210 246 133 272 167 217 244 242 253 250 230 261 144 286 201 219 295 249 265 253 236 297 152 291 204 236 324 309 286 262 256 303 161 301 249 283 349 315 310 267 269 313 166 325 332 288 381 331 313 302 319 340 231 336 340 294 445 339 349 311 323 357 266 353 353 316 453 351 360 321 353 361 288 360 375 325 455 384 395 338 355 383 311 383 429 326 459 391 404 362 372 396 328 387 431 339 474 417 443 385 384 399 333 424 432 363 480 434 467 439 396 423 349 466 437 404 488 439 477 441 397 451 377 484 449 456 491 448 479 444 420 457 432 498 455 462 496 492 482 488 473 476 456 500 480 478 I have marked some of the population members who appeared in more than one sample. With half the population covered in the 10 samples, it is not surprising that there would be some repetitions and some members never hit. No two samples are identical. Since ______% of the students answered yes, the parameter p is known to be ______. Recall: For a 90% interval, z* = 1.64. Since p is _________, a sample size of________ is large enough to use the normal distribution as the sampling distribution of pˆ . However, n = ____ is __________ large enough to use the traditional CI, according to M&M's rules. The estimated standard deviations and MOE's are shown in the following table: X pˆ ~ p SE(tr) M(90%) SE(+4) M(90% +4) 12 0.48 6 0.24 14 0.56 12 0.48 15 0.60 17 0.68 16 0.64 10 0.40 16 0.64 12 0.48 0.48 0.10 0.16 0.09 0.15 0.28 0.09 0.14 0.08 0.14 0.55 0.10 0.16 0.09 0.15 0.48 0.10 0.16 0.09 0.15 0.59 0.10 0.16 0.09 0.15 0.66 0.09 0.15 0.09 0.15 0.62 0.10 0.16 0.09 0.15 0.41 0.10 0.16 0.09 0.15 0.62 0.10 0.16 0.09 0.15 0.48 0.10 0.16 0.09 0.15 5 Sta 245 Sec.8.1 SB ~ values for the 10 samples with the upper and lower 90% Next are tables showing the pˆ and p confidence bounds. The tables show whether ____ was inside the interval or not. We expect to miss p in one of ten 90% intervals and in this simulation we missed_________________. Traditional: pˆ 0.48 0.24 0.56 0.48 0.60 0.68 0.64 0.40 0.64 0.48 Up 0.64 0.38 0.72 0.64 0.76 0.83 0.80 0.56 0.80 0.64 Low 0.32 0.10 0.40 0.32 0.44 0.53 0.48 0.24 0.48 0.32 p in? PLUS FOUR ~ p 0.48 0.28 0.55 0.48 0.59 0.66 0.62 0.41 0.62 0.48 up 0.64 0.41 0.70 0.64 0.74 0.80 0.77 0.56 0.77 0.64 low 0.33 0.14 0.40 0.33 0.44 0.51 0.47 0.26 0.47 0.33 p in? Non-standard use of test about p. In trials of the Salk polio vaccine, 200,00 children were assigned to the control group and the same number to the treatment group. They observed 142 cases of polio in the control group and 56 in the treatment group. Is the vaccine effective? The devil's advocate hypothesis says the vaccine has no effect. Thus, the skeptics say that a few children are fated to contract polio; assignment to treatment or control group has nothing to do with it. Each child has a 50-50 chance to be in treatment or control, just depending on the toss of a coin. Each polio case has a 50-50 chance to turn up in the treatment group or the control group. Therefore, the number of polio cases in the two groups must be about the same. Any difference is due to the chance variability in coin tossing. Let's examine this: n = 198 cases. p = probability of ending up in the placebo group = 0.5 under H0. The estimate 142 pˆ 0.72 . 198 n is large and under H0, p is 0.5, so the normal distribution can be used. 0.72 0.5 0.22 z 6.19 ( 0.5 )( 0.5 ) 0.0355 198 Although the devil's advocate hypothesis does not provide an alternative value for p, clearly we are interested in values of pˆ ____________ than 0.5. Use as the P-value: P(Z ) ________________ About ____________in ______________ or less than _____ in a _____________. ˆ ˆ Why can't we use the formula pˆ z* p(1-p)/n to make a confidence interval in the following case? We take a random sample of 50 households in order to estimate the percentage of all homes in the United States that have a refrigerator. It turns out that 49 of the 50 homes have a refrigerator. 49 pˆ is very close to 1 ( pˆ 0.98 ) The sample size is____________________ to 50 compensate for this, ( n(1 pˆ ) n X 1( 15) so the _______________________ would be very bad. 6 Sta 245 Sec.8.1 SB Since n ___________, we can use the Plus Four approach: p 51 54 0.9444 So the CI is 0.944(0.056) 0.944 z *0.0312 A 95% interval, (z* = ____),is (0.883,________). 54 We would use (0.883, ______). The traditional method gives, (0.941,_____), or (0.941, ____), 0.944 z * shorter, but ________________. Example - It's Wednesday November 3, 2004 - do you know who your president is? Last surveys before the 2004 election for some major national polls Poll Zogby Gallup Gallup Pew Harris TIPP Tarrance (R) Lake (D) date ended 11/2/04 10/31/04 10/24/04 10/30/04 11/1/04 11/1/04 11/1/04 11/1/04 Bush Kerry 49.4% 49% 51% 51% 49% 50.1% 51.2% 48.6% Actual Popular Vote: 50.75% Why did the polls differ from each other? Nader Other 49.1% 49% 46% 48% 48% 48.0% 47.8% 50.7% --0.5% 1% 1% 2% 1.1% 0.5% --- --0.5% ----1% 0.8% 0.5% --- 48.30% 0.36% 0.59% Why did they differ from themselves a few days before? Gallup had about 1600 "likely voters" in their final poll but only about 700 in their polls done in early October. Why did they increase the numbers? Gallup reported a ±3% margin of error for their final pre-election poll. Thus for Bush they made the confidence statement 49% ± 3% How did they get it? pˆ (1-pˆ ) n .49(.51) For the Gallup poll, this is: 2 0.03 1600 Would a Plus Four C.I. be different? For 95% confidence, MOE is: 2 What does this mean? The sample size is_________ and p is_______________. They use the ________________ approximation to get the MOE. 7 Sta 245 Sec.8.1 SB ________________ of the time pˆ must be____________ the interval that goes from p __________ to p ____________, i.e. the probability is __________ that the interval that goes from pˆ _________ to pˆ ________________ will contain p. 8 Sta 245 Sec.8.1 SB