Document 284733

Sect. 8.1: Inference for a Population Proportion
Previously, we have been making inferences about the population mean .
Now, we will be concerned about estimating the proportion, p, for some population that consists
of "successes" and "failures" as in Ch. 5.
The_______________________ of interest is the population proportion, p, of “successes”.
The ___________________ we will be using to estimate p is the sample proportion,
pˆ where
number of successes (1's) in the sample
X
n
n
For _____________sample sizes, the Binomial distribution must be used for inference about p.
pˆ
We will assume___________ sample size and use the _______________________________.
Recall from Chapter 5:
If a SRS of size n is chosen from a population with proportion p of “successes”, then
As the sample size n increases, the sampling distribution becomes approximately _________
1)
The mean of the sampling distribution is ____, therefore it is ____________________.
2)
The standard deviation of the sampling distribution is
Confidence Intervals:
Note that the standard deviation of pˆ involves the _____________ we are trying to estimate, so
we use instead the _____________________ of pˆ , (i.e. the estimate of this standard deviation)
which has the form:
S .E .
Confidence intervals for p.
A C = 1- confidence interval for p has the form
where_____ is the upper
ˆp
critical value for the standard normal distribution. Use Table___
to get this value. Use this interval for confidence level above _______%, only when both ____
and ___________ are __________________ .
For somewhat ____________ Moore and McCabe point out that this method can be
___________ . They propose an alternative method that calculates an estimate of p as though
____ additional observations had been obtained and _______ of them were "successes." This
method –the Plus Four estimate moves the estimate closer to ______
1
Sta 245 Sec.8.1 SB
Plus Four Confidence intervals for p.
Estimate p by p
X
n
p (1 p )
n
where z* is the upper ___________ critical value for the standard normal distribution. Use
Table D to get this value.
A C = 1- confidence interval for p has the form p z *
Use this interval for confidence level above _______%, only when _____________.
Hypothesis Testing and Confidence Intervals
The hypothesis testing involving sample proportions is another version of the ________-sample
z test.
When we are testing the null hypothesis H0: p = p0, we use____ in place of________ when
calculating the standard deviation in our test statistic, i.e. we standardize assuming p 0 is the
true mean.
Z=
The P-value is calculated based on the form of Ha:
Ha: p > p
P is P(Z > z)
Ha: p < p
P is P(Z < z)
Ha: p
p
P is 2P(Z > |z|)
Replace z with the observed value of the test statistic.
This method should be used only when the ____________ is large enough such that the
______ number of success and the ______________ number of failures is _______________
or more.
2
Sta 245 Sec.8.1 SB
Note that the close connection between confidence intervals and two sided hypothesis tests that
we saw in Chapter 6 does___________ here. That is due to ____________ the same estimate
of the standard deviation of pˆ in the two procedures. It is approximately true that the
confidence interval gives the range of p values that would be accepted if we made them the p0
in a hypothesis test.
Checklist for Inference About a Proportion
1)
The data is a ______________ from the population of interest.
2)
The population is at least ____________ times larger than the sample.
Otherwise, the formula for the standard deviation is not very accurate.
3)
The sample size n is large enough such that:
Traditional Confidence Interval:
Plus Four Confidence interval:
n ____________
Hypothesis Test of H0: p = p0: np 0 10 and n ( 1 p 0 ) 10
If these assumptions are not satisfied then the conclusion about the parameter p is __________
_____________ reliable.
Example
In recent years, 70% of first-year college students responding to a national survey
identified “being very well-off financially” as an important personal goal. Suppose OSU finds
that 103 of a SRS of 150 of its first-year students agree that this goal is important.
a)
Verify that the assumptions needed to make reliable conclusions from hypothesis testing
and confidence intervals are met.
b)
Give a 95% confidence interval for the proportion of all first-year students at OSU who
would identify being financially well-off as an important goal.
c)
Is there sufficient evidence that the proportion of first-year students at this university who
think this goal is important differs from the national value of 70%?
State your hypotheses, calculate the P-value and state your conclusions at the = 5%
level.
3
Sta 245 Sec.8.1 SB
Necessary Sample Size
Recall the margin of error for the large sample confidence interval is
pˆ (1 pˆ )
z*
n
A certain sample size is necessary to guarantee a particular margin of error. Since this is
decided prior to data collection, we need to guess the value of ˆp . Call this guess p*. Two ways
of guessing are
1)
using information from similar studies or pilot studies or past experience.
m
2)
z*
using p* = _____. The margin of error is ____________ when
ˆp = _________ so that our guess will be conservative. That is, any other value for the
sample proportion pˆ will yield a __________________ margin of error than planned.
A level C confidence interval for a population proportion p will have margin of error
approximately equal to m when the sample size is
2
z*
n=
p *(1 p*)
where p* is some guessed value for the sample proportion pˆ .
Example
PTC is a substance that has a strong bitter taste for some people and is tasteless for others.
The ability to taste PTC is inherited. About 75% of Italians can taste PTC. You want to estimate
the proportion of Americans with at least one Italian grandparent who can taste PTC. How large
a sample is necessary to test in order to estimate the proportion of PTC tasters within 0.04
with 95% confidence?
4
Sta 245 Sec.8.1 SB
Example: Last year's survey of the sta135 class had 502 responses to a question "did you eat
breakfast today?". Let's take that group of 502 as the population. I asked a stat program to
draw 10 random samples, each of size 25. Here is a list of the individuals included in each
sample:
No.
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9
10
24
19
21
9
22
16
14
79
1
27
27
32
71
26
33
35
19
112
10
51
52
51
72
62
77
54
20
135
24
97
61
114
95
99
79
78
23
159
30
98
85
115
98
108
125
81
24
170
34
106
123
122
110
139
130
150
39
206
92
131
138
126
191
173
139
154
52
216
103
141
169
134
201
177
173
160
70
231
114
172
176
141
225
184
174
200
92
232
119
182
230
198
227
230
196
235
103
258
163
197
231
219
232
249
210
246
133
272
167
217
244
242
253
250
230
261
144
286
201
219
295
249
265
253
236
297
152
291
204
236
324
309
286
262
256
303
161
301
249
283
349
315
310
267
269
313
166
325
332
288
381
331
313
302
319
340
231
336
340
294
445
339
349
311
323
357
266
353
353
316
453
351
360
321
353
361
288
360
375
325
455
384
395
338
355
383
311
383
429
326
459
391
404
362
372
396
328
387
431
339
474
417
443
385
384
399
333
424
432
363
480
434
467
439
396
423
349
466
437
404
488
439
477
441
397
451
377
484
449
456
491
448
479
444
420
457
432
498
455
462
496
492
482
488
473
476
456
500
480
478
I have marked some of the population members who appeared in more than one sample.
With half the population covered in the 10 samples, it is not surprising that there would be
some repetitions and some members never hit. No two samples are identical. Since
______% of the students answered yes, the parameter p is known to be ______.
Recall:
For a 90% interval, z* = 1.64. Since p is _________, a sample size of________ is large
enough to use the normal distribution as the sampling distribution of pˆ . However, n = ____ is
__________ large enough to use the traditional CI, according to M&M's rules.
The estimated standard deviations and MOE's are shown in the following table:
X
pˆ
~
p
SE(tr)
M(90%)
SE(+4)
M(90% +4)
12
0.48
6
0.24
14
0.56
12
0.48
15
0.60
17
0.68
16
0.64
10
0.40
16
0.64
12
0.48
0.48
0.10
0.16
0.09
0.15
0.28
0.09
0.14
0.08
0.14
0.55
0.10
0.16
0.09
0.15
0.48
0.10
0.16
0.09
0.15
0.59
0.10
0.16
0.09
0.15
0.66
0.09
0.15
0.09
0.15
0.62
0.10
0.16
0.09
0.15
0.41
0.10
0.16
0.09
0.15
0.62
0.10
0.16
0.09
0.15
0.48
0.10
0.16
0.09
0.15
5
Sta 245 Sec.8.1 SB
~ values for the 10 samples with the upper and lower 90%
Next are tables showing the pˆ and p
confidence bounds. The tables show whether ____ was inside the interval or not. We expect to
miss p in one of ten 90% intervals and in this simulation we missed_________________.
Traditional:
pˆ
0.48
0.24 0.56 0.48
0.60
0.68
0.64
0.40
0.64
0.48
Up
0.64
0.38 0.72 0.64
0.76
0.83
0.80
0.56
0.80
0.64
Low
0.32
0.10 0.40 0.32
0.44
0.53
0.48
0.24
0.48
0.32
p in?
PLUS FOUR
~
p
0.48
0.28 0.55 0.48
0.59
0.66
0.62
0.41
0.62
0.48
up
0.64
0.41 0.70 0.64
0.74
0.80
0.77
0.56
0.77
0.64
low
0.33
0.14 0.40 0.33
0.44
0.51
0.47
0.26
0.47
0.33
p in?
Non-standard use of test about p.
In trials of the Salk polio vaccine, 200,00 children were assigned to the control group and the same
number to the treatment group. They observed 142 cases of polio in the control group and 56 in the
treatment group. Is the vaccine effective?
The devil's advocate hypothesis says the vaccine has no effect. Thus, the skeptics say that a few children
are fated to contract polio; assignment to treatment or control group has nothing to do with it. Each child
has a 50-50 chance to be in treatment or control, just depending on the toss of a coin. Each polio case has
a 50-50 chance to turn up in the treatment group or the control group.
Therefore, the number of polio cases in the two groups must be about the same. Any difference is due to
the chance variability in coin tossing.
Let's examine this: n = 198 cases. p = probability of ending up in the placebo group = 0.5 under H0. The
estimate
142
pˆ
0.72 .
198
n is large and under H0, p is 0.5, so the normal distribution can be used.
0.72 0.5
0.22
z
6.19
( 0.5 )( 0.5 ) 0.0355
198
Although the devil's advocate hypothesis does not provide an alternative value for p, clearly we are
interested in values of pˆ ____________ than 0.5. Use as the P-value: P(Z
)
________________
About ____________in ______________ or less than _____ in a _____________.
ˆ ˆ
Why can't we use the formula pˆ z* p(1-p)/n
to make a confidence interval in the following
case?
We take a random sample of 50 households in order to estimate the percentage of all homes
in the United States that have a refrigerator. It turns out that 49 of the 50 homes have a
refrigerator.
49
pˆ is very close to 1 ( pˆ
0.98 ) The sample size is____________________ to
50
compensate for this, ( n(1 pˆ ) n X 1( 15) so the _______________________ would be
very bad.
6
Sta 245 Sec.8.1 SB
Since n ___________, we can use the Plus Four approach: p
51
54
0.9444
So the CI is
0.944(0.056)
0.944 z *0.0312 A 95% interval, (z* = ____),is (0.883,________).
54
We would use (0.883, ______). The traditional method gives, (0.941,_____), or (0.941, ____),
0.944 z *
shorter, but ________________.
Example - It's Wednesday November 3, 2004 - do you know who your president is?
Last surveys before the 2004 election for some major national polls
Poll
Zogby
Gallup
Gallup
Pew
Harris
TIPP
Tarrance (R)
Lake (D)
date
ended
11/2/04
10/31/04
10/24/04
10/30/04
11/1/04
11/1/04
11/1/04
11/1/04
Bush
Kerry
49.4%
49%
51%
51%
49%
50.1%
51.2%
48.6%
Actual Popular Vote:
50.75%
Why did the polls differ from each other?
Nader
Other
49.1%
49%
46%
48%
48%
48.0%
47.8%
50.7%
--0.5%
1%
1%
2%
1.1%
0.5%
---
--0.5%
----1%
0.8%
0.5%
---
48.30%
0.36%
0.59%
Why did they differ from themselves a few days before?
Gallup had about 1600 "likely voters" in their final poll but only about 700 in their polls done in early
October. Why did they increase the numbers?
Gallup reported a ±3% margin of error for their final pre-election poll.
Thus for Bush they made the confidence statement
49% ± 3%
How did they get it?
pˆ (1-pˆ )
n
.49(.51)
For the Gallup poll, this is: 2
0.03
1600
Would a Plus Four C.I. be different?
For 95% confidence, MOE is: 2
What does this mean?
The sample size is_________ and p is_______________. They use the ________________
approximation to get the MOE.
7
Sta 245 Sec.8.1 SB
________________ of the time pˆ must be____________ the interval that goes from p __________ to
p ____________, i.e. the probability is __________ that the interval that goes from pˆ _________ to
pˆ ________________ will contain p.
8
Sta 245 Sec.8.1 SB