Worksheet

Hypothesis Tests – Means – 2 samples
1. More eggs? Can a food additive increase egg
production? Agricultural researchers want to design an
experiment to find out. They have 100 hens available.
They have two kinds of feed—the regular feed and the
new feed with the additive. They plan to run their
experiment for a month, recording the number of eggs
each hen produces.
b) Design an experiment that will require a twosample t-procedure to analyze the results.
c) Design an experiment that will require a matchedpairs t-procedure to analyze the results.
d) Which experiment would you consider the stronger
design? Why?
2. MTV. Some students do homework with the TV on.
(Anyone come to mind?) Some researchers want to see
if people can work as effectively with as without
distraction. The researchers will time some volunteers
to see how long it takes them to complete some
relatively easy crossword puzzles. During some of the
trials, the room will be quiet, during other trials in the
same room, a TV will be on, tuned to MTV.
a) Design an experiment that will require a twosample t-procedure to analyze the results.
b) Design an experiment that will require a matchedpairs t-procedure to analyze the results.
c) Which experiment would you consider the stronger
design? Why?
3. Women. Values for the labor force participation rate of
women (LFPR) are published by the U.S. Bureau of
Labor Statistics. We are interested in whether there
was a difference between female participation in 1968
and 1972, a time of rapid change for women. We check
LFPR values for 19 randomly selected cities for 1968
and 1972. Shown below is software output for two
possible tests.
Paired t-Test of (1 - 2)
Test Ho:  (1972-1968) = 0 vs Ha:  (1972-1968) 
0
Mean of Paired Differences = 0.0337
t-Statistic = 2.458 w/ 18 df
p = 0.0244
2-Sample t-Test of 1 - 2
Ho: 1 - 2 = 0 Ha: 1 - 2  0
Test Ho: (1972) - (1968) = 0
vs Ha:  (1972) -  (1968)  0
Difference Between Means = 0.0337
t-Statistic = 1.496 w/ 35 df
p = 0.1434
a) Which of these tests is appropriate for these data?
Explain.
b) Using the test you selected, state your conclusion.
4. Learning math. The Core Plus Mathematics Project
(CPMP) is an innovative approach to teaching
mathematics that engages students in group
investigations and mathematical modeling. After field
tests in 36 high schools over a three-year period,
researchers compared the performances of CPMP
students with those taught using a traditional
curriculum. In one test, students had to solve applied
algebra problems that did not allow them to use
calculators. The table below shows the results. Are the
mean scores of the two groups significantly different?
Test an appropriate hypothesis and state your
conclusion.
Math
n Mean SD
program
CPMP
312 29.0 18.8
Traditional
265 38.4 16.2
Performance on Algebraic Symbolic
Manipulation Without Use of Calculators
a) Write an appropriate hypothesis.
b) Do you think the assumptions for inference are
satisfied? Explain.
c) Here is computer output for this hypothesis test.
Explain what the P-value means in this context.
2-Sample t-Test of 1 - 2  0 t-Statistic = -6.496
w/ 583 df P < 0.0001
e) State a conclusion about the CPMP program.
5. Rain. Simpson, Alsen, and Eden (Technometrics 1975)
report the results of trials in which clouds were seeded
and the amount of rainfall recorded. The authors report
on 26 seeded and 26 unseeded clouds in order of the
amount of rainfall, largest amount first. Here are two
possible tests to study the question of whether cloud
seeding works. Which test is appropriate for these
data? Explain your choice. Using the test you select,
state your conclusion.
Paired t-Test of  (1 - 2)
Mean of Paired Differences = -277.39615
t-Statistic = -3.641 w/ 25 df
p = 0.0012
2-Sample t-Test of 1 - 2
Difference Between Means = -277.4
t-Statistic = -1.998 w/ 33 df
p = 0.0538
a) Which of these tests is appropriate for these data?
Explain.
b) Using the test you selected, state your conclusion.
7. CPMP and word problems. The study of the new
CPMP mathematics methodology described in
Exercise 3 also tested students' abilities to solve word
problems. This table shows how the CPMP and
traditional groups performed. What do you conclude?
Math
n Mean SD
program
CPMP
320 57.4 32.1
Traditional
273 53.9 28.5
and some computer output from a two-sample t-test
computed for the data.
8. Streams. Researchers collected samples of water from
streams in the Adirondack Mountains to investigate the
effects of acid rain. They measured the pH (acidity) of
the water and classified the streams with respect to the
kind of substrate (type of rock over which they flow).
A lower pH means the water is more acidic. Here is a
plot of the pH of the streams by substrate (limestone,
mixed, or shale):
2-Sample t-Test of G - P > 0
Difference Between Means = -0.9914
t-Statistic = -1.540 w/196 df
P = 0.9374
a) Explain in this context what the P-value means.
b) State your conclusion about the effectiveness of
ginkgo biloba.
c) Proponents of ginkgo biloba continue to insist that
it works. What type of error do they claim your
conclusion makes? Explain.
Here are selected parts of a software analysis
comparing the pH of streams with limestone and shale
substrates:
2-Sample t-Test of 1 - 2
Difference Between Means = 0.735
t-Statistic = 16.30 w/ 133 df
p  0.0001
a) State the null and alternative hypotheses for this
test.
b) From the information you have, do the
assumptions and conditions appear to be met?
c) What conclusion would you draw?
9. Hurricanes. The data below show the number of
hurricanes recorded annually before and after 1970.
Create an appropriate visual display and determine
whether these data are appropriate for testing whether
there has been a change in the frequency of hurricanes.
1944-1969
3, 2, 1, 2, 4, 3, 7, 2, 3,
3, 2, 5, 2, 2, 4, 2, 2, 6,
0, 2, 5, 1, 3, 1, 0, 3
1970-2000
2, 1, 0, 1, 2, 3, 2, 1, 2, 2,
2, 3, 1, 1, 1, 3, 0, 1, 3, 2,
1, 2, 1, 1, 0, 5, 6, 1, 3, 5, 3
10. Memory. Does ginkgo biloba enhance memory? In an
experiment to find out, subjects were assigned
randomly to take ginkgo biloba supplements or a
placebo. Their memory was tested to see whether it
improved. Here are boxplots comparing the two groups
11. Job satisfaction. A company institutes an exercise
break for its workers to see if this will improve job
satisfaction, as measured by a questionnaire that
assesses workers' satisfaction. Scores for 10 randomly
selected workers before and after the implementation
of the exercise program are shown.
a) Identify the procedure you would use to assess the
effectiveness of the exercise program, and check to see
if the conditions allow use of that procedure.
b) Test an appropriate hypothesis and state your
conclusion.
Worker
number
1
2
3
4
5
6
7
8
9
10
Job satisfaction index
Before
34
28
29
45
26
27
24
15
15
27
After
33
36
50
41
37
41
39
21
20
37
13. Sleep. W. S. Cosset (Student) refers to data recording
the number of hours of additional sleep gained by 10
patients from the use of laevoliysocyamine
hydmbromide. We want to see if there is strong
evidence that the herb can help people get more sleep.
a) State the null and alternative hypotheses clearly.
b) A t-test of the null hypothesis of no gain has a tstatistic of 3.680 with 9 degrees of freedom. Find
the P-value.
c) Interpret this result by explaining the meaning of
the P-value.
d) State your conclusion regarding the hypotheses.
e) This conclusion, of course, may be incorrect. If so,
which type of error was made?
14. Gasoline. Many drivers of cars that can run on regular
gas actually buy premium in the belief that they will
get better gas mileage. To test that belief, we use 10
cars in a company fleet in which all the cars run on
regular gas. Each car is filled first with either regular or
premium gasoline, decided by a coin toss, and the
mileage for that tank recorded. Then the mileage is
recorded again for the same cars for a tank of the other
kind of gasoline. We don't let the drivers know about
this experiment. Here are the results (miles per gallon):
Car #
1 2 3 4 5 6 7 8 9 10
Regular 16 20 21 22 23 22 27 25 27 28
Premium 19 22 24 24 25 25 26 26 28 32
a) Is there evidence that cars get significantly better
fuel economy with premium gasoline?
b) How big might that difference be? Check a 90%
confidence interval.
c) Even if the difference is significant, why might the
company choose to stick with regular gasoline?
d) Suppose you had done a Bad Thing. (We're sure
you didn't.) Suppose you had mistakenly treated
these data as two independent samples instead of
matched pairs. What would the significance test
have found? Carefully explain why the results are
so different.
15. Yogurt. Do these data suggest that there is a significant
difference in calories between servings of strawberry
and vanilla yogurt? Test an appropriate hypothesis and
state your conclusion. Don't forget to check
assumptions and conditions!
Strawberry Vanilla
America's Choice
210
200
Breyer's Lowfat
220
220
Columbo
220
180
Dannon Light 'n Fit
120
120
Dannon Lowfat
210
230
Dannon laCreme
140
140
Great Value
180
80
La Yogurt
170
160
Mountain High
200
170
Stonyfield Farm
100
120
Yoplait Custard
190
190
Yoplait Light
100
100
16. Caffeine. A student experiment investigating the
potential impact of caffeine on studying for a test
involved 30 subjects, randomly divided into two
groups. Each group took a memory test. The subjects
then each drank two cups of regular (caffeinated) cola
or caffeine-free cola. Thirty minutes later they each
took another version of the memory test, and the
changes in their scores were noted. Among the 15
subjects who drank caffeine, scores fell an average of 0.933 points, with a standard deviation of 2.988 points.
Among the no-caffeine group, scores went up an
average of 1.429 points with a standard deviation of
2.441 points. Assumptions of Normality were deemed
reasonable based on histograms of differences
a) Did scores change significantly for the group who
drank caffeine? Test an appropriate hypothesis and
state your conclusion.
b) Did scores change significantly for the no-caffeine
group? Test an appropriate hypothesis and state
your
c) Does this indicate that some mystery substance in
noncaffeinated soda may aid memory? What other
explanation is plausible?
17. Hard water. In an investigation of environmental
causes of disease, data were collected on the annual
mortality rate (deaths per 100,000) for males in 61
large towns in England and Wales. In addition, the
water hardness was recorded as the calcium
concentration (parts per million, ppm) in the drinking
water. The data set also notes for each town whether it
was south or north of Derby. Is there a significant
difference in mortality rates in the two regions? Here
are the summary statistics.
Summary of: mortality
For categories in: Derby
Group Count Mean Median StdDev
North
34
1631.59
1631
138.470
South
27
1388.85
1369
151.114
a) Test appropriate hypotheses and state your
conclusion.
b) The boxplots of the two distributions show an
outlier among the data north of Derby. What effect
might that have had on your test?
18. Brain waves. An experiment was performed to see
whether sensory deprivation over an extended period
of time has any effect on the alpha-wave patterns
produced by the brain. To determine this, 20 subjects,
inmates in a Canadian prison, were randomly split into
two groups. Members of one group were placed in
solitary confinement. Those in the other group were
allowed to remain in their own cells. Seven days later,
alpha-wave frequencies were measured for all subjects,
as shown in the following table (P. Gendreau et al,
"Changes in EEC Alpha Frequency and Evoked
Response Latency During Solitary Confinement,"
journal of Abnormal Psychology 79 11972]: 54-59):
Nonconfined Confined
10.7
9.6
10.7
10.4
10.4
9.7
10.9
10.3
10.5
9.2
10.3
9.3
9.6
9.9
11.1
9.5
11.2
9.0
10.4
10.9
a) What are the null and alternative hypotheses? Be
sure to define all the terms and symbols you use.
b) Are the assumptions necessary for inference met?
c) Perform the appropriate test, indicating the formula
you used, the calculated value of the test statistic,
and the P-value.
d) State your conclusion.
19. Summer school. Having done poorly on their math
final exams in June, six students repeat the course in
summer school, then take another exam in August. If
we consider these students representative of all
students who might attend this summer school in other
years, do these results provide evidence that the
program is worthJune 54 49 68 66 62 62
Aug. 50 65 74 64 68 72
20. Lower scores? Newspaper headlines recently
announced a decline in science scores among high
school seniors. In 2000, 15,109 seniors tested by The
National Assessment in Education Program (NAEP)
scored a mean of 147 points. Four years earlier, 7537
seniors had averaged 150 points. The standard error of
the difference in the mean scores for the two groups
was 1.22.
a) Have the science scores declined significantly?
Cite appropriate statistical evidence to support
your conclusion.
b) The sample size in 2000 was almost double that in
1996. Does this make the results more convincing,
or less? Explain.
21. The Internet. The NAEP report described in Exercise
20 compared science scores for students who had home
Internet access with the scores of those who did not, as
shown in the graph. They report that the differences are
statistically significant.
a) Explain what "statistically significant" context.
b) If their conclusion is incorrect, which type of error
did the researchers commit?
c) Does this prove that using the Internet at home
improve a student's performance in science?
22. Music and memory. Is it a good idea to listen to music
when studying for a big test? In a study conducted by
some Statistics students, 62 people were randomly
assigned to listen to rap music, music by Mozart, or no
music while attempting (o memorize objects pictured
on a page. They were then asked to list all the objects
they could remember. Here are summary statistics for
each group:
Rap Mozart No Music
Count
29
20
13
Mean
10.72 10.0
12.77
StDev
3.99 3.19
4.73
a) Does it appear that it is better to study while
listening to Mozart than to rap music? Test an
appropriate hypothesis and state your conclusion.
b) Create a 90% confidence interval for the mean
difference in memory score between students who
study to Mozart and those who listen to no music
at all. Interpret your interval.
23. Rap. Using the results of the experiment described in
Exercise 22, does it matter whether one listens to rap
music while studying, or is it better to study without
music at all?
a) Test an appropriate hypothesis and state your
conclusion.
b) If you concluded there is a difference, estimate the
size of that difference with a confidence interval
and explain what your interval means.
Hypothesis Tests – Means – 2 samples
Answers:
1. a) Randomly assign 50 hens to each of the two kinds of
feed. Compare production at the end of the month.
b) Give all 100 hens the new feed for 2 weeks and the
old food for 2 weeks, randomly selecting which feed
the hens get first. Analyze the differences in production
for all 100 hens.
c) Matched pairs. Because hens vary in egg production,
the matched-pairs design will control for that.
2. a) Randomly assign half the volunteers to do the puzzles
in a quiet room, half to do them with MTV on.
Compare the times.
b) Randomly assign half the volunteers to do a puzzle
in a quiet room, half to do a puzzle with MTV on.
Then have each do a puzzle under the other condition.
Look at the differences in completion times.
c) Matched pairs. People vary in their ability to do
crossword puzzles.
3. a) Matched pairs — same cities in different time periods.
b) There is a significant difference (P-value = 0.0244)
in the labor force participation rate for women in these
cities; women's participation increased between 1968
and 1972.
4. a) H0: C - T = 0 vs. HA: C - T  0
b) Yes. Groups are independent, though we don't know
if students were randomly assigned to the programs.
Sample sizes are large, so CLT applies.
c) If the means for the two programs are really equal,
there is less than a 1 in 10,000 chance of seeing a
difference as large as or larger than the observed
difference just from natural sampling variation.
d) On average, students who learn with the CPMP
method do significantly worse on algebra tests that do
not allow them to use calculators than students who
learn by traditional methods.
5. a) 2-sample. Clouds are independent of one another.
b) Based on these data, there is some evidence of a
difference (P-value 0.0538) in the amount of rain
between seeded and unseeded clouds.
7. H0: C - T = 0 vs. HA: C - T  0. t = 1.406, df = 590.05,
P-value = 0.1602. Because of the large P-value, we fail
to reject H0. Based on this sample, there is no evidence
of a difference in mean scores on a test of word
problems, whether students learned with CPMP or
traditional methods.
8. a) H0: L - S = 0 vs. HA: L - S  0
b) Don't know if the streams were a random sample, or
whether they are less than 10 % of all Adirondack
streams. Boxplots show outliers and Shale may be
skewed (median is equal to Q1 or Q3), but samples are
large.
c) Based on these data, it appears that water flowing
over limestone is less acidic, on average, than water
flowing over shale.
9.
There are several concerns here. First, we don't have a
random sample. We have to assume that the actual
number of hurricanes in a given year is a random
sample of the hurricanes that might occur under similar
weather conditions. Also, the data for 1944-1969 are
not symmetric and have three outliers. The outliers will
tend to make the average for the period 1944-1969
larger. These data are not appropriate for inference.
The boxplots provide little evidence of a change in the
mean number of hurricanes in the two periods.
10. a) If there is no difference between ginkgo and the
placebo, there is a 93.74% chance of seeing a
difference as large or larger as that observed, just from
natural sampling variation.
b) There is no evidence based on this study that ginkgo
biloba improves memory, as the difference in mean
memory score was not significant.
c) Type ll
11. a) Paired sample test. Data are before/after for the same
workers; workers randomly selected; assume less than
10% of all this company's workers; boxplot of
differences shows them to be symmetric with no
outliers.
b) H0: D = 0 vs. HA: D > 0. t = 3.60, P-value =
0.0029. Because P < 0.01, reject H0. These data show
that average job satisfaction has increased after
implementation of the exercise program.
13. a) H0: D = 0 vs. HA: D > 0
b) 0.0025
c) If there is no gain of additional hours of sleep with
the herb, the chance of seeing a mean difference as
large or larger than the one observed is about onequarter percent.
d) The data provide evidence that the herb is helpful in
gaining additional sleep.
e) Type I
14. a) H0: D = 0 vs. HA: D > 0. t = 4.47, P-value = 0.0008.
Because of the very small P-value, we reject H0. These
data provide strong evidence that cars get significantly
better mileage, on average with premium than with
regular gasoline.
b) (1.18, 2.82) gallons
c) Premium gasoline costs more than regular.
d) t = 1.25, P-value is 0.1144. Would have decided no
difference. The variation in the cars' performances is
larger than the differences.
15. H0: D = 0 vs. HA: D  0. Data are paired by brand;
brands are independent of each other; less than 10% of
all yogurts (questionable); boxplot of differences
shows an outlier (100) for Great Value
With the outlier included, the mean difference
(Strawberry-Vanilla) is 12.5 calories with a t-stat of
1.332 with 11 df, for a P-value of 0.2098. Deleting the
outlier, the difference is even smaller, 4.55 calories
with a t-stat of only 0.833 and a P-value of 0.4241.
With P-values so large, we do not reject H0. We
conclude that the data do not provide evidence of a
difference in mean calories.
16. a) H0: D = 0 vs. HA: D  0. t = -1.21, P-value =
0.2466. Since P > 0.05, fail to reject H0. There is no
evidence that the mean score will change after using
caffeine.
b) H0: D = 0 vs. HA: D  0. t = 2.27, P-value =
0.0397. Since P < 0.05, reject H0. There is evidence
that the mean score will increase when no caffeine is
used.
c) No. Might be variation due to the people in the two
groups that produced a Type I error. (Answers will
vary.)
17. a) H0: N - S = 0 vs. HA: N - S  0. t = 6.47, df =
53.49, P-value = 3.2 X 10-8. Because the P-value is
low, we reject H0. On the basis of these data, there is
clear evidence that mortality rates are different. The
mean rate in the north is significantly higher.
b) It will raise y for the north, but from looking at the
boxplots and the fact that the mean and median are
nearly the same, it probably will not change the
conclusion of the test.
18. a) H0: NC - C = 0 vs. HA: NC - C  0; NC is the mean
for nonconfined inmates, C is the mean for inmates
confined to solitary.
b) Groups are independent of each other, not paired;
random assignment to groups, less than 10% of all
inmates, boxplot shows no outliers in either group.
c) 2-sample t-test statistic: 3.357, P-value = 0.0038
d) Because the P-value is so small, we reject H0.
Solitary confinement makes a difference in mean
alpha-wave frequencies; those subjected to
confinement have lower frequencies.
19. These are before and after scores for the same
individuals, not independent samples.
20. a) The 95% confidence interval for the difference is
(0.61, 5.39). 0 is not in the interval, so scores in 1996
were significantly higher. (Or the t, with more than
7500 df, is 2.459 for a P-value of 0.0069)
b) Since both samples were very large, there shouldn't
be a difference.
21. a) The observed differences are too large to attribute to
chance or natural sampling variation.
b) Type I
c) No. There may be many other factors.
22. a) H0: M - R = 0 vs. HA: M - R > 0. t = -0.70, df =
45.88, P-value = 0.7563. Because the P-value is so
large, we do not reject H0. These data provide no
evidence that listening to Mozart while studying is
better than listening to rap.
b) With 90% confidence, the average difference in
score is between (0.189 and 5.357) objects more for
those who listen to no music while studying, based on
these samples.
23. a) H0: M - R = 0 vs. HA: M - R < 0. t = -1.36, df =
20.00, P-value = 0.944. Because the P-value is large,
we fail to reject H0. These data show no evidence of a
difference in mean number of objects recalled between
listening to rap or no music at all. b) Didn't conclude a
difference.