Download Report

Chapter 3: The analysis of count data
3.0 Introduction
In this chapter we turn our attention to situations where our data come in the form
of counts, rather than continuous measurements. The simplest example, to be
discussed in Section 3.1, is, perhaps, a political poll, where n voters are
randomly selected from a population of voters and asked a question to which
they must answer either yes or no. What is of interest is the proportion of voters
in the population who would say yes to the question. The analysis will involve
calculating a confidence interval for the population proportion; this is directly
equivalent to calculating a confidence interval for a population mean based on a
sample mean, as discussed in Chapter 2 (Section 2.1).
In many comparative studies, the outcome of interest is a count rather than a
measured quantity; in Section 3.2 we will consider such studies. One example
involves drug trials where two groups of patients, who are treated differently, are
classified as improved/not improved, rather than measured on an improvement
scale. In such cases the question of interest is whether there would be a
difference in the population proportions that improve, if the treatments were
applied to whole populations, rather than to relatively small samples. Again, this
is directly equivalent to the simple comparative studies (involving two-sample and
paired t-tests and the corresponding confidence intervals) of Chapter 2, where
comparisons of population means were of interest.
In Section 3.3 we generalise from a two-group comparison to comparing
proportions arising in more than two groups (in Chapter 5 Analysis of Variance
(ANOVA) will be used to generalise from comparing two sample means to
comparing several means for continuous data). We also consider situations
where more than two outcomes are observed: for example, we examine the
results of a market research study where respondents were classified as having
bought a product, having heard of it but not bought it, or, as never having heard
of it. The chi-square distribution is introduced for analysing such data.
Section 3.4 is concerned with ‘goodness-of-fit’ tests - we ask if an observed set
of counts corresponds to an a priori set of relative frequencies; for example, does
the observed distribution of seed colours in a plant study correspond to what
would be expected under a Mendelian inheritance scheme?. To answer the
question, we carry out a statistical test which compares the divergences between
the observed counts and those that would be expected given the validity of the a
priori relative frequencies. The test assesses whether or not the overall
divergence would have been likely under chance variation. Again, the chi-square
distribution will be used as the reference distribution for the divergence measure.
Course Manual: PG Diploma in Statistics, TCD, Base Module
1
3.1 Single proportions
We are often interested in counting the relative frequency of occurrences of
particular characteristics of units sampled from a population or process, with a
view to making inferences on the population or process characteristics. For
example, political polls are carried out to estimate support for individual
politicians or parties in a defined population of electors; market research studies
are conducted to elicit information on purchasing behaviour in a defined
population of consumers; quality control samples are taken to monitor
conformance to specifications of industrial products. Here the practical question
is how to estimate the relative frequency of a given characteristic in a population
or process from that of the sample.
The Statistical Model
Suppose a random sample of n units is selected from a population or process in
which the proportion of units that has a particular characteristic is  and that r
members of the sample have this characteristic. For example, in a political poll
r=700 voters from a sample of n=1000 respond “yes” to a question; what does
the sample proportion p=r/n tell us about  the population proportion who would
say “yes”?
It can be shown that under repeated sampling the sample proportion p=r/n
behaves as if selected from a Normal distribution with mean  and standard
error:
 (1   )
SE( p) 
n
This means that repeated samples of size n will give sample proportions that
vary around  as shown in Figure 3.1.1 below. Note that this result is
 the larger the sample size n is, the better the approximation
approximate and that
becomes.
Figure 3.1.1 The distribution of sample proportions
Course Manual: PG Diploma in Statistics, TCD, Base Module
2
The fact that p is approximately Normally distributed allows the use of the
methods previously encountered for the analysis of measured variables, even
though we are now dealing with counts, where only integer values are possible.
A confidence interval for a population proportion
The statistical question is how to estimate , the population proportion, from that
of the sample value, p. Recall that when dealing with measured quantities we
used the fact that if some variable x is at least approximately Normally distributed
(say, with mean  and standard deviation ) then the average of n randomly
selected values, x , will also be Normally distributed, with the same mean, , and
with standard error /n. This is illustrated in Figure 3.1.2 below.
Figure 3.1.2: Distributions of individual measurements (a)
and of the averages of n measurements (b).
This gave us a method for estimating the population or process mean , using
the sample information. A 95% confidence interval for  was given by:
x ± 1.96

n
Where  was unknown it was replaced by the sample value s and, if the sample
size was small, the standard
Normal value of 1.96 was replaced by the


corresponding Student’s t-value.
The sample proportion in the current case corresponds directly to x above.
Similarly we estimate the standard error of p by replacing  by the sample
estimate p in the formula for standard error:
SEˆ ( p) 

Course Manual: PG Diploma in Statistics, TCD, Base Module
p(1  p)
n
3
This allows us to calculate an approximate 95% confidence interval 1 for  as:
p(1  p)
n
p  1.96
Note the structure of the confidence interval formula: the confidence interval is
given by a point estimate of the parameter of interest, plus or minus the
 of the point estimate multiplied by a critical value.
estimated standard error
Exactly the same structure applies to all confidence intervals, irrespective of the
parameter being estimated:
Example 1: French 2007 presidential election
On May 4th, 2007 the BBC website reported that an opinion poll (Ifop for the Le
Monde newspaper on May 3rd) ‘put Mr. Sarkozy at 53%, with Ms. Royal trailing
with 47%’ in the election campaign for the French presidency (voting took place
on the following Sunday). The sample size was 858 people and all had watched
the head to head television debate between the candidates. What can we infer
about the voting intentions of the French electorate from these sample findings, if
we can assume that this is a random sample of all French voters?
A 95% confidence interval for the population proportion of voters who favoured
Sarkozy is given by:
p(1  p)
p  1.96
n

0.530  1.96
0.530(1  0.530)
858
0.530  0.033
 and 56.3% are estimated to favour Sarkozy.
i.e., between 49.7%
 allows for the sampling variability involved in basing our
A confidence interval
conclusions about the population on the results of a random sample. Taken at
face value the confidence interval indicates that, although Sarkozy was six
percentage points above Royal in the sample, sampling variability is such that
Royal could have been very slightly ahead in the population of voters [50.3%,
corresponding to the lower bound of 49.7% for Sarkozy]. In other words the
sample size was not large enough to distinguish clearly between the two
1
Note that in this chapter the relevant reference distribution for calculating confidence intervals is
the standard Normal (never the t-distribution). Obviously, confidence levels other than 95% may
be obtained by replacing 1.96 by the corresponding Z value; thus, 2.58 gives 99% confidence.
Course Manual: PG Diploma in Statistics, TCD, Base Module
4
candidates. Note that the formula implies that the larger n is, the narrower will be
the error bound on the sample result.
The confidence interval is based on the assumption of a simple random sample
from the population of interest; this means that all voters would be equally likely
to be included in the sample. In practice, this is unlikely to be true for several
reasons. The poll was very likely a telephone survey – this raises several
questions: do all voters have telephones, if several voters reside in the same
house, are they equally likely to answer the telephone, will they answer truthfully
if they do respond to the survey? The poll was aimed at people who had
watched the TV debate – does this subset of the French electorate reflect
accurately the voting intentions of the overall electorate? What exactly was the
question asked – did they ask who had won the debate or who the respondent
would support – the question may appear the same, but it is dangerous to
assume that the respondent interprets a question as you do, unless it is made
very explicit. [I had a colleague once who, when asked by a TV licence inspector
if there was a television in the house, replied that there was and when asked if he
had a TV licence, replied that he did not. When summoned to court he was
asked by the judge why he did not have a TV licence; he explained that he did
not own the TV, which was his wife’s and that she did have a TV licence!]. Many
people are likely to refuse to answer survey questions (whether on the telephone
or otherwise) – is there a difference between the subset of the population who
are willing to respond and that which is not; any such difference leads to a nonresponse bias. Many surveys have response rates of less than 20%. In such
cases it is virtually impossible to assess the likely bias in the sample of
respondents.
In short, surveying is something of a black art! The use of a mathematical
formula, for most people, suggests an exactness which is clearly inappropriate in
very many survey situations.
Judgements need to be made on how
representative the sample is likely to be and the uncertainties inherent in these
judgements are not included in the error bounds generated by the confidence
interval. The real uncertainty is likely to be quite a lot higher (note the precision
in this final sentence!).
Exercise
3.1.1
th
On Saturday May 17 , 2008, the Irish Times published the results of a poll on attitudes to
the planned constitutional referendum on the Lisbon Treaty. The size of the sample was
1000; 35% of respondents said they intended to vote ‘yes’. Estimate the population
proportion who intended to vote ‘yes’, at this point in the campaign.
Sample Size
In any empirical study, one of the first questions that needs to be answered is:
how many observations are required? One way of addressing this question in
the current context is to ask how wide the confidence interval bounds on the
Course Manual: PG Diploma in Statistics, TCD, Base Module
5
sample proportion could be while still giving an interval that answers the practical
question, i.e., an interval that is ‘fit for purpose’. The standard error of the
sample proportion is:
SE( p) 
 (1   )
.
n
This depends not only on the sample size (n) but also on the population or
process proportion() or rather on the product ). Table 3.1.1 shows how
this varies with ; it assumes its worst value (in the sense of giving the widest
interval) when  is 0.5.







0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.09
0.16
0.21
0.24
0.25
0.24
0.21
0.16
0.09
Table 3.1.1: The impact of  on the standard error
Suppose we require that a 95% confidence interval is of the form:
p ± 0.01
then, since the confidence interval is given by:
p  1.96
p(1  p)
n
and a worst case is either p(1–p) or 

1.96
0.25
 0.01
n
which gives a sample size

Course Manual: PG Diploma in Statistics, TCD, Base Module
6
n
(1.96) 2 0.25
 9604
(0.01) 2
If n is around 1000 the error bounds are ± 0.03. This is a typical sample size for
political polls, which are usually reported in the press as subject to an error of

approximately 3%.
Significance Tests
When only a single proportion is of interest it is much more likely that the desired
analysis will be to calculate a confidence interval than to carry out a significance
test. Confidence intervals are more useful in that they not only tell us whether
the data are consistent with a particular hypothesised population or process
value, but also show the range of values with which the results are consistent.
Tests may be appropriate, however, in certain contexts, especially where a
decision has to be made.
Example 2: An Acceptance Sampling test
In Chapter 2 (Section 2.1) the use of a one-sided test was illustrated by an
acceptance sampling example. Acceptance sampling in an industrial context
often involves counted data. For example, a company that routinely buys
batches of 1000 components for use in one of its own products might take a
sample of 80 components on delivery and accept the consignment if no more
than 2 components in the sample are “non-conforming”, i.e., do not meet the
required specifications. The sample size of 80 and the acceptance number of 2
are suggested by the American National Standard: Sampling Procedures and
Tables for Inspection by Attributes (ANSI/ASQC Z1.4-1993) for batches of
between 501-1200 where the maximum acceptable non-conformance rate is 1%.
This decision rule is a one-sided test with a rejection region defined by the
numbers 3, 4, 5,…80.
Equivalently, a sample proportion greater than
p=2/80=0.025 leads to rejection of the null hypothesis that  ≤ 0.01.
The sampling tables are often referred to as the ‘Military Standards’ as they were
first developed during the second world war for monitoring the purchase of
military equipment. Further information on the use of these tables can be found
in the guidebook American National Standard: Introduction to Attribute Sampling
(ANSI/ASQC S2-1995).
Course Manual: PG Diploma in Statistics, TCD, Base Module
7
3.2 Simple Comparative Studies
In Chapter 2, we discussed comparative studies where the outcomes of interest
could be measured and the questions to be addressed referred to population or
process means. We now turn our attention to studies where the outcome
involves counting, rather than measuring, and the questions of interest refer to
population or process proportions.
Typical examples would be medical studies where the two groups are either
inherently different (e.g., smokers versus non-smokers) or the subjects are
initially similar and are broken into two groups to be allocated to different
treatments; the number in each group who respond in a defined way (e.g.,
develop lung cancer or become free of a medical condition) after a period of time
is then counted and expressed as a proportion of the group size. The statistical
problem is to assess the extent to which observed sample differences are likely
to be systematic and not simply due to chance variation.
Example 3: Use of Timolol for treating angina
Angina pectoris is a chronic heart condition in which the sufferer has periodic
attacks of chest pain. Samuels [1] cites the data in Table 3.2.1 from a study [2]
to evaluate the effectiveness of the drug Timolol in preventing angina attacks;
patients were randomly assigned to receive a daily dosage of either Timolol or a
Placebo for 28 weeks. The numbers of patients who did or did not become
completely free of angina attacks (Angina-Free) are shown in Table 3.2.1.
Clearly a higher proportion of the Timolol patients became Angina-Free in the
study. The statistical question is: “what are the implications for use of Timolol in
a population of such patients?”
Timolol
Placebo
Totals
Angina-Free
Not Angina Free
44
116
19
128
63
244
Totals
160
147
307
Proportions
0.275
0.129
0.205
Table 3.2.1: The Timolol study results
The Statistical Model
In Section 3.1, we saw that if a single random sample of n units is selected from
a population or process in which the proportion of units that has a particular
characteristic is 
Course Manual: PG Diploma in Statistics, TCD, Base Module
8
the sample proportion p=r/n behaves as if selected from a Normal distribution
with mean  and standard error:
SE( p) 
 (1   )
.
n
Assume now that we have two groups which are independent of each other in
the sense that the chance that any unit in one group does or does not have the

characteristic of interest
is not in any way affected by the responses of the units
in the second group. If this is true then we can extend the result used above in a
very simple way [see Appendix 3.1, page 60 for details]. If p1=r1/n1 is the sample
proportion and is the population or process proportion for the first group, and
p2=r2/n2 and are the corresponding values for the second group, then the
difference between the sample proportions, (p1–p2) has a sampling distribution
which is approximately Normal with mean (– and standard error:
SE ( p1  p 2 ) 
 1 (1   1 )
n1
 (1   2 )
 2
n2
This is illustrated in Figure 3.2.1.
Figure 3.2.1: The sampling distribution of the difference between sample proportions
We are now back on familiar territory and can work with this sampling
distribution, as it is merely a single Normal distribution (with which we have
worked several times already), even though the standard error formula looks a
little complicated.
Course Manual: PG Diploma in Statistics, TCD, Base Module
9
The statistical questions that we typically wish to address are whether or not the
population or process proportions are the same  and, if there is a
difference, how big is that difference –? As we have seen in several
contexts, questions about statistical significance and the size of effects can be
answered by calculating confidence intervals. This will be our approach here too.
We will, however, also examine the use of a statistical test for comparing sample
proportions, since such tests are commonly used in the research literature. We
will use the Timolol study to illustrate the methods.
Confidence Interval
We have seen that a confidence interval for a population parameter (or the
difference between two such parameters) is given by a point estimate plus or
minus the standard error multiplied by a critical value from the relevant reference
distribution. Accordingly, since the difference between two sample proportions
follows a Normal distribution (approximately) a 95% confidence interval for 1–2
is given by:
( p1  p2 )  1.96
p1 (1  p1 ) p2 (1  p2 )

n1
n2
where 1.96 is the critical value from a standard Normal curve and p 1 (Timolol)
and p2 (Placebo) estimate the corresponding population values 1 and 2,
respectively. 
For the Timolol data this gives:
(0.275  0.129) 1.96
0.275(1  0.275) 0.129(1  0.129)

160
147
0.146  0.088
 suggests that if patients were routinely administered Timolol
The interval
(according to the study protocol) then an excess of between 6% and 23% of the
 become Angina-Free over and above the percentage that
population would
would be expected if the patients were routinely prescribed a placebo. Thus, the
interval answers the research questions: it tells us that Timolol works and it
measures by how much it is better than a placebo.
The results of studies of this type are very often analysed by and reported by way
of statistical tests rather than confidence intervals. Accordingly, one such test
will be introduced below. However, I strongly recommend that a confidence
interval should be obtained, even if a statistical test is the more commonly used
method of analysis in a particular discipline. With very large sample sizes even
Course Manual: PG Diploma in Statistics, TCD, Base Module
10
tiny differences (of no practical interest) may be statistically significant; so simply
reporting results as being statistically significant (especially if the ‘statistically’
qualifier is dropped, as it often is!) can be misleading. On the other hand, small
sample sizes result in large standard errors, which will mean non-statisticallysignificant results unless large sample differences are observed. Calculating a
confidence interval makes explicit the uncertainty in the observed sample
difference and may lead readers to ask (correctly) if the non-significance is due
to the absence of an effect or simply due to the small size of the study.
Significance Test
Obviously the sample proportion of patients who became Angina-Free is greater
in the Timolol group (p1=44/160=0.275) than in the Placebo group
(p2=19/147=0.129). If we imagine the two treatments being administered to a
large population of such patients (from which the sample may be considered to
be randomly selected) then we can ask if the sample results are consistent with a
difference in the population proportions.
We frame our question in terms of a null hypothesis which considers the long-run
proportions as being the same and an alternative hypothesis which denies their
equality.
Ho: – = 0
H1: – 0
When the null hypothesis is true  the sample difference (p1–p2) has a
sampling distribution which is approximately Normal with mean –and
standard error:
SE( p1  p2 ) 
 (1   )  (1   )
n1

n2
where  is the common population value. Since the test statistic is calculated on
the assumption that the null hypothesis is true, the Z statistic is given by:

Z
( p1  p2 )  0
 (1   )  (1   )

n1
n2
.
This has a standard Normal distribution. Because  is unknown it must be

estimated from the sample data. If there is no group difference then a natural
Course Manual: PG Diploma in Statistics, TCD, Base Module
11
estimator is the overall sample proportion of patients who become Angina-Free,
p=63/307=0.205. This gives our test statistic2 as:
Z
( p1  p2 )  0

p(1  p) p(1  p)

n1
n2
(0.275  0.129)  0
 3.16
0.205(1  0.205) 0.205(1  0.205)

160
147
If we choose a significance level of =0.05 then the critical values are ±1.96 as
 in Figure 3.2.2 below.
shown
0.95
Figure 3.2.2: The standard Normal distribution, with critical values
Our test statistic Z=3.16 is greater than the upper critical value and the sample
difference is said to be ‘statistically significant’: we reject Ho and conclude that, in
the long-run, the proportion of patients who would become Angina-Free is
greater if Timolol is prescribed than it would be if a placebo were administered
routinely.
Note the assumption implicit in the analysis, that the patients are a random
sample from the corresponding population. In practice, of course, this is unlikely
to be true (especially as it would be the intention to administer the treatment to
future patients also). It is important, therefore, in considering the scientific value
of the test result to assess the extent to which the sample may be treated as if it
were a random sample from the target population. In reporting on the findings it
would also be important to specify the target population carefully; for example,
the sample may be restricted in its coverage: it may include patients of a narrow
age profile, of only one sex, of only one nationality or race etc.
2
note: rounding affects such calculations, the ratio is 0.1457/0.0461
Course Manual: PG Diploma in Statistics, TCD, Base Module
12
Typical Computer Output
Table 3.2.2 shows the output obtained from Minitab when the Timolol study
sample results are entered.
Test and CI for Two Proportions
Sample
Timolol
Placebo
Angina-Free
44
19
N
160
147
Sample p
0.275000
0.129252
Difference = p (1) - p (2)
Estimate for difference: 0.145748
95% CI for difference: (0.0578398, 0.233657)
Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002
Table 3.2.2: Minitab analysis of Timolol data
The p-value associated with the test indicates that if the null hypothesis were
true, the probability would be 0.002 that we would have obtained, just by chance,
a Z test statistic as far out, or further, into the tails of the standard Normal
distribution, than the value Z=3.16 obtained here. Accordingly, this is an
unusually large value and the null hypothesis of no long-run group difference
should be rejected. The computer output also gives the confidence interval (to
an impressive number of decimal places!).
Continuity Correction
In carrying out the Z-test (or the corresponding chi-square test which we will
meet in Section 3.3), it is often advised that a ‘continuity or Yates’ correction’ be
applied to the formula for the test statistic. This arises because although we are
counting discrete events, our calculations are based on the Normal distribution,
which is continuous – there is, therefore, an approximation involved. This is a
technical detail which I have chosen to ignore in the course of this chapter. The
continuity correction is only relevant when the sample size is relatively small – in
the current case applying the correction changes the test statistic Z from 3.16 to
3.02; the p-value changes from 0.0016 to 0.0025.
Exercise
3.2.1
th
On Saturday May 17 , 2008, the Irish Times published the results of a poll on attitudes to
the planned constitutional referendum on the Lisbon Treaty. The size of the sample was
1000; 35% of respondents said they intended to vote ‘yes’. A similar poll, carried out at
the end of January, 2008, gave 26% as the percentage who intended to vote ‘yes’. Carry
out a test to determine if the data suggest that the population proportions intending to
vote ‘yes’ were the same or different on the two sampling dates. Estimate the change in
the population proportion of those who intended to vote ‘yes’.
Course Manual: PG Diploma in Statistics, TCD, Base Module
13
Other Sampling Designs
The data of Example 3 resulted from a designed experiment – a clinical trial – in
which patients were assigned randomly to the two treatment groups. Where
random assignment is possible it strengthens the conclusions that may be drawn
from the study, as randomisation is the best protection against unknown sources
of bias, which may be present in purely observational studies (see further
discussion in Chapter 4). Our next two examples give data that are analysed in
exactly the same way as the Timolol data, even though the data generating
mechanisms are different. Example 4 is a typical cohort or longitudinal study.
Such studies are said to be prospective: they identify the study groups first,
follow them over time, and subsequently classify the subjects or patients
according to the outcomes being studied. Example 5 is a case-control study –
this type of data collection design is retrospective, as we shall see.
Example 4: Cohort Study of Slate Workers
Machin et al. [3] cite a study by Campbell et al. [4] in which a group of Welsh
men who had been exposed to slate dust through their work were compared to a
similar group who had not been so exposed; the control group was similar in age
and smoking status. The groups were followed over 24 years and the numbers
of deaths in the two groups were recorded, as shown in Table 3.2.3.
Died
Survived
Totals
Proportions
Slate Workers
Controls
379
230
347
299
726
529
0.522
0.435
Totals
609
646
1255
Relative risk
1.201
Odds Ratio
1.420
Table 3.2.3: A cohort study of slate workers and controls
Table 3.2.4 shows a statistical comparison of the proportions who died in the two
groups: the difference p1–p2 is highly statistically significant, with a p-value of
0.002.
Course Manual: PG Diploma in Statistics, TCD, Base Module
14
Test and CI for Two Proportions
Sample
Died
N
Sample p
Slate Workers
Controls
379
230
726
529
0.522039
0.434783
Difference = p (1) - p (2)
Estimate for difference: 0.0872560
95% CI for difference: (0.0315353, 0.142977)
Test for difference = 0 (vs not = 0): Z = 3.05 P-Value = 0.002
Table 3.2.4: A statistical comparison of the death rates
The confidence interval shows an increased mortality rate of between 3 and 14
percentage points for the slate workers over the control group. The relative risk
and odds ratio shown in Table 3.2.3 will be discussed later.
Example 5: Oesophageal Cancer Case-Control Study
Case-control studies involve selecting a group of patients with some condition (a
disease, say), together with a control group, and then looking backwards in time
(they are retrospective studies) to see if some antecedent or risk factor occurs
more frequently with the study group, than it does with the control group. If it
does, it suggests a possible link between that factor and the outcome of interest.
Obviously, case-control studies involve an inversion of the preferred point of
view: we would like to be able to say that “people who do X are more likely to get
disease Y”, whereas, what we can say is that “people who have disease Y are
more likely to have done X”. Such studies are particularly useful for studying rare
conditions, for which enormous sample sizes would be required for prospective
studies (in order to be sure of recruiting into the study sufficient numbers of
people who will get the disease).
Breslow and Day [5] cite a case-control study by Tuyns et al [6] involving 200
males who were diagnosed with oesophageal cancer in a Brittany hospital. A
control sample of 775 adult males was selected from electoral lists in the nearby
communes. Both cases and controls were administered a detailed dietary
interview, which contained questions about their consumption of alcohol and
tobacco as well as food. Only a small part of the dataset is examined below,
Breslow and Day discuss the data in a much more sophisticated way than is
attempted here.
Course Manual: PG Diploma in Statistics, TCD, Base Module
15
Alcohol consumption (g/day)
80+
<80
Totals
Proportions
Cases
Controls
96
109
104
666
200
775
0.480
0.141
Totals
205
770
975
Odds Ratio
5.640
Table 3.2.5: Oesophageal cancer case-control study
A Z-test (see Table 3.2.6, p<0.001) provides strong statistical evidence of an
association between alcohol consumption and disease status. The sample
proportions P(high alcohol consumption for Cases)=0.48 and P(high alcohol
consumption for Controls)=0.14 are, therefore, indicative of underlying
differences.
Test and CI for Two Proportions
Sample
Cases
Controls
80+
96
109
N
200
775
Sample p
0.480000
0.140645
Difference = p (1) - p (2)
Estimate for difference: 0.339355
95% CI for difference: (0.265916, 0.412793)
Test for difference = 0 (vs not = 0): Z = 10.50 P-Value = 0.000
Table 3.2.6: Minitab analysis of oesophageal cancer case-control study
The confidence interval suggests a higher prevalence of heavy drinking among
the cases, over and above the level for the controls, of between 27 and 41
percentage points.
Measuring Association
The Timolol study was a clinical trial – an experimental study in which patients
were randomly assigned to the treatment and control groups. In such cases we
have strong evidence for a causal relationship between the response and the
treatment administered. The oesophageal cancer study, on the other hand, was
a purely observational study: in such cases the assumption/deduction of a
Course Manual: PG Diploma in Statistics, TCD, Base Module
16
causal relationship between the risk factor (heavy drinking) and the response
(cancer) is always subject to the caveat that there could be other (unknown)
factors related to both of the observed factors, which are responsible for the
apparent relationship. This is described as ‘confounding’. For this reason, the
term ‘association’ tends to be used, in order to make clear that the data collection
method does not allow us to deduce, with any certainty, the existence of a causal
relationship.
For the Timolol study a natural measure of association is:
p1–p2 = P(Angina free|Timolol)–P(Angina free|Control)
as it makes a direct comparison between the two proportions of interest. We
have already discussed such comparisons extensively. An alternative way of
viewing the same information is to calculate an estimate of the relative risk:
Relative Risk =
p1 0.2 7 5

 2.1 3
p2 0.1 2 9
This lends itself to a simple description of the outcome of the study: “the chances
of becoming angina-free are more than doubled for the Timolol group”.
Similarly, for the slate workers study the mortality rates after 24 years were
p1=0.522 for the slate workers and p2=0.435 for the controls. The relative risk of
dying was 1.2 – slate workers had a 20% higher chance of dying within the study
period.
Neither of these measures (p1–p2 or p1/p2) is available if the study is
retrospective, since the relevant proportions are not available from the data.
For example, in the oesophageal cancer study 96/205=0.47 appears to estimate
the probability of being a case, i.e., of having this cancer. However, if the
researchers had identified 400 rather than 200 cases, the estimate would be
192/301=0.64, assuming the number of heavy drinkers also doubled. Thus, the
estimate would be determined by the design of the study rather than the
characteristics of the population being studied. An alternative measure of
association, the odds ratio, has the useful property that it provides a direct
measure of association, even when the data collection is retrospective.
Odds Ratio
Table 3.2.7 reproduces the Timolol data, together with a second table where the
frequencies are represented by letters
Course Manual: PG Diploma in Statistics, TCD, Base Module
17
Timolol
Control
Totals
Timolol Control Totals
Angina free
Not A-free
44
116
19
128
63
244
AF
not-AF
a
c
b
d
a+b
c+d
Totals
160
147
307
Totals
a+c
b+d
n
Table 3.2.7: Timolol study results
Out of the 160 people in the Timolol group 44 become Angina-free (AF) while
116 do not. Instead of expressing the chance of becoming Angina-free as a
proportion of the total p1=44/160, an alternative is to say that the odds of
becoming AF as opposed to not-AF are 44:116, which we may write as 44/116.
Similarly, the odds of becoming AF for the control group are 19/128. If we take
the ratio of these two quantities we get the odds ratio:
OR = 44 / 116  [ a  b ]  ad  2.6
19 / 128
c
d
bc
The odds of becoming AF are more than two and a half times higher for patients
who received Timolol.
Table 3.2.8 reproduces the oesophageal cancer study results in a similar format
to that just considered for the Timolol study. Note however, that it is the row
totals that are fixed by the study design here, whereas it was the column totals
for the Timolol study.
Alcohol consumption
80+
<80
Alcohol consumption
80+ <80
Totals
Totals
Cases
Controls
96
109
104
666
200
775
Cases
Controls
a
c
b
d
a+b
c+d
Totals
205
770
975
Totals
a+c
b+d
n
Table 3.2.8: Oesophageal cancer study results
By design, we cannot calculate directly the odds of being a case for the high
alcohol group (just as we could not calculate the corresponding probability, as
discussed earlier), because the number would be meaningless – to see this you
need only realise that the odds of being a case would double if the study had
included 400 instead of 200 cases (assuming the same proportion of heavy
drinkers appeared in the study the odds would be 192/109=1.76 instead of
Course Manual: PG Diploma in Statistics, TCD, Base Module
18
96/109=0.88). We can, however, find the odds of being a heavy drinker versus a
moderate drinker for each of the two groups. For the cases, this is 96/104 and
for the controls, it is 109/666. The odds ratio is then:
OR = 96 / 104  [ a  c ]  ad  5.6
109 / 666
b
d
bc
The odds are more than five times higher that an oesophageal cancer patient will
be a heavy drinker, than that a control will be one.
Note, however, that the odds ratio formula (in letters) for the retrospective casecontrol study turns out to be exactly the same as for the prospective experimental
study:
OR = ad
bc
and so, although the data were collected retrospectively, we can say that the
odds of getting oesophageal cancer are more than five times higher for heavy
drinkers than they are for more moderate drinkers3. The ratio is often called the
‘cross-product’ ratio because of the ‘shape’ of the resulting formula: compare it to
the layout of the letters in the tables.
This property that it produces a forward-oriented index, even when the data
collection is retrospective makes the odds ratio an attractive measure of
association. The odds ratio also allows the tools of multiple regression to be
applied to frequency data, when we want to analyse the influence of several risk
factors simultaneously, e.g., the effects not just of alcohol consumption, but also
of age, smoking habit, nutritional status etc. There is another reason why the
odds ratio is considered a useful index: it is frequently applied to diseases with
low rates4, and so it gives a close approximation to the relative risk for the groups
being compared, even when the sampling is retrospective and the relative
risk cannot be calculated.
3
Thus, although the odds a/c and b/d of being a case are dependent on the design (total number
of cases identified) and hence are not interpretable, their ratio is, since, while doubling the
number of cases will double both a and b, the ratio of the two odds will be unchanged, as the
doubling will cancel. Note that this is not true, in general, for the relative risk p1/p2.
4
2
The odds are p/(1–p) which is approximately p(1+p+p ) so, as long as p is small (say no more
2
than 0.05), p and higher powers of p are negligible and p/(1–p) is effectively equal to p. Thus,
the relative risk is effectively the same as the odds ratio when p is very small. Note that this
approximate formula is based on a power series expansion of the function p/(1–p), so unless you
studied this kind of mathematics, do not expect to be able to establish the relationship
algebraically.
Course Manual: PG Diploma in Statistics, TCD, Base Module
19
Exercises
3.2.2.
460 adults took part in an influenza vaccination study [7]. Of these 240 received the
influenza vaccine and 220 received a placebo vaccination. At the end of the trial 100
people contracted influenza of whom 20 were in the treatment group and 80 in the control
group. Carry out a statistical test to determine if the vaccine is effective. Calculate and
interpret a confidence interval which will estimate the potential effect of the vaccine
(beyond that of a placebo). Calculate the odds ratio.
If the entire population of adults had been vaccinated, estimate the proportion of the
population that would be expected to have contracted influenza during the test period.
3.2.3.
A study of respiratory disease in childhood [7] examined the question of whether children
with bronchitis in infancy get more respiratory symptoms in later life than others. Of 273
children with a history of bronchitis before age 5 years, 26 were reported to have a
persistent cough at age 14, whereas in the group of 1046 children in the study who did
not have bronchitis before the age of 5 years, 44 were reported to have a persistent
cough at 14 years. Are the proportions having a persistent cough statistically significantly
different? Calculate an interval estimate of the long run average difference between the
two rates of occurrence of persistent cough at age 14 years? Calculate the odds ratio.
3.2.4
Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or
shot by hunters in eastern Germany over a 40 year time period. Note that the numbers
were recovered from summaries in the original paper and the total sample size here is
1023, but the original paper refers to a total sample size of 1027. Table 3.2.9 shows the
N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and
column variables, respectively. The authors of the paper described otters of age four or
more years as ‘grown adults’; I have described otters younger than this as ‘young’.
Male
Female
Totals
Young
Adults
286
312
153
272
439
584
Totals
598
425
1023
Table 3.2.9: Otters cross-classified by age and sex
Note that the sampling scheme here is different from our earlier 2x2 tables – the Timolol
study was a clinical trial in which patients were randomly assigned to treatment and
control groups and the outcome (becoming angina-free) was recorded later, the Slate
Workers example was a cohort study in which two groups of subjects were identified and
then followed over many years, the Oesophageal Cancer data came from a case-control
study in which cancer patients and a control group were identified and interviewed
(retrospectively) about their drinking habits (a potential risk-factor for the cancer). In the
Otter study the researchers collected data for a given time period, and then crossclassified the results, without pre-specifying any sample size – this was itself a random
quantity.
A natural practical question is ‘by how much does the long-run proportion of young male
deaths exceed that for young females?’ This question can be answered by calculating a
confidence interval for the difference between the relevant proportions. Carry out the
corresponding Z-test, also.
Course Manual: PG Diploma in Statistics, TCD, Base Module
20
Example 6: Collapsing Multi-dimensional Contingency Tables: Simpson’s
Paradox – a statistical health warning!
Consider an hypothetical Graduate School with two programmes: Mathematics
and English5.
Suppose Table 3.2.10 represents the admission/rejection
decisions for applications in a particular year, classified by sex. The table
strongly suggests bias against females (12% acceptance versus 29% for males)
on the part of those making the admission decisions. This difference is highly
statistically significant, as shown by the Z-test result (Z=4.8).
Overall
Reject
Accept
Number of
Applicants
Percentage
Accepted
Z-value
p-value
Males
Females
107
370
43
50
150
420
29
12
4.8
0.000
Totals
477
93
570
16
Table 3.2.10: Overall Graduate School admission decisions
Consider now the results from the two Schools individually, as shown in Table
3.2.11. Note that the data in Table 3.2.11 sum to give the corresponding totals in
Table 3.2.10 – 150 males and 420 females.
In both cases the sample admission rate is higher for females (though not
statistically significantly) – 50 versus 40% for Mathematics and 10 versus 6% for
English. This is in stark contrast to the overall rates for the Graduate School.
This phenomenon (aggregated results contradicting the disaggregated results) is
called Simpson’s Paradox. Why does it happen?
In this example females apply (in greater numbers) for the programme into which
it is more difficult to gain admission. Thus, only 10% of applicants are accepted
into the English programme while 42% are accepted into Mathematics, but the
females overwhelmingly apply for English. The conclusion must be that if there
is sex bias, the source needs to be sought, not in the admissions procedures of
the Graduate School, but in the conditioning and social influences that lead
females to apply disproportionately for the graduate programmes into which it is
more difficult to gain entry.
5
The Graduate School example is inspired by Bickel et. al [9]
Course Manual: PG Diploma in Statistics, TCD, Base Module
21
Mathematics
Number of
Applicants
Percentage
Accepted
Reject
Accept
Males
Females
60
10
40
10
100
20
40
50
Totals
70
50
120
42
Z-value
0.828
p-value
0.408
English
Reject
Accept
Number of
Applicants
Percentage
Accepted
Z-value
p-value
Males
Females
47
360
3
40
50
400
6
10
0.907
0.364
Totals
407
43
450
10
Table 3.2.11: Graduate School admission decisions by School
This should be a sobering example when we consider how often we collect
multiply cross-classified data and, typically, collapse across categories to form a
multiplicity of two-dimensional tables for analytical purposes. For example,
Social/Political surveys will often collect information on Voting Intention, Party
Affiliation, Sex, Age, Income Group, Education, Occupational Status,
Geographical Location etc. Data collected on expectant mothers will include
most of the above and also variables such as number of previous pregnancies,
smoking, drinking, dietary, and exercise habits, as well as medical history
information. There is an obvious need for care in the analysis of such
information, to ensure that apparent effects are not the result of collapsing over
influential variables.
Matched Studies
The analysis of two-sample studies where proportions are the appropriate
outcome measure was introduced using the results of a clinical trial of Timolol for
preventing angina attacks. The Timolol study design is similar to that which
underlies studies where two-sample t-tests are used for the analysis (Chapter 2,
Section 2.4). If, instead of simply classifying each patient as Angina-Free or not,
we could assign an (approximately Normally distributed) improvement score to
each individual in the study, we could have used a two-sample t-test (or a Z=test,
given the sample sizes) and the corresponding confidence interval to analyse the
resulting data. This would be preferable statistically (more powerful; statistical
Course Manual: PG Diploma in Statistics, TCD, Base Module
22
power is discussed in Chapter 4 – a more powerful test is more likely to lead to a
correct result), as there is very little information embodied in a simple yes/no
classification.
When we studied two-sample t-tests, we also discussed paired t-tests: these
were appropriate when each person or experimental unit was individually
matched with another, or where, in measurement studies, the same
material/experimental unit was measured twice, by two different instruments.
Matching increases the statistical power of the test by eliminating much of the
chance variation affecting the comparison – the ‘treatments’ are compared on
more uniform material than is the case for two-sample t-tests.
Matching can also be used where the response is binary. The analysis of
matched studies will be illustrated using data from a case-control study. We
previously encountered a case-control study in Example 5 (oesophageal cancer
study); there, the control group was selected from the same community of adult
males as the cases, but there was no attempt to match the characteristics of
individual cases to those of individual controls. In our next example, patients with
other forms of cancer were individually matched to the cases, all of whom had a
particular cancer.
Example 7: Case-control study – Endometrial cancer
Rice [10] cites a study [11] in which 317 patients, who were diagnosed as having
endometrial carcinoma, were matched with other cancer patients. The controls
all had one of cervical cancer, ovarian cancer or cancer of the vulva. Each
control was matched by age at diagnosis (within 4 years) and year of diagnosis
(within two years) to a corresponding case. Table 3.2.12 shows the 317 pairs
cross-classified by use or not of oestrogen for at least six months prior to the
cancer diagnosis. The question of interest is whether or not there is an
association between exposure to oestrogen and the presence of endometrial
cancer.
Controls
Exposed Not-expos.
Cases
Totals
Exposed
Not-expos.
39
15
113
150
152
165
a
c
b
d
a+b
c+d
Totals
54
263
317
a+c
b+d
n
Table 3.2.12: A matched case-control study of endometrial cancer
Course Manual: PG Diploma in Statistics, TCD, Base Module
23
We address the question of interest by asking if the proportion of cases exposed
to oestrogen [p1=(a+b)/n = 152/317] is statistically significantly different from the
proportion of the controls who were exposed [p 2=(a+c)/n = 54/317]. This means
that we want to assess whether or not the difference, (p 1–p2) = [(a+b)/n –(a+c)/n]
= (b–c)/n, is statistically significantly different from zero, and if so, to estimate the
long-run difference. A confidence interval provides both the answer to the
statistical significance question and also the interval estimate we require.
Confidence Interval
The two sample proportions are not statistically independent of each other, so
the standard error formula used in comparing independent proportions (e.g., the
Timolol study) is not applicable here. The appropriate standard error estimate is:
SEˆ ( p1  p2 )  1n (b  c)  (b  c)2 /n
We can use this to obtain an approximate 95% confidence interval for the
difference between the corresponding long-run (population) proportions, ().
Thus, whenwe fill the relevant numbers for the endometrial cancer study into the
formula:
(p1  p2 ) 1.96 1n (b  c)  (b  c)2 /n
we obtain:

1
(0.48  0.17) 1.96 317
(11315)  (113 15)2 /317
0.31 0.08
 is quite far from zero, suggesting that the proportion exposed to
The interval
oestrogen in the population of endometrial cancer patients is considerably higher
than that 
for the other cancer groups – clear evidence of an association between
the use of oestrogen and the development of endometrial cancer.
A Statistical Test
An alternative approach to determining if an association exists, is through the use
of a statistical test. If we wish to test the null hypothesis that then, when
this hypothesis is true, the estimated standard error formula simplifies6 to:
SEˆ ( p1  p2 )  1n (b  c)
6
If the null hypothesis is true, the expectation (long-run value) of (p1–p2) is zero, i.e., the
expectation of (b–c)/n is zero, so the (b-c) part of the standard error formula drops out.

Course Manual: PG Diploma in Statistics, TCD, Base Module
24
Using this we calculate a Z-statistic:
Z
( p1  p2 )  0 (b  c) /n
(b  c)


SEˆ ( p1  p2 )
(b  c) /n
(b  c)
which follows a standard Normal distribution (again, when the null hypothesis is
true).

For the oestrogen data we obtain:
Z
(b  c)
(113 15)

 8.7
(b  c)
(113 15)
The critical values are ±1.96 for a two-tail test using a significance level of 0.05.
Our observed value of Z=8.7 is far out in the right-hand tail. There is, therefore,
strong evidence 
of an association between oestrogen use and endometrial
cancer.
Exercises
3.2.5
Samuels [1] reports data from a 1973 study of risk factors for stroke in which 155 women
who had experienced a haemoragic stroke (cases) were interviewed. Controls (who had
not had a stroke) were matched to each case by where they lived, age and race. The
subjects were asked if they had used oral contraceptives. The table below displays oral
contraceptive use (Yes/No) for all pairs.
Controls
Yes
No
Cases
Yes
No
5
13
30
107
Table 3.2.13: Pairs classified by use of oral contraceptives
Do the data of Table 3.2.13 suggest an association between oral contraception use and
strokes?
3.2.6
Altman [12] reports data from a 1975 study carried out to see if patients whose skin did
not respond to dinitrochlorobenzene (DNCB), a contact allergen, would show an equally
negative response to croton oil, a skin irritant. The table below shows the results of
simultaneous tests on the two chemicals for 173 patients with skin cancer; thus, this is an
example of a “self-paired” study.
Course Manual: PG Diploma in Statistics, TCD, Base Module
25
DNCB
Croton
Oil
+
–
+
81
48
–
23
21
Table 3.2.14: Pairs classified by use of oral contraceptives
Is there evidence of a differential response to the two chemicals?
Comparing two proportions that sum to 1
The matched samples design leads to proportions that are correlated with each
other. A special case of correlated proportions occurs when there are only two
proportions to be compared and the proportions are based on the same total.
The proportions will then be perfectly correlated, in that knowing one immediately
gives the value for the other, and the standard error needs to take account of this
fact. The analysis is a simple modification of our earlier analysis for two
independent proportions.
Example 8: French 2007 presidential election poll - revisited
Recall the political poll, of Section 3.1, that estimated the support for Sarkozy ()
and Royal () in the French 2007 presidential election. Note that in this case the
two sample proportions p1 and p2 are not independent since (p1+p2)=1 (assuming
respondents chose one or the other, i.e., no ‘neithers’ or ‘don’t knows’ in the
sample of 858). If we wish to focus on the difference in support for the two
candidates then the standard error of the sample difference (p1–p2) is
SE( p1  p2 )  2
1 (1  1)
n
where n is the size of the poll and p2=(1–p1); see Appendix 1, page 61 for details.
In practice, we replace  by p1 in using this formula for estimation. Note the
 this formula and that for the standard error of independent
difference between
proportions (page 9).
A 95% confidence interval for Sarkozy’s lead at the time the poll was taken is:
( p1  p2 )  1.96(2)
p1(1  p1)
n

Course Manual: PG Diploma in Statistics, TCD, Base Module
26
(0.530  0.470)  3.92
0.530(0.470)
858
0.060  0.067

This interval covers zero and so the sample difference is not statistically
significant – the
corresponding significance test would not reject the hypothesis
of no difference in the population of voters. Despite the size of the sample
difference in Sarkozy’s favour, the sampling variability is such that Royal could, in
fact, have had a slight lead in the population of electors from which the sample
was selected.
The conclusion is exactly the same as that based on an analysis of either
Sarkozy’s or Royal’s support separately, since (because p 1+p2=1) no extra
information is used in analysing the difference. The modification to the standard
error formula would be important, though, where one proportion was statistically
significantly bigger than the other and an interval estimate of the difference was
required.
Exercise
3.2.7
Exercise 3.2.4 (page 20) analysed data on otters that were either killed on the road or
shot by hunters in eastern Germany over a 40 year time period. The total sample size of
1023 was made up of 598 males and 425 females. The authors state that there is an
“even sex ratio in living otters”. Do the study results support the proposition that male
otters are more likely to be killed, or could the sample difference (58.5 versus 41.5%) be
attributable to chance sampling variability in the reported data?
Calculate a 95% confidence interval for the difference in the long-run proportions of male
and female otters killed in eastern Germany in the study period to answer this question.
Course Manual: PG Diploma in Statistics, TCD, Base Module
27
3.3 Analysis of Contingency Tables
In Section 3.2 we discussed the analysis of comparative studies involving two
groups where the responses were classified into one of two categories. We now
turn our attention to studies which may involve more than two groups and where
the responses may be classified into more than two categories. The data
resulting from such studies are usually presented in tabular form, with the
columns (say) representing the different groups and the rows the different
response categories. Such tables are often referred to as ‘contingency tables’ one characteristic is considered as ‘contingent on’ or ‘dependent on’ a second.
Chi-square Test
The analysis of the resulting data involves the use of (for us) a new statistical test
– the chi-square test. Before plunging into the analysis of larger tables, the chisquare test will be introduced, by using it to re-analyse an old friend – the Timolol
data!
Example 9: The Timolol Study- revisited
Patients were assigned to receive a daily dosage of either Timolol or a Placebo
for 28 weeks. The numbers of patients who did or did not become completely
free of angina attacks (Angina-Free) are shown again in Table 3.3.1.
Timolol
Placebo
Totals
Angina-Free
Not Angina Free
44
116
19
128
63
244
Totals
160
147
307
Table 3.3.1: The Timolol Study Results
Suppose we ask the question: “If Timolol has no greater effect than the Placebo,
how many of the treatment group would be expected to become Angina-Free?”.
Since 63/307 in all become Angina-Free, and this fraction should apply equally to
both the treatment and placebo groups if there is no differential effect, we
estimate the expected7 number of Angina-Free patients in the Timolol group as:
7
If 100 fair coins are tossed, we expect NxP[head] = 100[0.5] = 50 heads. By this we do not
mean that if the experiment were carried out once, the result would be 50 heads. Rather, that in
a very long sequence of such experiments the long-run average number of heads would be 50.
Thus, the term ‘expected value’ is a technical statistical term meaning ‘long-run average’.
Since an expected value is an average it may assume non-integer values even though the raw
data are counts. Accordingly, the expected values should not be rounded to integer values when
carrying out the calculations.
Course Manual: PG Diploma in Statistics, TCD, Base Module
28
E11 
63
(160)  32.83
307
E11 
63
(160)  32.83
307
E12 
63
(147)  30.17
307
63
E 21 
244
(160)  127.17
307
E 22 
244
(147)  116.83
307
244
160
147
307
Table 3.3.2: Calculating the expected values
Following exactly the same logic, we estimate the expected numbers in the other
cells of the 2x2 table as shown in Table 3.3.2, where i=1, 2 labels the rows and
j=1, 2 labels the columns.
Note that the expected value in cell (i, j) is generated by:
Eij 
RowTotalxColumnTotal
GrandTotal
We now have a set of ‘observed values’ and a set of ‘expected values’. If the
assumption of no differential treatment effect is valid, then the two sets of values
should differ only by chance variation.
A measure of the divergence is:
X 
2

(Oi j Ei )j2
E
a l l c e il jl s
Since a difference of (Oij–Eij)=5 would be much more important if we expected
Eij=10 than if we expected Eij=100, the divisor scales each squared deviation to
allow for the size of the expected value. The deviations are squared, as
otherwise they would sum to zero, since each row and column of the 2x2 table
contains a positive and negative deviation (Oij–Eij) of equal magnitude (see Table
3.3.3).
Our divergence measure is a test statistic which has a  (chi-square) sampling
distribution when the assumption under which the expected values were
calculated is true. The shape of the sampling distribution (examples will be
shown later) depends on a parameter called the ‘degrees of freedom’. This is
often labelled  (the Greek letter corresponding to n); here =1. This value
follows from the fact that for 2x2 tables once one expected value is calculated, all
Course Manual: PG Diploma in Statistics, TCD, Base Module
29
the others can be found by subtracting this value from the marginal values of the
table. Thus, the expected value in cell (1,2) is 63–32.83 = 30.17 and so on.
Equivalently, once one deviation (Oij–Eij) is found, the rest follow from the fact
that the deviations sum to zero across the rows and down the columns, see
Table 3.3.3.
(O11  E11 ) 2 (44  32.83) 2 (11.17) 2


 3.797
E11
32.83
32.83
(O12  E12 ) 2 (19  30.17) 2 (11.17) 2


 4.133
E12
30.17
30.17
(O21  E 21 ) 2 (116  127.17) 2 (11.17) 2


 0.980
E 21
127.17
127.17
(O22  E 22 ) 2 (128  116.83) 2 (11.17) 2


 1.067
E 22
116.83
116.83
Note the pattern of deviations (Oij–Eij):
–11.17
11.17
0
11.17
–11.17
0
0
0
Table 3.3.3: Calculating the contributions to X
2
The Statistical Test
If the proportion becoming Angina-Free would be  and the proportion becoming
not-Angina-Free would be  (note =1) in a population of patients receiving
the placebo, then the null hypothesis is that the pair [] applies also to the
population of patients receiving Timolol. The pair [] is the ‘relative frequency
distribution’: the null hypothesis is that this applies to both groups, equally.
The individual cell calculations are shown in Table 3.3.3. These give the test
statistic as:
X 
2

allcells
(Oij  E ij ) 2
E ij
 9.98

If we choose a significance
level =0.05 then the critical value is c2  3.84 as
shown in Figure 3.3.1 (see Statistical Table 4: ST-4). Note that the critical region
for the test statistic is inherently one-tailed, as large values of X2 point to large
divergences between observed and expected values, thus requiring rejection of

the null hypothesis under which the expected values were calculated.
Course Manual: PG Diploma in Statistics, TCD, Base Module
30
Figure 3.3.1: The chi-square distribution with degrees of freedom =1
Since the test statistic X2=9.98 greatly exceeds the critical value of c2  3.84 , the
null hypothesis is rejected. The test result suggests a systematic difference
between the observed and expected values. This points to a difference between
the two patient groups in terms of the long-run proportions becoming Angina
Free. Of course, we already know this from our previous analysis of the sample
proportions.
Typical Computer Output
Table 3.3.4 shows a chi-square analysis of the Timolol data, produced by
Minitab.
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
A-Free
NotA-Free
Total
Timolol
Placebo
Total
44
32.83
3.797
19
30.17
4.133
63
116
127.17
0.980
128
116.83
1.067
244
160
147
307
Chi-Sq = 9.978, DF = 1, P-Value = 0.002
Table 3.3.4: Minitab chi-square output for the Timolol data
Course Manual: PG Diploma in Statistics, TCD, Base Module
31
The computer output gives the observed and expected values and the
contribution to X2 for each of the four cells of the table. The individual
contributions are summed to give the test statistic, X2=9.978. This is stated to
have a p-value of 0.002, which means that if the null hypothesis were true, the
probability would be only 0.002 that we would have obtained, by chance, an X2
statistic equal to or greater than X2=9.978. This means that our test statistic is
unusually large (see Figure 3.3.2), which in turn means that the divergence
between the observed and expected values is also large. Such a p-value would
lead to rejection of the null hypothesis.
Figure 3.3.2: The p-value is the area in the right-hand tail beyond the calculated test statistic
2
X =9.978
Chi-square or Z-test?
Table 3.3.5 shows the Minitab analysis for the comparison of the two proportions
of patients who become Angina-Free.
Test and CI for Two Proportions
Sample
A-Free
Timolol
Placebo
44
19
N
160
147
Sample p
0.275000
0.129252
Difference = p (1) - p (2)
Estimate for difference: 0.145748
95% CI for difference: (0.0578398, 0.233657)
Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002
Table 3.3.5: Comparing the proportions becoming angina-free using Minitab
The Z-test for the difference between the population proportions gives a test
statistic of Z=3.16; when this is squared it gives (apart from the effects of
rounding) a value (9.98) which is the same as the X2 statistic for the chi-square
Course Manual: PG Diploma in Statistics, TCD, Base Module
32
test: the two tests are, therefore, equivalent. The Z value has a standard Normal
distribution; when the standard Normal is squared it gives a chi-square
distribution with one degree of freedom, which is the sampling distribution of the
X2 statistic.
The approach using proportions is (to me at least!) more appealing in that it
focuses attention on the quantities of direct interest (the proportions becoming
Angina-Free). It also leads naturally to calculating a confidence interval to
quantify the likely population difference.
Exercises
3.3.1 Re-analyse the Slate Workers Cohort Study data of Example 4, using a chi-square test.
3.3.2 Re-analyse the Oesophageal Cancer Case-Control Study of Example 5, using a chi-square
test.
McNemar’s Test
In Example 7 we analysed a matched case-control study (endometrial cancer)
using both a confidence interval for the difference between two correlated
proportions and the corresponding Z-test.
If the Z-statistic is squared it gives a chi-squared test statistic with one degree of
freedom, just as is the case for the test on two independent proportions. This
test gives identical results to the corresponding Z-test. The chi-square version is
usually referred to as McNemar’s test.
Larger Contingency Tables
No new ideas arise when we carry out chi-square tests on contingency tables
that are larger than the 2x2 tables just discussed – the amount of calculation
simply increases. The initial chi-square test tells us if all groups behave in the
same way or not. If they do not it will be necessary to follow up with further
analyses to identify where the differences arise.
We consider three examples from different areas: market research, literary style
analysis and genetics. Exercises then illustrate applications in physics, hospital
management and ecology.
Example 10: Market Research Study
Stuart [13, 14] reports on a market research study in which interviewees were
asked whether they had heard of a product, had heard of it but not bought it, or
had bought it at least once; these classifications measure ‘the degree of market
penetration’ of the product. The study results, for three regions are shown in
Table 3.3.6.
Course Manual: PG Diploma in Statistics, TCD, Base Module
33
Bought
Heard-no-buy
Never heard
A
109
55
36
Region
B
49
56
45
Totals
200
150
C
168
78
54
Totals
326
189
135
300
650
Table 3.3.6: Market penetration study
The samples sizes were clearly fixed for the three regions; why they were
different is not reported – the regions may, for example, be of different
geographical sizes.
Before we begin our formal statistical analysis of the data, a few comments on
presenting data in tabular form are worth considering. Table 3.3.7 shows the
relative frequencies for the market penetration categories in the three regions;
the proportions are expressed to three decimal places.
Bought
Heard-no-buy
Never heard
A
0.545
0.275
0.180
Region
B
0.327
0.373
0.300
C
0.560
0.260
0.180
Table 3.3.7: Relative frequency distributions of response by region
The same information is shown in Table 3.3.8, but this time the numbers are
expressed as column percentages and rounded to whole numbers (note that,
because of the rounding, region A sums to 101). The columns are also rearranged to put A beside C to emphasise their similarity.
Bought
Heard-nobuy
Never
heard
A
55
Region
C
56
B
33
28
26
37
18
18
30
Table 3.3.8: The percentage responses by region (rounded)
Course Manual: PG Diploma in Statistics, TCD, Base Module
34
I think you will agree that Table 3.3.8 is much easier to read and interpret:
regions A and C are similar in response characteristics and different from B.
Clearly organised tables are important when presenting information to others see Ehrenberg, [15] or Stuart [14] for a discussion of the principles of good table
making.
The Statistical Model
For the chi-square analysis of the Timolol study our model was that a fraction ()
of the population of patients would become free of angina attacks, and a fraction
() would not. The question of interest was whether the same fractions []
applied to both the Timolol and the placebo groups. Here, we have three groups
and three response categories. Our model is that the observed sets of three
response frequencies are generated by the same underlying population response
relative frequency distribution [].
Analysis
The observed relative frequencies from the three regions are clearly not the
same. The first statistical question is whether the differences between regions
could be attributed to chance sampling variation or not. If the answer is yes,
there is little point in making further between-region comparisons. This question
will be addressed using a chi-square test which is a straightforward extension of
the test carried out on the 2x2 table earlier. Differences between regions will
then be quantified using confidence intervals for differences between proportions
– these are calculated exactly as discussed earlier for the Timolol data.
Our null hypothesis is that there is a common set of population relative
frequencies []. On this assumption, Table 3.3.6 suggests that a natural
estimate of the proportion who have bought the product is 326/650, and this
should apply equally to all three regions. This implies that the expected number
of respondents in region A who have bought the product is:
E11 
326
(200)  100.31
650
.
As before, this estimate takes the form:
Eij 
RowTotalxColumnTotal
GrandTotal
.
Following precisely the same logic, Table 3.3.9 shows the calculation of the
expected values for all the cells. Note that the row and column totals are
identical to those for the observed values, as shown in Table 3.3.6.
Course Manual: PG Diploma in Statistics, TCD, Base Module
35
A
Bought
Heardnotbought
Never
heard
B
326
(200)  100.31
650
189
E 21 
(200)  58.15
650
E11 
E 31 
135
(200)  41.54
650
C
326
(150)  75.23
650
189
E 22 
(150)  43.62
650
E13 
135
(150)  31.15
650
E33 
E12 
E32 
200
326
(300)  150.46
650
189
E 23 
(300)  87.23
650
135
(300)  62.31
650
150
326
189
135
300
Table 3.3.9: Calculating the expected values
Table 3.3.10 shows the divergences between the observed and expected values
[Oij–Eij]. Note that the divergences sum to zero across all rows and down all
columns (our method of calculating the expected values forces the expected
table margins to be the same as the observed table margins, which means the
differences are zero).
Bought
Heard-notbought
Never heard
A
109  100.31  8.69
B
49  75.23  26.23
C
168  150.46  17.54
0
55  58.15  3.15
56  43.62  12.38
78  87.23  9.23
0
36  41.54  5.54
45  31.15  13.85
54  62.31  8.31
0
0
0
0
Table 3.3.10: Differences between observed and expected values
Because of these restrictions on the table margins, once four cells are filled
(either expected values or differences between observed and expected) the other
five are automatically determined – the deviations are said, therefore, to have
four degrees of freedom. This means that when we calculate our X2 test statistic,
we will compare it to a chi-square distribution with four degrees of freedom.
In general, the degrees of freedom for contingency tables are given by:
degrees of freedom = (Number of rows – 1) x (Number of columns – 1)
which, for the current dataset, gives:
degrees of freedom = (3 – 1) x (3 – 1) = 4.
Course Manual: PG Diploma in Statistics, TCD, Base Module
36
Figure 3.3.3: Chi-square distribution with 4 degrees of freedom
The Minitab analysis of Table 3.3.11 shows that the individual cell contributions
sum to a test statistic of:
X
2


(Oij  E ij ) 2
allcells
E ij
 24.6
which greatly exceeds the critical value of 9.49 (shown in Figure 3.3.3). Clearly,
the assumption of a common underlying relative frequency distribution of
responses must be rejected.
Chi-Square Test: A, B, C
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Bought
Heardno-buy
Neverheard
Total
A
B
C
Total
109
100.31
0.753
49
75.23
9.146
168
150.46
2.044
326
55
58.15
0.171
56
43.62
3.517
78
87.23
0.977
189
36
41.54
0.738
45
31.15
6.154
54
62.31
1.108
135
200
150
300
650
Chi-Sq = 24.608, DF = 4, P-Value = 0.000
Table 3.3.11: Minitab chi-square analysis of market research data
Course Manual: PG Diploma in Statistics, TCD, Base Module
37
A chi-square analysis of regions A and C supports the conclusion of common
underlying rates, which was suggested by Table 3.3.8. The p-value of 0.928 of
Table 3.3.12 shows the observed and expected values to be very close to each
other8.
Chi-Square Test: A, C
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
A
C
Total
Bought
109
110.80
0.029
168
166.20
0.019
277
Heard-no-buy
55
53.20
0.061
78
79.80
0.041
133
Never heard
36
36.00
0.000
54
54.00
0.000
90
Total
200
300
500
Chi-Sq = 0.150, DF = 2, P-Value = 0.928
Table 3.3.12 Chi-square comparison of responses for regions A and C
Table 3.3.13 shows a chi-square comparison of the pooled values for regions A
and C versus those of region B: region B clearly behaves differently.
Note that the statistical tests shown in Tables 3.3.12 and 3.3.13 have a different
status from that in Table 3.3.11. The tests carried out in Tables 3.3.12 and
3.3.13 were prompted by examining the data. It is axiomatic in scientific
investigation that you cannot form an hypothesis on the basis of a set of data and
test the hypothesis using the same dataset. These tests may be regarded,
therefore, as crude measures in an exploratory analysis; we would treat the
results with a great deal of caution if the tests were only marginally statistically
significant. Given the marked differences in responses between B and the
combined A and C, it is possible (even likely?) that the market researchers would
have expected such results and might have been in a position to formulate the
hypotheses underlying the tests of Tables 3.3.12 and 3.3.13 a priori, in which
case the tests would have their usual status. It might, for example, be the case
8
My colleague Michael Stuart found the data in a textbook (but has lost the reference). The
results in Table 3.3.12 are so closely matched for regions A and C as to suggest to me that they
may have been invented for illustrative purposes! However, I am taking them at face value until
he retrieves the reference.
Course Manual: PG Diploma in Statistics, TCD, Base Module
38
that B is a relatively new market while A and C are mature markets; Stuart does
not provide further information which would throw light on such questions.
Chi-Square Test: A+C, B
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
A+C
B
Total
Bought
277
250.77
2.744
49
75.23
9.146
326
Heard-no-buy
133
145.38
1.055
56
43.62
3.517
189
Never heard
90
103.85
1.846
45
31.15
6.154
135
Total
500
150
650
Chi-Sq = 24.461, DF = 2, P-Value = 0.000
Table 3.3.13: Chi-square comparison of pooled A and C versus B
Quantitative statements regarding differences can be supported by confidence
intervals for the population differences. For example, Table 3.3.14 shows a
comparison of the proportions of consumers who have bought the product in
region B versus the combined region A+C. Expressed in terms of percentages,
the point estimate for the difference is 23%, but the confidence interval suggests
that the percentage of consumers in regions A and C who bought the product is
between 14 and 31 percentage points higher than the corresponding value for
region B.
Test and CI for Two Proportions
Sample
Bought
A+C
B
277
49
N
500
150
Sample p
0.554
0.327
Difference = p (1) - p (2)
Estimate for difference: 0.227333
95% CI for difference: (0.140550, 0.314117)
Test for difference = 0 (vs not = 0): Z = 4.88 P-Value = 0.000
Table 3.3.14: Difference in buying proportions; regions A+C versus B
Course Manual: PG Diploma in Statistics, TCD, Base Module
39
Figure 3.3.4: A graphical representation of the study results (percentages)
Figure 3.3.4 presents the results of the study in graphical form. Note that the
lines connecting the dots have no intrinsic meaning – they simply help us
visualise the different market penetration ‘profiles’ corresponding to the two
market segments. The picture shows regions A and C to be higher by 22
percentage points than region B in relation to those who have bought the
product. On the other hand, B is higher by 10-12 percentage points for the other
two market penetration categories.
Contrast this picture with Table 3.3.6 as a way of reporting the study results.
Tables of technical statistical analyses, such as Table 3.3.11, are inappropriate in
presenting study results – they contain too many unnecessary technical details.
If at all, they should appear in a technical appendix to a report on the market
research study.
Example 11: Literary Style
Jane Austen left an unfinished novel (Sanditon) when she died. She also left a
summary of the remaining chapters. The novel was finished by an imitator, who
tried to match her style as closely as possible. The data below are cited by Rice
[10], based on an analysis by Morton [16]. Six words were selected for analysis
and their frequencies were counted in Chapters 1 and 3 of Sense and Sensibility,
1, 2 and 3 of Emma, 1 and 6 of Sanditon (described below as Sanditon-A, written
by Austen) and 12 and 24 of Sandition (described below as Sanditon-B, written
by the imitator). Can we distinguish between the authors on the basis of the
word counts?
Course Manual: PG Diploma in Statistics, TCD, Base Module
40
Sense+
Sensibility
Emma
a
an
this
that
with
without
147
25
32
94
59
18
186
26
39
105
74
10
101
11
15
37
28
10
83
29
15
22
43
4
517
91
101
258
204
42
Totals
375
440
202
196
1213
Sanditon-A Sanditon-B
Totals
Table 3.3.15 Word counts for the literary works
Statistical Model
Before beginning our analysis it is worthwhile considering the nature of the data
given in the table. Note that the data generation mechanism is quite different
from our previous market research example. There, there were three categories
of responses (bought, heard but no buy, and never heard of product). The three
relative frequencies or percentages for any region provided us with estimates of
consumer behaviour. Here the row categories are arbitrarily selected words
used by the author. The relative frequencies for any one novel do not have any
intrinsic meaning (unlike the market research study); they are simply the relative
frequencies of these particular words (relative to the full set of words included in
the list) and would change if any of the words were dropped from the list or if
others were added.
The statistical model that underlies the chi-square analysis that follows is that
there is a set of relative frequencies for these arbitrarily selected words
[] that characterises Austen’s writing style; it underlies all her writing
and should apply also to the imitator, If he has been successful in matching her
style. If this is true, then the observed frequencies (considered as a random
sample from her works) should only reflect chance variation away from this
stable underlying structure.
Analysis
Before comparing Austen to her imitator, it makes sense to check if the word
distributions are stable within her own work. If they are not, then the basis for the
comparison is undermined; we might have to ask ourselves to which Jane
Austen we want to compare the imitator!
The frequency distributions for the three known Austen works are given in Table
3.3.16 and a chi-square test is carried out. The critical value for a test using a
significance level of 0.05, where there are 10 degrees of freedom, is 18.31. The
test statistic of 12.3 is much less than this (with a correspondingly large p-value)
Course Manual: PG Diploma in Statistics, TCD, Base Module
41
and so we do not reject the null hypothesis of a consistent underlying structure
for the three works. It makes sense, therefore, to combine the counts for the
three works before comparing them to the imitator. One set of frequencies,
based on a total sample size of 1017, will be more stable than three sets based
on smaller sample sizes. The observed values in the first column of Table 3.3.17
are the combined word counts, that is the totals from the right-hand margin of
Table 3.3.16.
Chi-Square Test: Sensibility, Emma, Sanditon-A
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Sensibility
147
160.03
1.061
a
Emma
186
187.77
0.017
Sanditon-A
101
86.20
2.540
Total
434
an
25
22.86
0.200
26
26.82
0.025
11
12.31
0.140
62
this
32
31.71
0.003
39
37.21
0.086
15
17.08
0.254
86
that
94
87.02
0.560
105
102.10
0.082
37
46.88
2.080
236
with 59
59.37
0.002
74
69.66
0.271
28
31.98
0.495
161
Without 18
14.01
1.135
10
16.44
2.523
10
7.55
0.797
38
440
202
1017
Total 375
Chi-Sq = 12.271, DF = 10, P-Value = 0.267
Table 3.3.16: Consistency check on Austen’s style
In Table 3.3.17 the test statistic for the between author comparison (X2=32.8) is
highly statistically significant (the critical value for a significance level of 0.05 with
5 degrees of freedom is 11.07). This indicates a divergence between the authors
in the relative frequency of use of one or more of the selected words. Note the
very large contributions to the overall chi-square statistic given by ‘an’ (13.899 for
the imitator) and ‘that’ (9.298 for the imitator). Table 3.3.18 shows the relative
frequencies (expressed as percentages) of the six words for the two authors.
Course Manual: PG Diploma in Statistics, TCD, Base Module
42
Chi-Square Test: Austen, Imitator
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Austen
Imitator
Total
a
434
433.46
0.001
83
83.54
0.003
517
an
62
76.30
2.679
29
14.70
13.899
91
this
86
84.68
0.021
15
16.32
0.107
101
that
236
216.31
1.792
22
41.69
9.298
258
with 161
171.04
0.589
43
32.96
3.056
204
Without 38
35.21
0.220
4
6.79
1.144
42
Total 1017
196
1213
Chi-Sq = 32.810, DF = 5, P-Value = 0.000
Table 3.3.17: Between author comparison
Relative to his use of the other words, the imitator uses ‘an’ much more
frequently than does Austen (15% versus 6%). Austen on the other hand uses
‘that’ more frequently (relative to her use of the other words) than does the
imitator (23% versus 11%). The impact of these divergences can be seen in
Table 6.3.17. For Austen, 62 instances of ‘an’ were observed; on the equal
relative frequency assumption we would have expected 76.3. For the imitator,
the corresponding values are 29 and 14.7. Note that the divergences are equal
in magnitude but opposite in sign ±14.3 (as they must be since the row sums of
both the observed and expected values equal 91 and, hence, the sum of the cell
divergences is zero). The contribution to X2 is, however, greater for the imitator
since the contribution is:
(Oij  Eij ) 2
Eij
Course Manual: PG Diploma in Statistics, TCD, Base Module
43
and Eij is only 14.7 for the imitator, whereas it is 76.3 for Austen.
comments apply to the use of ‘that’.
Word
a
an
this
that
with
without
Austen
Imitator
43
6
8
23
16
4
42
15
8
11
22
2
Similar
Table 3.3.18: Relative frequencies (column percentages) of word use
When ‘an’ and ‘that’ are excluded from the word list, the chi-square test that
compares the distributions for the remaining words is not statistically significant,
as shown in Table 3.3.19. This highlights the potential influence of the arbitrary
nature of the word selection in stylistic comparisons.
Chi-Square Test: Austen, Imitator
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Austen
434
430.23
0.033
Imitator
83
86.77
0.163
Total
517
86
84.05
0.045
15
16.95
0.224
101
with 161
169.76
0.452
43
34.24
2.243
204
Without 38
34.95
0.266
4
7.05
1.319
42
Total 719
145
864
a
this
Chi-Sq = 4.746, DF = 3, P-Value = 0.191
Table 3.3.19: The reduced word list comparison
In summary, the imitator appears to have matched the relative frequencies of the
uses of ‘a’, ‘this’, ‘with’, and ‘without’ in Austen’s style, but failed to do so with
respect to ‘an’ and ‘that’. Does this mean we will never see a BBC Sanditon
series? Will Hollywood be quite so fastidious?
Course Manual: PG Diploma in Statistics, TCD, Base Module
44
Example 12: Mendel’s Pea Plant Data
The data consist of 529 pea plants cross-classified into a two-way table
according to the shape [round, round and wrinkled, wrinkled] and colour [yellow,
yellow and green, green] of their seeds. They are typical of the type of data
presented in discussions of simple Mendelian theories of inheritance. What is of
interest here is whether the row and column classifications can or cannot be
considered independent of each other: this carries implications for the underlying
genetic mechanisms at work.
Colour
Yellow
Shape
Y+G
Green
Round
R+W
Wrinkled
38
60
28
65
138
68
35
67
30
Totals
138
265
126
Totals
126
271
132
529
Table 3.3.20 Mendel’s pea plant data
Statistical Question
In the previous two examples, there were natural response characteristics (e.g.,
buy, heard but no-buy, and no-buy) in each of several defined groups (areas for
the market research data). An analysis expressed in terms of the equality of the
proportions having these characteristics in each of the groups was a natural
approach in addressing the question of interest. In the current example,
however, neither classification factor (shape or colour) can be thought of as a
response while the other drives that response in some way – here, the row and
column classifications have equal status. Accordingly, the research question is
expressed in the form: “is the row classification independent of the column
classification?” For the data of Table 3.3.20 this would mean that the probability
that a randomly selected seed is yellow does not depend on the probability that
the seed is round, for example.
Analysis
Seed shape and colour are genetically determined. In asking if the row and
column classifications of the 529 plants are independent of each other, what we
are really concerned with are the underlying genetic mechanisms determining
seed characteristics. Rejection of the null hypothesis of independence would
suggest that the genetic mechanisms are linked in some way.
Course Manual: PG Diploma in Statistics, TCD, Base Module
45
Chi-Square Test: Yellow, Y+G, Green
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Yellow
Y+G
Green
Total
Round
38
32.87
0.801
65
70.70
0.459
35
34.43
0.009
138
R+W
60
63.12
0.154
138
135.76
0.037
67
66.12
0.012
265
Wrinkled
28
30.01
0.135
68
64.55
0.185
30
31.44
0.066
126
126
271
132
529
Total
Chi-Sq = 1.857,
DF = 4,
P-Value = 0.762
Table 3.3.21 Chi-square test of row-column independence
We estimate the probability that a randomly selected seed will be Round as
138/529, while the probability that it will be Yellow is estimated as 126/529,
based on the data in the margins of Table 3.3.20
If the two characteristics are independent of each other then the product of these
two sample estimates gives us an estimate of the probability that both
characteristics are present simultaneously.
P[Round and Yellow] = P[Round] x P[Yellow]

138 126
x
529 529
The estimate of the expected number of round and yellow seeds is given by:

N x P[Round and Yellow] = N x P[Round] x P[Yellow]
= 529x
138 126
x
529 529

Course Manual: PG Diploma in Statistics, TCD, Base Module
46
=
138x126
 32.87
529
Note that the expected value is given by:

E11 =
RowTotalxColumnTotal
GrandTotal
as before. Precisely the same logic, applied to the other cells of the table, gives
 values, under the independence assumption (these
a complete set of expected
are shown in Table 3.3.21). If the sample size were infinite then the observed
and expected values would coincide (assuming independence). Because of the
small sample size and the inevitable chance variation in biological systems, we
do not expect exact correspondence, even if the two characteristics are
independent. However, the X2 measure of divergence should be smaller than
the critical value on a chi-square distribution with 4 degrees of freedom.
If we choose a significance level of =0.05, then our critical value is 9.49. Since
the value of X2 in Table 3.3.21 is much smaller than this, there is no basis for
rejecting the null hypothesis of row-column classification independence. The
observed data are consistent with the independence assumption. It would
appear that the genetic mechanisms that determine seed shape and colour do
not affect each other.
Exercises
3.3.3
In a study involving high-altitude balloon flights near the north magnetic pole, the
numbers of positive and negative electrons in cosmic ray showers were counted. Each
particle was classified by charge and into one of three energy intervals (MeV stands for
millions of electron volts) as shown in Table 3.3.22.
I have lost the source of these data, so no further information is available. Studies like
this were the most important source of information on high-energy sub-atomic particles
before the construction of high-energy particle accelerators. Cosmic rays are still an
important and active research area (see Wikipedia for a long account of the history and
current research).
Electron Charge
Energy
Interval (MeV)
Positive Negative
50-100
101-300
301-1000
9
32
23
20
51
117
64
188
Table 3.3.22 Cosmic ray shower data
Course Manual: PG Diploma in Statistics, TCD, Base Module
47
(a)
What can we learn about the charge-energy characteristics of the two particle types from
the data? Carry out a chi-square test. What is the null hypothesis? If you get a
statistically significant result carry out further analyses (confidence intervals) to tease out
the differences.
(b)
Suppose it was realized, subsequent to your analysis, that the energy measurements had
been faulty in such a way that electrons of up to 200 MeV could sometimes have been
assigned to the lowest energy interval. Re-analyse the data in the light of this
information. Do your initial conclusions still hold?
3.3.4
Box, Hunter and Hunter [17] report the data shown below for five hospitals. The
frequencies represent patient status after a surgical procedure designed to improve the
functioning of certain joints which have become impaired by disease. Patients were
classified by post-operative status: no improvement, partial functional restoration and
complete functional restoration.
Hospital
A
B
C
D
E
Totals
No improvement
Partial funct. rest.
Complete funct. rest.
13
18
16
5
10
16
8
36
35
21
56
51
43
29
10
90
149
128
Totals
47
31
79
128
82
367
Table 3.3.23: Post-operative improvement status of patients at five hospitals
The authors describe hospital E as a ‘referral hospital’, which presumably means that it is
a specialised centre where more difficult procedures are carried out. It is to be expected,
therefore, that the post-operative patient characteristics could be different from the other
hospitals.
Analyse the data and present your results both in tabular and graphical form.
3.3.5
Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or
shot by hunters in eastern Germany over a 40 year time period. Note that the numbers
were recovered from summaries in the original paper and the total sample size here is
1023, but the original paper refers to a total sample size of 1027. Table 3.3.24 shows the
N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and
column variables, respectively. The authors of the paper described otters of age four or
more years as ‘grown adults’; I have described otters younger than this as ‘young’.
Male
Female
Totals
Young
Adults
286
312
153
272
439
584
Totals
598
425
1023
Table 3.3.24.: Otters cross-classified by age and sex
Course Manual: PG Diploma in Statistics, TCD, Base Module
48
These data were analysed in Exercise 3.2.4 as a comparison of two independent
proportions (the proportion of ‘young’ animals for each sex).
Now, investigate if the row and column classifications can be considered independent (as
for the Mendelian data) using a chi-square test to answer the question.
3.4 Goodness-of-Fit Tests
We have used the chi-square test to compare two or more empirical frequency
distributions (for example, in the Market Research Study we compared the
numbers of consumers falling into each of three ‘market penetration’ categories,
for three regions), to determine if the underlying distributions could be considered
the same or not. Sometimes, it is desired to assess whether or not a single
frequency distribution corresponds to an expected theoretical distribution; the test
is then called a ‘goodness-of-fit’ test. For example, we might ask if the numbers
on a roulette wheel are equally likely to turn up – if not, we might conclude that
the wheel is inherently biased or that the wheel operator can influence the
outcome of the game.
Example 13: Random Number Generation
Random numbers are used in a variety of important problems, from solving
mathematical equations that are too difficult to analyse analytically, to creating
simulation models of physical systems (or computer games), to selecting random
samples from lists of people for inclusion in a survey. In practice, the numbers
used are, very often, ‘pseudo-random’, that is they are generated by a
deterministic algorithm with a view to creating a list of numbers that have the
characteristics (suitably defined) of random numbers. I generated 10,000 digits,
0–9, using the random number generator in Minitab and obtained the results
shown in Table 3.4.1.
Digits
0
Observed freq. 1036
1
960
2
1059
3
946
4
1029
5
995
6
974
7
959
8
1045
Total
9
997 10000
Table 3.4.1 A set of pseudo-random digits
Perhaps the simplest characteristic we might expect in a set of random digits is
that the ten values are equally likely to appear. Does the set of frequencies I
have generated suggest that the random number generator is well-behaved in
this regard?
Course Manual: PG Diploma in Statistics, TCD, Base Module
49
Our null hypothesis is that the probability of occurrence for each digit is constant,
=0.10. The alternative hypothesis is that not all probabilities are
equal to 0.10. If the null hypothesis is true, then the expected frequency is 1000
[N=10,000(0.10)] for each digit, as shown in Table 3.4.2.
The chi-square test statistic is calculated as before:
10
(Observed  Expected ) 2
X 
 14.55
Expected
i1
2
 0
Digits
1
2
3
4
5
6
7
8
9 Totals
Observed freq. 1036 960 1059 946 1029 995
974
959 1045 997 10000
Expected freq. 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10000
Chi-square
1.296 1.6 3.481 2.916 0.841 0.025 0.676 1.681 2.025 0.009 14.55
Table 3.4.2: A chi-square goodness-of-fit test for on random digits
The ten observed and expected values in Table 3.4.2 each add to 10,000 and,
therefore, the set of differences [Oi–Ei] sums to zero. Accordingly, once nine
expected values (or differences) are calculated, the last one is automatically
determined. This means that, when the null hypothesis is true, the X 2 statistic
follows a chi-square distribution with 9 degrees of freedom. The critical value for
a test having a significance level of 0.05 is Xc2=16.9. Since our test statistic of
X2=14.55 is less than this, we cannot reject the null hypothesis that the digits are
equally likely to occur. The area beyond X2=14.55 in the right-hand tail of a chisquare distribution with 9 degrees of freedom is 0.10; this is the p-value for our
test. The fact that it exceeds 0.05 means that we cannot reject the null
hypothesis, if we have previously decided on a significance level of 0.05. The
observed frequency distribution is consistent with the pseudo-random number
generator producing equal relative frequencies for the ten digits.
Example 14: Lotto Numbers
Table 3.4.3 shows information on the frequencies of occurrence of different
numbers in the Irish National Lottery (downloaded from the Lotto website on
30/7/2007). The mechanism used to select the numbers is physical rather than
algorithmic (balls are picked from a large glass bubble), and it is obviously
important that the numbers are equally likely to be picked. I decided to use only
the first 32 (the original number of balls when the Lotto was established) as I was
unclear on whether numbers after 32 had been included to the same extent in all
draws (clearly 43-45 had not).
Course Manual: PG Diploma in Statistics, TCD, Base Module
50
1=248
2=235
3=236
4=242
5=214
6=219
7=226
8=205
9=255
10=233
11=255
12=220
13=238
14=259
15=227
16=235
17=200
18=242
19=217
20=230
21=236
22=222
23=217
24=224
25=213
26=192
27=210
28=233
29=198
30=217
31=217
32=253
33=194
34=236
35=216
36=228
37=213
38=213
39=226
40=214
41=229
42=220
43=10
44=17
45=8
NOTE: The numbers 43, 44 and 45 were introduced on the 28th October, 2006
Table 3.4.3: Frequencies for the Lotto numbers
The total for the 32 observed frequencies is N=7268. On the assumption that all
numbers are equally likely (null hypothesis) then our expected value for each of
the 32 numbers is E=7268/32=227.125. We calculate the X 2 statistic as above
and obtain X2=40.2. As before, the degrees of freedom for the sampling
distribution of X2 (when the null hypothesis is true) will be the number of
categories (K) minus 1, (K–1)=31. For a test with a significance level of 0.05, the
critical value is Xc2=45.0; our test statistic is not statistically significant – the
corresponding p-value is 0.12. We cannot reject the null hypothesis of equally
likely outcomes – the test does not provide any evidence of bias in the selection
of Lotto numbers.
Example 15: Mendelian Ratios
Samuels [1] reports data from a breeding study [18] in which white chickens with
small combs were mated and produced 190 offspring, which were classified as
shown in Table 3.4.4.
Feathers/
Comb size
Observed
Freq.
White, small
White, large
Dark, small
Dark, large
111
37
34
8
Total
190
Table 3.4.4: Breeding study results
Course Manual: PG Diploma in Statistics, TCD, Base Module
51
This table is a typical example of Mendelian genetics data, where there are two
characteristics, here Feather Colour (white is dominant) and Comb Size (small is
dominant), each determined by a single pair of genes; the members of the pair
may appear in either of two versions. Dominant means that if even one copy of
this version of the gene is present the organism will have the appearance
corresponding to the dominant gene. The parents in this study had one
dominant and one recessive gene for each characteristic; according to Mendelian
theory their offspring would be expected to be born in the ratios ¾ for the
appearance (phenotype) of the dominant characteristic, and ¼ for the recessive
characteristic.
How do these ratios arise? Mendelian theory suggests the following mechanism.
The process begins with parents with pure genetic characteristics: both copies of
the gene for the characteristic of interest are the same – either both dominant, D,
or both recessive, r. When two such parents produce offspring, the genetic
structure of the offspring is necessarily hybrid, D-r, as each parent contributes
one copy of the gene. When two first generation hybrid offspring breed, the
second generation appear as shown in Figure 3.4.1.
Figure 3.4.1: Mendelian mechanism for second generation ratios
If the two characteristics are independently determined then we expect the
combined characteristics to appear with probabilities 9/16, 3/16, 3/16, 1/16, that
is, in the ratios 9:3:3:1 as shown in Table 3.4.59.
Small:3/4
White:3/4 9/16
Dark:1/4
3/16
Large:1/4
3/16
1/16
Table 3.4.5: The expected ratios, given independence
9
Recall, if two characteristics A and B are independent, then the probability that they both occur
together is the product of their separate probabilities of occurring: P(A and B)=P(A).P(B). Hence,
P(small and white)=P(small).P(white)=3/4 x 3/4 =9/16
Course Manual: PG Diploma in Statistics, TCD, Base Module
52
Our null hypothesis is that the Mendelian ratios 9:3:3:1 describe the underlying
frequency structure, i.e., the two characteristics are independently determined. A
statistical test essentially asks if the observed frequencies differ from those
predicted by the probabilities 9/16:3/16:3/16:1/16 by more than would be
expected from chance biological variation (the expected frequencies are 190
times the corresponding probabilities).
Table 3.4.6 shows the observed and expected values and the contributions to the
overall chi-square test statistic X2=1.55 for the four categories. The critical value
for a test with a significance level of 0.05 is Xc2=7.8, for a chi-square distribution
with 3 degrees of freedom [DF=No. of categories–1]. The test is non-significant;
the corresponding p-value is 0.67. The empirical data are consistent with the
classic Mendelian theory.
Feathers/Comb
size
White, small
White, large
Dark, small
Dark, large
Ratios
9/16
3/16
3/16
1/16
Obs.
Freq.
111
37
34
8
1.00
190
Total
Expect.
Freq. Chi-square
106.875
0.159
35.625
0.053
35.625
0.074
11.875
1.264
190
1.551
Table 3.4.6: A chi-square analysis of the chicken breeding data
Example 16: Poisson Distribution – -particles
The Poisson distribution is a probability model for rare independent events,
where the rate of occurrence () is constant in either space or time, and events
occur independently. For example, suppose a particular complication in the
delivery of babies occurs at a rate of =3 per month in a large maternity hospital.
We can calculate the probabilities for given numbers of such events10 (x) in any
one month using the formula:
P(x) 
x
x!
e
where =3 and x = 0, 1, 2, 3, 4 …..
thus
P(0)  e 3  0.0498
3
P(1)  e3  0.1 4 9 4
1
10
Note that the requirement that events occur independently means that the model could be
applied to numbers of deliveries, but not to numbers of births, if twin births were a possibility for
the event being modelled.
Course Manual: PG Diploma in Statistics, TCD, Base Module
53
P ( 2) 
3 2 3
e
 0 .2 2 4 0
2.1
P(3) 
3 3 3
e  0.2240
3.2.1
P(4) 
34
e 3  0.1680
4.3.2.1
The value of such calculations lies in the fact that if special teams/equipment are
required for these deliveries, then we can evaluate the probabilities for different
numbers of events and plan resources (contingency plans) accordingly. The
Poisson model is a fundamental tool in analysing the likelihood of events such as
occurrence of equipment failure, overloads on computer systems (e.g., internet
switching devices), blackouts on power transmission networks etc. Simulation of
such systems will require generation of Poisson random numbers (or of a closely
related distribution). Accordingly, we might want to test random numbers to
determine if they behave as if they come from a Poisson distribution, rather than
being simply equally likely, as in Example 13. The example I have chosen to
illustrate testing such a distribution refers to a physics experiment.
Rice [10] reports an analysis by Berkson [19] of radioactivity data observed in the
American Bureau of Standards. The data are summarised in Table 3.4.7: the
observed frequencies refer to the number of 10-second intervals in which
different numbers (x) of -particles were emitted from an Americium 241 source.
The total number of particles observed divided by the total observation time gave
0.8392 particles per second, which corresponds to =8.392 particles per 10second interval. As this value is calculated from the data, it is only an estimate of
the ‘true’ underlying rate (this is relevant to the analysis).
The question of interest is whether or not the data can be modelled by a Poisson
distribution. If the answer is yes, it suggests that the individual emissions are
independent of each other (independence is a key assumption underlying the
Poisson model) – which carries implications for the underlying physical
processes.
To calculate the set of expected frequencies11, we first use the Poisson model
with parameter =8.392 to get a set of probabilities P(x) for different numbers of
particles (x) in any one interval, and then multiple these by N=1207, the total
number of intervals, to get the expected number of intervals with x particles
[Expected value, E(x) = NP(x)].
11
In calculating expected frequencies it is recommended that the cells should have an
expectation of at least 1 (some say 5) to ensure that the approximation to the chi-square
distribution is good; cells should be combined to ensure that this happens. This is the reason the
first row of the table accumulates x=0, 1, 2; otherwise the expected values corresponding to x=0
and 1 would both be less than 5.
Course Manual: PG Diploma in Statistics, TCD, Base Module
54
For example:
P(x) 
P(3) 
x
x!
e
x = 0, 1, 2, 3, 4 …..
8.392 3 8.392
e
 0.022328
3.2.1
and the expected number of intervals with three particles is:
E[No. of intervals with 3 particles] = NP(3) = 1207x0.022328=26.95.
as shown in Table 3.4.8.
The probability for the first row of the table is the sum
P(0)+P(1)+P(2)=P(<3)=0.010111. Multiplying this by 1207, we get the expected
frequency of 12.20 shown in the first row of Table 3.4.8. Since there is no upper
bound on the number of particles in a time interval, we calculate P(x>16)=1–
[P(0)+P(1)+P(2)…P(16)] = 0.00587, which when multiplied by 1207 gives the last
expected frequency of 7.09 in Table 3.4.8.
No. particles
(x)
Frequency
0-2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17+
18
28
56
105
126
146
164
161
123
101
74
53
23
15
9
5
Total
1207
Table 3.4.7: Numbers of intervals in which (x) -particles were observed
Course Manual: PG Diploma in Statistics, TCD, Base Module
55
The last column of Table 3.4.8 is calculated using:
(Oi  Ei ) 2
Ei
where (i) labels the rows.
Number of Observed Expected
particles (x) Frequency Frequency Chi-square
0-2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17+
18
28
56
105
126
146
164
161
123
101
74
53
23
15
9
5
12.20
26.95
56.54
94.90
132.73
159.12
166.92
155.65
130.62
99.65
69.69
44.99
26.97
15.09
7.91
7.09
2.75
0.04
0.01
1.08
0.34
1.08
0.05
0.18
0.44
0.02
0.27
1.43
0.58
0.00
0.15
0.61
Total
1207
1207
9.04
Table 3.4.8: A chi-square analysis of the -particle data
The overall X2 statistic is 9.04, as shown in Table 3.4.8. To calculate the
degrees of freedom, we subtract 1 from the number of categories, K, as before;
we also subtract 1 degree of freedom for every parameter estimated from the
data12 – here we estimated one parameter, , so we have:
Degrees of Freedom = No. of categories –1 – no. of parameters estimated
= (K –1 –1) = 16 – 1 – 1 = 14.
If we were fitting a Normal distribution to a set of data, we would need to estimate the mean, ,
and the standard deviation, , and so we would lose 2 degrees of freedom for parameter
estimation (3 in total).
12
Course Manual: PG Diploma in Statistics, TCD, Base Module
56
The critical value is Xc2=23.7 for a test with a significance level of 0.05; our test
statistic of 9.04 is very much less than this (the corresponding p-value is 0.83),
so we cannot reject the null hypothesis - a Poisson model fits the data. This
suggests that the -particles are emitted independently of each other.
Exercise
3.4.1.
Example 12 analysed 529 pea plants cross-classified into a two-way table according to
the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green,
green] of their seeds. The question of interest was whether the row and column
classifications were independent. The data are reproduced below.
Colour
Shape
Yellow
Y+G
Green
Round
R+W
Wrinkled
38
60
28
65
138
68
35
67
30
Totals
138
265
126
Totals
126
271
132
529
Table 3.4.9: Mendel’s pea plant data
Mendelian theory predicts that the frequencies for the three colours (Y: Y+G: G) will
appear in the ratios 1:2:1 [or equivalently, the probability distribution is: ¼, ½, ¼].
Similarly the shapes (R: R+W: W) are also predicted to appear in the same ratios. The
mechanism is illustrated in Figure 3.4.1, page 52.
Carry out a goodness-of-fit test to check that the observed data depart from the 1:2:1
ratios by no more than would be expected from chance variation. Do this for both the
Colour and Shape distributions.
Course Manual: PG Diploma in Statistics, TCD, Base Module
57
References
[1]
Samuels, M.L., Statistics for the Life Sciences, Dellen Publishing Company,
1989.
[2]
Brailovsky, D., Timolol maleate (MK-950): A new beta-blocking agent for
prophylatic management of angina pectoris. A multicenter, multinational,
cooperative trial. In Beta-Adrenergic Blocking Agents in the Management of
Hypertension and Angina Pectoris, B. Magnani (ed.), Raven Press, 1974.
[3]
Machin, D., Campbell, M.J., and Walters, S.J., Medical Statistics, Wiley,
2007.
[4]
Campbell et al., A 24 year cohort study of mortality in slate workers in North
Wales, Journal of Occupational Medicine, 55, 448-453, 2005.
[5]
Breslow, N.E. and Day, N.E., Statistical Methods in Cancer Research: Vol. 1
– The analysis of case-control studies, International Agency for Research
on Cancer, 1980.
[6]
Tuyns, A.J., Pequignot, G. and Jensen, O.M., Le cancer de l’oesophage en
Ille-et-Vilaine en fonction des niveaux de consommation d’alcool et de
tabac, Bull. Cancer, 64, 45-60, 1977.
[7]
Bland, M., An Introduction to Medical Statistics, Oxford University Press,
2000
[8]
Hauer S., Ansorge H. and Zinke O., A long-term analysis of the age
structure of otters (Lutra Lutra) from eastern Germany, International Journal
of Mammalian Biology, 65, 360 – 368, 2000.
[9] P.J. Bickel, Hammel, E.A. and O’Connell, J.W., Sex Bias in Graduate
Admissions: Data from Berkeley, Science, Vol. 187, 398-404, 1975.
[10] Rice, J.A.,
1995.
Mathematical Statistics and Data Analysis, Duxbury Press,
[11] Smith, D.G., Prentice, R., Thompson, D.J., and Hermann, W.L., Association
of exogenous estrogen and endometrial carcinoma, N. Eng. J. Med., 293,
1164-67, 1975.
[12] Altman, D. G., Practical Statistics for Medical Research, Chapman and Hall,
1991.
Course Manual: PG Diploma in Statistics, TCD, Base Module
58
[13] Stuart, M., Changing the teaching of statistics, The Statistician, Vol. 44, No.
1, 45-54, 1995.
[14] Stuart, M., An Introduction to Statistical Analysis for Business and Industry,
Hodder, 2003.
[15] Ehrenberg, A.S.C., Data Reduction, Wiley, 1975
[16] Morton, A.Q., Literary Detection, Scribner’s, 1978.
[17] Box, G.E.P., Hunter, S., and Hunter, W.G., Statistics for Experimenters,
Wiley, 1978 (note a second edition appeared in 2005)
[18] Bateson, W. and Saunders, E.R., Reports for the Evolution Committee of
the Royal Society, 1, 1-160, 1902.
[19] Berkson, J., Examination of the randomness of alpha particle emissions, in
Research Papers in Statistics, F.N. David (ed.), Wiley, 1966.
Course Manual: PG Diploma in Statistics, TCD, Base Module
59
Appendix 3.1
Let X and Y be two random quantities each with their own long-run mean
(expectation) and standard deviation. The square of the standard deviation is
called the variance. The variance is of interest because when we combine
random quantities the variances (but not the standard deviations) are additive, as
discussed in Chapter 1, Section 1.4. Thus, it can be shown that if X and Y are
independent (the value of X does not affect the probable value of Y) then:
VAR(X + Y) = VAR(X) + VAR(Y)
and
VAR(X – Y) = VAR(X) + VAR(Y)
This latter result is used in finding the standard error of (p1 – p2). We have seen
that:
SE( p) 
 (1   )
n
and, hence, VAR( p) 
 (1   )
n
where p is a sample proportion, n the sample
size, and  is the corresponding


population proportion.
It follows that:
VAR( p1  p2 ) 
which gives

SE( p1  p2 ) 
1(1  1 )  2 (1   2 )
n1

n2
1(1  1 )  2 (1   2 )
n1

n2
.
It is intuitively obvious (and can be shown mathematically) that the long-run

mean of (p1 – p2) is (1 – 2). These results give us the sampling distribution of
(p1 – p2) as shown in Figure 3.2.1, page 9.
Course Manual: PG Diploma in Statistics, TCD, Base Module
60
The Special Case of two proportions that sum to 1.
If (p1+p2) = 1 then (p1–p2) = [p1–(1–p1)] = (2p1–1). It follows that:
SE(p1  p2 )  SE(2p1 1)  SE(2p1)  2SE(p1)
since ‘1’ is a constant. Therefore,

SE( p1  p2 )  2SE( p1)  2
1 (1  1)
n
which is the result used in our analysis of the French presidential poll in Example

8, page 26..
Text: © Eamonn Mullins, 2013,
Course Manual: PG Diploma in Statistics, TCD, Base Module
Data: see references for copyright owners.
61
Outline Solutions
3.1.1: Referendum poll
A 95% confidence interval for the population proportion who intended to vote yes
in the referendum is given by:
p 1.9 6
p(1 p)
n
0.35  1.96
0.35(1  0.35)
1000
0.35  0.03
i.e., we estimate that between 32% and 38% of the population of voters intended
to vote for the Lisbon treaty. A Minitab analysis is shown below.
Test and CI for One Proportion
Sample
1
Yes
350
N
1000
Sample p
0.350000
95% CI
(0.320438, 0.379562)
Using the normal approximation.
Table 3.1.1.1: Minitab analysis of referendum data
Course Manual: PG Diploma in Statistics, TCD, Base Module
62
3.2.1: Referendum polls
First, we will carry out a Z-test of the equality of the population proportions. As
for the Timolol study, we frame our question in terms of a null hypothesis which
considers the long-run proportions as being the same and an alternative
hypothesis which denies their equality.
Ho: – = 0
H1: – 0
where 1 is the proportion of the whole electorate who intended to vote “yes” in
May, while 2 is the corresponding proportion in January.
We calculate a Z statistic:
Z
( p1  p2 )  0
 (1   )  (1   )

n1
n2
- this will have a standard Normal distribution. Our estimate of the common 
(when the null hypothesis is true) is p=610/2000 = 0.305

( p1  p 2 )  0
Z
p (1  p ) p (1  p )

n1
n2

(0.35  0.26 )  0
0.305 (1  0.305 ) 0.305 (1  0.305 )

1000
1000
 4.37
When compared to the critical values of 1.96 corresponding to a significance
level of 0.05, the Z-statistic is highly statistically significant, leading us to
conclude that the population proportions are not the same.
A 95% confidence interval for the difference between the population proportions
is given by:
( p1  p2 )  1.96
p1 (1  p1 ) p2 (1  p2 )

n1
n2
For the referendum poll results this gives:

(0.35  0.26)  1.96
0.35(1  0.35) 0.26(1  0.26)

1000
1000
0.09  0.04

Course Manual: PG Diploma in Statistics, TCD, Base Module

63
We are 95% confident that the difference between the population proportions is
somewhere between 5 and 13 percentage points. A Minitab analysis is shown
below.
Test and CI for Two Proportions
Sample ‘Yes’
1
350
2
260
N
1000
1000
Sample p
0.350000
0.260000
Difference = p(1) - p(2)
Estimate for difference:
95% CI for difference:
0.09
(0.0498375, 0.130163)
Test for difference = 0 (vs not = 0):
Z = 4.37
P-Value = 0.000
Table 3.2.1.1: Minitab comparison of survey results
3.2.2: Influenza vaccination study
This exercise requires the same analyses for the test and confidence interval as
did Exercise 3.2.1 – the Minitab output given below shows the required results.
Test and CI for Two Proportions
Sample
1
2
Influenza
80
20
N
220
240
Sample p
0.363636
0.083333
Difference = p (1) - p (2)
Estimate for difference: 0.280303
95% CI for difference:
(0.207754, 0.352852)
Test for difference = 0 (vs not = 0):
Z = 7.28
P-Value = 0.000
Table 3.2.2.1: Minitab analysis of vaccine data
The influenza data are set out in tabular form in Table 3.2.2.2, to facilitate the
calculation of the odds ratio.
Course Manual: PG Diploma in Statistics, TCD, Base Module
64
Control
Vaccine
Totals
Control Vaccine Totals
Influenza
No influenza
80
140
20
220
100
360
Inf.
no Inf.
a
c
b
d
a+b
c+d
Totals
220
240
460
Totals
a+c
b+d
n
Table 3.2.2.2: The influenza vaccine data
OR =
80 / 140
a b
ad
[  ]
 6.3
20 / 220
c d
bc
The study suggests that the odds of contracting influenza are more than six times
higher for the control group.
A one-sample confidence interval for the population incidence of influenza, if the
entire population (of adults) had been vaccinated is shown in the Minitab output
below. Between 5 and 12% of adults would be expected to contract influenza,
even though they had been vaccinated.
Test and CI for One Proportion
Sample
Influenza
1
20
N
Sample p
95% CI
240
0.083333
(0.048366, 0.118300)
Using the normal approximation.
Table 3.2.2.3: Estimating the incidence of influenza in the population
3.2.3: Childhood respiratory disease study
The statistical test and confidence interval for comparing the sample proportions
are presented below and show a highly statistically significant difference.
Test and CI for Two Proportions
Sample
1
2
Cough
26
44
N
273
1046
Sample p
0.095238
0.042065
Difference = p (1) - p (2)
Estimate for difference: 0.0531731
95% CI for difference: (0.0162884, 0.0900577)
Test for difference = 0 (vs not = 0): Z = 3.49
P-Value = 0.000
Table 3.2.3.1: Minitab analysis of bronchitis data
Course Manual: PG Diploma in Statistics, TCD, Base Module
65
Bronchitis
Control
Totals
Coughs
No coughs
26
247
44
1002
70
1249
Totals
273
1046
1319
Bronchitis
Control
Totals
Coughs
No coughs
a
c
b
d
a+b
c+d
Totals
a+c
b+d
n
Table 3.2.3.2: The bronchitis data
OR =
26 / 247
a b
ad
[  ]
 2.4
44 / 1002
c d
bc
The study suggests that the odds of having a persistent cough in later life are
more than twice as high for the children who suffered from bronchitis in infancy
than they are for the control group.
3.2.4: Otter deaths in eastern Germany
For the otter data a natural practical question is ‘by how much does the long-run
proportion of young male deaths exceed that for young females?’ In addressing
this question we compare the sample proportions; in doing so, we regard the
column totals as fixed in making our comparison.
Test and CI for Two Proportions
Sample
Young
N
Sample p
Males
Females
286
153
598
425
0.478261
0.360000
Difference = p (1) - p (2)
Estimate for difference: 0.118261
95% CI for difference: (0.0575530, 0.178969)
Test for difference = 0 (vs not = 0): Z = 3.77 P-Value = 0.000
Table 3.2.5.1: Minitab comparison of sample proportions for otters
The confidence interval suggests that the long-run percentage of young male
deaths is between 6 and 18 percentage points higher than the corresponding
percentage for female otters. The odds-ratio is 1.63 – the odds of a male being
young when killed are about 60% higher than those for a female. The original
paper suggests reasons for such a difference. These include the larger territorial
Course Manual: PG Diploma in Statistics, TCD, Base Module
66
size and higher activity levels, particularly of younger males searching for a free
territory, leading to increased exposure to road traffic (69% of the deaths were
attributed to road kills).
3.2.5: Case-control contraception study
Table 3.2.5.1 shows the 155 pairs cross-classified by exposure to oral
contraceptives. The question of interest is whether or not there is an association
between exposure to oral contraception and the occurrence of a stroke.
Controls
Exposed Not-expos.
Cases
Totals
Exposed
Not-expos.
5
13
30
107
35
120
a
c
b
d
a+b
c+d
Totals
18
137
155
a+c
b+d
n
Table 3.2.5.1: A matched case-control study of strokes and oral contraceptive use
To test the null hypothesis that the proportion of women (in the relevant
population) who used oral contraceptives was the same for the population of
women who experienced strokes [cases: p1=(a+b)/n] as for the population who
did not [controls: p2=(a+c)/n], we calculate a Z-statistic:
Z
( p1  p2)  0 (b  c) / n
(b  c)
(30 13)



 2.59
ˆ( p  p )
SE
(b  c) / n
(b  c)
(30 13)
1
2
When compared to critical values of 1.96 given by the conventional significance
level of 0.05, the Z-value is clearly statistically significant: cases – those who had
strokes - were more likely to have been exposed to oral contraception.
A confidence interval for the increased probability of exposure to oral
contraception for cases is given by:
( p1  p 2 )  1.96 1n (b  c)  (b  c) 2 / n
where p1 = 35/155 = 0.226 and p2 = 18/155 = 0.116 giving (p1 – p2) = 0.110
Course Manual: PG Diploma in Statistics, TCD, Base Module
67
The calculated confidence interval is:
0.110  0.085 or [0.025 to 0.195]
that is, a difference of between 3 and 20 percentage points. Note that this is an
old study and the formulations of modern contraceptives are quite different, so
the results are of purely historical interest.
3.2.6. Self-paired allergy study
Table 3.2.6.1 shows the 173 self-pairs classified by response to the two
chemicals. The question of interest is whether or not there is a differential
response to the two chemicals.
DNCB
Croton Oil
+
–
+
–
81
23
48
21
129
44
a
c
b
d
a+b
c+d
Totals
104
69
173
a+c
b+d
n
Totals
Table 3.2.6.1: 173 patients classified according to response to two chemicals
These data may be analysed in exactly the same way as the
strokes/contraceptives data of Exercise 3.2.5. To test the null hypothesis that the
probability of a negative response to DNCB was the same that for Croton Oil we
can calculate a Z-statistic:
Z
( p1  p2 )  0
(b  c ) / n
(b  c )
( 4 8 2 3)



 2 .9 7
SEˆ ( p1  p2 )
(b  c ) / n
(b  c )
( 4 8 2 3)
When compared to critical values of 1.96 given by the conventional significance
level of 0.05, the Z-value is clearly statistically significant – patients (in the
relevant population) were more likely to react negatively to DNCB than they were
to Croton Oil.
A confidence interval for the increased probability of reacting negatively to DNCB
is given by:
( p1  p 2 )  1.96 1n (b  c)  (b  c) 2 / n
where p1 = 69/173 = 0.399 and p2 = 44/173 = 0.254 giving (p1 – p2) = 0.145
Course Manual: PG Diploma in Statistics, TCD, Base Module
68
The calculated confidence interval is:
0.145  0.098
Which ranges from 0.05 to 0.24, that is, a difference of between 5 and 24
percentage points.
3.2.7: Otter deaths in eastern Germany
A 95% confidence interval for the difference between the long-run proportions for
males () and females () is:
( p1  p 2 )  1.96(2)
p1 (1  p1 )
n
(0.585  0.415)  3.92
0.585(0.415)
= 0.17  0.06
1023
This suggests that there is an excess of such violent deaths among males, of
between 11 and 23 percentage points, over the percentage for female otters.
3.3.1: Slate workers study.
A Minitab chi-square analysis of the slate workers’ study is shown below. The X2
value of 9.328 is simply the square of the Z-value shown on page 15 (apart from
rounding). The conclusions from the chi-square analysis are the same,
therefore, as those from the Z-test for comparing two proportions, as discussed
on pages 14,15.
Chi-Square Test: C1, C2
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Died
379
352.30
2.024
Surv.
347
373.70
1.908
Total
726
Controls
230
256.70
2.778
299
272.30
2.618
529
Total
609
646
1255
Slate workers
Chi-Sq = 9.328, DF = 1, P-Value = 0.002
Table 3.3.1.1: Chi-square analysis of the slate workers’ data
Course Manual: PG Diploma in Statistics, TCD, Base Module
69
3.3.2: Oesophageal cancer case-control study
A chi-square analysis of the oesophageal cancer case-control data is shown in
Table 3.3.2.1, below – again there is a one-to-one correspondence with the Ztest of Table 3.2.6 (page 16).
Chi-Square Test: C3, C4
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
80+
96
42.05
69.212
<80
104
157.95
18.427
Total
200
Controls
109
162.95
17.861
666
612.05
4.755
775
Total
205
770
975
Cases
Chi-Sq = 110.255, DF = 1, P-Value = 0.000
Table 3.3.2.1: Chi-square analysis of the oesophageal cancer data
3.3.3: Cosmic Rays
It is not obvious to me whether the physicists would ask if the energy distributions
are different for the two particle types or if the charge distributions are different at
different energy ranges. In analysing these data, therefore, I would frame the
null hypothesis as a test of independence – are the energy and charge
classifications independent? I would treat the two classifications as if they have
equal status, in other words.
A chi-square test on the data gives a test statistic of X2=14.01, as shown in Table
3.3.3.1. This should be compared to a chi-square distribution with (3–1)(2–1)=2
degrees of freedom; for a significance level of =0.05, the critical value is 5.99
and the null hypothesis of energy-charge independence is rejected.
Course Manual: PG Diploma in Statistics, TCD, Base Module
70
Chi-Square Test: Positive, Negative
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Energy
Interval
Positive
Negative
Total
50-100
9
7.37
0.363
20
21.63
0.124
29
101-300
32
21.08
5.658
51
61.92
1.926
83
301-1000
23
35.56
4.434
117
104.44
1.509
140
188
252
Total
64
Chi-Sq = 14.013, DF = 2, P-Value = 0.001
Table 3.3.3.1: Chi-square analysis of the cosmic ray data
Suppose we decide to view the data in terms of the energy distributions for the
two particle types; positive electrons are called ‘positrons’ (as in Positron
Emission Tomography (PET) scanners). Table 3.3.3.2 shows the energy
distributions expressed as percentages, for the two particle types.
Electron Charge
Energy
Interval (MeV)
Positive Negative
50-100
101-300
301-1000
14
50
36
11
27
62
100
100
Table 3.3.3.2: The energy distributions (percentage)
It is obvious that there is a much higher percentage of highest energy particles in
the negative charge group; there is a difference of 26 percentage points, i.e., 62
versus 36%. The 95% confidence interval for the difference is from 13 to 40
percentage points, as shown in Table 3.3.3.3.
Course Manual: PG Diploma in Statistics, TCD, Base Module
71
Test and CI for Two Proportions
Sample
301+
N
Sample p
e-
117
188
0.622340
e+
23
64
0.359375
Difference = p (1) - p (2)
Estimate for difference: 0.262965
95% CI for difference: (0.126506, 0.399425)
Test for difference = 0 (vs not = 0): Z = 3.66 P-Value = 0.000
Table 3.3.3.3: Comparing proportions in the highest energy interval
Presumably, differences in the energy distributions would carry implications for
the mechanisms generating the particles. Thus, the non-symmetrical energy
distributions observed here may rule-in/rule-out certain sources of these
particles.
Faulty data collection
If the energy measurements cannot be trusted to differentiate between the two
lower energy intervals, then it makes sense to combine them into a single 50 –
300 MeV band; the revised distributions are shown in Table 3.3.3.4.
Electron Charge
Energy
Interval (MeV)
Positive Negative
50 -300
301-1000
41
23
71
117
64
188
Table 3.3.3.4: Revised energy distributions
A chi-square analysis of this 2x2 table (shown in Table 3.3.3.5) is effectively
equivalent to the comparison of two proportions in Table 3.3.3.3, since we
focused there on the proportions in the highest energy interval, which remains
unchanged. However, the null hypothesis underlying the Z-test for proportions in
Table 3.3.3.3 is an ‘homogeneity of proportions’ hypothesis (i.e., the same set of
long-run relative frequencies () applies to both types of electron), while that
underlying Table 3.3.3.5 could be framed, alternatively, as an energy level-
Course Manual: PG Diploma in Statistics, TCD, Base Module
72
charge independence hypothesis. As we have seen, these are essentially the
same hypothesis and the tests are equivalent (Z=3.66)2 = (X2=13.37), apart from
rounding.
Chi-Square Test: Positron, Electron
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Energy
Interval
Positron
Electron
Total
Low
41
28.44
5.542
71
83.56
1.887
112
High
23
35.56
4.434
117
104.44
1.509
140
Total
64
188
252
Chi-Sq = 13.372, DF = 1, P-Value = 0.000
Table 3.3.3.5: A chi-square analysis of the revised data
The overall conclusion of our analysis is that energy level and particle charge are
not independent classifications and that there is a disproportionately higher
fraction of (negatively charged) electrons in the highest energy band.
Course Manual: PG Diploma in Statistics, TCD, Base Module
73
3.3.4: Hospitals
We begin our analysis by excluding the referral hospital, E, and by examining the
other four hospitals A–D to determine if their post-operative patient status
distributions can be considered as being generated by the same set of underlying
relative frequencies 
Chi-Square Test: A, B, C, D
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
A
B
C
D
Total
No-imp
13
7.75
3.555
5
5.11
0.002
8
13.03
1.941
21
21.11
0.001
47
Partial
18
19.79
0.162
10
13.05
0.714
36
33.26
0.225
56
53.89
0.082
120
Complete
16
19.46
0.615
16
12.84
0.780
35
32.71
0.160
51
53.00
0.075
118
Total
47
31
79
128
285
Chi-Sq = 8.313, DF = 6, P-Value = 0.216
Table 3.3.4.1: Chi-square comparison of hospitals A – D
Table 3.3.4.1 shows a chi-square analysis of the 3x4 table. The X2=8.3 test
statistic is non-significant when compared to the =0.05 critical value of 2c=12.6
of the chi-square distribution with (3–1)(4–1)=6 degrees of freedom. The large pvalue of 0.216 indicates that the test statistic would be expected to be bigger
than 8.3 more than 20% of the time, even if the null hypothesis of equal
underlying frequency distributions were true. Accordingly, we cannot reject the
null hypothesis of equal underlying post-operative result performances for the
four hospitals. It makes sense, therefore, to combine the data into one ‘general
hospital’ distribution.
Course Manual: PG Diploma in Statistics, TCD, Base Module
74
Table 3.3.4.2 shows the (rounded) percentages that fall into the three outcome
categories for the general and referral hospitals.
Hospital
No improvement
Partial funct. rest.
Complete funct. rest.
General
16
42
41
Referral
52
35
12
Table 3.3.4.2: Percentage (rounded) changes
There are quite different response patterns, as might be expected. The referral
hospital has a much lower ‘complete recovery’ rate. (12% versus 41%) and a
much higher ‘no improvement’ rate (52% versus 16%). Table 3.3.4.3 is a chisquare analysis of this table.
Chi-Square Test: General, Referral
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
General
Referral
Total
No-imp
47
69.89
7.497
43
20.11
26.058
90
Partial
120
115.71
0.159
29
33.29
0.553
149
Complete
118
99.40
3.480
10
28.60
12.096
128
Total
285
82
367
Chi-Sq = 49.844, DF = 2, P-Value = 0.000
Table 3.3.4.3 Chi-square comparison of general and referral hospitals
The chi-square analysis (highly significant test statistic) provides statistical
justification for interpreting the two different response distributions of Table
3.3.4.2: the chi-square result tells us that the differences are unlikely to be due
to chance fluctuations.
Course Manual: PG Diploma in Statistics, TCD, Base Module
75
Figure 3.3.4.1: Response ‘profiles’ for hospital types
Figure 3.3.4.1 shows one possible graphical representation of the results. It
illustrates the difference in the ‘response profiles’ of the two types of hospital.
The profiles do not have any intrinsic meaning, of course, as the horizontal axis
is not a continuous scale. The profiles do, however, illustrate the fundamentally
different response patterns and would (I think!) help a report reader who might be
overwhelmed by the detail shown in Tables 3.3.4.1 and 3.3.4.3, in particular, to
interpret the results of the study.
3.3.5: Otter deaths in eastern Germany
We address the question: ‘is the row classification independent of the column
classification?’ In the current case this would mean that the probability that an
otter is young when it is killed does not depend on its sex.
If an otter is selected at random from the total sample of N=1023, then the
probability that it will be male is P[male]=598/1023 and the probability that it will
be young is P[young]=439/1023. If two characteristics are independent of each
other (in the sense that the occurrence of one does not affect the probable
occurrence of the other) then the probability that they both occur simultaneously
is given by the product of their separate probabilities. So, if independence holds:
P[young and male] = P[young] x P[male]
These sample probabilities are, of course, only estimates of corresponding
population values. Under independence the estimate of the expected number of
young males is given by:
N x P[young and male] = N x P[young] x P[male]
Course Manual: PG Diploma in Statistics, TCD, Base Module
76
439598
=102
x3 x
1021
3023
598
= 4 3x9
 2 5.66.2
1023
The estimated expected values and the contributions to the X 2 statistic for all
cells are shown in the Minitab analysis in Table 3.3.5.1. The sum of all the cell
contributions gives an overall test statistic X2=14.183 which is highly statistically
significant when compared to a critical value of 3.84 (again ‘degrees of
freedom’=1). The test statistic and the corresponding p-value (p<0.0005)
suggest that the two classifications are not independent of each other: the
observed frequencies deviate too much from those calculated on the
independence assumption.
Chi-Square Test: Male, Female
Expected counts are printed below observed counts
Chi-Square contributions are printed below expected counts
Male
Female
Total
Young
286
256.62
3.364
153
182.38
4.733
439
Adult
312
341.38
2.529
272
242.62
3.558
584
598
425
1023
Total
Chi-Sq = 14.183, DF = 1, P-Value = 0.000
Table 3.3.5.1: Minitab chi-square analysis of the otter data
Chi-square tests where the null hypothesis is expressed in terms of the equality
of two or more proportions (as in the Timolol example) are often referred to as
‘tests for homogeneity of proportions’, while those for which the null hypothesis is
posed in the manner just discussed are called ‘tests of independence’ or ‘tests of
association’, i.e., they address the question: are frequencies for one classification
‘independent of’ or ‘associated with’ those of the other?
Course Manual: PG Diploma in Statistics, TCD, Base Module
77
3.4.1: Mendel’s pea plant data
The observed colour distribution for the seeds is compared to the theoretical
ratios of 1:2:1 in the Minitab output below. This gives a chi-square value of 0.46,
with a corresponding p-value of 0.796. There is no reason to doubt the
applicability of the Mendelian theory.
Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C5
Category
Yellow
Yellow+Green
Green
N
529
DF
2
Test
Propn.
0.25
0.50
0.25
Observed
126
271
132
Chi-Sq
0.455577
Expected
132.25
264.50
132.25
Contribution
to Chi-Sq
0.295369
0.159735
0.000473
P-Value
0.796
Table 3.4.1.1: Minitab Goodness of Fit analysis of seed colours
Similarly, the observed shape distribution for the seeds is compared to the
theoretical ratios of 1:2:1 in the Minitab output in Table 3.4.1.2. This gives a chisquare value of 0.55, with a corresponding p-value of 0.76. There is no reason
to doubt the applicability of the Mendelian theory with respect to the shape
distribution either.
Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C7
Category
Round
Round+Wrinkled
Wrinkled
N
529
DF
2
Test
Observed
138
265
126
Chi-Sq
0.546314
Propn.
0.25
0.50
0.25
Contribution
Expected
132.25
264.50
132.25
to Chi-Sq
0.250000
0.000945
0.295369
P-Value
0.761
Table 3.4.1.2: Minitab Goodness of Fit analysis of seed shapes
© Text, Eamonn Mullins, 2013; data, see references,
Course Manual: PG Diploma in Statistics, TCD, Base Module
78