Chapter 3: The analysis of count data 3.0 Introduction In this chapter we turn our attention to situations where our data come in the form of counts, rather than continuous measurements. The simplest example, to be discussed in Section 3.1, is, perhaps, a political poll, where n voters are randomly selected from a population of voters and asked a question to which they must answer either yes or no. What is of interest is the proportion of voters in the population who would say yes to the question. The analysis will involve calculating a confidence interval for the population proportion; this is directly equivalent to calculating a confidence interval for a population mean based on a sample mean, as discussed in Chapter 2 (Section 2.1). In many comparative studies, the outcome of interest is a count rather than a measured quantity; in Section 3.2 we will consider such studies. One example involves drug trials where two groups of patients, who are treated differently, are classified as improved/not improved, rather than measured on an improvement scale. In such cases the question of interest is whether there would be a difference in the population proportions that improve, if the treatments were applied to whole populations, rather than to relatively small samples. Again, this is directly equivalent to the simple comparative studies (involving two-sample and paired t-tests and the corresponding confidence intervals) of Chapter 2, where comparisons of population means were of interest. In Section 3.3 we generalise from a two-group comparison to comparing proportions arising in more than two groups (in Chapter 5 Analysis of Variance (ANOVA) will be used to generalise from comparing two sample means to comparing several means for continuous data). We also consider situations where more than two outcomes are observed: for example, we examine the results of a market research study where respondents were classified as having bought a product, having heard of it but not bought it, or, as never having heard of it. The chi-square distribution is introduced for analysing such data. Section 3.4 is concerned with ‘goodness-of-fit’ tests - we ask if an observed set of counts corresponds to an a priori set of relative frequencies; for example, does the observed distribution of seed colours in a plant study correspond to what would be expected under a Mendelian inheritance scheme?. To answer the question, we carry out a statistical test which compares the divergences between the observed counts and those that would be expected given the validity of the a priori relative frequencies. The test assesses whether or not the overall divergence would have been likely under chance variation. Again, the chi-square distribution will be used as the reference distribution for the divergence measure. Course Manual: PG Diploma in Statistics, TCD, Base Module 1 3.1 Single proportions We are often interested in counting the relative frequency of occurrences of particular characteristics of units sampled from a population or process, with a view to making inferences on the population or process characteristics. For example, political polls are carried out to estimate support for individual politicians or parties in a defined population of electors; market research studies are conducted to elicit information on purchasing behaviour in a defined population of consumers; quality control samples are taken to monitor conformance to specifications of industrial products. Here the practical question is how to estimate the relative frequency of a given characteristic in a population or process from that of the sample. The Statistical Model Suppose a random sample of n units is selected from a population or process in which the proportion of units that has a particular characteristic is and that r members of the sample have this characteristic. For example, in a political poll r=700 voters from a sample of n=1000 respond “yes” to a question; what does the sample proportion p=r/n tell us about the population proportion who would say “yes”? It can be shown that under repeated sampling the sample proportion p=r/n behaves as if selected from a Normal distribution with mean and standard error: (1 ) SE( p) n This means that repeated samples of size n will give sample proportions that vary around as shown in Figure 3.1.1 below. Note that this result is the larger the sample size n is, the better the approximation approximate and that becomes. Figure 3.1.1 The distribution of sample proportions Course Manual: PG Diploma in Statistics, TCD, Base Module 2 The fact that p is approximately Normally distributed allows the use of the methods previously encountered for the analysis of measured variables, even though we are now dealing with counts, where only integer values are possible. A confidence interval for a population proportion The statistical question is how to estimate , the population proportion, from that of the sample value, p. Recall that when dealing with measured quantities we used the fact that if some variable x is at least approximately Normally distributed (say, with mean and standard deviation ) then the average of n randomly selected values, x , will also be Normally distributed, with the same mean, , and with standard error /n. This is illustrated in Figure 3.1.2 below. Figure 3.1.2: Distributions of individual measurements (a) and of the averages of n measurements (b). This gave us a method for estimating the population or process mean , using the sample information. A 95% confidence interval for was given by: x ± 1.96 n Where was unknown it was replaced by the sample value s and, if the sample size was small, the standard Normal value of 1.96 was replaced by the corresponding Student’s t-value. The sample proportion in the current case corresponds directly to x above. Similarly we estimate the standard error of p by replacing by the sample estimate p in the formula for standard error: SEˆ ( p) Course Manual: PG Diploma in Statistics, TCD, Base Module p(1 p) n 3 This allows us to calculate an approximate 95% confidence interval 1 for as: p(1 p) n p 1.96 Note the structure of the confidence interval formula: the confidence interval is given by a point estimate of the parameter of interest, plus or minus the of the point estimate multiplied by a critical value. estimated standard error Exactly the same structure applies to all confidence intervals, irrespective of the parameter being estimated: Example 1: French 2007 presidential election On May 4th, 2007 the BBC website reported that an opinion poll (Ifop for the Le Monde newspaper on May 3rd) ‘put Mr. Sarkozy at 53%, with Ms. Royal trailing with 47%’ in the election campaign for the French presidency (voting took place on the following Sunday). The sample size was 858 people and all had watched the head to head television debate between the candidates. What can we infer about the voting intentions of the French electorate from these sample findings, if we can assume that this is a random sample of all French voters? A 95% confidence interval for the population proportion of voters who favoured Sarkozy is given by: p(1 p) p 1.96 n 0.530 1.96 0.530(1 0.530) 858 0.530 0.033 and 56.3% are estimated to favour Sarkozy. i.e., between 49.7% allows for the sampling variability involved in basing our A confidence interval conclusions about the population on the results of a random sample. Taken at face value the confidence interval indicates that, although Sarkozy was six percentage points above Royal in the sample, sampling variability is such that Royal could have been very slightly ahead in the population of voters [50.3%, corresponding to the lower bound of 49.7% for Sarkozy]. In other words the sample size was not large enough to distinguish clearly between the two 1 Note that in this chapter the relevant reference distribution for calculating confidence intervals is the standard Normal (never the t-distribution). Obviously, confidence levels other than 95% may be obtained by replacing 1.96 by the corresponding Z value; thus, 2.58 gives 99% confidence. Course Manual: PG Diploma in Statistics, TCD, Base Module 4 candidates. Note that the formula implies that the larger n is, the narrower will be the error bound on the sample result. The confidence interval is based on the assumption of a simple random sample from the population of interest; this means that all voters would be equally likely to be included in the sample. In practice, this is unlikely to be true for several reasons. The poll was very likely a telephone survey – this raises several questions: do all voters have telephones, if several voters reside in the same house, are they equally likely to answer the telephone, will they answer truthfully if they do respond to the survey? The poll was aimed at people who had watched the TV debate – does this subset of the French electorate reflect accurately the voting intentions of the overall electorate? What exactly was the question asked – did they ask who had won the debate or who the respondent would support – the question may appear the same, but it is dangerous to assume that the respondent interprets a question as you do, unless it is made very explicit. [I had a colleague once who, when asked by a TV licence inspector if there was a television in the house, replied that there was and when asked if he had a TV licence, replied that he did not. When summoned to court he was asked by the judge why he did not have a TV licence; he explained that he did not own the TV, which was his wife’s and that she did have a TV licence!]. Many people are likely to refuse to answer survey questions (whether on the telephone or otherwise) – is there a difference between the subset of the population who are willing to respond and that which is not; any such difference leads to a nonresponse bias. Many surveys have response rates of less than 20%. In such cases it is virtually impossible to assess the likely bias in the sample of respondents. In short, surveying is something of a black art! The use of a mathematical formula, for most people, suggests an exactness which is clearly inappropriate in very many survey situations. Judgements need to be made on how representative the sample is likely to be and the uncertainties inherent in these judgements are not included in the error bounds generated by the confidence interval. The real uncertainty is likely to be quite a lot higher (note the precision in this final sentence!). Exercise 3.1.1 th On Saturday May 17 , 2008, the Irish Times published the results of a poll on attitudes to the planned constitutional referendum on the Lisbon Treaty. The size of the sample was 1000; 35% of respondents said they intended to vote ‘yes’. Estimate the population proportion who intended to vote ‘yes’, at this point in the campaign. Sample Size In any empirical study, one of the first questions that needs to be answered is: how many observations are required? One way of addressing this question in the current context is to ask how wide the confidence interval bounds on the Course Manual: PG Diploma in Statistics, TCD, Base Module 5 sample proportion could be while still giving an interval that answers the practical question, i.e., an interval that is ‘fit for purpose’. The standard error of the sample proportion is: SE( p) (1 ) . n This depends not only on the sample size (n) but also on the population or process proportion() or rather on the product ). Table 3.1.1 shows how this varies with ; it assumes its worst value (in the sense of giving the widest interval) when is 0.5. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.09 0.16 0.21 0.24 0.25 0.24 0.21 0.16 0.09 Table 3.1.1: The impact of on the standard error Suppose we require that a 95% confidence interval is of the form: p ± 0.01 then, since the confidence interval is given by: p 1.96 p(1 p) n and a worst case is either p(1–p) or 1.96 0.25 0.01 n which gives a sample size Course Manual: PG Diploma in Statistics, TCD, Base Module 6 n (1.96) 2 0.25 9604 (0.01) 2 If n is around 1000 the error bounds are ± 0.03. This is a typical sample size for political polls, which are usually reported in the press as subject to an error of approximately 3%. Significance Tests When only a single proportion is of interest it is much more likely that the desired analysis will be to calculate a confidence interval than to carry out a significance test. Confidence intervals are more useful in that they not only tell us whether the data are consistent with a particular hypothesised population or process value, but also show the range of values with which the results are consistent. Tests may be appropriate, however, in certain contexts, especially where a decision has to be made. Example 2: An Acceptance Sampling test In Chapter 2 (Section 2.1) the use of a one-sided test was illustrated by an acceptance sampling example. Acceptance sampling in an industrial context often involves counted data. For example, a company that routinely buys batches of 1000 components for use in one of its own products might take a sample of 80 components on delivery and accept the consignment if no more than 2 components in the sample are “non-conforming”, i.e., do not meet the required specifications. The sample size of 80 and the acceptance number of 2 are suggested by the American National Standard: Sampling Procedures and Tables for Inspection by Attributes (ANSI/ASQC Z1.4-1993) for batches of between 501-1200 where the maximum acceptable non-conformance rate is 1%. This decision rule is a one-sided test with a rejection region defined by the numbers 3, 4, 5,…80. Equivalently, a sample proportion greater than p=2/80=0.025 leads to rejection of the null hypothesis that ≤ 0.01. The sampling tables are often referred to as the ‘Military Standards’ as they were first developed during the second world war for monitoring the purchase of military equipment. Further information on the use of these tables can be found in the guidebook American National Standard: Introduction to Attribute Sampling (ANSI/ASQC S2-1995). Course Manual: PG Diploma in Statistics, TCD, Base Module 7 3.2 Simple Comparative Studies In Chapter 2, we discussed comparative studies where the outcomes of interest could be measured and the questions to be addressed referred to population or process means. We now turn our attention to studies where the outcome involves counting, rather than measuring, and the questions of interest refer to population or process proportions. Typical examples would be medical studies where the two groups are either inherently different (e.g., smokers versus non-smokers) or the subjects are initially similar and are broken into two groups to be allocated to different treatments; the number in each group who respond in a defined way (e.g., develop lung cancer or become free of a medical condition) after a period of time is then counted and expressed as a proportion of the group size. The statistical problem is to assess the extent to which observed sample differences are likely to be systematic and not simply due to chance variation. Example 3: Use of Timolol for treating angina Angina pectoris is a chronic heart condition in which the sufferer has periodic attacks of chest pain. Samuels [1] cites the data in Table 3.2.1 from a study [2] to evaluate the effectiveness of the drug Timolol in preventing angina attacks; patients were randomly assigned to receive a daily dosage of either Timolol or a Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown in Table 3.2.1. Clearly a higher proportion of the Timolol patients became Angina-Free in the study. The statistical question is: “what are the implications for use of Timolol in a population of such patients?” Timolol Placebo Totals Angina-Free Not Angina Free 44 116 19 128 63 244 Totals 160 147 307 Proportions 0.275 0.129 0.205 Table 3.2.1: The Timolol study results The Statistical Model In Section 3.1, we saw that if a single random sample of n units is selected from a population or process in which the proportion of units that has a particular characteristic is Course Manual: PG Diploma in Statistics, TCD, Base Module 8 the sample proportion p=r/n behaves as if selected from a Normal distribution with mean and standard error: SE( p) (1 ) . n Assume now that we have two groups which are independent of each other in the sense that the chance that any unit in one group does or does not have the characteristic of interest is not in any way affected by the responses of the units in the second group. If this is true then we can extend the result used above in a very simple way [see Appendix 3.1, page 60 for details]. If p1=r1/n1 is the sample proportion and is the population or process proportion for the first group, and p2=r2/n2 and are the corresponding values for the second group, then the difference between the sample proportions, (p1–p2) has a sampling distribution which is approximately Normal with mean (– and standard error: SE ( p1 p 2 ) 1 (1 1 ) n1 (1 2 ) 2 n2 This is illustrated in Figure 3.2.1. Figure 3.2.1: The sampling distribution of the difference between sample proportions We are now back on familiar territory and can work with this sampling distribution, as it is merely a single Normal distribution (with which we have worked several times already), even though the standard error formula looks a little complicated. Course Manual: PG Diploma in Statistics, TCD, Base Module 9 The statistical questions that we typically wish to address are whether or not the population or process proportions are the same and, if there is a difference, how big is that difference –? As we have seen in several contexts, questions about statistical significance and the size of effects can be answered by calculating confidence intervals. This will be our approach here too. We will, however, also examine the use of a statistical test for comparing sample proportions, since such tests are commonly used in the research literature. We will use the Timolol study to illustrate the methods. Confidence Interval We have seen that a confidence interval for a population parameter (or the difference between two such parameters) is given by a point estimate plus or minus the standard error multiplied by a critical value from the relevant reference distribution. Accordingly, since the difference between two sample proportions follows a Normal distribution (approximately) a 95% confidence interval for 1–2 is given by: ( p1 p2 ) 1.96 p1 (1 p1 ) p2 (1 p2 ) n1 n2 where 1.96 is the critical value from a standard Normal curve and p 1 (Timolol) and p2 (Placebo) estimate the corresponding population values 1 and 2, respectively. For the Timolol data this gives: (0.275 0.129) 1.96 0.275(1 0.275) 0.129(1 0.129) 160 147 0.146 0.088 suggests that if patients were routinely administered Timolol The interval (according to the study protocol) then an excess of between 6% and 23% of the become Angina-Free over and above the percentage that population would would be expected if the patients were routinely prescribed a placebo. Thus, the interval answers the research questions: it tells us that Timolol works and it measures by how much it is better than a placebo. The results of studies of this type are very often analysed by and reported by way of statistical tests rather than confidence intervals. Accordingly, one such test will be introduced below. However, I strongly recommend that a confidence interval should be obtained, even if a statistical test is the more commonly used method of analysis in a particular discipline. With very large sample sizes even Course Manual: PG Diploma in Statistics, TCD, Base Module 10 tiny differences (of no practical interest) may be statistically significant; so simply reporting results as being statistically significant (especially if the ‘statistically’ qualifier is dropped, as it often is!) can be misleading. On the other hand, small sample sizes result in large standard errors, which will mean non-statisticallysignificant results unless large sample differences are observed. Calculating a confidence interval makes explicit the uncertainty in the observed sample difference and may lead readers to ask (correctly) if the non-significance is due to the absence of an effect or simply due to the small size of the study. Significance Test Obviously the sample proportion of patients who became Angina-Free is greater in the Timolol group (p1=44/160=0.275) than in the Placebo group (p2=19/147=0.129). If we imagine the two treatments being administered to a large population of such patients (from which the sample may be considered to be randomly selected) then we can ask if the sample results are consistent with a difference in the population proportions. We frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality. Ho: – = 0 H1: – 0 When the null hypothesis is true the sample difference (p1–p2) has a sampling distribution which is approximately Normal with mean –and standard error: SE( p1 p2 ) (1 ) (1 ) n1 n2 where is the common population value. Since the test statistic is calculated on the assumption that the null hypothesis is true, the Z statistic is given by: Z ( p1 p2 ) 0 (1 ) (1 ) n1 n2 . This has a standard Normal distribution. Because is unknown it must be estimated from the sample data. If there is no group difference then a natural Course Manual: PG Diploma in Statistics, TCD, Base Module 11 estimator is the overall sample proportion of patients who become Angina-Free, p=63/307=0.205. This gives our test statistic2 as: Z ( p1 p2 ) 0 p(1 p) p(1 p) n1 n2 (0.275 0.129) 0 3.16 0.205(1 0.205) 0.205(1 0.205) 160 147 If we choose a significance level of =0.05 then the critical values are ±1.96 as in Figure 3.2.2 below. shown 0.95 Figure 3.2.2: The standard Normal distribution, with critical values Our test statistic Z=3.16 is greater than the upper critical value and the sample difference is said to be ‘statistically significant’: we reject Ho and conclude that, in the long-run, the proportion of patients who would become Angina-Free is greater if Timolol is prescribed than it would be if a placebo were administered routinely. Note the assumption implicit in the analysis, that the patients are a random sample from the corresponding population. In practice, of course, this is unlikely to be true (especially as it would be the intention to administer the treatment to future patients also). It is important, therefore, in considering the scientific value of the test result to assess the extent to which the sample may be treated as if it were a random sample from the target population. In reporting on the findings it would also be important to specify the target population carefully; for example, the sample may be restricted in its coverage: it may include patients of a narrow age profile, of only one sex, of only one nationality or race etc. 2 note: rounding affects such calculations, the ratio is 0.1457/0.0461 Course Manual: PG Diploma in Statistics, TCD, Base Module 12 Typical Computer Output Table 3.2.2 shows the output obtained from Minitab when the Timolol study sample results are entered. Test and CI for Two Proportions Sample Timolol Placebo Angina-Free 44 19 N 160 147 Sample p 0.275000 0.129252 Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002 Table 3.2.2: Minitab analysis of Timolol data The p-value associated with the test indicates that if the null hypothesis were true, the probability would be 0.002 that we would have obtained, just by chance, a Z test statistic as far out, or further, into the tails of the standard Normal distribution, than the value Z=3.16 obtained here. Accordingly, this is an unusually large value and the null hypothesis of no long-run group difference should be rejected. The computer output also gives the confidence interval (to an impressive number of decimal places!). Continuity Correction In carrying out the Z-test (or the corresponding chi-square test which we will meet in Section 3.3), it is often advised that a ‘continuity or Yates’ correction’ be applied to the formula for the test statistic. This arises because although we are counting discrete events, our calculations are based on the Normal distribution, which is continuous – there is, therefore, an approximation involved. This is a technical detail which I have chosen to ignore in the course of this chapter. The continuity correction is only relevant when the sample size is relatively small – in the current case applying the correction changes the test statistic Z from 3.16 to 3.02; the p-value changes from 0.0016 to 0.0025. Exercise 3.2.1 th On Saturday May 17 , 2008, the Irish Times published the results of a poll on attitudes to the planned constitutional referendum on the Lisbon Treaty. The size of the sample was 1000; 35% of respondents said they intended to vote ‘yes’. A similar poll, carried out at the end of January, 2008, gave 26% as the percentage who intended to vote ‘yes’. Carry out a test to determine if the data suggest that the population proportions intending to vote ‘yes’ were the same or different on the two sampling dates. Estimate the change in the population proportion of those who intended to vote ‘yes’. Course Manual: PG Diploma in Statistics, TCD, Base Module 13 Other Sampling Designs The data of Example 3 resulted from a designed experiment – a clinical trial – in which patients were assigned randomly to the two treatment groups. Where random assignment is possible it strengthens the conclusions that may be drawn from the study, as randomisation is the best protection against unknown sources of bias, which may be present in purely observational studies (see further discussion in Chapter 4). Our next two examples give data that are analysed in exactly the same way as the Timolol data, even though the data generating mechanisms are different. Example 4 is a typical cohort or longitudinal study. Such studies are said to be prospective: they identify the study groups first, follow them over time, and subsequently classify the subjects or patients according to the outcomes being studied. Example 5 is a case-control study – this type of data collection design is retrospective, as we shall see. Example 4: Cohort Study of Slate Workers Machin et al. [3] cite a study by Campbell et al. [4] in which a group of Welsh men who had been exposed to slate dust through their work were compared to a similar group who had not been so exposed; the control group was similar in age and smoking status. The groups were followed over 24 years and the numbers of deaths in the two groups were recorded, as shown in Table 3.2.3. Died Survived Totals Proportions Slate Workers Controls 379 230 347 299 726 529 0.522 0.435 Totals 609 646 1255 Relative risk 1.201 Odds Ratio 1.420 Table 3.2.3: A cohort study of slate workers and controls Table 3.2.4 shows a statistical comparison of the proportions who died in the two groups: the difference p1–p2 is highly statistically significant, with a p-value of 0.002. Course Manual: PG Diploma in Statistics, TCD, Base Module 14 Test and CI for Two Proportions Sample Died N Sample p Slate Workers Controls 379 230 726 529 0.522039 0.434783 Difference = p (1) - p (2) Estimate for difference: 0.0872560 95% CI for difference: (0.0315353, 0.142977) Test for difference = 0 (vs not = 0): Z = 3.05 P-Value = 0.002 Table 3.2.4: A statistical comparison of the death rates The confidence interval shows an increased mortality rate of between 3 and 14 percentage points for the slate workers over the control group. The relative risk and odds ratio shown in Table 3.2.3 will be discussed later. Example 5: Oesophageal Cancer Case-Control Study Case-control studies involve selecting a group of patients with some condition (a disease, say), together with a control group, and then looking backwards in time (they are retrospective studies) to see if some antecedent or risk factor occurs more frequently with the study group, than it does with the control group. If it does, it suggests a possible link between that factor and the outcome of interest. Obviously, case-control studies involve an inversion of the preferred point of view: we would like to be able to say that “people who do X are more likely to get disease Y”, whereas, what we can say is that “people who have disease Y are more likely to have done X”. Such studies are particularly useful for studying rare conditions, for which enormous sample sizes would be required for prospective studies (in order to be sure of recruiting into the study sufficient numbers of people who will get the disease). Breslow and Day [5] cite a case-control study by Tuyns et al [6] involving 200 males who were diagnosed with oesophageal cancer in a Brittany hospital. A control sample of 775 adult males was selected from electoral lists in the nearby communes. Both cases and controls were administered a detailed dietary interview, which contained questions about their consumption of alcohol and tobacco as well as food. Only a small part of the dataset is examined below, Breslow and Day discuss the data in a much more sophisticated way than is attempted here. Course Manual: PG Diploma in Statistics, TCD, Base Module 15 Alcohol consumption (g/day) 80+ <80 Totals Proportions Cases Controls 96 109 104 666 200 775 0.480 0.141 Totals 205 770 975 Odds Ratio 5.640 Table 3.2.5: Oesophageal cancer case-control study A Z-test (see Table 3.2.6, p<0.001) provides strong statistical evidence of an association between alcohol consumption and disease status. The sample proportions P(high alcohol consumption for Cases)=0.48 and P(high alcohol consumption for Controls)=0.14 are, therefore, indicative of underlying differences. Test and CI for Two Proportions Sample Cases Controls 80+ 96 109 N 200 775 Sample p 0.480000 0.140645 Difference = p (1) - p (2) Estimate for difference: 0.339355 95% CI for difference: (0.265916, 0.412793) Test for difference = 0 (vs not = 0): Z = 10.50 P-Value = 0.000 Table 3.2.6: Minitab analysis of oesophageal cancer case-control study The confidence interval suggests a higher prevalence of heavy drinking among the cases, over and above the level for the controls, of between 27 and 41 percentage points. Measuring Association The Timolol study was a clinical trial – an experimental study in which patients were randomly assigned to the treatment and control groups. In such cases we have strong evidence for a causal relationship between the response and the treatment administered. The oesophageal cancer study, on the other hand, was a purely observational study: in such cases the assumption/deduction of a Course Manual: PG Diploma in Statistics, TCD, Base Module 16 causal relationship between the risk factor (heavy drinking) and the response (cancer) is always subject to the caveat that there could be other (unknown) factors related to both of the observed factors, which are responsible for the apparent relationship. This is described as ‘confounding’. For this reason, the term ‘association’ tends to be used, in order to make clear that the data collection method does not allow us to deduce, with any certainty, the existence of a causal relationship. For the Timolol study a natural measure of association is: p1–p2 = P(Angina free|Timolol)–P(Angina free|Control) as it makes a direct comparison between the two proportions of interest. We have already discussed such comparisons extensively. An alternative way of viewing the same information is to calculate an estimate of the relative risk: Relative Risk = p1 0.2 7 5 2.1 3 p2 0.1 2 9 This lends itself to a simple description of the outcome of the study: “the chances of becoming angina-free are more than doubled for the Timolol group”. Similarly, for the slate workers study the mortality rates after 24 years were p1=0.522 for the slate workers and p2=0.435 for the controls. The relative risk of dying was 1.2 – slate workers had a 20% higher chance of dying within the study period. Neither of these measures (p1–p2 or p1/p2) is available if the study is retrospective, since the relevant proportions are not available from the data. For example, in the oesophageal cancer study 96/205=0.47 appears to estimate the probability of being a case, i.e., of having this cancer. However, if the researchers had identified 400 rather than 200 cases, the estimate would be 192/301=0.64, assuming the number of heavy drinkers also doubled. Thus, the estimate would be determined by the design of the study rather than the characteristics of the population being studied. An alternative measure of association, the odds ratio, has the useful property that it provides a direct measure of association, even when the data collection is retrospective. Odds Ratio Table 3.2.7 reproduces the Timolol data, together with a second table where the frequencies are represented by letters Course Manual: PG Diploma in Statistics, TCD, Base Module 17 Timolol Control Totals Timolol Control Totals Angina free Not A-free 44 116 19 128 63 244 AF not-AF a c b d a+b c+d Totals 160 147 307 Totals a+c b+d n Table 3.2.7: Timolol study results Out of the 160 people in the Timolol group 44 become Angina-free (AF) while 116 do not. Instead of expressing the chance of becoming Angina-free as a proportion of the total p1=44/160, an alternative is to say that the odds of becoming AF as opposed to not-AF are 44:116, which we may write as 44/116. Similarly, the odds of becoming AF for the control group are 19/128. If we take the ratio of these two quantities we get the odds ratio: OR = 44 / 116 [ a b ] ad 2.6 19 / 128 c d bc The odds of becoming AF are more than two and a half times higher for patients who received Timolol. Table 3.2.8 reproduces the oesophageal cancer study results in a similar format to that just considered for the Timolol study. Note however, that it is the row totals that are fixed by the study design here, whereas it was the column totals for the Timolol study. Alcohol consumption 80+ <80 Alcohol consumption 80+ <80 Totals Totals Cases Controls 96 109 104 666 200 775 Cases Controls a c b d a+b c+d Totals 205 770 975 Totals a+c b+d n Table 3.2.8: Oesophageal cancer study results By design, we cannot calculate directly the odds of being a case for the high alcohol group (just as we could not calculate the corresponding probability, as discussed earlier), because the number would be meaningless – to see this you need only realise that the odds of being a case would double if the study had included 400 instead of 200 cases (assuming the same proportion of heavy drinkers appeared in the study the odds would be 192/109=1.76 instead of Course Manual: PG Diploma in Statistics, TCD, Base Module 18 96/109=0.88). We can, however, find the odds of being a heavy drinker versus a moderate drinker for each of the two groups. For the cases, this is 96/104 and for the controls, it is 109/666. The odds ratio is then: OR = 96 / 104 [ a c ] ad 5.6 109 / 666 b d bc The odds are more than five times higher that an oesophageal cancer patient will be a heavy drinker, than that a control will be one. Note, however, that the odds ratio formula (in letters) for the retrospective casecontrol study turns out to be exactly the same as for the prospective experimental study: OR = ad bc and so, although the data were collected retrospectively, we can say that the odds of getting oesophageal cancer are more than five times higher for heavy drinkers than they are for more moderate drinkers3. The ratio is often called the ‘cross-product’ ratio because of the ‘shape’ of the resulting formula: compare it to the layout of the letters in the tables. This property that it produces a forward-oriented index, even when the data collection is retrospective makes the odds ratio an attractive measure of association. The odds ratio also allows the tools of multiple regression to be applied to frequency data, when we want to analyse the influence of several risk factors simultaneously, e.g., the effects not just of alcohol consumption, but also of age, smoking habit, nutritional status etc. There is another reason why the odds ratio is considered a useful index: it is frequently applied to diseases with low rates4, and so it gives a close approximation to the relative risk for the groups being compared, even when the sampling is retrospective and the relative risk cannot be calculated. 3 Thus, although the odds a/c and b/d of being a case are dependent on the design (total number of cases identified) and hence are not interpretable, their ratio is, since, while doubling the number of cases will double both a and b, the ratio of the two odds will be unchanged, as the doubling will cancel. Note that this is not true, in general, for the relative risk p1/p2. 4 2 The odds are p/(1–p) which is approximately p(1+p+p ) so, as long as p is small (say no more 2 than 0.05), p and higher powers of p are negligible and p/(1–p) is effectively equal to p. Thus, the relative risk is effectively the same as the odds ratio when p is very small. Note that this approximate formula is based on a power series expansion of the function p/(1–p), so unless you studied this kind of mathematics, do not expect to be able to establish the relationship algebraically. Course Manual: PG Diploma in Statistics, TCD, Base Module 19 Exercises 3.2.2. 460 adults took part in an influenza vaccination study [7]. Of these 240 received the influenza vaccine and 220 received a placebo vaccination. At the end of the trial 100 people contracted influenza of whom 20 were in the treatment group and 80 in the control group. Carry out a statistical test to determine if the vaccine is effective. Calculate and interpret a confidence interval which will estimate the potential effect of the vaccine (beyond that of a placebo). Calculate the odds ratio. If the entire population of adults had been vaccinated, estimate the proportion of the population that would be expected to have contracted influenza during the test period. 3.2.3. A study of respiratory disease in childhood [7] examined the question of whether children with bronchitis in infancy get more respiratory symptoms in later life than others. Of 273 children with a history of bronchitis before age 5 years, 26 were reported to have a persistent cough at age 14, whereas in the group of 1046 children in the study who did not have bronchitis before the age of 5 years, 44 were reported to have a persistent cough at 14 years. Are the proportions having a persistent cough statistically significantly different? Calculate an interval estimate of the long run average difference between the two rates of occurrence of persistent cough at age 14 years? Calculate the odds ratio. 3.2.4 Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or shot by hunters in eastern Germany over a 40 year time period. Note that the numbers were recovered from summaries in the original paper and the total sample size here is 1023, but the original paper refers to a total sample size of 1027. Table 3.2.9 shows the N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and column variables, respectively. The authors of the paper described otters of age four or more years as ‘grown adults’; I have described otters younger than this as ‘young’. Male Female Totals Young Adults 286 312 153 272 439 584 Totals 598 425 1023 Table 3.2.9: Otters cross-classified by age and sex Note that the sampling scheme here is different from our earlier 2x2 tables – the Timolol study was a clinical trial in which patients were randomly assigned to treatment and control groups and the outcome (becoming angina-free) was recorded later, the Slate Workers example was a cohort study in which two groups of subjects were identified and then followed over many years, the Oesophageal Cancer data came from a case-control study in which cancer patients and a control group were identified and interviewed (retrospectively) about their drinking habits (a potential risk-factor for the cancer). In the Otter study the researchers collected data for a given time period, and then crossclassified the results, without pre-specifying any sample size – this was itself a random quantity. A natural practical question is ‘by how much does the long-run proportion of young male deaths exceed that for young females?’ This question can be answered by calculating a confidence interval for the difference between the relevant proportions. Carry out the corresponding Z-test, also. Course Manual: PG Diploma in Statistics, TCD, Base Module 20 Example 6: Collapsing Multi-dimensional Contingency Tables: Simpson’s Paradox – a statistical health warning! Consider an hypothetical Graduate School with two programmes: Mathematics and English5. Suppose Table 3.2.10 represents the admission/rejection decisions for applications in a particular year, classified by sex. The table strongly suggests bias against females (12% acceptance versus 29% for males) on the part of those making the admission decisions. This difference is highly statistically significant, as shown by the Z-test result (Z=4.8). Overall Reject Accept Number of Applicants Percentage Accepted Z-value p-value Males Females 107 370 43 50 150 420 29 12 4.8 0.000 Totals 477 93 570 16 Table 3.2.10: Overall Graduate School admission decisions Consider now the results from the two Schools individually, as shown in Table 3.2.11. Note that the data in Table 3.2.11 sum to give the corresponding totals in Table 3.2.10 – 150 males and 420 females. In both cases the sample admission rate is higher for females (though not statistically significantly) – 50 versus 40% for Mathematics and 10 versus 6% for English. This is in stark contrast to the overall rates for the Graduate School. This phenomenon (aggregated results contradicting the disaggregated results) is called Simpson’s Paradox. Why does it happen? In this example females apply (in greater numbers) for the programme into which it is more difficult to gain admission. Thus, only 10% of applicants are accepted into the English programme while 42% are accepted into Mathematics, but the females overwhelmingly apply for English. The conclusion must be that if there is sex bias, the source needs to be sought, not in the admissions procedures of the Graduate School, but in the conditioning and social influences that lead females to apply disproportionately for the graduate programmes into which it is more difficult to gain entry. 5 The Graduate School example is inspired by Bickel et. al [9] Course Manual: PG Diploma in Statistics, TCD, Base Module 21 Mathematics Number of Applicants Percentage Accepted Reject Accept Males Females 60 10 40 10 100 20 40 50 Totals 70 50 120 42 Z-value 0.828 p-value 0.408 English Reject Accept Number of Applicants Percentage Accepted Z-value p-value Males Females 47 360 3 40 50 400 6 10 0.907 0.364 Totals 407 43 450 10 Table 3.2.11: Graduate School admission decisions by School This should be a sobering example when we consider how often we collect multiply cross-classified data and, typically, collapse across categories to form a multiplicity of two-dimensional tables for analytical purposes. For example, Social/Political surveys will often collect information on Voting Intention, Party Affiliation, Sex, Age, Income Group, Education, Occupational Status, Geographical Location etc. Data collected on expectant mothers will include most of the above and also variables such as number of previous pregnancies, smoking, drinking, dietary, and exercise habits, as well as medical history information. There is an obvious need for care in the analysis of such information, to ensure that apparent effects are not the result of collapsing over influential variables. Matched Studies The analysis of two-sample studies where proportions are the appropriate outcome measure was introduced using the results of a clinical trial of Timolol for preventing angina attacks. The Timolol study design is similar to that which underlies studies where two-sample t-tests are used for the analysis (Chapter 2, Section 2.4). If, instead of simply classifying each patient as Angina-Free or not, we could assign an (approximately Normally distributed) improvement score to each individual in the study, we could have used a two-sample t-test (or a Z=test, given the sample sizes) and the corresponding confidence interval to analyse the resulting data. This would be preferable statistically (more powerful; statistical Course Manual: PG Diploma in Statistics, TCD, Base Module 22 power is discussed in Chapter 4 – a more powerful test is more likely to lead to a correct result), as there is very little information embodied in a simple yes/no classification. When we studied two-sample t-tests, we also discussed paired t-tests: these were appropriate when each person or experimental unit was individually matched with another, or where, in measurement studies, the same material/experimental unit was measured twice, by two different instruments. Matching increases the statistical power of the test by eliminating much of the chance variation affecting the comparison – the ‘treatments’ are compared on more uniform material than is the case for two-sample t-tests. Matching can also be used where the response is binary. The analysis of matched studies will be illustrated using data from a case-control study. We previously encountered a case-control study in Example 5 (oesophageal cancer study); there, the control group was selected from the same community of adult males as the cases, but there was no attempt to match the characteristics of individual cases to those of individual controls. In our next example, patients with other forms of cancer were individually matched to the cases, all of whom had a particular cancer. Example 7: Case-control study – Endometrial cancer Rice [10] cites a study [11] in which 317 patients, who were diagnosed as having endometrial carcinoma, were matched with other cancer patients. The controls all had one of cervical cancer, ovarian cancer or cancer of the vulva. Each control was matched by age at diagnosis (within 4 years) and year of diagnosis (within two years) to a corresponding case. Table 3.2.12 shows the 317 pairs cross-classified by use or not of oestrogen for at least six months prior to the cancer diagnosis. The question of interest is whether or not there is an association between exposure to oestrogen and the presence of endometrial cancer. Controls Exposed Not-expos. Cases Totals Exposed Not-expos. 39 15 113 150 152 165 a c b d a+b c+d Totals 54 263 317 a+c b+d n Table 3.2.12: A matched case-control study of endometrial cancer Course Manual: PG Diploma in Statistics, TCD, Base Module 23 We address the question of interest by asking if the proportion of cases exposed to oestrogen [p1=(a+b)/n = 152/317] is statistically significantly different from the proportion of the controls who were exposed [p 2=(a+c)/n = 54/317]. This means that we want to assess whether or not the difference, (p 1–p2) = [(a+b)/n –(a+c)/n] = (b–c)/n, is statistically significantly different from zero, and if so, to estimate the long-run difference. A confidence interval provides both the answer to the statistical significance question and also the interval estimate we require. Confidence Interval The two sample proportions are not statistically independent of each other, so the standard error formula used in comparing independent proportions (e.g., the Timolol study) is not applicable here. The appropriate standard error estimate is: SEˆ ( p1 p2 ) 1n (b c) (b c)2 /n We can use this to obtain an approximate 95% confidence interval for the difference between the corresponding long-run (population) proportions, (). Thus, whenwe fill the relevant numbers for the endometrial cancer study into the formula: (p1 p2 ) 1.96 1n (b c) (b c)2 /n we obtain: 1 (0.48 0.17) 1.96 317 (11315) (113 15)2 /317 0.31 0.08 is quite far from zero, suggesting that the proportion exposed to The interval oestrogen in the population of endometrial cancer patients is considerably higher than that for the other cancer groups – clear evidence of an association between the use of oestrogen and the development of endometrial cancer. A Statistical Test An alternative approach to determining if an association exists, is through the use of a statistical test. If we wish to test the null hypothesis that then, when this hypothesis is true, the estimated standard error formula simplifies6 to: SEˆ ( p1 p2 ) 1n (b c) 6 If the null hypothesis is true, the expectation (long-run value) of (p1–p2) is zero, i.e., the expectation of (b–c)/n is zero, so the (b-c) part of the standard error formula drops out. Course Manual: PG Diploma in Statistics, TCD, Base Module 24 Using this we calculate a Z-statistic: Z ( p1 p2 ) 0 (b c) /n (b c) SEˆ ( p1 p2 ) (b c) /n (b c) which follows a standard Normal distribution (again, when the null hypothesis is true). For the oestrogen data we obtain: Z (b c) (113 15) 8.7 (b c) (113 15) The critical values are ±1.96 for a two-tail test using a significance level of 0.05. Our observed value of Z=8.7 is far out in the right-hand tail. There is, therefore, strong evidence of an association between oestrogen use and endometrial cancer. Exercises 3.2.5 Samuels [1] reports data from a 1973 study of risk factors for stroke in which 155 women who had experienced a haemoragic stroke (cases) were interviewed. Controls (who had not had a stroke) were matched to each case by where they lived, age and race. The subjects were asked if they had used oral contraceptives. The table below displays oral contraceptive use (Yes/No) for all pairs. Controls Yes No Cases Yes No 5 13 30 107 Table 3.2.13: Pairs classified by use of oral contraceptives Do the data of Table 3.2.13 suggest an association between oral contraception use and strokes? 3.2.6 Altman [12] reports data from a 1975 study carried out to see if patients whose skin did not respond to dinitrochlorobenzene (DNCB), a contact allergen, would show an equally negative response to croton oil, a skin irritant. The table below shows the results of simultaneous tests on the two chemicals for 173 patients with skin cancer; thus, this is an example of a “self-paired” study. Course Manual: PG Diploma in Statistics, TCD, Base Module 25 DNCB Croton Oil + – + 81 48 – 23 21 Table 3.2.14: Pairs classified by use of oral contraceptives Is there evidence of a differential response to the two chemicals? Comparing two proportions that sum to 1 The matched samples design leads to proportions that are correlated with each other. A special case of correlated proportions occurs when there are only two proportions to be compared and the proportions are based on the same total. The proportions will then be perfectly correlated, in that knowing one immediately gives the value for the other, and the standard error needs to take account of this fact. The analysis is a simple modification of our earlier analysis for two independent proportions. Example 8: French 2007 presidential election poll - revisited Recall the political poll, of Section 3.1, that estimated the support for Sarkozy () and Royal () in the French 2007 presidential election. Note that in this case the two sample proportions p1 and p2 are not independent since (p1+p2)=1 (assuming respondents chose one or the other, i.e., no ‘neithers’ or ‘don’t knows’ in the sample of 858). If we wish to focus on the difference in support for the two candidates then the standard error of the sample difference (p1–p2) is SE( p1 p2 ) 2 1 (1 1) n where n is the size of the poll and p2=(1–p1); see Appendix 1, page 61 for details. In practice, we replace by p1 in using this formula for estimation. Note the this formula and that for the standard error of independent difference between proportions (page 9). A 95% confidence interval for Sarkozy’s lead at the time the poll was taken is: ( p1 p2 ) 1.96(2) p1(1 p1) n Course Manual: PG Diploma in Statistics, TCD, Base Module 26 (0.530 0.470) 3.92 0.530(0.470) 858 0.060 0.067 This interval covers zero and so the sample difference is not statistically significant – the corresponding significance test would not reject the hypothesis of no difference in the population of voters. Despite the size of the sample difference in Sarkozy’s favour, the sampling variability is such that Royal could, in fact, have had a slight lead in the population of electors from which the sample was selected. The conclusion is exactly the same as that based on an analysis of either Sarkozy’s or Royal’s support separately, since (because p 1+p2=1) no extra information is used in analysing the difference. The modification to the standard error formula would be important, though, where one proportion was statistically significantly bigger than the other and an interval estimate of the difference was required. Exercise 3.2.7 Exercise 3.2.4 (page 20) analysed data on otters that were either killed on the road or shot by hunters in eastern Germany over a 40 year time period. The total sample size of 1023 was made up of 598 males and 425 females. The authors state that there is an “even sex ratio in living otters”. Do the study results support the proposition that male otters are more likely to be killed, or could the sample difference (58.5 versus 41.5%) be attributable to chance sampling variability in the reported data? Calculate a 95% confidence interval for the difference in the long-run proportions of male and female otters killed in eastern Germany in the study period to answer this question. Course Manual: PG Diploma in Statistics, TCD, Base Module 27 3.3 Analysis of Contingency Tables In Section 3.2 we discussed the analysis of comparative studies involving two groups where the responses were classified into one of two categories. We now turn our attention to studies which may involve more than two groups and where the responses may be classified into more than two categories. The data resulting from such studies are usually presented in tabular form, with the columns (say) representing the different groups and the rows the different response categories. Such tables are often referred to as ‘contingency tables’ one characteristic is considered as ‘contingent on’ or ‘dependent on’ a second. Chi-square Test The analysis of the resulting data involves the use of (for us) a new statistical test – the chi-square test. Before plunging into the analysis of larger tables, the chisquare test will be introduced, by using it to re-analyse an old friend – the Timolol data! Example 9: The Timolol Study- revisited Patients were assigned to receive a daily dosage of either Timolol or a Placebo for 28 weeks. The numbers of patients who did or did not become completely free of angina attacks (Angina-Free) are shown again in Table 3.3.1. Timolol Placebo Totals Angina-Free Not Angina Free 44 116 19 128 63 244 Totals 160 147 307 Table 3.3.1: The Timolol Study Results Suppose we ask the question: “If Timolol has no greater effect than the Placebo, how many of the treatment group would be expected to become Angina-Free?”. Since 63/307 in all become Angina-Free, and this fraction should apply equally to both the treatment and placebo groups if there is no differential effect, we estimate the expected7 number of Angina-Free patients in the Timolol group as: 7 If 100 fair coins are tossed, we expect NxP[head] = 100[0.5] = 50 heads. By this we do not mean that if the experiment were carried out once, the result would be 50 heads. Rather, that in a very long sequence of such experiments the long-run average number of heads would be 50. Thus, the term ‘expected value’ is a technical statistical term meaning ‘long-run average’. Since an expected value is an average it may assume non-integer values even though the raw data are counts. Accordingly, the expected values should not be rounded to integer values when carrying out the calculations. Course Manual: PG Diploma in Statistics, TCD, Base Module 28 E11 63 (160) 32.83 307 E11 63 (160) 32.83 307 E12 63 (147) 30.17 307 63 E 21 244 (160) 127.17 307 E 22 244 (147) 116.83 307 244 160 147 307 Table 3.3.2: Calculating the expected values Following exactly the same logic, we estimate the expected numbers in the other cells of the 2x2 table as shown in Table 3.3.2, where i=1, 2 labels the rows and j=1, 2 labels the columns. Note that the expected value in cell (i, j) is generated by: Eij RowTotalxColumnTotal GrandTotal We now have a set of ‘observed values’ and a set of ‘expected values’. If the assumption of no differential treatment effect is valid, then the two sets of values should differ only by chance variation. A measure of the divergence is: X 2 (Oi j Ei )j2 E a l l c e il jl s Since a difference of (Oij–Eij)=5 would be much more important if we expected Eij=10 than if we expected Eij=100, the divisor scales each squared deviation to allow for the size of the expected value. The deviations are squared, as otherwise they would sum to zero, since each row and column of the 2x2 table contains a positive and negative deviation (Oij–Eij) of equal magnitude (see Table 3.3.3). Our divergence measure is a test statistic which has a (chi-square) sampling distribution when the assumption under which the expected values were calculated is true. The shape of the sampling distribution (examples will be shown later) depends on a parameter called the ‘degrees of freedom’. This is often labelled (the Greek letter corresponding to n); here =1. This value follows from the fact that for 2x2 tables once one expected value is calculated, all Course Manual: PG Diploma in Statistics, TCD, Base Module 29 the others can be found by subtracting this value from the marginal values of the table. Thus, the expected value in cell (1,2) is 63–32.83 = 30.17 and so on. Equivalently, once one deviation (Oij–Eij) is found, the rest follow from the fact that the deviations sum to zero across the rows and down the columns, see Table 3.3.3. (O11 E11 ) 2 (44 32.83) 2 (11.17) 2 3.797 E11 32.83 32.83 (O12 E12 ) 2 (19 30.17) 2 (11.17) 2 4.133 E12 30.17 30.17 (O21 E 21 ) 2 (116 127.17) 2 (11.17) 2 0.980 E 21 127.17 127.17 (O22 E 22 ) 2 (128 116.83) 2 (11.17) 2 1.067 E 22 116.83 116.83 Note the pattern of deviations (Oij–Eij): –11.17 11.17 0 11.17 –11.17 0 0 0 Table 3.3.3: Calculating the contributions to X 2 The Statistical Test If the proportion becoming Angina-Free would be and the proportion becoming not-Angina-Free would be (note =1) in a population of patients receiving the placebo, then the null hypothesis is that the pair [] applies also to the population of patients receiving Timolol. The pair [] is the ‘relative frequency distribution’: the null hypothesis is that this applies to both groups, equally. The individual cell calculations are shown in Table 3.3.3. These give the test statistic as: X 2 allcells (Oij E ij ) 2 E ij 9.98 If we choose a significance level =0.05 then the critical value is c2 3.84 as shown in Figure 3.3.1 (see Statistical Table 4: ST-4). Note that the critical region for the test statistic is inherently one-tailed, as large values of X2 point to large divergences between observed and expected values, thus requiring rejection of the null hypothesis under which the expected values were calculated. Course Manual: PG Diploma in Statistics, TCD, Base Module 30 Figure 3.3.1: The chi-square distribution with degrees of freedom =1 Since the test statistic X2=9.98 greatly exceeds the critical value of c2 3.84 , the null hypothesis is rejected. The test result suggests a systematic difference between the observed and expected values. This points to a difference between the two patient groups in terms of the long-run proportions becoming Angina Free. Of course, we already know this from our previous analysis of the sample proportions. Typical Computer Output Table 3.3.4 shows a chi-square analysis of the Timolol data, produced by Minitab. Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A-Free NotA-Free Total Timolol Placebo Total 44 32.83 3.797 19 30.17 4.133 63 116 127.17 0.980 128 116.83 1.067 244 160 147 307 Chi-Sq = 9.978, DF = 1, P-Value = 0.002 Table 3.3.4: Minitab chi-square output for the Timolol data Course Manual: PG Diploma in Statistics, TCD, Base Module 31 The computer output gives the observed and expected values and the contribution to X2 for each of the four cells of the table. The individual contributions are summed to give the test statistic, X2=9.978. This is stated to have a p-value of 0.002, which means that if the null hypothesis were true, the probability would be only 0.002 that we would have obtained, by chance, an X2 statistic equal to or greater than X2=9.978. This means that our test statistic is unusually large (see Figure 3.3.2), which in turn means that the divergence between the observed and expected values is also large. Such a p-value would lead to rejection of the null hypothesis. Figure 3.3.2: The p-value is the area in the right-hand tail beyond the calculated test statistic 2 X =9.978 Chi-square or Z-test? Table 3.3.5 shows the Minitab analysis for the comparison of the two proportions of patients who become Angina-Free. Test and CI for Two Proportions Sample A-Free Timolol Placebo 44 19 N 160 147 Sample p 0.275000 0.129252 Difference = p (1) - p (2) Estimate for difference: 0.145748 95% CI for difference: (0.0578398, 0.233657) Test for difference = 0 (vs not = 0): Z = 3.16 P-Value = 0.002 Table 3.3.5: Comparing the proportions becoming angina-free using Minitab The Z-test for the difference between the population proportions gives a test statistic of Z=3.16; when this is squared it gives (apart from the effects of rounding) a value (9.98) which is the same as the X2 statistic for the chi-square Course Manual: PG Diploma in Statistics, TCD, Base Module 32 test: the two tests are, therefore, equivalent. The Z value has a standard Normal distribution; when the standard Normal is squared it gives a chi-square distribution with one degree of freedom, which is the sampling distribution of the X2 statistic. The approach using proportions is (to me at least!) more appealing in that it focuses attention on the quantities of direct interest (the proportions becoming Angina-Free). It also leads naturally to calculating a confidence interval to quantify the likely population difference. Exercises 3.3.1 Re-analyse the Slate Workers Cohort Study data of Example 4, using a chi-square test. 3.3.2 Re-analyse the Oesophageal Cancer Case-Control Study of Example 5, using a chi-square test. McNemar’s Test In Example 7 we analysed a matched case-control study (endometrial cancer) using both a confidence interval for the difference between two correlated proportions and the corresponding Z-test. If the Z-statistic is squared it gives a chi-squared test statistic with one degree of freedom, just as is the case for the test on two independent proportions. This test gives identical results to the corresponding Z-test. The chi-square version is usually referred to as McNemar’s test. Larger Contingency Tables No new ideas arise when we carry out chi-square tests on contingency tables that are larger than the 2x2 tables just discussed – the amount of calculation simply increases. The initial chi-square test tells us if all groups behave in the same way or not. If they do not it will be necessary to follow up with further analyses to identify where the differences arise. We consider three examples from different areas: market research, literary style analysis and genetics. Exercises then illustrate applications in physics, hospital management and ecology. Example 10: Market Research Study Stuart [13, 14] reports on a market research study in which interviewees were asked whether they had heard of a product, had heard of it but not bought it, or had bought it at least once; these classifications measure ‘the degree of market penetration’ of the product. The study results, for three regions are shown in Table 3.3.6. Course Manual: PG Diploma in Statistics, TCD, Base Module 33 Bought Heard-no-buy Never heard A 109 55 36 Region B 49 56 45 Totals 200 150 C 168 78 54 Totals 326 189 135 300 650 Table 3.3.6: Market penetration study The samples sizes were clearly fixed for the three regions; why they were different is not reported – the regions may, for example, be of different geographical sizes. Before we begin our formal statistical analysis of the data, a few comments on presenting data in tabular form are worth considering. Table 3.3.7 shows the relative frequencies for the market penetration categories in the three regions; the proportions are expressed to three decimal places. Bought Heard-no-buy Never heard A 0.545 0.275 0.180 Region B 0.327 0.373 0.300 C 0.560 0.260 0.180 Table 3.3.7: Relative frequency distributions of response by region The same information is shown in Table 3.3.8, but this time the numbers are expressed as column percentages and rounded to whole numbers (note that, because of the rounding, region A sums to 101). The columns are also rearranged to put A beside C to emphasise their similarity. Bought Heard-nobuy Never heard A 55 Region C 56 B 33 28 26 37 18 18 30 Table 3.3.8: The percentage responses by region (rounded) Course Manual: PG Diploma in Statistics, TCD, Base Module 34 I think you will agree that Table 3.3.8 is much easier to read and interpret: regions A and C are similar in response characteristics and different from B. Clearly organised tables are important when presenting information to others see Ehrenberg, [15] or Stuart [14] for a discussion of the principles of good table making. The Statistical Model For the chi-square analysis of the Timolol study our model was that a fraction () of the population of patients would become free of angina attacks, and a fraction () would not. The question of interest was whether the same fractions [] applied to both the Timolol and the placebo groups. Here, we have three groups and three response categories. Our model is that the observed sets of three response frequencies are generated by the same underlying population response relative frequency distribution []. Analysis The observed relative frequencies from the three regions are clearly not the same. The first statistical question is whether the differences between regions could be attributed to chance sampling variation or not. If the answer is yes, there is little point in making further between-region comparisons. This question will be addressed using a chi-square test which is a straightforward extension of the test carried out on the 2x2 table earlier. Differences between regions will then be quantified using confidence intervals for differences between proportions – these are calculated exactly as discussed earlier for the Timolol data. Our null hypothesis is that there is a common set of population relative frequencies []. On this assumption, Table 3.3.6 suggests that a natural estimate of the proportion who have bought the product is 326/650, and this should apply equally to all three regions. This implies that the expected number of respondents in region A who have bought the product is: E11 326 (200) 100.31 650 . As before, this estimate takes the form: Eij RowTotalxColumnTotal GrandTotal . Following precisely the same logic, Table 3.3.9 shows the calculation of the expected values for all the cells. Note that the row and column totals are identical to those for the observed values, as shown in Table 3.3.6. Course Manual: PG Diploma in Statistics, TCD, Base Module 35 A Bought Heardnotbought Never heard B 326 (200) 100.31 650 189 E 21 (200) 58.15 650 E11 E 31 135 (200) 41.54 650 C 326 (150) 75.23 650 189 E 22 (150) 43.62 650 E13 135 (150) 31.15 650 E33 E12 E32 200 326 (300) 150.46 650 189 E 23 (300) 87.23 650 135 (300) 62.31 650 150 326 189 135 300 Table 3.3.9: Calculating the expected values Table 3.3.10 shows the divergences between the observed and expected values [Oij–Eij]. Note that the divergences sum to zero across all rows and down all columns (our method of calculating the expected values forces the expected table margins to be the same as the observed table margins, which means the differences are zero). Bought Heard-notbought Never heard A 109 100.31 8.69 B 49 75.23 26.23 C 168 150.46 17.54 0 55 58.15 3.15 56 43.62 12.38 78 87.23 9.23 0 36 41.54 5.54 45 31.15 13.85 54 62.31 8.31 0 0 0 0 Table 3.3.10: Differences between observed and expected values Because of these restrictions on the table margins, once four cells are filled (either expected values or differences between observed and expected) the other five are automatically determined – the deviations are said, therefore, to have four degrees of freedom. This means that when we calculate our X2 test statistic, we will compare it to a chi-square distribution with four degrees of freedom. In general, the degrees of freedom for contingency tables are given by: degrees of freedom = (Number of rows – 1) x (Number of columns – 1) which, for the current dataset, gives: degrees of freedom = (3 – 1) x (3 – 1) = 4. Course Manual: PG Diploma in Statistics, TCD, Base Module 36 Figure 3.3.3: Chi-square distribution with 4 degrees of freedom The Minitab analysis of Table 3.3.11 shows that the individual cell contributions sum to a test statistic of: X 2 (Oij E ij ) 2 allcells E ij 24.6 which greatly exceeds the critical value of 9.49 (shown in Figure 3.3.3). Clearly, the assumption of a common underlying relative frequency distribution of responses must be rejected. Chi-Square Test: A, B, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Bought Heardno-buy Neverheard Total A B C Total 109 100.31 0.753 49 75.23 9.146 168 150.46 2.044 326 55 58.15 0.171 56 43.62 3.517 78 87.23 0.977 189 36 41.54 0.738 45 31.15 6.154 54 62.31 1.108 135 200 150 300 650 Chi-Sq = 24.608, DF = 4, P-Value = 0.000 Table 3.3.11: Minitab chi-square analysis of market research data Course Manual: PG Diploma in Statistics, TCD, Base Module 37 A chi-square analysis of regions A and C supports the conclusion of common underlying rates, which was suggested by Table 3.3.8. The p-value of 0.928 of Table 3.3.12 shows the observed and expected values to be very close to each other8. Chi-Square Test: A, C Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A C Total Bought 109 110.80 0.029 168 166.20 0.019 277 Heard-no-buy 55 53.20 0.061 78 79.80 0.041 133 Never heard 36 36.00 0.000 54 54.00 0.000 90 Total 200 300 500 Chi-Sq = 0.150, DF = 2, P-Value = 0.928 Table 3.3.12 Chi-square comparison of responses for regions A and C Table 3.3.13 shows a chi-square comparison of the pooled values for regions A and C versus those of region B: region B clearly behaves differently. Note that the statistical tests shown in Tables 3.3.12 and 3.3.13 have a different status from that in Table 3.3.11. The tests carried out in Tables 3.3.12 and 3.3.13 were prompted by examining the data. It is axiomatic in scientific investigation that you cannot form an hypothesis on the basis of a set of data and test the hypothesis using the same dataset. These tests may be regarded, therefore, as crude measures in an exploratory analysis; we would treat the results with a great deal of caution if the tests were only marginally statistically significant. Given the marked differences in responses between B and the combined A and C, it is possible (even likely?) that the market researchers would have expected such results and might have been in a position to formulate the hypotheses underlying the tests of Tables 3.3.12 and 3.3.13 a priori, in which case the tests would have their usual status. It might, for example, be the case 8 My colleague Michael Stuart found the data in a textbook (but has lost the reference). The results in Table 3.3.12 are so closely matched for regions A and C as to suggest to me that they may have been invented for illustrative purposes! However, I am taking them at face value until he retrieves the reference. Course Manual: PG Diploma in Statistics, TCD, Base Module 38 that B is a relatively new market while A and C are mature markets; Stuart does not provide further information which would throw light on such questions. Chi-Square Test: A+C, B Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A+C B Total Bought 277 250.77 2.744 49 75.23 9.146 326 Heard-no-buy 133 145.38 1.055 56 43.62 3.517 189 Never heard 90 103.85 1.846 45 31.15 6.154 135 Total 500 150 650 Chi-Sq = 24.461, DF = 2, P-Value = 0.000 Table 3.3.13: Chi-square comparison of pooled A and C versus B Quantitative statements regarding differences can be supported by confidence intervals for the population differences. For example, Table 3.3.14 shows a comparison of the proportions of consumers who have bought the product in region B versus the combined region A+C. Expressed in terms of percentages, the point estimate for the difference is 23%, but the confidence interval suggests that the percentage of consumers in regions A and C who bought the product is between 14 and 31 percentage points higher than the corresponding value for region B. Test and CI for Two Proportions Sample Bought A+C B 277 49 N 500 150 Sample p 0.554 0.327 Difference = p (1) - p (2) Estimate for difference: 0.227333 95% CI for difference: (0.140550, 0.314117) Test for difference = 0 (vs not = 0): Z = 4.88 P-Value = 0.000 Table 3.3.14: Difference in buying proportions; regions A+C versus B Course Manual: PG Diploma in Statistics, TCD, Base Module 39 Figure 3.3.4: A graphical representation of the study results (percentages) Figure 3.3.4 presents the results of the study in graphical form. Note that the lines connecting the dots have no intrinsic meaning – they simply help us visualise the different market penetration ‘profiles’ corresponding to the two market segments. The picture shows regions A and C to be higher by 22 percentage points than region B in relation to those who have bought the product. On the other hand, B is higher by 10-12 percentage points for the other two market penetration categories. Contrast this picture with Table 3.3.6 as a way of reporting the study results. Tables of technical statistical analyses, such as Table 3.3.11, are inappropriate in presenting study results – they contain too many unnecessary technical details. If at all, they should appear in a technical appendix to a report on the market research study. Example 11: Literary Style Jane Austen left an unfinished novel (Sanditon) when she died. She also left a summary of the remaining chapters. The novel was finished by an imitator, who tried to match her style as closely as possible. The data below are cited by Rice [10], based on an analysis by Morton [16]. Six words were selected for analysis and their frequencies were counted in Chapters 1 and 3 of Sense and Sensibility, 1, 2 and 3 of Emma, 1 and 6 of Sanditon (described below as Sanditon-A, written by Austen) and 12 and 24 of Sandition (described below as Sanditon-B, written by the imitator). Can we distinguish between the authors on the basis of the word counts? Course Manual: PG Diploma in Statistics, TCD, Base Module 40 Sense+ Sensibility Emma a an this that with without 147 25 32 94 59 18 186 26 39 105 74 10 101 11 15 37 28 10 83 29 15 22 43 4 517 91 101 258 204 42 Totals 375 440 202 196 1213 Sanditon-A Sanditon-B Totals Table 3.3.15 Word counts for the literary works Statistical Model Before beginning our analysis it is worthwhile considering the nature of the data given in the table. Note that the data generation mechanism is quite different from our previous market research example. There, there were three categories of responses (bought, heard but no buy, and never heard of product). The three relative frequencies or percentages for any region provided us with estimates of consumer behaviour. Here the row categories are arbitrarily selected words used by the author. The relative frequencies for any one novel do not have any intrinsic meaning (unlike the market research study); they are simply the relative frequencies of these particular words (relative to the full set of words included in the list) and would change if any of the words were dropped from the list or if others were added. The statistical model that underlies the chi-square analysis that follows is that there is a set of relative frequencies for these arbitrarily selected words [] that characterises Austen’s writing style; it underlies all her writing and should apply also to the imitator, If he has been successful in matching her style. If this is true, then the observed frequencies (considered as a random sample from her works) should only reflect chance variation away from this stable underlying structure. Analysis Before comparing Austen to her imitator, it makes sense to check if the word distributions are stable within her own work. If they are not, then the basis for the comparison is undermined; we might have to ask ourselves to which Jane Austen we want to compare the imitator! The frequency distributions for the three known Austen works are given in Table 3.3.16 and a chi-square test is carried out. The critical value for a test using a significance level of 0.05, where there are 10 degrees of freedom, is 18.31. The test statistic of 12.3 is much less than this (with a correspondingly large p-value) Course Manual: PG Diploma in Statistics, TCD, Base Module 41 and so we do not reject the null hypothesis of a consistent underlying structure for the three works. It makes sense, therefore, to combine the counts for the three works before comparing them to the imitator. One set of frequencies, based on a total sample size of 1017, will be more stable than three sets based on smaller sample sizes. The observed values in the first column of Table 3.3.17 are the combined word counts, that is the totals from the right-hand margin of Table 3.3.16. Chi-Square Test: Sensibility, Emma, Sanditon-A Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Sensibility 147 160.03 1.061 a Emma 186 187.77 0.017 Sanditon-A 101 86.20 2.540 Total 434 an 25 22.86 0.200 26 26.82 0.025 11 12.31 0.140 62 this 32 31.71 0.003 39 37.21 0.086 15 17.08 0.254 86 that 94 87.02 0.560 105 102.10 0.082 37 46.88 2.080 236 with 59 59.37 0.002 74 69.66 0.271 28 31.98 0.495 161 Without 18 14.01 1.135 10 16.44 2.523 10 7.55 0.797 38 440 202 1017 Total 375 Chi-Sq = 12.271, DF = 10, P-Value = 0.267 Table 3.3.16: Consistency check on Austen’s style In Table 3.3.17 the test statistic for the between author comparison (X2=32.8) is highly statistically significant (the critical value for a significance level of 0.05 with 5 degrees of freedom is 11.07). This indicates a divergence between the authors in the relative frequency of use of one or more of the selected words. Note the very large contributions to the overall chi-square statistic given by ‘an’ (13.899 for the imitator) and ‘that’ (9.298 for the imitator). Table 3.3.18 shows the relative frequencies (expressed as percentages) of the six words for the two authors. Course Manual: PG Diploma in Statistics, TCD, Base Module 42 Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen Imitator Total a 434 433.46 0.001 83 83.54 0.003 517 an 62 76.30 2.679 29 14.70 13.899 91 this 86 84.68 0.021 15 16.32 0.107 101 that 236 216.31 1.792 22 41.69 9.298 258 with 161 171.04 0.589 43 32.96 3.056 204 Without 38 35.21 0.220 4 6.79 1.144 42 Total 1017 196 1213 Chi-Sq = 32.810, DF = 5, P-Value = 0.000 Table 3.3.17: Between author comparison Relative to his use of the other words, the imitator uses ‘an’ much more frequently than does Austen (15% versus 6%). Austen on the other hand uses ‘that’ more frequently (relative to her use of the other words) than does the imitator (23% versus 11%). The impact of these divergences can be seen in Table 6.3.17. For Austen, 62 instances of ‘an’ were observed; on the equal relative frequency assumption we would have expected 76.3. For the imitator, the corresponding values are 29 and 14.7. Note that the divergences are equal in magnitude but opposite in sign ±14.3 (as they must be since the row sums of both the observed and expected values equal 91 and, hence, the sum of the cell divergences is zero). The contribution to X2 is, however, greater for the imitator since the contribution is: (Oij Eij ) 2 Eij Course Manual: PG Diploma in Statistics, TCD, Base Module 43 and Eij is only 14.7 for the imitator, whereas it is 76.3 for Austen. comments apply to the use of ‘that’. Word a an this that with without Austen Imitator 43 6 8 23 16 4 42 15 8 11 22 2 Similar Table 3.3.18: Relative frequencies (column percentages) of word use When ‘an’ and ‘that’ are excluded from the word list, the chi-square test that compares the distributions for the remaining words is not statistically significant, as shown in Table 3.3.19. This highlights the potential influence of the arbitrary nature of the word selection in stylistic comparisons. Chi-Square Test: Austen, Imitator Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Austen 434 430.23 0.033 Imitator 83 86.77 0.163 Total 517 86 84.05 0.045 15 16.95 0.224 101 with 161 169.76 0.452 43 34.24 2.243 204 Without 38 34.95 0.266 4 7.05 1.319 42 Total 719 145 864 a this Chi-Sq = 4.746, DF = 3, P-Value = 0.191 Table 3.3.19: The reduced word list comparison In summary, the imitator appears to have matched the relative frequencies of the uses of ‘a’, ‘this’, ‘with’, and ‘without’ in Austen’s style, but failed to do so with respect to ‘an’ and ‘that’. Does this mean we will never see a BBC Sanditon series? Will Hollywood be quite so fastidious? Course Manual: PG Diploma in Statistics, TCD, Base Module 44 Example 12: Mendel’s Pea Plant Data The data consist of 529 pea plants cross-classified into a two-way table according to the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. They are typical of the type of data presented in discussions of simple Mendelian theories of inheritance. What is of interest here is whether the row and column classifications can or cannot be considered independent of each other: this carries implications for the underlying genetic mechanisms at work. Colour Yellow Shape Y+G Green Round R+W Wrinkled 38 60 28 65 138 68 35 67 30 Totals 138 265 126 Totals 126 271 132 529 Table 3.3.20 Mendel’s pea plant data Statistical Question In the previous two examples, there were natural response characteristics (e.g., buy, heard but no-buy, and no-buy) in each of several defined groups (areas for the market research data). An analysis expressed in terms of the equality of the proportions having these characteristics in each of the groups was a natural approach in addressing the question of interest. In the current example, however, neither classification factor (shape or colour) can be thought of as a response while the other drives that response in some way – here, the row and column classifications have equal status. Accordingly, the research question is expressed in the form: “is the row classification independent of the column classification?” For the data of Table 3.3.20 this would mean that the probability that a randomly selected seed is yellow does not depend on the probability that the seed is round, for example. Analysis Seed shape and colour are genetically determined. In asking if the row and column classifications of the 529 plants are independent of each other, what we are really concerned with are the underlying genetic mechanisms determining seed characteristics. Rejection of the null hypothesis of independence would suggest that the genetic mechanisms are linked in some way. Course Manual: PG Diploma in Statistics, TCD, Base Module 45 Chi-Square Test: Yellow, Y+G, Green Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Yellow Y+G Green Total Round 38 32.87 0.801 65 70.70 0.459 35 34.43 0.009 138 R+W 60 63.12 0.154 138 135.76 0.037 67 66.12 0.012 265 Wrinkled 28 30.01 0.135 68 64.55 0.185 30 31.44 0.066 126 126 271 132 529 Total Chi-Sq = 1.857, DF = 4, P-Value = 0.762 Table 3.3.21 Chi-square test of row-column independence We estimate the probability that a randomly selected seed will be Round as 138/529, while the probability that it will be Yellow is estimated as 126/529, based on the data in the margins of Table 3.3.20 If the two characteristics are independent of each other then the product of these two sample estimates gives us an estimate of the probability that both characteristics are present simultaneously. P[Round and Yellow] = P[Round] x P[Yellow] 138 126 x 529 529 The estimate of the expected number of round and yellow seeds is given by: N x P[Round and Yellow] = N x P[Round] x P[Yellow] = 529x 138 126 x 529 529 Course Manual: PG Diploma in Statistics, TCD, Base Module 46 = 138x126 32.87 529 Note that the expected value is given by: E11 = RowTotalxColumnTotal GrandTotal as before. Precisely the same logic, applied to the other cells of the table, gives values, under the independence assumption (these a complete set of expected are shown in Table 3.3.21). If the sample size were infinite then the observed and expected values would coincide (assuming independence). Because of the small sample size and the inevitable chance variation in biological systems, we do not expect exact correspondence, even if the two characteristics are independent. However, the X2 measure of divergence should be smaller than the critical value on a chi-square distribution with 4 degrees of freedom. If we choose a significance level of =0.05, then our critical value is 9.49. Since the value of X2 in Table 3.3.21 is much smaller than this, there is no basis for rejecting the null hypothesis of row-column classification independence. The observed data are consistent with the independence assumption. It would appear that the genetic mechanisms that determine seed shape and colour do not affect each other. Exercises 3.3.3 In a study involving high-altitude balloon flights near the north magnetic pole, the numbers of positive and negative electrons in cosmic ray showers were counted. Each particle was classified by charge and into one of three energy intervals (MeV stands for millions of electron volts) as shown in Table 3.3.22. I have lost the source of these data, so no further information is available. Studies like this were the most important source of information on high-energy sub-atomic particles before the construction of high-energy particle accelerators. Cosmic rays are still an important and active research area (see Wikipedia for a long account of the history and current research). Electron Charge Energy Interval (MeV) Positive Negative 50-100 101-300 301-1000 9 32 23 20 51 117 64 188 Table 3.3.22 Cosmic ray shower data Course Manual: PG Diploma in Statistics, TCD, Base Module 47 (a) What can we learn about the charge-energy characteristics of the two particle types from the data? Carry out a chi-square test. What is the null hypothesis? If you get a statistically significant result carry out further analyses (confidence intervals) to tease out the differences. (b) Suppose it was realized, subsequent to your analysis, that the energy measurements had been faulty in such a way that electrons of up to 200 MeV could sometimes have been assigned to the lowest energy interval. Re-analyse the data in the light of this information. Do your initial conclusions still hold? 3.3.4 Box, Hunter and Hunter [17] report the data shown below for five hospitals. The frequencies represent patient status after a surgical procedure designed to improve the functioning of certain joints which have become impaired by disease. Patients were classified by post-operative status: no improvement, partial functional restoration and complete functional restoration. Hospital A B C D E Totals No improvement Partial funct. rest. Complete funct. rest. 13 18 16 5 10 16 8 36 35 21 56 51 43 29 10 90 149 128 Totals 47 31 79 128 82 367 Table 3.3.23: Post-operative improvement status of patients at five hospitals The authors describe hospital E as a ‘referral hospital’, which presumably means that it is a specialised centre where more difficult procedures are carried out. It is to be expected, therefore, that the post-operative patient characteristics could be different from the other hospitals. Analyse the data and present your results both in tabular and graphical form. 3.3.5 Hauer, Ansorge and Zinke [8] report data on otters that were either killed on the road or shot by hunters in eastern Germany over a 40 year time period. Note that the numbers were recovered from summaries in the original paper and the total sample size here is 1023, but the original paper refers to a total sample size of 1027. Table 3.3.24 shows the N=1023 otters cross-classified into a 2x2 table with age-group and sex as the row and column variables, respectively. The authors of the paper described otters of age four or more years as ‘grown adults’; I have described otters younger than this as ‘young’. Male Female Totals Young Adults 286 312 153 272 439 584 Totals 598 425 1023 Table 3.3.24.: Otters cross-classified by age and sex Course Manual: PG Diploma in Statistics, TCD, Base Module 48 These data were analysed in Exercise 3.2.4 as a comparison of two independent proportions (the proportion of ‘young’ animals for each sex). Now, investigate if the row and column classifications can be considered independent (as for the Mendelian data) using a chi-square test to answer the question. 3.4 Goodness-of-Fit Tests We have used the chi-square test to compare two or more empirical frequency distributions (for example, in the Market Research Study we compared the numbers of consumers falling into each of three ‘market penetration’ categories, for three regions), to determine if the underlying distributions could be considered the same or not. Sometimes, it is desired to assess whether or not a single frequency distribution corresponds to an expected theoretical distribution; the test is then called a ‘goodness-of-fit’ test. For example, we might ask if the numbers on a roulette wheel are equally likely to turn up – if not, we might conclude that the wheel is inherently biased or that the wheel operator can influence the outcome of the game. Example 13: Random Number Generation Random numbers are used in a variety of important problems, from solving mathematical equations that are too difficult to analyse analytically, to creating simulation models of physical systems (or computer games), to selecting random samples from lists of people for inclusion in a survey. In practice, the numbers used are, very often, ‘pseudo-random’, that is they are generated by a deterministic algorithm with a view to creating a list of numbers that have the characteristics (suitably defined) of random numbers. I generated 10,000 digits, 0–9, using the random number generator in Minitab and obtained the results shown in Table 3.4.1. Digits 0 Observed freq. 1036 1 960 2 1059 3 946 4 1029 5 995 6 974 7 959 8 1045 Total 9 997 10000 Table 3.4.1 A set of pseudo-random digits Perhaps the simplest characteristic we might expect in a set of random digits is that the ten values are equally likely to appear. Does the set of frequencies I have generated suggest that the random number generator is well-behaved in this regard? Course Manual: PG Diploma in Statistics, TCD, Base Module 49 Our null hypothesis is that the probability of occurrence for each digit is constant, =0.10. The alternative hypothesis is that not all probabilities are equal to 0.10. If the null hypothesis is true, then the expected frequency is 1000 [N=10,000(0.10)] for each digit, as shown in Table 3.4.2. The chi-square test statistic is calculated as before: 10 (Observed Expected ) 2 X 14.55 Expected i1 2 0 Digits 1 2 3 4 5 6 7 8 9 Totals Observed freq. 1036 960 1059 946 1029 995 974 959 1045 997 10000 Expected freq. 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10000 Chi-square 1.296 1.6 3.481 2.916 0.841 0.025 0.676 1.681 2.025 0.009 14.55 Table 3.4.2: A chi-square goodness-of-fit test for on random digits The ten observed and expected values in Table 3.4.2 each add to 10,000 and, therefore, the set of differences [Oi–Ei] sums to zero. Accordingly, once nine expected values (or differences) are calculated, the last one is automatically determined. This means that, when the null hypothesis is true, the X 2 statistic follows a chi-square distribution with 9 degrees of freedom. The critical value for a test having a significance level of 0.05 is Xc2=16.9. Since our test statistic of X2=14.55 is less than this, we cannot reject the null hypothesis that the digits are equally likely to occur. The area beyond X2=14.55 in the right-hand tail of a chisquare distribution with 9 degrees of freedom is 0.10; this is the p-value for our test. The fact that it exceeds 0.05 means that we cannot reject the null hypothesis, if we have previously decided on a significance level of 0.05. The observed frequency distribution is consistent with the pseudo-random number generator producing equal relative frequencies for the ten digits. Example 14: Lotto Numbers Table 3.4.3 shows information on the frequencies of occurrence of different numbers in the Irish National Lottery (downloaded from the Lotto website on 30/7/2007). The mechanism used to select the numbers is physical rather than algorithmic (balls are picked from a large glass bubble), and it is obviously important that the numbers are equally likely to be picked. I decided to use only the first 32 (the original number of balls when the Lotto was established) as I was unclear on whether numbers after 32 had been included to the same extent in all draws (clearly 43-45 had not). Course Manual: PG Diploma in Statistics, TCD, Base Module 50 1=248 2=235 3=236 4=242 5=214 6=219 7=226 8=205 9=255 10=233 11=255 12=220 13=238 14=259 15=227 16=235 17=200 18=242 19=217 20=230 21=236 22=222 23=217 24=224 25=213 26=192 27=210 28=233 29=198 30=217 31=217 32=253 33=194 34=236 35=216 36=228 37=213 38=213 39=226 40=214 41=229 42=220 43=10 44=17 45=8 NOTE: The numbers 43, 44 and 45 were introduced on the 28th October, 2006 Table 3.4.3: Frequencies for the Lotto numbers The total for the 32 observed frequencies is N=7268. On the assumption that all numbers are equally likely (null hypothesis) then our expected value for each of the 32 numbers is E=7268/32=227.125. We calculate the X 2 statistic as above and obtain X2=40.2. As before, the degrees of freedom for the sampling distribution of X2 (when the null hypothesis is true) will be the number of categories (K) minus 1, (K–1)=31. For a test with a significance level of 0.05, the critical value is Xc2=45.0; our test statistic is not statistically significant – the corresponding p-value is 0.12. We cannot reject the null hypothesis of equally likely outcomes – the test does not provide any evidence of bias in the selection of Lotto numbers. Example 15: Mendelian Ratios Samuels [1] reports data from a breeding study [18] in which white chickens with small combs were mated and produced 190 offspring, which were classified as shown in Table 3.4.4. Feathers/ Comb size Observed Freq. White, small White, large Dark, small Dark, large 111 37 34 8 Total 190 Table 3.4.4: Breeding study results Course Manual: PG Diploma in Statistics, TCD, Base Module 51 This table is a typical example of Mendelian genetics data, where there are two characteristics, here Feather Colour (white is dominant) and Comb Size (small is dominant), each determined by a single pair of genes; the members of the pair may appear in either of two versions. Dominant means that if even one copy of this version of the gene is present the organism will have the appearance corresponding to the dominant gene. The parents in this study had one dominant and one recessive gene for each characteristic; according to Mendelian theory their offspring would be expected to be born in the ratios ¾ for the appearance (phenotype) of the dominant characteristic, and ¼ for the recessive characteristic. How do these ratios arise? Mendelian theory suggests the following mechanism. The process begins with parents with pure genetic characteristics: both copies of the gene for the characteristic of interest are the same – either both dominant, D, or both recessive, r. When two such parents produce offspring, the genetic structure of the offspring is necessarily hybrid, D-r, as each parent contributes one copy of the gene. When two first generation hybrid offspring breed, the second generation appear as shown in Figure 3.4.1. Figure 3.4.1: Mendelian mechanism for second generation ratios If the two characteristics are independently determined then we expect the combined characteristics to appear with probabilities 9/16, 3/16, 3/16, 1/16, that is, in the ratios 9:3:3:1 as shown in Table 3.4.59. Small:3/4 White:3/4 9/16 Dark:1/4 3/16 Large:1/4 3/16 1/16 Table 3.4.5: The expected ratios, given independence 9 Recall, if two characteristics A and B are independent, then the probability that they both occur together is the product of their separate probabilities of occurring: P(A and B)=P(A).P(B). Hence, P(small and white)=P(small).P(white)=3/4 x 3/4 =9/16 Course Manual: PG Diploma in Statistics, TCD, Base Module 52 Our null hypothesis is that the Mendelian ratios 9:3:3:1 describe the underlying frequency structure, i.e., the two characteristics are independently determined. A statistical test essentially asks if the observed frequencies differ from those predicted by the probabilities 9/16:3/16:3/16:1/16 by more than would be expected from chance biological variation (the expected frequencies are 190 times the corresponding probabilities). Table 3.4.6 shows the observed and expected values and the contributions to the overall chi-square test statistic X2=1.55 for the four categories. The critical value for a test with a significance level of 0.05 is Xc2=7.8, for a chi-square distribution with 3 degrees of freedom [DF=No. of categories–1]. The test is non-significant; the corresponding p-value is 0.67. The empirical data are consistent with the classic Mendelian theory. Feathers/Comb size White, small White, large Dark, small Dark, large Ratios 9/16 3/16 3/16 1/16 Obs. Freq. 111 37 34 8 1.00 190 Total Expect. Freq. Chi-square 106.875 0.159 35.625 0.053 35.625 0.074 11.875 1.264 190 1.551 Table 3.4.6: A chi-square analysis of the chicken breeding data Example 16: Poisson Distribution – -particles The Poisson distribution is a probability model for rare independent events, where the rate of occurrence () is constant in either space or time, and events occur independently. For example, suppose a particular complication in the delivery of babies occurs at a rate of =3 per month in a large maternity hospital. We can calculate the probabilities for given numbers of such events10 (x) in any one month using the formula: P(x) x x! e where =3 and x = 0, 1, 2, 3, 4 ….. thus P(0) e 3 0.0498 3 P(1) e3 0.1 4 9 4 1 10 Note that the requirement that events occur independently means that the model could be applied to numbers of deliveries, but not to numbers of births, if twin births were a possibility for the event being modelled. Course Manual: PG Diploma in Statistics, TCD, Base Module 53 P ( 2) 3 2 3 e 0 .2 2 4 0 2.1 P(3) 3 3 3 e 0.2240 3.2.1 P(4) 34 e 3 0.1680 4.3.2.1 The value of such calculations lies in the fact that if special teams/equipment are required for these deliveries, then we can evaluate the probabilities for different numbers of events and plan resources (contingency plans) accordingly. The Poisson model is a fundamental tool in analysing the likelihood of events such as occurrence of equipment failure, overloads on computer systems (e.g., internet switching devices), blackouts on power transmission networks etc. Simulation of such systems will require generation of Poisson random numbers (or of a closely related distribution). Accordingly, we might want to test random numbers to determine if they behave as if they come from a Poisson distribution, rather than being simply equally likely, as in Example 13. The example I have chosen to illustrate testing such a distribution refers to a physics experiment. Rice [10] reports an analysis by Berkson [19] of radioactivity data observed in the American Bureau of Standards. The data are summarised in Table 3.4.7: the observed frequencies refer to the number of 10-second intervals in which different numbers (x) of -particles were emitted from an Americium 241 source. The total number of particles observed divided by the total observation time gave 0.8392 particles per second, which corresponds to =8.392 particles per 10second interval. As this value is calculated from the data, it is only an estimate of the ‘true’ underlying rate (this is relevant to the analysis). The question of interest is whether or not the data can be modelled by a Poisson distribution. If the answer is yes, it suggests that the individual emissions are independent of each other (independence is a key assumption underlying the Poisson model) – which carries implications for the underlying physical processes. To calculate the set of expected frequencies11, we first use the Poisson model with parameter =8.392 to get a set of probabilities P(x) for different numbers of particles (x) in any one interval, and then multiple these by N=1207, the total number of intervals, to get the expected number of intervals with x particles [Expected value, E(x) = NP(x)]. 11 In calculating expected frequencies it is recommended that the cells should have an expectation of at least 1 (some say 5) to ensure that the approximation to the chi-square distribution is good; cells should be combined to ensure that this happens. This is the reason the first row of the table accumulates x=0, 1, 2; otherwise the expected values corresponding to x=0 and 1 would both be less than 5. Course Manual: PG Diploma in Statistics, TCD, Base Module 54 For example: P(x) P(3) x x! e x = 0, 1, 2, 3, 4 ….. 8.392 3 8.392 e 0.022328 3.2.1 and the expected number of intervals with three particles is: E[No. of intervals with 3 particles] = NP(3) = 1207x0.022328=26.95. as shown in Table 3.4.8. The probability for the first row of the table is the sum P(0)+P(1)+P(2)=P(<3)=0.010111. Multiplying this by 1207, we get the expected frequency of 12.20 shown in the first row of Table 3.4.8. Since there is no upper bound on the number of particles in a time interval, we calculate P(x>16)=1– [P(0)+P(1)+P(2)…P(16)] = 0.00587, which when multiplied by 1207 gives the last expected frequency of 7.09 in Table 3.4.8. No. particles (x) Frequency 0-2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17+ 18 28 56 105 126 146 164 161 123 101 74 53 23 15 9 5 Total 1207 Table 3.4.7: Numbers of intervals in which (x) -particles were observed Course Manual: PG Diploma in Statistics, TCD, Base Module 55 The last column of Table 3.4.8 is calculated using: (Oi Ei ) 2 Ei where (i) labels the rows. Number of Observed Expected particles (x) Frequency Frequency Chi-square 0-2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17+ 18 28 56 105 126 146 164 161 123 101 74 53 23 15 9 5 12.20 26.95 56.54 94.90 132.73 159.12 166.92 155.65 130.62 99.65 69.69 44.99 26.97 15.09 7.91 7.09 2.75 0.04 0.01 1.08 0.34 1.08 0.05 0.18 0.44 0.02 0.27 1.43 0.58 0.00 0.15 0.61 Total 1207 1207 9.04 Table 3.4.8: A chi-square analysis of the -particle data The overall X2 statistic is 9.04, as shown in Table 3.4.8. To calculate the degrees of freedom, we subtract 1 from the number of categories, K, as before; we also subtract 1 degree of freedom for every parameter estimated from the data12 – here we estimated one parameter, , so we have: Degrees of Freedom = No. of categories –1 – no. of parameters estimated = (K –1 –1) = 16 – 1 – 1 = 14. If we were fitting a Normal distribution to a set of data, we would need to estimate the mean, , and the standard deviation, , and so we would lose 2 degrees of freedom for parameter estimation (3 in total). 12 Course Manual: PG Diploma in Statistics, TCD, Base Module 56 The critical value is Xc2=23.7 for a test with a significance level of 0.05; our test statistic of 9.04 is very much less than this (the corresponding p-value is 0.83), so we cannot reject the null hypothesis - a Poisson model fits the data. This suggests that the -particles are emitted independently of each other. Exercise 3.4.1. Example 12 analysed 529 pea plants cross-classified into a two-way table according to the shape [round, round and wrinkled, wrinkled] and colour [yellow, yellow and green, green] of their seeds. The question of interest was whether the row and column classifications were independent. The data are reproduced below. Colour Shape Yellow Y+G Green Round R+W Wrinkled 38 60 28 65 138 68 35 67 30 Totals 138 265 126 Totals 126 271 132 529 Table 3.4.9: Mendel’s pea plant data Mendelian theory predicts that the frequencies for the three colours (Y: Y+G: G) will appear in the ratios 1:2:1 [or equivalently, the probability distribution is: ¼, ½, ¼]. Similarly the shapes (R: R+W: W) are also predicted to appear in the same ratios. The mechanism is illustrated in Figure 3.4.1, page 52. Carry out a goodness-of-fit test to check that the observed data depart from the 1:2:1 ratios by no more than would be expected from chance variation. Do this for both the Colour and Shape distributions. Course Manual: PG Diploma in Statistics, TCD, Base Module 57 References [1] Samuels, M.L., Statistics for the Life Sciences, Dellen Publishing Company, 1989. [2] Brailovsky, D., Timolol maleate (MK-950): A new beta-blocking agent for prophylatic management of angina pectoris. A multicenter, multinational, cooperative trial. In Beta-Adrenergic Blocking Agents in the Management of Hypertension and Angina Pectoris, B. Magnani (ed.), Raven Press, 1974. [3] Machin, D., Campbell, M.J., and Walters, S.J., Medical Statistics, Wiley, 2007. [4] Campbell et al., A 24 year cohort study of mortality in slate workers in North Wales, Journal of Occupational Medicine, 55, 448-453, 2005. [5] Breslow, N.E. and Day, N.E., Statistical Methods in Cancer Research: Vol. 1 – The analysis of case-control studies, International Agency for Research on Cancer, 1980. [6] Tuyns, A.J., Pequignot, G. and Jensen, O.M., Le cancer de l’oesophage en Ille-et-Vilaine en fonction des niveaux de consommation d’alcool et de tabac, Bull. Cancer, 64, 45-60, 1977. [7] Bland, M., An Introduction to Medical Statistics, Oxford University Press, 2000 [8] Hauer S., Ansorge H. and Zinke O., A long-term analysis of the age structure of otters (Lutra Lutra) from eastern Germany, International Journal of Mammalian Biology, 65, 360 – 368, 2000. [9] P.J. Bickel, Hammel, E.A. and O’Connell, J.W., Sex Bias in Graduate Admissions: Data from Berkeley, Science, Vol. 187, 398-404, 1975. [10] Rice, J.A., 1995. Mathematical Statistics and Data Analysis, Duxbury Press, [11] Smith, D.G., Prentice, R., Thompson, D.J., and Hermann, W.L., Association of exogenous estrogen and endometrial carcinoma, N. Eng. J. Med., 293, 1164-67, 1975. [12] Altman, D. G., Practical Statistics for Medical Research, Chapman and Hall, 1991. Course Manual: PG Diploma in Statistics, TCD, Base Module 58 [13] Stuart, M., Changing the teaching of statistics, The Statistician, Vol. 44, No. 1, 45-54, 1995. [14] Stuart, M., An Introduction to Statistical Analysis for Business and Industry, Hodder, 2003. [15] Ehrenberg, A.S.C., Data Reduction, Wiley, 1975 [16] Morton, A.Q., Literary Detection, Scribner’s, 1978. [17] Box, G.E.P., Hunter, S., and Hunter, W.G., Statistics for Experimenters, Wiley, 1978 (note a second edition appeared in 2005) [18] Bateson, W. and Saunders, E.R., Reports for the Evolution Committee of the Royal Society, 1, 1-160, 1902. [19] Berkson, J., Examination of the randomness of alpha particle emissions, in Research Papers in Statistics, F.N. David (ed.), Wiley, 1966. Course Manual: PG Diploma in Statistics, TCD, Base Module 59 Appendix 3.1 Let X and Y be two random quantities each with their own long-run mean (expectation) and standard deviation. The square of the standard deviation is called the variance. The variance is of interest because when we combine random quantities the variances (but not the standard deviations) are additive, as discussed in Chapter 1, Section 1.4. Thus, it can be shown that if X and Y are independent (the value of X does not affect the probable value of Y) then: VAR(X + Y) = VAR(X) + VAR(Y) and VAR(X – Y) = VAR(X) + VAR(Y) This latter result is used in finding the standard error of (p1 – p2). We have seen that: SE( p) (1 ) n and, hence, VAR( p) (1 ) n where p is a sample proportion, n the sample size, and is the corresponding population proportion. It follows that: VAR( p1 p2 ) which gives SE( p1 p2 ) 1(1 1 ) 2 (1 2 ) n1 n2 1(1 1 ) 2 (1 2 ) n1 n2 . It is intuitively obvious (and can be shown mathematically) that the long-run mean of (p1 – p2) is (1 – 2). These results give us the sampling distribution of (p1 – p2) as shown in Figure 3.2.1, page 9. Course Manual: PG Diploma in Statistics, TCD, Base Module 60 The Special Case of two proportions that sum to 1. If (p1+p2) = 1 then (p1–p2) = [p1–(1–p1)] = (2p1–1). It follows that: SE(p1 p2 ) SE(2p1 1) SE(2p1) 2SE(p1) since ‘1’ is a constant. Therefore, SE( p1 p2 ) 2SE( p1) 2 1 (1 1) n which is the result used in our analysis of the French presidential poll in Example 8, page 26.. Text: © Eamonn Mullins, 2013, Course Manual: PG Diploma in Statistics, TCD, Base Module Data: see references for copyright owners. 61 Outline Solutions 3.1.1: Referendum poll A 95% confidence interval for the population proportion who intended to vote yes in the referendum is given by: p 1.9 6 p(1 p) n 0.35 1.96 0.35(1 0.35) 1000 0.35 0.03 i.e., we estimate that between 32% and 38% of the population of voters intended to vote for the Lisbon treaty. A Minitab analysis is shown below. Test and CI for One Proportion Sample 1 Yes 350 N 1000 Sample p 0.350000 95% CI (0.320438, 0.379562) Using the normal approximation. Table 3.1.1.1: Minitab analysis of referendum data Course Manual: PG Diploma in Statistics, TCD, Base Module 62 3.2.1: Referendum polls First, we will carry out a Z-test of the equality of the population proportions. As for the Timolol study, we frame our question in terms of a null hypothesis which considers the long-run proportions as being the same and an alternative hypothesis which denies their equality. Ho: – = 0 H1: – 0 where 1 is the proportion of the whole electorate who intended to vote “yes” in May, while 2 is the corresponding proportion in January. We calculate a Z statistic: Z ( p1 p2 ) 0 (1 ) (1 ) n1 n2 - this will have a standard Normal distribution. Our estimate of the common (when the null hypothesis is true) is p=610/2000 = 0.305 ( p1 p 2 ) 0 Z p (1 p ) p (1 p ) n1 n2 (0.35 0.26 ) 0 0.305 (1 0.305 ) 0.305 (1 0.305 ) 1000 1000 4.37 When compared to the critical values of 1.96 corresponding to a significance level of 0.05, the Z-statistic is highly statistically significant, leading us to conclude that the population proportions are not the same. A 95% confidence interval for the difference between the population proportions is given by: ( p1 p2 ) 1.96 p1 (1 p1 ) p2 (1 p2 ) n1 n2 For the referendum poll results this gives: (0.35 0.26) 1.96 0.35(1 0.35) 0.26(1 0.26) 1000 1000 0.09 0.04 Course Manual: PG Diploma in Statistics, TCD, Base Module 63 We are 95% confident that the difference between the population proportions is somewhere between 5 and 13 percentage points. A Minitab analysis is shown below. Test and CI for Two Proportions Sample ‘Yes’ 1 350 2 260 N 1000 1000 Sample p 0.350000 0.260000 Difference = p(1) - p(2) Estimate for difference: 95% CI for difference: 0.09 (0.0498375, 0.130163) Test for difference = 0 (vs not = 0): Z = 4.37 P-Value = 0.000 Table 3.2.1.1: Minitab comparison of survey results 3.2.2: Influenza vaccination study This exercise requires the same analyses for the test and confidence interval as did Exercise 3.2.1 – the Minitab output given below shows the required results. Test and CI for Two Proportions Sample 1 2 Influenza 80 20 N 220 240 Sample p 0.363636 0.083333 Difference = p (1) - p (2) Estimate for difference: 0.280303 95% CI for difference: (0.207754, 0.352852) Test for difference = 0 (vs not = 0): Z = 7.28 P-Value = 0.000 Table 3.2.2.1: Minitab analysis of vaccine data The influenza data are set out in tabular form in Table 3.2.2.2, to facilitate the calculation of the odds ratio. Course Manual: PG Diploma in Statistics, TCD, Base Module 64 Control Vaccine Totals Control Vaccine Totals Influenza No influenza 80 140 20 220 100 360 Inf. no Inf. a c b d a+b c+d Totals 220 240 460 Totals a+c b+d n Table 3.2.2.2: The influenza vaccine data OR = 80 / 140 a b ad [ ] 6.3 20 / 220 c d bc The study suggests that the odds of contracting influenza are more than six times higher for the control group. A one-sample confidence interval for the population incidence of influenza, if the entire population (of adults) had been vaccinated is shown in the Minitab output below. Between 5 and 12% of adults would be expected to contract influenza, even though they had been vaccinated. Test and CI for One Proportion Sample Influenza 1 20 N Sample p 95% CI 240 0.083333 (0.048366, 0.118300) Using the normal approximation. Table 3.2.2.3: Estimating the incidence of influenza in the population 3.2.3: Childhood respiratory disease study The statistical test and confidence interval for comparing the sample proportions are presented below and show a highly statistically significant difference. Test and CI for Two Proportions Sample 1 2 Cough 26 44 N 273 1046 Sample p 0.095238 0.042065 Difference = p (1) - p (2) Estimate for difference: 0.0531731 95% CI for difference: (0.0162884, 0.0900577) Test for difference = 0 (vs not = 0): Z = 3.49 P-Value = 0.000 Table 3.2.3.1: Minitab analysis of bronchitis data Course Manual: PG Diploma in Statistics, TCD, Base Module 65 Bronchitis Control Totals Coughs No coughs 26 247 44 1002 70 1249 Totals 273 1046 1319 Bronchitis Control Totals Coughs No coughs a c b d a+b c+d Totals a+c b+d n Table 3.2.3.2: The bronchitis data OR = 26 / 247 a b ad [ ] 2.4 44 / 1002 c d bc The study suggests that the odds of having a persistent cough in later life are more than twice as high for the children who suffered from bronchitis in infancy than they are for the control group. 3.2.4: Otter deaths in eastern Germany For the otter data a natural practical question is ‘by how much does the long-run proportion of young male deaths exceed that for young females?’ In addressing this question we compare the sample proportions; in doing so, we regard the column totals as fixed in making our comparison. Test and CI for Two Proportions Sample Young N Sample p Males Females 286 153 598 425 0.478261 0.360000 Difference = p (1) - p (2) Estimate for difference: 0.118261 95% CI for difference: (0.0575530, 0.178969) Test for difference = 0 (vs not = 0): Z = 3.77 P-Value = 0.000 Table 3.2.5.1: Minitab comparison of sample proportions for otters The confidence interval suggests that the long-run percentage of young male deaths is between 6 and 18 percentage points higher than the corresponding percentage for female otters. The odds-ratio is 1.63 – the odds of a male being young when killed are about 60% higher than those for a female. The original paper suggests reasons for such a difference. These include the larger territorial Course Manual: PG Diploma in Statistics, TCD, Base Module 66 size and higher activity levels, particularly of younger males searching for a free territory, leading to increased exposure to road traffic (69% of the deaths were attributed to road kills). 3.2.5: Case-control contraception study Table 3.2.5.1 shows the 155 pairs cross-classified by exposure to oral contraceptives. The question of interest is whether or not there is an association between exposure to oral contraception and the occurrence of a stroke. Controls Exposed Not-expos. Cases Totals Exposed Not-expos. 5 13 30 107 35 120 a c b d a+b c+d Totals 18 137 155 a+c b+d n Table 3.2.5.1: A matched case-control study of strokes and oral contraceptive use To test the null hypothesis that the proportion of women (in the relevant population) who used oral contraceptives was the same for the population of women who experienced strokes [cases: p1=(a+b)/n] as for the population who did not [controls: p2=(a+c)/n], we calculate a Z-statistic: Z ( p1 p2) 0 (b c) / n (b c) (30 13) 2.59 ˆ( p p ) SE (b c) / n (b c) (30 13) 1 2 When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant: cases – those who had strokes - were more likely to have been exposed to oral contraception. A confidence interval for the increased probability of exposure to oral contraception for cases is given by: ( p1 p 2 ) 1.96 1n (b c) (b c) 2 / n where p1 = 35/155 = 0.226 and p2 = 18/155 = 0.116 giving (p1 – p2) = 0.110 Course Manual: PG Diploma in Statistics, TCD, Base Module 67 The calculated confidence interval is: 0.110 0.085 or [0.025 to 0.195] that is, a difference of between 3 and 20 percentage points. Note that this is an old study and the formulations of modern contraceptives are quite different, so the results are of purely historical interest. 3.2.6. Self-paired allergy study Table 3.2.6.1 shows the 173 self-pairs classified by response to the two chemicals. The question of interest is whether or not there is a differential response to the two chemicals. DNCB Croton Oil + – + – 81 23 48 21 129 44 a c b d a+b c+d Totals 104 69 173 a+c b+d n Totals Table 3.2.6.1: 173 patients classified according to response to two chemicals These data may be analysed in exactly the same way as the strokes/contraceptives data of Exercise 3.2.5. To test the null hypothesis that the probability of a negative response to DNCB was the same that for Croton Oil we can calculate a Z-statistic: Z ( p1 p2 ) 0 (b c ) / n (b c ) ( 4 8 2 3) 2 .9 7 SEˆ ( p1 p2 ) (b c ) / n (b c ) ( 4 8 2 3) When compared to critical values of 1.96 given by the conventional significance level of 0.05, the Z-value is clearly statistically significant – patients (in the relevant population) were more likely to react negatively to DNCB than they were to Croton Oil. A confidence interval for the increased probability of reacting negatively to DNCB is given by: ( p1 p 2 ) 1.96 1n (b c) (b c) 2 / n where p1 = 69/173 = 0.399 and p2 = 44/173 = 0.254 giving (p1 – p2) = 0.145 Course Manual: PG Diploma in Statistics, TCD, Base Module 68 The calculated confidence interval is: 0.145 0.098 Which ranges from 0.05 to 0.24, that is, a difference of between 5 and 24 percentage points. 3.2.7: Otter deaths in eastern Germany A 95% confidence interval for the difference between the long-run proportions for males () and females () is: ( p1 p 2 ) 1.96(2) p1 (1 p1 ) n (0.585 0.415) 3.92 0.585(0.415) = 0.17 0.06 1023 This suggests that there is an excess of such violent deaths among males, of between 11 and 23 percentage points, over the percentage for female otters. 3.3.1: Slate workers study. A Minitab chi-square analysis of the slate workers’ study is shown below. The X2 value of 9.328 is simply the square of the Z-value shown on page 15 (apart from rounding). The conclusions from the chi-square analysis are the same, therefore, as those from the Z-test for comparing two proportions, as discussed on pages 14,15. Chi-Square Test: C1, C2 Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Died 379 352.30 2.024 Surv. 347 373.70 1.908 Total 726 Controls 230 256.70 2.778 299 272.30 2.618 529 Total 609 646 1255 Slate workers Chi-Sq = 9.328, DF = 1, P-Value = 0.002 Table 3.3.1.1: Chi-square analysis of the slate workers’ data Course Manual: PG Diploma in Statistics, TCD, Base Module 69 3.3.2: Oesophageal cancer case-control study A chi-square analysis of the oesophageal cancer case-control data is shown in Table 3.3.2.1, below – again there is a one-to-one correspondence with the Ztest of Table 3.2.6 (page 16). Chi-Square Test: C3, C4 Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts 80+ 96 42.05 69.212 <80 104 157.95 18.427 Total 200 Controls 109 162.95 17.861 666 612.05 4.755 775 Total 205 770 975 Cases Chi-Sq = 110.255, DF = 1, P-Value = 0.000 Table 3.3.2.1: Chi-square analysis of the oesophageal cancer data 3.3.3: Cosmic Rays It is not obvious to me whether the physicists would ask if the energy distributions are different for the two particle types or if the charge distributions are different at different energy ranges. In analysing these data, therefore, I would frame the null hypothesis as a test of independence – are the energy and charge classifications independent? I would treat the two classifications as if they have equal status, in other words. A chi-square test on the data gives a test statistic of X2=14.01, as shown in Table 3.3.3.1. This should be compared to a chi-square distribution with (3–1)(2–1)=2 degrees of freedom; for a significance level of =0.05, the critical value is 5.99 and the null hypothesis of energy-charge independence is rejected. Course Manual: PG Diploma in Statistics, TCD, Base Module 70 Chi-Square Test: Positive, Negative Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Interval Positive Negative Total 50-100 9 7.37 0.363 20 21.63 0.124 29 101-300 32 21.08 5.658 51 61.92 1.926 83 301-1000 23 35.56 4.434 117 104.44 1.509 140 188 252 Total 64 Chi-Sq = 14.013, DF = 2, P-Value = 0.001 Table 3.3.3.1: Chi-square analysis of the cosmic ray data Suppose we decide to view the data in terms of the energy distributions for the two particle types; positive electrons are called ‘positrons’ (as in Positron Emission Tomography (PET) scanners). Table 3.3.3.2 shows the energy distributions expressed as percentages, for the two particle types. Electron Charge Energy Interval (MeV) Positive Negative 50-100 101-300 301-1000 14 50 36 11 27 62 100 100 Table 3.3.3.2: The energy distributions (percentage) It is obvious that there is a much higher percentage of highest energy particles in the negative charge group; there is a difference of 26 percentage points, i.e., 62 versus 36%. The 95% confidence interval for the difference is from 13 to 40 percentage points, as shown in Table 3.3.3.3. Course Manual: PG Diploma in Statistics, TCD, Base Module 71 Test and CI for Two Proportions Sample 301+ N Sample p e- 117 188 0.622340 e+ 23 64 0.359375 Difference = p (1) - p (2) Estimate for difference: 0.262965 95% CI for difference: (0.126506, 0.399425) Test for difference = 0 (vs not = 0): Z = 3.66 P-Value = 0.000 Table 3.3.3.3: Comparing proportions in the highest energy interval Presumably, differences in the energy distributions would carry implications for the mechanisms generating the particles. Thus, the non-symmetrical energy distributions observed here may rule-in/rule-out certain sources of these particles. Faulty data collection If the energy measurements cannot be trusted to differentiate between the two lower energy intervals, then it makes sense to combine them into a single 50 – 300 MeV band; the revised distributions are shown in Table 3.3.3.4. Electron Charge Energy Interval (MeV) Positive Negative 50 -300 301-1000 41 23 71 117 64 188 Table 3.3.3.4: Revised energy distributions A chi-square analysis of this 2x2 table (shown in Table 3.3.3.5) is effectively equivalent to the comparison of two proportions in Table 3.3.3.3, since we focused there on the proportions in the highest energy interval, which remains unchanged. However, the null hypothesis underlying the Z-test for proportions in Table 3.3.3.3 is an ‘homogeneity of proportions’ hypothesis (i.e., the same set of long-run relative frequencies () applies to both types of electron), while that underlying Table 3.3.3.5 could be framed, alternatively, as an energy level- Course Manual: PG Diploma in Statistics, TCD, Base Module 72 charge independence hypothesis. As we have seen, these are essentially the same hypothesis and the tests are equivalent (Z=3.66)2 = (X2=13.37), apart from rounding. Chi-Square Test: Positron, Electron Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Energy Interval Positron Electron Total Low 41 28.44 5.542 71 83.56 1.887 112 High 23 35.56 4.434 117 104.44 1.509 140 Total 64 188 252 Chi-Sq = 13.372, DF = 1, P-Value = 0.000 Table 3.3.3.5: A chi-square analysis of the revised data The overall conclusion of our analysis is that energy level and particle charge are not independent classifications and that there is a disproportionately higher fraction of (negatively charged) electrons in the highest energy band. Course Manual: PG Diploma in Statistics, TCD, Base Module 73 3.3.4: Hospitals We begin our analysis by excluding the referral hospital, E, and by examining the other four hospitals A–D to determine if their post-operative patient status distributions can be considered as being generated by the same set of underlying relative frequencies Chi-Square Test: A, B, C, D Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts A B C D Total No-imp 13 7.75 3.555 5 5.11 0.002 8 13.03 1.941 21 21.11 0.001 47 Partial 18 19.79 0.162 10 13.05 0.714 36 33.26 0.225 56 53.89 0.082 120 Complete 16 19.46 0.615 16 12.84 0.780 35 32.71 0.160 51 53.00 0.075 118 Total 47 31 79 128 285 Chi-Sq = 8.313, DF = 6, P-Value = 0.216 Table 3.3.4.1: Chi-square comparison of hospitals A – D Table 3.3.4.1 shows a chi-square analysis of the 3x4 table. The X2=8.3 test statistic is non-significant when compared to the =0.05 critical value of 2c=12.6 of the chi-square distribution with (3–1)(4–1)=6 degrees of freedom. The large pvalue of 0.216 indicates that the test statistic would be expected to be bigger than 8.3 more than 20% of the time, even if the null hypothesis of equal underlying frequency distributions were true. Accordingly, we cannot reject the null hypothesis of equal underlying post-operative result performances for the four hospitals. It makes sense, therefore, to combine the data into one ‘general hospital’ distribution. Course Manual: PG Diploma in Statistics, TCD, Base Module 74 Table 3.3.4.2 shows the (rounded) percentages that fall into the three outcome categories for the general and referral hospitals. Hospital No improvement Partial funct. rest. Complete funct. rest. General 16 42 41 Referral 52 35 12 Table 3.3.4.2: Percentage (rounded) changes There are quite different response patterns, as might be expected. The referral hospital has a much lower ‘complete recovery’ rate. (12% versus 41%) and a much higher ‘no improvement’ rate (52% versus 16%). Table 3.3.4.3 is a chisquare analysis of this table. Chi-Square Test: General, Referral Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts General Referral Total No-imp 47 69.89 7.497 43 20.11 26.058 90 Partial 120 115.71 0.159 29 33.29 0.553 149 Complete 118 99.40 3.480 10 28.60 12.096 128 Total 285 82 367 Chi-Sq = 49.844, DF = 2, P-Value = 0.000 Table 3.3.4.3 Chi-square comparison of general and referral hospitals The chi-square analysis (highly significant test statistic) provides statistical justification for interpreting the two different response distributions of Table 3.3.4.2: the chi-square result tells us that the differences are unlikely to be due to chance fluctuations. Course Manual: PG Diploma in Statistics, TCD, Base Module 75 Figure 3.3.4.1: Response ‘profiles’ for hospital types Figure 3.3.4.1 shows one possible graphical representation of the results. It illustrates the difference in the ‘response profiles’ of the two types of hospital. The profiles do not have any intrinsic meaning, of course, as the horizontal axis is not a continuous scale. The profiles do, however, illustrate the fundamentally different response patterns and would (I think!) help a report reader who might be overwhelmed by the detail shown in Tables 3.3.4.1 and 3.3.4.3, in particular, to interpret the results of the study. 3.3.5: Otter deaths in eastern Germany We address the question: ‘is the row classification independent of the column classification?’ In the current case this would mean that the probability that an otter is young when it is killed does not depend on its sex. If an otter is selected at random from the total sample of N=1023, then the probability that it will be male is P[male]=598/1023 and the probability that it will be young is P[young]=439/1023. If two characteristics are independent of each other (in the sense that the occurrence of one does not affect the probable occurrence of the other) then the probability that they both occur simultaneously is given by the product of their separate probabilities. So, if independence holds: P[young and male] = P[young] x P[male] These sample probabilities are, of course, only estimates of corresponding population values. Under independence the estimate of the expected number of young males is given by: N x P[young and male] = N x P[young] x P[male] Course Manual: PG Diploma in Statistics, TCD, Base Module 76 439598 =102 x3 x 1021 3023 598 = 4 3x9 2 5.66.2 1023 The estimated expected values and the contributions to the X 2 statistic for all cells are shown in the Minitab analysis in Table 3.3.5.1. The sum of all the cell contributions gives an overall test statistic X2=14.183 which is highly statistically significant when compared to a critical value of 3.84 (again ‘degrees of freedom’=1). The test statistic and the corresponding p-value (p<0.0005) suggest that the two classifications are not independent of each other: the observed frequencies deviate too much from those calculated on the independence assumption. Chi-Square Test: Male, Female Expected counts are printed below observed counts Chi-Square contributions are printed below expected counts Male Female Total Young 286 256.62 3.364 153 182.38 4.733 439 Adult 312 341.38 2.529 272 242.62 3.558 584 598 425 1023 Total Chi-Sq = 14.183, DF = 1, P-Value = 0.000 Table 3.3.5.1: Minitab chi-square analysis of the otter data Chi-square tests where the null hypothesis is expressed in terms of the equality of two or more proportions (as in the Timolol example) are often referred to as ‘tests for homogeneity of proportions’, while those for which the null hypothesis is posed in the manner just discussed are called ‘tests of independence’ or ‘tests of association’, i.e., they address the question: are frequencies for one classification ‘independent of’ or ‘associated with’ those of the other? Course Manual: PG Diploma in Statistics, TCD, Base Module 77 3.4.1: Mendel’s pea plant data The observed colour distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab output below. This gives a chi-square value of 0.46, with a corresponding p-value of 0.796. There is no reason to doubt the applicability of the Mendelian theory. Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C5 Category Yellow Yellow+Green Green N 529 DF 2 Test Propn. 0.25 0.50 0.25 Observed 126 271 132 Chi-Sq 0.455577 Expected 132.25 264.50 132.25 Contribution to Chi-Sq 0.295369 0.159735 0.000473 P-Value 0.796 Table 3.4.1.1: Minitab Goodness of Fit analysis of seed colours Similarly, the observed shape distribution for the seeds is compared to the theoretical ratios of 1:2:1 in the Minitab output in Table 3.4.1.2. This gives a chisquare value of 0.55, with a corresponding p-value of 0.76. There is no reason to doubt the applicability of the Mendelian theory with respect to the shape distribution either. Chi-Square Goodness-of-Fit Test for Observed Counts in Variable: C7 Category Round Round+Wrinkled Wrinkled N 529 DF 2 Test Observed 138 265 126 Chi-Sq 0.546314 Propn. 0.25 0.50 0.25 Contribution Expected 132.25 264.50 132.25 to Chi-Sq 0.250000 0.000945 0.295369 P-Value 0.761 Table 3.4.1.2: Minitab Goodness of Fit analysis of seed shapes © Text, Eamonn Mullins, 2013; data, see references, Course Manual: PG Diploma in Statistics, TCD, Base Module 78
© Copyright 2024