Methods of Psychological Research Online 2000, Vol.5, No.2 Institute for Science Education Internet: http://www.mpr-online.de © 2000 IPN Kiel What sample sizes are needed to get correct significance levels for log-linear models? - A Monte Carlo Study using the SPSS-procedure "Hiloglinear" Ingeborg STELZL1 Abstract Pearson's !2 and the Likelihood-ratio statistic G 2 are the most common and widely used test statistics for log-linear models. They are both asymptotically distributed as chi-squared variables. The present article reports the results of a Monte-Carlo study which compares the two test statistics for two-, three- and four-dimensional contingency tables, employing conditions which may be judged reasonable for psychological research and using one of the most prominent computer programs (SPSS "Hiloglinear"). Our results are consistent with previous research in that, on the whole, Pearson's !2 behaves better than G 2 . As a rule of thumb one may state that Pearson's !2 will not result in severely inflated alpha values (empirical values of .075 or larger for a nominal level of .05) if the total sample size equals five times the number of cells and the smallest expected cell frequency is larger than 0.50. On contrast, the Likelihood-ratio statistic G 2 yields in some cases severely inflated empirical alpha values for the higher interactions even if the total sample size equals ten times the number of cells and the smallest expected cell frequency is larger than one. In those cases where sample size is large enough to use Pearson's !2 , Pearson's !2 is preferable to G 2 , as it is generally closer to the nominal alpha. For cases not covered by this rule parametric bootstrapping is recommended. Key words: contingency tables, log-linear models, significance tests, Pearson's !2 , Likelihood ratio G 2 , Monte Carlo study, simulation, SPSS Hiloglinear 1 Author's address: Prof. Dr. Ingeborg Stelzl, Philipps-University, Fachbereich Psychologie, Gutenbergstr. 18, D-35032 Marburg, [email protected] Germany Tel.:+49-6421-2823669, Fax: +49-6421-2828929, e-mail: 96 1. MPR-Online 2000, Vol. 5, No. 2 Introduction Log-linear models may be used for the analysis of contigency tables to test hypotheses about main effects, two way and higher order interactions. Although several other procedures have been suggested (for an overview see Read & Cressie, 1988; for two way tables also Goodman, 1996), so far Pearson's !2 and the Likelihood Ratio Statistic G 2 are still the best known and most widely used test statistics for significance testing. As both procedures, Pearson's !2 and G 2 , are based on asymptotic theory, i. e., derived for large sample sizes only, we are left with the question, which of the two statistics should be preferred with moderate sample sizes. The present paper will report the results from a Monte Carlo Study, from which some guidelines are derived to answer this question. However, during the last years the role of significance testing in the behavioral and social sciences has been questioned on epistemological grounds, the reasons for significance testing have become controversial, and it has been questioned whether significance testing should not be abandonned at all (see Cohen, 1994; Gigerenzer, 1993; Harlow, Mulaik & Steiger, 1997; Sedlmeier, 1996; Iseler, 1997; Sedlmeier, 1998; Brandstätter, 1999). Therefore, we will first try to give some comments, how the specific topic of our present study may be embedded in this debate. Some of the main objections which were raized against the practice of significance tests were: (1) that significance tests were often misunderstood and misinterpreted (for an overview and critical discussion of common misunderstandings see Mulaik, Raj and Marshman, 1977; for an attempt of a reformulation of the logical basis of significance testing see Harris, 1997; Jones, 1999); (2) that significance were superfluous, as the nullhypothesis were always known to be false, and that they were prohibitive to scientific progress, if non-significant results were not published (see Schmidt & Hunter, 1997); (3) that significance test would not contain the relevant information concerning effect sizes and accuracy of estimation. (For alternatives, which may be used instead of significance tests or supplementary, see Brandstätter, 1999; Meehl, 1997; Reichardt & Gollob, 1997; Sedlmeier, 1996.) This debate may be expected to continue still for a long time, as it reaches far into the area of philosophy of science. The impact of some arguments (e. g., that the null hypothesis is always known to be false) will depend also on substantial grounds and vary with the field of application. Nevertheless, presumably most researchers engaged in this discussion would agree to the following statements: I. Stelzl: Sample sizes needed for log-linear models 97 (1) Significance tests must not be the main criterion to judge the scientific impact of an empirical study, e. g., in deciding whether or not it should be published. The results of an empirical study are reported insufficiently and inappropriately if only test statistics and p-values from significance tests are reported. At least for the main hypotheses effect sizes (parameter estimates with confidence intervals and/or global effect size measures, e. g., proportion of predidicted variance) should be given. Furthermore, there should be a descriptive summary of the data including as many details as possible (e. g., all cell frequencies of a multidimensional contingency table if a log linear model is employed, the complete correlation matrix if a structural equation model is to be fitted to the data). This will enable other authors to reanalyze the data by their own models. (2) Significance testing requires sample sizes that are sufficiently large to provide adaequate statistical power (Cohen, 1988 suggests 0.8 as a minimum for standard cases, but higher values, e. g. 0.95, may be required depending on the specific hypothesis under question) for effect sizes which are judged to be relevant for the specific matter of research (that may be small, medium or large effects as defined by Cohen, 1988, or Erdfelder, Faul & Buchner, 1996). (3) If the results of significance tests are given by tail probabilities (p-values), these p-values should be computed correctly. If only asymptotic distributions are known for some test statistic this rises the question what minimal sample sizes are required to get correct p-values from those asymptotic distributions. The present study will contribute to this field referring to a special class of statistical models, i. e. log-linear models for the analysis of contingency tables, for which two competing asymptotic test procedures, Pearson's !2 and G 2 , are compared. The test statictics in the study: Pearson's !2 and the Likelihood-Ratio-statistic G 2 Both of the two test procedures require that one estimtes first the expected cell frequencies under the null hypothesis. E. g., if a 4-factor interaction is to be tested the model for the null hypothesis contains free parameters for all main effects, 2-factor and 3-factor interactions whereas the parameters for the 4-factor interaction are fixed to zero. The two test statistics, Pearson's !2 and G 2 are defined as follows: Pearson's !2 : !2 = !ki=1 2 ( xi - m i) i m (1) 98 MPR-Online 2000, Vol. 5, No. 2 Likelihood-Ratio-Statistik G 2 : G 2 = - 2 !ki=1 x i ln i m xi (2) where i= 1 ... k index for the cells, with k = total number of cells i = estimate of the expected cell frequency m in cell i under the null hypothesis xi = observed cell frequency in cell i Both test statistics are asymptotically distributed as chi-squared variables, with the numer of degrees of freedom equal to the total number of cells minus the number of estimated free parameters. Several Monte Carlo studies have been conducted to investigate the behavior of Pearson's !2 and/or G 2 for small or moderate sample sizes (Agresti & Yang, 1987; Berry & Mieke, 1988; Chapman, 1976; Haber, 1984; Hosmane, 1986, 1987; Koehler, 1986; Koehler & Larntz, 1980; Larntz, 1978; Lawal, 1984; Milligan, 1980; Rudas, 1986; Upton, 1982). Most of them concentrate upon two way tables, whereas less is known about higher order interactions, which require iterative algorithmas. If Pearson's !2 and G 2 are compared within the same study, results are heterogeneous for one-way tables, whereas most of the studies investigating two- and higher order tables find a tendency Pearson's !2 to perform somewhat better than G 2 . One way tables were investigated by Chapman (1976) and Koehler and Larntz (1980) with heterogeneous results. Lawal (1984) who compared four test statistics including Pearson's !2 and G 2 found Pearson's !2 to be closest to the nominal alpha. Two way tables with 2x2 cells were investigated by Upton (1982) with sample sizes ranging from n = 14 to 96. He compared twelve test statistics, including Pearson's !2 and G 2 . He got the best results if Pearsons !2 was modified by a multiplication factor of (n - 1)/n. Yet, also Pearsons !2 without a correction factor was found to perform well. Hosmane (1986) investigated two way tables with 2x2 up to 9x9 cells and sample sizes ranging from 10 to 190. He compared Pearsons !2 and G 2 to some modifications I. Stelzl: Sample sizes needed for log-linear models 99 and concluded that !2 without any modification yields the best results. Agresti and Yang (1987) investigated contingency tables with 2x3 up to 10x10 cells and sample sizes of N = 50 or 100. They found acceptable results for Pearson's !2 if the expected average cell frequency was not smaller than one. For direct testing of a loglinear model the distribution on Pearson's !2 was closer to the asymptotic chi-squared distribution than the distribution of G 2 . On the other hand, for comparing two unsaturated models G 2 outperformed Pearson's !2 in many cases. Berry and Mielke (1988) investigated 2x2 up to 3x4 tables with sample sizes ranging from 20 to 80. They compared five test statistics and found a nonasymptotic chisquared test to be superior in overall performance to the other four tests including Pearson's !2 and G 2 . Three way tables with 2x2x2 cells and sample sizes from 10 to 90 were studied by Milligan (1980). Both, Pearson's !2 and G 2 showed a tendency to depressed " -levels for main effects and 2-factor interactions whereas considerably inflated " -levels occurred for some cases for the 3-factor interactions with Pearson's !2 . Three way tables with 2x2x2 cells were also studied by Haber (1984) using larger sample sizes ranging from 40 to 400. He compared six test statistics, including Pearson's !2 and G 2 . Among the tests which do not inflate the nominal significance level Pearson's !2 was found to be the most powerful. Besides several one- and two way contingency tables Larntz (1978) investigated also a 3x3x3 design. He compared Pearson's !2 to G 2 and a further statistic and concluded that Pearson's !2 should be preferred because its Type I error rates were closest to the nominal level. Rudas (1986) studied two and three way tables with 2x2 up to 3x3x5 cells and sample sizes ranging from 15 to 150. He compared Pearson's !2 , G 2 and a further statistic suggested by Cressie and Read (1985). The results for Pearson's !2 and the Cressie and Read-statistic were very similar and these two statistics were found more appropriate than G 2 for small sample sizes. Also Koehler (1986) studied two- and three way tables and 2ktables. He found acceptable results for Pearson's !2 except some cases of sparse tables containing both very small and moderately large expected frequencies. The accuracy of G 2 , on the other hand, was judged generally unacceptable producing greatly inflated Type I error levels in some cases and deflated levels in others. For very large and sparse tables other asymptotic procedures were preferrable to Pearson's !2 and G 2 . 100 MPR-Online 2000, Vol. 5, No. 2 Hosmane (1987) who investigated tables with 2x2x2 to 4x4x3 cells and sample sizes ranging from 10 to 200 compared five test statistics including Pearson's !2 and G 2 . He recommends Pearson's !2 as the test statistic closest to nominal level " . The present paper is to contribute to this field by focussing on three-way and fourway designs. Whereas many of the previous studies used very small sample sizes in order to examine the behavior of the test statistics at the lower end the present study will focus on sample sizes which may be judged realistic and reasonable for psychological research. That means that sample sizes should be sufficiently large to provide acceptable statistical power at least for large effects. As the results of an iterative estimation procedure may depend on the quality of the mathematical algorithm it was decided to employ for our simulation study a widely used procedure from a well known statistical package with the program options left in their default modes. The procedure "Hiloglinear" from SPSS was chosen with the aim to derive some guidelines when the results from this procedure can be used without a risk of severely inflated or deflated Type I error rates and which of the two test statistics provided by the program should be preferred. 2. The Monte Carlo study Our simulation study includes designs with two categories per variable (2x2, 2x2x2, 2x2x2x2) and designs with up to four categories per variable (4x3, 4x3x2, 4x3x2x2). The first series of simulations was run with uniform marginals sampling from a null model with zero main effects and zero interactions. The sample sizes chosen were 2.5, 5 or 10 times the number of cells. Smaller sample sizes were not considered because of the lack in statistical power that would result. According to Erdfelder (1992) and Erdfelder, Faul and Buchner (1996) a sample size of N = 32 is required for a chi-squared test with one degree of freedom to reach a power value of 0.8 at alpha = .05 for large effects, and N = 88 for medium effects; for a chi-squared test with six degrees of freedom the required sample sizes are N = 55 and N = 152, respectively (though these values are also based on asymptotic theory they may nevertheless be used as a rough estimate). A further series of simulations was run with the same designs and the same sample sizes, but differing marginal distributions. For designs with two categories the marginals were .7,.3 or .8,.2, for designs with three- and four-categorial variables the marginals were .1, .2, .3, .4 for variable A, .1, .4, .5 for variable B and .5, .5 or .8, .2 for variable C I. Stelzl: Sample sizes needed for log-linear models 101 and D. The population values for all interactions were zero. Due to the differences in the marginal probabilities there was considerable variation in the expected cell frequencies within the designs and in many cases some of the expected frequencies were smaller than one. 3. The simulation program2 The data were generated by a SPSS-input-program as follows: The SPSS-procedure "uniform" was used to generate for each person a value for each variable drawn from a uniform distribution in the interval [0, 1], e. g., for a 4x3x2x2 design four variables A, B, C, D. Next the variables were devided into categories: E.g., when variable A was designed to have four categories a1 to a4 with marginal probabilities .1, .2, .3, .4 a person was assigned to category a1 if his/her value for A was in the range 0 to ! .1, to category a2 if it was between 0.1 and ! .3, etc. The procedure continued with categorizing the next variable until the person was categorized completely. Then the values for the next person were drawn and categorized until a sample of the required size was completed. Then the input program was closed and the procedure "Hiloglinear" was started. This input program including the line calling the procedure "Hiloglinear" was written into a SPSS-MAKRO which was executed 10 000 times. This resulted in a very long SPSS-output which was filtered by a program written in "UNIX-awk" to find the lines with the results of the significance tests. The fields containing the p-values were transferred to a new file, which was then used as an input file for a SPSS programm to assess the empirical distribution of the p-values. The p-values ! .05 were counted to get empirical rejection rates. The procedure "Hiloglinear" from SPSS for Unix Release 5.0 is described in the manual "SPSS-Statistical Algorithmus" als follows: First, maximum likelihood estimates of the expected cell frequencies under the specific model under investigation are computed using the iterative proportional fit algorithm as described by Fienberg (1977). The following default values for the program options were left unchanged: To avoid problems with zero frequencies the program adds a constant delta = .5 to all empirical cell frequencies before the fit algorithm is started. The fit algorithm stops, if the largest change 2 The students Mrs. Sigrid Kühl, Mrs. Verena Polz and Mrs. Karina Wahl were engaged in the develop- ment and running of the simulation program. 102 MPR-Online 2000, Vol. 5, No. 2 of an expected cell frequency in two consecutive iterations is less than .25 or if 20 iterations have been executed. Based on these expected frequencies as compared to the empirical cell counts the two test statistics Pearson's !2 and G 2 are computed and significance tests for the fit of the model are performed using either of the two statistics. The resulting p-values are given. The output provides significance tests for the fit of the following models: M0: Global null hypothesis. All main effects and interactions are zero. M1: Model with free parameters for the main effects. 2-factor and higher inter- actions are zero. M2: Model with free parameters for main effects and 2-factor interactions. 3- factor and higher interactions are zero. M3: Model with free parameters for main effects, 2-factor and 3-factor interac- tions. The 4-factor interaction is zero. Next, hypotheses about main effects, 2-factor, 3-factor and 4-factor interactions are tested seperately by hierarchical model comparisons (Chi-square difference tests for nested models): Test whether the marginals differ significantly from uniform distributions (M1 vs. M0) Test of the 2-factor interactions (M2 vs. M1) Test of the 3-factor interactions (M3 vs. M2) Test of the 4-factor interaction (satisfied model with main effects and all interactions including the 4-factor interactions vs. M3) Furthermore, significance tests using G 2 are performed for each of the main effects and each of the interactions seperately ("partial associations"). Also these tests are based on hierarchical comparisons: E. g., when in a model with four variables the 3-factor interactions ABC is to be tested, two models with the following higher-order marginals fitted to the data are compared: H0: model with all main effects, all 2-factor interactions and the 3-factor inter- actions ABD, BCD and ACD, (i. e., all 3-factor interactions except ABC) H1: model with all main effects, all 2-factor interactions and all 3-factor inter- actions including ABC I. Stelzl: Sample sizes needed for log-linear models 103 In the next section simulation results will be reported for the significance tests of the global null hypothesis, for groups of hypotheses (all main effects, all 2-factor interactions, etc.) and in some cases also for partial associations. 4. Results Table 1 gives the empirical rejection rates corresponding to a nominal level of alpha = .05 under the condition of the complete null hypothesis, i. e. with uniform marginal distributions. As each entry in the table is based on 10 000 samples 95 percent of the values should fall into the interval .046 to .054, if the true alphas were .05. Already a rough inspection of Table 1 shows that the majority of the values (64%) is outside this range and thus departs significantly from the nominal value. However, applying asymptotic results to finite sample sizes one cannot expect to get exact error rates. In what follows we will call increased empirical values up to .075 "moderately" inflated, higher values will be called "severely" inflated. Severely inflated values are underlined in the tables. On the other hand, empirical error rates below the nominal level indicate a conservative significance test, presumably with increased Type II error rates. Therefore, values equal or below .025 are printed in italics. Table 1 shows that there is a general tendency for G 2 to yield higher empirical alphas than Pearson's !2 (in 60 cases out of 72 the value for G 2 is higher than that for Pearson's !2 , in 1 case equal and in 11 cases lower). Most of the cases of severely increased alpha values occur for G 2 , whereas severely depressed alphas occur for neither statistic. Giving equal weight to discrepancies in either direction (increased or depressed values) Pearson's !2 is found to be in 52 cases closer to the nominal alpha than G 2 , 2 times equal, and in 18 cases more discrepant than G 2 . Thus one may state that Pearson's !2 is on the whole closer to the nominal level than G 2 . Next we will look on the results for the two statistics in more detail: Pearson's !2 : There are only two cases of severely inflated alpha levels, both occuring when 4-factor interactions are tested with the smallest sample sizes (2.5 times the number of cells). When the sample sizes are increased the empirical error rates improve and differ only "moderately" from the nominal level. 104 MPR-Online 2000, Vol. 5, No. 2 Table 1: Empirical alpha-values for a nominal alpha = .05 when the complete null hypothesis is true (uniform marginals, no interactions) significance tests Design N n/ global marginals cell G 2x2 4x3 2x2x2 4x3x2 2x2x2x2 4x3x2x2 2 " 2 G 2 " 2 2-way 3-way 4-way interactions "2 G2 interactions "2 G2 interactions "2 G2 10 2.5 .077 .036 .047 .059 .100 .052 20 5 .048 .038 .052 .046 .058 .051 40 10 .053 .042 .048 .048 .050 .050 30 2.5 .091 .042 .057 .053 .102 .045 60 5 .067 .047 .056 .053 .071 .046 120 10 .057 .048 .056 .057 .056 .047 20 2.5 .083 .043 .059 .049 .076 .050 .095 .069 40 5 .067 .045 .048 .050 .062 .047 .068 .060 80 10 .056 .047 .052 .055 .055 .049 .060 .058 60 2.5 .106 .056 .053 .058 .075 .045 .129 .063 120 5 .078 .054 .059 .060 .062 .050 .082 .057 240 10 .060 .053 .054 .055 .055 .050 .060 053 40 2.5 .095 045 .052 .055 .065 .043 .107 .065 .111 .078 80 5 .068 .047 .053 .059 .055 .043 .067 .050 .077 .069 160 10 .060 .053 .054 .053 .054 .047 .061 .054 .058 .057 120 2.5 .141 .054 .057 .068 .063 .044 .122 .052 .160 .091 240 5 .086 .056 .052 .061 .055 .047 .078 .048 .098 .073 480 10 .065 .054 .048 .055 .054 .048 .064 .051 .068 .063 N = total sample size I. Stelzl: Sample sizes needed for log-linear models 105 Apart from a conservative tendency for 2x2 tables with small sample sizes, the tests for the global null hypothesis lead to acceptable type I error rates (.042 to .056). The tests for the main effects, 2-factor and 3-factor interactions lead to somewhat larger but still "moderate" departures from the nominal level (.043 to .069). The tests for the 4factor interactions lead to the two severely inflated values mentioned above occurring with sample sizes of 2.5 times the cell number. For sample sizes equalling 5 or 10 times the cell number, these error rates were only moderately inflated. The Likelihood-Ratio-G 2 : Whereas the results for the main effects are satisfactory (.047 to .059) 2-factor and higher interactions show an increasing tendency to yield inflated type I error rates. Severely inflated empirical alpha levels occur for all designs with sample sizes only 2.5 times the number of cells and for many cases with sample size 5 times the cell number. Only with sample sizes equal to 10 times the cell number all departures stay in the range defined as only "moderately" descrepant. Tables 2a and 2b give the empirical rejection rates for all simulations with main effects present, i. e. with marginals differring from the uniform distribution. The sample sizes used are the same as in Table 1, equalling 2.5 times, 5 times, or 10 times the cell number. Yet, due to the variation in the marginal probabilities there was considerable variation in the expected cell frequencies within a design. The smallest expected cell frequency within a design is given in the column headed "min E(ni)" in Tables 2a and 2b. The alternative hypotheses was true for the global tests and the tests of the marginals. Except for the smallest sample sizes (N = 10, N = 20) statistical power exceeded 0.90 in all cases. As the results for the global tests and the tests for the marginal distributions were very much alike, only the results for the marginal distributions are reported. Table 2a refers to designs with two categories per variable (2x2 to 2x2x2x2), whereas Table 2b contains the results for designs with up to four categories per variable (4x3 to 4x3x2x2). Taking together from both groups of designs the behavior of the two test statistics may be summarized as follows: Pearson's !2 leads to only moderate discrepancies from the nominal level (empirical levels .043 to .071) when the total sample size equals at least 5 times the cell number and the smallest expected cell frequency is larger than 0.5. With expected cell frequencies smaller than 0.5 serious departures occur in either direction (severely inflated or depressed empirical levels) without following a simple pattern. 106 MPR-Online 2000, Vol. 5, No. 2 Table 2a: Empirical rejection rates for a nominal level of alpha = .05, when main effects are present but all interactions are zero. Designs with only two-categorial variables. significance tests Design 2x2 2x2 2x2x2 2x2x2 2x2x2x2 2x2x2x2 # N n min mar- E(ni) gi- marginals# 2-way 3-way 4-way interactions interactions interactions nals G2 "2 G2 "2 G2 "2 G2 "2 10 2.5 .9 .3 .7 .331 .361 .071 .046 20 5 1.8 .647 .607 .074 .045 40 10 3.6 .922 .913 .065 .052 10 2.5 .4 .2 .8 .697 .744 .034 .054 20 5 .8 .970 .957 .055 .049 40 10 1.6 1.00 1.00 .070 .046 20 2.5 .54 .3 .7 .768 .725 .088 .065 .070 .044 40 5 1.08 .977 .972 .069 .051 .072 .057 80 10 2.16 1.00 1.00 .053 .047 .068 .055 20 2.5 .16 .2 .8 .990 .990 .059 .086 .036 .016 40 5 .32 1.00 1.00 0.68 .065 .056 .039 80 10 .64 1.00 1.00 .069 .058 .058 .052 80 5 .65 .3 .7 1.00 1.00 .062 .052 .096 .071 .081 .051 160 10 1.30 1.00 1.00 .053 .050 .069 .053 .079 .061 80 5 .13 .2 .8 1.00 1.00 .081 .101 .071 .090 .022 .011 160 10 .26 1.00 1.00 .058 .069 .078 .081 .050 .032 As the alternative hypothesis holds for the main effects the values in these columns are empirical power-values. N = total sample size n = N devided by the number of cells min E " n i # = smallest expected cell frequency I. Stelzl: Sample sizes needed for log-linear models 107 Table 2b: Empirical rejection rates for a nominal level of alpha = .05, when main effects are present but all interactions are zero. Designs with more than two categories per variable. significance tests Design N n min marginals marginals# E(ni) G 4x3 4x3x2 4x3x2x2 # 60 5 .6 .1 .2 .3 .4I 120 10 1.2 120 5 .6 240 10 1.2 120 5 .24 240 10 .48 240 5 .6 480 10 1.2 240 5 .24 480 10 .48 240 5 .10 480 10 .19 2 " 2 2-way 3-way 4-way interactions "2 G2 interactions "2 G2 interacti"2 G2 1.00 1.00 .065 .046 1.00 1.00 .065 .048 .1 .2 .3 .4I 1.00 1.00 .071 .060 .081 .043 1.00 1.00 .063 .056 .081 .052 .1 .2 .3 .4I 1.00 1.00 .074 .076 .067 .042 1.00 1.00 .066 .064 .072 .049 .1 .2 .3 .4I 1.00 1.00 .061 .060 .102 .064 .089 .056 1.00 1.00 .054 .052 .081 .053 .091 .065 .1 .2 .3 .4I 1.00 1.00 .062 .072 .112 .093 .060 .033 1.00 1.00 .050 .056 .088 .064 .082 .053 .1 .2 .3 .4I 1.00 1.00 .067 .096 .117 .125 .028 .019 1.00 1.00 .066 .071 .010 .085 .060 .034 .1 .4 .5 .1 .4 .5I.5 .5 .1 .4 .5I.2 .8 .1 .4 .5I.5 .5I .1 .4 .5I.5 .5I .1 .4 .5I.2 .8I As the altrnative hypothesis holds for the main effects the values in these columns are empirical power-values. N = total sample size n = N devided by the number of cells min E " n i # = smallest expected cell frequency 108 MPR-Online 2000, Vol. 5, No. 2 Table 3: Empirical alpha values for a nominal alpha = .05 when testing partial associations using G 2 . Two design conditions selected from Table 2b. a) Design 4x3x2: N min n marginals AB AC BC ABC .1 .2 .3 .4I .072 .061 .062 .081 .063 .056 .054 .081 E(ni) 120 5 .6 240 10 1.2 .1 .4 .5I.5 .5 a) Design 4x3x2x2: N n min margi- E(ni) nals AB AC BC AD BD CD 240 5 .6 .1 .2 .3 .4I 480 10 1.2 .052 .052 .052 .056 .051 .050 N n min margi- ABC ABD ACD BCD ABCD E(ni) nals .1 .4 .5I.5 240 5 .6 .1 .2 .3 .4I 480 10 1.2 .1 .4 .5I.5 .063 .056 .054 .055 .051 .051 .097 .096 .076 .077 .089 .078 .071 .059 .061 .091 N = total sample size n = N devided by the number of cells min E " n i # = smallest expected cell frequency I. Stelzl: Sample sizes needed for log-linear models 109 The Likelihood-Ratio G 2 yields for the 2-factor interactions only moderate discrepencies from the nominal level (empirical levels .053 to .070) when the total sample size equals 10 times the cell number and the smallest expected cell frequency is larger than 1. However, even when these conditions are satisfied, the significance tests of the 3- and 4factor-interactions lead in many cases to seriously increased alpha levels. Therefore, G 2 cannot be recommended for these tests. Comparing Pearson's !2 to G 2 we find that in all cases which satisfy the above rule for Pearson's !2 (total sample size 5 times the cell number, smallest expected cell frequency .5) Pearson's !2 is closer to the nominal level than G 2 . Table 3 shows the results for the tests of the partial associations in some of the larger designs (4x3x2 and 4x3x2x2) using G 2 . Also these results indicate that severely inflated alpha levels occur for 3- and 4-factor interactions even under the conditions of large sample sizes (10 times the cell number) and smallest expected cell frequencies larger than 1. 4.1. Supplementary results: Since the goodness-of-fit values obtained for a model and the resulting p-values may depend to some extent also on the quality of the numerical optimization procedure, we wanted to check whether an increase in numerical occuracy would affect the obtained error rates. Two designs were chosen: The 2x2x2 design with unequal marginals (.8,.2) and sample size = 80, and the 2x2x2x2 design with unequal marginals (.8,.2) and sample size = 160. The latter had produced severely inflated error rates for both, Pearson's !2 and G 2 . The program options for numerical accuracy were changed from the default values of a maximum of 20 iterations and a convergence criterion of 0.25 to a maximum of 50 iterations and a convergence criterion of .05. For each of the two designs a simulations with 10 000 data sets was run under the improved accuracy conditions. The results were very close to that under the default conditions. In particular, the severely increased type I error rates for the 3-factor interactions in the 2x2x2x2 design did not improve. Next, the value of the constant delta, which is added to all observed cell frequencies to avoid problems with empty cells, was changed from its default value 0.5 to 0.1. The same two design conditions as before (2x2x2 with marginals .8,.2 n = 80; 2x2x2x2 with marginals .8,.2 n = 160) and two further conditions (2x2x2x2 with marginals .7,.3, n = 160; 4x3x2x2 uniform marginals, n = 480) were chosen. The results are heterogeneous: 110 MPR-Online 2000, Vol. 5, No. 2 There were substantial improvements in many cases, expecially for G 2 , but also detoriations. For those cases which satisfy the conditions given above for Pearson's !2 , (sample size at last 5 times the number of cells, smallest expected cell frequency larger than .5), no severe departures from the nominal level occured, neither for Pearson's !2 nor G 2 , and Pearson's !2 was always closer to the nominal level than G 2 . Furthermore, we wanted to check whether the procedure "Hiloglinear" in "SPSS for Windows 8.0" differs in any respect from the procedure in "SPSS for UNIX Releas 5.0" which was used in our simulations. We chose the last six design conditions from Table 2b (4x3x2x2 designs with unequal marginals) and generated one sample for each condition. This samples were analyzed using both "Hiloglinear" from "SPSS for Windows 8.0" and from "SPSS for UNIX 5.0" with the program options left to their default values (delta = .5, convergence = 25, iterate = 20). The output from the two versions of SPSS was identical. Then the options were changed to delta = 0, convergence = .01 and iterate = 50. Again, the output was the same for both SPSS-versions. 5. Discussion The results of the present study are in accordance with the majority of the previous findings concluding that Pearson's !2 is generally closer to the nominal level than G 2 . Pearson's !2 did not lead to seriously increased type I error rates when the smallest expected cell frequency was larger than .5 and the total sample size was at least 5 times the number of cells. This rule was found to hold for main effects and 2-factor interactions as well as for higher order interactions. For G 2 , on the other hand, a rule was found only for main effects and 2-factor interactions (to avoid seriously increased type I error rates the smallest expected cell frequency should be larger than 1 and the total sample size at least 10 times the cell number), whereas in many cases significance tests of higher order interactions lead to seriously increased alpha levels even when this rule was satisfied. However, a Monte Carlo study can only yield limited information on the behavior of a test statistic as only a limited number of cases can be simulated, and the cases included will usually differ in some respects from the conditions that a researcher is faced with when analyzing his/her data. In the present study we aimed at chosing the conditions in a way to cover the most typical cases of psychological research and as a result I. Stelzl: Sample sizes needed for log-linear models 111 of our study we defined an area of conditions where one may feel on the save side using SPSS standard procedures. Nevertheless, important questions are left open: What should one do when the above rules for applying asymptotic procedures are not satisfied? E. g., when higher order interactions are to be tested and the smallest expected cell sizes are smaller than .5, i. e., too small for either G 2 and Pearson's !2 . In such situations it would be desirable to provide bootstrapping procedures which enable the researcher to start his/her own simulation study with his/her specific design conditions and parameter values and with the actual values for computational accuracy and the constant delta. Based on this simulation the distribution of the test statistic could be assessed empirically and used instead of the asymptotic distribution to compute the tail probability needed for the significance test. E. g., when a 4-factor interaction is to be tested one might proceed as follows: First, the empirical data are used to estimate the parameters under the model of the null hypothesis. As the 4-factor interaction is under question this is a model containing main effects, 2-factor and 3-factor interactions but no 4-factor interaction. Next, a large number of data sets with the actual sample size is generated from this model. To each data set the null hypothesis model with all parameters free except the 4-factor interaction, and the alternative model with all parameters free including the 4-factor interaction (i. e. the saturated model) are fitted. Comparing these two models the test statistic for the 4-factor interaction is computed and the distribution of the test statistic over the samples as replications is assessed. To perform the significance test for the 4-factor interaction one may use the empirical distribution of the test statistic to estimate the 95th percentile point and use it as the critical value or, alternatively one may estimate the tail probability (p-value) from the proportion of samples which have led to a value exceeding that computed from the real data set. Similarily, one may obtain also estimated power values: A further simulation study might be run generating Monte Carlo samples from a model with parameter values specified according to the respective alternative hypothesis. The proportion of cases leading to a test statistic exceeding the 95th percentile point obtained from the null hypothesis yields an estimate of statistical power for this alternative hypothesis. This approach, which has been implemented in the program PANMARK by Van de Pol, Langeheine and De Jong (1991), is called "sophisticated bootstrapping" or "parametric bootstrapping" and has been presented and demonstrated with a variety of log- 112 MPR-Online 2000, Vol. 5, No. 2 linear models and latent class models also by Langeheine, Pannekoek and Van de Pol (1996) and Langeheine, Van de Pol and Pannekoek (1997). Von Davier (1997) developped parametric bootstrapping procedures for various item response models, as for these models the number of cells (= possible response patterns) becomes large even for moderate item numbers, and hence the sample sizes available are in praxis nearly always too small to use asymptotic tests as Pearson's !2 or G 2 . So far, parametric bootstrapping is the only alternative when the sample size is known to be too small for asymptotic procedures. Furthermore, it may also be helpful when the actual design conditions differ far from those covered by our tables and generalizations from our Monte Carlo study become hazardous. As similar considerations apply in principle to all other asymptotically derived significance tests it would be desirable if parametric bootstrapping procedures were included in all major statistical packages whenever asymptotic tests are applied and only rough knowledge is a vailable about their behavior with finite sample sizes. I. Stelzl: Sample sizes needed for log-linear models 113 References [1] Agresti, A. & Yang, M.C. (1987). An empirical investigation of some effects of sparseness in contingency tables. Computational Statistics and Data Analysis, 5, 9-21. [2] Berry, K.J. & Mielke, P.W. Jr. (1988). Monte Carlo comparisons of the asymptotic chi-square and Likelihood-ratio tests with the nonasymptotic chi-square test for sparse r x c tables. Psychological Bulletin, 103, 256-264. [3] Brandstätter, E. (1999). Confidence interval as an alternative to significance testing. Methods of Psychological Research-Online. Vol. 4 No. 2. http://www. mpr-online.de [4] Chapman, J.W. (1976). A comparison of the X², - 2 log R, and multinominal probability criteria for significance tests when expected frequencies are small. Journal of the American Statistical Association, 71, 854-863. [5] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, N.J.: Lawrence Erlbaum. [6] Cohen, J. (1994). The earth is round (p .05). American Psychologist, 49, 997-1003. [7] Davier, von, M. (1997). Methoden zur Prüfung probabilistischer Testmodelle. [Institut für die Pädagogik der Naturwissenschaften an der Universität Kiel] IPN. Olsenhausenstraße 62, D-24098 Kiel. [8] Erdfelder, E., Faul, F. & Buchner, A. (1996). GPOWER: A general power analysis program. Behavior Research Methods, Instruments, & Computers, 28, 1-11. [9] Faul, F. & Erdfelder, E. (1992). GPOWER: A priori-, post hoc-, and compromise power analysis for MS-DOS (Computer program). Bonn: Bonn University. [10] Fienberg, S.E. (1977). The analysis of cross-classified categorial data. Cambridge, MA: The MIT Press. [11] Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In Keren, G. & Lewis, C. (Eds.), A handbook for data analysis in the behavioral sciences. Methodological issues, 311-339. Hillsdale: Lawrence Erlbaum. [12] Goodman, L.A. (1996). A single general method for the analysis of cross-classified data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some methods of correspondence analysis and association analysis. Journal of the American Statistical Associaton, 433, 408-428. 114 MPR-Online 2000, Vol. 5, No. 2 [13] Haber, M. (1984). A comparison of tests for the hypothesis of no three-factor interaction in 2 x 2 x 2 contingency tables. Journal of Statistical Computation and Simulation, 20, 205-215. [14] Harlow, L.L., Mulaik, S.A. & Steiger, J.H. (1997). What if there were no significance tests? Hillsdale: Lawrence Erlbaum. [15] Harris, R.J. (1997). Reforming significance testing via three-valued logic. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests. Mahwah, N.J.: Lawrence Erlbaum. [16] Hosmane, B. (1986). Improved likelihood ratio tests and Pearson chi-square tests for independence in two dimensional contingency tables. Communications in Statistics - Theory and Methods, 15, 1875-1888. [17] Hosmane, B. (1987). An empirical investigation of chi-square tests for the hypothesis of no three-factor interaction in I x J x K contingency tables. Journal of Statistical Computation and Simulation, 28, 167-178. [18] Iseler, A. (1997). Signifikanztests: Ritual, guter Brauch und gute Gründe. Methods of Psychological Research-Online, Diskussionsforum. URL http://www.pabstpublishers.de/impr/forum_e.html. [19] Jones, L.V. (1999). A sensible reformulation of the significance test. ViSta: The Visual Statistics System. http: // forrest.psych.unc.edu/jones-tukey 112399.html [20] Koehler, K.J. (1986). Goodness-of-fit tests for log-linear models in sparse contingency tables. Journal of the American Statistical Association, 81, 483-492. [21] Koehler, K.J. & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association, 75, 336-344. [22] Langeheine, R., Pannekoek, J. & Van de Pol, F. (1996). Bootstrapping goodnessof-fit measures in categorical data analysis. Sociological Methods & Research, 24, 492-516. [23] Langeheine, R., Van de Pol, F. & Pannekoek, J. (1997). KontingenztabellenAnalyse bei kleinen Stichproben: Probleme bei der Prüfung der Modellgültigkeit mittels Chi-Quadrat Statistiken. Empirische Pädagogik, 11, 63-77. I. Stelzl: Sample sizes needed for log-linear models 115 [24] Larntz, K. (1978). Small-sample comparisons of exact levels for chi-squared goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253263. [25] Lawal, H.B. (1984). Comparisons of X², Y², Freeman-Tukey and William's improved G 2 test statistics in small samples of one-way multinomials. Biometrika, 71, 415-458. [26] Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests (p. 393-426). Mahwah, N.J.: Lawrence Erlbaum. [27] Milligan, G.W. (1980). Factors that affect Type I and Type II error rates in the analysis of multidimensional contingency tables. Psychological Bulletin, 87, 238-244. [28] Mulaik, St.A., Raju, N.S. & Harshman, R.A. (1997). There is a time and place of significance testing. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests (p. 65-115). Hillsdale, N.J.: Lawrence Erlbaum. [29] Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer. [30] Reichardt, Ch.S. & Gollob, H.F. (1997). When confidence intervals should be used instead of statistical significance tests, and vice versa. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests (p. 259-284). Hillsdale, N.J.: Lawrence Erlbaum. [31] Rudas, T. (1986). A Monte Carlo comparison of the small sample behaviour of the Pearson, the likelihood ratio and the Cressie-Read statistics. Journal of Statistical Computation and Simulation, 24, 107-120. [32] Schmidt, F.L. & Hutner, J.E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests (p. 37-64). Hillsdale, N.J.: Lawrence Erlbaum. [33] Sedlmeier, P. (1996). Jenseits des Signifikanztest-Rituals: Ergänzungen und Alternativen. Methods of Psychological Research Online, 1, 41-63. 116 MPR-Online 2000, Vol. 5, No. 2 [34] Sedlmeier, P. (1998). Was sind die guten Gründe für Signifikanztests? Diskussionsbeitrag zu Sedlmeier (1996) und Iseler (1997). Methods of Psychological ResearchOnline, 3, 39-42. [35] SPSS Statistical Algorithms (ohne Jahr). Chicago: SPSS Inc. [36] Upton, G.J.G. (1982). A comparison of alternative tests for the 2 x 2 comparative trial. Journal of the Royal Statistical Society Series A, 145, 86-105. [37] Van de Pol, F., Langeheine, R. & De Jong, W. (1991). PANMARK user manual: PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of Statistics.
© Copyright 2024