Document 261157

Methods of Psychological Research Online 2000, Vol.5, No.2
Institute for Science Education
Internet: http://www.mpr-online.de
© 2000 IPN Kiel
What sample sizes are needed to get correct significance
levels for log-linear models? - A Monte Carlo Study using
the SPSS-procedure "Hiloglinear"
Ingeborg STELZL1
Abstract
Pearson's !2 and the Likelihood-ratio statistic G 2 are the most common and widely
used test statistics for log-linear models. They are both asymptotically distributed as
chi-squared variables. The present article reports the results of a Monte-Carlo study
which compares the two test statistics for two-, three- and four-dimensional contingency
tables, employing conditions which may be judged reasonable for psychological research
and using one of the most prominent computer programs (SPSS "Hiloglinear"). Our results are consistent with previous research in that, on the whole, Pearson's !2 behaves
better than G 2 . As a rule of thumb one may state that Pearson's !2 will not result in
severely inflated alpha values (empirical values of .075 or larger for a nominal level of
.05) if the total sample size equals five times the number of cells and the smallest expected cell frequency is larger than 0.50. On contrast, the Likelihood-ratio statistic G 2
yields in some cases severely inflated empirical alpha values for the higher interactions
even if the total sample size equals ten times the number of cells and the smallest expected cell frequency is larger than one. In those cases where sample size is large enough to
use Pearson's !2 , Pearson's !2 is preferable to G 2 , as it is generally closer to the nominal alpha. For cases not covered by this rule parametric bootstrapping is recommended.
Key words: contingency tables, log-linear models, significance tests, Pearson's !2 ,
Likelihood ratio G 2 , Monte Carlo study, simulation, SPSS Hiloglinear
1
Author's address: Prof. Dr. Ingeborg Stelzl, Philipps-University, Fachbereich Psychologie, Gutenbergstr.
18,
D-35032
Marburg,
[email protected]
Germany
Tel.:+49-6421-2823669,
Fax:
+49-6421-2828929,
e-mail:
96
1.
MPR-Online 2000, Vol. 5, No. 2
Introduction
Log-linear models may be used for the analysis of contigency tables to test hypotheses
about main effects, two way and higher order interactions. Although several other
procedures have been suggested (for an overview see Read & Cressie, 1988; for two way
tables also Goodman, 1996), so far Pearson's !2 and the Likelihood Ratio Statistic G 2
are still the best known and most widely used test statistics for significance testing. As
both procedures, Pearson's !2 and G 2 , are based on asymptotic theory, i. e., derived for
large sample sizes only, we are left with the question, which of the two statistics should
be preferred with moderate sample sizes. The present paper will report the results from
a Monte Carlo Study, from which some guidelines are derived to answer this question.
However, during the last years the role of significance testing in the behavioral and
social sciences has been questioned on epistemological grounds, the reasons for significance testing have become controversial, and it has been questioned whether significance
testing should not be abandonned at all (see Cohen, 1994; Gigerenzer, 1993; Harlow,
Mulaik & Steiger, 1997; Sedlmeier, 1996; Iseler, 1997; Sedlmeier, 1998; Brandstätter,
1999). Therefore, we will first try to give some comments, how the specific topic of our
present study may be embedded in this debate.
Some of the main objections which were raized against the practice of significance
tests were: (1) that significance tests were often misunderstood and misinterpreted (for
an overview and critical discussion of common misunderstandings see Mulaik, Raj and
Marshman, 1977; for an attempt of a reformulation of the logical basis of significance
testing see Harris, 1997; Jones, 1999); (2) that significance were superfluous, as the nullhypothesis were always known to be false, and that they were prohibitive to scientific
progress, if non-significant results were not published (see Schmidt & Hunter, 1997); (3)
that significance test would not contain the relevant information concerning effect sizes
and accuracy of estimation. (For alternatives, which may be used instead of significance
tests or supplementary, see Brandstätter, 1999; Meehl, 1997; Reichardt & Gollob, 1997;
Sedlmeier, 1996.)
This debate may be expected to continue still for a long time, as it reaches far into
the area of philosophy of science. The impact of some arguments (e. g., that the null
hypothesis is always known to be false) will depend also on substantial grounds and vary with the field of application. Nevertheless, presumably most researchers engaged in
this discussion would agree to the following statements:
I. Stelzl: Sample sizes needed for log-linear models
97
(1) Significance tests must not be the main criterion to judge the scientific impact of
an empirical study, e. g., in deciding whether or not it should be published. The results
of an empirical study are reported insufficiently and inappropriately if only test statistics and p-values from significance tests are reported. At least for the main hypotheses
effect sizes (parameter estimates with confidence intervals and/or global effect size measures, e. g., proportion of predidicted variance) should be given. Furthermore, there
should be a descriptive summary of the data including as many details as possible (e. g.,
all cell frequencies of a multidimensional contingency table if a log linear model is employed, the complete correlation matrix if a structural equation model is to be fitted to
the data). This will enable other authors to reanalyze the data by their own models.
(2) Significance testing requires sample sizes that are sufficiently large to provide
adaequate statistical power (Cohen, 1988 suggests 0.8 as a minimum for standard cases,
but higher values, e. g. 0.95, may be required depending on the specific hypothesis under question) for effect sizes which are judged to be relevant for the specific matter of
research (that may be small, medium or large effects as defined by Cohen, 1988, or
Erdfelder, Faul & Buchner, 1996).
(3) If the results of significance tests are given by tail probabilities (p-values), these
p-values should be computed correctly. If only asymptotic distributions are known for
some test statistic this rises the question what minimal sample sizes are required to get
correct p-values from those asymptotic distributions.
The present study will contribute to this field referring to a special class of statistical
models, i. e. log-linear models for the analysis of contingency tables, for which two competing asymptotic test procedures, Pearson's !2 and G 2 , are compared.
The test statictics in the study: Pearson's !2 and the Likelihood-Ratio-statistic G 2
Both of the two test procedures require that one estimtes first the expected cell frequencies under the null hypothesis. E. g., if a 4-factor interaction is to be tested the
model for the null hypothesis contains free parameters for all main effects, 2-factor and
3-factor interactions whereas the parameters for the 4-factor interaction are fixed to zero.
The two test statistics, Pearson's !2 and G 2 are defined as follows:
Pearson's !2 :
!2 = !ki=1
2
( xi - m
ˆi)
ˆi
m
(1)
98
MPR-Online 2000, Vol. 5, No. 2
Likelihood-Ratio-Statistik G 2 :
G 2 = - 2 !ki=1 x i ln
ˆi
m
xi
(2)
where
i=
1 ... k index for the cells, with
k = total number of cells
ˆ i = estimate of the expected cell frequency
m
in cell i under the null hypothesis
xi =
observed cell frequency in cell i
Both test statistics are asymptotically distributed as chi-squared variables, with the
numer of degrees of freedom equal to the total number of cells minus the number of
estimated free parameters.
Several Monte Carlo studies have been conducted to investigate the behavior of Pearson's !2 and/or G 2 for small or moderate sample sizes (Agresti & Yang, 1987; Berry &
Mieke, 1988; Chapman, 1976; Haber, 1984; Hosmane, 1986, 1987; Koehler, 1986; Koehler
& Larntz, 1980; Larntz, 1978; Lawal, 1984; Milligan, 1980; Rudas, 1986; Upton, 1982).
Most of them concentrate upon two way tables, whereas less is known about higher order interactions, which require iterative algorithmas. If Pearson's !2 and G 2 are compared within the same study, results are heterogeneous for one-way tables, whereas most
of the studies investigating two- and higher order tables find a tendency Pearson's !2 to
perform somewhat better than G 2 .
One way tables were investigated by Chapman (1976) and Koehler and Larntz (1980)
with heterogeneous results. Lawal (1984) who compared four test statistics including
Pearson's !2 and G 2 found Pearson's !2 to be closest to the nominal alpha.
Two way tables with 2x2 cells were investigated by Upton (1982) with sample sizes
ranging from n = 14 to 96. He compared twelve test statistics, including Pearson's !2
and G 2 . He got the best results if Pearson’s !2 was modified by a multiplication factor
of (n - 1)/n. Yet, also Pearson’s !2 without a correction factor was found to perform
well.
Hosmane (1986) investigated two way tables with 2x2 up to 9x9 cells and sample sizes ranging from 10 to 190. He compared Pearson’s !2 and G 2 to some modifications
I. Stelzl: Sample sizes needed for log-linear models
99
and concluded that !2 without any modification yields the best results.
Agresti and Yang (1987) investigated contingency tables with 2x3 up to 10x10 cells
and sample sizes of N = 50 or 100. They found acceptable results for Pearson's !2 if the
expected average cell frequency was not smaller than one. For direct testing of a loglinear model the distribution on Pearson's !2 was closer to the asymptotic chi-squared
distribution than the distribution of G 2 . On the other hand, for comparing two unsaturated models G 2 outperformed Pearson's !2 in many cases.
Berry and Mielke (1988) investigated 2x2 up to 3x4 tables with sample sizes ranging
from 20 to 80. They compared five test statistics and found a nonasymptotic chisquared test to be superior in overall performance to the other four tests including Pearson's !2 and G 2 .
Three way tables with 2x2x2 cells and sample sizes from 10 to 90 were studied by
Milligan (1980). Both, Pearson's !2 and G 2 showed a tendency to depressed " -levels
for main effects and 2-factor interactions whereas considerably inflated " -levels occurred for some cases for the 3-factor interactions with Pearson's !2 . Three way tables with
2x2x2 cells were also studied by Haber (1984) using larger sample sizes ranging from 40
to 400. He compared six test statistics, including Pearson's !2 and G 2 . Among the tests
which do not inflate the nominal significance level Pearson's !2 was found to be the
most powerful.
Besides several one- and two way contingency tables Larntz (1978) investigated also
a 3x3x3 design. He compared Pearson's !2 to G 2 and a further statistic and concluded
that Pearson's !2 should be preferred because its Type I error rates were closest to the
nominal level.
Rudas (1986) studied two and three way tables with 2x2 up to 3x3x5 cells and
sample sizes ranging from 15 to 150. He compared Pearson's !2 , G 2 and a further statistic suggested by Cressie and Read (1985). The results for Pearson's !2 and the Cressie
and Read-statistic were very similar and these two statistics were found more appropriate than G 2 for small sample sizes. Also Koehler (1986) studied two- and three way
tables and 2ktables. He found acceptable results for Pearson's !2 except some cases of
sparse tables containing both very small and moderately large expected frequencies. The
accuracy of G 2 , on the other hand, was judged generally unacceptable producing greatly
inflated Type I error levels in some cases and deflated levels in others. For very large
and sparse tables other asymptotic procedures were preferrable to Pearson's !2 and G 2 .
100
MPR-Online 2000, Vol. 5, No. 2
Hosmane (1987) who investigated tables with 2x2x2 to 4x4x3 cells and sample sizes
ranging from 10 to 200 compared five test statistics including Pearson's !2 and G 2 . He
recommends Pearson's !2 as the test statistic closest to nominal level " .
The present paper is to contribute to this field by focussing on three-way and fourway designs. Whereas many of the previous studies used very small sample sizes in order
to examine the behavior of the test statistics at the lower end the present study will
focus on sample sizes which may be judged realistic and reasonable for psychological
research. That means that sample sizes should be sufficiently large to provide acceptable
statistical power at least for large effects.
As the results of an iterative estimation procedure may depend on the quality of the
mathematical algorithm it was decided to employ for our simulation study a widely used
procedure from a well known statistical package with the program options left in their
default modes. The procedure "Hiloglinear" from SPSS was chosen with the aim to derive some guidelines when the results from this procedure can be used without a risk of
severely inflated or deflated Type I error rates and which of the two test statistics provided by the program should be preferred.
2.
The Monte Carlo study
Our simulation study includes designs with two categories per variable (2x2, 2x2x2,
2x2x2x2) and designs with up to four categories per variable (4x3, 4x3x2, 4x3x2x2). The
first series of simulations was run with uniform marginals sampling from a null model
with zero main effects and zero interactions. The sample sizes chosen were 2.5, 5 or 10
times the number of cells. Smaller sample sizes were not considered because of the lack
in statistical power that would result. According to Erdfelder (1992) and Erdfelder, Faul
and Buchner (1996) a sample size of N = 32 is required for a chi-squared test with one
degree of freedom to reach a power value of 0.8 at alpha = .05 for large effects, and N =
88 for medium effects; for a chi-squared test with six degrees of freedom the required
sample sizes are N = 55 and N = 152, respectively (though these values are also based
on asymptotic theory they may nevertheless be used as a rough estimate).
A further series of simulations was run with the same designs and the same sample
sizes, but differing marginal distributions. For designs with two categories the marginals
were .7,.3 or .8,.2, for designs with three- and four-categorial variables the marginals
were .1, .2, .3, .4 for variable A, .1, .4, .5 for variable B and .5, .5 or .8, .2 for variable C
I. Stelzl: Sample sizes needed for log-linear models
101
and D. The population values for all interactions were zero. Due to the differences in the
marginal probabilities there was considerable variation in the expected cell frequencies
within the designs and in many cases some of the expected frequencies were smaller
than one.
3.
The simulation program2
The data were generated by a SPSS-input-program as follows: The SPSS-procedure
"uniform" was used to generate for each person a value for each variable drawn from a
uniform distribution in the interval [0, 1], e. g., for a 4x3x2x2 design four variables A,
B, C, D. Next the variables were devided into categories: E.g., when variable A was designed to have four categories a1 to a4 with marginal probabilities .1, .2, .3, .4 a person
was assigned to category a1 if his/her value for A was in the range 0 to ! .1, to category
a2 if it was between 0.1 and ! .3, etc. The procedure continued with categorizing the
next variable until the person was categorized completely. Then the values for the next
person were drawn and categorized until a sample of the required size was completed.
Then the input program was closed and the procedure "Hiloglinear" was started.
This input program including the line calling the procedure "Hiloglinear" was written
into a SPSS-MAKRO which was executed 10 000 times. This resulted in a very long
SPSS-output which was filtered by a program written in "UNIX-awk" to find the lines
with the results of the significance tests. The fields containing the p-values were transferred to a new file, which was then used as an input file for a SPSS programm to assess
the empirical distribution of the p-values. The p-values ! .05 were counted to get empirical rejection rates.
The procedure "Hiloglinear" from SPSS for Unix Release 5.0 is described in the manual "SPSS-Statistical Algorithmus" als follows: First, maximum likelihood estimates of
the expected cell frequencies under the specific model under investigation are computed
using the iterative proportional fit algorithm as described by Fienberg (1977). The following default values for the program options were left unchanged: To avoid problems
with zero frequencies the program adds a constant delta = .5 to all empirical cell frequencies before the fit algorithm is started. The fit algorithm stops, if the largest change
2
The students Mrs. Sigrid Kühl, Mrs. Verena Polz and Mrs. Karina Wahl were engaged in the develop-
ment and running of the simulation program.
102
MPR-Online 2000, Vol. 5, No. 2
of an expected cell frequency in two consecutive iterations is less than .25 or if 20 iterations have been executed.
Based on these expected frequencies as compared to the empirical cell counts the two
test statistics Pearson's !2 and G 2 are computed and significance tests for the fit of the
model are performed using either of the two statistics. The resulting p-values are given.
The output provides significance tests for the fit of the following models:
M0:
Global null hypothesis. All main effects and interactions are zero.
M1:
Model with free parameters for the main effects. 2-factor and higher inter-
actions are zero.
M2:
Model with free parameters for main effects and 2-factor interactions. 3-
factor and higher interactions are zero.
M3:
Model with free parameters for main effects, 2-factor and 3-factor interac-
tions. The 4-factor interaction is zero.
Next, hypotheses about main effects, 2-factor, 3-factor and 4-factor interactions are
tested seperately by hierarchical model comparisons (Chi-square difference tests for nested models):
Test whether the marginals differ significantly from uniform distributions (M1
vs. M0)
Test of the 2-factor interactions (M2 vs. M1)
Test of the 3-factor interactions (M3 vs. M2)
Test of the 4-factor interaction (satisfied model with main effects and all interactions including the 4-factor interactions vs. M3)
Furthermore, significance tests using G 2 are performed for each of the main effects
and each of the interactions seperately ("partial associations"). Also these tests are based
on hierarchical comparisons: E. g., when in a model with four variables the 3-factor interactions ABC is to be tested, two models with the following higher-order marginals
fitted to the data are compared:
H0:
model with all main effects, all 2-factor interactions and the 3-factor inter-
actions ABD, BCD and ACD, (i. e., all 3-factor interactions except ABC)
H1:
model with all main effects, all 2-factor interactions and all 3-factor inter-
actions including ABC
I. Stelzl: Sample sizes needed for log-linear models
103
In the next section simulation results will be reported for the significance tests of the
global null hypothesis, for groups of hypotheses (all main effects, all 2-factor interactions, etc.) and in some cases also for partial associations.
4.
Results
Table 1 gives the empirical rejection rates corresponding to a nominal level of alpha
= .05 under the condition of the complete null hypothesis, i. e. with uniform marginal
distributions. As each entry in the table is based on 10 000 samples 95 percent of the
values should fall into the interval .046 to .054, if the true alphas were .05. Already a
rough inspection of Table 1 shows that the majority of the values (64%) is outside this
range and thus departs significantly from the nominal value. However, applying asymptotic results to finite sample sizes one cannot expect to get exact error rates. In what
follows we will call increased empirical values up to .075 "moderately" inflated, higher
values will be called "severely" inflated. Severely inflated values are underlined in the
tables. On the other hand, empirical error rates below the nominal level indicate a conservative significance test, presumably with increased Type II error rates. Therefore,
values equal or below .025 are printed in italics.
Table 1 shows that there is a general tendency for G 2 to yield higher empirical alphas than Pearson's !2 (in 60 cases out of 72 the value for G 2 is higher than that for
Pearson's !2 , in 1 case equal and in 11 cases lower). Most of the cases of severely increased alpha values occur for G 2 , whereas severely depressed alphas occur for neither
statistic. Giving equal weight to discrepancies in either direction (increased or depressed
values) Pearson's !2 is found to be in 52 cases closer to the nominal alpha than G 2 , 2
times equal, and in 18 cases more discrepant than G 2 . Thus one may state that Pearson's !2 is on the whole closer to the nominal level than G 2 .
Next we will look on the results for the two statistics in more detail:
Pearson's !2 : There are only two cases of severely inflated alpha levels, both occuring when 4-factor interactions are tested with the smallest sample sizes (2.5 times the
number of cells). When the sample sizes are increased the empirical error rates improve
and differ only "moderately" from the nominal level.
104
MPR-Online 2000, Vol. 5, No. 2
Table 1: Empirical alpha-values for a nominal alpha = .05 when the complete null hypothesis is true (uniform marginals, no interactions)
significance tests
Design
N
n/
global
marginals
cell
G
2x2
4x3
2x2x2
4x3x2
2x2x2x2
4x3x2x2
2
"
2
G
2
"
2
2-way
3-way
4-way
interactions
"2
G2
interactions
"2
G2
interactions
"2
G2
10
2.5
.077
.036
.047
.059
.100
.052
20
5
.048
.038
.052
.046
.058
.051
40
10
.053
.042
.048
.048
.050
.050
30
2.5
.091
.042
.057
.053
.102
.045
60
5
.067
.047
.056
.053
.071
.046
120
10
.057
.048
.056
.057
.056
.047
20
2.5
.083
.043
.059
.049
.076
.050
.095
.069
40
5
.067
.045
.048
.050
.062
.047
.068
.060
80
10
.056
.047
.052
.055
.055
.049
.060
.058
60
2.5
.106
.056
.053
.058
.075
.045
.129
.063
120
5
.078
.054
.059
.060
.062
.050
.082
.057
240
10
.060
.053
.054
.055
.055
.050
.060
053
40
2.5
.095
045
.052
.055
.065
.043
.107
.065
.111
.078
80
5
.068
.047
.053
.059
.055
.043
.067
.050
.077
.069
160
10
.060
.053
.054
.053
.054
.047
.061
.054
.058
.057
120
2.5
.141
.054
.057
.068
.063
.044
.122
.052
.160
.091
240
5
.086
.056
.052
.061
.055
.047
.078
.048
.098
.073
480
10
.065
.054
.048
.055
.054
.048
.064
.051
.068
.063
N = total sample size
I. Stelzl: Sample sizes needed for log-linear models
105
Apart from a conservative tendency for 2x2 tables with small sample sizes, the tests
for the global null hypothesis lead to acceptable type I error rates (.042 to .056). The
tests for the main effects, 2-factor and 3-factor interactions lead to somewhat larger but
still "moderate" departures from the nominal level (.043 to .069). The tests for the 4factor interactions lead to the two severely inflated values mentioned above occurring
with sample sizes of 2.5 times the cell number. For sample sizes equalling 5 or 10 times
the cell number, these error rates were only moderately inflated.
The Likelihood-Ratio-G 2 : Whereas the results for the main effects are satisfactory
(.047 to .059) 2-factor and higher interactions show an increasing tendency to yield inflated type I error rates. Severely inflated empirical alpha levels occur for all designs
with sample sizes only 2.5 times the number of cells and for many cases with sample size
5 times the cell number. Only with sample sizes equal to 10 times the cell number all
departures stay in the range defined as only "moderately" descrepant.
Tables 2a and 2b give the empirical rejection rates for all simulations with main effects present, i. e. with marginals differring from the uniform distribution. The sample
sizes used are the same as in Table 1, equalling 2.5 times, 5 times, or 10 times the cell
number. Yet, due to the variation in the marginal probabilities there was considerable
variation in the expected cell frequencies within a design. The smallest expected cell
frequency within a design is given in the column headed "min E(ni)" in Tables 2a and
2b.
The alternative hypotheses was true for the global tests and the tests of the marginals. Except for the smallest sample sizes (N = 10, N = 20) statistical power exceeded
0.90 in all cases. As the results for the global tests and the tests for the marginal distributions were very much alike, only the results for the marginal distributions are reported.
Table 2a refers to designs with two categories per variable (2x2 to 2x2x2x2), whereas
Table 2b contains the results for designs with up to four categories per variable (4x3 to
4x3x2x2). Taking together from both groups of designs the behavior of the two test statistics may be summarized as follows:
Pearson's !2 leads to only moderate discrepancies from the nominal level (empirical
levels .043 to .071) when the total sample size equals at least 5 times the cell number
and the smallest expected cell frequency is larger than 0.5. With expected cell frequencies smaller than 0.5 serious departures occur in either direction (severely inflated or
depressed empirical levels) without following a simple pattern.
106
MPR-Online 2000, Vol. 5, No. 2
Table 2a: Empirical rejection rates for a nominal level of alpha = .05, when main effects
are present but all interactions are zero. Designs with only two-categorial variables.
significance tests
Design
2x2
2x2
2x2x2
2x2x2
2x2x2x2
2x2x2x2
#
N
n
min
mar-
E(ni)
gi-
marginals#
2-way
3-way
4-way
interactions
interactions
interactions
nals
G2
"2
G2
"2
G2
"2
G2
"2
10
2.5
.9
.3 .7
.331
.361
.071
.046
20
5
1.8
‘’
.647
.607
.074
.045
40
10
3.6
‘’
.922
.913
.065
.052
10
2.5
.4
.2 .8
.697
.744
.034
.054
20
5
.8
‘’
.970
.957
.055
.049
40
10
1.6
‘’
1.00
1.00
.070
.046
20
2.5
.54
.3 .7
.768
.725
.088
.065
.070
.044
40
5
1.08
‘’
.977
.972
.069
.051
.072
.057
80
10
2.16
‘’
1.00
1.00
.053
.047
.068
.055
20
2.5
.16
.2 .8
.990
.990
.059
.086
.036
.016
40
5
.32
‘’
1.00
1.00
0.68
.065
.056
.039
80
10
.64
‘’
1.00
1.00
.069
.058
.058
.052
80
5
.65
.3 .7
1.00
1.00
.062
.052
.096
.071
.081
.051
160
10
1.30
‘’
1.00
1.00
.053
.050
.069
.053
.079
.061
80
5
.13
.2 .8
1.00
1.00
.081
.101
.071
.090
.022
.011
160
10
.26
‘’
1.00
1.00
.058
.069
.078
.081
.050
.032
As the alternative hypothesis holds for the main effects the values in these columns are
empirical power-values.
N = total sample size
n = N devided by the number of cells
min E " n i # = smallest expected cell frequency
I. Stelzl: Sample sizes needed for log-linear models
107
Table 2b: Empirical rejection rates for a nominal level of alpha = .05, when main effects
are present but all interactions are zero. Designs with more than two categories per variable.
significance tests
Design
N
n
min marginals
marginals#
E(ni)
G
4x3
4x3x2
4x3x2x2
#
60
5
.6
.1 .2 .3 .4I
120
10
1.2
120
5
.6
240
10
1.2
120
5
.24
240
10
.48
240
5
.6
480
10
1.2
240
5
.24
480
10
.48
240
5
.10
480
10
.19
2
"
2
2-way
3-way
4-way
interactions
"2
G2
interactions
"2
G2
interacti"2
G2
1.00
1.00
.065
.046
‘’
1.00
1.00
.065
.048
.1 .2 .3 .4I
1.00
1.00
.071
.060
.081
.043
‘’
1.00
1.00
.063
.056
.081
.052
.1 .2 .3 .4I
1.00
1.00
.074
.076
.067
.042
‘’
1.00
1.00
.066
.064
.072
.049
.1 .2 .3 .4I
1.00
1.00
.061
.060
.102
.064
.089 .056
‘’
1.00
1.00
.054
.052
.081
.053
.091 .065
.1 .2 .3 .4I
1.00
1.00
.062
.072
.112
.093
.060 .033
‘’
1.00
1.00
.050
.056
.088
.064
.082 .053
.1 .2 .3 .4I
1.00
1.00
.067
.096
.117
.125
.028 .019
1.00
1.00
.066
.071
.010
.085
.060 .034
.1 .4 .5
.1 .4 .5I.5 .5
.1 .4 .5I.2 .8
.1 .4 .5I.5 .5I
.1 .4 .5I.5 .5I
.1 .4 .5I.2 .8I
‘’
As the altrnative hypothesis holds for the main effects the values in these columns are
empirical power-values.
N = total sample size
n = N devided by the number of cells
min E " n i # = smallest expected cell frequency
108
MPR-Online 2000, Vol. 5, No. 2
Table 3: Empirical alpha values for a nominal alpha = .05 when testing partial associations using G 2 . Two design conditions selected from Table 2b.
a) Design 4x3x2:
N
min
n
marginals
AB
AC
BC
ABC
.1 .2 .3 .4I
.072
.061
.062
.081
.063
.056
.054
.081
E(ni)
120
5
.6
240
10
1.2
.1 .4 .5I.5 .5
‘’
a) Design 4x3x2x2:
N
n
min
margi-
E(ni)
nals
AB
AC
BC
AD
BD
CD
240
5
.6
.1 .2 .3 .4I
480
10
1.2
‘’
.052 .052 .052 .056 .051 .050
N
n
min
margi-
ABC ABD ACD BCD ABCD
E(ni)
nals
.1 .4 .5I.5
240
5
.6
.1 .2 .3 .4I
480
10
1.2
‘’
.1 .4 .5I.5
.063 .056 .054 .055 .051 .051
.097
.096
.076
.077
.089
.078
.071
.059
.061
.091
N = total sample size
n = N devided by the number of cells
min E " n i # = smallest expected cell frequency
I. Stelzl: Sample sizes needed for log-linear models
109
The Likelihood-Ratio G 2 yields for the 2-factor interactions only moderate discrepencies
from the nominal level (empirical levels .053 to .070) when the total sample size equals
10 times the cell number and the smallest expected cell frequency is larger than 1.
However, even when these conditions are satisfied, the significance tests of the 3- and 4factor-interactions lead in many cases to seriously increased alpha levels. Therefore, G 2
cannot be recommended for these tests.
Comparing Pearson's !2 to G 2 we find that in all cases which satisfy the above rule
for Pearson's !2 (total sample size 5 times the cell number, smallest expected cell frequency › .5) Pearson's !2 is closer to the nominal level than G 2 .
Table 3 shows the results for the tests of the partial associations in some of the larger
designs (4x3x2 and 4x3x2x2) using G 2 . Also these results indicate that severely inflated
alpha levels occur for 3- and 4-factor interactions even under the conditions of large
sample sizes (10 times the cell number) and smallest expected cell frequencies larger
than 1.
4.1.
Supplementary results:
Since the goodness-of-fit values obtained for a model and the resulting p-values may
depend to some extent also on the quality of the numerical optimization procedure, we
wanted to check whether an increase in numerical occuracy would affect the obtained
error rates.
Two designs were chosen: The 2x2x2 design with unequal marginals (.8,.2) and
sample size = 80, and the 2x2x2x2 design with unequal marginals (.8,.2) and sample size
= 160. The latter had produced severely inflated error rates for both, Pearson's !2 and
G 2 . The program options for numerical accuracy were changed from the default values
of a maximum of 20 iterations and a convergence criterion of 0.25 to a maximum of 50
iterations and a convergence criterion of .05. For each of the two designs a simulations
with 10 000 data sets was run under the improved accuracy conditions. The results were
very close to that under the default conditions. In particular, the severely increased type
I error rates for the 3-factor interactions in the 2x2x2x2 design did not improve.
Next, the value of the constant delta, which is added to all observed cell frequencies
to avoid problems with empty cells, was changed from its default value 0.5 to 0.1. The
same two design conditions as before (2x2x2 with marginals .8,.2 n = 80; 2x2x2x2 with
marginals .8,.2 n = 160) and two further conditions (2x2x2x2 with marginals .7,.3, n =
160; 4x3x2x2 uniform marginals, n = 480) were chosen. The results are heterogeneous:
110
MPR-Online 2000, Vol. 5, No. 2
There were substantial improvements in many cases, expecially for G 2 , but also detoriations.
For those cases which satisfy the conditions given above for Pearson's !2 , (sample size at last 5 times the number of cells, smallest expected cell frequency larger than .5),
no severe departures from the nominal level occured, neither for Pearson's !2 nor G 2 ,
and Pearson's !2 was always closer to the nominal level than G 2 .
Furthermore, we wanted to check whether the procedure "Hiloglinear" in "SPSS for
Windows 8.0" differs in any respect from the procedure in "SPSS for UNIX Releas 5.0"
which was used in our simulations. We chose the last six design conditions from Table
2b (4x3x2x2 designs with unequal marginals) and generated one sample for each condition. This samples were analyzed using both "Hiloglinear" from "SPSS for Windows 8.0"
and from "SPSS for UNIX 5.0" with the program options left to their default values
(delta = .5, convergence = 25, iterate = 20). The output from the two versions of SPSS
was identical. Then the options were changed to delta = 0, convergence = .01 and iterate = 50. Again, the output was the same for both SPSS-versions.
5.
Discussion
The results of the present study are in accordance with the majority of the previous
findings concluding that Pearson's !2 is generally closer to the nominal level than G 2 .
Pearson's !2 did not lead to seriously increased type I error rates when the smallest expected cell frequency was larger than .5 and the total sample size was at least 5 times
the number of cells. This rule was found to hold for main effects and 2-factor interactions as well as for higher order interactions. For G 2 , on the other hand, a rule was found
only for main effects and 2-factor interactions (to avoid seriously increased type I error
rates the smallest expected cell frequency should be larger than 1 and the total sample
size at least 10 times the cell number), whereas in many cases significance tests of higher
order interactions lead to seriously increased alpha levels even when this rule was satisfied.
However, a Monte Carlo study can only yield limited information on the behavior of
a test statistic as only a limited number of cases can be simulated, and the cases included will usually differ in some respects from the conditions that a researcher is faced
with when analyzing his/her data. In the present study we aimed at chosing the conditions in a way to cover the most typical cases of psychological research and as a result
I. Stelzl: Sample sizes needed for log-linear models
111
of our study we defined an area of conditions where one may feel on the save side using
SPSS standard procedures.
Nevertheless, important questions are left open: What should one do when the above
rules for applying asymptotic procedures are not satisfied? E. g., when higher order interactions are to be tested and the smallest expected cell sizes are smaller than .5, i. e.,
too small for either G 2 and Pearson's !2 .
In such situations it would be desirable to provide bootstrapping procedures which
enable the researcher to start his/her own simulation study with his/her specific design
conditions and parameter values and with the actual values for computational accuracy
and the constant delta. Based on this simulation the distribution of the test statistic
could be assessed empirically and used instead of the asymptotic distribution to compute the tail probability needed for the significance test. E. g., when a 4-factor interaction
is to be tested one might proceed as follows: First, the empirical data are used to estimate the parameters under the model of the null hypothesis. As the 4-factor interaction
is under question this is a model containing main effects, 2-factor and 3-factor interactions but no 4-factor interaction. Next, a large number of data sets with the actual
sample size is generated from this model. To each data set the null hypothesis model
with all parameters free except the 4-factor interaction, and the alternative model with
all parameters free including the 4-factor interaction (i. e. the saturated model) are fitted. Comparing these two models the test statistic for the 4-factor interaction is computed and the distribution of the test statistic over the samples as replications is assessed.
To perform the significance test for the 4-factor interaction one may use the empirical
distribution of the test statistic to estimate the 95th percentile point and use it as the
critical value or, alternatively one may estimate the tail probability (p-value) from the
proportion of samples which have led to a value exceeding that computed from the real
data set.
Similarily, one may obtain also estimated power values: A further simulation study
might be run generating Monte Carlo samples from a model with parameter values specified according to the respective alternative hypothesis. The proportion of cases leading
to a test statistic exceeding the 95th percentile point obtained from the null hypothesis
yields an estimate of statistical power for this alternative hypothesis.
This approach, which has been implemented in the program PANMARK by Van de
Pol, Langeheine and De Jong (1991), is called "sophisticated bootstrapping" or "parametric bootstrapping" and has been presented and demonstrated with a variety of log-
112
MPR-Online 2000, Vol. 5, No. 2
linear models and latent class models also by Langeheine, Pannekoek and Van de Pol
(1996) and Langeheine, Van de Pol and Pannekoek (1997). Von Davier (1997) developped parametric bootstrapping procedures for various item response models, as for these
models the number of cells (= possible response patterns) becomes large even for moderate item numbers, and hence the sample sizes available are in praxis nearly always too
small to use asymptotic tests as Pearson's !2 or G 2 .
So far, parametric bootstrapping is the only alternative when the sample size is
known to be too small for asymptotic procedures. Furthermore, it may also be helpful
when the actual design conditions differ far from those covered by our tables and generalizations from our Monte Carlo study become hazardous.
As similar considerations apply in principle to all other asymptotically derived significance tests it would be desirable if parametric bootstrapping procedures were included
in all major statistical packages whenever asymptotic tests are applied and only rough
knowledge is a vailable about their behavior with finite sample sizes.
I. Stelzl: Sample sizes needed for log-linear models
113
References
[1] Agresti, A. & Yang, M.C. (1987). An empirical investigation of some effects of sparseness in contingency tables. Computational Statistics and Data Analysis, 5, 9-21.
[2] Berry, K.J. & Mielke, P.W. Jr. (1988). Monte Carlo comparisons of the asymptotic
chi-square and Likelihood-ratio tests with the nonasymptotic chi-square test for
sparse r x c tables. Psychological Bulletin, 103, 256-264.
[3] Brandstätter, E. (1999). Confidence interval as an alternative to significance testing.
Methods of Psychological Research-Online. Vol. 4 No. 2. http://www. mpr-online.de
[4] Chapman, J.W. (1976). A comparison of the X², - 2 log R, and multinominal probability criteria for significance tests when expected frequencies are small. Journal of
the American Statistical Association, 71, 854-863.
[5] Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, N.J.: Lawrence Erlbaum.
[6] Cohen, J. (1994). The earth is round (p ‹.05). American Psychologist, 49, 997-1003.
[7] Davier, von, M. (1997). Methoden zur Prüfung probabilistischer Testmodelle. [Institut für die Pädagogik der Naturwissenschaften an der Universität Kiel] IPN. Olsenhausenstraße 62, D-24098 Kiel.
[8] Erdfelder, E., Faul, F. & Buchner, A. (1996). GPOWER: A general power analysis
program. Behavior Research Methods, Instruments, & Computers, 28, 1-11.
[9] Faul, F. & Erdfelder, E. (1992). GPOWER: A priori-, post hoc-, and compromise
power analysis for MS-DOS (Computer program). Bonn: Bonn University.
[10] Fienberg, S.E. (1977). The analysis of cross-classified categorial data. Cambridge,
MA: The MIT Press.
[11] Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In
Keren, G. & Lewis, C. (Eds.), A handbook for data analysis in the behavioral sciences. Methodological issues, 311-339. Hillsdale: Lawrence Erlbaum.
[12] Goodman, L.A. (1996). A single general method for the analysis of cross-classified
data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher,
and also some methods of correspondence analysis and association analysis. Journal
of the American Statistical Associaton, 433, 408-428.
114
MPR-Online 2000, Vol. 5, No. 2
[13] Haber, M. (1984). A comparison of tests for the hypothesis of no three-factor interaction in 2 x 2 x 2 contingency tables. Journal of Statistical Computation and Simulation, 20, 205-215.
[14] Harlow, L.L., Mulaik, S.A. & Steiger, J.H. (1997). What if there were no significance tests? Hillsdale: Lawrence Erlbaum.
[15] Harris, R.J. (1997). Reforming significance testing via three-valued logic. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance
tests. Mahwah, N.J.: Lawrence Erlbaum.
[16] Hosmane, B. (1986). Improved likelihood ratio tests and Pearson chi-square tests
for independence in two dimensional contingency tables. Communications in Statistics - Theory and Methods, 15, 1875-1888.
[17] Hosmane, B. (1987). An empirical investigation of chi-square tests for the hypothesis of no three-factor interaction in I x J x K contingency tables. Journal of Statistical Computation and Simulation, 28, 167-178.
[18] Iseler, A. (1997). Signifikanztests: Ritual, guter Brauch und gute Gründe. Methods
of Psychological Research-Online, Diskussionsforum. URL http://www.pabstpublishers.de/impr/forum_e.html.
[19] Jones, L.V. (1999). A sensible reformulation of the significance test. ViSta: The
Visual Statistics System. http: // forrest.psych.unc.edu/jones-tukey 112399.html
[20] Koehler, K.J. (1986). Goodness-of-fit tests for log-linear models in sparse contingency tables. Journal of the American Statistical Association, 81, 483-492.
[21] Koehler, K.J. & Larntz, K. (1980). An empirical investigation of goodness-of-fit
statistics for sparse multinomials. Journal of the American Statistical Association,
75, 336-344.
[22] Langeheine, R., Pannekoek, J. & Van de Pol, F. (1996). Bootstrapping goodnessof-fit measures in categorical data analysis. Sociological Methods & Research, 24,
492-516.
[23] Langeheine, R., Van de Pol, F. & Pannekoek, J. (1997). KontingenztabellenAnalyse bei kleinen Stichproben: Probleme bei der Prüfung der Modellgültigkeit mittels Chi-Quadrat Statistiken. Empirische Pädagogik, 11, 63-77.
I. Stelzl: Sample sizes needed for log-linear models
115
[24] Larntz, K. (1978). Small-sample comparisons of exact levels for chi-squared
goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253263.
[25] Lawal, H.B. (1984). Comparisons of X², Y², Freeman-Tukey and William's improved G 2 test statistics in small samples of one-way multinomials. Biometrika, 71,
415-458.
[26] Meehl, P.E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions.
In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests (p. 393-426). Mahwah, N.J.: Lawrence Erlbaum.
[27] Milligan, G.W. (1980). Factors that affect Type I and Type II error rates in the
analysis of multidimensional contingency tables. Psychological Bulletin, 87, 238-244.
[28] Mulaik, St.A., Raju, N.S. & Harshman, R.A. (1997). There is a time and place of
significance testing. In Harlow, L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if
there were no significance tests (p. 65-115). Hillsdale, N.J.: Lawrence Erlbaum.
[29] Read, T.R.C. & Cressie, N.A.C. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.
[30] Reichardt, Ch.S. & Gollob, H.F. (1997). When confidence intervals should be used
instead of statistical significance tests, and vice versa. In Harlow, L.L., Mulaik, St.A.
& Steiger, J.H. (Eds.), What if there were no significance tests (p. 259-284). Hillsdale, N.J.: Lawrence Erlbaum.
[31] Rudas, T. (1986). A Monte Carlo comparison of the small sample behaviour of the
Pearson, the likelihood ratio and the Cressie-Read statistics. Journal of Statistical
Computation and Simulation, 24, 107-120.
[32] Schmidt, F.L. & Hutner, J.E. (1997). Eight common but false objections to the
discontinuation of significance testing in the analysis of research data. In In Harlow,
L.L., Mulaik, St.A. & Steiger, J.H. (Eds.), What if there were no significance tests
(p. 37-64). Hillsdale, N.J.: Lawrence Erlbaum.
[33] Sedlmeier, P. (1996). Jenseits des Signifikanztest-Rituals: Ergänzungen und Alternativen. Methods of Psychological Research Online, 1, 41-63.
116
MPR-Online 2000, Vol. 5, No. 2
[34] Sedlmeier, P. (1998). Was sind die guten Gründe für Signifikanztests? Diskussionsbeitrag zu Sedlmeier (1996) und Iseler (1997). Methods of Psychological ResearchOnline, 3, 39-42.
[35] SPSS Statistical Algorithms (ohne Jahr). Chicago: SPSS Inc.
[36] Upton, G.J.G. (1982). A comparison of alternative tests for the 2 x 2 comparative
trial. Journal of the Royal Statistical Society Series A, 145, 86-105.
[37] Van de Pol, F., Langeheine, R. & De Jong, W. (1991). PANMARK user manual:
PANel analysis using MARKov chains. Voorburg: Netherlands Central Bureau of
Statistics.