five two-sample location tests for skewed distributions with Performance of unequal variances ⁎

Contemporary Clinical Trials 30 (2009) 490–496
Contents lists available at ScienceDirect
Contemporary Clinical Trials
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c o n c l i n t r i a l
Performance of five two-sample location tests for skewed distributions with
unequal variances
Morten W. Fagerland ⁎, Leiv Sandvik
Ullevål Department of Research Administration, Oslo University Hospital, N-0407 Oslo, Norway
a r t i c l e
i n f o
Article history:
Received 16 March 2009
Accepted 18 June 2009
Keywords:
Two-sample location problem
T test
Welch test
Wilcoxon–Mann–Whitney test
Yuen–Welch test
Brunner–Munzel test
Robustness
Skewness
Heteroscedasticity
a b s t r a c t
Tests for comparing the locations of two independent populations are associated with different
null hypotheses, but results are often interpreted as evidence for or against equality of means or
medians. We examine the appropriateness of this practice by investigating the performance of
five frequently used tests: the two-sample T test, the Welch U test, the Yuen–Welch test, the
Wilcoxon–Mann–Whitney test, and the Brunner–Munzel test. Under combined violations of
normality and variance homogeneity, the true significance level and power of the tests depend
on a complex interplay of several factors. In a wide ranging simulation study, we consider
scenarios differing in skewness, skewness heterogeneity, variance heterogeneity, sample size,
and sample size ratio. We find that small differences in distribution properties can alter test
performance markedly, thus confounding the effort to present simple test recommendations.
Instead, we provide detailed recommendations in Appendix A. The Welch U test is
recommended most frequently, but cannot be considered an omnibus test for this problem.
© 2009 Elsevier Inc. All rights reserved.
1. Introduction
Comparison of locations, or central tendency, of two
independent populations is common in medical research. A
plethora of tests exists, of amenability depending on the
distribution of the data at hand. The choice of test decides
what can be inferred from the results. This is due to the
different null hypotheses these methods are designed to test.
The two-sample T test is the most common approach. This
is a test of equality of means, but it is derived under the
assumptions that the two distributions are normal with equal
variances. A modification of this test, the Welch U test [1], is
designed for unequal variances, but the assumption of
normality is maintained.
When distributions deviate from normality, several
approaches are available. The most common non-parametric
alternative is the Wilcoxon–Mann–Whitney (WMW) test.
This test is often regarded as a test of equal medians, but this
is not true in general. The correct null hypothesis for this test
is P(X b Y) = 0.5, where X and Y are random samples from the
⁎ Corresponding author. Tel.: +47 41 50 46 14; fax: +47 22 11 84 79.
E-mail address: [email protected] (M.W. Fagerland).
1551-7144/$ – see front matter © 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.cct.2009.06.007
two populations. The results from the WMW test can be
interpreted as a test of equality of medians only when the two
distributions are identical except for a possible shift in
location [2]. Many attempts have been made to improve the
WMW test. The most prominent of these is the Brunner–
Munzel test [3], which allows for tied values and unequal
variances.
For markedly skewed distributions, the mean can be a
poor measure of central tendency because outliers inflate its
value. This can be ameliorated by removing the smallest and
the largest values in the sample. If an equal amount of values
are removed from each tail, the mean of the resulting sample
is called the trimmed mean. Comparing trimmed means can
be done with the Yuen–Welch test [4], which is identical to
the Welch U test for zero amount of trimming.
When using these tests, one must be aware that the results
pertain to the tests' specific null hypotheses. A significant pvalue from the WMW test or the Brunner–Munzel test, for
example, can be difficult to interpret beyond noting that the
observations from one of the populations tend to be smaller
than the observations from the other population. According to
Cliff [5], this interpretation has merit in its own right, and he
suggests making inference about P(X N Y) − P(X b Y) as an
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
491
alternative to means or other measures of location. In
practice, however, researchers often like to make inference
about the two common measures of central tendency, the
mean and the median, which offer intuitive interpretations.
In medical research, the assumptions of normality and
variance homogeneity are often violated [6,7]. Skewed data
are common in medical research [8], and several well known
variables are known to be markedly skewed, for example
triglyceride level and sedimentation rate. If two skewed
distributions have unequal locations, the variances can be
expected to differ as well. Hence, medical data often exhibit a
combination of skewness and unequal variances.
The purpose of this paper is to investigate to what extent
the five mentioned tests can be appropriately used to
compare means and medians for a wide range of skewed
distributions with varying degrees of unequal variances. Even
though the body of literature on two-sample location tests is
considerable [9,10], a consistent and comprehensive examination of this issue has not been previously presented. For
example, situations where the two distributions have unequal
skewness have not been thoroughly studied, although it has
been shown that both type I errors and power can be affected
[7,11].
The tests will be subjected to quantified robustness
criteria. For each situation, the test or tests with highest
power that maintain true significance levels (p) sufficiently
close to the nominal level (α) will be identified. Bradley [12]
defines criteria for α-robustness as conservative with 0.9α ≤
p ≤ 1.1α and liberal with 0.5α ≤ p ≤ 1.5α. This implies that
closeness be considered sufficient if the true significance
levels are within plus or minus 10% or 50% of the nominal
significance levels. We consider 50% to be too liberal for most
situations, but 10%, 20%, and 40% limits will be studied. We
refer to this as the 10%-, 20%-, and 40%-robustness of the tests.
For a nominal significance level of 5%, this implies that we
accept true significance levels that are in the intervals [4.5,
5.5], [4.0, 5.0], and [3.0, 7.0], respectively.
lism, and breast cancer. Eilertsen et al. [13] examined whether
different HT regimens have different effects on blood
coagulation by randomizing 202 healthy women to either
low-dose HT, conventional-dose (high-dose) HT, tibolone, or
raloxifene. The primary outcome measure was D-dimer—a
marker of fibrin production and degradation which can be
used to assess the effect of HT on coagulation.
After six weeks of therapy, the distribution of D-dimer was
considerably skewed in the low-dose HT group and moderately skewed in the high-dose HT group (Fig. 1). Summary
statistics show that the difference in means is 87, the
difference in medians is 103, and the difference in 20%
trimmed means is 89:
2. Clinical example
and
Hormone therapy (HT) is associated with adverse effects
such as increased risk of arterial and venous thromboembo-
SX =
Low-dose HT
High-dose HT
n
Mean
Median
20% trimmed mean
Std
Skewness
47
48
398
485
307
410
336
425
284
260
3.1
1.8
How strong is the evidence for a difference in location
between the two groups? We calculated the two-sample T
test (p = 0.13), the Welch U test (p = 0.13), the Wilcoxon–
Mann–Whitney test (p = 0.011), the Brunner–Munzel test
(p = 0.010), and the Yuen–Welch test (p = 0.027). The highest p-value is more than ten times the smallest p-value.
Which test should we trust? We return to this example in
section 5.4.
3. Notation and test statistics
Consider two populations A and B. Assume that we have
two independent samples: X with m observations from A, and
Y with n observations from B. The estimated means and
sample variances are:
X=
2
1 m
1 n
∑ X ; Y = ∑ Yi ;
m i=1 i
n i=1
m
1
1 n
2
2
2
∑ ðXi −XÞ ; SY =
∑ ðY −YÞ :
m−1 i = 1
n−1 i = 1 i
Fig. 1. Histogram showing the distribution of D-dimer in the low-dose HT (left) and high-dose HT (right) treatment arms after six weeks of the Eilertsen et al. trial
[13]. One outlier in each group was removed.
492
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
The two-sample T test is based on the test statistic
T=
Sp
X−Y
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
1= m + 1= n
where Sp is the pooled sample standard deviation:
2
Sp =
2
2
ðm−1ÞSX + ðn−1ÞSY
:
m + n−2
Under the null hypothesis of equal means, the T statistic
has a t-distribution with m + n − 2 degrees of freedom. It is
assumed that the distributions of A and B are normal with
equal variances.
Welch [1] proposed several modifications of the twosample T test suitable for situations with unequal variances.
One of these tests, the Welch U test, is available in most
software packages. The appropriate test statistic is
U = ðX−YÞ =
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
S2X
S2
+ Y:
m
n
U is approximately t-distributed with fU degrees of freedom:
fU =
S2X
S2
+ Y
m
n
!2
=
!
S4X
S4Y
+
:
m3 −m2
n3 −n2
To obtain the sample trimmed means, the amount of
trimming (γ) must be chosen. For general use, γ = 0.2 is a
good choice [11,14]. This corresponds to removing the 20%
smallest and the 20% largest observations in each sample. Let
X γ̅ and Y γ̅ denote the trimmed means (the mean of the
samples after trimming). The Yuen–Welch test [4] statistic is
given by
X γ −Y γ
Y = pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ;
dX + dY
where dX and dY are estimates of the squared standard errors.
Calculation of dX and dY is shown in Appendix B. Under the
null hypothesis of equal trimmed means, Y follows a tdistribution with fY degrees of freedom,
2
fY = ðdX + dY Þ
=
!
d2X
d2Y
+
;
hX −1
hY −1
where hX and hY are the number of observations left in
samples X and Y after trimming.
The WMW test statistic is based on ranks and involves
calculating
WX = mn + mðm + 1Þ = 2−RX ;
where RX is the sum of the ranks in sample X. Under the null
hypothesis that P(X b Y) = 0.5, WX is approximately normal
distributed with mean mn/2 and variance mn(m + n + 1)/12.
The statistic
W = ðWX −mn = 2Þ =
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
mnðm + n + 1Þ = 12
can be approximated by the standard normal distribution. By
using the exact permutation distribution of ranks, an exact
version of the WMW test can be constructed. Since the exact
test is only practicable for small samples, we do not consider
it. Throughout this paper, references to the WMW test are to
the approximate version of the test.
The Brunner–Munzel test [3] is a modification of the
WMW test designed to handle ties and unequal variances.
Instead of associating ranks with the sample observations,
midranks are computed. Midranks are equal to ranks when
there are no tied values. For tied values, the midranks are the
average of their ranks. The midranks of 2, 5, 5, 6, 9, 9, 9, 10, for
example, are 1, 2.5, 2.5, 4, 6, 6, 6, and 8. Let MX̅ and M̅Y be the
means of the midranks associated with the samples X and Y
when the data are pooled. The Brunner–Munzel test statistic
is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
B = ðMY −MX Þ = ðm + nÞ SB2X = mn2 + SB2Y = m2 n;
where the expressions for SB2X and SB2Y are given in
Appendix B. The distribution of B can be approximated by
a t-distribution with fB degrees of freedom:
fB =
SB2X
SB2Y
+
n
m
!2
=
!
SB4X
SB4Y
+
:
n2 ðm − 1Þ
m2 ðn − 1Þ
4. Simulation setup
We examined the significance level and power of the tests
by using computer simulations. Table 1 defines the relevant
parameters of the simulation setup. The choices of these
parameters are discussed below.
Two criteria were used to select sample sizes: the total
sample size had to range from small to large, and the ratio of
the two sample sizes had to correspond to balanced designs
(m = n), and unbalanced designs (m/n N 1 and m/n b 1).
The impact of unequal variances was studied by specifying
the ratio of the standard deviations (θ). The largest standard
deviation was associated with the m size sample X, and the
smallest standard deviation was associated with the n size
sample Y. Values of θ = 1.0,1.25,1.5,2.0,4.0 were used. When
m N n, the distribution of the largest sample had the largest
variance, and when m b n, the distribution of the largest
sample had the smallest variance.
Different degrees of skewness (β) were introduced by
using gamma and lognormal distributions. When the two
distributions were given different degrees of skewness, the
distribution with the largest variance had the largest skewness. The normal distribution was used to generate symmetric
distributions (β = 0).
In the power simulations, a difference in location (D)
between the two distributions was introduced and standardized to make it comparable across distributions and sample
sizes:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
D = δ⋅ σA2 = m + σB2 = n; δ = 1; 2; 3;
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
493
Table 1
Summary of the simulation setup.
Tests
Null hypotheses
Difference in location
Nominal significance levels
Sampling distributions
Sample sizes
Standard deviation ratios
Equal skewness values
Unequal skewness values
Replications
Programming language
a
T: the two-sample T test
U: the Welch U test
Y: the Yuen–Welch test
W: the Wilcoxon–Mann–Whitney test
B: the Brunner–Munzel test
Equal means; equal medians
δ = 0,1,2,3
α = 0.05; 0.01
Gamma a; lognormal a
(m,n) = (10, 10), (10, 25), (25, 10), (25, 25) (50, 50), (25, 100), (100, 25), (100, 100)
θ = 1.0,1.25,1.5,2.0,4.0
βA = βB = 0.0,0.5,1.0,1.5,2.0,2.5,3.0
(βA, βB) = (1.0, 0.5), (2.0, 0.5), (3.0, 2.5) (3.0, 2.0), (3.0, 1.0)
10,000
Matlab [15] with the Statistics Toolbox
Normal distribution for β = 0.
where σ2A is the variance of distribution A, and σ2B is the
variance of distribution B.
5. Results and recommendations
5.1 . Gamma distribution versus lognormal distribution
We generated data from two types of distributions, the
gamma distribution and the lognormal distribution. Test
recommendations were based on each distribution individually. In general, the robustness criteria were satisfied slightly
more often when data was generated from the lognormal
distribution as compared to when data was generated from
the gamma distribution. The general behavior of the tests was
very similar for the two distributions, both when significance
level and power were considered. We have restricted further
attention to the results and recommendations based on the
gamma distribution. This makes the recommendations
slightly more cautious than it would have been if it was
based on the lognormal distribution.
5.2 . Nominal significance level of 5% versus 1%
The qualitative behavior of the tests was the same for a
nominal significance level of 5% and 1%. However, the
significance levels of the 1% tests were more sensitive to the
effects of skewness, unequal variances, and unequal sample
sizes than the significance levels of the 5% tests, thus making
the 1% tests a little less type I error robust. As for power, the
1% and 5% power curves had similar shapes.
5.3 . Test recommendations
For each studied situation, two criteria were used to
decide if a test could be recommended. First, the true
significance level (p) of the test had to be close to the
nominal significance level (α). Closeness was defined in
three levels: p within 10% of α, p within 20% of α, and p
within 40% of α. As considerable less robustness was
observed for α = 0.01, we felt that demanding that both the
α = 0.01 and the α = 0.05 tests had to satisfy the robustness criteria was too strict, especially since α = 0.05 is by
far the most used in medical publications. Therefore, the
robustness criteria were based on α = 0.05 only. Second,
the power of the test had to be higher than the power of
the other tests. To allow for the inaccuracy of results from
simulation, a definition of power equivalence was devised.
For each test with each distribution and sample size
combination, the three power values corresponding to the
three introduced differences in distribution location
(δ = 1,2,3) were summed. Two tests were considered
power equivalent if the smallest power sum deviated less
than 2.5% from the largest power sum.
Due to the large number of simulated situations, a
comprehensive display of the recommendations is given in
Appendix A. Two examples are given in Table 2: m = 100,
n = 25 with equal distribution skewness, and m = n = 50
with unequal distribution skewness. In both cases, the
robustness level is 20%.
It is clear from the recommendations that simple rules
about which test should be used in which situation cannot be
accurately stated. Each of the factors under consideration in
this study—the total sample size, the sample size ratio, the
standard deviation ratio, skewness, and skewness heterogeneity—has an effect on type I errors or power or both of
some or all the tests. The net effect of these factors is often
difficult to predict. We strongly recommend that the relevant
tables in Appendix A are consulted before the choice of test is
made. Nonetheless, a superficial summary of the recommendations is shown in Table 3.
There are situations where none of the tests can be
recommended. Transformation of the data by taking logarithms or square roots may reduce skewness and variance
heterogeneity, but there are some problems with this
approach [16–18]. First, the exact effect of the transformation on skewness and variance is somewhat unpredictable.
Two samples of similar shape may have skewness and
variance altered differently, and differences that did not exist
between the original samples may be introduced between
the transformed samples. Second, the results from tests on
transformed data are valid only on the transformed scale,
and interpreting the results back onto the original scale can
be troublesome. As a general rule, when using transformations of any kind, the transformed samples should be
examined with the same scrutiny as the original samples.
Specifically, signs of unequal variances and skewness
distributed unevenly between the two samples should be
given particular attention.
494
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
Table 2
Tests with highest power that satisfy 4.0 ≤ p ≤ 6.0 for α = 0.05.
m = 100, n = 25
Robustness level: 20%
H0: equal means
H0: equal medians
U
U
UB
UB
T
!
U
U
U
U
TUW
U
U
U
U
W
U
U
U
U
W
U
U
U
U
W
U
U
U
Y
W
0.0
0.5
1.0
1.5
2.0
2.5
U
U
U
Y
W
Std. ratio
4.00
2.00
1.50
1.25
1.00
U
U
UB
UB
TU
Y
Y
U
U
TUW
Y
Y
Y
YB
W
Y
–
Y
B
W
–
–
Y
B
W
–
–
–
B
B
–
W
W
W
T
3.0
βA = βB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
m = n = 50
Robustness level: 20%
H0: equal means
H0: equal medians
TU
TU
TU
TU
TU
–
TU
TU
TU
TU
–
–
TU
TU
Y
–
U
TU
TU
TU
–
–
TU
TU
B
Std. ratio
4.00
2.00
1.50
1.25
1.00
1.0
0.5
2.0
0.5
3.0
2.5
3.0
2.0
3.0
1.0
Skewness dist. A
Skewness dist. B
Y
WB
WB
WB
WB
YB
Y
Y
Y
Y
Y
Y
WB
Y
Y
Y
Y
Y
Y
Y
–
–
Y
Y
Y
1.0
0.5
2.0
0.5
3.0
2.5
3.0
2.0
3.0
1.0
p is the true significance level and α is the nominal significance level. βA is the skewness of distribution A and βB is the skewness of distribution B. An entry of “–”
means that no test satisfies the robustness criterion. The data were generated from normal distributions (skewness = 0) and gamma distributions (skewness N 0).
T = the two-sample T test.
U = the Welch U test.
Y = the Yuen–Welch test.
W = the Wilcoxon–Mann–Whitney test.
B = the Brunner–Munzel test.
5.4 . The clinical example revisited
In section 2, we compared the locations of D-dimer in the
low-dose HT and the high-dose HT treatment arms after six
weeks of the Eilertsen et al. trial [13]. We obtained widely
different p-values with our five tests. The sample sizes in the
two groups were 47 and 48, the standard deviation ratio was
284/260 = 1.1, and the sample skewness was 3.1 and 1.8. For
distributions with unequal skewness, Table 13 in Appendix A
details recommendations for a sample size of 50 in each
group. An excerpt is given in the lower part of Table 2. For
distributions similar to the ones in our example, the twosample T test and the Welch U test are the most powerful
tests of means, and the Yuen–Welch test is the most powerful
test of medians. All three tests are type I error robust at the
10% level. As the differences in means and trimmed means are
similar (87 and 89), the smaller p-value for the Yuen–Welch
test reflects the smaller variance estimate this test uses due to
trimming of the largest observations.
To conclude this example, there is some evidence (Yuen–
Welch test: p = 0.027) that there is a difference in 20%
trimmed means, but no evidence of a difference in means (T
test/Welch: p = 0.13). The Wilcoxon–Mann–Whitney and the
Brunner–Munzel tests are not recommended in this situation.
Because the Yuen–Welch test is robust for testing medians,
and because the trimmed means are close to the medians, any
inference drawn about the trimmed means can be applied to
the medians as well.
6. Discussion
Table 3
Brief summary of the recommendations.
Comparing means
m=n
mbn
mNn
θ = 1.0
θ N 1.0
W,B
B
W
T,U
U or no test b
U
Comparing medians
T,U,W,B, sometimes Y a
U,B, sometimes Y c
U,Y,W,B
θ is the standard deviation ratio and β is the skewness. m and n are the
sample sizes. When m b n, the smallest sample has the largest variance.
When m N n, the largest sample has the largest variance. T = the two-sample
T test; U = the Welch U test; Y = the Yuen–Welch test; W = the Wilcoxon–
Mann–Whitney test; B = the Brunner–Munzel test.
a
Y for combinations of large θs or large βs or both.
b
U when β ≤ 1.0, else no test.
c
Y for large sample sizes.
Comparing the locations of two skewed populations is
fraught with difficulties. Unless the degree of skewness is
small, different measures of central tendency—for example
the mean, the median, and the 20% trimmed mean—can differ
markedly in numeric value. If the variances are unequal as
well, making inferences about equality of two different
measures can lead to opposite conclusions. In such cases, it
is important to accurately define the population differences of
interest, and to interpret test results in strict adherence to the
tests' null hypotheses.
The aim of this paper was to assess the ability of some
much used tests to compare means and medians for a wide
range of skewed distributions with unequal variances. Our
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
recommendations are detailed in Appendix A. We briefly
review the most important results:
• The performance of the tests depends on many factors, most
notably variance heterogeneity, skewness and skewness
heterogeneity, the sample size ratio, and the total sample
size.
• Small distribution changes can lead to large changes in test
performance.
• Skewness heterogeneity had a slight negative effect on the
rank-based tests, but almost no effect on the parametric
tests.
• For the simulated settings, the Welch U test is recommended most frequently.
• The rank-based methods are sensitive to departures from
the pure shift model.
• For variables with skewed distributions, the 20% trimmed
mean is closer to the median than to the mean.
• The five examined tests performed similarly on samples
drawn from gamma distributions as compared to samples
drawn from lognormal distributions.
The advantage of the Welch test demonstrated in our
study is in agreement with previous studies and several
authors recommend the Welch test for almost all situations
[19–22]. We agree that the Welch test is the best test in
general, but to select the most powerful robust test, a careful
consideration of the properties of the data is recommended.
As an aid in this endeavor, Appendix A should be helpful.
The five tests examined in this paper are but a small part of
the large set of tests available for the two-sample location
problem. However, because of their widespread use, these
five tests merit special attention. Several alternative methods
are presented in the two books by Wilcox [11,23], including
methods using robust measures of location, rank-based
methods, permutation tests, and bootstrap methods. One of
the main obstacles to contemporary methods is availability in
commercial software. This problem is easily overcome by
using the free software R [24] for which a large number of
functions exist to perform modern methods [11,23].
Our simulation study is limited in scope by two main
factors. First, we have employed two families of distributions,
the gamma and the lognormal. Although very similar results
were observed for the two distributions, we cannot rule out
the possibility that other types of distributions may produce
conspicuously different results. Also, extreme observations
have a large impact on the T test and the Welch test. A realistic
modeling of extreme observations is difficult, and other
distributions than the gamma and the lognormal are perhaps
better suited. Second, the effect of kurtosis has not been
assessed. There is some evidence that kurtosis has only a
minor effect on type I error rates [9,16,25], but that power
may be affected [26]. For gamma and lognormal distributions,
skewness and kurtosis are not independent parameters [27].
Thus, for the skewed distributions studied in this paper, the
effect of kurtosis cannot be separated from the effect of
skewness.
We have quantified robustness by defining 10%, 20%, and
40% limits to the deviation of the true significance level from
the nominal level. We consider a 10% deviation to be
acceptable in almost any practical application and that a
20% deviation is sufficiently precise for most situations.
495
However, if a test is robust at the 40% level only, obtained pvalues should be interpreted with due caution.
Appendix A. Supplementary data
Supplementary data associated with this article can be
found, in the online version, at doi:10.1016/j.cct.2009.06.007.
Appendix B
Appendix B.1. Estimates of the squared standard errors in the
Yuen–Welch test
Let gX =γm and gY =γn be the number of observations
(rounded down) trimmed from each tail in X and Y. Denote the
number of remaining observations in the trimmed samples by
hX =m − 2gX and hY =n − 2gY. The squared standard errors are
based on the sample Winsorized variances. Denote the sorted
observations in X by X(1) ≤ X(2) ≤ ⋯ ≤ X(m). The Winsorized
sample of X,
1
2
m
WX = WX ; WX ; …; WX ;
is found by setting WX =X and replacing each of the gX smallest
observations, X(1),…, X(gx), with X(gX + 1), and replacing each of
the gX largest observations, X(m − gX + 1),…, X(m), with X(m − gX).
The Winsorized sample of Y (WY) is found in the same way.
Denote the Winsorized sample means by WX̅ and WY̅ . The
sample Winsorized variances are
2
swX =
m
1
1 n
i
2
2
i
2
∑ ðWX −W X Þ and swY =
∑ ðWY −W Y Þ :
m−1 i = 1
n−1 i = 1
The squared standard errors in the Yuen–Welch test are
dX =
sw2X ðm−1Þ
sw2Y ðn−1Þ
and dY =
:
hX ðhX −1Þ
hY ðhY −1Þ
Further details can be found in [4,11].
Appendix B.2. Variance estimates in the Brunner–Munzel test
Following the notation in section 3, MX = M1X,M2X,…, Mm
X
and MY = M1Y,M2Y,…, MnY are the midranks of X and Y based on
pooling all observations. M̅X and M̅Y are the means of the
pooled midranks. Midranks can also be computed within each
1 2
n
sample. Denote these by VX = V1X,V2X,…,Vm
X and VY = VY,VY,…VY.
The variance estimates in the Brunner–Munzel test are
2
SBX =
m
1
∑
m−1 i = 1
m+1 2
i
i
MX −VX −MX +
2
and
2
SBY =
1 n
∑
n−1 i = 1
i
i
MY −VY −M Y +
For further details, see [3,11].
n+1 2
:
2
496
M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496
References
[1] Welch BL. The significance of the difference between two means when
the population variances are unequal. Biometrika 1937;29:350–62.
[2] Lehmann EL. Nonparametrics—statistical methods based on ranks.
Upper Saddle River, NJ: Prentice-Hall, Inc.; 1975.
[3] Brunner E, Munzel U. The nonparametric Behrens–Fisher problem:
asymptotic theory and a small-sample approximation. Biom J 2000;42:
17–25.
[4] Yuen KK. The two-sample trimmed t for unequal population variances.
Biometrika 1974;61:165–70.
[5] Cliff N. Dominance statistics: ordinal analyses to answer ordinal
questions. Psychol Bull 1993;114:494–509.
[6] Wilcox RR. Comparing the means of two independent groups. Biom J
1990;32:771–80.
[7] Wilcox RR, Keselman HJ. Modern robust data analysis methods:
measures of central tendency. Psychol Methods 2003;8:254–74.
[8] Bridge PD, Sawilowsky SS. Increasing physicians' awareness of the
impact of statistics on research outcomes: comparative power of the ttest and Wilcoxon rank-sum test in small samples applied research.
J Clin Epidemiol 1999;52:229–35.
[9] Penfield DA. Choosing a two-sample location test. J Exp Educ
1994;62:343–60.
[10] Stonehouse JM, Forrester GJ. Robustness of the t and U tests under
combined assumption violations. J Appl Stat 1998;25:63–74.
[11] Wilcox RR. Introduction to robust estimation and hypothesis testing.
2nd ed. San Diego, CA: Academic Press; 2005.
[12] Bradley JV. Robustness? Br J Math Stat Psychol 1978;31:144–52.
[13] Eilertsen AL, Qvigstad E, Andersen TO, Sandvik L, Sandset PM.
Conventional-dose hormone therapy (HT) and tibolone, but not lowdose HT and raloxifene, increase markers of activated coagulation.
Maturitas 2006;55:278–87.
[14] Wilcox RR. Some results on the Tukey–McLaughlin and Yuen methods
for trimmed means when distributions are skewed. Biom J 1994;3:
259–73.
[15] Matlab 7. Natick, MA: The MathWorks, Inc.; 2005.
[16] Pearson ES, Please NW. Relation between the shape of population
distribution and the robustness of four simple test statistics. Biometrika
1975;62:223–41.
[17] Sutton CD. Computer-intensive methods for tests about the mean of an
asymmetrical distribution. J Am Stat Assoc 1993;88:802–10.
[18] Grissom RJ. Heterogeneity of variance in clinical data. J Consult Clin
Psychol 2000;68:155–65.
[19] Best DJ, Rayner JCW. Welch's approximate solution for the Behrens–
Fisher problem. Technometrics 1987;29(2):205–10.
[20] Gans DJ. Use of a preliminary test in comparing two sample means.
Commun Stat Simul C 1981;B10(2):163–74.
[21] Zimmerman DW. A note on preliminary tests of equality of variances. Br
J Math Stat Psychol 2004;57:173–81.
[22] Ruxton GD. The unequal variance t-test is an underused alternative to
Student's t-test and the Mann–Whitney U test. Behav Ecol 2006;17(4):
688–90.
[23] Wilcox RR. Applying contemporary statistical techniques. San Diego,
CA: Academic Press; 2003.
[24] The R project for statistical computing. [http://www.r-project.org/].
[25] Cressie NAC, Whitford HJ. How to use the two sample t-test. Biom J
1986;28(2):131–48.
[26] Wilcox RR. ANOVA: the practical importance of heteroscedastic
methods, using trimmed means versus means, and designing simulation studies. Br J Math Stat Psychol 1995;48:99–114.
[27] Evans M, Hastings N, Peacock B. Statistical distributions. 3rd ed. New
York, NY: John Wiley & Sons, Inc.; 2000.