Contemporary Clinical Trials 30 (2009) 490–496 Contents lists available at ScienceDirect Contemporary Clinical Trials j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c o n c l i n t r i a l Performance of ﬁve two-sample location tests for skewed distributions with unequal variances Morten W. Fagerland ⁎, Leiv Sandvik Ullevål Department of Research Administration, Oslo University Hospital, N-0407 Oslo, Norway a r t i c l e i n f o Article history: Received 16 March 2009 Accepted 18 June 2009 Keywords: Two-sample location problem T test Welch test Wilcoxon–Mann–Whitney test Yuen–Welch test Brunner–Munzel test Robustness Skewness Heteroscedasticity a b s t r a c t Tests for comparing the locations of two independent populations are associated with different null hypotheses, but results are often interpreted as evidence for or against equality of means or medians. We examine the appropriateness of this practice by investigating the performance of ﬁve frequently used tests: the two-sample T test, the Welch U test, the Yuen–Welch test, the Wilcoxon–Mann–Whitney test, and the Brunner–Munzel test. Under combined violations of normality and variance homogeneity, the true signiﬁcance level and power of the tests depend on a complex interplay of several factors. In a wide ranging simulation study, we consider scenarios differing in skewness, skewness heterogeneity, variance heterogeneity, sample size, and sample size ratio. We ﬁnd that small differences in distribution properties can alter test performance markedly, thus confounding the effort to present simple test recommendations. Instead, we provide detailed recommendations in Appendix A. The Welch U test is recommended most frequently, but cannot be considered an omnibus test for this problem. © 2009 Elsevier Inc. All rights reserved. 1. Introduction Comparison of locations, or central tendency, of two independent populations is common in medical research. A plethora of tests exists, of amenability depending on the distribution of the data at hand. The choice of test decides what can be inferred from the results. This is due to the different null hypotheses these methods are designed to test. The two-sample T test is the most common approach. This is a test of equality of means, but it is derived under the assumptions that the two distributions are normal with equal variances. A modiﬁcation of this test, the Welch U test [1], is designed for unequal variances, but the assumption of normality is maintained. When distributions deviate from normality, several approaches are available. The most common non-parametric alternative is the Wilcoxon–Mann–Whitney (WMW) test. This test is often regarded as a test of equal medians, but this is not true in general. The correct null hypothesis for this test is P(X b Y) = 0.5, where X and Y are random samples from the ⁎ Corresponding author. Tel.: +47 41 50 46 14; fax: +47 22 11 84 79. E-mail address: [email protected] (M.W. Fagerland). 1551-7144/$ – see front matter © 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.cct.2009.06.007 two populations. The results from the WMW test can be interpreted as a test of equality of medians only when the two distributions are identical except for a possible shift in location [2]. Many attempts have been made to improve the WMW test. The most prominent of these is the Brunner– Munzel test [3], which allows for tied values and unequal variances. For markedly skewed distributions, the mean can be a poor measure of central tendency because outliers inﬂate its value. This can be ameliorated by removing the smallest and the largest values in the sample. If an equal amount of values are removed from each tail, the mean of the resulting sample is called the trimmed mean. Comparing trimmed means can be done with the Yuen–Welch test [4], which is identical to the Welch U test for zero amount of trimming. When using these tests, one must be aware that the results pertain to the tests' speciﬁc null hypotheses. A signiﬁcant pvalue from the WMW test or the Brunner–Munzel test, for example, can be difﬁcult to interpret beyond noting that the observations from one of the populations tend to be smaller than the observations from the other population. According to Cliff [5], this interpretation has merit in its own right, and he suggests making inference about P(X N Y) − P(X b Y) as an M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 491 alternative to means or other measures of location. In practice, however, researchers often like to make inference about the two common measures of central tendency, the mean and the median, which offer intuitive interpretations. In medical research, the assumptions of normality and variance homogeneity are often violated [6,7]. Skewed data are common in medical research [8], and several well known variables are known to be markedly skewed, for example triglyceride level and sedimentation rate. If two skewed distributions have unequal locations, the variances can be expected to differ as well. Hence, medical data often exhibit a combination of skewness and unequal variances. The purpose of this paper is to investigate to what extent the ﬁve mentioned tests can be appropriately used to compare means and medians for a wide range of skewed distributions with varying degrees of unequal variances. Even though the body of literature on two-sample location tests is considerable [9,10], a consistent and comprehensive examination of this issue has not been previously presented. For example, situations where the two distributions have unequal skewness have not been thoroughly studied, although it has been shown that both type I errors and power can be affected [7,11]. The tests will be subjected to quantiﬁed robustness criteria. For each situation, the test or tests with highest power that maintain true signiﬁcance levels (p) sufﬁciently close to the nominal level (α) will be identiﬁed. Bradley [12] deﬁnes criteria for α-robustness as conservative with 0.9α ≤ p ≤ 1.1α and liberal with 0.5α ≤ p ≤ 1.5α. This implies that closeness be considered sufﬁcient if the true signiﬁcance levels are within plus or minus 10% or 50% of the nominal signiﬁcance levels. We consider 50% to be too liberal for most situations, but 10%, 20%, and 40% limits will be studied. We refer to this as the 10%-, 20%-, and 40%-robustness of the tests. For a nominal signiﬁcance level of 5%, this implies that we accept true signiﬁcance levels that are in the intervals [4.5, 5.5], [4.0, 5.0], and [3.0, 7.0], respectively. lism, and breast cancer. Eilertsen et al. [13] examined whether different HT regimens have different effects on blood coagulation by randomizing 202 healthy women to either low-dose HT, conventional-dose (high-dose) HT, tibolone, or raloxifene. The primary outcome measure was D-dimer—a marker of ﬁbrin production and degradation which can be used to assess the effect of HT on coagulation. After six weeks of therapy, the distribution of D-dimer was considerably skewed in the low-dose HT group and moderately skewed in the high-dose HT group (Fig. 1). Summary statistics show that the difference in means is 87, the difference in medians is 103, and the difference in 20% trimmed means is 89: 2. Clinical example and Hormone therapy (HT) is associated with adverse effects such as increased risk of arterial and venous thromboembo- SX = Low-dose HT High-dose HT n Mean Median 20% trimmed mean Std Skewness 47 48 398 485 307 410 336 425 284 260 3.1 1.8 How strong is the evidence for a difference in location between the two groups? We calculated the two-sample T test (p = 0.13), the Welch U test (p = 0.13), the Wilcoxon– Mann–Whitney test (p = 0.011), the Brunner–Munzel test (p = 0.010), and the Yuen–Welch test (p = 0.027). The highest p-value is more than ten times the smallest p-value. Which test should we trust? We return to this example in section 5.4. 3. Notation and test statistics Consider two populations A and B. Assume that we have two independent samples: X with m observations from A, and Y with n observations from B. The estimated means and sample variances are: X= 2 1 m 1 n ∑ X ; Y = ∑ Yi ; m i=1 i n i=1 m 1 1 n 2 2 2 ∑ ðXi −XÞ ; SY = ∑ ðY −YÞ : m−1 i = 1 n−1 i = 1 i Fig. 1. Histogram showing the distribution of D-dimer in the low-dose HT (left) and high-dose HT (right) treatment arms after six weeks of the Eilertsen et al. trial [13]. One outlier in each group was removed. 492 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 The two-sample T test is based on the test statistic T= Sp X−Y pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; 1= m + 1= n where Sp is the pooled sample standard deviation: 2 Sp = 2 2 ðm−1ÞSX + ðn−1ÞSY : m + n−2 Under the null hypothesis of equal means, the T statistic has a t-distribution with m + n − 2 degrees of freedom. It is assumed that the distributions of A and B are normal with equal variances. Welch [1] proposed several modiﬁcations of the twosample T test suitable for situations with unequal variances. One of these tests, the Welch U test, is available in most software packages. The appropriate test statistic is U = ðX−YÞ = sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ S2X S2 + Y: m n U is approximately t-distributed with fU degrees of freedom: fU = S2X S2 + Y m n !2 = ! S4X S4Y + : m3 −m2 n3 −n2 To obtain the sample trimmed means, the amount of trimming (γ) must be chosen. For general use, γ = 0.2 is a good choice [11,14]. This corresponds to removing the 20% smallest and the 20% largest observations in each sample. Let X γ̅ and Y γ̅ denote the trimmed means (the mean of the samples after trimming). The Yuen–Welch test [4] statistic is given by X γ −Y γ Y = pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ; dX + dY where dX and dY are estimates of the squared standard errors. Calculation of dX and dY is shown in Appendix B. Under the null hypothesis of equal trimmed means, Y follows a tdistribution with fY degrees of freedom, 2 fY = ðdX + dY Þ = ! d2X d2Y + ; hX −1 hY −1 where hX and hY are the number of observations left in samples X and Y after trimming. The WMW test statistic is based on ranks and involves calculating WX = mn + mðm + 1Þ = 2−RX ; where RX is the sum of the ranks in sample X. Under the null hypothesis that P(X b Y) = 0.5, WX is approximately normal distributed with mean mn/2 and variance mn(m + n + 1)/12. The statistic W = ðWX −mn = 2Þ = pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ mnðm + n + 1Þ = 12 can be approximated by the standard normal distribution. By using the exact permutation distribution of ranks, an exact version of the WMW test can be constructed. Since the exact test is only practicable for small samples, we do not consider it. Throughout this paper, references to the WMW test are to the approximate version of the test. The Brunner–Munzel test [3] is a modiﬁcation of the WMW test designed to handle ties and unequal variances. Instead of associating ranks with the sample observations, midranks are computed. Midranks are equal to ranks when there are no tied values. For tied values, the midranks are the average of their ranks. The midranks of 2, 5, 5, 6, 9, 9, 9, 10, for example, are 1, 2.5, 2.5, 4, 6, 6, 6, and 8. Let MX̅ and M̅Y be the means of the midranks associated with the samples X and Y when the data are pooled. The Brunner–Munzel test statistic is qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ B = ðMY −MX Þ = ðm + nÞ SB2X = mn2 + SB2Y = m2 n; where the expressions for SB2X and SB2Y are given in Appendix B. The distribution of B can be approximated by a t-distribution with fB degrees of freedom: fB = SB2X SB2Y + n m !2 = ! SB4X SB4Y + : n2 ðm − 1Þ m2 ðn − 1Þ 4. Simulation setup We examined the signiﬁcance level and power of the tests by using computer simulations. Table 1 deﬁnes the relevant parameters of the simulation setup. The choices of these parameters are discussed below. Two criteria were used to select sample sizes: the total sample size had to range from small to large, and the ratio of the two sample sizes had to correspond to balanced designs (m = n), and unbalanced designs (m/n N 1 and m/n b 1). The impact of unequal variances was studied by specifying the ratio of the standard deviations (θ). The largest standard deviation was associated with the m size sample X, and the smallest standard deviation was associated with the n size sample Y. Values of θ = 1.0,1.25,1.5,2.0,4.0 were used. When m N n, the distribution of the largest sample had the largest variance, and when m b n, the distribution of the largest sample had the smallest variance. Different degrees of skewness (β) were introduced by using gamma and lognormal distributions. When the two distributions were given different degrees of skewness, the distribution with the largest variance had the largest skewness. The normal distribution was used to generate symmetric distributions (β = 0). In the power simulations, a difference in location (D) between the two distributions was introduced and standardized to make it comparable across distributions and sample sizes: qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ D = δ⋅ σA2 = m + σB2 = n; δ = 1; 2; 3; M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 493 Table 1 Summary of the simulation setup. Tests Null hypotheses Difference in location Nominal signiﬁcance levels Sampling distributions Sample sizes Standard deviation ratios Equal skewness values Unequal skewness values Replications Programming language a T: the two-sample T test U: the Welch U test Y: the Yuen–Welch test W: the Wilcoxon–Mann–Whitney test B: the Brunner–Munzel test Equal means; equal medians δ = 0,1,2,3 α = 0.05; 0.01 Gamma a; lognormal a (m,n) = (10, 10), (10, 25), (25, 10), (25, 25) (50, 50), (25, 100), (100, 25), (100, 100) θ = 1.0,1.25,1.5,2.0,4.0 βA = βB = 0.0,0.5,1.0,1.5,2.0,2.5,3.0 (βA, βB) = (1.0, 0.5), (2.0, 0.5), (3.0, 2.5) (3.0, 2.0), (3.0, 1.0) 10,000 Matlab [15] with the Statistics Toolbox Normal distribution for β = 0. where σ2A is the variance of distribution A, and σ2B is the variance of distribution B. 5. Results and recommendations 5.1 . Gamma distribution versus lognormal distribution We generated data from two types of distributions, the gamma distribution and the lognormal distribution. Test recommendations were based on each distribution individually. In general, the robustness criteria were satisﬁed slightly more often when data was generated from the lognormal distribution as compared to when data was generated from the gamma distribution. The general behavior of the tests was very similar for the two distributions, both when signiﬁcance level and power were considered. We have restricted further attention to the results and recommendations based on the gamma distribution. This makes the recommendations slightly more cautious than it would have been if it was based on the lognormal distribution. 5.2 . Nominal signiﬁcance level of 5% versus 1% The qualitative behavior of the tests was the same for a nominal signiﬁcance level of 5% and 1%. However, the signiﬁcance levels of the 1% tests were more sensitive to the effects of skewness, unequal variances, and unequal sample sizes than the signiﬁcance levels of the 5% tests, thus making the 1% tests a little less type I error robust. As for power, the 1% and 5% power curves had similar shapes. 5.3 . Test recommendations For each studied situation, two criteria were used to decide if a test could be recommended. First, the true signiﬁcance level (p) of the test had to be close to the nominal signiﬁcance level (α). Closeness was deﬁned in three levels: p within 10% of α, p within 20% of α, and p within 40% of α. As considerable less robustness was observed for α = 0.01, we felt that demanding that both the α = 0.01 and the α = 0.05 tests had to satisfy the robustness criteria was too strict, especially since α = 0.05 is by far the most used in medical publications. Therefore, the robustness criteria were based on α = 0.05 only. Second, the power of the test had to be higher than the power of the other tests. To allow for the inaccuracy of results from simulation, a deﬁnition of power equivalence was devised. For each test with each distribution and sample size combination, the three power values corresponding to the three introduced differences in distribution location (δ = 1,2,3) were summed. Two tests were considered power equivalent if the smallest power sum deviated less than 2.5% from the largest power sum. Due to the large number of simulated situations, a comprehensive display of the recommendations is given in Appendix A. Two examples are given in Table 2: m = 100, n = 25 with equal distribution skewness, and m = n = 50 with unequal distribution skewness. In both cases, the robustness level is 20%. It is clear from the recommendations that simple rules about which test should be used in which situation cannot be accurately stated. Each of the factors under consideration in this study—the total sample size, the sample size ratio, the standard deviation ratio, skewness, and skewness heterogeneity—has an effect on type I errors or power or both of some or all the tests. The net effect of these factors is often difﬁcult to predict. We strongly recommend that the relevant tables in Appendix A are consulted before the choice of test is made. Nonetheless, a superﬁcial summary of the recommendations is shown in Table 3. There are situations where none of the tests can be recommended. Transformation of the data by taking logarithms or square roots may reduce skewness and variance heterogeneity, but there are some problems with this approach [16–18]. First, the exact effect of the transformation on skewness and variance is somewhat unpredictable. Two samples of similar shape may have skewness and variance altered differently, and differences that did not exist between the original samples may be introduced between the transformed samples. Second, the results from tests on transformed data are valid only on the transformed scale, and interpreting the results back onto the original scale can be troublesome. As a general rule, when using transformations of any kind, the transformed samples should be examined with the same scrutiny as the original samples. Speciﬁcally, signs of unequal variances and skewness distributed unevenly between the two samples should be given particular attention. 494 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 Table 2 Tests with highest power that satisfy 4.0 ≤ p ≤ 6.0 for α = 0.05. m = 100, n = 25 Robustness level: 20% H0: equal means H0: equal medians U U UB UB T ! U U U U TUW U U U U W U U U U W U U U U W U U U Y W 0.0 0.5 1.0 1.5 2.0 2.5 U U U Y W Std. ratio 4.00 2.00 1.50 1.25 1.00 U U UB UB TU Y Y U U TUW Y Y Y YB W Y – Y B W – – Y B W – – – B B – W W W T 3.0 βA = βB 0.0 0.5 1.0 1.5 2.0 2.5 3.0 m = n = 50 Robustness level: 20% H0: equal means H0: equal medians TU TU TU TU TU – TU TU TU TU – – TU TU Y – U TU TU TU – – TU TU B Std. ratio 4.00 2.00 1.50 1.25 1.00 1.0 0.5 2.0 0.5 3.0 2.5 3.0 2.0 3.0 1.0 Skewness dist. A Skewness dist. B Y WB WB WB WB YB Y Y Y Y Y Y WB Y Y Y Y Y Y Y – – Y Y Y 1.0 0.5 2.0 0.5 3.0 2.5 3.0 2.0 3.0 1.0 p is the true signiﬁcance level and α is the nominal signiﬁcance level. βA is the skewness of distribution A and βB is the skewness of distribution B. An entry of “–” means that no test satisﬁes the robustness criterion. The data were generated from normal distributions (skewness = 0) and gamma distributions (skewness N 0). T = the two-sample T test. U = the Welch U test. Y = the Yuen–Welch test. W = the Wilcoxon–Mann–Whitney test. B = the Brunner–Munzel test. 5.4 . The clinical example revisited In section 2, we compared the locations of D-dimer in the low-dose HT and the high-dose HT treatment arms after six weeks of the Eilertsen et al. trial [13]. We obtained widely different p-values with our ﬁve tests. The sample sizes in the two groups were 47 and 48, the standard deviation ratio was 284/260 = 1.1, and the sample skewness was 3.1 and 1.8. For distributions with unequal skewness, Table 13 in Appendix A details recommendations for a sample size of 50 in each group. An excerpt is given in the lower part of Table 2. For distributions similar to the ones in our example, the twosample T test and the Welch U test are the most powerful tests of means, and the Yuen–Welch test is the most powerful test of medians. All three tests are type I error robust at the 10% level. As the differences in means and trimmed means are similar (87 and 89), the smaller p-value for the Yuen–Welch test reﬂects the smaller variance estimate this test uses due to trimming of the largest observations. To conclude this example, there is some evidence (Yuen– Welch test: p = 0.027) that there is a difference in 20% trimmed means, but no evidence of a difference in means (T test/Welch: p = 0.13). The Wilcoxon–Mann–Whitney and the Brunner–Munzel tests are not recommended in this situation. Because the Yuen–Welch test is robust for testing medians, and because the trimmed means are close to the medians, any inference drawn about the trimmed means can be applied to the medians as well. 6. Discussion Table 3 Brief summary of the recommendations. Comparing means m=n mbn mNn θ = 1.0 θ N 1.0 W,B B W T,U U or no test b U Comparing medians T,U,W,B, sometimes Y a U,B, sometimes Y c U,Y,W,B θ is the standard deviation ratio and β is the skewness. m and n are the sample sizes. When m b n, the smallest sample has the largest variance. When m N n, the largest sample has the largest variance. T = the two-sample T test; U = the Welch U test; Y = the Yuen–Welch test; W = the Wilcoxon– Mann–Whitney test; B = the Brunner–Munzel test. a Y for combinations of large θs or large βs or both. b U when β ≤ 1.0, else no test. c Y for large sample sizes. Comparing the locations of two skewed populations is fraught with difﬁculties. Unless the degree of skewness is small, different measures of central tendency—for example the mean, the median, and the 20% trimmed mean—can differ markedly in numeric value. If the variances are unequal as well, making inferences about equality of two different measures can lead to opposite conclusions. In such cases, it is important to accurately deﬁne the population differences of interest, and to interpret test results in strict adherence to the tests' null hypotheses. The aim of this paper was to assess the ability of some much used tests to compare means and medians for a wide range of skewed distributions with unequal variances. Our M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 recommendations are detailed in Appendix A. We brieﬂy review the most important results: • The performance of the tests depends on many factors, most notably variance heterogeneity, skewness and skewness heterogeneity, the sample size ratio, and the total sample size. • Small distribution changes can lead to large changes in test performance. • Skewness heterogeneity had a slight negative effect on the rank-based tests, but almost no effect on the parametric tests. • For the simulated settings, the Welch U test is recommended most frequently. • The rank-based methods are sensitive to departures from the pure shift model. • For variables with skewed distributions, the 20% trimmed mean is closer to the median than to the mean. • The ﬁve examined tests performed similarly on samples drawn from gamma distributions as compared to samples drawn from lognormal distributions. The advantage of the Welch test demonstrated in our study is in agreement with previous studies and several authors recommend the Welch test for almost all situations [19–22]. We agree that the Welch test is the best test in general, but to select the most powerful robust test, a careful consideration of the properties of the data is recommended. As an aid in this endeavor, Appendix A should be helpful. The ﬁve tests examined in this paper are but a small part of the large set of tests available for the two-sample location problem. However, because of their widespread use, these ﬁve tests merit special attention. Several alternative methods are presented in the two books by Wilcox [11,23], including methods using robust measures of location, rank-based methods, permutation tests, and bootstrap methods. One of the main obstacles to contemporary methods is availability in commercial software. This problem is easily overcome by using the free software R [24] for which a large number of functions exist to perform modern methods [11,23]. Our simulation study is limited in scope by two main factors. First, we have employed two families of distributions, the gamma and the lognormal. Although very similar results were observed for the two distributions, we cannot rule out the possibility that other types of distributions may produce conspicuously different results. Also, extreme observations have a large impact on the T test and the Welch test. A realistic modeling of extreme observations is difﬁcult, and other distributions than the gamma and the lognormal are perhaps better suited. Second, the effect of kurtosis has not been assessed. There is some evidence that kurtosis has only a minor effect on type I error rates [9,16,25], but that power may be affected [26]. For gamma and lognormal distributions, skewness and kurtosis are not independent parameters [27]. Thus, for the skewed distributions studied in this paper, the effect of kurtosis cannot be separated from the effect of skewness. We have quantiﬁed robustness by deﬁning 10%, 20%, and 40% limits to the deviation of the true signiﬁcance level from the nominal level. We consider a 10% deviation to be acceptable in almost any practical application and that a 20% deviation is sufﬁciently precise for most situations. 495 However, if a test is robust at the 40% level only, obtained pvalues should be interpreted with due caution. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.cct.2009.06.007. Appendix B Appendix B.1. Estimates of the squared standard errors in the Yuen–Welch test Let gX =γm and gY =γn be the number of observations (rounded down) trimmed from each tail in X and Y. Denote the number of remaining observations in the trimmed samples by hX =m − 2gX and hY =n − 2gY. The squared standard errors are based on the sample Winsorized variances. Denote the sorted observations in X by X(1) ≤ X(2) ≤ ⋯ ≤ X(m). The Winsorized sample of X, 1 2 m WX = WX ; WX ; …; WX ; is found by setting WX =X and replacing each of the gX smallest observations, X(1),…, X(gx), with X(gX + 1), and replacing each of the gX largest observations, X(m − gX + 1),…, X(m), with X(m − gX). The Winsorized sample of Y (WY) is found in the same way. Denote the Winsorized sample means by WX̅ and WY̅ . The sample Winsorized variances are 2 swX = m 1 1 n i 2 2 i 2 ∑ ðWX −W X Þ and swY = ∑ ðWY −W Y Þ : m−1 i = 1 n−1 i = 1 The squared standard errors in the Yuen–Welch test are dX = sw2X ðm−1Þ sw2Y ðn−1Þ and dY = : hX ðhX −1Þ hY ðhY −1Þ Further details can be found in [4,11]. Appendix B.2. Variance estimates in the Brunner–Munzel test Following the notation in section 3, MX = M1X,M2X,…, Mm X and MY = M1Y,M2Y,…, MnY are the midranks of X and Y based on pooling all observations. M̅X and M̅Y are the means of the pooled midranks. Midranks can also be computed within each 1 2 n sample. Denote these by VX = V1X,V2X,…,Vm X and VY = VY,VY,…VY. The variance estimates in the Brunner–Munzel test are 2 SBX = m 1 ∑ m−1 i = 1 m+1 2 i i MX −VX −MX + 2 and 2 SBY = 1 n ∑ n−1 i = 1 i i MY −VY −M Y + For further details, see [3,11]. n+1 2 : 2 496 M.W. Fagerland, L. Sandvik / Contemporary Clinical Trials 30 (2009) 490–496 References [1] Welch BL. The signiﬁcance of the difference between two means when the population variances are unequal. Biometrika 1937;29:350–62. [2] Lehmann EL. Nonparametrics—statistical methods based on ranks. Upper Saddle River, NJ: Prentice-Hall, Inc.; 1975. [3] Brunner E, Munzel U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom J 2000;42: 17–25. [4] Yuen KK. The two-sample trimmed t for unequal population variances. Biometrika 1974;61:165–70. [5] Cliff N. Dominance statistics: ordinal analyses to answer ordinal questions. Psychol Bull 1993;114:494–509. [6] Wilcox RR. Comparing the means of two independent groups. Biom J 1990;32:771–80. [7] Wilcox RR, Keselman HJ. Modern robust data analysis methods: measures of central tendency. Psychol Methods 2003;8:254–74. [8] Bridge PD, Sawilowsky SS. Increasing physicians' awareness of the impact of statistics on research outcomes: comparative power of the ttest and Wilcoxon rank-sum test in small samples applied research. J Clin Epidemiol 1999;52:229–35. [9] Penﬁeld DA. Choosing a two-sample location test. J Exp Educ 1994;62:343–60. [10] Stonehouse JM, Forrester GJ. Robustness of the t and U tests under combined assumption violations. J Appl Stat 1998;25:63–74. [11] Wilcox RR. Introduction to robust estimation and hypothesis testing. 2nd ed. San Diego, CA: Academic Press; 2005. [12] Bradley JV. Robustness? Br J Math Stat Psychol 1978;31:144–52. [13] Eilertsen AL, Qvigstad E, Andersen TO, Sandvik L, Sandset PM. Conventional-dose hormone therapy (HT) and tibolone, but not lowdose HT and raloxifene, increase markers of activated coagulation. Maturitas 2006;55:278–87. [14] Wilcox RR. Some results on the Tukey–McLaughlin and Yuen methods for trimmed means when distributions are skewed. Biom J 1994;3: 259–73. [15] Matlab 7. Natick, MA: The MathWorks, Inc.; 2005. [16] Pearson ES, Please NW. Relation between the shape of population distribution and the robustness of four simple test statistics. Biometrika 1975;62:223–41. [17] Sutton CD. Computer-intensive methods for tests about the mean of an asymmetrical distribution. J Am Stat Assoc 1993;88:802–10. [18] Grissom RJ. Heterogeneity of variance in clinical data. J Consult Clin Psychol 2000;68:155–65. [19] Best DJ, Rayner JCW. Welch's approximate solution for the Behrens– Fisher problem. Technometrics 1987;29(2):205–10. [20] Gans DJ. Use of a preliminary test in comparing two sample means. Commun Stat Simul C 1981;B10(2):163–74. [21] Zimmerman DW. A note on preliminary tests of equality of variances. Br J Math Stat Psychol 2004;57:173–81. [22] Ruxton GD. The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test. Behav Ecol 2006;17(4): 688–90. [23] Wilcox RR. Applying contemporary statistical techniques. San Diego, CA: Academic Press; 2003. [24] The R project for statistical computing. [http://www.r-project.org/]. [25] Cressie NAC, Whitford HJ. How to use the two sample t-test. Biom J 1986;28(2):131–48. [26] Wilcox RR. ANOVA: the practical importance of heteroscedastic methods, using trimmed means versus means, and designing simulation studies. Br J Math Stat Psychol 1995;48:99–114. [27] Evans M, Hastings N, Peacock B. Statistical distributions. 3rd ed. New York, NY: John Wiley & Sons, Inc.; 2000.