Download Report

Chemometrics and Intelligent Laboratory Systems 57 Ž2001. 1–14
www.elsevier.comrlocaterchemometrics
Conditional Fisher’s exact test as a selection criterion for
pair-correlation method. Type I and Type II errors
b,)
Robert
Rajko´ a,) , Karoly
Heberger
´
´
´
a
Department of Unit Operations and EnÕironmental Engineering, Institute of Food Industry College, UniÕersity of Szeged, P.O. Box 433,
H-6701 Szeged, Hungary
b
Institute of Chemistry, Chemical Research Center, Hungarian Academy of Sciences, P.O. Box. 17, H-1525 Budapest, Hungary
Received 1 February 2000; accepted 20 December 2000
Abstract
The pair-correlation method ŽPCM. has been developed recently for discrimination between two variables. PCM can be
used to identify the decisive Žfundamental, basic. factor from among correlated variables even in cases when all other statistical criteria fail to indicate significant difference. These decisions are needed frequently in QSAR studies andror chemical
model building. The conditional Fisher’s exact test, based on testing significance in the 2 = 2 contingency tables is a suitable
selection criterion for PCM. The test statistic provides a probabilistic aid for accepting the hypothesis of significant differences between two factors, which are almost equally correlated with the response Ždependent variable.. Differentiating between factors can lead to alternative models at any arbitrary significance level. The power function of the test statistic has
also been deduced theoretically. A similar derivation was undertaken for the description of the influence of Type I Žfalsepositive conclusion, error of the first kind. and Type II Žfalse-negative conclusion, error of the second kind. errors. The appropriate decision is indicated from the low probability levels of both false conclusions. q 2001 Elsevier Science B.V. All
rights reserved.
Keywords: Variable Žor feature. selection; Pair-correlation method ŽPCM.
1. Introduction
Variable selection Žsubset selection, feature selection. is one of the key issues in chemometrics. The
selection process is more or less solved for linear relationships. Unfortunately, the algorithm for variable
selection and the selection criteria are often indistinguishable. The same algorithm can lead to the selec)
Corresponding authors.
E-mail addresses: [email protected] ŽR. Rajko´ .,
..
[email protected] ŽK. Heberger
´
tion of different variables using different criteria and
vice versa. Construction of data sets for which even
the accepted criteria Žforward selection, backward
elimination, stepwise. can lead to different conclusions is relatively easy w1a,1bx. Other algorithms
based on principal component analysis ŽPCA., partial least squares ŽPLS., genetic algorithm and artificial neural networks ŽANN. increase the uncertainty
concerning the selection of the best subset. The increasing usage of ANN technique forced the chemometricians to develop non-linear variable selection
methods. The approaches are often heuristic and lack
0169-7439r01r$ - see front matter q 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 9 - 7 4 3 9 Ž 0 1 . 0 0 1 0 1 - 0
2
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
any firm theoretical basis. Centner et al. w2x emphasised that A . . . weakness of all these methods is the
estimation of a suitable number of variables Žcut-off
level.. No explicit rule exists up to now. As a result
all approaches work with a user-defined number of
variables or with a user-defined critical value for the
considered selection criterionB.
All the above mentioned methods build models for
prediction. However, the prediction is not necessarily
a goal to be achieved for the model building process.
Models having theoretical basis, physical relevance,
are superior to empirical ones. However, there are no
algorithmic ways to select important or basic factors.
The connection to the physical significance has to be
examined individually again and again. The present
paper introduces a new technique, which uses other
portion of information present in the data than usually. The technique is able to select AsuperiorB factors if the superiority exists.
Consider the following example based on the correlation coefficient. Two independent variables Ž X 1
and X 2 . can be discriminated according to the magnitude of the correlation coefficients r Y vs. X 1 and
r Y vs. X 2 . The discrimination can be formulated as an
F-test to identify significant differences at a given
probability level.
The classical Pearson product moment correlation
coefficient is not the only measure for correlation.
There are also non-parametric measures for correlation, e.g., Spearman’s rho and Kendall’s tau w3x. They
are, however, not yet used for variable selection. The
pair-correlation method ŽPCM. provides an alternative possibility to characterise different correlations
without using the correlation coefficient.
PCM w4–6x has been developed recently for the
discrimination of variables as a non-parametric
method, contrast with methods that require the assumption of normality. PCM can be used to choose
the decisive Žfundamental, basic. factor from among
correlated Žcollinear. variables, even if all classical
statistical criteria cannot indicate any significant difference. PCM, however, needs a test statistic as a selection criterion, i.e., a probabilistic aid for accepting
the hypothesis that a significant difference exists between the two factors at any arbitrary significance
level.
There are two hypotheses that must be specified in
any statistical testing procedure w7–9x: the null hy-
pothesis, denoted H 0 , and the alternative hypothesis,
denoted H A . Acceptation or rejection of the null hypothesis is the task to be solved. However, statistical
hypothesis testing is based on sample information.
Nobody can be sure that the decision is correct. When
H 0 is true but, by chance, the sample data infer incorrectly to that it is false, this is referred to as a Type
I error or the error of the first kind Žthe probability of
this event is ´ .. When H 0 is false but, by bad luck,
the sample data lead mistakenly to that it could be
true, this is called Type II error or the error of the
second kind Žthe probability of this event is b .. The
power of a test Žequals 1 y b . is a measure of how
good the test is at rejecting a false null hypothesis.
PCM is used to choose between two factors X 1
and X 2 , which are approximately equally correlated
with the dependent variable Y. Hence, determination
of b is of crucial importance ŽPCM can only discriminate between X 1 and X 2 if the null hypothesis
can be rejected.. Low levels of both ´ and b indicate that the correct decision has been made.
Our aim in this paper was to develop a selection
criterion for PCM. The theoretical deduction of Type
I and II errors will justify the usage of the method.
Moreover, we would like to communicate an improvement of the algorithm for PCM. The improvement is summarised in the Appendix A. Finally, we
present some examples to validate the method and to
better understand how it works.
2. Theoretical principles of PCM
PCM is based on non-parametric, i.e., distribution-free Žcombinatorial. analysis. The formulation of
the initial task is given below.
Let us define three vectors as dependent Ž Y . and
independent variables Ž X 1 and X 2 .. The task is to
choose the superior one from the coequal X 1 and X 2 .
Both of the independent variables correlate positively with the dependent variables. The case when
one of them or both does not correlate with Y does
not cause serious limitation. This will be discussed in
the validation part of the paper. Likewise, a negative
correlation does not limit the usage of the method.
Consider all the possible element pairs of the Y
vector that can occur when the differences D X 1 for
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
Table 1
Distribution of events A, B, C, and D; frequencies obtained using
PCM
D X2 )0
D X2 -0
D X1) 0
D X1- 0
A: kA
B: k B
C: k C
D: k D
Y vs. X 1 , and D X 2 for Y vs. X 2 are determined.
Only the signs of the differences are important:
D X 1 s Ž X 1i y X 1 j . sgn Ž Yi y Yj .
D X 2 s Ž X 2 i y X 2 j . sgn Ž Yi y Yj .
5
1 F i F j F m,
Ž 1.
°0,
if Yi s Yj ,
~ <Y yY <
sgn Ž Yi y Yj . s
i
j
i
j
¢ Y yY
,
otherwise,
where m is the number of measurements. There will
m
be
s w mŽ m y 1.xr2 s n point pairs and differ2
ences D X 1 and D X 2 . There are only four possible
signs of differences in D X 1 and D X 2 . They are
termed A, B, C and D. Table 1 summarises the four
possibilities Ževents.. The frequencies of the events
A, B, C and D Ž k A , k B , k C , and k D , respectively. are
counted and ordered Žsee Table 1.. Fig. 1 represents
the fundamental nature of the four events as the basis
ž /
3
of PCM. The cases are ignored if the Yi s Yj . However, this cannot cause any limitation. These cases do
not hold any information on the differences in the independent variables.
Because of the initial assumption, both positive Žor
negative. correlations for Y vs. X 1 and Y vs. X 2 ,
the frequency of event A should be the largest. That
is both X 1 and X 2 must change in the same direction as Y. Event D shows how the correlation tends
to be reduced by chance. Its frequency is expected to
be the lowest then. If the frequency of event A is not
the highest, then either one of or both X variables
correlate with Y negatively.
The rearrangement of boxes is equivalent to the
multiplication of X 1 or X 2 , or both by minus one to
obtain positive correlation between Y and X 1 , as well
as Y and X 2 . This can be seen from the formulas in
brackets in Appendix A, where the rearrangement
procedure is given in details.
Events A and D have no direct information for
choosing between X 1 and X 2 . If the frequency value
k B belonging to event B is larger than k C for event
C, then X 1 overrides X 2 and vice versa. Further details of the properties of PCM are given in w6x. The
word ‘larger’ can be interpreted statistically. Thus, a
test statistic is required to determine whether the frequency value associated with event B is significantly
larger than that for event C Žor vice versa..
The paper describes a test statistic based on testing the significance of a 2 = 2 contingency table. The
power function of this test statistic and the influence
Fig. 1. Graphical representation of four possible events as the basis of PCM.
4
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
of Type I and Type II errors are also investigated and
described.
3. Conditional Fisher’s exact test
ing realised with factual values k A , k B , k C , k D
w3,9,13x:
n
n
ž
P k B N Ž k B q kC . ,
n n
,
s
2 2
/
3.1. Type I error
Consider Table 1 as an example of 2 = 2 contingency tables. Similar contingency tables are frequently used, e.g., in medical sciences, so several
tests have been developed and investigated. The most
important one is Fisher’s exact test w3,9–14x.
The contingency table shown in Table 2 can be
created by applying PCM to the data Žsee Table 2..
If k B is significantly larger than k C , then variable
X 1 is more strongly correlated with Y than variable
X 2 , and vice versa. The null hypothesis assumes that
X 1 and X 2 are equally correlated with Y:
H 0 : k B s kC .
t
FŽ t,K . s
ÝP
ks0
t
s
Consider the following alternative hypothesis
If H 0 is rejected, then X 1 Žor X 2 . is more strongly
correlated with Y than the other X variable. If H 0 is
not rejected, then it can be supposed that the probability of events B and C are equal, i.e., they have, in
addition to the same predictive property, the same
correlation. It can be further supposed that the event
B appears only in half of n, and the event C appears
only in the other half of n. The test statistics is based
on the probability of the 2 = 2 contingency table be-
ž
kNK ,
n
Ý
ks0
Ž 3.
2
kB
ž
2
kC
n
k B q kC
.
/
n n
,
2 2
n
/
0 0
2
k
2
Kyk
n
K
,
Ž 5.
ž /
where K s k B q k C . The following equation must
hold:
n
n
n
n
0 0 0 0
2
2
2
K
Kyk
k
Kyk
q Ý
Ý
Y
n
n
ks0
ksk ´
K
K
´
´
s q s´ ,
Ž 6.
2
2
where k´X and k´Y are chosen according to ´ , which is
X
k´
2
k
ž /
ž /
Table 2
The 2 = 2 contingency table to help to test the discrimination between variables X 1 and X 2 based on calculations by PCM
X 1 may have stronger correlation
X 2 may have stronger correlation
Marginal sum
Ž 4.
Alternative developments of this hypergeometric
formula can be found in Ref. w15x. Use of this formula for determining the optimal sample size in
forensic casework has recently appeared in w16x.
Thus, the cumulative distribution function of the test
statistic will be hypergeometric:
Ž 2.
HA : k B / kC .
0 0
Frequencies with
information to
discriminate between
variables
Frequencies without
information to
discriminate between
variables
Marginal
sum
kB
kC
k B q kC
Ž nr2. y k B
Ž nr2. y k C
kA q k D
Ž nr2.
Ž nr2.
n
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
the probability of Type I error, so that k´Y s K y k´X .
Because of the symmetry, Eq. Ž6. can be reduced to
2 F Ž k´X , K . s ´ .
5
First, an approximation is investigated. The approximation can be considered that the alternative
hypothesis H A : k B s k 3 is true. Let
Ž 7.
K ´ s k B q k C / Kb s k B q k 3
The above part of the paper describes Fisher’s test.
As stated by Massart et al. w14x this is the best choice
for testing hypotheses on 2 = 2 contingency tables.
Now all the statistical tools are at the disposal to
make a decision for discrimination between the two
variables X 1 and X 2 . If k B - k´X or k B ) k´Y , then the
null hypothesis H 0 should be rejected at a confidence level Ž1 y ´ ., and again, if k B is larger than
k C , then X 1 correlates with Y stronger than X 2 and
vice versa. The procedure is visualised in Fig. 2.
To apply Eq. Ž6 ., the binomial coefficients
n
s n!rŽ k!Ž n y k .!. must be calculated via factok
rials n!, k! and Ž n y k .!. It can preferably be done
using an approximation called Stirling-formula, see
Appendix B.
ž/
Ž kC / k3 . ,
Ž 8.
and the probability of H A becomes
nX
X
ž
P kB N Ž kB q k3 . ,
X
n n
,
s
2 2
/
nX
0 0
2
kB
ž
2
k3
nX
kB q k3
,
Ž 9.
/
where nX s k A q k B q k 3 q k D .
Fig. 3 shows an example of making a Type II error and its probability. The calculation of the
crosshatched area in Fig. 3 can be done with help of
Eq. Ž5.:
b s F Ž k´Y , Kb . y F Ž k´X , Kb .
3.2. Type II error, a strictly conditional approximation
The power function PowŽ.. of the previously described test can be deduced by taking into account the
wrong acceptance of the null hypothesis H 0 : k B s k C .
s F Ž K ´ y k´X , Kb . y F Ž k´X , Kb . ,
then the power function will be
Pow
ž
Kb
kB q k3
s
2
2
/
s1 y b
nX
nX
s1q
0 0
2
k
X
k´
Ý
2
Kb y k
nX
Kb
ks0
ž /
nX
nX
X
K ´yk ´
Fig. 2. Hypergeometric distribution function helping the acceptance or rejection of the null hypothesis H 0 based on the test
statistic described in the text Ž kA s 40, K s k B q k C s 20, k D s 2,
´ s 0.02, k´X s6, k´Y s14..
Ž 10 .
y
Ý
ks0
0 0
2
k
2
Kb y k
nX
Kb
ž /
.
Ž 11 .
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
6
and its graph can be seen in Fig. 4. If the graph is
steeper around z s 0, and if PowŽz. is close to 1, the
false null hypothesis can be accepted with smaller
probability.
3.3. Type II error, a complex unconditional deduction
Fig. 3. Representation of the probability of Type II error Ž b . by
two possible distribution functions according to H 0 and H A .
The value of k´X can be calculated from
n
n
0 0
2
2
K´ y k
2
k
X
k´
Ý
n
K´
F´ ,
pB
Ž 12 .
ž /
ks0
In Eq. Ž8. of the previous section, a specific Kb
as the real value for K instead of K ´ was considered. This latter could appear only by some random
effect. If the value of Kb is not known in advance,
all possible values have to be regarded as prerequisite.
Two nr2 sized independent random binomial
samples are assumed in conformity with Table 2, i.e.,
the binomial events are the following: X 1 or X 2 have
stronger correlation with Y. Consider the odds ratio
defined as
cs
for given ´ . For the difference z, the power function
will be
1 y pB
pC ,
Ž 14 .
1 y pC
where p B s k B rŽ nr2. and pC s k C rŽ nr2. are the
probabilities of the two events according to the rows
of Table 2.
ž
Pow z s
K´
2
y
Kb
2
/
s1 y b
nX
nX
s1q
0
2
k
X
k´
Ý
ž
ks0
nX
k B q kC y 2 z
X
y
Ý
ks0
0
/
nX
nX
K ´yk ´
2
k B q kC y 2 z y k
0
2
k
ž
2
k B q kC y 2 z y k
nX
k B q kC y 2 z
/
0
,
Ž 13 .
Fig. 4. Power function of the test statistic for PCM.
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
The conditional distribution of k B , at a given K
s k B q k C and c , according to Ref. w17x is
n
n
0 0
f Ž k B N K ;c . s
n
ž /
min K ,
n
2
Ý
ž
ismax 0, Ky
n
2
n
s
.
n
Ý P Ž Ž k b ,k c . g Wcond Ž ´ , K . 4 .
Ks0
0 0
/
2
i
indicate the marginal distribution of K. The unconditional power can be given based on W Ž ´ . as
Pow Ž p B , pC . s P Ž Ž k b ,k c . g Wuncond Ž ´ . 4 .
2
c kB
KykB
2
kB
7
c
2
Kyi
i
n
s
Ý fŽ K.
Ks0
Ý
f Ž k b N K ;c . ,
WcondŽ ´ , K .
Ž 15 .
Ž 19 .
Conditional Wcond and unconditional Wuncond critical
regions are defined as the ones that appeared in Ref.
w18x, but they have been modified according to the
two-sided test:
where k b and k c are variables and their values can
be varied from 0 to nr2.
The sets of Wcond Ž ´ , K . and Wuncond Ž ´ . can be
constructed using tables introduced by Finney et al.
w13x. For example, for n s 20 and ´ s 0.05 the rejection region is shown in Table 3. The values in this
table are calculated using Eq. Ž20. if k b - k c and Eq.
Ž21. if k b ) k c following the form of Eq. Ž5..
Wcond Ž ´ , K .
°
n
Ž
min K ,
2
¶
.
f Ž x N K ; c s1 . F
Ý
~
s Ž k b ,k c . :k b / k c ;
x s min Ž k b , k c .
¢
Ý
2
xsk b
•,
f Ž x N K ; c s1 . F
Ž
n
x s max 0 , K y
2.
n
2
´
kb
2
Ý
ß
ž
/
f Ž x N Ž k b q k c . ;c s 1. ,
´
max Ž k b , k c .
Ý
ž
min k bqk c ,
xsmax 0, k bqk cy
f Ž x N Ž k b q k c . ;c s 1. .
n
2
Ž 20 .
Ž 21 .
/
Ž 16 .
n
Wuncond Ž ´ . s
D Wcond Ž ´ , K . .
Ž 17 .
Ks0
Wuncond Ž ´ . is the union of n q 1 mutually exclusive
conditional critical regions, some of which might be
zero. Let
n
ž /
min K ,
fŽ K. s
n
2
Ý
ž
ismax 0, Ky
n
2
0
/
2
i
p Bi Ž 1 y p Bn r2yi .
n
=
0
p BKy i Ž 1 y p Bn r2yKqi . ,
2
Kyi
Ž 18 .
If the value is less than ´r2 s 0.025, then the Ž k b ,
k c . pair belongs to the conditional critical region
Wcond Ž ´ , K s k b q k c .. For easier interpretation of
Table 3, numbers belonging to the cumulative distribution, e.g., at K s 5 Žor else k b s 5 y k c ., are displayed with grey background. Only half of the cumulative distributions were calculated, because of the
symmetry of the distribution. The two-sided approach means a summation from 0 to the lower value
of k b and k c . Moreover, it means a summation from
the higher value of k b and k c to nr2.
The other cumulative distributions are situated
perpendicularly Žand diagonally. to the designated
one ŽTable 3..
According to the above example, the unconditional critical region is given using Eq. Ž17., i.e., unifying Ž k b , k c . pairs for which the value of the cumu-
8
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
Table 3
Critical region Žbolded values. for n s 20 and ´ s 0.05
lative distribution function is not larger then ´r2 s
0.025. Thus, Wuncond will be a set of unions Žj. of
the appropriate sets of Wcond Že.g., Wcond Ž ´ ,5. s
Ž5,0., Ž0,5.4. according to Eq. Ž17. and the detailed
enumeration is the following: Wuncond Ž ´ . s Wcond Ž ´ ,
5. j Wcond Ž ´ , 6. j Wcond Ž ´ , 7. j Wcond Ž ´ , 8. j Wcond
Ž ´ ,9. j Wcond Ž ´ ,10. j Wcond Ž ´ ,11. j Wcond Ž ´ ,12. j
Wcond Ž ´ ,13. j Wcond Ž ´ ,14. j Wcond Ž ´ ,15.4 , because
Wcond Ž ´ ,r. s B for r g 0, 1, 2, 3, 4, 16, 17, 18, 19,
204 . Thus, Wuncond Ž ´ . s Ž5,0., Ž0,5.4 j Ž6,0., Ž0,6.4
j Ž7,0., Ž0,7.4 j Ž8,0., Ž7,1., Ž1,7., Ž0,8.4 j Ž9,0.,
Ž8,1., Ž1,8., Ž0,9.4 j Ž10,0., Ž9,1., Ž8,2., Ž2,8., Ž1,9.,
Ž0,10.4 j Ž10,1., Ž9,2., Ž2,9., Ž1,10.4 j Ž10,2., Ž9,3.,
Ž3,9., Ž2,10.4 j Ž10,3., Ž3,10.4 j Ž10,4., Ž4,10.4 j
Ž10,5., Ž5,10.4 .
Several algorithms are available for fast and reliable calculation of the power function w19–21x. A
new, effortlessly programmable algorithm was, however, developed and used for PCM based on the Stirling-formula described in Appendix B.
4. Discrimination between two variables by the
well-known parametric way
of the correlation may be the Pearson product moment correlation coefficient, r:
r s r Ž j ,h . s
M
Ž jyM Ž j . . ŽhyM Žh . .
.
DŽ j . DŽh .
Ž 22 .
It can only be calculated for distribution functions
with finite mean M wh x, M w j x and finite non-zero
variance V wh x s D 2 wh x, V w j x. To know more about
the correlation coefficient, see e.g., Falk and Well
w22x. If the assumptions are not fulfilled exactly, it is
expedient to use robust and fuzzy procedures Žsee
Ref. w23x and the references therein..
In the correlation analysis, the observations are
drawn from the joint distribution of X and Y, which
are assumed to have a bivariate normal distribution,
and inferences concerning the correlation between X
and Y can be made. The sample multiple correlation
coefficient of Y with X 1 and X 2 is defined as the
simple correlation coefficient between Y and its predicted value Yˆ Ž Y and Yˆ have bivariate normal distribution in most of the practically relevant cases.
w24x. It is denoted by r:
m
This section summarises the methods compared to
PCM. The correlation means a stochastic relationship
between two random variables h , j , i.e., P Žh - y N j
s x . / P Žh - y ., where P Ž.. means the probability
of the event in the brackets. The numerical measure
Ý yi yˆi
r s r y yˆ s
(
is1
m
Ý
is1
yi2
,
m
Ý
is1
yˆi2
Ž 23 .
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
where yi s Yi y Y, yˆi s Yˆi y Y, Y s
Ý mis1Yi
and Yˆi
m
s a q b 1 X 1 i q b 2 X 2 i . It is possible to test whether
the multiple correlation coefficient r Y vs. X 1 , X 2 of Y
on X 1 and X 2 equals the simple correlation coefficient r Y vs. X 1 of Y on only one variable, say, X 1.
Testing the null hypothesis
H 0 : r Y vs . X 1 , X 2 s r Y vs .X 1
Žthe Greek letter r means the expected value of r,
i.e., r is the population correlation., the test statistic
will be
r Y2 vs . X 1 , X 2 y r Y2 vs . X 1
Ž m y 1. y Ž m y 2.
Fs
1 y r Y2 vs . X 1 , X 2
,
Ž 24 .
my2y1
and this F value should be compared to the critical
value of F w1 y ´ , 1, m y 3x from the table of the F
distribution for a significant level ´ and at the given
degrees of freedoms. This test is the same test for H 0 :
b 2 s 0 Žthe Greek letter b means the expected value
of b ., thus one can decide whether the variable X 2
is significant or not. This procedure gives a simple
selection criterion for choosing between the two
variables X 1 and X 2 .
The selection criterion based on Eq. Ž24. can be
only used if one of the two variables is non-significant comparing to the other. If X 1 and X 2 are in approximately the same correlation with Y, another test
statistic is required. That is based on Fisher’s z
statistic w24,25x:
1
1qr
z s ln
.
Ž 25 .
2
1yr
It has been shown w25x that z is approximately
normally distributed with mean 1r2lnwŽ1 q r .rŽ1 y
r .x and variance 1rŽ m y 3.. The null hypothesis is
H 0 : r Y vs. X 1 s r Y vs. X 2 . The test statistic Ž z Y vs. X 1 y
z Y vs. X 2 . has normal distribution with mean 0 and
variance 1rŽ m1 y 3. q 1rŽ m 2 y 3., for calculating
sample correlations r Y vs. X 1 and r Y vs. X 2 based on
independent samples of size m1 and m 2 , respectively. One can use normal tables for testing whether
ž /
1
2
ln
ž
Ž 1 q r Y vs . X .Ž 1 y r Y vs . X .
Ž 1 y r Y vs . X .Ž 1 q r Y vs . X .
equals zero.
1
2
1
2
/
Ž 26 .
9
5. Discussion
It is always a difficult task to validate a new
method. Especially, if the new method is more exact,
precise, the validation with accepted methods of less
precision is impossible. However, there are some
possibilities to test, justify the new methods.
First, a theoretical deduction is mentioned. The
derivation of Type I and Type II in the preceding
chapter is a clear indication of the correctness of the
method.
Secondly, Monte Carlo simulations are mentioned. They were planned to be used an opposite approach, i.e., if there is no correlation between Y and
X 1 as well as Y and X 2 , then the PCM should not
cause any artefacts. Extended Monte Carlo simulations were made using various element numbers of
vectors, different distributions, many thousands repetitions. The results are under consideration Ževaluation.. We will publish them in due course. Herewith,
we mention only that Ži. the results of Monte Carlo
simulations are not in contradiction with the theoretical deduction presented here; Žii. the artefact rate was
4.6% Ž N s 20, ´ s 0.05., using random numbers of
a uniform distribution. The method found differences
in factors X 1 and X 2 with similar probability just as
the difference was generated in a random manner.
Thirdly, the empirical experience should be mentioned. Many hundred usage of PCM and comparisons of the results by classical methods suggest that
PCM is able to find significant difference in factors
more frequently than any of the classical methods. On
the other hand, whenever classical methods find a
factor to be superior, PCM finds the same superiority.
5.1. Case study 1 (simulated data)
First, the results derived from simulated data were
investigated. Two vectors Ž X 1 and X 2 . were created
by a random number generator. The dependent variable Ž Y . was defined as a sum of X 1 and X 2 . Thus,
the correlations between Y and X 1 as well as Y and
X 2 were ensured, whereas there were no correlation
between X 1 and X 2 . The results below were chosen
from hundreds of simulations to prove that a seem-
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
10
Table 4
Frequencies obtained by applying PCM for the simulated data
D X2 )0
D X2 -0
D X1) 0
D X1- 0
kA s 50
k B s6
kC s 2
kD s0
made as high as at ´ s 0.253. Now, the Type II error
will be 0.46, which is much smaller then previously.
Increasing the number of samples can reduce the false
negative error too and increase the power of the test,
as seen in Table 5.
Conclusions from this case study are that Ži. The
PCM does not indicate difference of factors if there
were no difference generated in the data. ŽEven if
seemingly large ratio exists between frequencies of
k B and k C . Žii. The method is conservative enough
not to signalise difference Žthe probability of Type II
error.. Žiii. To be sure, to keep both types of errors
low the number of vector elements Ždegrees of freedom. should be increased.
ingly large ratio is not necessarily followed by a significant difference between variables. Using PCM,
the following frequencies Žsummarised in Table 4. are
obtained.
To make a decision, the following values of the
distribution function were calculated from the Eq. Ž5.:
2 F Ž k s k C s 2, K s 2 q 6 . s 0.253
) 0.1 ) 0.05 ) 0.01,
5.2. Case study 2 (real data)
Ž 27 .
2 F Ž k s 1, K s 1 q 7 . s 0.0517
- 0.1
) 0.05 ) 0.01,
To find an example for which the classical tests do
not indicate significant difference between factors
knowing for sure that such a difference exists, is a
difficult task. As an experimental examination mentioned before w4x, the 2-cyano-2-propyl radical addition to vinyl type alkenes is considered based on
preliminary results in Ref. w6x. The logarithms of the
addition rate constants of some alkenes were investigated as functions of the reaction enthalpies Ž D Hr .
and the electron affinities of the alkenes ŽEA.. F-test
according to Eq. Ž24. could not help to choose between variables EA and D Hr Ž F s 4.054 at p s
0.0612 ) 0.05.. The absolute values of the correlation coefficients were very close to each other Ž r log k
.
vs. EAs 0.8427 and r log k vs. D H r s y0.8554 , and the
z-statistic according to Eq. Ž26. could not differentiate between them. Ž D z s z log k vs. D H r y z log k vs. EA s
y0.04582, Varw D z x s 0.125, p s 0.2758 4 0.05..
Heuristically, a possible dominance of the reaction
enthalpy was predicted by pair-correlation method,
but it was not proved statistically. Table 6 shows the
Ž 28 .
2 F Ž k s 0, K s 0 q 8 . s 0.00448
- 0.01 - 0.05 - 0.1.
Ž 29 .
According to Eq. Ž27., the null hypothesis, i.e., the
two variables equivalently correlated, cannot be rejected at the 1%, 5% and even 10% level. This is because the same uniform random number generator for
variables X 1 and X 2 generated the data. The classical F test of correlation coefficients provides the same
result. The two variables X 1 and X 2 are indistinguishable.
If the null hypothesis is accepted, then the alternative hypothesis has to be rejected. It means, Type II
error can occur. The question still remains what is the
probability of this event? The probability of Type II
error, b , will be 0.82 at a significance level of ´ s
0.05. This value of b is very high. The ´ s 0.05 is
only a nominal value, however, the decision can be
Table 5
Reducing the false negative error at the nominal value ´ s 0.05
2 = 2 Tables
b
Power
50
2
6
0
0.822
0.178
100
12
0.518
0.482
4
0
125
15
0.390
0.610
5
0
150
18
0.292
0.708
6
0
175
21
0.213
0.787
7
0
200
24
0.163
0.837
8
0
225
27
0.118
0.882
9
0
250
30
0.082
0.918
10
0
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
Table 6
Frequencies obtained by applying PCM to discriminate between
EA and D Hr
DEA ) 0
DEA - 0
DD Hr ) 0
DD Hr - 0
kA s109
k B s 22
k C s6
kD s3
results of the PCM calculating with a realistic additive error level s 0.2 w6x.
2 F Ž k s k C s 6, K s 6 q 22 . s 0.00123
- 0.01 - 0.05 - 0.1,
Ž 30 .
2 F Ž k s 7, K s 7 q 21 . s 0.00537
- 0.01 - 0.05 - 0.1,
Ž 31 .
2 F Ž k s 8, K s 8 q 20 . s 0.0190
- 0.05 - 0.1
) 0.001,
Ž 32 .
2 F Ž k s 9, K s 9 q 19 . s 0.0560
- 0.1
) 0.05 ) 0.001,
Ž 33 .
2 F Ž k s 10, K s 10 q 18 . s 0.138
) 0.1 ) 0.05 ) 0.01.
Ž 34 .
It is obvious that the reaction enthalpy D Hr has
stronger correlation with log k than electron affinity
EA. Thus, the dominance of the variable D Hr is
proven statistically for a confidence level of 99.9%
according to Eq. Ž30..
The null hypothesis was rejected at the nominal
value of ´ s 0.05, meaning that one can make a false
positive conclusion with probability 0.05, i.e., one
fails in 5% of the occurrences. On the other hand, H 0
can be accepted with probability less then 0.00123. In
that case, the probability of false decision only for
1.23 per thousand of events, but the false negative
error is b s 0.49, i.e., making mistake in almost half
of the occurrences.
At a nominal value of ´ s 0.05 the power of the
test is 0.914, which is rather reassuring, so in this example it is highly recommended to reject H 0 and to
accept H A .
6. Conclusions
Some improvements over the earlier version of
PCM w6x were made to the algorithm in order to avoid
11
the use of correlation coefficients as parametric characteristics. The discrimination of variables is based on
a 2 = 2 contingency table. The conditional Fisher’s
exact test was introduced to test the hypothesis. The
power function of the test statistic and description of
the influence of Type I Žfalse-positive conclusion, error of the first kind. and Type II Žfalse-negative conclusion, error of the second kind. errors have been
given by theoretical deductions. To our best knowledge, this is the first report on using the concept of
the Type I and Type II errors to validate a variable
selection method in the literature.
The consequences of the Type I and II errors were
detailed showing on two case studies. They can help
to understand the principles of the algorithm and to
avoid the pitfalls of hypothesis testing. Example 1
stressed the importance of increasing the sample size
when b is very high to accept H 0 at low risk. Example 2 showed the situation when one can reject H 0
and the hazard of choosing a very low value for ´ .
Results shown in this paper are not only related to
PCM, but they can be generally used for every 2 = 2
tables and the Fisher’s exact test appearing in
chemometric problems. So much the more, as the
Type II error of Fisher’s exact test is not mentioned
at all in chemometric papers and handbooks.
Summarising the advantages of the pair correlation method with Fisher’s exact test as a selection
criterion: The method is able to find significant differences between factors Žmodels., even if other statistical criteria cannot indicate it. PCM does not need
the assumption of Gaussian or other fixed type distribution. In contrast to the classical methods, PCM can
work with correlated variables. PCM can easily be
generalised for variable selections for more than two
variables. The comparison of factors can be made
pair-wise in all possible combinations. Every comparison can mark a factor as superior, inferior or no
decision can be made. Then the factors are ordered
according to the number of their superiority. Moreover, PCM can be generalised for any fixed nonlinear model. In such cases, Yˆ1 and Yˆ2 should be used
instead of X 1 and X 2 .
The consistency is also important in developing
the first non-parametric variable selection method.
Hence, the rearrangement of boxes Žsummarised in
Appendix A. is introduced, which has the advantage
of using consistently non-parametric methods.
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
12
As a disadvantage, we can only mention the nonsensitiveness of PCM for prediction purposes. The
prediction lies outside of the scope of the method. A
data set can easily be constructed for which a more
correlated Žsuperior. factor has a less predictive ability as measured by classical tests, residual error.
However, this happens rarely during normal usage of
PCM. Generally, if a factor better correlates Žsuperior by PCM. with the dependent variable than another factor, it has the better prediction performance.
For the vast majority of the cases, PCM can select the
better variable for prediction as well.
The novelty of the paper can be summarised as
following:
Ži. this is the first approach to discriminate between seemingly equivalent factors, to select variables in a non-parametric way,
Žii. this is the first presentation for application of
statistically correct selection criterion to a nonparametric variable selection method and
Žiii. this is the first appearance of the derivation of
Type I and Type II errors for Fisher’s exact test,
with two sided way, in the chemometric literature.
A user-friendly program w28x is available from the
authors upon request.
Acknowledgements
The Academic Research Project ŽNo. AKP 98-51
2,4r19. and the Hungarian Science Foundation ŽNo.
OTKA F-025287. supported this scientific research.
The authors would like to acknowledge helpful critical comments to an anonymous referee.
Appendix A
There are 75 different outputs from raw use of PCM. The following summarises all possible outcomes:
Ž1.
Ž2.
Ž3.
Ž4.
Ž5.
Ž6.
Ž7.
Ž8.
Ž9.
Ž10.
Ž11.
Ž12.
Ž13.
Ž14.
Ž15.
Ž16.
Ž17.
Ž18.
Ž19.
Ž20.
Ž21.
Ž22.
Ž23.
k A s k D ) k B ) k C : Situation-1
k A s k D ) k C ) k B : Situation-1
k B s k C ) k A ) k D : Situation-1
k B s k C ) k D ) k A : Situation-1
k A s k B ) k C ) k D Situation 0
k A s k B ) k C s k D : Situation 0
k A s k B s k C ) k D : Situation 0
k A s k B s k C s k D : Situation 0
k A ) k B ) k C ) k D : Situation 0
k A ) k B ) k C s k D : Situation 0
k A ) k B s k C ) k D : Situation 0
k A ) k B s k C s k D : Situation 0
k A ) k B ) k D ) k C : Situation 0
k A ) k B s k D ) k C : Situation 0
k A s k C ) k B ) k D : Situation 0
k A s k C ) k B s k D : Situation 0
k A ) k C ) k B ) k D : Situation 0
k A ) k C ) k B s k D : Situation 0
k A ) k C ) k D ) k B : Situation 0
k A ) k C s k D ) k B : Situation 0
k A s k D ) k B s k C : Situation 0
k A ) k D ) k B ) k C : Situation 0
k A ) k D ) k B s k C : Situation 0
Ž24.
Ž25.
Ž26.
Ž27.
Ž28.
Ž29.
Ž30.
Ž31.
Ž32.
Ž33.
Ž34.
Ž35.
Ž36.
Ž37.
Ž38.
Ž39.
Ž40.
Ž41.
Ž42.
Ž43.
Ž44.
Ž45.
Ž46.
kA ) k D ) k C ) k B : Situation 0
k A s k B ) k D ) k C : Situation 1
k A s k B s k D ) k C : Situation 1
k B ) k A ) k C ) k D : Situation 1
k B ) k A ) k C s k D : Situation 1
k B ) k A s k C ) k D : Situation 1
k B ) k A s k C s k D : Situation 1
k B ) k A ) k D ) k C : Situation 1
k B ) k A s k D ) k C : Situation 1
k B s k C ) k A s k D : Situation 1
k B ) k C ) k A ) k D : Situation 1
k B ) k C ) k A s k D : Situation 1
k B ) k C ) k D ) k A : Situation 1
k B ) k C s k D ) k A : Situation 1
k B s k D ) k A ) k C : Situation 1
k B s k D ) k A s k C : Situation 1
k B ) k D ) k A ) k C : Situation 1
k B ) k D ) k A s k C : Situation 1
k B ) k D ) k C ) k A : Situation 1
k A s k C ) k D ) k B : Situation 2
k A s k C s k D ) k B : Situation 2
k C ) kA ) k B ) k D : Situation 2
k C ) kA ) k B s k D : Situation 2
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
Ž47.
Ž48.
Ž49.
Ž50.
Ž51.
Ž52.
Ž53.
Ž54.
Ž55.
Ž56.
Ž57.
Ž58.
Ž59.
Ž60.
Ž61.
k C ) k A s k B ) k D : Situation 2
k C ) k A s k B s k D : Situation 2
k C ) k A ) k D ) k B : Situation 2
k C ) k A s k D ) k B : Situation 2
k C ) k B ) k A ) k D : Situation 2
k C ) k B ) k A s k D : Situation 2
k C ) k B ) k D ) k A : Situation 2
k C ) k B s k D ) k A : Situation 2
k C s k D ) k A ) k B : Situation 2
k C s k D ) k A s k B : Situation 2
k C ) k D ) k A ) k B : Situation 2
k C ) k D ) k A s k B : Situation 2
k C ) k D ) k B ) k A : Situation 2
k B s k C s k D ) k A : Situation 3
k B s k D ) k C ) k A : Situation 3
Ž62.
Ž63.
Ž64.
Ž65.
Ž66.
Ž67.
Ž68.
Ž69.
Ž70.
Ž71.
Ž72.
Ž73.
Ž74.
Ž75.
13
k C s k D ) k B ) k A : Situation 3
k D ) k A ) k B ) k C : Situation 3
k D ) k A ) k B s k C : Situation 3
k D ) k A s k B ) k C : Situation 3
k D ) k A s k B s k C : Situation 3
k D ) k A ) k C ) k B : Situation 3
k D ) k A s k C ) k B : Situation 3
k D ) k B ) k A ) k C : Situation 3
k D ) k B ) k A s k C : Situation 3
k D ) k B ) k C ) k A : Situation 3
k D ) k B s k C ) k A : Situation 3
k D ) k C ) k A ) k B : Situation 3
k D ) k C ) k A s k B : Situation 3
k D ) k C ) k B ) k A : Situation 3
If k A is not the largest one, then the boxes A, B, C and D must be rearranged. The situations above are
interpreted as follows:
Situation-1
Situation 0
Situation 1
Situation 2
Situation 3
ambiguous situation, PCM cannot make a distinction
no change needed
changes needed like k A l k B and k C l k D Ž Y vs. X 1 and Y vs. yX 2 .
changes needed like k A l k C and k B l k D Ž Y vs. yX 1 and Y vs. X 2 .
changes needed like k A l k D and k B l k C Ž Y vs. yX 1 and Y vs. yX 2 .
The ambiguous situation does not occur frequently. Even if an ambiguous situation occurs, it
does not cause any problem. If the output is number
1 or 2 of the situation table, then k A s k D . It means
that the correlation is enhanced and weakened
equally. For outputs 3 and 4, the realignment of boxes
can be done as prescribed for either Situation 1 and
2. In those cases, PCM gives that X 1 and X 2 correlated equally with Y as if k B and k C would be statistically indistinguishable. It can happen that k D will
not be the smallest one after reshuffling, because the
main regulation is that k A has to be the largest.
The modification is based on the properties of
Criterion Function CFŽ i, j . and prevents us from using correlation coefficients, a parametric characteristics that was used in the previously presented algorithm of PCM w6x.
The criterion function for rearrangement is given
by:
CF Ž i , j . s
½
0,
Ž y2D1yD2q5. r2,
if D1PD2 s 0
otherwise,
Ž 35 .
where D k s sgnŽ X k i y X k j .)sgnŽ Yi y Yj ., k s 1 or
2. Thus, the number of cases in box A is a CFŽ i, j .
s 1; 1 F i - j F n4 , in box B is a CFŽ i, j . s 2; 1 F i
- j F n4 , in box C is a CFŽ i, j . s 3; q1 F i - j F
n4 , in box D is a CFŽ i, j . s 4; q1 F i - j F n4 , and
there will be a CFŽ i, j . s 0; q1 F i - j F n4 ignored cases. An easy derivation of CFŽ i, j . is based
on finding the simplest function depending on only
D k s sgnŽ X k i y X k j .)sgnŽ Yi y Yj ., k s 1 or 2:
CF Ž i , j . s aD1 q bD2 q cD1D2 q d.
Ž 36 .
Four equations can be created to determine 4 unknown coefficients a, b, c and d Ž D k has values of
only 1 and y1.:
CF Ž i , j . s 1 s a1 q b1 q c1 q d
Ž 37 .
CF Ž i , j . s 2 s a1 q b Ž y1 . q c Ž y1 . q d
CF Ž i , j . s 3 s a Ž y1 . q b1 q c Ž y1 . q d
CF Ž i , j . s 4 s a Ž y1 . q b Ž y1 . q c1 q d.
The solutions of this linear equation system are a s
y1, b s y1r2, c s 0 and d s 5r2.
R. Rajko,
Chemometrics and Intelligent Laboratory Systems 57 (2001) 1–14
´ K. Hebergerr
´
14
Substitution of these results into Eq. Ž36. yields
Eq. Ž35..
Appendix B
To calculate factorials the following Stirling approximation can be used w26,27x:
n!s
n
n
ž /'p
2 ne
e
h
Bn), h s
Ž B n), h.
,
B2 j
1
Ý
js1 Ž 2 j y 1 . 2 j
n
2 jy1
,
Ž 38 .
where the Bernoulli numbers Bj are defined as
B0 s 1
jy1
y
Bj s
jq1
Bi
i
ž /
Ý
is0
jq1
j
,
j s 1,2,4,6,8,10, . . .
ž /
Ž 39 .
because B3 s B5 s B7 s B9 s . . . s 0.
n
The binomial coefficient Ž . can be calculated using
k
Eq. Ž38.:
1
n
s
k
'2p
ž/
n
1
nq
1
2
k kq 2 Ž n y k .
ny kq
1
e
ž
B)
n,h
)
)
Bk , h Bnyk
,h
/.
2
Ž 40 .
Value of h may be chosen beginning with 0 to any
large integer. Precise enough results has been obtained, however, at h s 3.
References
w1x Ža. M.L. Thompson, Int. Stat. Rev. 46 Ž1978. 1–19;
Žb. M.L. Thompson, Int. Stat. Rev. 46 Ž1978. 129–146.
w2x V. Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.
Vandeginste, C. Sterna, Anal. Chem. 68 Ž1996. 3851–3858.
w3x W.J. Conover, Practical Nonparametric Statistics. 2nd edn.,
Wiley, New York, 1980 Chap. 4.
w4x K. Heberger,
H. Fischer, Int. J. Chem. Kinet. 25 Ž1993. 249–
´
263.
w5x K. Heberger,
H. Fischer, Int. J. Chem. Kinet. 25 Ž1993. 913–
´
920.
w6x K. Heberger,
R. Rajko,
´
´ Discrimination of statistically equivalent variables in quantitative structure-activity relationships,
ŽEds.., Quantitative Structurein: F. Chen, G. Schuurmann
¨¨
Activity Relationships ŽQSAR. in Environmental Sciences—
VII. SETAC Press, Pensacola, Florida, 1997, pp. 423–431,
Chap. 29.
w7x I. Vincze, Mathematische Statistik mit industriellen Anwendungen. Akademiai
Kiado,
´
´ Budapest, 1971 Žin German..
w8x R.L. Mason, R.F. Gunst, J.L. Hess, Statistical Design and
Analysis of Experiments with Applications to Engineering
and Science. Wiley, New York, 1989.
w9x E.L. Lehmann, Testing Statistical Hypotheses. 2nd edn.,
Chapman and Hall, New York, 1993.
w10x W.H. Robertson, Technometrics 2 Ž1960. 103–107.
w11x G.J.G. Upton, J. R. Stat. Soc. A 145 Ž1982. 86–105.
w12x F. Yates, J. R. Stat. Soc. A 145 Ž1984. 426–463.
w13x D.J. Finney, R. Latsa, B.M. Bennett, P. Hsu, Tables for Testing Significance in a 2=2 Contingency Table. Cambridge
Univ. Press, Cambridge, 1963.
w14x D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de
Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics: Part A and Part B. Elsevier, Amsterdam, 1998.
w15x R.C. Serlin, J.R. Levin, J. Stat. Edu. 4 Ž2. Ž1996. http:rr
www.stat.unipg.itrncsurinforjserv4n2rserlin.html.
w16x N.M. Faber, M. Sjerps, H.A.L. Leijenhorst, S.E. Maljaars,
Sci. Justice 39 Ž2. Ž1999. 113–122.
w17x R.A. Fisher, J. R. Stat. Soc. A 98 Ž1935. 39–54.
w18x M. Gail, J.J. Gart, Biometrics 29 Ž1973. 441–448.
w19x J.T. Casagrande, M.C. Pike, P.G. Smith, Appl. Stat. 27 Ž1978.
212–219.
w20x R.G. Thomas, M. Conlon, Technical Report 382, University
of Florida, Gainesville, 1991.
w21x M. Conlon, R.G. Thomas, Appl. Stat. 42 Ž1993. 258–260.
w22x R. Falk, A.D. Well, J. Stat. Edu. 5 Ž3. Ž1997. http:rrwww.
stat.unipg.itrncsurinforjserv5n3rfalk.html.
w23x R. Rajko,
´ Anal. Lett. 27 Ž1994. 215–228.
w24x O.J. Dunn, V.A. Clark, Applied Statistics: Analysis of Variance and Regression. 2nd edn., Wiley, New York, 1987.
w25x G.S. Mudholkar, Fisher’s z-distribution. in: S. Kotz, N.L.
Johnson ŽEds.., Encyclopedia of Statistical Sciences, vol. 3
Wiley, New York, 1983.
w26x M. Abramowitz, C.A. Stegun, Handbook of Mathematical
Functions with Formulas, Graphs, and Mathematical Tables.
Dover, New York, 1972 9th printing.
w27x P. Szasz,
´ Elements of Differential and Integral Calculus.
Kozoktatasugyi
¨
´ ¨ Kiado,
´ Budapest, 1951 ŽIn Hungarian..
w28x R. Rajko,
Program for Pair-Correlation Method
´ K. Heberger,
´
ŽPCM. V1.0a written in Visual Basic for Applications of MS
Excel V7.0, 1998.