Download Report

Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Comparison of Seven Asymptotic Error Rate
Expansions for the Sample Linear
Discriminant Function
D. D. Ekezie, S. I. Onyeagu
Abstract — Seven asymptotic error rate expansions for the sample linear discriminant function were
considered and compared using binary variables. A simulation experiment was carried out to compare the
performance of these rules. In all, 22 population pairs which gave rise to 225 configurations were formed.
At each of the 225 configurations, the asymptotic expansion with error having the minimum variance after
1000 repeated trials is declared the best. For the 225 configurations of the simulation experiments,
Anderson’s asymptotic expansion was the best in terms of minimum variance.
Key Words: Fisher’s Linear Discriminant Function, Mahalanobis Distance, Asymtotic Error Rate Expansions.
1.
Introduction
Discrimination and classification deal with problems of differentiating between two or more populations on
the basis of multivariate measurements. In discrimination, we are given the existence of two populations and
a sample of individuals from each. The problem is to set up a rule, based on measurements from these
individuals, which will enable us to allot some new individuals to correct population when we do not know
from which of the two it emanates.
In classification, we are given a sample of individuals, or the whole population and the problem is to classify
them into groups which shall be as distinct as possible. Example, given a population of unknown origin we
may wish to see whether they fall into natural classes, natural in this sense meaning that the members in a
group are close together in resemblance, but that the members of one group differ considerably from those
of another.
The classical linear statistical discrimination problem may be described as follows: suppose that each
member of the union of two populations possess a finite set of common characteristics of features, which
can be denoted by F=(f1,…,fp) whose observed values are denoted by
the observed value of the characteristic
X   x1, x2 ,..., x p  such that x j is
f j , j  1,2,..., p .
1 and  2 denote two distinct populations whose known multivariate probability mass functions are
multivariate Bernoulli with means P1 and P2 such that P1  P 2 and covariance matrices. 1  2  .
Also, let q1 and q2 be the known a priori probabilities that an individual is selected from 1 or  2
Let
respectively. Let
i
c  be the cost of misclassifying an individual from 1 ,where
 j
i
   0, i  j
 j
 0, i  j
i, j  1,2
93
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Then, given a
populations
p 1
observation vector X of an individual selected at random from the union of the
1 and  2 ,the statistical discrimination problem is to formulate a decision rule which
classifies the individual into one of the populations and optimizes some criterion which measures
performance accuracy. A particular value of X denoted by X  ( x1, x2, ,..., x p ) is called a response pattern.
A response pattern is a series of
and the probability that response pattern X is observed in the
ith population (i=1,2) will be denoted by  i  x  .The linear Discriminant function was developed by Fisher
(1936) who applied the criterion of finding the linear transformation which maximizes a univariate
difference in the group means relative to the common univariate dispersion.
2.
FISHER’S LINEAR DISCRIMINANT FUNCTION (LDF)
The Fisher’s Linear Discriminant Function (LDF) for binary variables is given by
L  X     P2 j  P1 j  kj X k 
j
k
1

2 j
 P
2j
 P1 j  kj  P2 k  P1k 
…(2.1)
k
where  are the elements of the inverse of the pooled covariance matrix for the two populations.
Typically, the parameters of the two underlying distributions are unknown and must be estimated using
samples of size n1 and n2 from  1 and  2 respectively. Unbiased estimates of Pij are given by
kj
n ( x)
Pˆij   S j i
ni
…
(2.2)
where S j is the set of all patterns X with X j  1 .
Therefore, the sample based linear discriminant function for binary variables is given by
L( x )    p2 j  p1 j  S kj X k 
j
k
1
p2 j  p1 j  S kj  p2 k  p1k 


2 j k
…
(2.3)
kj
where S are elements of the inverse of the pooled sample variance covariance matrix.
The allocation rule for the Fisher’s Linear Discriminant Function is the following:
Classify a new item with response pattern X into
L( x )    p2 j  p1 j  S kj X k 
j
and to
k
 1 if
1
p2 j  p1 j  S kj  p2 k  p1k   c


2 j k
…
(2.4)
 2 , otherwise, where c is the constant defined as

 2 c 1 | 2  

c  log e 
.

c
2
|
1




1


…
(2.5)
For any classification rule, its associated error rates are often the criteria by which the classification
performance is evaluated. In the two population discrimination problem, there are two possible
misclassifications; a rule may classify an observation actually from  2 to  1 , or it may classify an
94
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
 1 to  2 .
observation from
The respective probabilities of misclassification are denoted by
p 1| 2  and p  2 |1 . The overall probability of misclassification, or total error rate is expressed as
q1 p  2 |1  q2 p 1| 2 
… (2.6)
The error rate is easily calculated when the populations are characterized by multivariate normal densities
with known parameters, and in the case
D 
c  0 , is equal to  
 where  is the standard normal
 2 
cumulative distribution function and D is the Mahalanobis distance, defined by
D    P2 j  P1J  kj  P2 j  P1J 
/
… (2.7)
However, when the population parameters are unknown and must be estimated, calculation of the exact,
overall expected error rate for the Fisher’s LDF becomes virtually intractable. In an attempt to remedy this
problem, investigators have derived asymptotic expansions for the overall expected error rate of the sample
LDF.
In this paper, we generate data from the multivariate Bernoulli distribution, use the data to compute the
Mahalanobis’s squared distance and plug the values into the cumulative standard normal distribution
function  . We consider sample sizes n1  n2  30 and above since the Bernoulli distribution can be
approximated to the normal for large samples.
3.
Statement of the Problem.
For any classification rule, its associated error rates are often the criteria by which the classification
performance is evaluated. In the two-population discrimination problem, there are two possible
misclassifications; a rule may classify an observation actually from  2 to 1 or it may classify an
observation from
p  2 |1
.
1 to  2 . The respective probabilities of misclassification are denoted by p 1| 2  and
The overall probability of misclassification, or total error rate is expressed as
q1 p  2 |1  q2 p 1| 2 
The error rate is easily calculated when the populations are characterized by multivariate normal densities
with
known
parameters,
and
in
the
case


  c 1| 2  

 D 
c  0 ,  where c is the constant defined as c  loge  2
 . is equal to  
 where
 2 
 1c  2 |1 
 


 is the standard normal cumulative distribution function and D is the Mahalanobis distance, defined by
D    P2 j  P1J  kj  P2 j  P1J 
/
However, when the population parameters are unknown and must be estimated, calculation of the exact,
overall expected error rate for the Fisher’s Linear Discriminant Function becomes virtually intractable. In an
attempt to remedy this
problem, investigators have derived asymptotic expansions for the overall expected error rate of the sample
Linear Discriminant Function. Most statistical methods developed for estimations, hypothesis testing and
confidence statements were based upon exact specifications of the populations of the response variates. In
the applied sciences, another kind of multivariate problem often occurs in which an observation must be
95
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
assigned in some optimum way to one of several populations. If our population consist of two groups,
1
and  2 . We observe a r x 1 vector X and must assign the individual whose measurements are given by X
1 or  2 . We need a rule to assign X to 1 or  2 .If the parameters of the distribution of X in
1 and  2 are known, we may use this knowledge in the construction of an assignment rule. If not, we
use samples of sizes n1 from 1 or and n2 from  2 to estimate the parameters. We need a criterion of
to
goodness of classification. Fisher (1936) has suggested using a Linear combination of the observations and
choosing the coefficients so that the ratio of the difference of the means of the linear combination in the two
groups to its variance is maximized. Welch (1939) suggested that maximizing the total probability of
misclassification would be a sensible idea. Von Mises(1945) suggested minimizing the maximum
probability of misclassification in the two groups. Therefore, if we are confronted with this kind of problem
X   X 1, X 2, ..., X 1  into 1 or
1
of classifying an object of an unknown origin with measurement vector
 2 ,how do we choose the “BEST” rule so that the expected cost associated with misclassification will be
minimum.
In this work, we generate data from the multivariate Bernoulli distribution, use the data to compute the
Mahalanobis’s squared distance and plug the values into the cumulative standard normal distribution  . We
consider sample sizes n1  n2  30 and above since the Bernoulli distribution can be approximated to the
normal for large samples. Many asymptotic expansions for the expected error rate have been formulated.
These include those by Okamoto (1963;1968),Anderson(1973);Efron (1975),Sayre (1980)
Schervish(1981),Raudys (1972),Deev (1972) and Kharin (1984).
These asymptotic expansions are typically functions of the training sample sizes n 1 and n2, the dimension p
of the observation vector X, and the Mahalanobis distance between the two populations.
Wyman, Young and Turner (1990) made a comprehensive investigation into the relative accuracy of these
asymptotic error rate expansions using data generated from the multivariate normal distribution. They
concluded that the best asymptotic expansion is Raudy’s asymptotic expansion. In this thesis, we are doing
the same but using data generated from the Multivariate Bernoulli distribution. We assess seven asymptotic
expansions in terms of their ability to approximate the expected probability of misclassification (EPMC) or
unconditional probability of misclassification.
The main objective of this paper is to compare seven asymptotic error rate expansions for the sample linear
Discriminant function with the aim of determining how well the assignment rules perform. The asymptotic
expansions rules being compared are as follows:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
Anderson’s asymptotic expansion
Deev’s asymptotic expansion
Efron’s asymptotic expansion
Raudy’s asymptotic expansion
Okamoto’s asymptotic expansion
Sayre’s asymptotic expansion
Kharin’s asymptotic expansion
We present seven asymptotic expansions of the expected probability of misclassification and the following
notation is utilized for all asymptotic expansions.  denotes the standard normal cumulative distribution
function;  denotes the standard normal density function; n1 and n2 denote sample sizes sampled from
populations
1 and  2 respectively; n represents the quantity n1+n2 whereas N represents n-2. There has
been much vigorous research in this area since the pioneering work of Fisher (1936), and it is not possible to
cover this work exhaustively in this study.
96
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Our interest is to provide an overview of the main ideas, and to supply sufficient working details so that the
user can put the ideas into practice in order to minimize the expected cost of misclassification and to choose
an appropriate asymptotic expansion given any population structure.
4.
Seven Asymptotic expansions for the expected probability of misclassification
4.1
Okamoto’s Asymptotic Expansion
Probably the most well-known asymptotic expansion of the expected total error rate of the sample linear
discriminant function is due to Okamoto (1963). The derivation of this expansion utilizes the studentization
method (Hartley 1938), Welch (1947) in which Okamoto applies a Taylor series expansion to the
characteristic function of the studentized sample linear discriminant function.
It is well known that if the Mahalanobis squared distance
D 2 between two populations  1 and  2 defined
by
D2   1  2  1  1  2 
/
… (3.7)
n1 , n2 and n tend to infinity X1  1 , X 2  2 and S   in probability and
 1 2 2 
1

hence the limiting distribution of W is N  D 2 , D 2  or N 
D , D  according as X n  n 1 comes
 2

2

from  1 or from  2 , that is, if any real constant c denoted by
is not zero, then as
1
1


p1  c, D   Pr W  D 2  cD |  1 
2


the probability that
W
… (4.1)
1 2
D  cD when X n1  n2 1 comes from  1 and similarly
2

1 2 cD 
p2  c, D   Pr W 
D 

2
2 

then both
(1968
where
… (4.2)
p1  c; D  and p2  c; D  tend to   c  , cumulative distribution function of N  0,1 . Okamoto
) evaluated
p1  c; D  and p2  c; D  in an asymptotic expansion with respect to n11 , n21 and n 1
n  n1  n2  2 . Okamoto showed that p2  c; D  can be derived from p1  c; D  . He stated his
main result in the following theorem and corollaries.
Theorem: If
D  0 , then
p1  c; D   1  L  d ; D   Q  d ; D    c   O3
where
2
…
(4.3)
d stands for the differential operator d dc ,   c  for the c.d.f of N  0,1 and
3
L  d ; D    Li  d ; D 
…
i 1
97
(4.4)
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Q d ; D 
2
1
 L  d ; D    Qij  d ; D 
2
i  j 1
… (4.5)
L1  d ; D    2n1D 2  d 4  r  d 2  Dd  
1
… (4.6)
1
2
L2  d ; D    2n2 D 2   d 2  Dd   r  d 2  Dd  


… (4.7)
2
1
L3  d ; D    4n   2d 2  Dd   2  r  1  3d 2  Dd  


… (4.8)
1
2
Q11  d ; D    4n12 D 4  2d 4  d 2  Dd   r  d 2  Dd  


… (4.9)
1
3
2
Q22  d ; D    4n22 D 4  2  d 2  Dd   r  d 2  Dd  


… (4.10)
Q12  d ; D    2n1n2 D 4  2d 4  d 2  Dd   rd 4 
1
Q13  d ; D    2n1nD

 4d 4  2d 2  Dd   2  5r  7  d 4  D 2d 2 


   r 2  r  3d 2  Dd 




 2  d 2  Dd  2d 2  Dd  2  2  5r  7  d 4 


 4  3r  4  D 2d 2   r 2  r  3d 2  Dd  


2 1
Q23  d ; D    2n2nD
… (4.11)
2 1
… (4.12)
… (4.13)
 2  2d 2  Dd  2  7d 2  2 Dd 



4
3


1 3  29 r  55  d  12  5r  9  Dd

Q33  d ; D   12n 2  
 3  3r  5 D 2d 2  6  6r 2  13r  9  d 2 


 6  r  12 Dd



and finally
… (4.14)
O3 stands for the term of the third order with respect to  n11 , n21 , n1  .
Corollary 1
Let
  c  be

p1  c; D     c   2n1D 2
the

1
density

of
N  0,1 ,
then
3c  c3  r  D  c 
1
2
1
2
  2n2  D 2  2 D  3c  c  D  c   r  D  c    4n  2  D  3c   c  D  2c   2r  D  3c  




02

98
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
... (4.15)
This corollary is proved by substituting the identities
d c   c
d 2  c   c  c 
d 3  c    c 2  1   c 
d 2  c    3c  c 2    c 
into the term L  d ; D    c  of the theorem.
In many situations, the discrimination is performed in the following way: We regard an observed value of
X n1n 21 as coming from 1 or  2 according as the observed value of W is positive or negative. For
this procedure the error probabilities of two kinds are given by
Corollary 2
b
b
b
  D  a1 a2 a3 b11 b22 b12
Pr W  0 |  1   
 13  23  332  O3 …(4.16)
    2  2 
 2  n1 n2 n n1 n2 n1n2 n1n n2n n
b
b
b
  D  a2 a1 a3 b22 b11 b12
Pr W  0 |  2    
 23  13  332  O3 …
    2  2 
 2  n1 n2 n n1 n2 n1n2 n1n n2n n
(4.17)
where
a1   2 D 2 
1
d
4
o
 3rdo2 
a2   2 D 2  do4   r  4  do2 
1
a3 
1
 r  1 do2
2
…
(4.18)
…
(4.19)
…
(4.20)
b11  8D 4  do8  6  r  2  do6   r  2  9r  16  do4  2Or  r  2  d o2 
…
b22  8D 4  do8  2  r  10  do6   r  6  r  16  d o4  4  r  4  r  6  d o2 
…
1
1
(4.22)
b12   4 D 4  do8  2  r  8 do6  3  r 2  10r  16 d o4  12r  r  6  d o2 
1
b13   4 D 2 
1
b23   4 D 2 
b33 
1
 r  1 do6  3  r  4  do4  6  r  4  do2 
 r  1 do6   r  8 do4  2  r  4  do2 
1
 r  1  r  1 do4  4rdo2 
8
(4.21)
…
…
(4.24)
…
…
99
(4.23)
(4.25)
(4.26)
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
and
doi
doi
are constants defined by
 di 
 c
i 
 dc 
= 
c
…
D
i 2,4,6,8
2
(4.27)
Sayre’s Asymptotic Expansion
4.2
When the populations from which the samples are drawn are multivariate normal with mean
common covariance matrix  , then for any procedure, denoted by
n
*
u1,u2 and
ˆ a fixed function where
, with w
1
1


Wˆ   X   X 1  X 2   S 1  X 1  X 2  ,the conditional probability of misclassification otherwise
2


*
*
known as the actual error rates given by P1 n and P2 n where for i  1,2
 
Pi ( n* )
where
 
1


1
i 
 1

1
u

X

X




 i
1
2  S  X1  X 2  
2




1 1

1


X 1  X 2  S S  X 1  X 2 





  y 
y
  2 
1
2

…
(4.28)
 t 2 
exp 
 dt
 2 
For equal prior probabilities of an observation belonging to either
1 or  2 ,the average actual rate is
given by
 P n   P n 
R
*
*
1
2
…
2
Mclachlan (1974) approximated the distributions of the actual error rates
total error rate R , by asymptotic expansions and calculated that
(4.29)
Pi  n*  i  1,2 and the average
Pi  ni  has a normal distribution if the
terms of the second order with respect to the reciprocals of the initial sample sizes are ignored and that
R has a normal distribution on ignoring only the terms of the third order. The normality of R based on
ignoring these third order terms may be only approximate for moderate samples sizes,and the distribution
may better be approximated by a linear combination of chi-squared variates. Sayre(1980) approximated the
distribution and moments of the actual error rates for univariate and multivariate models . In the multivariate
normal situation, the limiting distribution of

 1 
n  R    D  where D is the mahalanobis distance
 2 

between the populations,maybe found by using the results of Efron (1975). Efron considered a mixture
ni
n had the limiting value qi (where qi (i  1,2) are prior probabilities of
n
being in population  i ). Efron also alluded to the asymptotic formula for the distribution of R when n1
sampling scheme in which
100
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
and n2 are fixed. Sayre (1980) presented the following explicit formulas given fixed sample sizes. Thus the
limiting distribution of

  D 
n R   

 2 


  D 
n R   
  is given by
 2 

n1, n1

2  n1n2  
1  D  n 2   2
 q1 2 D    k  .   n n    X  p1
1 2


  n 2  
 n 2 D 2 n 2  n2
n1  n 2 2 
n n 
  q1 2 D  Q  k  


1  2 D 2 1 2 2   X 21


2 
n  n1n2 D 
n 
 4n1n2 n1n2  n
… (4.30)
q 
ln  1 
q
q 
D
Where k 
  2  and   ln  1 
2
D
 q2 
An asymptotic mean and variance formula for R can therefore be written as follows:
q1
n
 k  1
2D
n1n2
E  R
2
 2  n1n2
1
 n2 n1   
 D 
D
r



r

1




 2 2   
  2  


 (4.31)
2

4
n n D 
 2 
  n
2
4
 n n  2  D 4

n1 
n1 
4 3  n2 n1  2 
2  n2
2  n2
2
1 2

D



2
d




2
















D 2  n n  D 4 
q12   k   n 2   8
n n
n n

V  R 


4
4 D 2n 2  n n  2 

8
n1  
2 2
2
2
2
3  n2
4
4
1 2
 n 2   2 D  4rD  4rD  4 D  D 2  8  n  n    2 4  D  r  1 
 

 


2
(4.32)
For the special case in which
q1  q2 
1
and n1  n2 the general results reduce to
2
4  D2  D  2
D D
   X r 1     X 12
4D
4 2
2
. . . (4.33)
For the limiting distribution and
E  R 

1
1
 D  1
 D 
    D 2   r  1    r  1   

2nD  2    4
4
 2 

…
(4.34)
and
2
  D 
V  R       32r  16rD 2  2 RD 4  16D 2  32 
  2 
101
…
(4.35)
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
4.3
Andersons Asymptotic Expansion
The
X , X 1, X 2
Statistics,
and
S
are
independently
distributed
according
to
 1   1 
N  u,   , N  u2 ,     ,  u2 ,     , and w  , n  respectively; here u  E ( X ) and w  , n 
  n1     n2  
denotes the Wishart distribution with n degrees of freedom. An observation X may be classified by means of
the classification statistic
1
1


W   X 1  X 2  S 1  X   X 1  X 2  
2


where
X1 
1 n1
X
n1 j 1 1 j
1 n2
X2   X2 j
n2 j 1
n1
n2
ns    X 1 j  X 1  X 2 j  X 2     X 2 j  X 2  X 2 j  X 2 
1
j 1
and
1
j 1
n  n1  n2  2 .The distribution of W depends on the parameters u1,u2 and  through the squared
mahalanobis distance
 2   u1  u2  1  u1  u2 
1
which can be estimated by
D 2   X 1  X 2  S 1  X 1  X 2 
1
2 
D 2  a . The limiting distribution of W as n1   and n2   is normal
1
1
variance  and mean  if X is from N  u,   , and mean -  if X is from N  u2 ,   .
2
2
Let
and
with
Bowker and Sitgreaves (1961) for n1 = n2 and Okamoto (1963,1968) gave asymptotic expansions of the

1 


W  
w 
2
2 
distribution of 
for X coming from N  u,   and 
for X coming from N  u2 ,  
1
1
2
2
1
1 1
n
, , and 2 when n1   and n2   ,and 2  k a finite positive constant.
to terms of order
n
n1
n1 n2
In particular Pr w  0 was evaluated. The Statistician who wants to classify X , may take c to be a
constant, perhaps 0,and accept the pair of misclassification probabilities that result. The
102
asymptotic
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
1 

w 
2 
expansion of the distribution of 
gives approximate evaluations of these probabilities, which
1
2
are functions of the unknown parameters

as well as of c.
On the other hand, the Statistician may want to determine the cut off point c to adjust the probabilities of


W  
2
misclassification. Since the limiting distribution of 
and
1
2
1 

w 
2 

are N  0,1 when
1
2
E  X   u1 and E  X   u2 , respectively, a first approximation to the pair of misclassification
1
1
1

 1

  c 2  and      c 2  ,where  
2

 2

probabilities is  
function of the standard normal variate. Since a is an estimate of

is the cumulative distribution
 ,one might base his choice of c on the
1 
1 


W  a 
w 
2 
2 
fact that the limiting distributions of 
and 
are N  0,1 when E  X   u1 and
1
1
a2
2
E  X   u2 ,respectively.
1 
1 


W  a 
w 
2 
2 
Anderson (1973) derived asymptotic expansions of the distributions of 
and 
in
1
1
a2
2
these two cases, respectively.
We write
1
1
W  a   X 1  X 2  S 1  X 1  X 2 
2
… (4.36)
Then
1


1


1 1
1 1
1
W  2 a



2


Pr 

u

Pr
X

X
S
X

X

u
X

X
S
X

X
  X 1  X 2  S 1  X 1  X 2  








 1
2
1
2
1
2
1
2
1







 a2



… (4.37)
Since X has the distribution N  u,   independently of X 1 , X 2, and S, the conditional distribution of
X
1
K
1
1
 X 2  S 1  X  u  is N 0,  X 1  X 2  S 1S 1  X 1  X 2  , and


X
 X 2  S 1  X  u 
1
1
 X  X  S 1S 1  X  X 
2
1
2
 1

1
… (4.38)
1
2
103
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Has the distribution
N (0,1) .Then equation (3.44) is
1


1 1
1 1
1


2


W

a
u
X

X
S
X

X

X

X
S
X

u










2
1
2
1
2
1



2  u   Pr  K   1
Pr 


1
1



 a2

 X  X 1 S 1S 1  X  X  2
1
2
1
2








1


1 1
1 1
2


u
X

X
S
X

X

X

X
S
X

u

 1 2   1 2   1  
2
  1
= E  
 ..(4.39)
1
1


 X  X  S 1S 1  X  X  2
2
1
2


 1



where the expectation is with respect to X 1 , X 2, and S.
The
distribution
of
W
and
a
is
invariant
with
respect
to
the
transformations
 Axj  b , where A is nonsingular. The
maximal parameter invariant of these transformations is the distance  .
X  Ax  b, X
*
*(1)
j
 Axj
(1)
 b j=1…….N1, and xj
Anderson chose A and b to transform  to
*(2)
(2)
1
2
,u1  u2 to    D,0,...,0  where    , and u1=0.
1
Anderson also treated the case where u=u1.
Let Y,Z and V be defined by
X 1 - X 2 = 
1
1
…
(4.40)
Y
n2
X1 
1
n
S 
1
2
Z
1
n
1
2
V
…
The joint distribution of
Y Z 
1

  1
1 

 0 n

N1 N 2 





N  ,
 0  
n



N1


1 1
(4.41)
is
n 
 
N1  
n 
 
N1  
… (4.42)
104
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Then equation (3.46) is
1


W  2 a

Pr 
 u =
1
 a2



1

1
1
1
2
  
 
 
 

1
1
1
1

  u    1 Y     1 V     1 V     1 Y    11
 
 
 

  
2
2
2
2
n
n
n
n
n2







 

 

E 
1


1
2
2


 






     11 Y     11 V     11 Y   
  
 
 
 
2
2
2
n
n
n





 





1


   11 Y 


n2 

 
1


 
   11 V  Z   


 
n2 

 







… (4.43)
We can write
1




   11 V  =   11 V  1 V 2  13 V 3  12 V 4  15 V 5    11 V 
n
n




n2 
n2
n2
n2 
n2 

2


   11 V  =   21 V  3 V 2  43 V 3  52 V 4  15
n
n


n2 
n2
n2
n2

1
… (4.43)
2



 6V 5  51 V 6     11 V  … (4.44)



n2
n2 


Then (as Taylor series expansions) we have
1
1
1
2






   1 Y     1 V     1 Y   =
1
1
1

 
 

2
2
n   n  
n 2  

1 2
1
1

1
1   V   Y Y  2 VY
 1  1  2 1Y   1V    
n  K1n Y , Z ,V 

n2

 D
1
2n
…
 2 Y   V    n1  V   Y Y
1
1
2
1
2
 


 
1
1
2
1
(4.45)
 2 1VY  
(4.46)
2
1
2 1Y   1V    K 2 n Y , Z ,V 
3 
8D
105
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
1
1 
1 


Y
1
1


n2 
n2 


   1 Y 
1


n2 

1
1


1
1
   11 V  Z = 1  1Z  Y 1Z   1VZ
n


n2
n2 




   11 V 


n2 

2


   11 Y  


n 2  


2
 K3n Y , Z ,V 
... (4.47)
1
2
=

1
1  3 V   Y Y  4 VY
 1  1  2 1Y  2 1V    
n  K 4 n Y , Z ,V 

n2

1
2
1
1
 


1
2
... (4.48)
 1

3 1V 2  Y 1Y  4 1VY  
3 

1
1
1
= 
 1Y   1V     2
  K5n Y , Z ,V 
1 
2

n 3
1
1
3 2


 Y   V 
n
 25 

Here, K jn Y , Z ,V  , j  1,..,5 is a remainder term consisting of
1
3
2
times a homogeneous polynomial
n
1
(not depending on n) of degree 3 in the elements of Y,Z and V plus 2 times a homogeneous polynomial
n
5
  
of degree 4 plus an remainder term which is 0  n 2  for fixed Y,Z and V.


The argument of  
u 
 in (3.50) is the product of
1  u
 1  u

2 1Y   1V     1Z     1V 2  Y 1Y  2 1VY  

1 
 2
 n  2

n2
2
u
 3  2 1Y   1V    Y 1Z   1VZ  K 6 n Y , Z ,V 
8
and (3.56) which is
106
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
2 
u  1 1
7 1
u
1
1 2
  2  VY   V     4   Y  V   8  V    



1  u 1
1 1  1 1 1
1 1
1 1 1
1 1 1

u  1  2  V    Z     Y Z   VZ  3  YZ   3  Z V 
  K5n Y , Z ,V 
2




 n 
2 

n






=u 
1
n
1
2
C  Z ,V  
1
D Y , Z ,V   K7 n Y , Z ,V 
n
Say (as the definition of C(Z,V) and D(Y,Z,V) and K 6n(Y,Z,V) and K7n(Y,Z,V) have the same properties as
Kjn(Y,Z,V) j=1,…,5
A Taylor series expansion of  
 in (3.50) gives


1
1
 u  1 C  Z ,V   D Y , Z ,V  + K7 n Y , Z ,V 
n


n2


1

1
1 2

 1
   u     u   1 C  Z , Y    D Y , Z ,V   uC  Z ,V    3 K8 Y , Z ,V  
n
2


n2
 n2
1
K9 Y , Z ,V   K10 n Y , Z ,V 
n2
where K8 Y , Z ,V  is a homogeneous polynomial (not depending on n but depending on u ) of
degree 3 in the elements of Y,Z, and V, K9 Y , Z ,V  is a polynomial of degree 4, and K10n Y , Z ,V  is

a remainder term, which is 0  n


5
2

 for fixed Y,Z, and V (and u).

Let Jn be the set of Y,Z, and V such that
Vij
yi
 glog
1
 n  2 , Zi
 g(log n)1/2,i=1,…,p and
 2log n, i, j  i,..., p where g > 2(1+k)/k1/2.
Anderson showed that
Pr J n   1  0  n 2 
The difference between E 
0(n-2), because 0   

and the integral of  
 times the density of Y,Z, and V over Jn is
  1 . In Jn each element of Y,Z, and V divided by n1/2
1/2
5/2
is less than a constant times
log n/ n .The part of the remainder Kjn(Y,Z,V) j=1,…,7, that is 0(n- ) for fixed Y,Z, and V can be
written as a homogeneous polynomial of degree 5 in the elements of Y,Z, and V with coefficients possibly
107
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
depending on Y,Z,and V (by use of Taylor series with remainder),each coefficient is bounded in J n (for
sufficiently large n). The same holds for K10n (Y,Z,V). Hence, in Jn
K10n
 log n 
(Y,Z,V) <constant X  1 
 2 
 n 
5
And the integral of this times the density of Y,Z, and V over J n is
0  n 2  . Since fourth order absolute
moments of Y,Z and V exist and are bounded the integral of Y,Z, and V over Jn is bounded; hence,the
contribution
of
this
term
(with
the
factor
n-2)
0  n 2  . The difference between
is
3

1


n EC  Z ,V  , n E  D Y , Z ,V   uC 2  Z ,V  and n 2 K8 Y , Z ,V  and the integrals over Jn of
2


1
3


1


n 2 C  Z ,V  , n 1E  D Y , Z ,V   uC 2  Z ,V  and n 2 K8 Y , Z ,V  times the density of Y,Z,
2



1
2
1
and V respectively, are
0  n 2  .
Thus,
1


W  2 a

Pr 
 u 
1
 a2



1

1
1

 1
   u     u   1 EC  Z ,V    ED Y , Z ,V   uEC 2  Z ,V    3 E  K8 Y , Z ,V   
n
2

 n 2
 n2
0  n 2 
1

1
1

 1
   u     u   1 EC  Z ,V    ED Y , Z ,V   uEC 2  Z ,V    3 E  K8 Y , Z ,V   
n
2

 n 2
 n2
0  n 2 
Because the third order moment of Y,Z, and V are either 0 or
0  n 2  .
Since C(Z,V) is linear and homogeneous EC(Z,V)=0. Since (Y,Z) and V are independent
ED Y , Z ,V  
2
u
7 u
1 2
1
E

V


E

V




2
8 4
1
1
EY 1Z  3 E 1YZ 1


108
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013


= r 
3
n 1
 u   r  1
4
n1 
Since E 1V 2 =E 1VV 1 
r
  2 E  v12i
i 1
r


  2  Ev112   Ev12i 
i 1


2
   r  1
E  1V     4 Ev112
2
 2 4
EC 2  Z ,V  
Replacing
u2
1
2
E  1V    2 E 1ZZ 1
4
4

1
n
 u2 
2
n1
n
by its limit 1+k and substituting in ..,we have
n1
1


W  2 a

Pr 
 u 
1
 a2



  r  1
1
1 1 
1 



   u     u   1 1  k    r   k  u  u 3   0  n 2 
n
4 2 
4 


 2

when E(X)=U1.Interchanging n1 and n2 gives
1


  r  1
W  2 a

1
1 1 
1 3


 1 
Pr 

u


v


v
1


r


Xu

v   0  n 2 




 1 
 

1

n
4 2k 
4 
 k 

 a2

 2



when E(X)=U2.
4.4
Efrons Asymptotic Expansion
Efron (1975) has derived an asymptotic expansion of the expected probability of misclassification using a
geometric argument which utilizes differential gradients and tangent lines. His derivation assumes mixture
sampling so that sampling sizes n1 and n2 are stochastic Efron’s expansion may be written as
109
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
EPMCEfr  q1  b1   q2  b2    2n   q1b1  b1   q2b2  b2  
1




V 2 1  2 D 2 q2 q1  r  1 1  2 D 2 q2 q1
  1 2

 1  D   VD  q1  q2  
D2
D2

  4
  

 

where
  q2  
  log  1  
 q 
V 
D
1
b1  D  V
2
1
b2  D  V
2
A unique facet of this study is the inclusion of three asymptotic expansions Raudys (1972) ,Deev (1972),
and Kharin (1984) of the expected probability of misclassification (EPMC) which are derived by Russian
Investigators. Little mention is made of these investigators and their expansions in the statisticaldiscrimination literature authored in the western hemisphere. All three derivations depend on the central
limit theorem for dependent variables.
4.5
Raudy’s Asymptotic Expansion
The asymptotic expansion of Raudys is given by
EPMCRau
 1 2
 1 2
2 
2 
 q1 
  q2 

 m 
 m 




where

 n 2  4r
m
  n  p 
4.6

1
2


Deev’s Asymptotic Expansion
Deev’s asymptotic expansion is expressed as
110
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
EPMCDeev
1   12 
    d 
2

where


 n  1 n  1 1  2  p  1 
d

1
n  n  p  
n 

4.7
2

Kharin’s Asymptotic Expansion
The asymptotic expansion of the EPMC derived by Kharin is
EPMCKhr
  2  
1 2  1
2
exp 
 p  1     n1  n1 

4 
 1 
 8 
   
1
 2 
4  2  2 
5.
Sampling Experiments And Results
We have generated data from the Bernoulli distribution and the populations are characterized by three to five
variables. We have used a minimum of three variables because researches have shown that the number of
variables that allows classification procedures to differ from each other is three. It was due to this fact that in
most experiments researchers uses population with more than two variables for sampling experiments. Even
for additional variables, it has been observed that both time and cost of sampling increase disproportionately
so that if one wants to do simulations with more than six variables, it becomes increasingly difficult. We
define a simulation experiment which is characterized by the value we assign to the input variables P ij and
P2j. In this study we only considered mean structures characterized by the difference between P 2j and Pij
which must be non-negative. In this study we selected this difference to be not more than 0.4. We used 22
population pairs given rise to 225 Configuration formed by specifying the values of the means P ij and P2j.
The seven asymptotic expansions were evaluated at each of the 225 configurations of n, r, and d where n is
the sample size, r is the number of variables and d  P2 j  P1 j  0, j  1,2,.., r . The 225 configurations
of n, r and d are all possible combinations of n=40,60,100,140,200,300,400,600,700,800,900,1000, r= 3,4,5
and
d = 0.1,0.2,0.3 and 0.4.
The simulation experiments have been implemented using the International Mathematics and Statistics
library. The simulation was done with a Fortran 77 program converted to the present day Fortran 208. The
number of iterations used for each of the configurations of n, r and d is 1000.
Seven population pairs are based on three variables; nine are based on four variables and six on five
variables. The seven asymptotic expansions are evaluated at each of the 225 configurations of n, r and d. At
each of the 225 configurations, the asymptotic expansion that has minimum variance is declared the “best”
asymptotic expansion (see Tables 1, 2 and 3 respectively).
111
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Table 1 Simulation Results to evaluate asymptotic expansions for sample
Optimum Rate of Asymptotic
Expansion
0.358
Asymptotic Expansion
70
100
150
sizes ni using three variables.
Mean Error rate
Variance
Anderson
0.353064
0.001100
Deev
0.356192
0.001130
Efron
0.350147
0.001145
Raudys
0.360767
0.001272
Okamoto
0.358585
0.001282
Sayre
0.358570
0.001282
Kharin
0.357558
0.001297
Anderson
0.353850
0.000790
Deev
0.356137
0.000806
Efron
0.351838
0.000812
Okamoto
0.357690
0.000874
Sayre
0.357683
0.000874
Kharin
0.356981
0.000881
Raudys
0.359542
0.000884
Anderson
Deev
0.355908
0.357493
0.000532
0.000540
Efron
Okamoto
Sayre
Kharin
Raudys
0.354595
0.358514
0.358511
0.358050
0.359914
0.000543
0.000570
0.000570
0.000573
0.000578
112
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Table 2
Simulation Results to evaluate asymptotic expansion for sample sizes
using four
variables
Rate of Asymptotic
Asymptotic Expansion
Mean Error rate
Variance
Optimum
Expansion
0.4117
30
50
70
100
Anderson
0.376443
0.001918
Deev
0.383130
0.002013
Efron
0.370495
0.002116
Raudys
0.393423
0.002374
Okamoto
0.394742
0.003008
Sayre
0.394671
0.003011
Kharin
0.392610
0.003082
Anderson
0.389262
0.001274
Deev
0.394024
0.001313
Efron
0.386203
0.001354
Raudys
0.402851
0.001536
Okamoto
0.402318
0.001811
Sayre
0.402296
0.001812
Kharin
0.401194
0.001840
Anderson
Deev
Efron
0.396501
0.400181
0.394507
0.000939
0.000958
0.000981
Raudys
Okamoto
Sayre
Kharin
0.407831
0.406758
0.406747
0.406014
0.000115
0.001282
0.001283
0.001298
Anderson
0.403963
0.000684
Deev
0.406717
0.000695
Efron
0.402699
0.000705
Raudys
0.413209
0.000803
Okamoto
0.411713
0.000859
Sayre
0.411708
0.000859
Kharin
0.411232
0.000866
113
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
Table 3: Simulation results to evaluate asymptotic expansion for sample sizes ni using five variables.
Optimum Rate of Asymptotic
Expansion
0.40437
50
Asymptotic Expansion
Mean Error rate
Variance
Anderson
0.376967
0.001178
Deeve
0.381547
0.001217
Efron
0.373528
0.001250
Raudys
0.389332
0.001426
Okamoto
0.387660
0.001558
Sayre
0.387635
0.001559
Kharin
0.386420
0.001583
70
Anderson
Deeve
Efron
Raudys
Okamoto
Sayre
Kharin
100
Anderson
Deeve
Efron
Raudys
Okamoto
0.390603
0.393244
0.389132
0.398774
0.396923
0.000703
0.000716
0.000725
0.000817
0.000833
Sayre
Kharin
0.396923
0.396385
0.000834
0.000841
Anderson
Deeve
Efron
0.396781
0.398640
0.395872
0.000482
0.000488
0.000492
Okamoto
Sayre
Raudys
Kharin
0.401287
0.401285
0.402917
0.400946
0.000543
0.000543
0.000545
0.000546
150
114
0.385002
0.388535
0.382755
0.395331
0.393439
0.393427
0.392617
0.000920
0.000943
0.000961
0.001092
0.001144
0.001144
0.001157
Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013
6.
Discussion of results, Conclusion and Recommendation
Two methods can be used to get the best asymptotic expansion. They are
1.
the asymptotic expansion that has the minimum variance for each
2.
the difference between the optimum error and the expected error rate.
configuration,
The asymptotic expansion that has the minimum becomes the best for that configuration. For the
purpose of this paper we have used the first option above.
For the 225 configurations of the simulation experiments, Anderson’s expansion was the best in terms of
minimum variance. Efron’s expansion and Deev’s also performed better than the remaining asymptotic
expansions. The worst expansion was that of Raudys. Simulations indicate that the expected value expansion
was reasonably good for small to moderate sample sizes and excellent for large samples. For sample size
500 and for particular values of r and d some asymptotic expansions produced the same variance.
Researchers should look at the second method of obtaining the best asymptotic expansion.
REFERENCES
Anderson T.W. (1951). “Classification by multivariate analysis” Psychmetrika, 16, 31 – 50.
Anderson, T.W. (1973). “An asymptotic expansion of the distribution of the studentized classification
statistic W”, Ann. Stat. 1, 964 – 972.
Deev, A.D. (1972). “Asymptotic expansions of statistic distributions of discriminant analysis W, M, W*.
Stat Metody Classif, MGU, Moscow 6 – 51.
Efron, B. (1975). “The efficiency of logistic regression compared to normal discriminant analysis” Journal
of the American statistical Association, 70, 892 – 896.
Fisher, R.A. (1936). “The use of multiple measurements in taxonomic problems” Ann. Eugenics 7, 179 –
188.
Kharin, Y.S. (1984). “The investigation of risk for statistical classifiers using minimum estimators” theory
Prob. Appl. 28, 623 – 630.3 S. John (1961).
“Errors in discrimination” Ann Math. Stat. 32, 1125 – 1144. Lachenbruch, P. (1975) Discriminant Analysis,
Hafner Press.
Okamoto, M. (1963). “An asymptotic expansion for the distribution of the linear descriminant function”
Ann. Math. Stat. 34, 1286 – 1301.
Okamoto, M. (1968). “Correction to an asymptotic expansion for the distribution of the linear discriminant
function” Ann. Math. Stat. 39. 1358 – 1359.
Raudys, S. (1972). “On the amount of priori information in designing the classification algorithm” Tech.
Cybern 4, 168 – 174.
Sayre, J.W. (1980). “The distribution of the actual error rae in linear discriminant analysis” J. of Am. Stat.
Assoc. 75, 201 – 205.
Scherish, M.J. (1981) “Asymptotic expansions of the means and variances of error rates” Biometrika 68, 295
– 299.
Wyman, F.J. Young, O. M. Turner, D.W. (1990) “Comparison of asymptotic error rate expansions for the
sample linear discriminant function’ Pattern recognition 23 No 7, 775 – 783.
115