Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Comparison of Seven Asymptotic Error Rate Expansions for the Sample Linear Discriminant Function D. D. Ekezie, S. I. Onyeagu Abstract — Seven asymptotic error rate expansions for the sample linear discriminant function were considered and compared using binary variables. A simulation experiment was carried out to compare the performance of these rules. In all, 22 population pairs which gave rise to 225 configurations were formed. At each of the 225 configurations, the asymptotic expansion with error having the minimum variance after 1000 repeated trials is declared the best. For the 225 configurations of the simulation experiments, Anderson’s asymptotic expansion was the best in terms of minimum variance. Key Words: Fisher’s Linear Discriminant Function, Mahalanobis Distance, Asymtotic Error Rate Expansions. 1. Introduction Discrimination and classification deal with problems of differentiating between two or more populations on the basis of multivariate measurements. In discrimination, we are given the existence of two populations and a sample of individuals from each. The problem is to set up a rule, based on measurements from these individuals, which will enable us to allot some new individuals to correct population when we do not know from which of the two it emanates. In classification, we are given a sample of individuals, or the whole population and the problem is to classify them into groups which shall be as distinct as possible. Example, given a population of unknown origin we may wish to see whether they fall into natural classes, natural in this sense meaning that the members in a group are close together in resemblance, but that the members of one group differ considerably from those of another. The classical linear statistical discrimination problem may be described as follows: suppose that each member of the union of two populations possess a finite set of common characteristics of features, which can be denoted by F=(f1,…,fp) whose observed values are denoted by the observed value of the characteristic X x1, x2 ,..., x p such that x j is f j , j 1,2,..., p . 1 and 2 denote two distinct populations whose known multivariate probability mass functions are multivariate Bernoulli with means P1 and P2 such that P1 P 2 and covariance matrices. 1 2 . Also, let q1 and q2 be the known a priori probabilities that an individual is selected from 1 or 2 Let respectively. Let i c be the cost of misclassifying an individual from 1 ,where j i 0, i j j 0, i j i, j 1,2 93 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Then, given a populations p 1 observation vector X of an individual selected at random from the union of the 1 and 2 ,the statistical discrimination problem is to formulate a decision rule which classifies the individual into one of the populations and optimizes some criterion which measures performance accuracy. A particular value of X denoted by X ( x1, x2, ,..., x p ) is called a response pattern. A response pattern is a series of and the probability that response pattern X is observed in the ith population (i=1,2) will be denoted by i x .The linear Discriminant function was developed by Fisher (1936) who applied the criterion of finding the linear transformation which maximizes a univariate difference in the group means relative to the common univariate dispersion. 2. FISHER’S LINEAR DISCRIMINANT FUNCTION (LDF) The Fisher’s Linear Discriminant Function (LDF) for binary variables is given by L X P2 j P1 j kj X k j k 1 2 j P 2j P1 j kj P2 k P1k …(2.1) k where are the elements of the inverse of the pooled covariance matrix for the two populations. Typically, the parameters of the two underlying distributions are unknown and must be estimated using samples of size n1 and n2 from 1 and 2 respectively. Unbiased estimates of Pij are given by kj n ( x) Pˆij S j i ni … (2.2) where S j is the set of all patterns X with X j 1 . Therefore, the sample based linear discriminant function for binary variables is given by L( x ) p2 j p1 j S kj X k j k 1 p2 j p1 j S kj p2 k p1k 2 j k … (2.3) kj where S are elements of the inverse of the pooled sample variance covariance matrix. The allocation rule for the Fisher’s Linear Discriminant Function is the following: Classify a new item with response pattern X into L( x ) p2 j p1 j S kj X k j and to k 1 if 1 p2 j p1 j S kj p2 k p1k c 2 j k … (2.4) 2 , otherwise, where c is the constant defined as 2 c 1 | 2 c log e . c 2 | 1 1 … (2.5) For any classification rule, its associated error rates are often the criteria by which the classification performance is evaluated. In the two population discrimination problem, there are two possible misclassifications; a rule may classify an observation actually from 2 to 1 , or it may classify an 94 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 1 to 2 . observation from The respective probabilities of misclassification are denoted by p 1| 2 and p 2 |1 . The overall probability of misclassification, or total error rate is expressed as q1 p 2 |1 q2 p 1| 2 … (2.6) The error rate is easily calculated when the populations are characterized by multivariate normal densities with known parameters, and in the case D c 0 , is equal to where is the standard normal 2 cumulative distribution function and D is the Mahalanobis distance, defined by D P2 j P1J kj P2 j P1J / … (2.7) However, when the population parameters are unknown and must be estimated, calculation of the exact, overall expected error rate for the Fisher’s LDF becomes virtually intractable. In an attempt to remedy this problem, investigators have derived asymptotic expansions for the overall expected error rate of the sample LDF. In this paper, we generate data from the multivariate Bernoulli distribution, use the data to compute the Mahalanobis’s squared distance and plug the values into the cumulative standard normal distribution function . We consider sample sizes n1 n2 30 and above since the Bernoulli distribution can be approximated to the normal for large samples. 3. Statement of the Problem. For any classification rule, its associated error rates are often the criteria by which the classification performance is evaluated. In the two-population discrimination problem, there are two possible misclassifications; a rule may classify an observation actually from 2 to 1 or it may classify an observation from p 2 |1 . 1 to 2 . The respective probabilities of misclassification are denoted by p 1| 2 and The overall probability of misclassification, or total error rate is expressed as q1 p 2 |1 q2 p 1| 2 The error rate is easily calculated when the populations are characterized by multivariate normal densities with known parameters, and in the case c 1| 2 D c 0 , where c is the constant defined as c loge 2 . is equal to where 2 1c 2 |1 is the standard normal cumulative distribution function and D is the Mahalanobis distance, defined by D P2 j P1J kj P2 j P1J / However, when the population parameters are unknown and must be estimated, calculation of the exact, overall expected error rate for the Fisher’s Linear Discriminant Function becomes virtually intractable. In an attempt to remedy this problem, investigators have derived asymptotic expansions for the overall expected error rate of the sample Linear Discriminant Function. Most statistical methods developed for estimations, hypothesis testing and confidence statements were based upon exact specifications of the populations of the response variates. In the applied sciences, another kind of multivariate problem often occurs in which an observation must be 95 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 assigned in some optimum way to one of several populations. If our population consist of two groups, 1 and 2 . We observe a r x 1 vector X and must assign the individual whose measurements are given by X 1 or 2 . We need a rule to assign X to 1 or 2 .If the parameters of the distribution of X in 1 and 2 are known, we may use this knowledge in the construction of an assignment rule. If not, we use samples of sizes n1 from 1 or and n2 from 2 to estimate the parameters. We need a criterion of to goodness of classification. Fisher (1936) has suggested using a Linear combination of the observations and choosing the coefficients so that the ratio of the difference of the means of the linear combination in the two groups to its variance is maximized. Welch (1939) suggested that maximizing the total probability of misclassification would be a sensible idea. Von Mises(1945) suggested minimizing the maximum probability of misclassification in the two groups. Therefore, if we are confronted with this kind of problem X X 1, X 2, ..., X 1 into 1 or 1 of classifying an object of an unknown origin with measurement vector 2 ,how do we choose the “BEST” rule so that the expected cost associated with misclassification will be minimum. In this work, we generate data from the multivariate Bernoulli distribution, use the data to compute the Mahalanobis’s squared distance and plug the values into the cumulative standard normal distribution . We consider sample sizes n1 n2 30 and above since the Bernoulli distribution can be approximated to the normal for large samples. Many asymptotic expansions for the expected error rate have been formulated. These include those by Okamoto (1963;1968),Anderson(1973);Efron (1975),Sayre (1980) Schervish(1981),Raudys (1972),Deev (1972) and Kharin (1984). These asymptotic expansions are typically functions of the training sample sizes n 1 and n2, the dimension p of the observation vector X, and the Mahalanobis distance between the two populations. Wyman, Young and Turner (1990) made a comprehensive investigation into the relative accuracy of these asymptotic error rate expansions using data generated from the multivariate normal distribution. They concluded that the best asymptotic expansion is Raudy’s asymptotic expansion. In this thesis, we are doing the same but using data generated from the Multivariate Bernoulli distribution. We assess seven asymptotic expansions in terms of their ability to approximate the expected probability of misclassification (EPMC) or unconditional probability of misclassification. The main objective of this paper is to compare seven asymptotic error rate expansions for the sample linear Discriminant function with the aim of determining how well the assignment rules perform. The asymptotic expansions rules being compared are as follows: (1) (2) (3) (4) (5) (6) (7) Anderson’s asymptotic expansion Deev’s asymptotic expansion Efron’s asymptotic expansion Raudy’s asymptotic expansion Okamoto’s asymptotic expansion Sayre’s asymptotic expansion Kharin’s asymptotic expansion We present seven asymptotic expansions of the expected probability of misclassification and the following notation is utilized for all asymptotic expansions. denotes the standard normal cumulative distribution function; denotes the standard normal density function; n1 and n2 denote sample sizes sampled from populations 1 and 2 respectively; n represents the quantity n1+n2 whereas N represents n-2. There has been much vigorous research in this area since the pioneering work of Fisher (1936), and it is not possible to cover this work exhaustively in this study. 96 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Our interest is to provide an overview of the main ideas, and to supply sufficient working details so that the user can put the ideas into practice in order to minimize the expected cost of misclassification and to choose an appropriate asymptotic expansion given any population structure. 4. Seven Asymptotic expansions for the expected probability of misclassification 4.1 Okamoto’s Asymptotic Expansion Probably the most well-known asymptotic expansion of the expected total error rate of the sample linear discriminant function is due to Okamoto (1963). The derivation of this expansion utilizes the studentization method (Hartley 1938), Welch (1947) in which Okamoto applies a Taylor series expansion to the characteristic function of the studentized sample linear discriminant function. It is well known that if the Mahalanobis squared distance D 2 between two populations 1 and 2 defined by D2 1 2 1 1 2 / … (3.7) n1 , n2 and n tend to infinity X1 1 , X 2 2 and S in probability and 1 2 2 1 hence the limiting distribution of W is N D 2 , D 2 or N D , D according as X n n 1 comes 2 2 from 1 or from 2 , that is, if any real constant c denoted by is not zero, then as 1 1 p1 c, D Pr W D 2 cD | 1 2 the probability that W … (4.1) 1 2 D cD when X n1 n2 1 comes from 1 and similarly 2 1 2 cD p2 c, D Pr W D 2 2 then both (1968 where … (4.2) p1 c; D and p2 c; D tend to c , cumulative distribution function of N 0,1 . Okamoto ) evaluated p1 c; D and p2 c; D in an asymptotic expansion with respect to n11 , n21 and n 1 n n1 n2 2 . Okamoto showed that p2 c; D can be derived from p1 c; D . He stated his main result in the following theorem and corollaries. Theorem: If D 0 , then p1 c; D 1 L d ; D Q d ; D c O3 where 2 … (4.3) d stands for the differential operator d dc , c for the c.d.f of N 0,1 and 3 L d ; D Li d ; D … i 1 97 (4.4) Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Q d ; D 2 1 L d ; D Qij d ; D 2 i j 1 … (4.5) L1 d ; D 2n1D 2 d 4 r d 2 Dd 1 … (4.6) 1 2 L2 d ; D 2n2 D 2 d 2 Dd r d 2 Dd … (4.7) 2 1 L3 d ; D 4n 2d 2 Dd 2 r 1 3d 2 Dd … (4.8) 1 2 Q11 d ; D 4n12 D 4 2d 4 d 2 Dd r d 2 Dd … (4.9) 1 3 2 Q22 d ; D 4n22 D 4 2 d 2 Dd r d 2 Dd … (4.10) Q12 d ; D 2n1n2 D 4 2d 4 d 2 Dd rd 4 1 Q13 d ; D 2n1nD 4d 4 2d 2 Dd 2 5r 7 d 4 D 2d 2 r 2 r 3d 2 Dd 2 d 2 Dd 2d 2 Dd 2 2 5r 7 d 4 4 3r 4 D 2d 2 r 2 r 3d 2 Dd 2 1 Q23 d ; D 2n2nD … (4.11) 2 1 … (4.12) … (4.13) 2 2d 2 Dd 2 7d 2 2 Dd 4 3 1 3 29 r 55 d 12 5r 9 Dd Q33 d ; D 12n 2 3 3r 5 D 2d 2 6 6r 2 13r 9 d 2 6 r 12 Dd and finally … (4.14) O3 stands for the term of the third order with respect to n11 , n21 , n1 . Corollary 1 Let c be p1 c; D c 2n1D 2 the 1 density of N 0,1 , then 3c c3 r D c 1 2 1 2 2n2 D 2 2 D 3c c D c r D c 4n 2 D 3c c D 2c 2r D 3c 02 98 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 ... (4.15) This corollary is proved by substituting the identities d c c d 2 c c c d 3 c c 2 1 c d 2 c 3c c 2 c into the term L d ; D c of the theorem. In many situations, the discrimination is performed in the following way: We regard an observed value of X n1n 21 as coming from 1 or 2 according as the observed value of W is positive or negative. For this procedure the error probabilities of two kinds are given by Corollary 2 b b b D a1 a2 a3 b11 b22 b12 Pr W 0 | 1 13 23 332 O3 …(4.16) 2 2 2 n1 n2 n n1 n2 n1n2 n1n n2n n b b b D a2 a1 a3 b22 b11 b12 Pr W 0 | 2 23 13 332 O3 … 2 2 2 n1 n2 n n1 n2 n1n2 n1n n2n n (4.17) where a1 2 D 2 1 d 4 o 3rdo2 a2 2 D 2 do4 r 4 do2 1 a3 1 r 1 do2 2 … (4.18) … (4.19) … (4.20) b11 8D 4 do8 6 r 2 do6 r 2 9r 16 do4 2Or r 2 d o2 … b22 8D 4 do8 2 r 10 do6 r 6 r 16 d o4 4 r 4 r 6 d o2 … 1 1 (4.22) b12 4 D 4 do8 2 r 8 do6 3 r 2 10r 16 d o4 12r r 6 d o2 1 b13 4 D 2 1 b23 4 D 2 b33 1 r 1 do6 3 r 4 do4 6 r 4 do2 r 1 do6 r 8 do4 2 r 4 do2 1 r 1 r 1 do4 4rdo2 8 (4.21) … … (4.24) … … 99 (4.23) (4.25) (4.26) Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 and doi doi are constants defined by di c i dc = c … D i 2,4,6,8 2 (4.27) Sayre’s Asymptotic Expansion 4.2 When the populations from which the samples are drawn are multivariate normal with mean common covariance matrix , then for any procedure, denoted by n * u1,u2 and ˆ a fixed function where , with w 1 1 Wˆ X X 1 X 2 S 1 X 1 X 2 ,the conditional probability of misclassification otherwise 2 * * known as the actual error rates given by P1 n and P2 n where for i 1,2 Pi ( n* ) where 1 1 i 1 1 u X X i 1 2 S X1 X 2 2 1 1 1 X 1 X 2 S S X 1 X 2 y y 2 1 2 … (4.28) t 2 exp dt 2 For equal prior probabilities of an observation belonging to either 1 or 2 ,the average actual rate is given by P n P n R * * 1 2 … 2 Mclachlan (1974) approximated the distributions of the actual error rates total error rate R , by asymptotic expansions and calculated that (4.29) Pi n* i 1,2 and the average Pi ni has a normal distribution if the terms of the second order with respect to the reciprocals of the initial sample sizes are ignored and that R has a normal distribution on ignoring only the terms of the third order. The normality of R based on ignoring these third order terms may be only approximate for moderate samples sizes,and the distribution may better be approximated by a linear combination of chi-squared variates. Sayre(1980) approximated the distribution and moments of the actual error rates for univariate and multivariate models . In the multivariate normal situation, the limiting distribution of 1 n R D where D is the mahalanobis distance 2 between the populations,maybe found by using the results of Efron (1975). Efron considered a mixture ni n had the limiting value qi (where qi (i 1,2) are prior probabilities of n being in population i ). Efron also alluded to the asymptotic formula for the distribution of R when n1 sampling scheme in which 100 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 and n2 are fixed. Sayre (1980) presented the following explicit formulas given fixed sample sizes. Thus the limiting distribution of D n R 2 D n R is given by 2 n1, n1 2 n1n2 1 D n 2 2 q1 2 D k . n n X p1 1 2 n 2 n 2 D 2 n 2 n2 n1 n 2 2 n n q1 2 D Q k 1 2 D 2 1 2 2 X 21 2 n n1n2 D n 4n1n2 n1n2 n … (4.30) q ln 1 q q D Where k 2 and ln 1 2 D q2 An asymptotic mean and variance formula for R can therefore be written as follows: q1 n k 1 2D n1n2 E R 2 2 n1n2 1 n2 n1 D D r r 1 2 2 2 (4.31) 2 4 n n D 2 n 2 4 n n 2 D 4 n1 n1 4 3 n2 n1 2 2 n2 2 n2 2 1 2 D 2 d 2 D 2 n n D 4 q12 k n 2 8 n n n n V R 4 4 D 2n 2 n n 2 8 n1 2 2 2 2 2 3 n2 4 4 1 2 n 2 2 D 4rD 4rD 4 D D 2 8 n n 2 4 D r 1 2 (4.32) For the special case in which q1 q2 1 and n1 n2 the general results reduce to 2 4 D2 D 2 D D X r 1 X 12 4D 4 2 2 . . . (4.33) For the limiting distribution and E R 1 1 D 1 D D 2 r 1 r 1 2nD 2 4 4 2 … (4.34) and 2 D V R 32r 16rD 2 2 RD 4 16D 2 32 2 101 … (4.35) Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 4.3 Andersons Asymptotic Expansion The X , X 1, X 2 Statistics, and S are independently distributed according to 1 1 N u, , N u2 , , u2 , , and w , n respectively; here u E ( X ) and w , n n1 n2 denotes the Wishart distribution with n degrees of freedom. An observation X may be classified by means of the classification statistic 1 1 W X 1 X 2 S 1 X X 1 X 2 2 where X1 1 n1 X n1 j 1 1 j 1 n2 X2 X2 j n2 j 1 n1 n2 ns X 1 j X 1 X 2 j X 2 X 2 j X 2 X 2 j X 2 1 j 1 and 1 j 1 n n1 n2 2 .The distribution of W depends on the parameters u1,u2 and through the squared mahalanobis distance 2 u1 u2 1 u1 u2 1 which can be estimated by D 2 X 1 X 2 S 1 X 1 X 2 1 2 D 2 a . The limiting distribution of W as n1 and n2 is normal 1 1 variance and mean if X is from N u, , and mean - if X is from N u2 , . 2 2 Let and with Bowker and Sitgreaves (1961) for n1 = n2 and Okamoto (1963,1968) gave asymptotic expansions of the 1 W w 2 2 distribution of for X coming from N u, and for X coming from N u2 , 1 1 2 2 1 1 1 n , , and 2 when n1 and n2 ,and 2 k a finite positive constant. to terms of order n n1 n1 n2 In particular Pr w 0 was evaluated. The Statistician who wants to classify X , may take c to be a constant, perhaps 0,and accept the pair of misclassification probabilities that result. The 102 asymptotic Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 1 w 2 expansion of the distribution of gives approximate evaluations of these probabilities, which 1 2 are functions of the unknown parameters as well as of c. On the other hand, the Statistician may want to determine the cut off point c to adjust the probabilities of W 2 misclassification. Since the limiting distribution of and 1 2 1 w 2 are N 0,1 when 1 2 E X u1 and E X u2 , respectively, a first approximation to the pair of misclassification 1 1 1 1 c 2 and c 2 ,where 2 2 probabilities is function of the standard normal variate. Since a is an estimate of is the cumulative distribution ,one might base his choice of c on the 1 1 W a w 2 2 fact that the limiting distributions of and are N 0,1 when E X u1 and 1 1 a2 2 E X u2 ,respectively. 1 1 W a w 2 2 Anderson (1973) derived asymptotic expansions of the distributions of and in 1 1 a2 2 these two cases, respectively. We write 1 1 W a X 1 X 2 S 1 X 1 X 2 2 … (4.36) Then 1 1 1 1 1 1 1 W 2 a 2 Pr u Pr X X S X X u X X S X X X 1 X 2 S 1 X 1 X 2 1 2 1 2 1 2 1 2 1 a2 … (4.37) Since X has the distribution N u, independently of X 1 , X 2, and S, the conditional distribution of X 1 K 1 1 X 2 S 1 X u is N 0, X 1 X 2 S 1S 1 X 1 X 2 , and X X 2 S 1 X u 1 1 X X S 1S 1 X X 2 1 2 1 1 … (4.38) 1 2 103 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Has the distribution N (0,1) .Then equation (3.44) is 1 1 1 1 1 1 2 W a u X X S X X X X S X u 2 1 2 1 2 1 2 u Pr K 1 Pr 1 1 a2 X X 1 S 1S 1 X X 2 1 2 1 2 1 1 1 1 1 2 u X X S X X X X S X u 1 2 1 2 1 2 1 = E ..(4.39) 1 1 X X S 1S 1 X X 2 2 1 2 1 where the expectation is with respect to X 1 , X 2, and S. The distribution of W and a is invariant with respect to the transformations Axj b , where A is nonsingular. The maximal parameter invariant of these transformations is the distance . X Ax b, X * *(1) j Axj (1) b j=1…….N1, and xj Anderson chose A and b to transform to *(2) (2) 1 2 ,u1 u2 to D,0,...,0 where , and u1=0. 1 Anderson also treated the case where u=u1. Let Y,Z and V be defined by X 1 - X 2 = 1 1 … (4.40) Y n2 X1 1 n S 1 2 Z 1 n 1 2 V … The joint distribution of Y Z 1 1 1 0 n N1 N 2 N , 0 n N1 1 1 (4.41) is n N1 n N1 … (4.42) 104 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Then equation (3.46) is 1 W 2 a Pr u = 1 a2 1 1 1 1 2 1 1 1 1 u 1 Y 1 V 1 V 1 Y 11 2 2 2 2 n n n n n2 E 1 1 2 2 11 Y 11 V 11 Y 2 2 2 n n n 1 11 Y n2 1 11 V Z n2 … (4.43) We can write 1 11 V = 11 V 1 V 2 13 V 3 12 V 4 15 V 5 11 V n n n2 n2 n2 n2 n2 2 11 V = 21 V 3 V 2 43 V 3 52 V 4 15 n n n2 n2 n2 n2 1 … (4.43) 2 6V 5 51 V 6 11 V … (4.44) n2 n2 Then (as Taylor series expansions) we have 1 1 1 2 1 Y 1 V 1 Y = 1 1 1 2 2 n n n 2 1 2 1 1 1 1 V Y Y 2 VY 1 1 2 1Y 1V n K1n Y , Z ,V n2 D 1 2n … 2 Y V n1 V Y Y 1 1 2 1 2 1 1 2 1 (4.45) 2 1VY (4.46) 2 1 2 1Y 1V K 2 n Y , Z ,V 3 8D 105 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 1 1 1 Y 1 1 n2 n2 1 Y 1 n2 1 1 1 1 11 V Z = 1 1Z Y 1Z 1VZ n n2 n2 11 V n2 2 11 Y n 2 2 K3n Y , Z ,V ... (4.47) 1 2 = 1 1 3 V Y Y 4 VY 1 1 2 1Y 2 1V n K 4 n Y , Z ,V n2 1 2 1 1 1 2 ... (4.48) 1 3 1V 2 Y 1Y 4 1VY 3 1 1 1 = 1Y 1V 2 K5n Y , Z ,V 1 2 n 3 1 1 3 2 Y V n 25 Here, K jn Y , Z ,V , j 1,..,5 is a remainder term consisting of 1 3 2 times a homogeneous polynomial n 1 (not depending on n) of degree 3 in the elements of Y,Z and V plus 2 times a homogeneous polynomial n 5 of degree 4 plus an remainder term which is 0 n 2 for fixed Y,Z and V. The argument of u in (3.50) is the product of 1 u 1 u 2 1Y 1V 1Z 1V 2 Y 1Y 2 1VY 1 2 n 2 n2 2 u 3 2 1Y 1V Y 1Z 1VZ K 6 n Y , Z ,V 8 and (3.56) which is 106 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 2 u 1 1 7 1 u 1 1 2 2 VY V 4 Y V 8 V 1 u 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u 1 2 V Z Y Z VZ 3 YZ 3 Z V K5n Y , Z ,V 2 n 2 n =u 1 n 1 2 C Z ,V 1 D Y , Z ,V K7 n Y , Z ,V n Say (as the definition of C(Z,V) and D(Y,Z,V) and K 6n(Y,Z,V) and K7n(Y,Z,V) have the same properties as Kjn(Y,Z,V) j=1,…,5 A Taylor series expansion of in (3.50) gives 1 1 u 1 C Z ,V D Y , Z ,V + K7 n Y , Z ,V n n2 1 1 1 2 1 u u 1 C Z , Y D Y , Z ,V uC Z ,V 3 K8 Y , Z ,V n 2 n2 n2 1 K9 Y , Z ,V K10 n Y , Z ,V n2 where K8 Y , Z ,V is a homogeneous polynomial (not depending on n but depending on u ) of degree 3 in the elements of Y,Z, and V, K9 Y , Z ,V is a polynomial of degree 4, and K10n Y , Z ,V is a remainder term, which is 0 n 5 2 for fixed Y,Z, and V (and u). Let Jn be the set of Y,Z, and V such that Vij yi glog 1 n 2 , Zi g(log n)1/2,i=1,…,p and 2log n, i, j i,..., p where g > 2(1+k)/k1/2. Anderson showed that Pr J n 1 0 n 2 The difference between E 0(n-2), because 0 and the integral of times the density of Y,Z, and V over Jn is 1 . In Jn each element of Y,Z, and V divided by n1/2 1/2 5/2 is less than a constant times log n/ n .The part of the remainder Kjn(Y,Z,V) j=1,…,7, that is 0(n- ) for fixed Y,Z, and V can be written as a homogeneous polynomial of degree 5 in the elements of Y,Z, and V with coefficients possibly 107 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 depending on Y,Z,and V (by use of Taylor series with remainder),each coefficient is bounded in J n (for sufficiently large n). The same holds for K10n (Y,Z,V). Hence, in Jn K10n log n (Y,Z,V) <constant X 1 2 n 5 And the integral of this times the density of Y,Z, and V over J n is 0 n 2 . Since fourth order absolute moments of Y,Z and V exist and are bounded the integral of Y,Z, and V over Jn is bounded; hence,the contribution of this term (with the factor n-2) 0 n 2 . The difference between is 3 1 n EC Z ,V , n E D Y , Z ,V uC 2 Z ,V and n 2 K8 Y , Z ,V and the integrals over Jn of 2 1 3 1 n 2 C Z ,V , n 1E D Y , Z ,V uC 2 Z ,V and n 2 K8 Y , Z ,V times the density of Y,Z, 2 1 2 1 and V respectively, are 0 n 2 . Thus, 1 W 2 a Pr u 1 a2 1 1 1 1 u u 1 EC Z ,V ED Y , Z ,V uEC 2 Z ,V 3 E K8 Y , Z ,V n 2 n 2 n2 0 n 2 1 1 1 1 u u 1 EC Z ,V ED Y , Z ,V uEC 2 Z ,V 3 E K8 Y , Z ,V n 2 n 2 n2 0 n 2 Because the third order moment of Y,Z, and V are either 0 or 0 n 2 . Since C(Z,V) is linear and homogeneous EC(Z,V)=0. Since (Y,Z) and V are independent ED Y , Z ,V 2 u 7 u 1 2 1 E V E V 2 8 4 1 1 EY 1Z 3 E 1YZ 1 108 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 = r 3 n 1 u r 1 4 n1 Since E 1V 2 =E 1VV 1 r 2 E v12i i 1 r 2 Ev112 Ev12i i 1 2 r 1 E 1V 4 Ev112 2 2 4 EC 2 Z ,V Replacing u2 1 2 E 1V 2 E 1ZZ 1 4 4 1 n u2 2 n1 n by its limit 1+k and substituting in ..,we have n1 1 W 2 a Pr u 1 a2 r 1 1 1 1 1 u u 1 1 k r k u u 3 0 n 2 n 4 2 4 2 when E(X)=U1.Interchanging n1 and n2 gives 1 r 1 W 2 a 1 1 1 1 3 1 Pr u v v 1 r Xu v 0 n 2 1 1 n 4 2k 4 k a2 2 when E(X)=U2. 4.4 Efrons Asymptotic Expansion Efron (1975) has derived an asymptotic expansion of the expected probability of misclassification using a geometric argument which utilizes differential gradients and tangent lines. His derivation assumes mixture sampling so that sampling sizes n1 and n2 are stochastic Efron’s expansion may be written as 109 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 EPMCEfr q1 b1 q2 b2 2n q1b1 b1 q2b2 b2 1 V 2 1 2 D 2 q2 q1 r 1 1 2 D 2 q2 q1 1 2 1 D VD q1 q2 D2 D2 4 where q2 log 1 q V D 1 b1 D V 2 1 b2 D V 2 A unique facet of this study is the inclusion of three asymptotic expansions Raudys (1972) ,Deev (1972), and Kharin (1984) of the expected probability of misclassification (EPMC) which are derived by Russian Investigators. Little mention is made of these investigators and their expansions in the statisticaldiscrimination literature authored in the western hemisphere. All three derivations depend on the central limit theorem for dependent variables. 4.5 Raudy’s Asymptotic Expansion The asymptotic expansion of Raudys is given by EPMCRau 1 2 1 2 2 2 q1 q2 m m where n 2 4r m n p 4.6 1 2 Deev’s Asymptotic Expansion Deev’s asymptotic expansion is expressed as 110 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 EPMCDeev 1 12 d 2 where n 1 n 1 1 2 p 1 d 1 n n p n 4.7 2 Kharin’s Asymptotic Expansion The asymptotic expansion of the EPMC derived by Kharin is EPMCKhr 2 1 2 1 2 exp p 1 n1 n1 4 1 8 1 2 4 2 2 5. Sampling Experiments And Results We have generated data from the Bernoulli distribution and the populations are characterized by three to five variables. We have used a minimum of three variables because researches have shown that the number of variables that allows classification procedures to differ from each other is three. It was due to this fact that in most experiments researchers uses population with more than two variables for sampling experiments. Even for additional variables, it has been observed that both time and cost of sampling increase disproportionately so that if one wants to do simulations with more than six variables, it becomes increasingly difficult. We define a simulation experiment which is characterized by the value we assign to the input variables P ij and P2j. In this study we only considered mean structures characterized by the difference between P 2j and Pij which must be non-negative. In this study we selected this difference to be not more than 0.4. We used 22 population pairs given rise to 225 Configuration formed by specifying the values of the means P ij and P2j. The seven asymptotic expansions were evaluated at each of the 225 configurations of n, r, and d where n is the sample size, r is the number of variables and d P2 j P1 j 0, j 1,2,.., r . The 225 configurations of n, r and d are all possible combinations of n=40,60,100,140,200,300,400,600,700,800,900,1000, r= 3,4,5 and d = 0.1,0.2,0.3 and 0.4. The simulation experiments have been implemented using the International Mathematics and Statistics library. The simulation was done with a Fortran 77 program converted to the present day Fortran 208. The number of iterations used for each of the configurations of n, r and d is 1000. Seven population pairs are based on three variables; nine are based on four variables and six on five variables. The seven asymptotic expansions are evaluated at each of the 225 configurations of n, r and d. At each of the 225 configurations, the asymptotic expansion that has minimum variance is declared the “best” asymptotic expansion (see Tables 1, 2 and 3 respectively). 111 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Table 1 Simulation Results to evaluate asymptotic expansions for sample Optimum Rate of Asymptotic Expansion 0.358 Asymptotic Expansion 70 100 150 sizes ni using three variables. Mean Error rate Variance Anderson 0.353064 0.001100 Deev 0.356192 0.001130 Efron 0.350147 0.001145 Raudys 0.360767 0.001272 Okamoto 0.358585 0.001282 Sayre 0.358570 0.001282 Kharin 0.357558 0.001297 Anderson 0.353850 0.000790 Deev 0.356137 0.000806 Efron 0.351838 0.000812 Okamoto 0.357690 0.000874 Sayre 0.357683 0.000874 Kharin 0.356981 0.000881 Raudys 0.359542 0.000884 Anderson Deev 0.355908 0.357493 0.000532 0.000540 Efron Okamoto Sayre Kharin Raudys 0.354595 0.358514 0.358511 0.358050 0.359914 0.000543 0.000570 0.000570 0.000573 0.000578 112 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Table 2 Simulation Results to evaluate asymptotic expansion for sample sizes using four variables Rate of Asymptotic Asymptotic Expansion Mean Error rate Variance Optimum Expansion 0.4117 30 50 70 100 Anderson 0.376443 0.001918 Deev 0.383130 0.002013 Efron 0.370495 0.002116 Raudys 0.393423 0.002374 Okamoto 0.394742 0.003008 Sayre 0.394671 0.003011 Kharin 0.392610 0.003082 Anderson 0.389262 0.001274 Deev 0.394024 0.001313 Efron 0.386203 0.001354 Raudys 0.402851 0.001536 Okamoto 0.402318 0.001811 Sayre 0.402296 0.001812 Kharin 0.401194 0.001840 Anderson Deev Efron 0.396501 0.400181 0.394507 0.000939 0.000958 0.000981 Raudys Okamoto Sayre Kharin 0.407831 0.406758 0.406747 0.406014 0.000115 0.001282 0.001283 0.001298 Anderson 0.403963 0.000684 Deev 0.406717 0.000695 Efron 0.402699 0.000705 Raudys 0.413209 0.000803 Okamoto 0.411713 0.000859 Sayre 0.411708 0.000859 Kharin 0.411232 0.000866 113 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 Table 3: Simulation results to evaluate asymptotic expansion for sample sizes ni using five variables. Optimum Rate of Asymptotic Expansion 0.40437 50 Asymptotic Expansion Mean Error rate Variance Anderson 0.376967 0.001178 Deeve 0.381547 0.001217 Efron 0.373528 0.001250 Raudys 0.389332 0.001426 Okamoto 0.387660 0.001558 Sayre 0.387635 0.001559 Kharin 0.386420 0.001583 70 Anderson Deeve Efron Raudys Okamoto Sayre Kharin 100 Anderson Deeve Efron Raudys Okamoto 0.390603 0.393244 0.389132 0.398774 0.396923 0.000703 0.000716 0.000725 0.000817 0.000833 Sayre Kharin 0.396923 0.396385 0.000834 0.000841 Anderson Deeve Efron 0.396781 0.398640 0.395872 0.000482 0.000488 0.000492 Okamoto Sayre Raudys Kharin 0.401287 0.401285 0.402917 0.400946 0.000543 0.000543 0.000545 0.000546 150 114 0.385002 0.388535 0.382755 0.395331 0.393439 0.393427 0.392617 0.000920 0.000943 0.000961 0.001092 0.001144 0.001144 0.001157 Canadian Journal on Computing in Mathematics, Natural Sciences, Engineering and Medicine Vol. 4 No. 1, February 2013 6. Discussion of results, Conclusion and Recommendation Two methods can be used to get the best asymptotic expansion. They are 1. the asymptotic expansion that has the minimum variance for each 2. the difference between the optimum error and the expected error rate. configuration, The asymptotic expansion that has the minimum becomes the best for that configuration. For the purpose of this paper we have used the first option above. For the 225 configurations of the simulation experiments, Anderson’s expansion was the best in terms of minimum variance. Efron’s expansion and Deev’s also performed better than the remaining asymptotic expansions. The worst expansion was that of Raudys. Simulations indicate that the expected value expansion was reasonably good for small to moderate sample sizes and excellent for large samples. For sample size 500 and for particular values of r and d some asymptotic expansions produced the same variance. Researchers should look at the second method of obtaining the best asymptotic expansion. REFERENCES Anderson T.W. (1951). “Classification by multivariate analysis” Psychmetrika, 16, 31 – 50. Anderson, T.W. (1973). “An asymptotic expansion of the distribution of the studentized classification statistic W”, Ann. Stat. 1, 964 – 972. Deev, A.D. (1972). “Asymptotic expansions of statistic distributions of discriminant analysis W, M, W*. Stat Metody Classif, MGU, Moscow 6 – 51. Efron, B. (1975). “The efficiency of logistic regression compared to normal discriminant analysis” Journal of the American statistical Association, 70, 892 – 896. Fisher, R.A. (1936). “The use of multiple measurements in taxonomic problems” Ann. Eugenics 7, 179 – 188. Kharin, Y.S. (1984). “The investigation of risk for statistical classifiers using minimum estimators” theory Prob. Appl. 28, 623 – 630.3 S. John (1961). “Errors in discrimination” Ann Math. Stat. 32, 1125 – 1144. Lachenbruch, P. (1975) Discriminant Analysis, Hafner Press. Okamoto, M. (1963). “An asymptotic expansion for the distribution of the linear descriminant function” Ann. Math. Stat. 34, 1286 – 1301. Okamoto, M. (1968). “Correction to an asymptotic expansion for the distribution of the linear discriminant function” Ann. Math. Stat. 39. 1358 – 1359. Raudys, S. (1972). “On the amount of priori information in designing the classification algorithm” Tech. Cybern 4, 168 – 174. Sayre, J.W. (1980). “The distribution of the actual error rae in linear discriminant analysis” J. of Am. Stat. Assoc. 75, 201 – 205. Scherish, M.J. (1981) “Asymptotic expansions of the means and variances of error rates” Biometrika 68, 295 – 299. Wyman, F.J. Young, O. M. Turner, D.W. (1990) “Comparison of asymptotic error rate expansions for the sample linear discriminant function’ Pattern recognition 23 No 7, 775 – 783. 115
© Copyright 2024