Outlier Detection for High Dimension, Low Sample Size Data Myung Hee Lee Colorado State University (Joint work with Jeongyoun Ahn and Jung Ae Lee) Outlier Detection • What is an outlier? • Why outlier detection is important? High Dimension, Low Sample Size (HDLSS) Data • Data with more variables than the sample size (d > n). • Also called “Large p, small n”. • Microarray data, medical images, signal processing, etc. • Classical multivariate techniques fail. Outlier detection for HDLSS data • Visualization of HDLSS? • Distance measure? Methods based on Euclidean distance do not work well in high dimension. • We propose to use – a new type of distance measure for HDLSS and – a graphical tool that to detect outliers. Distance measure • Euclidean distance D (j) = C q ¯ −j )T (xj − x ¯ −j ) (xj − x • Mahalanobis distance q ¯ −j )T (S−j + αI)−1(xj − x ¯ −j ) D (j) = (xj − x MH • Maximal Data Piling(MDP) distance q ¯ (−j))T (Id − P(−j))(xj − x ¯ (−j)) D (j) = (xj − x MDP The MDP distance • The MDP distance is the distance between affine subspaces generated by each class. Cluster 1 Cluster 2 • It is the distance between data projections by the MDP direction vector, v , which yields dichotomous projections. MDP The MDP direction −4 −4 x 10 x 10 ALL AML 8 4 0 0 −4 −2 0 −2 4 x 10 0 4 x 10 Projections of Leukimia data onto the support vector machine (left) and the maximal data piling (right) direction vectors. Empirical QQ plot Empirical QQ plot- Algorithm Start with q = 1. 1) Calculate leave-one-out distances (say dj , j = 1, . . . , n − q + 1) and sort them by ascending order to be d(1) < · · · < d(n−q+1). 2) Remove the sample which has the largest distance and compute ¯ ∗ and variance covariance matrix S∗ from the the sample mean x remaining samples. 3) Generate n − q + 1 samples from Np(¯ x∗, S∗ + αI). 4) Compute the leave-one-out distances d∗j (j = 1, . . . , n − q + 1) from the simulated data and sort d∗j s to be d∗(1) < d∗(2) < · · · < d∗(n−q+1). 5) Repeat step 3) and 4) 100 times, and average the distances to obtain d¯∗(1) < d¯∗(2) < · · · < d¯∗(n−q+1). Empirical QQ plot- Algorithm 6) The simulated distances, d¯∗(j)s in X-axis and the distance from the data, d(j)s in Y -axis on the plot. 7) Deviating from the straight line indicates an outlier. 8) Increase q until we do not see any evidence of outlier in the plot. Empirical QQ plot- Multiple Outliers Will this work? - Theoretical properties • Distance choice: MDP distance will be presented. • Single outlier vs Multiple outliers (masking effect) • High dimensional data? HDLSS asymptotics (d → ∞, n: fixed) HDLSS geometric representation - Notation • Xd×n = [X1, . . . , Xn]: non-outliers data matrix with d > n • Xj = (X1j , . . . , Xdj )t are i.i.d. d-variate random vectors from a population. • X0d×n0 = [X10, . . . , Xn0]: outliers data matrix 0 • Xj0 = (X1j , . . . , Xdj0 )t are d-variate random vectors from another population. HDLSS geometric representation - Assumptions 0 = (X10, . . . , Xd0)0: the d−variate • X(d) = (X1, . . . , Xd)0 and X(d) random vectors for the two clusters. • Assume the following conditions as d → ∞. (a) The fourth moments of the entries of the data vectors are uniformly bounded. Pd −1 2 (b) d Var(X ) −→ σ j j=1 P (c) d−1 dj=1 Var(Xj0) −→ τ 2 Pd −1 0 2 2 (d) d j=1 {E(Xj ) − E(Xj )} −→ µ (e) There exists a permutation of the entries of the data vectors such that the sequence of the variables are ρ-mixing for functions that are dominated by quadratics. HDLSS geometric representation As d → ∞, • Then, the data vectors approximately form an N −polyhedron. • Each cluster forms a regular simplex with n and n0 vertices, (X and X 0). • The length of an data vectors√in X (or X 0) is √ edge connecting √ approximately 2σ (or 2τ ) after scaled by d. • The plength of an edge connecting data √ vectors in different cluster is σ 2 + τ 2 + µ2 after scaled by d. Theorem: Single outlier • Assume there is a single outlier, i.e., n0 = 1. • Assumptions (a) - (e) are satisfied. • Assume that µ2 + τ 2 > σ 2. Then, the leave-one-out MDP distance for the outlier is bigger than the distances for non-outliers with probability 1. Theorem: Multiple Outliers • Assume n > n0. • Assumptions (a) - (e) are satisfied. let • Let τ 2 = cσ 2 for some constant c > 0. • Under either of the following further assumption, the leave-oneout MDP outlier detection method detects outliers in the large d-limit in the sense that the MDP distance is bigger when an outlier is split from the rest than when non-outlier is split from the rest with probability 1. (i) If c > 1 and µ ≥ 0, or (ii) if c = 1 and µ > 0, or (iii) if n(n0 − 1) <c<1 n0(n − 1) and µ2 > fn,n0 (c)σ 2, where c2(n − 1) − c(n − n0) − (n0 − 1) fn,n0 (c) = , n(n0 − 1) − cn0(n − 1) How many outlier can MDP handle? • Assume n > n0. • Assumptions (a) - (e) are satisfied. let • Let τ 2 = cσ 2 for 0 < c < 1. We have “moderately” isolated outliers. Then, the MDP outlier method detects outliers successfully w.p. 1 if n n0 < n(1 − c) + c and µ2 > fn,n0 (c)σ 2. Example: minimum distance for successful outlier detection n=100, σ = 1, c=0.9 6 5 µ2 4 3 2 1 0 1 2 3 4 5 6 7 8 9 n0 n=100, σ2 = 1, c=0.7 6 5 µ2 4 3 2 1 0 1 2 3 4 5 6 7 8 n0 X-axis: number of outliers. Y −axis: distancex. 9 Detectable areas by MDP and Euclidean(CD)? MDP vs Euclidean(CD)? MDP vs Euclidean(CD)? Will this work? - Empirical Study • Xj ∼ Nd(0, Σ) • Σ := UΛU T • U is an orthnormal matrix • Λ = diag{λ1, . . . , λp} with λj = p3 j −1, j = 1, . . . , p. Single Outlier Multiple Outliers: n0 =? Lung Cancer Data: n = 19 and d = 1000 Summary • A new outlier detection procedure for HDLSS data. • Graphical tool? Empirical qq plot. • Distance? Euclidian, Mahalanobis, and MDP distance. • High dimensional asymptotic properties of distances. • Simulation and real data examples are presented to demonstrate the performance of the proposed method.