Outlier Detection for High Dimension, Low Sample Size Data Myung Hee Lee

Outlier Detection for High Dimension, Low
Sample Size Data
Myung Hee Lee
Colorado State University
(Joint work with Jeongyoun Ahn and Jung Ae Lee)
Outlier Detection
• What is an outlier?
• Why outlier detection is important?
High Dimension, Low Sample Size (HDLSS) Data
• Data with more variables than the sample size (d > n).
• Also called “Large p, small n”.
• Microarray data, medical images, signal processing, etc.
• Classical multivariate techniques fail.
Outlier detection for HDLSS data
• Visualization of HDLSS?
• Distance measure? Methods based on Euclidean distance do
not work well in high dimension.
• We propose to use
– a new type of distance measure for HDLSS and
– a graphical tool that to detect outliers.
Distance measure
• Euclidean distance
D (j) =
C
q
¯ −j )T (xj − x
¯ −j )
(xj − x
• Mahalanobis distance
q
¯ −j )T (S−j + αI)−1(xj − x
¯ −j )
D (j) = (xj − x
MH
• Maximal Data Piling(MDP) distance
q
¯ (−j))T (Id − P(−j))(xj − x
¯ (−j))
D (j) = (xj − x
MDP
The MDP distance
• The MDP distance is the distance between affine subspaces generated by each class.
Cluster 1
Cluster 2
• It is the distance between data projections by the MDP direction
vector, v , which yields dichotomous projections.
MDP
The MDP direction
−4
−4
x 10
x 10
ALL
AML
8
4
0
0
−4
−2
0
−2
4
x 10
0
4
x 10
Projections of Leukimia data onto the support vector machine (left) and the maximal data piling (right) direction
vectors.
Empirical QQ plot
Empirical QQ plot- Algorithm
Start with q = 1.
1) Calculate leave-one-out distances (say dj , j = 1, . . . , n − q + 1)
and sort them by ascending order to be d(1) < · · · < d(n−q+1).
2) Remove the sample which has the largest distance and compute
¯ ∗ and variance covariance matrix S∗ from the
the sample mean x
remaining samples.
3) Generate n − q + 1 samples from Np(¯
x∗, S∗ + αI).
4) Compute the leave-one-out distances d∗j (j = 1, . . . , n − q + 1)
from the simulated data and sort d∗j s to be d∗(1) < d∗(2) < · · · <
d∗(n−q+1).
5) Repeat step 3) and 4) 100 times, and average the distances to
obtain d¯∗(1) < d¯∗(2) < · · · < d¯∗(n−q+1).
Empirical QQ plot- Algorithm
6) The simulated distances, d¯∗(j)s in X-axis and the distance from
the data, d(j)s in Y -axis on the plot.
7) Deviating from the straight line indicates an outlier.
8) Increase q until we do not see any evidence of outlier in the plot.
Empirical QQ plot- Multiple Outliers
Will this work? - Theoretical properties
• Distance choice: MDP distance will be presented.
• Single outlier vs Multiple outliers (masking effect)
• High dimensional data? HDLSS asymptotics (d → ∞, n: fixed)
HDLSS geometric representation - Notation
• Xd×n = [X1, . . . , Xn]: non-outliers data matrix with d > n
• Xj = (X1j , . . . , Xdj )t are i.i.d. d-variate random vectors from
a population.
• X0d×n0 = [X10, . . . , Xn0]: outliers data matrix
0
• Xj0 = (X1j
, . . . , Xdj0 )t are d-variate random vectors from another population.
HDLSS geometric representation - Assumptions
0
= (X10, . . . , Xd0)0: the d−variate
• X(d) = (X1, . . . , Xd)0 and X(d)
random vectors for the two clusters.
• Assume the following conditions as d → ∞.
(a) The fourth moments of the entries of the data vectors are
uniformly bounded.
Pd
−1
2
(b) d
Var(X
)
−→
σ
j
j=1
P
(c) d−1 dj=1 Var(Xj0) −→ τ 2
Pd
−1
0
2
2
(d) d
j=1 {E(Xj ) − E(Xj )} −→ µ
(e) There exists a permutation of the entries of the data vectors such that the sequence of the variables are ρ-mixing for
functions that are dominated by quadratics.
HDLSS geometric representation
As d → ∞,
• Then, the data vectors approximately form an N −polyhedron.
• Each cluster forms a regular simplex with n and n0 vertices, (X
and X 0).
• The length of an
data vectors√in X (or X 0) is
√ edge connecting
√
approximately 2σ (or 2τ ) after scaled by d.
• The
plength of an edge connecting data
√ vectors in different cluster
is σ 2 + τ 2 + µ2 after scaled by d.
Theorem: Single outlier
• Assume there is a single outlier, i.e., n0 = 1.
• Assumptions (a) - (e) are satisfied.
• Assume that µ2 + τ 2 > σ 2.
Then, the leave-one-out MDP distance for the outlier is bigger than
the distances for non-outliers with probability 1.
Theorem: Multiple Outliers
• Assume n > n0.
• Assumptions (a) - (e) are satisfied.
let
• Let τ 2 = cσ 2 for some constant c > 0.
• Under either of the following further assumption, the leave-oneout MDP outlier detection method detects outliers in the large
d-limit in the sense that the MDP distance is bigger when an
outlier is split from the rest than when non-outlier is split from
the rest with probability 1.
(i) If c > 1 and µ ≥ 0, or
(ii) if c = 1 and µ > 0, or
(iii) if
n(n0 − 1)
<c<1
n0(n − 1)
and
µ2 > fn,n0 (c)σ 2,
where
c2(n − 1) − c(n − n0) − (n0 − 1)
fn,n0 (c) =
,
n(n0 − 1) − cn0(n − 1)
How many outlier can MDP handle?
• Assume n > n0.
• Assumptions (a) - (e) are satisfied.
let
• Let τ 2 = cσ 2 for 0 < c < 1. We have “moderately” isolated
outliers.
Then, the MDP outlier method detects outliers successfully w.p. 1
if
n
n0 <
n(1 − c) + c
and
µ2 > fn,n0 (c)σ 2.
Example: minimum distance for successful outlier detection
n=100, σ = 1, c=0.9
6
5
µ2
4
3
2
1
0
1
2
3
4
5
6
7
8
9
n0
n=100, σ2 = 1, c=0.7
6
5
µ2
4
3
2
1
0
1
2
3
4
5
6
7
8
n0
X-axis: number of outliers. Y −axis: distancex.
9
Detectable areas by MDP and Euclidean(CD)?
MDP vs Euclidean(CD)?
MDP vs Euclidean(CD)?
Will this work? - Empirical Study
• Xj ∼ Nd(0, Σ)
• Σ := UΛU
T
• U is an orthnormal matrix
• Λ = diag{λ1, . . . , λp} with λj = p3 j −1, j = 1, . . . , p.
Single Outlier
Multiple Outliers: n0 =?
Lung Cancer Data: n = 19 and d = 1000
Summary
• A new outlier detection procedure for HDLSS data.
• Graphical tool? Empirical qq plot.
• Distance? Euclidian, Mahalanobis, and MDP distance.
• High dimensional asymptotic properties of distances.
• Simulation and real data examples are presented to demonstrate
the performance of the proposed method.