Outlier Detection for High Dimension, Low
Sample Size Data
Myung Hee Lee
Colorado State University
(Joint work with Jeongyoun Ahn and Jung Ae Lee)
Outlier Detection
• What is an outlier?
• Why outlier detection is important?
High Dimension, Low Sample Size (HDLSS) Data
• Data with more variables than the sample size (d > n).
• Also called “Large p, small n”.
• Microarray data, medical images, signal processing, etc.
• Classical multivariate techniques fail.
Outlier detection for HDLSS data
• Visualization of HDLSS?
• Distance measure? Methods based on Euclidean distance do
not work well in high dimension.
• We propose to use
– a new type of distance measure for HDLSS and
– a graphical tool that to detect outliers.
Distance measure
• Euclidean distance
D (j) =
C
q
¯ −j )T (xj − x
¯ −j )
(xj − x
• Mahalanobis distance
q
¯ −j )T (S−j + αI)−1(xj − x
¯ −j )
D (j) = (xj − x
MH
• Maximal Data Piling(MDP) distance
q
¯ (−j))T (Id − P(−j))(xj − x
¯ (−j))
D (j) = (xj − x
MDP
The MDP distance
• The MDP distance is the distance between affine subspaces generated by each class.
Cluster 1
Cluster 2
• It is the distance between data projections by the MDP direction
vector, v , which yields dichotomous projections.
MDP
The MDP direction
−4
−4
x 10
x 10
ALL
AML
8
4
0
0
−4
−2
0
−2
4
x 10
0
4
x 10
Projections of Leukimia data onto the support vector machine (left) and the maximal data piling (right) direction
vectors.
Empirical QQ plot
Empirical QQ plot- Algorithm
Start with q = 1.
1) Calculate leave-one-out distances (say dj , j = 1, . . . , n − q + 1)
and sort them by ascending order to be d(1) < · · · < d(n−q+1).
2) Remove the sample which has the largest distance and compute
¯ ∗ and variance covariance matrix S∗ from the
the sample mean x
remaining samples.
3) Generate n − q + 1 samples from Np(¯
x∗, S∗ + αI).
4) Compute the leave-one-out distances d∗j (j = 1, . . . , n − q + 1)
from the simulated data and sort d∗j s to be d∗(1) < d∗(2) < · · · <
d∗(n−q+1).
5) Repeat step 3) and 4) 100 times, and average the distances to
obtain d¯∗(1) < d¯∗(2) < · · · < d¯∗(n−q+1).
Empirical QQ plot- Algorithm
6) The simulated distances, d¯∗(j)s in X-axis and the distance from
the data, d(j)s in Y -axis on the plot.
7) Deviating from the straight line indicates an outlier.
8) Increase q until we do not see any evidence of outlier in the plot.
Empirical QQ plot- Multiple Outliers
Will this work? - Theoretical properties
• Distance choice: MDP distance will be presented.
• Single outlier vs Multiple outliers (masking effect)
• High dimensional data? HDLSS asymptotics (d → ∞, n: fixed)
HDLSS geometric representation - Notation
• Xd×n = [X1, . . . , Xn]: non-outliers data matrix with d > n
• Xj = (X1j , . . . , Xdj )t are i.i.d. d-variate random vectors from
a population.
• X0d×n0 = [X10, . . . , Xn0]: outliers data matrix
0
• Xj0 = (X1j
, . . . , Xdj0 )t are d-variate random vectors from another population.
HDLSS geometric representation - Assumptions
0
= (X10, . . . , Xd0)0: the d−variate
• X(d) = (X1, . . . , Xd)0 and X(d)
random vectors for the two clusters.
• Assume the following conditions as d → ∞.
(a) The fourth moments of the entries of the data vectors are
uniformly bounded.
Pd
−1
2
(b) d
Var(X
)
−→
σ
j
j=1
P
(c) d−1 dj=1 Var(Xj0) −→ τ 2
Pd
−1
0
2
2
(d) d
j=1 {E(Xj ) − E(Xj )} −→ µ
(e) There exists a permutation of the entries of the data vectors such that the sequence of the variables are ρ-mixing for
functions that are dominated by quadratics.
HDLSS geometric representation
As d → ∞,
• Then, the data vectors approximately form an N −polyhedron.
• Each cluster forms a regular simplex with n and n0 vertices, (X
and X 0).
• The length of an
data vectors√in X (or X 0) is
√ edge connecting
√
approximately 2σ (or 2τ ) after scaled by d.
• The
plength of an edge connecting data
√ vectors in different cluster
is σ 2 + τ 2 + µ2 after scaled by d.
Theorem: Single outlier
• Assume there is a single outlier, i.e., n0 = 1.
• Assumptions (a) - (e) are satisfied.
• Assume that µ2 + τ 2 > σ 2.
Then, the leave-one-out MDP distance for the outlier is bigger than
the distances for non-outliers with probability 1.
Theorem: Multiple Outliers
• Assume n > n0.
• Assumptions (a) - (e) are satisfied.
let
• Let τ 2 = cσ 2 for some constant c > 0.
• Under either of the following further assumption, the leave-oneout MDP outlier detection method detects outliers in the large
d-limit in the sense that the MDP distance is bigger when an
outlier is split from the rest than when non-outlier is split from
the rest with probability 1.
(i) If c > 1 and µ ≥ 0, or
(ii) if c = 1 and µ > 0, or
(iii) if
n(n0 − 1)
<c<1
n0(n − 1)
and
µ2 > fn,n0 (c)σ 2,
where
c2(n − 1) − c(n − n0) − (n0 − 1)
fn,n0 (c) =
,
n(n0 − 1) − cn0(n − 1)
How many outlier can MDP handle?
• Assume n > n0.
• Assumptions (a) - (e) are satisfied.
let
• Let τ 2 = cσ 2 for 0 < c < 1. We have “moderately” isolated
outliers.
Then, the MDP outlier method detects outliers successfully w.p. 1
if
n
n0 <
n(1 − c) + c
and
µ2 > fn,n0 (c)σ 2.
Example: minimum distance for successful outlier detection
n=100, σ = 1, c=0.9
6
5
µ2
4
3
2
1
0
1
2
3
4
5
6
7
8
9
n0
n=100, σ2 = 1, c=0.7
6
5
µ2
4
3
2
1
0
1
2
3
4
5
6
7
8
n0
X-axis: number of outliers. Y −axis: distancex.
9
Detectable areas by MDP and Euclidean(CD)?
MDP vs Euclidean(CD)?
MDP vs Euclidean(CD)?
Will this work? - Empirical Study
• Xj ∼ Nd(0, Σ)
• Σ := UΛU
T
• U is an orthnormal matrix
• Λ = diag{λ1, . . . , λp} with λj = p3 j −1, j = 1, . . . , p.
Single Outlier
Multiple Outliers: n0 =?
Lung Cancer Data: n = 19 and d = 1000
Summary
• A new outlier detection procedure for HDLSS data.
• Graphical tool? Empirical qq plot.
• Distance? Euclidian, Mahalanobis, and MDP distance.
• High dimensional asymptotic properties of distances.
• Simulation and real data examples are presented to demonstrate
the performance of the proposed method.
© Copyright 2025