Distance Based Outlier Detection for Data Streams

Distance Based Outlier Detection for Data
Streams
Luan Tran, Liyue Fan, Cyrus Shahabi
Integrated Media Systems Center
University of Southern California
Algorithms (cont.)
Introduction
Motivation: Discovering abnormal phenomena in data streams has
become more critical. There are many applications such as:
§  Micro-Cluster Based Algorithm [3]
A data point in each cluster is certainly an inlier
§  Credit card fraud detection
Stream
source
Based on abnormal transactions
Data point
Can be assigned to a cluster
§  Network intrusion prevention
Micro clusters
If not
Based on abnormal log activities
§  Stock investment tactical planning
Based on abnormal changes in price
§  Event detection in sensor network
Micro cluster
Micro cluster
Use event based approach
R/2
Based on abnormal measuring values
In this poster, we study the problem of distance-based outlier
detection for data streams and some of the most recent algorithms
proposed by the scientific community.
§ Lifespan - Aware Minimal Probing Algorithm (MESI) [4]
Only find enough k neighbors, instead of all neighbors, for each data
point, to prove it is an inlier or not.
Slide 1
Distance Based Outlier
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
W
§  Data point o is a distance-based outlier if less than k neighbors in
a window lie within distance R from o.
Preceding next
K=3, W=16 (window size)
K= 5
Succeeding first
Results
- Current window: t18 (dash lines)
o9 has 4 neighbors (o5, o10, o14, o15) à inlier
o11 has 4 neighbors à inlier
§ Windows PC with Intel core i5, 1.8GHz, 4GB RAM
§ Datasets:
- TAO (Tropical Atmosphere Ocean), real-world, outlier rate = 0.01%
- Gaussian, synthetic, outlier rate = 0.1%
- Consider window t22
o9 is an inlier, neighbors set = (o10, o14, o15)
o11 is an outlier neighbors set = (o13) :
o3, o4, o6 expired
§ CPU time (s)
§ Peak memory consumption (MB)
§  Succeeding neighbor of point o is a neighbor that comes after o.
§  Preceding neighbor of point o is a neighbor that comes before o.
§  Safe inlier is a point that never becomes an outlier: has at least k succeeding
neighbors
§  Naïve approach: For each data point, store a list of neighbors.
Preceding neighbor k
Preceding neighbor1
Succeeding neighbor 1
Succeeding neighbor K
Algorithms
§  Exact Storm [1]
For each data point, store number of succeeding neighbors and list of
preceding neighbors
Preceding neighbor k
Preceding neighbor2
Preceding neighbor 1
Number of succeeding neighbors
§  Abstract-C [2]
For each point, store total number of neighbors in every window that it
participates in. For red point : <W1: 3, W2: 3>
W2
Conclusion and Future Work
§  Observations:
- Micro-cluster based algorithm is the best at CPU time and memory consumption.
- MESI, Abstract-C are comparable when window size is small.
§  Future work:
W1
§ Event Based Approach (LUE) [3]
Use a priority queue to store un-safe inliers. Poll from event
queue when a data object expires.
Outlier list
If it is an outlier
Stream
source
Data point
- Complete evaluation with more real-world datasets and larger parameter ranges
- Identify the characteristics of real-world outliers and propose efficient methods
References
[1] Fabrizio Angiulli and Fabio Fassetti. Detecting distance-based outliers in streams of data. CIKM '07.
If it is a unsafe inlier
[2] Di Yang, Elke A. Rundensteiner, and Matthew O. Ward. Neighbor-based pattern detection for windows
over streaming data. EDBT '09.
[3] Maria Kontaki, Anastasios Gounaris, Apostolos N. Papadopoulos, Kostas Tsichlas, and Yannis
Manolopoulos. Continuous monitoring of distance-based outliers over data streams. ICDE’11.
Priority queue
[4] Lei Cao, Di Yang, Qingyang Wang, Yanwei Yu, Jiayuan Wang, and Elke A. Rundensteiner. Scalable
distance-based outlier detection over high-volume data streams. ICDE’14.
IMSC Retreat 2015