Distance Based Outlier Detection for Data Streams Luan Tran, Liyue Fan, Cyrus Shahabi Integrated Media Systems Center University of Southern California Algorithms (cont.) Introduction Motivation: Discovering abnormal phenomena in data streams has become more critical. There are many applications such as: § Micro-Cluster Based Algorithm [3] A data point in each cluster is certainly an inlier § Credit card fraud detection Stream source Based on abnormal transactions Data point Can be assigned to a cluster § Network intrusion prevention Micro clusters If not Based on abnormal log activities § Stock investment tactical planning Based on abnormal changes in price § Event detection in sensor network Micro cluster Micro cluster Use event based approach R/2 Based on abnormal measuring values In this poster, we study the problem of distance-based outlier detection for data streams and some of the most recent algorithms proposed by the scientific community. § Lifespan - Aware Minimal Probing Algorithm (MESI) [4] Only find enough k neighbors, instead of all neighbors, for each data point, to prove it is an inlier or not. Slide 1 Distance Based Outlier Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 W § Data point o is a distance-based outlier if less than k neighbors in a window lie within distance R from o. Preceding next K=3, W=16 (window size) K= 5 Succeeding first Results - Current window: t18 (dash lines) o9 has 4 neighbors (o5, o10, o14, o15) à inlier o11 has 4 neighbors à inlier § Windows PC with Intel core i5, 1.8GHz, 4GB RAM § Datasets: - TAO (Tropical Atmosphere Ocean), real-world, outlier rate = 0.01% - Gaussian, synthetic, outlier rate = 0.1% - Consider window t22 o9 is an inlier, neighbors set = (o10, o14, o15) o11 is an outlier neighbors set = (o13) : o3, o4, o6 expired § CPU time (s) § Peak memory consumption (MB) § Succeeding neighbor of point o is a neighbor that comes after o. § Preceding neighbor of point o is a neighbor that comes before o. § Safe inlier is a point that never becomes an outlier: has at least k succeeding neighbors § Naïve approach: For each data point, store a list of neighbors. Preceding neighbor k Preceding neighbor1 Succeeding neighbor 1 Succeeding neighbor K Algorithms § Exact Storm [1] For each data point, store number of succeeding neighbors and list of preceding neighbors Preceding neighbor k Preceding neighbor2 Preceding neighbor 1 Number of succeeding neighbors § Abstract-C [2] For each point, store total number of neighbors in every window that it participates in. For red point : <W1: 3, W2: 3> W2 Conclusion and Future Work § Observations: - Micro-cluster based algorithm is the best at CPU time and memory consumption. - MESI, Abstract-C are comparable when window size is small. § Future work: W1 § Event Based Approach (LUE) [3] Use a priority queue to store un-safe inliers. Poll from event queue when a data object expires. Outlier list If it is an outlier Stream source Data point - Complete evaluation with more real-world datasets and larger parameter ranges - Identify the characteristics of real-world outliers and propose efficient methods References [1] Fabrizio Angiulli and Fabio Fassetti. Detecting distance-based outliers in streams of data. CIKM '07. If it is a unsafe inlier [2] Di Yang, Elke A. Rundensteiner, and Matthew O. Ward. Neighbor-based pattern detection for windows over streaming data. EDBT '09. [3] Maria Kontaki, Anastasios Gounaris, Apostolos N. Papadopoulos, Kostas Tsichlas, and Yannis Manolopoulos. Continuous monitoring of distance-based outliers over data streams. ICDE’11. Priority queue [4] Lei Cao, Di Yang, Qingyang Wang, Yanwei Yu, Jiayuan Wang, and Elke A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. ICDE’14. IMSC Retreat 2015
© Copyright 2024