Ranking Outliers Using Symmetric Neighborhood Relationship
2006, Advances in Knowledge Discovery and Data Mining
https://doi.org/10.1007/11731139_68Abstract
Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2, 11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.
References (40)
- Let {d M in (p, M C 1 ),. . . , d M in (p, M C l+n-1 )} be sorted in increasing order, then a lower bound on the k-distance of p, denoted as min k dist(p) will be d M in (p, M C i ) such that n 1 + ... + n i ≥ k, and n 1 + ... + n i-1 < k 2. Let {d M ax (p, M C 1 ),. . . , d M ax (p, M C l+n-1 )} be sorted in increasing order, then an upper bound on the k-distance of p, denoted as max k dist(p) will be d M ax (p, M C i ) such that n 1 + ... + n i ≥ k and n 1 + ... + n i-1 < k. The following is the micro-cluster based algorithm for mining top-n local outliers. Algorithm 3 Micro-cluster method. Input: A set of micro-clusters M C 1 , . . . , M C l , M . Output: Top-n IN F LO of D. Method:
- FOR each micro-cluster M C i DO
- Get Max/Min of k dist(p) ; // based on theorem 2
- IF Min k dist (p) < Mink dist (M C i ) THEN
- Min k dist (M C i ) = Mink dist (p);
- IF Max k dist (p) > Maxk dist (M C i ) THEN
- Max k dist (M C i ) = Maxk dist (p);
- FOR each micro-cluster M C i DO 9. count = |RN N k (M C i )|;
- IF unvisited(M C i ) THEN
- S = getKN N (M C i );
- unvisited(M C i ) = F ALSE;
- ELSE
- S = KN N (M C i );
- /
- C. Aggarwal and P. Yu: Outlier Detection for High Dimensional Data. SIGMOD 2001
- M. M. Breunig, H.P. Kriegel, R.T. Ng, and J.Sander: LOF: Identifying Density-based Local Outliers. SIGMOD 2000
- D. Chakrabarti: AutoPart: Parameter-Free Graph Partitioning and Outlier Detection. PKDD 2004
- Z. X. Chen, A. W. Fu, J. Tang: On Complementarity of Cluster and Outlier Detection Schemes. DaWaK 2003
- A. L. Chiu, A. W. Fu: Enhancements on Local Outlier Detection. IDEAS 2003
- M. Ester, H. P. Kriegel et al.: A Density-based Algorithm for Discovering Clusters in Large Spatial Databases. KDD 1996
- S. Guha, R. Rastogi, and K.Shim: Cure: An Efficient Clustering Algorithm for Large Databases. SIGMOD 1998
- V. Hautamki, I. Krkkinen and P. Frnti: Outlier Detection Using k-nearest Neighbour Graph, ICPR 2004.
- J. W. Han, M. Kamber: Data Mining: Concepts and Techniques. In Morgan Kaufmann Publishers.
- H. Jagadish, N. Koudas, and S. Muthukrishnan: Mining Deviants in a Time Series Database. VLDB 1999
- W. Jin, K. H. Tung and J. W. Han: Mining Top-n Local Outliers in Large Databases. KDD 2001
- E. Knorr, R. Ng: Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB 1998
- E. Knorr and R. Ng: Finding Intensional Knowledge of Distance-Based Outliers. VLDB 1999
- F. Korn and S. Muthukrishnan: Influence Sets Based on Reverse Nearest Neighbor Queries. SIGMOD 2000
- S. Muthukrishnan, R. Shah, J. S. Vitter: Mining Deviants in Time Series Data Streams. SSDBM 2004
- R. Ng and J. W. Han: Efficient and Effective Clustering Method for Spatial Data Mining. VLDB 1994
- S. Papadimitriou, H. Kitagawa et al. LOCI: Fast Outlier Detection Using the Local Correlation Integral. ICDE 2003
- S. Papadimitriou, C. Faloutsos: Cross-Outlier Detection. SSTD 2003
- N. Roussopoulos, S. Kelley and F. Vincent: Nearest neighbor queries. SIGMOD 1995
- S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD 2000
- S. Shekhar, C. T. Lu, P. S. Zhang: Detecting Graph-based Spatial Outliers. KDD 2001
- J. Tang, Z. X. Chen et al.: Enhancing Effectiveness of Outlier Detections for Low Density Patterns. PAKDD 2002
- W. K. Wong, A. W. Moore et al.: Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks. AAAI 2002
- M. L. Yiu, N. Mamoulis: Clustering Objects on a Spatial Network. SIGMOD 2004
- M. L. Yiu et al.: Aggregate Nearest Neighbor Queries in Road Networks. IEEE Trans. Knowl. Data Eng. 17(6), 2005
- T. Zhang et al.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD 1996