Academia.eduAcademia.edu

Outline

Ranking Outliers Using Symmetric Neighborhood Relationship

2006, Advances in Knowledge Discovery and Data Mining

https://doi.org/10.1007/11731139_68

Abstract

Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2, 11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.

References (40)

  1. Let {d M in (p, M C 1 ),. . . , d M in (p, M C l+n-1 )} be sorted in increasing order, then a lower bound on the k-distance of p, denoted as min k dist(p) will be d M in (p, M C i ) such that n 1 + ... + n i ≥ k, and n 1 + ... + n i-1 < k 2. Let {d M ax (p, M C 1 ),. . . , d M ax (p, M C l+n-1 )} be sorted in increasing order, then an upper bound on the k-distance of p, denoted as max k dist(p) will be d M ax (p, M C i ) such that n 1 + ... + n i ≥ k and n 1 + ... + n i-1 < k. The following is the micro-cluster based algorithm for mining top-n local outliers. Algorithm 3 Micro-cluster method. Input: A set of micro-clusters M C 1 , . . . , M C l , M . Output: Top-n IN F LO of D. Method:
  2. FOR each micro-cluster M C i DO
  3. Get Max/Min of k dist(p) ; // based on theorem 2
  4. IF Min k dist (p) < Mink dist (M C i ) THEN
  5. Min k dist (M C i ) = Mink dist (p);
  6. IF Max k dist (p) > Maxk dist (M C i ) THEN
  7. Max k dist (M C i ) = Maxk dist (p);
  8. FOR each micro-cluster M C i DO 9. count = |RN N k (M C i )|;
  9. IF unvisited(M C i ) THEN
  10. S = getKN N (M C i );
  11. unvisited(M C i ) = F ALSE;
  12. ELSE
  13. S = KN N (M C i );
  14. /
  15. C. Aggarwal and P. Yu: Outlier Detection for High Dimensional Data. SIGMOD 2001
  16. M. M. Breunig, H.P. Kriegel, R.T. Ng, and J.Sander: LOF: Identifying Density-based Local Outliers. SIGMOD 2000
  17. D. Chakrabarti: AutoPart: Parameter-Free Graph Partitioning and Outlier Detection. PKDD 2004
  18. Z. X. Chen, A. W. Fu, J. Tang: On Complementarity of Cluster and Outlier Detection Schemes. DaWaK 2003
  19. A. L. Chiu, A. W. Fu: Enhancements on Local Outlier Detection. IDEAS 2003
  20. M. Ester, H. P. Kriegel et al.: A Density-based Algorithm for Discovering Clusters in Large Spatial Databases. KDD 1996
  21. S. Guha, R. Rastogi, and K.Shim: Cure: An Efficient Clustering Algorithm for Large Databases. SIGMOD 1998
  22. V. Hautamki, I. Krkkinen and P. Frnti: Outlier Detection Using k-nearest Neighbour Graph, ICPR 2004.
  23. J. W. Han, M. Kamber: Data Mining: Concepts and Techniques. In Morgan Kaufmann Publishers.
  24. H. Jagadish, N. Koudas, and S. Muthukrishnan: Mining Deviants in a Time Series Database. VLDB 1999
  25. W. Jin, K. H. Tung and J. W. Han: Mining Top-n Local Outliers in Large Databases. KDD 2001
  26. E. Knorr, R. Ng: Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB 1998
  27. E. Knorr and R. Ng: Finding Intensional Knowledge of Distance-Based Outliers. VLDB 1999
  28. F. Korn and S. Muthukrishnan: Influence Sets Based on Reverse Nearest Neighbor Queries. SIGMOD 2000
  29. S. Muthukrishnan, R. Shah, J. S. Vitter: Mining Deviants in Time Series Data Streams. SSDBM 2004
  30. R. Ng and J. W. Han: Efficient and Effective Clustering Method for Spatial Data Mining. VLDB 1994
  31. S. Papadimitriou, H. Kitagawa et al. LOCI: Fast Outlier Detection Using the Local Correlation Integral. ICDE 2003
  32. S. Papadimitriou, C. Faloutsos: Cross-Outlier Detection. SSTD 2003
  33. N. Roussopoulos, S. Kelley and F. Vincent: Nearest neighbor queries. SIGMOD 1995
  34. S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD 2000
  35. S. Shekhar, C. T. Lu, P. S. Zhang: Detecting Graph-based Spatial Outliers. KDD 2001
  36. J. Tang, Z. X. Chen et al.: Enhancing Effectiveness of Outlier Detections for Low Density Patterns. PAKDD 2002
  37. W. K. Wong, A. W. Moore et al.: Rule-Based Anomaly Pattern Detection for Detecting Disease Outbreaks. AAAI 2002
  38. M. L. Yiu, N. Mamoulis: Clustering Objects on a Spatial Network. SIGMOD 2004
  39. M. L. Yiu et al.: Aggregate Nearest Neighbor Queries in Road Networks. IEEE Trans. Knowl. Data Eng. 17(6), 2005
  40. T. Zhang et al.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD 1996