Academia.eduAcademia.edu

Outline

Distance-Based Outlier Detection: Consolidation and Renewed Bearing

2010, Proceedings of The Vldb Endowment

https://doi.org/10.14778/1920841.1921021

Abstract

Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.

References (29)

  1. REFERENCES
  2. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117-122, 2008.
  3. F. Angiulli and F. Fassetti. Very efficient mining of distance-based outliers. In M. J. Silva, A. H. F. Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and A. O. Falcão, editors, CIKM, pages 791-800. ACM, 2007.
  4. F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In PKDD '02: Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pages 15-26, London, UK, 2002. Springer-Verlag.
  5. S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. SIGKDD Explor. Newsl., 2(2):81-85, 2000.
  6. S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In 9th ACM SIGKDD Int. Conf. on Knowledge Discovery on Data Mining, 2003.
  7. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 93-104. ACM, 2000.
  8. M. Ester, J. Kriegel, H. P.and Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial fatabases with noise. In In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996.
  9. C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 163-174. ACM New York, NY, USA, 1995.
  10. A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. 6th SIAM Int. Conf. on Data Mining, April 2005.
  11. A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349-364, 2008.
  12. S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD '98: ACM SIGMOD Int. Conf. on Management of data, pages 73-84, New York, NY, USA, 1998. ACM.
  13. Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283-304, 1998.
  14. E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB '99: 25th Int. Conf. on Very Large Data Bases, pages 211-222, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
  15. H. Kriegel, P. Kroger, and A. Zimek. Outlier Detection Techniques. In Tutorial at the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009.
  16. J. Laurikkala, M. Juhola, and E. Kentala. Informal identification of outliers in medical data. In The Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Citeseer, 2000.
  17. M. Mahoney and P. Chan. Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 376-385. ACM New York, NY, USA, 2002.
  18. M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the Third IEEE International Conference on Data Mining, page 601. Citeseer, 2003.
  19. R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In 20th Int. Conf. on Very Large Data Bases, 1994, Santiago, Chile, pages 144-155. Morgan Kaufmann Publishers, 1994.
  20. K. Ord. Outliers in statistical data : V. barnett and t. lewis, 1994, 3rd edition, (john wiley & sons, chichester), isbn 0-471-93094. Int. Journal of Forecasting, 12(1):175-176, March 1996.
  21. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In 19th International Conference on Data Engineering, 2003. Proceedings, pages 315-326, 2003.
  22. Projeto Tamandua, 2006. http://tamandua.speed.dcc.ufmg.br/.
  23. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD '00: Proc. ACM SIGMOD Int. Conf. on Management of data, pages 427-438, New York, NY, USA, 2000. ACM Press.
  24. N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD '95: ACM SIGMOD Int. Conf. on Management of data, pages 71-79, New York, NY, USA, 1995. ACM.
  25. P. Torr and D. Murray. Outlier detection and motion segmentation. Sensor Fusion VI, 2059:432-443, 1993.
  26. J. Tukey. Exploratory data analysis. Addison-Wesley, 1977.
  27. N. Vu and V. Gopalkrishnan. Efficient Pruning Schemes for Distance-Based Outlier Detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, page 175. Springer, 2009.
  28. M. Wu and C. Jermaine. A bayesian method for guessing the extreme values in a data set? In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 471-482. VLDB Endowment, 2007.
  29. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25(2):103-114, 1996.