Fuzzy clustering-based approach for outlier detection
2010
Abstract
Outlier detection is an important task in a wide variety of application areas. In this paper, a proposed method based on fuzzy clustering approaches for outlier detection is presented. We first perform the c-means fuzzy clustering algorithm. Small clusters are then determined and considered as outlier clusters. The rest of outliers (if any) are then detected in the remaining clusters based on temporary removing a point from the data set and recalculating the objective function. If a noticeable change occurred in the Objective Function (OF), the point is considered an outlier. Experimental results show that our method works well. The test results show that the proposed approach gave good results when applied to different data sets.
Key takeaways
AI
AI
- The proposed fuzzy clustering method effectively identifies outliers by analyzing small clusters.
- Outliers are defined as points that significantly affect the objective function upon removal.
- The method shows effectiveness on data sets including Bupa, detecting 40 outliers versus 48 previously identified.
- Small clusters are defined as containing fewer points than half the average from k clusters.
- The Fuzzy C-Means algorithm is central to the proposed outlier detection approach.
References (32)
- Han, J. and M. Kamber , 2006. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2 nd ed.
- Bolton, R. and D. J. Hand, Statistical Fraud Detection: A Review, 2002. Statistical Science, 17(3): 235-255.
- Lane, T. and C. E. Brodley. Temporal Sequence Learning and Data Reduction for Anomaly Detection, 1999. ACM Transactions on Information and System Security, 2(3): 295-331.
- Chiu, A. and A. Fu, 2003. Enhancement on Local Outlier Detection. 7th International Database Engineering and Application Symposium (IDEAS03), pp. 298-307.
- Knorr, E. and R. Ng, Algorithms for Mining Distance-based Outliers in Large Data Sets, 1998. Proc. the 24 th International Conference on Very Large Databases (VLDB), pp. 392-403.
- Knorr, E., R. Ng, and V. Tucakov, 2000. Distance-based Outliers: Algorithms and Applications. VLDB Journal, 8(3-4): 237- 253.
- Hodge, V. and J. Austin, 2004. A Survey of Outlier Detection Methodologies, Artificial Intelligence Review, 22: 85-126.
- Jain, A. and R. Dubes, 1988. Algorithms for Clustering Data. Prentice-Hall.
- Loureiro,A., L. Torgo and C. Soares, 2004. Outlier Detection using Clustering Methods: a Data Cleaning Application, in Proceedings of KDNet Symposium on Knowledge-based Systems for the Public Sector. Bonn, Germany.
- Niu, K., C. Huang, S. Zhang, and J. Chen, 2007. ODDC: Outlier Detection Using Distance Distribution Clustering, T. Washio et al. (Eds.): PAKDD 2007 Workshops, Lecture Notes in Artificial Intelligence (LNAI) 4819, pp. 332-343, Springer-Verlag.
- Zhang, J. and H. Wang, 2006. Detecting outlying subspaces for high-dimensional data: the new Task, Algorithms, and Performance, Knowledge and Information Systems, 10(3): 333-355.
- MacQueen, J.,1967. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp. Math. Stat. and Prob, pp. 281-97.
- Laan, M., K. Pollard and J. Bryan, 2003. A New Partitioning Around Medoids Algorithms, Journal of Statistical Computation and Simulation, 73(8): 575-584.
- Kaufman, L. and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons.
- Bezdek, J, L. Hall, and L. Clarke, Review of MR Image Segmentation Techniques Using Pattern Recognition, Medical Physics, Vol. 20, No. 4, 1993, pp. 1033-1048.
- Pham, D, Spatial Models for Fuzzy Clustering, Computer Vision and Image Understanding, Vol. 84, No. 2, 2001, pp. 285-297.
- Rignot, E, R. Chellappa, and P. Dubois, Unsupervised Segmentation of Polarimetric SAR Data Using the Covariance Matrix, IEEE Trans. Geosci. Remote Sensing, Vol. 30, No. 4, 1992, pp. 697-705.
- Zhang, Q. and I. Couloigner, 2005. A New and Efficient K-Medoid Algorithm for Spatial Clustering, in O. Gervasi et al. (Eds.): ICCSA 2005, Lecture Notes in Computer Science (LNCS) 3482, pp. 181 -189, 2005. © Springer- Verlag.
- Hawkins, D., 1980. Identifications of Outliers, Chapman and Hall, London.
- Barnett, V. and T. Lewis, 1994. Outliers in Statistical Data. John Wiley.
- Rousseeuw, P. and A. Leroy, 1996. Robust Regression and Outlier Detection, 3rd ed.. John Wiley & Sons.
- Ramaswami, S., R. Rastogi and K. Shim, 2000. Efficient Algorithm for Mining Outliers from Large Data Sets. Proc. ACM SIGMOD, pp. 427- 438.
- Angiulli, F. and C. Pizzuti, Outlier Mining in Large High-Dimensional Data Sets, 2005. IEEE Transactions on Knowledge and Data Engineering, 17(2): 203-215.
- Breunig, M., H. Kriegel, R. Ng and J. Sander, 2000. Lof: identifying density-based local outliers. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data. ACM Press, 93-104.
- Papadimitriou, S., H. Kitawaga, P. Gibbons, and C. Faloutsos, 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering, pp. 315-326.
- Gath, I and A. Geva, 1989. Fuzzy Clustering for the Estimation of the Parameters of the Components of Mixtures of Normal Distribution, Pattern Recognition Letters, 9, pp. 77-86.
- Cutsem, B and I. Gath, 1993. Detection of Outliers and Robust Estimation using Fuzzy Clustering, Computational Statistics & Data Analyses 15, pp. 47-61.
- Jiang, M., S. Tseng and C. Su, 2001. Two- phase Clustering Process for Outlier Detection, Pattern Recognition Letters, 22: 691-700.
- Acuna E. and Rodriguez C., (2004), A Meta Analysis Study of Outlier Detection Methods in Classification, Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, available at academic.uprm.edu/~eacuna/paperout.pdf. In proceedings IPSI 2004, Venice.
- Almeida, J., L. Barbosa, A. Pais and S. Formosinho, 2007. Improving Hierarchical Cluster Analysis: A New Method with Outlier Detection and Automatic Clustering, Chemometrics and Intelligent Laboratory Systems 87: 208-217.
- Yoon, K., O. Kwon and D. Bae, 2007. An approach to Outlier Detection of Software Measurement Data using the K-means Clustering Method, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), Madrid, pp. 443-445.
- Blake, C. L. & C. J. Merz, 1998. UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository. html, University of California, Irvine, Department of Information and Computer Sciences.