A Clustering-Based Algorithm for Data Reduction
2009
Abstract
Finding an efficient data reduction method for largescale problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods.
References (31)
- W. Lam, C.-K. Keung, and C. X. Ling, "Learning good- prototypes for classification using filtering and abstraction of instances," Pattern Recognition, vol. 35, no. 7, pp. 1491-1506, July 2002.
- T. M. Cover and P. E. Hart, "Nearest neighbor pattern classi- fication," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, January 1967.
- D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems, Man, and Cybernetics, vol. 2, no. 3, pp. 408-421, July 1972.
- G. W. Gates, "The reduced nearest neighbor rule," IEEE Trans- actions on Information Theory, vol. 18, no. 3, pp. 431-433, May 1972.
- G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, "An algorithm for a selective nearest neighbor decision rule," IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 665-669, November 1975.
- P. Datta and D. Kibler, "Symbolic nearest mean classifiers," in Proceedings of the 14th International Conference on Machine Learning, July 1997, pp. 82-87.
- D. R. Wilson and T. R. Martinez, "Reduction techniques for instance-based learning algorithms," Machine Learning, vol. 38, no. 3, pp. 257-286, March 2000.
- H. Brighton and C. Mellish, "Reduction techniques for instance- based learning algorithms," Data Mining and Knowledge Dis- covery, vol. 6, no. 2, pp. 153-172, April 2002.
- S.-W. Kim and B. J. Oommen, "Enhancing prototype reduction schemes with LVQ3-type algorithms," Pattern Recognition, vol. 36, no. 5, pp. 1083-1093, May 2003.
- J. Sánchez, "High training set size reduction by space partition- ing and prototype abstraction," Pattern Recognition, vol. 37, no. 7, pp. 1561-1564, July 2004.
- M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. P ekalska, and R. P. W. Duin, "Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces," Pattern Recognition, vol. 39, no. 10, pp. 1827-1838, October 2006.
- S.-H. Son and J.-Y. Kim, "Data reduction for instance-based learning using entropy-based partitioning," in Proceedings of the International Conference on Computational Science and its Applications, May 2006, pp. 590-599.
- E. Marchiori, "Hit miss networks with applications to instance selection," Journal of Machine Learning Research, vol. 9, pp. 997-1017, June 2008.
- S.-W. Kim and B. J. Oommen, "Enhancing prototype reduction schemes with recursion: A method applicable for "large" data sets," IEEE Transactions on Systems, Man, and Cybernetics, part B: Cybernetics, vol. 34, no. 3, pp. 1384-1397, June 2004.
- J. Macqueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.
- C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, September 1995.
- V. Vapnik, The nature of statistical learning theory, 2nd ed. New York, NY, USA: Springer, November 1999.
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and other kernel-based learning methods. New York, NY, USA: Cambridge University Press, March 2000.
- N. A. Syed, H. Liu, and K. K. Sung, "A study of support vectors on model independent example selection," in Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, August 1999, pp. 272-276.
- Y.-J. Lee and O. L. Mangasarian, "RSVM: Reduced support vector machines," in Proceedings of the First SIAM Interna- tional Conference on Data Mining, April 2001, pp. 350-366.
- K.-M. Lin and C.-J. Lin, "A study on reduced support vector machines," IEEE Transactions on Neural Networks, vol. 14, no. 6, pp. 1449-1559, November 2003.
- J. G. Wang, P. Neskovic, and L. N. Cooper, "Training data selection for support vector machines," in Proceedings of the 1st International Conference on Advances in Natural Computation, August 2005, pp. 554-564.
- E. P ekalska, R. P. W. Duin, and P. Paclík, "Prototype selection for dissimilarity-based classifiers," Pattern Recognition, vol. 39, no. 2, pp. 189-208, February 2006.
- Y.-J. Lee and S.-Y. Huang, "Reduced support vector machines: A statistical theory," IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 1-13, January 2007.
- Y. Linde, A. Buzo, and R. Gray, "An algorithm for vector quan- tizer design," IEEE Transaction on Communications, vol. 28, no. 1, pp. 84-95, January 1980.
- G. J. Klir and B. Yuan, Fuzzy sets and fuzzy logic: theory and applications, 1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, May 1995.
- M. B. de Almeida, A. de Padua Braga, and J. P. Braga, "SVM- KM: speeding SVMs learning with a priori cluster selection and k-means," in Proceedings of the 6th Brazilian Symposium on Neural Networks, November 2000, pp. 162-167.
- S. Zheng, X. Lu, N. Zheng, and W. Xu, "Unsupervised cluster- ing based reduced support vector machines," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2003, pp. 821-824.
- R. Koggalage and S. K. Halgamuge, "Reducing the number of training samples for fast support vector machine classification," Neural Information Processing -Letters and Reviews, vol. 2, no. 3, pp. 57-65, March 2004.
- A. Asuncion and D. Newman, "UCI ma- chine learning repository," 2007. [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html
- S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K. Chan, "Cost-based modeling for fraud and intrusion detection: Results from the jam project," in In Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, January 2000, pp. 130-144.