Semantic Smoothing for Model-based Document Clustering
2006, Sixth International Conference on Data Mining (ICDM'06)
https://doi.org/10.1109/ICDM.2006.142Abstract
A document is often full of class-independent "general" words and short of class-specific "core" words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most modelbased clustering approaches use Laplacian smoothing to prevent zero probability while most similaritybased approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as contextsensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that semantic smoothing in conjunction with model-based clustering approaches is effective in improving cluster quality.
References (19)
- Banerjee, A. and Ghosh, J. Frequency sensitive competitive learning for clustering on high-dimensional hperspheres. Proc. IEEE Int. Joint Conference on Neural Networks, pp. 1590-1595.
- Berger, A. and Lafferty J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in IR, 1999, pp.222-229.
- Deerwester, S., Dumais, T.S., Furnas, W.G., Landauer, K.T., Harshman, R. Indexing by Latent Semantic Analysis, Journal of the American Society of Information Science, 1990, 41(6): 391-407
- Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, 39: 1-38.
- McQueen, J., Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5- th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297
- Kaufman, L. and Rousseuw, P.J. Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons, 1990.
- Kullback, S. and Leibler, R.A. On information and sufficiency. Annals of Mathematical Statistics, 22(1):79-86, March 1951.
- Lafferty, J. and Zhai, C. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in IR, 2001, pp.111-119.
- McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification. AAAI Workshop on Learning for Text Categorization, pp 41- 48.
- Nigam, K., McCallum, A., Thrun, S., Mitchell, T., Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning, Volume 39 , Issue 2-3 (May-June 2000), pp103-134
- Smadja, F. Retrieving collocations from text: Xtract. Computational Linguistics, 1993, 19(1), pp. 143--177.
- Steinbach, M., Karypis, G., and Kumar, V. A Comparison of document clustering techniques. Technical Report #00-034, Department of Computer Science and Engineering, University of Minnesota, 2000.
- Yoo I., Hu X., Song I-Y, Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering, accepted in the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Yoo I., Hu X., Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, 2006, pp303-312.
- Zhai, C. and Lafferty, J. A Study of Smoothing Methods for Language Models Applied to Ad hoc Information Retrieval. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in IR, 2001, pp.334-342.
- Zhai, C. and Lafferty, J. Two-Stage Language Models for Information Retrieval. 2002 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'02).
- Zhao, Y. and Karypis, G. Criterion functions for document clustering: experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2001.
- Zhong, S. and Ghosh, J. Generative model-based document clustering: a comparative study. Knowledge and Information Systems, 8(3): 374-384, 2005.
- Zhou, X., Hu, X., Zhang, X., Lin, X., and Song, I.-Y. Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR. In the 29th Annual International ACM SIGIR Conference, Aug 6-11, 2006, Seattle, WA, USA