Academia.eduAcademia.edu

Outline

Contextual Document Clustering

2004

https://doi.org/10.1007/978-3-540-24752-4_13

Abstract

In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.

References (22)

  1. Baeza-Yates and Ribeiro-Neto: Modern Information Retrieval, ACM Press, 1999.
  2. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM Interna- tional Conference on Research and Development in Information Re- trieval, pp. 96-103, 1998.
  3. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In Proceedings of SIGIR-01, 24th ACM International Conference on Research and De- velopment in Information Retrieval, pp. 146-153,2001.
  4. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter,Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, Vol 1:1-48, 2002.
  5. Cutting, D.,Pedersen, J., Karger, D., Tukey, J.: Scatter/Gather: Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the Fifteenth Annual International ACM SIGIR Con- ference on Research and Development in Information Retrieval, pp. 318-329, 1992.
  6. Dhillon, Y.,Manella, S., Kumar, R.: Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification, Journal of Ma- chine Learning Research Vol 3:1265-1287, 2003.
  7. El-Yaniv R., Souroujon O.: Iterative double clustering for unsuper- vised and semi-supervised learning. In Proceedings of ECML-01, 12th European Conference on Machine Learning. pp. 121 -132,2001.
  8. Hofmann, T.: Probabilistic latent semantic indexing. In Proceedings of the 22nd ACM-SIGIR Intemational Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
  9. Jain, A. K., Murty, M. N. and Flynn, P. J.: Data Clustering: A Re- view. ACM Computing Surveys 31(3):26423,1999.
  10. Joachims, T.: A statistical learning model for Support Vector Ma- chines. SIGIR'01, New Orleans, USA, 2001.
  11. Karipis, G., Han, E.H.: Concept indexing: a fast dimensionality re- duction algorithm with applications to document retrieval and cat- egorisation, University of Minnesota, Technical Report TR-00-0016, 2000.
  12. Lang, K.: Learning to Filter netnews In Proceedings of 12th Inter- national Conference on Machine Learning, pp 331-339, 1995.
  13. Lin, J: Divergence Measures Based on the Shannon Entropy, IEEE Transactions on Information Theory, 37(1), pp145-151, 1991.
  14. Liu, X., Gong, Y., Xu, W., Zhu, S: Document clustering with cluster refinement and model selection capabilities. In Proceedings of SIGIR- 02, 25th ACM International Conference on Research and Develop- ment in Information Retrieval, pp. 191-198,2002.
  15. Pantel, P. ,Lin, D.: Document clustering with committees. In the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR), 2002.
  16. Pereira, F., Tishby, N., Lee L.: Distributional clustering of English words. In 30th Annual Meeting of the Association for Computational Linguistics, Columbus. Ohio, pp. 183-190, 1993.
  17. Sebastiani, F.: Machine learning in automated text categorization, ACM Computer Surveys, Vol.34, No.1, March 2002, pp. 1-47, 2002.
  18. Slonim, N.,Tishby N: Document Clustering using word clusters via the Information Bottleneck method. In the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2000.
  19. Slonim, N.,Friedman, N., Tishby N.: Unsupervised document classifi- cation using sequential information maximization. In the 25th Annual International Conference on Research and Development in Informa- tion Retrieval (SIGIR),2002.
  20. Tishby N., Pereira F., Bialek W.: The Information bottleneck method. Invited paper to The 37th annual Allerton Conference on Communication, Control, and Computing, 1999.
  21. Van Rijsbergen, C. J.: Information retrieval, Butterworth- Heinemann, 1979.
  22. Zamir, O. and Etzioni, O.: Web document Clustering, A feasibility demonstration in ACM SIGIR 98, pp 46-54, 1998.