Academia.eduAcademia.edu

Outline

A Hybrid Chinese Information Retrieval Model

2010, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-642-15470-6_28

Abstract

A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating wordbased techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.

References (8)

  1. Chengye Lu, Yue Xu, Shlomo Geva: Translation disambiguation in web-based translation extraction for English-Chinese CLIR. Proceedings of the 2007 ACM symposium on applied computing, Pages 819-823 (2007)
  2. Geva, S.: GPX -Gardens Point XML IR at INEX 2005, INEX 2005. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 240-253. Springer, Heidelberg (2006)
  3. Jianfeng Gao and Mu Li: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics, MIT. 531-574, Vol 31, Issue 4 (2005)
  4. Peng F., Huang X., Schuurmans D. and Cercone N.: Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR, Proceedings of the 19th international conference on Computational linguistics, Pages 1-7 (2002)
  5. Nianwen Xue: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing, Vol 8, No 1, Pages 29-48 (2003)
  6. Richard Sproat and Chilin Shih: Corpus-Based Methods in Chinese Morphology and Phonology. AT&T Labs -Research (2002)
  7. Wei-Yun Ma, Keh-Jiann Chen: Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff. Proceedings of the second SIGHAN workshop on Chinese language processing -Volume 17, Pages 168 -171 (2003)
  8. Xinjing Wang, Wen Liu and Yong Qin: A Search-based Chinese Word Segmentation Method. Proceedings of the 16th international conference on World Wide Web , Pages 1129 -1130 (2006)