Chinese text retrieval without using a dictionary
1997, ACM SIGIR Forum
https://doi.org/10.1145/278459.258532Abstract
It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.
References (6)
- lq [18] Chao-Huang Chang and Cheng-Der Chen. A Study on Inte- grating Chinese Word Segmentation and part-of-speech tag- ging. Communication oj the Chine&e and Oriental Lan- guages Information Processing Society, 3:69-77, 1993.
- Yuen Ren Chao. A Gmmmar o~Spoken Chinese. University of California Press, Berkeley, 1966.
- Keh-.liann Chen and Shing-Huan Liu. Word Identification for Mandarin Chinese Sentences. In Proceeding of COLING, pages 23-28, August 1992.
- K. Church and P. Hanks. Word Association Norms, Mutual information, and Lexicography. In 27th Annual Meeting of the A mociation jor Computational Lingai.stice, pages 76-63.
- W. S. Cooper, A. Chen, and F. C. Gey. Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression. In D. K. Harman, edhor, The Second Tezt REttieval Conference (TREC-2), pages 57-66, March 1994. Chamg-Kang Fan and Wen-Hsiang Tsai. Automatic Word Identification in Chinese Sentences by the Relaxation Tech- nique. Computer Procean"ng oj Chinene and Om"ental Lan- guages, pages 33-56, November 1988. FDMC. Xiandai han~u ptnhi cidian (Frequency dictionary of modern Chinese). Beijing Language Institute Press, 1986. Pascale l%ng and Dekai Wu. Statistical augmentation of a Chinese machine-readable dictionary. In Second A nnual Workshop on Very Large Corpora, pages 33-56, 1994.
- Kok-Wee Gan, Martha Palmer, and Kirn-Teng Lua A Sta- tistiadly Emergent Approach for Language Processing Ap- plication to Modeling Context Effects in Ambiguous Chi- nese Word Boundary Percept ion.