Chinese information extraction and retrieval

Sean Boisen; Mary Ellen Okurowski; Michael Crystal; Erik Peterson; Ralph Weischedel; John Broglio; Jamie Callan; Bruce Croft; Theresa Hand; Thomas Keenan

doi:10.3115/1119018.1119047

Outline

Title

Abstract

Background

Query Expansion with Related Terms

Evaluation Methodology

Evaluation Results

Evaluation Conclusions

All Topics

Computer Science

Natural Language Processing

Chinese information extraction and retrieval

Michael Crystal

1996, Proceedings of a workshop on held at Vienna, Virginia May 6-8, 1996 -

https://doi.org/10.3115/1119018.1119047

visibility

…

description

11 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

This paper provides a summary of the following topics: I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5. program issues in technology development.

Shaozhi Ye

Web Information SystemsWISE 2004, 2004

downloadDownload free PDF View PDFchevron_right

Chinese term extraction using different types of relevance

Qin Lu

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

This paper presents a new term extraction approach using relevance between term candidates calculated by a link analysis based method. Different types of relevance are used separately or jointly for term verification. The proposed approach requires no prior domain knowledge and no adaptation for new domains. Consequently, the method can be used in any domain corpus and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques and also verify the efficiency and relative domain independent nature of the approach.

downloadDownload free PDF View PDFchevron_right

Design and Evaluation of Approaches for Automatic Chinese Text

Jing-doo Wang

2000

In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors(kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.

downloadDownload free PDF View PDFchevron_right

Large-Vocabulary Chinese Text/speech Information Retrieval Using Mandarin Speech Queries

Hsin-min Wang

The network technology and the Internet are creating a completely new information era. It is believed that in the near future numerous of digital libraries and a great variety of multimedia databases, which consist of heterogeneous types of information including text, audio, image, video and so on, will be available worldwide via the Internet. This paper deals with the problem of Chinese text and Mandarin speech information retrieval with Mandarin speech queries. Instead of using the syllable-based information alone, the word-based information was also successfully incorporated to further improve the retrieving performance. A prototype system with an interface supporting some user-friendly functions was successfully implemented and the initial test results verified the feasibility of our approaches.

downloadDownload free PDF View PDFchevron_right

Probability-Based Chinese Text Processing and Retrieval

Stephen Robertson, Aijun An

Computational Intelligence, 2000

We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.

downloadDownload free PDF View PDFchevron_right

Unknown Word Extraction for Chinese Documents

Wei-Yun Ma

2002

There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown words in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented which online identifies new words, including low frequency new words, with high precision and high recall rates.

downloadDownload free PDF View PDFchevron_right

Accessor Variety Criteria for Chinese Word Extraction

Xiaotie Deng

Computational Linguistics, 2004

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, , 'percent', and , 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

downloadDownload free PDF View PDFchevron_right

Automating Data Extraction from Chinese Texts

Peter Bol

2016

This project was designed as an international and interdisciplinary collaboration (between the UK, US and the Netherlands) to facilitate and promote research techniques for large-scale structured datasets derived from unstructured corpora of Chinese texts. It aimed to reduce ambiguity within and across entity types such as place and personal names and to improve recall and accuracy by developing machine learning approaches

downloadDownload free PDF View PDFchevron_right

A Bottom-Up Term Extraction Approach for Web-Based Translation in Chinese-English IR Systems

Shlomo Geva

2007

The extraction of Multiword Lexical Units (MLUs) in lexica is important to language related methods such as Natural Language Processing (NLP) and machine translation. As one word in one language may be translated into an MLU in another language, the extraction of MLUs plays an important role in Cross-Language Information Retrieval (CLIR), especially in finding the translation for words that are not in a dictionary. Web mining has been used for translating the query terms that are missing from dictionaries. MLU extraction is one of the key parts in search engine based translation. The MLU extraction result will finally affect the transition quality. Most statistical approaches to MLU extraction rely on large statistical information from huge corpora. In the case of search engine based translation, those approaches do not perform well because the size of corpus returned from a search engine is usually small. In this paper, we present a new string measurement and new Chinese MLU extraction process that works well on small corpora.

downloadDownload free PDF View PDFchevron_right

Chinese information retrieval using Lemur: NTCIR-5 CIR experiments at UNT

Rowena Li

This paper describes our participation in NTCIR-5 Chinese Information Retrieval (IR) evaluation. The main purpose is to evaluate Lemur, a freely available information retrieval toolkit. Our results showed that Lemur could provide above average performance on most of the runs. We also compared manual queries vs. automatic queries for Chinese IR. The results show that manually generated queries did not have much effect on IR performance. More analysis will be carried out to discover causes behind hard topics and ways to improve the overall retrieval performance.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

W. Bruce Croft

1996

Information retrieval in a foreign language requires modification to text and user interfaces. Stemming, word boundary identification, punctuation and stopword identificdation must all be modified; appropriate input and presentation methods must be provided. But once these interface issues are resolved the retrieval model and enhancement techniques operate equally effectively in all the languages we have worked with.

downloadDownload free PDF View PDFchevron_right

On Chinese text retrieval

Xiaobo Ren

Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '96, 1996

In previous studies, Chinese text retrieval has often been dealt with on the character basis. This approach is not suited to deal with complex queries. We suggest that Chirmse text retrieval should work with words inslead of characters. The crucial problem is to segment originally continuous Chinese texts into words. In this paper, wc Ilrsi propose a hybrid segmentation approach which unifies the commonly used approaches. The systcm SMART is then udaptcd to index the segmented Chinese texls. Finally, wc suggest that Chinese text retrieval should move further to include a thesaurus in order to cope with dle rich vocabulary of Chinese.

downloadDownload free PDF View PDFchevron_right

A Hybrid Chinese Information Retrieval Model

Shlomo Geva

Lecture Notes in Computer Science, 2010

A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating wordbased techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.

downloadDownload free PDF View PDFchevron_right

Chinese text retrieval without using a dictionary

Jason Meggs

ACM SIGIR Forum, 1997

It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.

downloadDownload free PDF View PDFchevron_right

Okapi Chinese Text Retrieval Experiments at TREC-6

Stephen Robertson

Trec, 1997

downloadDownload free PDF View PDFchevron_right

Effects of Term Segmentation on Chinese/English Cross-Language Information Retrieval

Jianqiang Wang

1999

The majority of recent Cross-Language Information Retrieval (CLIR) research has focused on European languages. CLIR problems that involve East Asian languages such as Chinese introduce additional challenges, because written Chinese texts lack boundaries between terms. The paper examines three Chinese segmentation techniques in combination with two variants of dictionary-based Chinese to English query translation. The results indicate that failure to segment terms, particularly technical terms and names, can have a cascading effect that reduces retrieval effectiveness. Task-tuned segmentation algorithms and alternative term weighting strategies are suggested as productive directions for future work

downloadDownload free PDF View PDFchevron_right

Chinese Term Extraction Based on Delimiters

Qin Lu

Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which are much more stable and domain independent. The proposed approach is not as sensitive to term frequency as that of previous works. This approach has no strict limit or hard rules and thus they can deal with all kinds of terms. It also requires no prior domain knowledge and no additional training to adapt to new domains. Consequently, the proposed approach can be applied to different domains easily and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques which veri...

downloadDownload free PDF View PDFchevron_right

IASL system for NTCIR-6 Korean-Chinese cross-language information retrieval

Richard Tzong-Han Tsai

2007

This paper describes our Korean-Chinese cross-language information retrieval system for NTCIR-6. Our system uses a bilingual dictionary to perform query translation. We expand our bilingual dictionary by extracting words and their translations from the Wikipedia site, an online encyclopedia. To resolve the problem of translating Western people's names into Chinese, we propose a transliteration mapping method. We translate queries form Korean query to Chinese by using a co-occurrence method. When evaluating on the NTCIR-6 test set, the performance of our system achieves a mean average precision (MAP) of 0.1392 (relax score) for title query type and 0.1274 (relax score) for description query type.

downloadDownload free PDF View PDFchevron_right

Chinese information extraction and retrieval

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics