Chinese information extraction and retrieval
1996, Proceedings of a workshop on held at Vienna, Virginia May 6-8, 1996 -
https://doi.org/10.3115/1119018.1119047…
11 pages
1 file
Sign up for access to the world's latest research
Abstract
This paper provides a summary of the following topics: I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5. program issues in technology development.
Related papers
Web Information SystemsWISE 2004, 2004
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009
This paper presents a new term extraction approach using relevance between term candidates calculated by a link analysis based method. Different types of relevance are used separately or jointly for term verification. The proposed approach requires no prior domain knowledge and no adaptation for new domains. Consequently, the method can be used in any domain corpus and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques and also verify the efficiency and relative domain independent nature of the approach.
2000
In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors(kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.
The network technology and the Internet are creating a completely new information era. It is believed that in the near future numerous of digital libraries and a great variety of multimedia databases, which consist of heterogeneous types of information including text, audio, image, video and so on, will be available worldwide via the Internet. This paper deals with the problem of Chinese text and Mandarin speech information retrieval with Mandarin speech queries. Instead of using the syllable-based information alone, the word-based information was also successfully incorporated to further improve the retrieving performance. A prototype system with an interface supporting some user-friendly functions was successfully implemented and the initial test results verified the feasibility of our approaches.
Computational Intelligence, 2000
We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.
2002
There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context and content information of unknown words in the steps of detection process, extraction process, and verification process. A practical unknown word extraction system was implemented which online identifies new words, including low frequency new words, with high precision and high recall rates.
Computational Linguistics, 2004
We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, , 'percent', and , 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.
2016
This project was designed as an international and interdisciplinary collaboration (between the UK, US and the Netherlands) to facilitate and promote research techniques for large-scale structured datasets derived from unstructured corpora of Chinese texts. It aimed to reduce ambiguity within and across entity types such as place and personal names and to improve recall and accuracy by developing machine learning approaches
2007
The extraction of Multiword Lexical Units (MLUs) in lexica is important to language related methods such as Natural Language Processing (NLP) and machine translation. As one word in one language may be translated into an MLU in another language, the extraction of MLUs plays an important role in Cross-Language Information Retrieval (CLIR), especially in finding the translation for words that are not in a dictionary. Web mining has been used for translating the query terms that are missing from dictionaries. MLU extraction is one of the key parts in search engine based translation. The MLU extraction result will finally affect the transition quality. Most statistical approaches to MLU extraction rely on large statistical information from huge corpora. In the case of search engine based translation, those approaches do not perform well because the size of corpus returned from a search engine is usually small. In this paper, we present a new string measurement and new Chinese MLU extraction process that works well on small corpora.
This paper describes our participation in NTCIR-5 Chinese Information Retrieval (IR) evaluation. The main purpose is to evaluate Lemur, a freely available information retrieval toolkit. Our results showed that Lemur could provide above average performance on most of the runs. We also compared manual queries vs. automatic queries for Chinese IR. The results show that manually generated queries did not have much effect on IR performance. More analysis will be carried out to discover causes behind hard topics and ways to improve the overall retrieval performance.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.