A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language
ACM Transactions on Asian and Low-Resource Language Information Processing
The morphological variations of highly inflected languages that appear in a text impede the progr... more The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is developed, and its effectiveness is evaluated for Word Sense Disambiguation (WSD). Having observed its usefulness, lemmatizer is considered for developing Natural Language Processing tools for languages rich in morphological variations. Among various Indian highly inflected languages, Assamese, spoken by over 14 million people in the North-Eastern region of India, is also one of them. In this present work, after a detailed study on the possible transformations through which surface words are created from lemmas, we have designed an Assamese lemmatizer in such a manner that suitable reverse transformations can be employed on a surface word to derive the co-relative (similar) lemma back. And it has been observed that the lemmatizer is compe...
In today’s world English is considered as important language across the Globe. Many resources are... more In today’s world English is considered as important language across the Globe. Many resources are available in English language on the internet, which is not easily understandable , so its necessary that English language need to translate into the local languages of India so that the people of India can easily understand the enormous amount of English resources. As the information is of large amount so its not possible to keep translating things from one language to another manually. Thus its very important to translate the given text or information from one language to another automatically and effectively. This paper discusses about Neural Machine Translation(NMT) for converting English text to Hindi text. Neural machine translation(NMT) is one of the most recent and effective translation technique amongst all existing machine translation systems. In our experiment we have tested using 4 different model on OPUS, IITBombay English-Hindi parallel corpora contains nearly 1084157 sent...
Assamese Word Sense Disambiguation using Cuckoo Search Algorithm
Procedia Computer Science, 2021
Abstract Natural language processing is associated with human-computer interaction, where several... more Abstract Natural language processing is associated with human-computer interaction, where several challenges require natural language understanding. The Word sense disambiguation problem comprises the computational assignment of meaning to a word according to a specific context in which it occurs. There are numerous natural language processing applications, such as machine translation, information retrieval, and information extraction, which require this task which takes place at the semantic level. To solve this problem unsupervised computation proposals can be effective since they have been successfully used for many real-world optimization problems. In this paper, we propose to solve the word sense disambiguation problem using the cuckoo search algorithm in the Assamese language. We illustrate the performance of our algorithm by carrying out experiments on an Assamese corpus. And comparing them against an unsupervised genetic algorithm that is implemented in the Assamese language. Results of the experiment show that the cuckoo algorithm can achieve more precision, recall and F-measure, attaining 87.5, 84, and 85.71 percentages respectively.
Word sense disambiguation (WSD) is a problem to determine a word according to the context in whic... more Word sense disambiguation (WSD) is a problem to determine a word according to the context in which it occurs. There are plenty amount of works done in WSD for some languages such as English, but research work on Assamese WSD remains limited. It is a more exigent task because Assamese has an intrinsic complexity in its writing structure and ambiguity, such as syntactic, semantic, and anaphoric ambiguity levels. A novel unsupervised genetic word sense disambiguation algorithm is proposed in this paper. The algorithm first uses WordNet to extract all possible senses for a given ambiguous word, then a genetic algorithm is used taking Wu-Palmer’s similarity measure as the fitness function and calculating the similarity measure for all extracted senses. The winner sense which will have the highest score declared as the winner sense.
Improving stemming for Assamese information retrieval
International Journal of Information Technology, 2021
To enhance the Assamese stemmer several approaches and solutions by researchers have been propose... more To enhance the Assamese stemmer several approaches and solutions by researchers have been proposed. Such stemmers are important as the features are often applied for application-oriented projects, and especially, to develop information retrieval (IR) systems. Assamese stemming could be defined as a process that strips off a set of suffixes from words. But this process also has certain set back such as vocalization ambiguity, incorrect removal, single solution, etc. In this paper, we have proposed an Assamese stemmer that provides solutions to various drawbacks as proposed earlier and to make use of various features as mentioned above efficiently. We have tested using 20,000 words from 16 different articles, all possible suffixes in the Assamese language were manually collected taking the help of an Assamese linguistic expert. It has achieved quite better accuracy with 86.16%. Also, the accuracy of the system is compared with other existing approaches and our system outperforms all the others. Besides, we proposed an automatic approach for the evaluation and comparison of Assamese stemmers that takes into account metrics related to the accuracy of results.
Uploads
Papers by Arjun Gogoi