Lecture notes in networks and systems, Jul 6, 2022
Gujarat is one of the states in the western part of India, and the Gujarati language is the offic... more Gujarat is one of the states in the western part of India, and the Gujarati language is the official language of Gujarat. The Gujarati language is more than 700 years old and is spoken by more than 55 million people around the world. A machine translation system (MTS) is needed for the communication between people knowing different languages. Idioms are used in almost all the languages. An idiom is a phrase or an expression whose meaning does not necessarily relate to the literal meaning of its individual words. The idiom generally means something different than what is directly conveyed by its individual words. Translation of idioms for any language, as with Gujarati idioms, into any other language is a challenging task. All existing MTS including Google Translate and Microsoft Bing fail to translate Gujarati idioms suitably. In the current paper, a particular category of Gujarati idioms, called the personage idioms, has been treated for detection from the input text and translation into the English language. We have deployed a dictionary-based algorithm and context-based search, respectively, for idioms having one and multiple meanings. This is a first of its kind work in the world. From a broad technical view point, this is an application of Gujarati to interlingual translation for the MTS sub-domain of natural language processing (NLP). The interlingual language is considered to be English for the present research work.
International Journal of Advanced Computer Science and Applications
The authors of this research paper present a mechanism for dealing with loanwords, missing words,... more The authors of this research paper present a mechanism for dealing with loanwords, missing words, and newly developed terms inclusion issues in WordNets. WordNet has evolved as one of the most prominent Natural Language Processing (NLP) toolkits. This mechanism can be used to improve the WordNet of any language. The authors chose to work with the Hindi and Gujarati languages in this research work to achieve a higher quality research aspect because these are the languages with major dialects. The research work used more than 5000 Hindi verse-based data corpus instead of a prose-based data corpus.As a result, nearly 14000 Hindi words were discovered that were not present in the popular Hindi IndoWordNet, accounting for 13.23 percent of the total existing word count of 105000+. Working with idioms was a distinct method for the Gujarati language. Around 3500 idioms data were used, and nearly 900 Gujarati terms were discovered that did not exist in the IndoWordNet, accounting for nearly 1.4 percent of the total of 64000+ Gujarati words in the IndoWordNet. It will also contribute almost 14000 Hindi words and around 900 Gujarati words to the IndoWordNet project.
2020 International Conference for Emerging Technology (INCET), 2020
Gujarati language is the official language of the state of Gujarat located on the western region ... more Gujarati language is the official language of the state of Gujarat located on the western region of India. Machine Translation System (MTS) translates text from one language to other language. Based on our review, we found that very few machine translation systems are available that converts Gujarati text into English language. This paper focuses on the translation of Gujarati trigram idioms. Idiom is defined as a token-sequence whose meaning is different from the literal meaning of the individual tokens. The proposed Gujarati to English Idioms translator accurately translates the trigram and bigram idioms. We have created the corpus of nearly 3000 n-gram idioms and from this corpus we have found nearly 890 trigram idioms and 1735 bigram idioms. This paper studies the translation of trigram and bigram idioms.
International Journal of Advanced Computer Science and Applications
Gujarati language is used for conversation by more than 55 million people worldwide and it is mor... more Gujarati language is used for conversation by more than 55 million people worldwide and it is more than 1000 years old language. It is the chief language of the Indian state of Gujarat. There are many dialects of Gujarati like Standard Gujarati, Amdawadi Gujarati, Kathiawadi Gujarati, Kutchi Gujarati etc. The Gujarati language is very rich in morphology like other Indo-Aryan languages like Hindi. Many readability tests are available in the English language, but no readability complexity test is available for the Gujarati idiomatic text. The Complexity score is the sub concept of the readability test. In order to define complexity level of Gujarati text, complexity score of Gujarati text is calculated. We deployed a novel readability complexity score calculation method in which we considered the number of letters of each word, the number of diacritics of each word, Gujarati idiomatic text of n-gram where n=1 to 9, Gujarati idiomatic text of m-meaning idioms where m=1 to 7. The complexity score is calculated as the sum of word complexity score, diacritics complexity score, n-gram complexity score of Gujarati idioms and m-meaning complexity score of Gujarati idioms. We emphasized Gujarati idiomatic text for the calculation of complexity score as idioms make the text more complex to understand. This is an innovative and first of its kind work in the research community of Gujarati language. The results are hopeful enough to employ the suggested complexity score method for developing a readability test method for natural language processing tasks for the Gujarati language.
International Journal of Advanced Computer Science and Applications, 2021
Gujarati language is the Indo-Aryan language spoken by the Gujaratis, the people of the state of ... more Gujarati language is the Indo-Aryan language spoken by the Gujaratis, the people of the state of Gujarat of India. Gujarati is the one of the 22 official languages recognized by the Indian government. Gujarati script was adopted from Devanagari script. Approximately 3000 idioms are available in Gujarati language. Machine translation of any idiom is the challenging task because contextual information is important for the translation of a particular idiom. For the translation of Gujarati idioms into English or any other language, surrounding contextual words are considered for the translation of specific idiom in the case of ambiguity of the meaning of idiom. This paper experiments the IndoWordNet for Gujarati language for getting synonyms of surrounding contextual words. This paper uses n-gram model and experiments various window sizes surrounding the particular idiom as well as role of stop-words for correct context identification. The paper demonstrates the usefulness of context window in case of ambiguity in the meaning identification of idioms with multiple meanings. The results of this research could be consumed by any destination-independent machine translation system for Gujarati language.
International Journal of Advanced Research in Computer Science, 2018
India is a multilingual country. At present, there are 22 official languages in India. Gujarat is... more India is a multilingual country. At present, there are 22 official languages in India. Gujarat is a state located in the western region of India. The Gujarati language is spoken by nearly 60 million people worldwide, making it the 26th most-spoken native language in the world. In Machine Translation System (MTS), one natural language gets translated to another language using computational applications with minimal human effort or without a real-time human interface. Many attempts have been done in Machine Translation System for Indian languages. Unfortunately, we do not have an efficient Machine Translation System today. This paper gives a brief description of approaches of Machine Translation and the work done for the Gujarati language.
International Journal of Advanced Computer Science and Applications, 2021
Gujarati is the language used for everyday communication in the state of Gujarat, India. The Guja... more Gujarati is the language used for everyday communication in the state of Gujarat, India. The Gujarati language is also officially recognized by the constitution and the government of India. Gujarati script is based on the Devanagari script. An idiom is an expression, phrase, or word that has a different meaning from the literal meaning of the words in it. Idioms represent the cultural heritage of Gujarati language. Idioms are used in Gujarati language for effective communication and convey of an accurate message. No Machine Translation System does the accurate translation of Gujarati idioms to English or any other language. Different idiom phrases can be generated by adding diacritic(s) as well as suffix to the root or base form of the idiom. Many forms of single idiom make automatic idiom identification as well as machine translation more challenging. This paper focuses on the design and implementation of diacritics and suffix-based rules for dynamic phrase generation and detection of idioms of Gujarati language. This implementation helps in identifying Gujarati idiom present in any possible form in the Gujarati text. The obtained results with the execution of 7050 different Gujarati idiom phrases yield an accuracy of 99.73%. The results are encouraging enough to make the proposed implementation useful for Natural Language processing tasks related to Gujarati language idioms.
Uploads
Papers by Dr. Jatin Modh