Papers by Pushpak Bhattacharyya

In this paper, we describe our English-Hindi and Hindi-English statistical systems submitted to t... more In this paper, we describe our English-Hindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly. We show improvements to the translation systems using pre-procesing and post-processing components. To overcome the structural divergence between English and Hindi, we preorder the source side sentence to conform to the target language word order. Since parallel corpus is limited, many words are not translated. We translate out-of-vocabulary words and transliterate named entities in a post-processing stage. We also investigate ranking of translations from multiple systems to select the best translation.

In this paper, we describe our English-Hindi and Hindi-English statistical systems submitted to t... more In this paper, we describe our English-Hindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly. We show improvements to the translation systems using pre-procesing and post-processing components. To overcome the structural divergence between English and Hindi, we preorder the source side sentence to conform to the target language word order. Since parallel corpus is limited, many words are not translated. We translate out-of-vocabulary words and transliterate named entities in a post-processing stage. We also investigate ranking of translations from multiple systems to select the best translation.

A Connectionist Model for Predicate Logic Reasoning Using Coarse-Coded Distributed Representations
In this paper, we describe a model for reasoning using forward chaining for predicate logic rules... more In this paper, we describe a model for reasoning using forward chaining for predicate logic rules and facts with coarse-coded distributed representations for instantiated predicates in a connectionist frame work. Distributed representations are known to give advantages of good generalization, error correction and graceful degradation of performance under noise conditions. The system supports usage of complex rules which involve multiple conjunctions and disjunctions. The system solves the variable binding problem in a new way using coarse-coded distributed representations of instantiated predicates without the need to decode them into localist representations. Its performance with regard to generalization on unseen inputs and its ability to exhibit fault tolerance under noise conditions is studied and has been found to give good results.
In this paper we report our work on building a POS tagger for a morphologically rich language-Hin... more In this paper we report our work on building a POS tagger for a morphologically rich language-Hindi. The theme of the research is to vindicate the stand that-if morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. The methodology makes use of locally annotated modestly-sized corpora (15,562 words), exhaustive morpohological analysis backed by high-coverage lexicon and a decision tree based learning algorithm (CN2). The evaluation of the system was done with 4-fold cross validation of the corpora in the news domain (www.bbc.co.uk/hindi). The current accuracy of POS tagging is 93.45% and can be further improved.
Predicate Preserving Parsing
ABSTRACT
Knowledge Extraction from Hindi Text
... The larger number indicates higher priority. The rules without priority values are deemed to ... more ... The larger number indicates higher priority. The rules without priority values are deemed to have 0 priority. 3 Lexicon The lexicon used for the system consists of mapping of Hindi words to Universal Words and lexical-semantic attributes describing the words. ...

Machine Translation, 2001
Interlingua and transfer-based approaches tomachine translation have long been in use in competin... more Interlingua and transfer-based approaches tomachine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-representation scheme. But given a particular interlingua, its adoption depends on its ability (a) to capture the knowledge in texts precisely and accurately and (b) to handle cross-language divergences. This paper studies the language divergence between English and Hindi and its implication to machine translation between these languages using the Universal Networking Language (UNL). UNL has been introduced by the United Nations University, Tokyo, to facilitate the transfer and exchange of information over the internet. The representation works at the level of single sentences and defines a semantic net-like structure in which nodes are word concepts and arcs are semantic relations between these concepts. The language divergences between Hindi, an Indo-European language, and English can be considered as representing the divergences between the SOV and SVO classes of languages. The work presented here is the only one to our knowledge that describes language divergence phenomena in the framework of computational linguistics through a South Asian language.
Predicate Preserving Parsing
ABSTRACT
Knowledge Extraction from Hindi Text
... The larger number indicates higher priority. The rules without priority values are deemed to ... more ... The larger number indicates higher priority. The rules without priority values are deemed to have 0 priority. 3 Lexicon The lexicon used for the system consists of mapping of Hindi words to Universal Words and lexical-semantic attributes describing the words. ...

Machine Translation, 2001
Interlingua and transfer-based approaches tomachine translation have long been in use in competin... more Interlingua and transfer-based approaches tomachine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-representation scheme. But given a particular interlingua, its adoption depends on its ability (a) to capture the knowledge in texts precisely and accurately and (b) to handle cross-language divergences. This paper studies the language divergence between English and Hindi and its implication to machine translation between these languages using the Universal Networking Language (UNL). UNL has been introduced by the United Nations University, Tokyo, to facilitate the transfer and exchange of information over the internet. The representation works at the level of single sentences and defines a semantic net-like structure in which nodes are word concepts and arcs are semantic relations between these concepts. The language divergences between Hindi, an Indo-European language, and English can be considered as representing the divergences between the SOV and SVO classes of languages. The work presented here is the only one to our knowledge that describes language divergence phenomena in the framework of computational linguistics through a South Asian language.
Predicate Preserving Parsing
ABSTRACT
Knowledge Extraction from Hindi Text
... The larger number indicates higher priority. The rules without priority values are deemed to ... more ... The larger number indicates higher priority. The rules without priority values are deemed to have 0 priority. 3 Lexicon The lexicon used for the system consists of mapping of Hindi words to Universal Words and lexical-semantic attributes describing the words. ...

Machine Translation, 2001
Interlingua and transfer-based approaches tomachine translation have long been in use in competin... more Interlingua and transfer-based approaches tomachine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-representation scheme. But given a particular interlingua, its adoption depends on its ability (a) to capture the knowledge in texts precisely and accurately and (b) to handle cross-language divergences. This paper studies the language divergence between English and Hindi and its implication to machine translation between these languages using the Universal Networking Language (UNL). UNL has been introduced by the United Nations University, Tokyo, to facilitate the transfer and exchange of information over the internet. The representation works at the level of single sentences and defines a semantic net-like structure in which nodes are word concepts and arcs are semantic relations between these concepts. The language divergences between Hindi, an Indo-European language, and English can be considered as representing the divergences between the SOV and SVO classes of languages. The work presented here is the only one to our knowledge that describes language divergence phenomena in the framework of computational linguistics through a South Asian language.
Predicate Preserving Parsing
ABSTRACT
Knowledge Extraction from Hindi Text
... The larger number indicates higher priority. The rules without priority values are deemed to ... more ... The larger number indicates higher priority. The rules without priority values are deemed to have 0 priority. 3 Lexicon The lexicon used for the system consists of mapping of Hindi words to Universal Words and lexical-semantic attributes describing the words. ...

Machine Translation, 2001
Interlingua and transfer-based approaches tomachine translation have long been in use in competin... more Interlingua and transfer-based approaches tomachine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-representation scheme. But given a particular interlingua, its adoption depends on its ability (a) to capture the knowledge in texts precisely and accurately and (b) to handle cross-language divergences. This paper studies the language divergence between English and Hindi and its implication to machine translation between these languages using the Universal Networking Language (UNL). UNL has been introduced by the United Nations University, Tokyo, to facilitate the transfer and exchange of information over the internet. The representation works at the level of single sentences and defines a semantic net-like structure in which nodes are word concepts and arcs are semantic relations between these concepts. The language divergences between Hindi, an Indo-European language, and English can be considered as representing the divergences between the SOV and SVO classes of languages. The work presented here is the only one to our knowledge that describes language divergence phenomena in the framework of computational linguistics through a South Asian language.

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languag... more We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languag... more We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

The effort required for a human annotator to detect sentiment is not uniform for all texts, irres... more The effort required for a human annotator to detect sentiment is not uniform for all texts, irrespective of his/her expertise. We aim to predict a score that quantifies this effort, using linguistic properties of the text. Our proposed metric is called Sentiment Annotation Complexity (SAC). As for training data, since any direct judgment of complexity by a human annotator is fraught with subjectivity, we rely on cognitive evidence from eye-tracking. The sentences in our dataset are labeled with SAC scores derived from eye-fixation duration. Using linguistic features and annotated SACs, we train a regressor that predicts the SAC with a best mean error rate of 22.02% for five-fold cross-validation. We also study the correlation between a human annotator's perception of complexity and a machine's confidence in polarity determination. The merit of our work lies in (a) deciding the sentiment annotation cost in, for example, a crowdsourcing setting, (b) choosing the right classifier for sentiment prediction. * -Aditya is funded by the TCS Research Fellowship Program.
Uploads
Papers by Pushpak Bhattacharyya