Prof. Niladri Sekhar Dash

Indian Statistical Institute, Calcutta, Linguistic Research Unit, Faculty Member

Followers

1,981

Following

297

Co-authors

Public Views

Professor and Head, Linguistic Research Unit
Phone: 91-94773 45295/ 91-94331 90295
Address: Linguistic Research Unit
Indian Statistical Institute
203, B.T. Road, Baranagar
Kolkata - 700108, India

less

InterestsView All (14)

Uploads

Papers by Prof. Niladri Sekhar Dash

Aphasia in South Asian languages (ASAL) project: a protocol of connected speech tasks for investigating cross-linguistic grammatical profiles in aphasia for South Asian languages

Aphasiology, 2025

Background: Languages vary in their syntactic (e.g. word order in sentences), lexical (e.g. prese... more Background: Languages vary in their syntactic (e.g. word order in
sentences), lexical (e.g. presence of specific word classes), and
morphological (e.g. inflectional or derivational forms of words)
properties. This cross-linguistic variation in language typology influences
the manifestation of agrammatic symptoms. However, most
theoretical models of agrammatism are based on English and a few
European languages, limiting their generalizability. This narrow
focus neglects the rich syntactic, lexical and morphological variations
available in languages globally. Clinically, the lack of language-
specific characteristics of agrammatic impairments limits
translational potential for precise and improved diagnosis. The
Aphasia in South Asian Languages (ASAL) project addresses this
gap by leveraging interdisciplinary expertise and advances in connected
speech methodologies and analyses. It aims to identify
cross-linguistic features of agrammatic production in post-stroke
aphasia across under-researched languages from two major language
families: Indo-Aryan (Hindi-Urdu, Bengali) and Dravidian
(Tamil, Kannada, Malayalam).
Aim: This paper presents the protocol developed as the part of the
ASAL project for eliciting connected speech data to examine cross-linguistic grammatical profiles in aphasia. It outlines procedures for
data collection across five connected speech genres and offers
detailed guidelines for transcription, segmentation and data extraction.
Additionally, it provides recommendations for linguistic analyses
aimed at characterising agrammatism and grammatical
deficits, while also accounting for cross-linguistic variation.
Methods & Procedures: The protocol was designed for cross-sectional
data acquisition from cohorts of people with post-stroke
aphasia meeting the criteria of “agrammatic by clinical standard”
and neurologically unimpaired speakers in each of the five languages.
Data collection procedures are detailed for five connected
speech genres – personal narrative, procedural task, image
sequence, novel story narrative and picture description – with
multiple exemplars in each. Additional data include demographics,
aphasia type and severity, and cognitive assessments (e.g. verbal
fluency, inhibition, memory span, shifting, cognitive screen). The
protocol provides guidance for transcription, data extraction and
recommendations for cross-linguistic analyses, along with results of
preliminary analyses.
Conclusion: The ASAL project is a pioneering initiative investigating
agrammatism in linguistically diverse, under-studied South
Asian clinical populations. This protocol enables researchers to
conduct cross-linguistic studies and develop culturally and linguistically
tailored clinical tools. Specifically, it supports: 1) the identification
of agrammatic features in narrative speech across languages;
2) the development of clinical checklists for identifying grammatical
impairments. This protocol is uniquely positioned to facilitate effective
comparisons between universal and language-specific grammatical
patterns across a broad spectrum of languages and
language families, including multilingual populations and diverse
clinical conditions.

Download

Chapter 14 Developing a Dictionary for Kharia Sabar: An Indigenous and Endangered Tribal Speech Community of Eastern India

Handbook on Endangered South Asian and Southeast Asian Languages. Switzerland AG: Springer (2025), 2025

In this chapter, we address the challenges and the problems that we have faced while we were tryi... more In this chapter, we address the challenges and the problems that we have faced while we were trying to develop a dictionary for the Kharia Sabar speech community, which is an indigenous endangered tribal community living in the district of Purulia in the state of West Bengal, India. The challenges that we have faced may be classified into two broad types, namely, extralinguistic challenges and linguistic challenges (Ivanishcheva, 2016). The extralinguistic challenges are primarily related to awareness about the importance of such a knowledge text among the members of the community; careful investigation of the attitude of the community members relating to the procurement of data and information from their life, living, culture, history, heritage, and ecology (Littell et al., 2017); logistic issues in data collection from the Urheimat (i.e., the primaeval habitation) of the community through on-the-spot interviews; demographic and ethical issues in selection of appropriate respondents; availability of funds for conducting elaborate linguistic surveys; a collection of lexical data covering all major aspects of the community life; availability of trained human resource for lexical processing, analysis and dictionary compilation; and availability of agencies who are willing to publish dictionary as a commercial product. The linguistic challenges, on the other hand, are largely associated with collection of lexical data from community members; sufficiency, diversity and variety of lexical data types (Lam et al., 2014); paucity of lexicographic details for entry words; citation of example sentences for determining usage-based sense variations of polysemous entries; inadequacy of linguistic description of lexical items for addressing referential and pedagogical requirements (Rehg, 2018); utilization of pictures, images and diagrams for a visual representation of complex concepts and ideas of the community and similar other issues. Many of these challenges are linked with several theoretical and ethical issues-all of which combine together to make the process of dictionarymaking for the Kharia Sabar community an upheaval task fretted with many caveats and shortcomings particularly in those contexts when their folk texts, verbal narratives, written materials and historical records are not available for reference and utilization (Mosel, 2004). Keeping the challenges in view, in this paper, we discuss the strategies that we have adopted to overcome the hurdles we have faced during the stages of compilation of the dictionary, which is being built to primarily help in the process of preserving and promoting the endangered indigenous language against a backdrop of camouflaged aggression of more powerful neighbouring languages. The dictionary that we are developing will be used by the Kharia Sabar speakers for general reference and pedagogic purposes, while outside people will use it for academic, commercial and localization purposes.

Download

Developing an Online Platform for Multimodal Lexical Learning for the Bengali Learners

2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT)

The usability of a lexical database of a natural language on an online platform is in great deman... more The usability of a lexical database of a natural language on an online platform is in great demand in this new era of online language education. An interface of this kind can primarily serve three different purposes: first, it plays an operational role in new frame of multimodal education system; second, it partly fulfills the necessity of online lexical resources which are in increasing need in research and application in the field of language technology; and third, it becomes a useful and omnipresent resource for native and foreign language users who want to learn meanings and usages of words of a natural language for different personal, academic, philanthropic and commercial purposes. Keeping these applications in view, in this paper, we describe the process that we apply to develop an online platform for the Bengali WordNet, which offers a unique opportunity for multimodal lexical learning to the Bengali learners. The proposed work, that we want to present in this paper, has the potentials to fulfill all the three needs stated above. While the WordNet provides different kinds of information directly associated with words, the digital platform has a very sensitive user-friendly interface for the access of the lexical resources stored in the WordNet.

Corpus-based Analysis of the Bengali Language

The book analyses linguistic data and information of modern Bengali as found in a text corpus. It... more The book analyses linguistic data and information of modern Bengali as found in a text corpus. It presents methods of Bengali corpus generation and processing (e.g., frequency counting, concordance, lemmatization, key-word-in-context, collocation, annotation, parsing etc.); analyzes the form and function of characters; and explores the structural intricacies of words belonging to different parts-of-speech. For the first time, modern Bengali is analyzed and interpreted here with close reference to a corpus to understand the language in a new perspective. It can be used as a text-cum-reference book in colleges and universities that teach Bengali. Learners will be equipped with new information collected from the corpus and analyzed empirically to enhance their linguistic knowledgebase. They will also learn how different corpus processing techniques are applied on the corpus and what kinds of linguistic information are extracted to be used in language description, analysis, and application. Common readers, on the other hand, will have a close look into modern Bengali language to understand how Bengali people use the language while they compose texts in written form.

Decomposition of Inflected Verbs

Springer eBooks, 2021

Etymological Annotation

Springer eBooks, 2021

Extratextual Annotation

Springer eBooks, 2021

Language and linguistics

Language-teaching abstracts, Oct 1, 1973

Language and linguistics LINGUISTIC DESCRIPTION AND ANALYSIS 73-225 Desherieva, T. I. K Bonpocy 0... more Language and linguistics LINGUISTIC DESCRIPTION AND ANALYSIS 73-225 Desherieva, T. I. K Bonpocy 06 OTHOineHHH KOHCTpyKUHH npeflJIOaceHHH K HOMHHaTHBHOH, reHHTHBHOft, flaTHBHofi KOHCTpyKUHHM. [On the question of the relationship of the ergative construction of a sentence to nominative, genitive and dative constructions.] Bonpocu H3bim-3HOHUH (MOSCOW), 5 (1972), 4 2 -8 .

Download

Time Never Stands Still Dash and Saha

International Journal of Communication. Vol. 34. No. 1-2. Pp. 07-39., 2024

In this paper, we make an attempt to identify, mark and analyse those linguistic expressions (inc... more In this paper, we make an attempt to identify, mark and analyse those linguistic expressions (including both single and multiword units) that have temporal implications in Bengali. We also try to investigate how these expressions are used in Bengali to serve the purposes of referring to time and time-induced aspects of events, situations, contexts and mental states in various intralinguistic and extralinguistic environments. Understanding these aspects, we believe, can help us develop a socio-cognitive interface that may help us interpret how these expressions often transgress the spheres of the referential (i.e., denotative) meanings to enter into the spheres of figurative senses. Another notable purpose of this study is to understand how temporal expressions, through semantic gradience (Leech, 1993), acquire polysemic identity to weave a different conceptual network rarely represented by their referential meanings. For this study, we analyse a large number of temporal expressions collected from some modern Bengali text corpora and we observe that the primary senses of these temporal expressions often percolate into a discourse to generate different senses to design a better scheme of communication among the interlocutors. We also investigate to understand how speakers use temporal expressions as a productive time-denoting strategy in their discourse of linguistic activities; what kinds of goals the language users achieve through the use of such expressions; and how such expressions act as strategic devices in understanding word meanings. This empirical study also allows us to address some of the questions that relate to how spatial expressions denote temporal information, how the concept of time differs between speech communities on ecolinguistic factors; and how the symbolic use of temporal expressions denotes movement of time and event in discourse. The theoretical importance of this study lies in understanding the cognitive strategies used in the conceptualization of time through temporal expressions in natural languages.

Download

Investigating the Challenges Involved in Rendering Emphatic Particles from Hindi to Bengali in both Manual and Machine Translation

International Journal of Translation, 2024

The analysis of some sample sentences of Hindi-Bengali parallel translation corpora shows that mo... more The analysis of some sample sentences of Hindi-Bengali parallel translation corpora shows that modern Hindi has three most frequently used emphatic particles (i.e, bhī "also, too", hī "only, just, alone", and to "indeed"), which need proper attention for their translation into Bengali. Similarly, modern Bengali also possesses three highly frequent emphatic particles (i.e.,-o "also, too",-i "only, just, alone", and to "indeed"), which require adequate attention for their appropriate usage in translation. As the name implies, these particles, in some way or other, lend an emphasis on a word or a larger part of a sentence (e.g., phrase or clause) to add some extra shade of meaning to the original meaning represented by a sentence. In most cases, they operate at the lexical level to emphasise the sense tagged to the words and terms they are attached to. At the syntactic level, they primarily express the sense of "even" in non-conditional clauses (or sentences) or a sense of "only" in conditional clauses. Although the emphatic particles of the two sister languages (i.e., Hindi and Bengali) exhibit semantic and functional proximities, due to various linguistic and extralinguistic factors underlying a piece of text, they often deviate from their primary senses when they are used in different sentential contexts. Moreover, their translation from Hindi to Bengali is largely affected due to syntactic, semantic, contextual and pragmatic symmetries and asymmetries noted between the two languages. This paper highlights the process and the strategies that are adopted and applied to translate Hindi emphatic particles into Bengali

Download

Corpus as a Primary Resource for ELT

In this chapter, we argue in favor of teaching English as a second language to the non-native lea... more In this chapter, we argue in favor of teaching English as a second language to the non-native learners with direct utilization of English Language Corpus (ELC). Keeping various advantages of ELC in view, we address here some of the issues relating to the application of ELC as a primary resource of language data and information to be used in the English Language Teaching (ELT) courses for the students who are learning English as a second language. We also discuss here how the learners can access and refer to both speech and text data of ELC in a classroom situation or in a language laboratory for their academic activities. The proposed strategy is meant to be assisted by a computer and based on data, information, and examples retrieved from present-day ELC developed with various text samples composed by native English speakers. The method will be beneficial to the learners if it is used with careful manipulation of tools and techniques used in advanced ELT that advocates utilization of empirical linguistic resources to empower learners. Finally, we argue that the utilization of relevant linguistic data, information, and examples from ELC will enhance the linguistic skills and efficiency of the English learners much better ways than our traditional ELT courses do.

Corpus as a Primary Resource for ELT

Utility and Application of Language Corpora, 2018

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow

Compounding is a highly fertile process. It is quite often used in various innovative ways for ge... more Compounding is a highly fertile process. It is quite often used in various innovative ways for generating new words in most of the languages. At the time of compounding the participating members often undergo a process of morphosyntactic change that forces them to lose much of their lexicosemantic information. In this paper we make an attempt to capture lexicosemantic properties, which are lost in this process, and try to identify the factors that play active roles behind such metamorphosis of compounds. Our investigation is based on Bengali compounds as the central area of study with occasional references to the English compounds for understanding the phenomenon in a systematic way. The present study has direct applicational relevance in the area of applied linguistics, mainstream linguistics and language technology.

Download

Application of Expectation–Maximization Algorithm to Solve Lexical Divergence in Bangla–Odia Machine Translation

Smart Innovation, Systems and Technologies, 2022

This paper shows the word alignment between Odia-Bangla languages using the Expectation-Maximizat... more This paper shows the word alignment between Odia-Bangla languages using the Expectation-Maximization(EM) algorithm with high accuracy output. The entire mathematical calculation is worked out and shown here by taking some Bangla-Odia sentences as a set of examples. The EM algorithm helps to find out the maximum likelihood probability value with the collaboration of the &amp;#39;argmax function&amp;#39; that follows the mapping between two or more words of source and target language sentences. The lexical relationship among the words between two parallel sentences is known after calculating some mathematical values and those values indicate which word of the target language is aligned with which word of the source language. As the EM algorithm is an iterative or looping process, the word relationship between source and target languages is easily found out by calculating some probability values in terms of maximum likelihood estimation(MLE) in an iterative way. To find the MLE or maximum a posterior(MAP) of parameters in the probability model, the model depends on unobserved latent variable(s). For years, it has been one of the toughest challenges because the process of lexical alignment for translation involves several machine learning algorithms and mathematical modeling. Keeping all these issues in mind, we have attempted to describe the nature of lexical problems that arise at the time of analyzing bilingual translated texts between Bangla(as source language) and Odia(as the target language). In Word alignment, handling the ‘word divergence’ or ‘lexical divergence’ problem is the main issue and a challenging task, though it is not solved by EM algorithm it is only possible through a bilingual dictionary or called as a lexical database that is experimentally examined and tested only mathematically. Problems of word divergence are normally addressed at the phrase level using bilingual dictionaries or lexical databases. The basic challenge lies in the identification of the single word units of the source text which are converted into multiword units in the target text.

Role of Artificial Intelligence in Preservation of Culture and Heritage

Routledge eBooks, Sep 1, 2022

Dash Language and Linguistics (2011)

Download

Pre-digital Corpora (Part 1)

The history of use of language corpora before the digital corpus was generated and used is shroud... more The history of use of language corpora before the digital corpus was generated and used is shrouded in darkness. In this chapter, we have attempted to shed some light on this dark history. We have tried to study the unmarked history regarding the processes of the generation of handmade language corpora over the past 200 years. Tracing through the past, we have described how, in the earlier years, people designed, developed and utilized language corpora in various linguistic studies. First, we have tried to justify the relevance of the survey in the present context of corpus-based linguistic studies; then we have shown how language corpora are used to collect words and other lexical items for compiling general and special dictionaries, such as, Johnson's Dictionary (1755), The Oxford English Dictionary (1882), Supplementary Volumes of Oxford English Dictionary and the Dictionary of American English. In addition, we have described how good quotations are collected from handmade language corpora to substantiate the definitions of words provided in reference dictionaries; how handmade corpora are used in the lexical study of a language; and how data and information are extracted from handmade corpora for writing grammar books for primary and advanced language learners. Thus we have provided some rudimentary descriptions about the works of earlier scholars who manually designed and developed language corpora based on their personal design principles and utilized these in various ways to address several linguistic requirements.

Download

Digital Text Corpora (Part 1)

The history of digital text corpus generation and usage presents an interesting narrative. It sho... more The history of digital text corpus generation and usage presents an interesting narrative. It shows how technology has brought about a resurgence in the discipline of linguistics, which was otherwise turning its attention towards a direction of no-return. In this chapter, we have briefly described the formation and content of some of the most widely known digital text corpora so far developed in English and some other languages. The goal is to refer to some of the big digital corpora available today with a focus on the patterns of their formation, the type of content included in them, and the way these corpora are being used in various linguistic works. In a step-by-step manner, we have discussed in brief the story of developing the Brown Corpus; described the formation and content of the Lancaster-Oslo-Bergen (LOB) Corpus; presented a short overview of the content and structure of the Australian Corpus of English; briefly reported on the process of generating the Corpus of New Zealand English; described the method of developing the FLOB (Freiburg–LOB) Corpus in parallel to the LOB Corpus with a special goal; and finally have reported about the formation of International Corpus of English as a mission for generating a corpus with different varieties of English used across the world.

Nature of Data

Springer eBooks, 2018

It is always difficult to define the nature of language data since language texts often possess m... more It is always difficult to define the nature of language data since language texts often possess multiple properties, due to which the nature of a particular text may overlap with that of another. However, since it is assumed that a corpus should be marked with the nature of a text, it is necessary to understand how a corpus can be different based on the nature of text—although mutual interpolation across texts is a common feature in every natural language. Based on the nature of the text, in this chapter, we have argued that a ‘general corpus’ is meant for including all kinds of text available in a language; a ‘special corpus’ is meant to collect data of a special type and to be used in special situations; a ‘sample corpus’ should contain sufficient amount of data from the major text types to be used as a representative sample of these texts types; a ‘literary corpus’ should contain only samples from imaginative literary texts; a ‘monitor corpus’, by virtue of its name and nature, must be very large in size with data taken from all kinds of context and composition with an open possibility for it to be regularly upgraded and augmented; a ‘multimodal corpus’ is meant to contain texts in all forms (audio, video, textual, sign language, etc.); a ‘sublanguage corpus’ should contain a variety of language data compiled from the ‘subsets’ of the general language; and a ‘controlled language corpus’ should be exclusive in nature since it is meant to put a strong restriction on the grammar, style and vocabulary of a language for the writers of documents belonging to special domains.

Pre-digital Corpora (Part 2)

Springer eBooks, 2018

Following the footsteps of the previous chapter (Chap. 9), in this chapter, we have presented a s... more Following the footsteps of the previous chapter (Chap. 9), in this chapter, we have presented a short description of the process of corpus generation and utilization in some other domains of linguistic studies before the computer was introduced to the act of digital corpus generation. We have primarily concentrated on some of the core domains of linguistics, besides lexicography, which is already addressed in the previous chapter. Here we have discussed the use of language corpora in the study of dialects; described the use of corpora in the analysis of speech patterns and habits; discussed how corpora are used in language pedagogy; presented how corpora are utilized in the second language education of children; provided information on the use of corpora in the study of the stylistic aspects of writers of various periods; and finally, we have discussed how corpora are used in various other fields of linguistics. Through this short presentation, we aim to give some ideas to the new generation of scholars about the functional relevance of pre-digital handmade language corpora in mainframe linguistic activities that flourished and spread across languages over the last two centuries.

Prof. Niladri Sekhar Dash

Uploads

Papers by Prof. Niladri Sekhar Dash

Log In