Word Segmentation

description1,191 papers

group150 followers

lightbulbAbout this topic

Word segmentation is the process of identifying and separating individual words within a continuous stream of speech or text. It is a critical task in natural language processing and linguistics, enabling the analysis and understanding of language structure and meaning.

lightbulbAbout this topic

Key research themes

1. How can linguistic features and statistical models improve word segmentation in morphologically complex languages?

This research area investigates the integration of linguistically motivated features—such as morphological structures, reduplication, phonotactics, and affixation patterns—into statistical frameworks (e.g., Conditional Random Fields, Hidden Markov Models) for effective word segmentation. Morphologically rich and complex languages like Chinese, Vietnamese, and Turkish pose segmentation challenges due to issues like the absence of explicit word boundaries, dynamic morphological processes, and productive compounding. Leveraging linguistic insights alongside statistical learning facilitates more precise segmentation that generalizes across corpora and dialectal variations, which is crucial for downstream NLP applications in these languages.

A conditional random field word segmenter for sighan bakeoff 2005

by Chris Manning

2023, Proceedings of the …

Key finding: Introduces a Conditional Random Field (CRF) word segmentation model for Chinese that effectively incorporates character identity n-grams, morphological features derived from automatically extracted affix and singleton... Read more

articleView Paper downloadDownload

Customizable segmentation of morphologically derived words in Chinese

by Andi Wu

2016

Key finding: Presents a customizable segmentation system targeting morphologically derived words (MDWs) in Chinese, modeling words through hierarchical word trees with resolution parameters that control segmentation granularity. By... Read more

articleView Paper downloadDownload

A Hybrid Approach to Word Segmentation of Vietnamese Texts

by le phuong

2022, Lecture Notes in Computer Science

Key finding: Develops a hybrid Vietnamese word segmentation system combining minimal deterministic finite-state automata representing a large Vietnamese lexicon, regular expression parsing for lexical phrase identification, and maximal... Read more

articleView Paper downloadDownload

Turkish word segmentation using morphological analyzer

by Mehmed Ozkan

2025, 7th European Conference on Speech Communication and Technology (Eurospeech 2001)

Key finding: Proposes a Turkish word segmentation approach that segments continuous character streams lacking spaces, typical of speech-to-text outputs, by utilizing a morphological analyzer designed with grouped morpheme recognition... Read more

articleView Paper downloadDownload

A Morpheme-based Part-of-Speech Tagger for Chinese

by Jonathan Webster

2023

Key finding: Develops a morpheme-based POS tagger that segments words into morphemes via forward maximum matching and applies lexicalized HMMs for POS tagging on morpheme sequences. The system highlights the productive morphological... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What statistical and computational techniques enhance robust and unsupervised word segmentation across diverse languages and data types?

This theme explores unsupervised and statistical algorithms for word segmentation that utilize probabilistic models, phonotactic cues, and robust classification approaches applied to various types of speech and text data, including phonetic transcriptions, noisy handwritten inputs, and low-resource language settings. It emphasizes methods that are language-agnostic or adaptable by exploiting distributional and phonotactic regularities, such as phone n-grams and transition probabilities, as well as nonparametric statistical distributions to handle real data variability. Such techniques aim to improve segmentation without heavy reliance on large annotated lexicons, thus scalable across languages and domains.

Lexicalized Phonotactic Word Segmentation

by Margaret Fleck

2022

Key finding: Introduces 'WordEnds', an unsupervised boundary inference algorithm that uses phone n-gram statistics before and after pauses to bootstrap a discriminative boundary detection model. Tested on adult speech corpora in English... Read more

articleView Paper downloadDownload

WordSeg: Standardizing unsupervised word form segmentation from text

by Georgia Loukatou

2021

Key finding: Develops WordSeg, an open-source, modular software framework that standardizes benchmarking and evaluation of unsupervised word segmentation algorithms on child-directed speech corpora. By integrating multiple algorithms and... Read more

articleView Paper downloadDownload

Word Segmentation using the Student’s-t Distribution

by Giorgos Sfikas

2016

Key finding: Proposes a novel word segmentation method that employs the Student’s-t distribution for gap classification between word candidates in text line images, exploiting the distribution’s robustness to outliers. Experiments across... Read more

articleView Paper downloadDownload

Word segmentation of Vietnamese texts: a comparison of approaches

by Luong The Anh

2017

Key finding: Compares three Vietnamese segmentation systems: two maximum matching-based methods (one augmented with statistical word and bigram frequencies, another employing pattern matching for pre/post-processing) and a third system... Read more

articleView Paper downloadDownload

An Efficient Language-Independent Multi-Font OCR for Arabic Script

by Hussein Osman

2023, Computer Science & Information Technology (CS & IT)

Key finding: Presents an Arabic OCR system tackling cursive script segmentation through an improved font-independent character segmentation algorithm with rigorous cut filtration techniques robust to letter overlapping and shape... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How does early language exposure and linguistic structure influence infant word segmentation and vocabulary development?

This theme focuses on psycholinguistic and developmental studies that examine how infants acquire word segmentation abilities and how language-specific rhythmic, prosodic, and statistical cues influence their segmentation of continuous speech. Research evaluates the timing and mechanisms by which infants discern word boundaries, how these skills vary between monolingual and bilingual learners, and how segmentation ability correlates with subsequent vocabulary growth. Understanding these cognitive and linguistic foundations informs models of early language acquisition and supports interventions targeting infant language development.

Word segmentation in monolingual and bilingual infant learners of English and French

by Megha Sundara

2024

Key finding: Finds that 7.5-month-old monolingual infants segment bi-syllabic words with the dominant stress pattern of their native language (trochaic for English, iambic for French), and fail in cross-language tests between English and... Read more

articleView Paper downloadDownload

WORD SEGMENTATION SKILL IN INFANTS AND ITS INFLUENCE ON VOCABULARY DEVELOPMENT: A REVIEW

by Orhan Hanbay

2024, RESS Journal

Key finding: Reviews literature establishing that infants begin word segmentation ability around 6 months using prosodic cues, phoneme distributions, and syllable transition probabilities, and that exposure to high-quality linguistic... Read more

articleView Paper downloadDownload

Speech Segmentation and Cross-Situational Word Learning in Parallel

by Isabella Prequero

2024, Open Mind

Key finding: Demonstrates that adults performing joint speech segmentation and word-object mapping tasks achieve better performance than when each task is isolated, revealing synergistic effects when statistical learning of syllable... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Word Segmentation

CLE Urdu Books N-grams

by Qurat ul ain Akram

2025

The paper presents the development of first publically available Urdu N-grams extracted from different books. For the best representation of Ngrams, large amount of Urdu corpus is collected from books covering different domains. The... more

descriptionView Paper arrow_downwardDownload

Enriching Notations in Mainland Southeast Asian Indic Scripts (handout)

by Hideo SAWADA

2025

Writing systems used in Southeast Asia, such as Mon, Burmese, Khmer, and Thai scripts are phonogramic systems of Indic origin, traced back to Brāhmī script in BC3c which was designed for an Indo‐Aryan language, Prākrit. It is... more

descriptionView Paper arrow_downwardDownload

Ambiguity Resolution in Chinese Word Segmentation

by Benjamin Tsou

2025, Pacific Asia Conference on Language, Information, and Computation

A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and... more

descriptionView Paper arrow_downwardDownload

Morphological analysis for rule based machine translation

by Arwa Hatem Alqudsi

2025, 2011 International Conference on Semantic Technology and Information Retrieval

The Arabic language is a Semitic language and it exhibits systematic but complex morphological structure based on root-pattern design. The aim of the present paper is to propose a transfer-based approach using morphological analysis which... more

descriptionView Paper arrow_downwardDownload

From signal to grammar: Rhythm and the acquisition of syllable structure

by Marina Vigário

2025, Proceedings of the 27th Annual Boston …

The coexistence of a set of these properties in a given language may promote the perception of stressed syllables in relation to other syllables - yielding stress-timing -, or make all syllables equally salient - yielding syllable-timing.... more

descriptionView Paper arrow_downwardDownload

Is the word the basic processing unit in Chinese sentence reading: An eye movement study

by dahai yu

2025, Lingua

This study sets out to explore the basic processing unit in Chinese sentence reading in an eye movement experiment. In the experiment, four analogous conditions were created: normal Chinese sentence with no highlighting, highlighting that... more

descriptionView Paper arrow_downwardDownload

A morpheme-based lexical chunking system for Chinese

by Jonathan Webster

2025, 2008 International Conference on Machine Learning and Cybernetics

Chinese lexical analysis consists of word segmentation and part-of-speech tagging. Most previous studies consider them as two separate tasks. In this paper we formalize the two processes as a unique chunking task on a sequence of... more

descriptionView Paper arrow_downwardDownload

PARSER: A Model for Word Segmentation

by Annie Vinter

2025, Journal of Memory and Language

showed that adults were able to segment into words an artificial language that included no pauses or other prosodic cues for word boundaries. We propose an account of their results that requires only limited computational abilities and... more

descriptionView Paper arrow_downwardDownload

On Closed Task of Chinese Word Segmentation: An Improved CRF Model Coupled with Character Clustering and Automatically Generated Template Matching

by Hong-jie Dai

2025, Meeting of the Association for Computational Linguistics

This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply Kmeans clustering... more

descriptionView Paper arrow_downwardDownload

ByteSpan: Information-Driven Subword Tokenisation

by Suchir Salhan

2025, ICML Tokenization Workshop (TokShop)

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an... more

descriptionView Paper arrow_downwardDownload

The Distribution of Phonemes across Languages: Chance, costs, and integration across linguistic tiers

by Suchir Salhan

2025, The 13th International Conference on the Mental Lexicon (Montréal, Québec, Canada)

Invited Keynote Delivered by Dr Fermín Moscoso del Prado Martín In contrast with a broad corpus of research on the distribution of the frequencies of words, the distribution of the frequencies of phonemic contrasts is remarkably... more

descriptionView Paper arrow_downwardDownload

Dzongkha Word Segmentation

by Sithar Norbu

2025

Dzongkha, the national language of Bhutan, is continuous in written form and it fails to mark the word boundary. Dzongkha word segmentation is one of the fundamental problems and a prerequisite that needs to be solved before more advanced... more

descriptionView Paper arrow_downwardDownload

Multiresolution Morphological Approach to Document Image Analysis

by Dan Bloomberg

2025

An image-based approach to document image analysis is presented. The methods are moti- vated by a merged view of shape and textural image properties at multiple scales. The principal binary image operations are morphological and... more

descriptionView Paper arrow_downwardDownload

A multilevel text line segmentation framework for handwritten historical documents

by Volker Märgner

2025, Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR

Text-line segmentation is considered as a crucial step of document analysis and recognition systems because its output is considered as the input of recognition systems. Due to the reason that the same handwritten image page has different... more

descriptionView Paper arrow_downwardDownload

Partitioning of Posteriorgrams Using Siamese Models for Unsupervised Acoustic Modelling

by Giampiero Salvi

2025

Unsupervised methods tend to discover highly speaker-specific representations of speech. We propose a method for improving the quality of posteriorgrams generated from an unsupervised model through partitioning of the latent classes. We... more

descriptionView Paper arrow_downwardDownload

Chinese Named Entity Recognition with Conditional Random Fields

by Hitoshi Isahara

2025, Meeting of the Association for Computational Linguistics

We present a Chinese Named Entity Recognition (NER) system submitted to the close track of Sighan Bakeoff2006. We define some additional features via doing statistics in training corpus. Our system incorporates basic features and... more

descriptionView Paper arrow_downwardDownload

Segmentation of Rhythmic Units in Word Speech by Japanese Infants and Toddlers

by Izumi Uehara

2025, Frontiers in Psychology

When infants and toddlers are confronted with sequences of sounds, they are required to segment the sounds into meaningful units to achieve sufficient understanding. Rhythm has been regarded as a crucial cue for segmentation of speech... more

descriptionView Paper arrow_downwardDownload

The structure of creole words. Segmental, syllabic and morphological aspects , edited by Parth Bhatt and Ingo Plag

by Parth Bhatt

2025, Journal of Pidgin and Creole Languages

This interesting volume represents a collection of high quality essays that touch on some of the more recent and "controversial" topics in creole studies in the last 10 years. In some ways, this volume represents the shift away from dominant debates or dichotomies that had previously defined creole studies, e.g., superstrata versus substrata, substrata versus universals (or the language bio-program hypothesis [LBH]). Bickerton's work on the LBH dominated much of the creolistics research of the 1980s-1990s, even if most of the research in print reflected significantly more detractors arguing against the hypothesis than supporters of it. In the same way, McWhorter's 1998 article has come to dominate much of the research in the early 21st century. Most work seems not to support his Prototype Hypothesis (PH), but to argue against it. Several of the articles in the present volume are reactions to/against McWhorter's work. The book is dedicated to the memory of Jacques Arends. In fact, the last article (pp. 223-241) represents one of Jacques's last published works (written in collaboration with his students) since his death in 2005. As many have already written, he and his excellent scholarship will be missed. This volume of 11 papers contains a subset of those that were presented at the "Second International Workshop on the Phonology and Morphology of Creole Languages", held at the University of Siegen in October 2003. As the (sub)title suggests, the book is divided into three parts: "Section 1: segmental aspects" (pp. 3-82); "Section 2: Syllabic aspects" (pp. 85-150); and "Section 3: Morphological aspects" (pp. 153-241). Instead of giving an inventory of all the articles, I will touch upon the chapters that I found the most illuminating or provocative. The first article, by Thomas B. , presents an inventory of phonemes in so-called creole languages. Klein concludes that, as regards vowels and consonants, creole languages are fairly typical (i.e., the typological middle ground) in terms of phonemic inventory among the world's languages. That is, there is nothing reduced or simplified about the phonemes in creoles. He writes: "The vast majority of Creole languages exhibit the typical non-Creole inventory size of 20-37 phonemes" (p. 8). Klein reaches this conclusion by comparing creole

descriptionView Paper arrow_downwardDownload

Heuristic-based text segmentation of bilingual handwritten documents for Gurumukhi-Latin scripts

by Seema Bawa

2025

This paper focuses on the segmentation of unconstrained handwritten documents containing bilingual data at the word level. Most of the official documents available in India are bilingual, i.e., in the regional language as well as English... more

descriptionView Paper arrow_downwardDownload

"Teaching AI to Understand Punctuation Systems Across Languages: A Case Study on Burmese and Multilingual Models"

by Htein Win

2025, Artificial Intelligence

Punctuation plays a vital role in written language, enhancing clarity, structure, and communicative intent. While modern artificial intelligence (AI) models have demonstrated advanced capabilities in natural language processing, their... more

descriptionView Paper arrow_downwardDownload

The Perception of Word Juncture in English and Arabic

by Anmar Hammoodi Saeed

2025, College Of Basic Education Researches Journal

In this study the perception of word juncture in English and Arabic is investigated. Word juncture is taken as the allophonic, or phonetic, variation at word boundary that is contrastive. It is hypothesized that minimal pairs... more

descriptionView Paper arrow_downwardDownload

Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition

by Mijit Ablimit

2025, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA)

For large-vocabulary continuous speech recognition (LVCSR) of highly-inflected languages, selection of an appropriate recognition unit is the first important step. The morpheme-based approach is often adopted because of its high coverage... more

descriptionView Paper arrow_downwardDownload

Three systems and one verifier-HOKUM’s participation in QAC3 of NTCIR-5

by Fumito Masui

2025, Proceedings of the Fifth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access

This paper is a report from collective participation in NTCIR-5 Question Answering Challenge between researchers from Mie University, Hokkaido University and Otaru University of Commerce. Although our results were not impressive, we would... more

descriptionView Paper arrow_downwardDownload

A Word Segmentation System for Handling Space Omission Problem in Urdu Script

by Gurpreet Singh Lehal

2025

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word... more

descriptionView Paper arrow_downwardDownload

Shahmukhi to Gurmukhi Transliteration System: A Corpus based Approach

by Gurpreet Singh Lehal

2025, Research on computing science

This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This... more

descriptionView Paper arrow_downwardDownload

A Transliteration Based Word Segmentation System for Shahmukhi Script

by Gurpreet Singh Lehal

2025, Communications in computer and information science

Word Segmentation is an important prerequisite for almost all Natural Language Processing (NLP) applications. Since word is a fundamental unit of any language, almost every NLP system first needs to segment input text into a sequence of... more

descriptionView Paper arrow_downwardDownload

Turkish word segmentation using morphological analyzer

by Mehmed Ozkan

2025, 7th European Conference on Speech Communication and Technology (Eurospeech 2001)

This paper describes an algorithm to segment an input Turkish string without any spaces, which may be an output of a speech-to-text application, into words by using morphological analyzer. It is quite possible to use the algorithm on... more

descriptionView Paper arrow_downwardDownload

A computer readability formula of Japanese texts for machine scoring

by Yuka Tateisi

2025, Proceedings of the 12th conference on Computational linguistics -

A readability formula is obtained that can be used by computer programs for style checking of Japanese texts and need not syntactic or semantic information. The formula is derived as a linear combination of tile surface characteristics of... more

descriptionView Paper arrow_downwardDownload

A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction

by Qin Lu

2025

Automatic term extraction is the first step towards automatic or semi-automatic update of existing domain knowledge base. Most of the researches applied word segmentation as a preprocessing step to Chinese term extraction. However,... more

descriptionView Paper arrow_downwardDownload

Vietnamese Word Segmentation

by Nguyen Van Toan

2025, Proceedings of NLPRS'01

Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages,... more

descriptionView Paper arrow_downwardDownload

OTAP ottoman archives internet interface

by Emre Erdem Sahin

2025, 2012 20th Signal Processing and Communications Applications Conference (SIU)

Osmanlı Metin Arşivi Projesi kapsamında Osmanlı Türkçesi metinlerinin yüklenmesi, ikilileştirilmesi, satır ve kelime bölütlenmesi, etiketlenmesi, tanınması ve testlerinin yapılması amacıyla bir Genel Ag arabirimi geliştirilmiştir. Bu... more

descriptionView Paper arrow_downwardDownload

Automatic Corpus-Based Extraction of Chinese Legal Terms

by Benjamin Tsou

2025, NLPRS

This paper reports on a study involving the automatic extraction of Chinese legal terms. We used a word segmented corpus of Chinese court judgments to extract salient legal expressions with standard collocation learning techniques. Our... more

descriptionView Paper arrow_downwardDownload

A realistic and robust model for Chinese word segmentation

by Chu-Ren Huang

2025, arXiv (Cornell University)

A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese... more

descriptionView Paper arrow_downwardDownload

Reducing the False Alarm Rate of Chinese Character Error Detection and Correction

by Chao-Lin Liu

2025

The main drawback of previous Chinese character error detection systems is the high false alarm rate. To solve this problem, we propose a system that combines a statistic method and template matching to detect Chinese character errors.... more

descriptionView Paper arrow_downwardDownload

Using Parallel Corpora to Automatically Generate Training Data for Chinese Segmenters in NTCIR PatentMT Tasks

by Chao-Lin Liu

2025

Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools... more

descriptionView Paper arrow_downwardDownload

Multi-Scale Spoken Document Retrieval for Cantonese Broadcast News

by Helen Meng

2025, International Journal of Speech Technology

This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in... more

descriptionView Paper arrow_downwardDownload

Multi-Scale Spoken Document Retrieval for Cantonese Broadcast News

by Helen Meng

2025, International Journal of Speech Technology

descriptionView Paper arrow_downwardDownload

Early parafoveal processing in reading Chinese sentences

by Ralph Radach

2025, Acta Psychologica

The possibility that during Chinese reading information is extracted at the beginning of the current fixation was examined in this study. Twenty-four participants read for comprehension while their eye movements were being recorded. A... more

descriptionView Paper arrow_downwardDownload

A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags

by Sơn Phạm

2025, 2009 International Conference on Knowledge and Systems Engineering

Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word... more

descriptionView Paper arrow_downwardDownload

Automatic recognition of printed Oriya script

by Umapada Pal

2024, Sadhana-academy Proceedings in Engineering Sciences

This paper deals with an Optical Character Recognition (OCR) system for printed Oriya script. The development of OCR for this script is difficult because a large number of character shapes in the script have to be recognized. In the... more

descriptionView Paper arrow_downwardDownload

COMUNICA: a question answering system for Brazilian Portuguese

by Leandro Krug Wives

2024

COMUNICA is a voice QA system for Brazilian Portuguese with search capabilities for consulting both structured and unstructured datasets. One of the goals of this work is to help address digital inclusion by providing an alternative way... more

descriptionView Paper arrow_downwardDownload

Multilingual OCR'MOCR': An Approach to Classify Words to Languages

by Shahin Alam

2024, International Journal

Abstract: There are immense efforts to design a complete OCR for most of the world's leading languages, however, multilingual documents either of handwritten or of printed form. As a united attempt, Unicode based OCRs were studied... more

descriptionView Paper arrow_downwardDownload

Progress of combining trigram and Winnow in Thai OCR error correction

by Boonserm Kijsirikul

2024

For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by... more

descriptionView Paper arrow_downwardDownload

Forst: Question Answering System Using Basic Element at NTCIR-11 QA-Lab Task

by Hideyuki Shibuki

2024, NTCIR

This paper describes Forst's approach to university entrance examinations at NTCIR-11 QA-Lab Task. Our system consists of two types of modules: dedicated modules for each question format and common modules called by the dedicated modules... more

descriptionView Paper arrow_downwardDownload

Competing models of liaison acquisition: Evidence from corpus and experimental data

by geraldine legendre

2024, Language

Given that nouns rarely appear in isolation in French, infants acquiring the language must often retrieve the underlying representation of vowel-initial lexical forms from liaison contexts that provide conflicting information about the... more

descriptionView Paper arrow_downwardDownload

Competing models of liaison acquisition: Evidence from corpus and experimental data

by geraldine legendre

2024, Language

descriptionView Paper arrow_downwardDownload

Skew Detection and Correction of Handwritten Document Images using Multiple Projection Profiles and Skew Angles

by hasan mithun

2024

Numerous experimentations are performed in the area of optical character recognition. The overall process of optical character recognition can be assured successful only when there is skew free document input considered at the time of... more

descriptionView Paper arrow_downwardDownload

Three principled methods of automatic word form recognition

by Roland R . Hausser

2024

For the computer, word forms in an online text are simply letter sequences between blanks. A rule-based automatic language analyses presupposes, however, that the computer can recognize the individual word forms. This includes assigning... more

descriptionView Paper arrow_downwardDownload

Word segmentation in monolingual and bilingual infant learners of English and French

by Megha Sundara

2024

Word segmentation skills emerge during infancy, but it is unclear to what extent this ability is shaped by experience listening to a specific language or language type. This issue was explored by comparing segmentation of bi-syllabic... more

descriptionView Paper arrow_downwardDownload

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

by Sebastian Nehrdich

2024, One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich... more

descriptionView Paper arrow_downwardDownload

Word Segmentation

Key research themes

1. How can linguistic features and statistical models improve word segmentation in morphologically complex languages?

2. What statistical and computational techniques enhance robust and unsupervised word segmentation across diverse languages and data types?

3. How does early language exposure and linguistic structure influence infant word segmentation and vocabulary development?

Related Topics

All papers in Word Segmentation