Academia.eduAcademia.edu

Word Segmentation

description1,191 papers
group150 followers
lightbulbAbout this topic
Word segmentation is the process of identifying and separating individual words within a continuous stream of speech or text. It is a critical task in natural language processing and linguistics, enabling the analysis and understanding of language structure and meaning.
lightbulbAbout this topic
Word segmentation is the process of identifying and separating individual words within a continuous stream of speech or text. It is a critical task in natural language processing and linguistics, enabling the analysis and understanding of language structure and meaning.

Key research themes

1. How can linguistic features and statistical models improve word segmentation in morphologically complex languages?

This research area investigates the integration of linguistically motivated features—such as morphological structures, reduplication, phonotactics, and affixation patterns—into statistical frameworks (e.g., Conditional Random Fields, Hidden Markov Models) for effective word segmentation. Morphologically rich and complex languages like Chinese, Vietnamese, and Turkish pose segmentation challenges due to issues like the absence of explicit word boundaries, dynamic morphological processes, and productive compounding. Leveraging linguistic insights alongside statistical learning facilitates more precise segmentation that generalizes across corpora and dialectal variations, which is crucial for downstream NLP applications in these languages.

Key finding: Introduces a Conditional Random Field (CRF) word segmentation model for Chinese that effectively incorporates character identity n-grams, morphological features derived from automatically extracted affix and singleton... Read more
Key finding: Presents a customizable segmentation system targeting morphologically derived words (MDWs) in Chinese, modeling words through hierarchical word trees with resolution parameters that control segmentation granularity. By... Read more
Key finding: Develops a hybrid Vietnamese word segmentation system combining minimal deterministic finite-state automata representing a large Vietnamese lexicon, regular expression parsing for lexical phrase identification, and maximal... Read more
Key finding: Proposes a Turkish word segmentation approach that segments continuous character streams lacking spaces, typical of speech-to-text outputs, by utilizing a morphological analyzer designed with grouped morpheme recognition... Read more
Key finding: Develops a morpheme-based POS tagger that segments words into morphemes via forward maximum matching and applies lexicalized HMMs for POS tagging on morpheme sequences. The system highlights the productive morphological... Read more

2. What statistical and computational techniques enhance robust and unsupervised word segmentation across diverse languages and data types?

This theme explores unsupervised and statistical algorithms for word segmentation that utilize probabilistic models, phonotactic cues, and robust classification approaches applied to various types of speech and text data, including phonetic transcriptions, noisy handwritten inputs, and low-resource language settings. It emphasizes methods that are language-agnostic or adaptable by exploiting distributional and phonotactic regularities, such as phone n-grams and transition probabilities, as well as nonparametric statistical distributions to handle real data variability. Such techniques aim to improve segmentation without heavy reliance on large annotated lexicons, thus scalable across languages and domains.

Key finding: Introduces 'WordEnds', an unsupervised boundary inference algorithm that uses phone n-gram statistics before and after pauses to bootstrap a discriminative boundary detection model. Tested on adult speech corpora in English... Read more
Key finding: Develops WordSeg, an open-source, modular software framework that standardizes benchmarking and evaluation of unsupervised word segmentation algorithms on child-directed speech corpora. By integrating multiple algorithms and... Read more
Key finding: Proposes a novel word segmentation method that employs the Student’s-t distribution for gap classification between word candidates in text line images, exploiting the distribution’s robustness to outliers. Experiments across... Read more
Key finding: Compares three Vietnamese segmentation systems: two maximum matching-based methods (one augmented with statistical word and bigram frequencies, another employing pattern matching for pre/post-processing) and a third system... Read more
Key finding: Presents an Arabic OCR system tackling cursive script segmentation through an improved font-independent character segmentation algorithm with rigorous cut filtration techniques robust to letter overlapping and shape... Read more

3. How does early language exposure and linguistic structure influence infant word segmentation and vocabulary development?

This theme focuses on psycholinguistic and developmental studies that examine how infants acquire word segmentation abilities and how language-specific rhythmic, prosodic, and statistical cues influence their segmentation of continuous speech. Research evaluates the timing and mechanisms by which infants discern word boundaries, how these skills vary between monolingual and bilingual learners, and how segmentation ability correlates with subsequent vocabulary growth. Understanding these cognitive and linguistic foundations informs models of early language acquisition and supports interventions targeting infant language development.

Key finding: Finds that 7.5-month-old monolingual infants segment bi-syllabic words with the dominant stress pattern of their native language (trochaic for English, iambic for French), and fail in cross-language tests between English and... Read more
Key finding: Reviews literature establishing that infants begin word segmentation ability around 6 months using prosodic cues, phoneme distributions, and syllable transition probabilities, and that exposure to high-quality linguistic... Read more
Key finding: Demonstrates that adults performing joint speech segmentation and word-object mapping tasks achieve better performance than when each task is isolated, revealing synergistic effects when statistical learning of syllable... Read more

All papers in Word Segmentation

The paper presents the development of first publically available Urdu N-grams extracted from different books. For the best representation of Ngrams, large amount of Urdu corpus is collected from books covering different domains. The... more
Writing systems used in Southeast Asia, such as Mon, Burmese, Khmer, and Thai scripts are phonogramic systems of Indic origin, traced back to Brāhmī script in BC3c which was designed for an Indo‐Aryan language, Prākrit. It is... more
A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and... more
The Arabic language is a Semitic language and it exhibits systematic but complex morphological structure based on root-pattern design. The aim of the present paper is to propose a transfer-based approach using morphological analysis which... more
The coexistence of a set of these properties in a given language may promote the perception of stressed syllables in relation to other syllables - yielding stress-timing -, or make all syllables equally salient - yielding syllable-timing.... more
This study sets out to explore the basic processing unit in Chinese sentence reading in an eye movement experiment. In the experiment, four analogous conditions were created: normal Chinese sentence with no highlighting, highlighting that... more
Chinese lexical analysis consists of word segmentation and part-of-speech tagging. Most previous studies consider them as two separate tasks. In this paper we formalize the two processes as a unique chunking task on a sequence of... more
showed that adults were able to segment into words an artificial language that included no pauses or other prosodic cues for word boundaries. We propose an account of their results that requires only limited computational abilities and... more
This paper addresses two major problems in closed task of Chinese word segmentation (CWS): tagging sentences interspersed with non-Chinese words, and long named entity (NE) identification. To resolve the former, we apply Kmeans clustering... more
Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an... more
Invited Keynote Delivered by Dr Fermín Moscoso del Prado Martín In contrast with a broad corpus of research on the distribution of the frequencies of words, the distribution of the frequencies of phonemic contrasts is remarkably... more
Dzongkha, the national language of Bhutan, is continuous in written form and it fails to mark the word boundary. Dzongkha word segmentation is one of the fundamental problems and a prerequisite that needs to be solved before more advanced... more
An image-based approach to document image analysis is presented. The methods are moti- vated by a merged view of shape and textural image properties at multiple scales. The principal binary image operations are morphological and... more
Text-line segmentation is considered as a crucial step of document analysis and recognition systems because its output is considered as the input of recognition systems. Due to the reason that the same handwritten image page has different... more
Unsupervised methods tend to discover highly speaker-specific representations of speech. We propose a method for improving the quality of posteriorgrams generated from an unsupervised model through partitioning of the latent classes. We... more
We present a Chinese Named Entity Recognition (NER) system submitted to the close track of Sighan Bakeoff2006. We define some additional features via doing statistics in training corpus. Our system incorporates basic features and... more
When infants and toddlers are confronted with sequences of sounds, they are required to segment the sounds into meaningful units to achieve sufficient understanding. Rhythm has been regarded as a crucial cue for segmentation of speech... more
This interesting volume represents a collection of high quality essays that touch on some of the more recent and "controversial" topics in creole studies in the last 10 years. In some ways, this volume represents the shift away from... more
This paper focuses on the segmentation of unconstrained handwritten documents containing bilingual data at the word level. Most of the official documents available in India are bilingual, i.e., in the regional language as well as English... more
Punctuation plays a vital role in written language, enhancing clarity, structure, and communicative intent. While modern artificial intelligence (AI) models have demonstrated advanced capabilities in natural language processing, their... more
In this study the perception of word juncture in English and Arabic is investigated. Word juncture is taken as the allophonic, or phonetic, variation at word boundary that is contrastive. It is hypothesized that minimal pairs... more
For large-vocabulary continuous speech recognition (LVCSR) of highly-inflected languages, selection of an appropriate recognition unit is the first important step. The morpheme-based approach is often adopted because of its high coverage... more
This paper is a report from collective participation in NTCIR-5 Question Answering Challenge between researchers from Mie University, Hokkaido University and Otaru University of Commerce. Although our results were not impressive, we would... more
Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word... more
This research paper describes a corpus based transliteration system for Punjabi language. The existence of two scripts for Punjabi language has created a script barrier between the Punjabi literature written in India and in Pakistan. This... more
Word Segmentation is an important prerequisite for almost all Natural Language Processing (NLP) applications. Since word is a fundamental unit of any language, almost every NLP system first needs to segment input text into a sequence of... more
This paper describes an algorithm to segment an input Turkish string without any spaces, which may be an output of a speech-to-text application, into words by using morphological analyzer. It is quite possible to use the algorithm on... more
A readability formula is obtained that can be used by computer programs for style checking of Japanese texts and need not syntactic or semantic information. The formula is derived as a linear combination of tile surface characteristics of... more
by Qin Lu
Automatic term extraction is the first step towards automatic or semi-automatic update of existing domain knowledge base. Most of the researches applied word segmentation as a preprocessing step to Chinese term extraction. However,... more
Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages,... more
Osmanlı Metin Arşivi Projesi kapsamında Osmanlı Türkçesi metinlerinin yüklenmesi, ikilileştirilmesi, satır ve kelime bölütlenmesi, etiketlenmesi, tanınması ve testlerinin yapılması amacıyla bir Genel Ag arabirimi geliştirilmiştir. Bu... more
This paper reports on a study involving the automatic extraction of Chinese legal terms. We used a word segmented corpus of Chinese court judgments to extract salient legal expressions with standard collocation learning techniques. Our... more
A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese... more
The main drawback of previous Chinese character error detection systems is the high false alarm rate. To solve this problem, we propose a system that combines a statistic method and template matching to detect Chinese character errors.... more
Chinese texts do not contain spaces as word separators like English and many alphabetic languages. To use Moses to train translation models, we must segment Chinese texts into sequences of Chinese words. Increasingly more software tools... more
This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in... more
This paper presents the application of a multi-scale paradigm to Chinese spoken document retrieval (SDR) for improving retrieval performance. Multi-scale refers to the use of both words and subwords for retrieval. Words are basic units in... more
The possibility that during Chinese reading information is extracted at the beginning of the current fixation was examined in this study. Twenty-four participants read for comprehension while their eye movements were being recorded. A... more
Word segmentation is one of the most important tasks in NLP. This task, within Vietnamese language and its own features, faces some challenges, especially in words boundary determination. To tackle the task of Vietnamese word... more
This paper deals with an Optical Character Recognition (OCR) system for printed Oriya script. The development of OCR for this script is difficult because a large number of character shapes in the script have to be recognized. In the... more
COMUNICA is a voice QA system for Brazilian Portuguese with search capabilities for consulting both structured and unstructured datasets. One of the goals of this work is to help address digital inclusion by providing an alternative way... more
Abstract: There are immense efforts to design a complete OCR for most of the world's leading languages, however, multilingual documents either of handwritten or of printed form. As a united attempt, Unicode based OCRs were studied... more
For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by... more
This paper describes Forst's approach to university entrance examinations at NTCIR-11 QA-Lab Task. Our system consists of two types of modules: dedicated modules for each question format and common modules called by the dedicated modules... more
Given that nouns rarely appear in isolation in French, infants acquiring the language must often retrieve the underlying representation of vowel-initial lexical forms from liaison contexts that provide conflicting information about the... more
Given that nouns rarely appear in isolation in French, infants acquiring the language must often retrieve the underlying representation of vowel-initial lexical forms from liaison contexts that provide conflicting information about the... more
Numerous experimentations are performed in the area of optical character recognition. The overall process of optical character recognition can be assured successful only when there is skew free document input considered at the time of... more
For the computer, word forms in an online text are simply letter sequences between blanks. A rule-based automatic language analyses presupposes, however, that the computer can recognize the individual word forms. This includes assigning... more
Word segmentation skills emerge during infancy, but it is unclear to what extent this ability is shaped by experience listening to a specific language or language type. This issue was explored by comparing segmentation of bi-syllabic... more
Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich... more
Download research papers for free!