Memory-based context-sensitive spelling correction at web scale

We study the problem of correcting spelling mistakes in text using memory-based learning techniques and a very large database of token n-gram occurrences in web text as training data. Our approach uses the context in which an error appears to select the most likely candidate from words which might have been intended in its place. Using a novel correction algorithm and a massive database of training data, we demonstrate higher accuracy on correcting realword errors than previous work, and very high accuracy at a new task of ranking corrections to non-word errors given by a standard spelling correction package.

Michael Flor

Traitement Automatique des Langues (TAL), 53:3, 61-99., 2012

Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99. This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."

downloadDownload free PDF View PDFchevron_right

A large scale ranker-based system for search query spelling correction

Daniel Micol

2010

This paper makes three significant extensions to a noisy channel speller designed for standard written text to target the challenging domain of search queries. First, the noisy channel model is subsumed by a more general ranker, which allows a variety of features to be easily incorporated. Second, a distributed infrastructure is proposed for training and applying Web scale n-gram language models. Third, a new phrase-based error model is presented. This model places a probability distribution over transformations between multi-word phrases, and is estimated using large amounts of query-correction pairs derived from search logs. Experiments show that each of these extensions leads to significant improvements over the stateof-the-art baseline methods.

downloadDownload free PDF View PDFchevron_right

Exploiting syntactic and distributional information for spelling correction with web-scale n-gram models

Joel Tetreault

Proceedings of the …, 2011

We propose a novel way of incorporating dependency parse and word co-occurrence information into a state-of-the-art web-scale n-gram model for spelling correction. The syntactic and distributional information provides extra evidence in addition to that provided by a web-scale n-gram corpus and especially helps with data sparsity problems. Experimental results show that introducing syntactic features into n-gram based models significantly reduces errors by up to 12.4% over the current state-of-the-art. The ...

downloadDownload free PDF View PDFchevron_right

Real-word spelling correction using Google Web IT 3-grams

Tarikul Islam

Proceedings of the 2009 Conference on …, 2009

We present a method for detecting and correcting multiple real-word spelling errors using the Google Web 1T 3-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the detection recall (the fraction of errors correctly detected) and the correction recall (the fraction of errors correctly amended), while keeping the respective precisions (the fraction of detections or amendments that are correct) as high as possible. Evaluation results on a standard data set show that our method outperforms two other methods on the same task.

downloadDownload free PDF View PDFchevron_right

A Benchmark Corpus of English Misspellings and a Minimally-supervised Model for Spelling Correction

Michael Flor

Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, 2019

Spelling correction has attracted a lot of attention in the NLP community. However, models have been usually evaluated on artificially-created or proprietary corpora. A publicly-available corpus of authentic misspellings, annotated in context, is still lacking. To address this, we present and release an annotated data set of 6,121 spelling errors in context, based on a corpus of essays written by English language learners. We also develop a minimally-supervised context-aware approach to spelling correction. It achieves strong results on our data: 88.12% accuracy. This approach can also train with a minimal amount of annotated data (performance reduced by less than 1%). Furthermore, this approach allows easy porta-bility to new domains. We evaluate our model on data from a medical domain and demonstrate that it rivals the performance of a model trained and tuned on in-domain data.

downloadDownload free PDF View PDFchevron_right

Learning phrase-based spelling error models from clickthrough data

Daniel Micol

2010

This paper explores the use of clickthrough data for query spelling correction. First, large amounts of query-correction pairs are derived by analyzing users' query reformulation behavior encoded in the clickthrough data. Then, a phrase-based error model that accounts for the transformation probability between multi-term phrases is trained and integrated into a query speller system. Experiments are carried out on a human-labeled data set. Results show that the system using the phrase-based error model outperforms significantly its baseline systems. 3 Clickthrough Data and Spelling Correction This section describes the way the query-correction pairs are extracted from click

downloadDownload free PDF View PDFchevron_right

Real-Word Spelling Correction using Google Web 1T 3-grams

Diana Inkpen

Empirical Methods in Natural Language Processing, 2009

We present a method for detecting and correcting multiple real-word spelling er- rors using the Google Web 1T 3-gram data set and a normalized and modified ver- sion of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to im- prove the detection recall (the fraction of errors correctly detected) and the correc- tion

downloadDownload free PDF View PDFchevron_right

Scaling up context-sensitive text correction

Dan Roth

2001

Abstract The main challenge in an effort to build a realistic system with context-sensitive inference capabilities, beyond accuracy, is scalability. This paper studies this problem in the context of a learning-based approach to context sensitive text correction–the task of fixing spelling errors that result in valid words, such as substituting to for too, casual for causal, and so on. Research papers on this problem have developed algorithms that can achieve fairly high accuracy, in many cases over 90%.

downloadDownload free PDF View PDFchevron_right

An In-depth Study of the Automatic Detection and Correction of Spelling Mistakes

Iadh Ounis

2005

ABSTRACT This paper discusses the issues involved in an information retrieval system when spelling errors are encountered in a query. We look at two classical algorithms that may be used to correct these errors and their consequent effect on the system's performance. The algorithms are the Levenshtein Distance and the Longest Common Subsequence. We experiment on a variety of test data and explore the impact of certain errors on an information retrieval system.

downloadDownload free PDF View PDFchevron_right

Large scale experiments on correction of confused words

David Powers

Proceedings 24th Australian Computer Science Conference. ACSC 2001

This paper describes a new approach to automatically learn contextual knowledge for spelling and grammar correctionwe aim particularly to deal with cases where the words are all in the dictionary and so it is not obvious that there is an error. Traditional approaches are dictionary based, or use elementary tagging or partial parsing of the sentence to obtain context knowledge. Our approach uses aflx information and only the most frequent words to reduce the complexity in terms of training time and running time for context-sensitive spelling correction. We build large scale confused word sets based on keyboard adjacency and apply our new approach to learn the contextual knowledge to detect and correct them. We explore the perjlormance of autocorrection under conditions where significance and probabilty are set by the user.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (8)

K. Atkinson. GNU Aspell, 1998. Software available at http://aspell.net/.
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Meeting of the Associa- tion for Computational Linguistics, pages 26-33, 2001.
T. Brants and A. Franz. Web 1t 5-gram version 1, 2006.
A. J. Carlson, J. Rosen, and D. Roth. Scaling up context- sensitive text correction. In Proceedings of the Thirteenth Conference on Innovative Applications of Artificial Intelli- gence Conference, pages 45-50. AAAI Press, 2001.
K. W. Church and W. A. Gale. Probability scoring for spelling correction. Statistics and Computing, 1991.
A. R. Golding and D. Roth. Applying winnow to context- sensitive spelling correction. In International Conference on Machine Learning, pages 182-190, 1996.
M. Lapata and F. Keller. Web-based models for natural lan- guage processing. ACM Trans. Speech Lang. Process., 2(1):3, 2005.
V. Liu and J. R. Curran. Web text corpus for natural language processing. In EACL. The Association for Computer Linguis- tics, 2006.

Tarikul Islam

International Conference on Information and Knowledge Management, 2009

downloadDownload free PDF View PDFchevron_right

Misspelling Correction with Pre-trained Contextual Language Model

Xiaonan Jing

2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)

Spelling irregularities, known now as spelling mistakes, have been found for several centuries. As humans, we are able to understand most of the misspelled words based on their location in the sentence, perceived pronunciation, and context. Unlike humans, computer systems do not possess the convenient auto complete functionality of which human brains are capable. While many programs provide spelling correction functionality, many systems do not take context into account. Moreover, Artificial Intelligence systems function in the way they are trained on. With many current Natural Language Processing (NLP) systems trained on grammatically correct text data, many are vulnerable against adversarial examples, yet correctly spelled text processing is crucial for learning. In this paper, we investigate how spelling errors can be corrected in context, with a pretrained language model BERT. We present two experiments, based on BERT and the edit distance algorithm, for ranking and selecting candidate corrections. The results of our experiments demonstrated that when combined properly, contextual word embeddings of BERT and edit distance are capable of effectively correcting spelling errors.

downloadDownload free PDF View PDFchevron_right

Correcting real-word spelling errors by restoring lexical cohesion

Nestor Andres

Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by . We tested the method on an artificial corpus of errors; it achieved recall of 23 to 50% and precision of 18 to 25%.

downloadDownload free PDF View PDFchevron_right

A Winnow-Based Approach to Context-Sensitive Spelling Correction

Dan Roth

Machine Learning, 1999

A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts refer to only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. We present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting "to" for "too", "casual" for "causal", etc. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task. We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) The primary reason that WinSpell outperforms BaySpell is that WinSpell learns a better linear separator; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.

downloadDownload free PDF View PDFchevron_right

Real-word spelling correction using Google Web 1T n-gram with backoff

AMINUL ISLAM

2009 International Conference on Natural Language Processing and Knowledge Engineering, 2009

downloadDownload free PDF View PDFchevron_right

New Language Models for Spelling Correction

Si Aouragh

The International Arab Journal of Information Technology

Correcting spelling errors based on the context is a fairly significant problem in Natural Language Processing (NLP) applications. The majority of the work carried out to introduce the context into the process of spelling correction uses the n-gram language models. However, these models fail in several cases to give adequate probabilities for the suggested solutions of a misspelled word in a given context. To resolve this issue, we propose two new language models inspired by stochastic language models combined with edit distance. A first phase consists in finding the words of the lexicon orthographically close to the erroneous word and a second phase consists in ranking and limiting these suggestions. We have applied the new approach to Arabic language taking into account its specificity of having strong contextual connections between distant words in a sentence. To evaluate our approach, we have developed textual data processing applications, namely the extraction of distant transi...

downloadDownload free PDF View PDFchevron_right

Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users

ion silviu

2004

Logs of user queries to an internet search engine provide a large amount of implicit and explicit information about language. In this paper, we investigate their use in spelling correction of search queries, a task which poses many additional challenges beyond the traditional spelling correction problem. We present an approach that uses an iterative transformation of the input query strings into other strings that correspond to more and more likely queries according to statistics extracted from internet search query logs.

downloadDownload free PDF View PDFchevron_right

Towards a single proposal in spelling correction

Koldo Gojenola, Atro Voutilainen

Proceedings of the 17th …, 1998

The study presented here relies on the integrated use of different kinds of knowledge in order to improve first-guess accuracy in non-word context-sensitive correction for general unrestricted texts. State of the art spelling correction systems, e.g.

downloadDownload free PDF View PDFchevron_right

Toward filling the gap between interactive and fully-automatic spelling correction using the linguistic context

Patrick Ruch

2001

We report on the comparison of different strategies for correcting spelling errors resulting in non-existent words. Unlike interactive spelling checkers, where usually only the left context is available, the system we developed takes advantage of the entire context surrounding misspelling. Moreover, unlike traditional systems, based exclusively on a string-to-string edit distance and a word language model, we explore the use of the part-of-speech for selecting candidates. In conclusion, we show that spelling correction improves by extending the context. The best results are obtained when combining a part-of-speech filter with a word language model, and using both the left and right adjacent contexts.

downloadDownload free PDF View PDFchevron_right

Search Query Spell Correction with Weak Supervision in E-commerce

Madhura Pande

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Misspelled search queries in e-commerce can lead to empty or irrelevant products. Besides inadvertent typing mistakes, most spell mistakes occur because the user does not know the correct spelling, hence typing it as it is pronounced colloquially. This colloquial typing creates countless misspelling patterns for a single correct query. In this paper, we first systematically analyze and group different spell errors into error classes and then leverage the stateof-the-art Transformer model for contextual spell correction. We overcome the constraint of limited human labelled data by proposing novel synthetic data generation techniques for voluminous generation of training pairs needed by data hungry Transformers, without any human intervention. We further utilize weakly supervised data coupled with curriculum learning strategies to improve on tough spell mistakes without regressing on the easier ones. We show significant improvements from our model on human labeled data and online A/B experiments against multiple state-of-art models.

downloadDownload free PDF View PDFchevron_right

Cited by

Misspelling Correction with Pre-trained Contextual Language Model

Xiaonan Jing

2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC)

downloadDownload free PDF View PDFchevron_right

Memory-based context-sensitive spelling correction at web scale

Sign up for access to the world's latest research

Abstract

Related papers

References (8)

Related papers

Related topics

Cited by