Using suffix arrays as language models: Scaling the n-gram

Menno  van Zaanen

Outline

Using suffix arrays as language models: Scaling the n-gram

Menno van Zaanen

2010

Abstract

In this article, we propose the use of suffix arrays to implement n-gram language models with practically unlimited size n. These unbounded n-grams are called ∞-grams. This approach allows us to use large contexts efficiently to distinguish between different alternative sequences while applying synchronous back-off.

FAQs

What evidence supports the effectiveness of ∞-grams over traditional n-grams?add

The study finds that larger n-grams consistently yield better precision scores, with an average n-gram size of 3.9 for confusibles compared to smaller sizes. Moreover, ∞-grams did not decrease performance, highlighting the robustness of the synchronous back-off method.

How does the synchronous back-off method mitigate data sparseness?add

Synchronous back-off utilizes lower-order n-grams to approximate probabilities, allowing corrections when higher-order n-grams yield zero probabilities. This approach continues backing off until a non-zero probability alternative is found, ensuring effective language modeling.

How are contextual errors identified in the proposed model?add

The model generates all possible alternatives at error positions by leveraging contextual information from large n-grams. It disambiguates these through a scoring mechanism that selects the most probable correction based on the language model.

What categories of contextual errors were evaluated in the research?add

The research evaluates three types of contextual errors: confusibles, verb and noun agreement, and prenominal adjective ordering. Each task presents unique challenges in correcting errors that are contextually valid.

What role does the British National Corpus play in this research?add

The British National Corpus, containing approximately 100 million words, provides the training and testing data for evaluating the ∞-gram model. A consecutive 10% chunk of the corpus was withheld as test material.

References (21)

M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004.
R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.
E. Bick. A constraint grammar based spellchecker for danish with a special focus on dyslexics. In A Man of Measure: Festschrift in Honour of Fred Karlsson on his 60th Birthday; a special supplement to SKY Journal of Linguistics, pages 387-396, 2006.
S.F. Chen and J. Goodman. An empirical study of smoothing techniques for language modelling. In Proceedings of the 34th Annual Meeting of the ACL, pages 310-318. ACL, June 1996.
M. Chodorow and C. Leacock. An unsupervised method for detecting grammatical errors. In In Proceedings of NAACL'00, pages 140-147, 2000.
W. Daelemans, A. Van den Bosch, and J. Zavrel. Forgetting exceptions is harmful in language learning. Machine Learning, Special issue on Natural Language Learning, 34:11-41, 1999.
A.R. Golding and D. Roth. A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning, 34(1-3):107-130, 1999.
N. Han, M. Chodorow, and C. Leacock. Detecting errors in english article usage by non-native speak- ers. Natural Language Engineering, 12(02):115-129, 2006.
J. H. Huang and D. W. Powers. Large scale experiments on correction of confused words. In Aus- tralasian Computer Science Conference Proceedings, pages 77-82, Queensland AU, 2001. Bond Uni- versity.
D. E. Knuth. The art of computer programming, volume 3: Sorting and searching (second edition). Addison-Wesley, Reading, MA, 1998.
J. Lee and S. Seneff. Automatic Grammar Correction for Second-Language Learners. In Ninth Inter- national Conference on Spoken Language Processing. ISCA, 2006.
J. Lee and S. Seneff. Correcting misuse of verb forms. In Proceedings of ACL-08: HLT, pages 174- 182, Columbus, Ohio, June 2008. Association for Computational Linguistics.
R. Malouf. The order of prenominal adjectives in natural language generation. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 85-92, New Brunswick, NJ, 2000. ACL.
U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
D. Sandra, F. Daems, and S. Frisson. Zo helder en toch zoveel fouten! wat leren we uit psy- cholinguïstisch onderzoek naar werkwoordfouten bij ervaren spellers? Tijdschrift van de Vereniging voor het Onderwijs in het Nederlands, 30(3):3-20, 2001.
J. Shaw and V. Hatzivassiloglou. Ordering among premodifiers. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 135-143, College Park, Maryland, USA, June 1999. Association for Computational Linguistics.
H. Stehouwer and A. Van den Bosch. Putting the t where it belongs: Solving a confusion problem in Dutch. In S. Verberne, H. van Halteren, and P.-A. Coppen, editors, Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, pages 21-36, Nijmegen, The Netherlands, 2009.
H. Stehouwer and M. Van Zaanen. Language models for contextual error detection and correction. In Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, pages 41-48, Athens, Greece, 2009.
G. Sun, X. Liu, G. Cong, M. Zhou, Z. Xiong, J. Lee, and C. Lin. Detecting erroneous sentences using automatically mined sequential patterns. In Proceedings of the 45th Annual Meeting of the Associa- tion of Computational Linguistics, pages 81-88, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
A. Van den Bosch, G.J. Busser, S. Canisius, and W. Daelemans. An efficient memory-based morpho- syntactic tagger and parser for Dutch. In P. Dirix, I. Schuurman, V. Vandeghinste, and F. Van Eynde, editors, Computational Linguistics in the Netherlands: Selected Papers from the Seventeenth CLIN Meeting, pages 99-114, Leuven, Belgium, 2007.
L. A. Wilcox-O'Hearn, G. Hirst, and A. Budanitsky. Real-Word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. In A. Gelbukh, editor, Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008, volume LNCS 4919, pages 605-616, Berlin, Germany, 2008. Springer Verlag.

Using suffix arrays as language models: Scaling the n-gram

Sign up for access to the world's latest research

Abstract

FAQs

Related papers

References (21)

Related papers

Related topics