Papers by Menno van Zaanen
In this article we investigate how (computational) grammar inference systems are evaluated and ho... more In this article we investigate how (computational) grammar inference systems are evaluated and how the evaluation procedure can be improved. First, we describe the currently used evaluation methods and look at the advantages and disadvantages of each method. The main problems of the methods are: the dependency on language experts, the influence of the annotation scheme of language data, and the language dependency of the evaluation. We then propose a new method that will allow for an evaluation independently of language and annotation scheme. This method requires (syntactically) structured corpora in multiple languages to test for language independency of the grammatical inference system and corpora structured using different annotation schemes to diminish the influence the annotation has on the evaluation.
Proceedings of the Workshop and Tutorial on Learning Contex-Free Grammars, ECML 2003, Cavtat-Dubrovnik, Croatia, September 22-26, 2003
Ecml, 2003
Computational Language Learning
Musical parameters in the playlist of a Dutch Crematorium
Mortality, 2016
Evaluation of selection in context-free grammar learning systems

In order to create a robust parser, it is necessary that the parser has a well-defined behaviour ... more In order to create a robust parser, it is necessary that the parser has a well-defined behaviour on what to do when it is fed with incorrect input. There are several ways to cope with incorrect input. The method described here starts with expanding an Earley parser to let it correct erred input. The corrections consist of inserting or deleting tokens from the input, but other corrections can be simulated by combining these two operations. When parsing an input string, typically more than one derivation can be generated. This is especially the case if the input has been corrected. The Data Oriented Parsing (DOP) model is used to disambiguate among the possible derivations. DOP selects the most probable derivation based on a corpus of derivations that were previously generated by the Earley parser. The complete system, i.e. the expanded Earley parser in combination with DOP, has been successfully tested on correcting corrupted C programs. The next step is to use the system for error correction in natural language processing.
In this article we will introduce a new approach (and several implementations) to the task of sen... more In this article we will introduce a new approach (and several implementations) to the task of sentence classification, where pre-defined classes are assigned to sentences. This approach concentrates on structural information that is present in the sentences. This information is extracted using machine learning techniques and the patterns found are used to classify the sentences. The approach fits in between the existing machine learning and hand-crafting of regular expressions approaches, and it combines the best of both.
Grammatical inference and computational linguistics
Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference - CLAGI '09, 2009
... The learning concerns finite-state transducers from parallel corpora. Context-free grammars o... more ... The learning concerns finite-state transducers from parallel corpora. Context-free grammars of different types were used for very different tasks: • Alexander Clark, Remi Eyraud and Amaury Habrard (A note ... Springer-Verlag. A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen. 1998. ...
In this article we investigate how (computational) grammar inference systems are evaluated and ho... more In this article we investigate how (computational) grammar inference systems are evaluated and how the evaluation procedure can be improved. First, we describe the currently used evaluation methods and look at the advantages and disadvantages of each method. The main problems of the methods are: the dependency on language experts, the influence of the annotation scheme of language data, and the language dependency of the evaluation. We then propose a new method that will allow for an evaluation independently of language and annotation scheme. This method requires (syntactically) structured corpora in multiple languages to test for language independency of the grammatical inference system and corpora structured using different annotation schemes to diminish the influence the annotation has on the evaluation.
Computational Language Learning (Update of )
Handbook of Logic and Language, 2011
Grammatical Inference: Algorithms and Applications
Lecture Notes in Computer Science, 2000
Page 1. Arlindo L. Oliveira (Ed.) Grammatical Inference: Algorithms and Applications 5th Internat... more Page 1. Arlindo L. Oliveira (Ed.) Grammatical Inference: Algorithms and Applications 5th International Colloquium, ICGI 2000 Lisbon, Portugal, September 11-13, 2000 Proceedings Jpl Springer Page 2. Table of Contents Inference ...
Proceedings of the Workshop and Tutorial on Learning Context-Free Grammars. Cavtat, Croatia
In this article, we propose the use of suffix arrays to implement n-gram language models with pra... more In this article, we propose the use of suffix arrays to implement n-gram language models with practically unlimited size n. These unbounded n-grams are called ∞-grams. This approach allows us to use large contexts efficiently to distinguish between different alternative sequences while applying synchronous back-off.
Proceedings of the Australasian Language Technology Workshop, 2007
Question answering on speech transcripts (QAst) is a pilot track of the CLEF competition. In this... more Question answering on speech transcripts (QAst) is a pilot track of the CLEF competition. In this paper we present our contribution to QAst, which is centred on a study of Named Entity (NE) recognition on speech transcripts, and how it impacts on the accuracy of the final question answering system. We have ported AFNER, the NE recogniser of the AnswerFinder questionanswering project, to the set of answer types expected in the QAst track. AFNER uses a combination of regular expressions, lists of names (gazetteers) and ...
Working Notes for the CLEF 2007 Workshop, 2007
Macquarie University's contribution to the QAst track of CLEF is centered on a study of Named Ent... more Macquarie University's contribution to the QAst track of CLEF is centered on a study of Named Entity (NE) recognition on speech transcripts, and how such NE recognition impacts on the accuracy of the final question answering system. We have ported AFNER, the NE recogniser of the AnswerFinder question-answering project, to the types of answer types expected in the QAst track. AFNER uses a combination of regular expressions, lists of names (gazetteers) and machine learning. The machine learning component is a Maximum Entropy classifier and was trained on a development set of the AMI corpus. Problems with scalability of the system and errors of the extracted annotation lead to relatively poor performance in general, though the system was second (out of three participants) in one of the QAst subtasks.

In this article we describe our submission to the Dutch-English QA@CLEF task. We took the publicl... more In this article we describe our submission to the Dutch-English QA@CLEF task. We took the publicly available OpenEphyra question answering system, which is an open- source English question answering system. This was turned into a multi-lingual vari- ant by translating questions from Dutch to English using Systran's online-translation system. The current approach has some known problems, for example, we do not distinguish between factoid, lists, and definition questions (all questions are treated as factoid questions), OpenEphyra does not provide support text for answers (text in the document surrounding the answer is used as support text), temporal restrictions and anaphora are not handled at all. The amount of modifications of OpenEphyra required to run the experiment were such that due to time constraints only one exper- iment could be submitted. The original idea behind this research was to investigate the impact of the quality of the question analysis. In particular, we are ...
Learning structure using alignment based learning
In this paper we will introduce a new algorithm that learns structure in the form of bracketed se... more In this paper we will introduce a new algorithm that learns structure in the form of bracketed sentences using unstructured, untagged sentences. The algorithm is based on the idea of Harris [Har51] stating that constituents of the same type can be replaced within a sentence. The algorithm consists of two phases. In the rst phase, called alignment learning, the algorithm
In this paper a new similarity-based learning algorithm, inspired by string edit-distance , is ap... more In this paper a new similarity-based learning algorithm, inspired by string edit-distance , is applied to the problem of bootstrapping structure from scratch. The algorithm takes a corpus of unannotated sentences as input and returns a corpus of bracketed sentences. The method works on pairs of unstructured sentences or sentences partially bracketed by the algorithm that have one or more words in common. It finds parts of sentences that are interchangeable (i.e. the parts of the sentences that are different in both sentences). These parts are taken as possible constituents of the same type. While this corresponds to the basic bootstrapping step of the algorithm, further structure may be learned from comparison with other (similar) sentences.
Proceedings of the International Multiconference on Computer Science and Information Technology, 2010
Finding regularities in large data sets requires implementations of systems that are efficient in... more Finding regularities in large data sets requires implementations of systems that are efficient in both time and space requirements. Here, we describe a newly developed system that exploits the internal structure of the enhanced suffixarray to find significant patterns in a large collection of sequences. The system searches exhaustively for all significantly compressing patterns where patterns may consist of symbols and skips or wildcards. We demonstrate a possible application of the system by detecting interesting patterns in a Dutch and an English corpus.
Uploads
Papers by Menno van Zaanen