We propose a shared task on multilingual SurfaceRealization, i.e., on mapping unorderedand uninfl... more We propose a shared task on multilingual SurfaceRealization, i.e., on mapping unorderedand uninflected universal dependency trees tocorrectly ordered and inflected sentences in anumber of languages. A second deeper inputwill be available in which, in addition,functional words, fine-grained PoS and morphologicalinformation will be removed fromthe input trees. The first shared task on SurfaceRealization was carried out in 2011 witha similar setup, with a focus on English. Wethink that it is time for relaunching such ashared task effort in view of the arrival of UniversalDependencies annotated treebanks fora large number of languages on the one hand,and the increasing dominance of Deep Learning,which proved to be a game changer forNLP, on the other hand.
Mention detection is an important preprocessing step for annotation and interpretation in applica... more Mention detection is an important preprocessing step for annotation and interpretation in applications such as NER and coreference resolution, but few stand-alone neural models have been proposed able to handle the full range of mentions. In this work, we propose and compare three neural network-based approaches to mention detection. The first approach is based on the mention detection part of a state of the art coreference resolution system; the second uses ELMO embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model. Our best model (using a biaffine classifier) achieves gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL coreference annotation setting. The same model achieves improvements of up to 5.3 and 6.2 p.p. when compared with the best-reported mention detection F1 on the CONLL and CRAC coreference data sets respectively in a HIGH F1 annotation setting. We then evaluate our models for coreference resolution by using mentions predicted by our best model in start-of-the-art coreference systems. The enhanced model achieved absolute improvements of up to 1.7 and 0.7 p.p. when compared with our strong baseline systems (pipeline system and end-to-end system) respectively. For nested NER, the evaluation of our model on the GENIA corpora shows that our model matches or outperforms state-of-the-art models despite not being specifically designed for this task.
Large language models (LLMs) have shown impressive results while requiring little or no direct su... more Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of attributed LLMs. We propose a reproducible evaluation framework for the task and benchmark a broad set of architectures. We take human annotations as a gold standard and show that a correlated automatic metric is suitable for development. 1 Our experimental work gives concrete answers to two key questions (How to measure attribution?, and How well do current state-ofthe-art methods perform on attribution?), and give some hints as to how to address a third (How to build LLMs with attribution?). 1 We publicly release all system responses and their human and automatic ratings, at https://github.com/ google-research-datasets/Attributed-QA * Equal contribution. † Final author. 2 By "direct supervision" we refer to labeled examples for the specific task in mind, for example datasets such as the Natural Questions corpus (Kwiatkowski et al., 2019) for question answering. We use the term "direct supervision" to distinguish this form of supervision from the term "self supervision" sometimes used in the context of LLMs.
Transactions of the Association for Computational Linguistics, 2023
Most recent coreference resolution systems use search algorithms over possible spans to identify ... more Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score for English (a 2.3 higher F1-score than previous work [Dobrovolskii, 2021]) using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than previous work), and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We obtain substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages. We provide the code and models as open source. 1
The rise of neural networks, and particularly recurrent neural networks, has produced significant... more The rise of neural networks, and particularly recurrent neural networks, has produced significant advances in part-ofspeech tagging accuracy (Zeman et al., 2017). One characteristic common among these models is the presence of rich initial word encodings. These encodings typically are composed of a recurrent character-based representation with learned and pre-trained word embeddings. However, these encodings do not consider a context wider than a single word and it is only through subsequent recurrent layers that word or sub-word information interacts. In this paper, we investigate models that use recurrent neural networks with sentence-level context for initial character and word-based representations. In particular we show that optimal results are obtained by integrating these context sensitive representations through synchronized training with a meta-model that learns to combine their states. We present results on part-of-speech and morphological tagging with state-of-the-art performance on a number of languages.
Most recent coreference resolution systems use search algorithms over possible spans to identify ... more Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1score for English (a 2.3 higher F1-score than previous work (Dobrovolskii, 2021)) using only CoNLL data for training, 68.5 F1score for Arabic (+4.1 higher than previous work) and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We get substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages. We provide the code and models as open source 1 .
This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR... more This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR'20) which was organised as part of the COLING'20 Workshop on Multilingual Surface Realisation. As in SR'18 and SR'19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks: (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descr...
Les corpus sont tres utiles pour de nombreuses taches dans le domaine du traitement automatique d... more Les corpus sont tres utiles pour de nombreuses taches dans le domaine du traitement automatique des langues naturelles. Les corpus annotes syntaxiquement sont devenus une ressource importante en TAL. Ils sont couramment utilises, par exemple comme banc d'essai pour la generation, l'analyse et la desambigusation semantique, et comme source pour l'acquisition de ressources (collocations, information sur la sous-categorisation, extraction de grammaire). Lorsqu'on utilise les structures de dependance pour le TAL, le manque de corpus annotes en structures de dependance constitue un handicap. Nous presentons une approche fondee sur une grammaire de graphes pour convertir des corpus annotes en structures syntagmatiques en corpus annotes en dependances. Cette approche fonctionne pour des langues a ordre de mots (partiellement) libre et fixe.
Proceedings of the 10th International Conference on Natural Language Generation, 2017
We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and unin... more We propose a shared task on multilingual Surface Realization, i.e., on mapping unordered and uninflected universal dependency trees to correctly ordered and inflected sentences in a number of languages. A second deeper input will be available in which, in addition, functional words, fine-grained PoS and morphological information will be removed from the input trees. The first shared task on Surface Realization was carried out in 2011 with a similar setup, with a focus on English. We think that it is time for relaunching such a shared task effort in view of the arrival of Universal Dependencies annotated treebanks for a large number of languages on the one hand, and the increasing dominance of Deep Learning, which proved to be a game changer for NLP, on the other hand.
In this paper, we present an approach to improve the accuracy of a strong transition-based depend... more In this paper, we present an approach to improve the accuracy of a strong transition-based dependency parser by exploiting dependency language models that are extracted from a large parsed corpus. We integrated a small number of features based on the dependency language models into the parser. To demonstrate the effectiveness of the proposed approach, we evaluate our parser on standard English and Chinese data where the base parser could achieve competitive accuracy scores. Our enhanced parser achieved state-of-the-art accuracy on Chinese data and competitive results on English data. We gained a large absolute improvement of one point (UAS) on Chinese and 0.5 points for English.
We present a simple LSTM-based transition-based dependency parser. Our model is composed of a sin... more We present a simple LSTM-based transition-based dependency parser. Our model is composed of a single LSTM hidden layer replacing the hidden layer in the usual feed-forward network architecture. We also propose a new initialization method that uses the pre-trained weights from a feed-forward neural network to initialize our LSTM-based model. We also show that using dropout on the input layer has a positive effect on performance. Our final parser achieves a 93.06% unlabeled and 91.01% labeled attachment score on the Penn Treebank. We additionally replace LSTMs with GRUs and Elman units in our model and explore the effectiveness of our initialization method on individual gates constituting all three types of RNN units.
This paper presents a novel self-training approach that we use to explore a scenario which is typ... more This paper presents a novel self-training approach that we use to explore a scenario which is typical for under-resourced languages. We apply self-training on small multilingual dependency corpora of nine languages. Our approach employs a confidence-based method to gain additional training data from large unlabeled datasets. The method has been shown effective for five languages out of the nine languages of the SPMRL Shared Task 2014 datasets. We obtained the largest absolute improvement of two percentage points on Korean data. Our selftraining experiments show improvements upon the best state-of-the-art systems of the SPMRL shared task that employs one parser only.
Mention detection is an important preprocessing step for annotation and interpretation in applica... more Mention detection is an important preprocessing step for annotation and interpretation in applications such as NER and coreference resolution, but few stand-alone neural models have been proposed able to handle the full range of mentions. In this work, we propose and compare three neural network-based approaches to mention detection. The first approach is based on the mention detection part of a state of the art coreference resolution system; the second uses ELMO embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model. Our best model (using a biaffine classifier) achieves gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL coreference annotation setting. The same model achieves improvements of up to 5.3 and 6.2 p.p. when compared with the best-reported mention detection F1 on the CONLL and CRAC coreference data sets respectively in a HIGH F1 annotation...
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019), 2019
We propose a method to represent dependency trees as dense vectors through the recursive applicat... more We propose a method to represent dependency trees as dense vectors through the recursive application of Long Short-Term Memory networks to build Recursive LSTM Trees (RLTs). We show that the dense vectors produced by Recursive LSTM Trees replace the need for structural features by using them as feature vectors for a greedy Arc-Standard transition-based dependency parser. We also show that RLTs have the ability to incorporate useful information from the bi-LSTM contextualized representation used by Cross and Huang (2016) and Kiperwasser and Goldberg (2016b). The resulting dense vectors are able to express both structural information relating to the dependency tree, as well as sequential information relating to the position in the sentence. The resulting parser only requires the vector representations of the top two items on the parser stack, which is, to the best of our knowledge, the smallest feature set ever published for Arc-Standard parsers to date, while still managing to achieve competitive results.
We gratefully acknowledge the hard work put in by the SR'19 participating teams, reviewers and lo... more We gratefully acknowledge the hard work put in by the SR'19 participating teams, reviewers and local organisers, and more generally, the creativity and enthusiasm generated by participants in the MSR workshops and SR tasks which is of course what keeps them both going.
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019), 2019
We report results from the SR'19 Shared Task, the second edition of a multilingual surface realis... more We report results from the SR'19 Shared Task, the second edition of a multilingual surface realisation task organised as part of the EMNLP'19 Workshop on Multilingual Surface Realisation. As in SR'18, the shared task comprised two different tracks: (a) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (b) a Deep Track where additionally, functional words and morphological information were removed. The Shallow Track was offered in 11, and the Deep Track in three languages. Systems were evaluated (a) automatically, using a range of intrinsic metrics, and (b) by human judges in terms of readability and meaning similarity to a reference. This report presents the evaluation results, along with descriptions of the SR'19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.
Proceedings of the 11th International Conference on Natural Language Generation, 2018
In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multiling... more In this paper, we present the datasets used in the Shallow and Deep Tracks of the First Multilingual Surface Realisation Shared Task (SR'18). For the Shallow Track, data in ten languages has been released: Arabic, Czech, Dutch, English, Finnish, French, Italian, Portuguese, Russian and Spanish. For the Deep Track, data in three languages is made available: English, French and Spanish. We describe in detail how the datasets were derived from the Universal Dependencies V2.0, and report on an evaluation of the Deep Track input quality. In addition, we examine the motivation for, and likely usefulness of, deriving NLG inputs from annotations in resources originally developed for Natural Language Understanding (NLU), and assess whether the resulting inputs supply enough information of the right kind for the final stage in the NLG process.
Uploads
Papers by Bernd Bohnet