Event information extraction

description25 papers

group2 followers

lightbulbAbout this topic

Event information extraction is a subfield of natural language processing that focuses on identifying and extracting structured information about events from unstructured text. This includes recognizing event triggers, participants, time, location, and other relevant attributes to facilitate the organization and analysis of event-related data.

lightbulbAbout this topic

Key research themes

1. How can ontologies and structured semantic frameworks enhance the effectiveness of event extraction from unstructured text?

This research area focuses on developing comprehensive and flexible ontologies and semantic resources that define event types, argument roles, and analytic dimensions for event extraction (EE). Such ontological frameworks are vital because they provide structured guidance to automate the identification and classification of events and their participants in text, improving accuracy and domain adaptability. Addressing limitations in previous ontologies—such as narrow topical coverage, inflexible argument role definitions, and lack of analytical granularity—can yield better event extraction systems that serve diverse applications including knowledge base construction, summarization, and crisis monitoring.

COfEE: A Comprehensive Ontology for Event Extraction from text, with an online annotation tool

by Ali Balali

2022, ArXiv

Key finding: This paper proposes the COfEE event ontology addressing shortcomings of popular ontologies like ACE, CAMEO, and ICEWS that suffer from limited topical coverage (mainly political events), rigid argument role definitions, and... Read more

articleView Paper downloadDownload

Knowledge-Driven Event Extraction in Russian: Corpus-Based Linguistic Resources

by Valery Solovyev

2021, Computational intelligence and neuroscience

Key finding: This work highlights a hybrid approach combining knowledge-driven (ontology and pattern-based) and data-driven techniques to improve EE system performance in Russian, a less-resourced language for EE. It develops linguistic... Read more

articleView Paper downloadDownload

Event detection based on open information extraction and ontology

by Sadok Ben Yahia

2023, Journal of Information and Telecommunication

Key finding: The paper applies open information extraction (OIE) combined with ontological reasoning to reduce expert intervention in EE for the domain of management change events. Unlike earlier approaches relying heavily on manual... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What modeling and joint learning strategies effectively handle complex phenomena such as role overlaps and ambiguity in multilingual event extraction?

This theme explores approaches addressing key challenges in event extraction, especially in languages like Chinese, where word segmentation ambiguities and overlapping semantic roles frequently occur. Researchers investigate methods that model interdependencies among event triggers, arguments, and roles jointly rather than in pipelined stages, allowing for simultaneous resolution of ambiguous and overlapping event elements. Such joint frameworks utilize pre-trained language models and reformulate argument extraction as a relation triple extraction problem to improve robustness in multilingual settings and complex event structures.

A Novel Joint Framework for Multiple Chinese Events Extraction

by Nuo Xu

2023

Key finding: This paper defines an event relation triple representation capturing interdependencies among event triggers, arguments, and roles explicitly, converting argument extraction into relation triple extraction. Employing a... Read more

articleView Paper downloadDownload

Tunable domain-independent event extraction in the MIRA framework

by Deyan Peychev

2023

Key finding: Applying a three-stage classification process (trigger word tagging, simple event extraction, and complex event extraction) using the MIRA online learning framework, this paper demonstrates tunable precision-recall trade-offs... Read more

articleView Paper downloadDownload

Exploring a Probabilistic Earley Parser for Event Composition in Biomedical Texts

by Trần Lê Ngọc Mai

2024

Key finding: This study presents a high-precision event extraction system focused on biomedical texts, leveraging a probabilistic Earley chart parsing algorithm for event composition. The approach treats event structures analogously to... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can computational frameworks leverage high-level event representations and distributed semantic processing to improve extraction and reasoning over complex and large-scale event streams?

This theme investigates methods for modeling and processing events at levels above isolated event occurrences, incorporating temporal, spatial, and semantic dimensions for complex event processing (CEP). It includes approaches that extend traditional event models by integrating RDF semantics with temporal reasoning capabilities and distributed architectures to achieve scalability. Further, it addresses techniques for mining holistic or object-centric event logs, capturing interrelated behavior in event data streams, thus facilitating predictive analytics, pattern detection, and knowledge population in dynamic and heterogeneous data environments.

Towards Efficient Semantically Enriched Complex Event Processing and Pattern Matching

by Syed Aqeel Haider Gillani

2025

Key finding: This paper proposes an extended RDF-based event data model that incorporates temporal reasoning directly at the RDF level, addressing limitations of existing SCEP systems that lack the notion of time and rely on centralized... Read more

articleView Paper downloadDownload

High-Level Event Mining: A Framework

by Wil van der Aalst and

2023

Key finding: The authors introduce a framework for detecting high-level events that capture holistic and system-wide process states emerging from clusters of temporally proximate events across multiple process instances. By segmenting... Read more

articleView Paper downloadDownload

A Framework for Extracting and Encoding Features from Object-Centric Event Data

by Wil van der Aalst

2023

Key finding: This study develops a framework to extract and encode features from object-centric event logs where events relate to multiple objects of various types, reflecting interactions between concurrent processes. It critiques the... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Event information extraction

Building and Evaluating an Annotated Corpus for Automated Recognition of Chat-Based Social Engineering Attacks

by Ioannis Mavridis

2025, Applied Sciences

Chat-based Social Engineering (CSE) is widely recognized as a key factor to successful cyber-attacks, especially in small and medium-sized enterprise (SME) environments. Despite the interest in preventing CSE attacks, few studies have... more

descriptionView Paper arrow_downwardDownload

Automated metadata annotation: What is and is not possible with machine learning

by Hans Brandhorst

2024, Data Intelligence

Automated metadata annotation is only as good as training dataset, or rules that are available for the domain. It's important to learn what type of data content a pre-trained machine learning algorithm has been trained on to... more

descriptionView Paper arrow_downwardDownload

Character-based Neural Embeddings for Tweet Clustering

by Svitlana Vakulenko

2024, Zenodo (CERN European Organization for Nuclear Research)

In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and... more

descriptionView Paper arrow_downwardDownload

Extended Multilingual Protest News Detection - Shared Task 1, CASE 2021 and 2022

by Tadashi Nomoto

2023, Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest Event Detection. This task is a continuation of CASE 2021 that consists of four subtasks that are i) document classification, ii) sentence classification, iii) event... more

descriptionView Paper arrow_downwardDownload

Automated metadata annotation: What is and is not possible with machine learning

by Joaquim López

2023, Data Intelligence

descriptionView Paper arrow_downwardDownload

An Automatic Participant Detection Framework for Event Tracking on Twitter

by Colin Layfield

2023, Algorithms

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

descriptionView Paper arrow_downwardDownload

Automated metadata annotation: What is and is not possible with machine learning

by More Lopez

2023, Data Intelligence

descriptionView Paper arrow_downwardDownload

Multilingual Protest News Detection - Shared Task 1, CASE 2021

by SHYAM RATAN

2023, Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

Benchmarking state-of-the-art text classification and information extraction systems in multilingual, cross-lingual, few-shot, and zeroshot settings for socio-political event information collection is achieved in the scope of the shared... more

descriptionView Paper arrow_downwardDownload

Multilingual Protest News Detection - Shared Task 1, CASE 2021

by Farhana Liza

2023, Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

descriptionView Paper arrow_downwardDownload

Identifying Incorrect Labels in the CoNLL-2003 Corpus

by Bryan Cutler

2023, Proceedings of the 24th Conference on Computational Natural Language Learning

The CoNLL-2003 corpus for Englishlanguage named entity recognition (NER) is one of the most influential corpora for NER model research. A large number of publications, including many landmark works, have used this corpus as a source of... more

descriptionView Paper arrow_downwardDownload

Automated metadata annotation: What is and is not possible with machine learning

by Joseph Busch

2023, Data Intelligence

descriptionView Paper arrow_downwardDownload

Joint event extraction along shortest dependency paths using graph convolutional networks

by Masoud Asadpour

2023, Knowledge-Based Systems

Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as... more

Figure 1: An example of the Event Extraction task (lower part of the figure) and its dependency parsing result (upper part). Source: the dependency parsing result is produced by Stanford CoreNLP? toolkit on a sentence of the ACE 2005 dataset. In the figure, one can identify three event triggers: “/eaved” (rectangular green box), “talks” (yellow box) and “summit” (blue box), and their corresponding subtypes: “Transport” (which belongs to the movement type), “Meet” (contact type) and “Meer”, respectively. The arguments and corresponding roles associated with each event trigger can also be found in the sentence. For instance, “Bush”, “Putin” and “France” (see green row) are arguments of the event trigger “/eaved” with the roles “Entity”, “Entity” and “Destination”, respectively. Instead, “their” (see yellow row) is the argument of the event trigger “talks” with the role “Entity”. Finally, “Bush”, “Putin”, “the largest * http://nlp.stanford.edu:8080/corenlp/

Figure 2: Architecture of JEE-SDP for the event extraction task depicted on our running example sentence taken from the ACE 2005 dataset. In the SDP-DMCNN module, the processing of the event trigger “swmmit” and the argument candidate “Putin” is illustrated.

Figure 3: A example for preparation of SDP-L adjacency matrices between argument candidates.

Figure 4: Illustration of the GCN-SDP architecture. The SDP-L adjacency matrix (M,) is calculated for d=1. Se eee ee ee eee eee a The GCN-SDP is customized in comparison to the GCN layers in the previous works [29, 40]. The GCN-SDP uses fewer learning parameters due to small training data in EE. Moreover, instead of using sum layer to aggregate n, different SDP-L vectors, a self-attention layer is used. The self-attention layer considers the importance of the different SDP-L vectors through learning parameters in aggregation phase. Figure 4, shows the GCN-SDP architecture.

the test set. According to this figure, the SDP extraction layer could successfully eliminate irrelevant words to help in extracting valuable features both in the short and the long range dependences between event triggers and their arguments in the sentences. The best effectiveness of JEE-SDP belongs to the bin 1-3 (the sequential distance between event trigger and argument candidate is between | to 3) which achieves F1-score by 86% in the argument role classification task. Figure 5: The effectiveness of two models JEE and JEE-SDP based on the sequential distance between event triggers and their arguments. The sequential distance is divided into bins of width 3.

Figure 6: The visualization of trained weights for different SDP lengths (zg).

Using the shortest path between two entities in the dependency graph has proven to be a valid and helpful solution within the context of the relation extraction task [35-37]. In this research work, we apply t he SDP to predict the role of the argument candidates. To accomplish this in the SDP-DMCNN module, we only consider the words which are along the SDP between the trigger candidate ¢; and the argument candidate e;. For ins the words “Putin”, “Bush”, “leaved”, “talks”, “Group” and “summit” when aiming to de ance, we would consider ermine the SDP between “Putin” and “summit” in Figure 1. Also, the length of the SDP (SDP-L) between the trigger candidate ¢; and the argument candidate e; is important as it presents significant information about relationshi ps. To calculate the SDP and the SDP-L between trigger and argument candidates we resort to Algorithm 1. In this algorithm, A € R"w*”, is an adjacency matrix based on dependency arcs which are extracted from the dependency parser. We assume that arcs in the dependency graph are undirected. T € R"**"w and E € R”e*™ are the matrices which indicate positions of trigger and argument candidates in a sentence. For example, the vector e,€ R™ in Figure 2, shows the position of the argument candidate “The largest Nations” in the sentence. The values of this vector are equal to zero, except in the indexes of 15, 16 and 17 which are equal to 1. At the first step (Lines 1-3), the path and distance between words in the dependency graph are calculated by the breadth-first search (BFS) algorithm [67]. Given a token w;, BFS starts at token w;and then explores all of the neighbor tokens at the present depth prior to moving on to the tokens at the next depth level. The shortest paths between all words and their lengths are calculated by running the BFS on all tokens in the sentence. Since, trigger and argument candidates can include

The network parameters are 0 =[LSTM”.LSTM*.W,.Wp.b,. bz. Wj. bj.Zq.Wa. ba. W3.W4.Ws.b3.b4.bs] ir E-SDP where 0<d<n;in the GCN-SDP and I <j <m in DMCNN. We used ReLU [69] as our nonlinear activatior action in JEE-SDP. To training these parameters, the cross entropy is selected as loss function with equal weigh’ ‘two output models. We use the stochastic gradient descent algorithm with shuffled mini-batches and Adar date rule [70] to minimize the loss functions. Table 1: Hyperparameters in JEE-SDP

Table 2: Overall effectiveness with gold-standard entities. Bold denotes the best result. The results of different methods are quoted from the corresponding papers due to the same test set. * shows the methods of our proposed framework.

Table 3: Effect of the different word representation methods in JEE-SDP 4.4 Effect of Extracting Multiple Events We concluded that using Bert does not provide significant improvement in the argument role classification task. The effectiveness of the argument role classification task could be affected by layers which consider the relations between event trigger and argument candidates in JEE-SDP. However, concatenating Bert and Glove improves the Fl-score by 2%, which means Glove and BERT maybe complementary in the trigger classification task. This demonstrates the effectiveness of the embedding methods in solving the ambiguity in classifying event triggers. To evaluate the effect of JEE-SDP framework in sentences which include multiple events, the test set is divided into two parts (1/1 and 1/N) according to the number of event triggers in the sentences and the effectiveness is

JEE-SDP improves Fl-score by at least 6.1% and 1.9% for the trigger classification task in (1/1) and (1/N) sentences, respectively. Also, the most important observation from the table is that “JEE-SDP +GCN-SDP +ATT” significantly outperforms all the other methods with large margins especially in (1/N) sentences for the argument role classification task which yields 12.1% improvement compared to the state-of-the-art method (JMEE method). This demonstrates that the proposed framework can effectively capture more valuable clues when a sentence contains more than one event. Table 4 illustrates the effectiveness of the EE methods on the single and multiple event triggers sentences. To consider the effect of each layer, we present the results of JEE-SDP framework with and without some layers separately. According to Table 4, since joint approaches such as JRNN, JMEE and JEE-SDP would result in capturing co-occurrence relationships between event trigger subtypes in sentences, they present a significant improvement in EE especially in sentences with multiple events (1/N). For example, where the model predicted an Attack event, another event trigger subtypes can be most likely Die, Transport and Injure [29].

descriptionView Paper arrow_downwardDownload

9). Automatic Labeling for Entity Extraction in Cyber Security. Retrieved from Cornell University Library: http://arxiv.org/abs/1308.4941

by John Goodall

2023

Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is... more

descriptionView Paper arrow_downwardDownload

Global Contentious Politics Database (GLOCON) Annotation Manuals

by ali hürriyetoğlu

2023, Cornell University - arXiv

This work is funded by the European Research Council (ERC) Starting Grant awarded to Dr. Erdem Yörük for the project Emerging Markets Welfare (project ID 714868). The research project is hosted by the Koç University and has benefited from... more

descriptionView Paper arrow_downwardDownload

PESE: Event Structure Extraction using Pointer Network based Encoder-Decoder Architecture

by Alapan Kuila

2023, Cornell University - arXiv

The task of event extraction (EE) aims to find the events and event-related argument information from the text and represent them in a structured format. Most previous works try to solve the problem by separately identifying multiple... more

descriptionView Paper arrow_downwardDownload

A Novel Joint Framework for Multiple Chinese Events Extraction

by Nuo Xu

2023

Event extraction is an essential yet challenging task in information extraction. Previous approaches have paid little attention to the problem of roles overlap which is a common phenomenon in practice. To solve this problem, this paper... more

descriptionView Paper arrow_downwardDownload

Global Contentious Politics Database (GLOCON) Annotation Manuals

by Osman Mutlu

2022

descriptionView Paper arrow_downwardDownload

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

by Osman Mutlu

2022, Data Intelligence

We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus... more

descriptionView Paper arrow_downwardDownload

Machine Learning Approaches for Catchphrase Extraction in Legal Documents

by Lakshmi Simhan

2022

The purpose of this research was to automatically extract catchphrases given a set of Legal documents. For this task, our focus was mainly on the Machine learning approaches: a comparative approach was used between the unsupervised and... more

descriptionView Paper arrow_downwardDownload

ESGBERT: Language Model to Help with Classification Tasks Related to Companies’ Environmental, Social, and Governance Practices

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2022

Environmental, Social, and Governance (ESG) are non-financial factors that are garnering attention from investors as they increasingly look to apply these as part of their analysis to identify material risks and growth opportunities. Some... more

descriptionView Paper arrow_downwardDownload

Associating Events with People on Social Networks Using A-PRIORI

by Vyankatesh Agrawal

2022, Computer Science & Information Technology ( CS & IT )

In social media, same news or events are associated with two or more people, sometimes with different perspective. The representation of the news or events varies from person to person, perspective to perspective or time to time. In this... more

descriptionView Paper arrow_downwardDownload

The Complementary Nature of Different NLP Toolkits for Named Entity Recognition in Social Media

by Álvaro Figueira, PhD

2022, Lecture Notes in Computer Science

In this paper we study the combined use of four different NLP toolkits-Stanford CoreNLP, GATE, OpenNLP and Twitter NLP tools-in the context of social media posts. Previous studies have shown performance comparisons between these tools,... more

descriptionView Paper arrow_downwardDownload

Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

by Kentaro Inui

2022, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task... more

descriptionView Paper arrow_downwardDownload

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

by Osman Mutlu

2022, Data Intelligence

descriptionView Paper arrow_downwardDownload

Multilingual Protest News Detection - Shared Task 1, CASE 2021

by Osman Mutlu

2022

Benchmarking state-of-the-art text classification and information extraction systems in multilingual, cross-lingual, few-shot, and zero-shot settings for socio-political event information collection is achieved in the scope of the shared... more

descriptionView Paper arrow_downwardDownload

Global Contentious Politics Database (GLOCON) Annotation Manuals

by ali hürriyetoğlu and

2022, Global Contentious Politics Database (GLOCON) Annotation Manuals

Emerging Markets Welfare project investigates the effects of contentious politics on welfare state programs in countries of the Global South. It hypothesizes that government response to social contention is a significant factor that shapes welfare policies. It is in this respect that mapping the dynamics of social contention in a given country becomes crucial, and duly constitutes a fundamental component of the entire project. Investigating the causal relationship between social contention and government policy involves more than a simple correlation, particularly if the focus is on specific government action, namely welfare policies. The map of social contention adequate for such an understanding should thus go beyond laying out basic trends of ebbing and flowing of social contention over space and time and provide insight into particularities such as the types of action repertoires, levels of violence, characteristics of actors or social groups that engage in contentious politics, the characteristics of the demands that they raise.
The purpose of the second work package of the EMW Project is to draw the aforementioned map of social contention. For achieving this purpose, we created a database of contentious politics events through the extraction of information from the news reports that are featured in the most prominent online sources each focus country has to offer. The Global Contentious Politics Database (GLOCON) records contentious politics events (referred to as protest events for the sake of brevity) that take place within the borders of our focus countries with all the information available in the source about the events’ time and place, actor, type, demands raised, violence level. As of the moment, the GLOCON database contains protest event data from India, China, South Africa, Argentina, and Brazil. It features data in three languages: English for India, China, and South Africa data, Spanish for Argentina data, and Portuguese for Brazil data. The database was created in a way that is able to accommodate additions of other focus countries and/or news sources in the future.
The database creation utilized automated text processing tools that detects if a news article contains a protest event, locate protest information within the article, and extract pieces of information regarding the detected protest events. The basis of training and testing the automated tools is the GLOCON Gold Standard Corpus (GSC), which contains news articles from multiple sources from each focus country. The articles in the GSC were manually coded by skilled annotators in both classification and extraction tasks with the utmost accuracy and consistency that automated tool development demands. In order to assure these, the annotation manuals in this document lay out the rules according to which annotators code the news articles. Annotators refer to the manuals at all times for all annotation tasks and apply the rules that they contain.
Despite the EMW Project's focus on the countries of the Global South, and the initial choice of a limited number of countries to be featured in the GSC, none of the rules or principles contained in this manual is more or less applicable to certain countries, sources or periods than others. The GLOCON database aims to be inclusive and capable of expanding. Securing consistency, reliability, and validity of data in the face of temporal and spatial expansion requires that annotation principles are generally applicable and that they are applied consistently.
The annotation process is composed of three main levels for each news report document. The document-level annotation determines the news articles that contain information on actual (past or ongoing) protest events. The sentence-level annotation aims to locate sentences that contain protest event-related information. In the final phase, words or phrases that give concrete information about protest events are detected.
Corresponding to the document and sentence classification, and information extraction tasks, there are three main and two supplemental manuals which together cover the entire annotation process from the document, through the sentence, to the token level. The first manual is the Document Level Protest Annotation Manual (DOLPAM) which establishes the rules for determining news articles that contain protest events; in other words, classifying news articles into those which contain protest events and those which do not. It lays out the protest event ontology, that is, the protest event definition which specifies the range of contentious politics events that are included in the scope of the project. It introduces and exemplifies different types of protest events, and defines the criteria to which a news report must conform to be labeled as a protest event article. The following Sentence Level Protest Annotation Manual (SELPAM) carries on with classifying the sentences in the documents that have already been classified as protest event articles. Similarly, it defines and exemplifies event sentences and enumerates the rules by which sentences are labeled as event sentences and non-event sentences. The third and final main manual is the Token Level Protest Annotation Manual (TOLPAM) which is the longest and most detailed of the three main level manuals. It defines the types of event-related information that the project aims at collecting from news articles and explains how expressions within the event sentences which contain these pieces of information are tagged. The remaining two manuals are supplemental manuals that label further information about the events that are already extracted in the three main levels of annotation. Both define annotation tasks that are performed on the document level. The first is the Violent Protest Events Annotation Manual which lays out the rules for classifying news reports that contain protest events into categories of violent and non-violent. The following, Protest Event Demands Annotation Manual aims at setting the rules for labeling the demands and/or grievances associated with the protest events that are extracted in the news articles. More detailed information about each manual can be found under their respective headings.
Even though every particular level of annotation has its respective annotation manual, the whole process must be thought of as an integrated whole as each level of annotation is premised on the results of the previous level. Hence, familiarizing oneself with all the manuals before starting annotation on any single level is recommended. Knowing in advance what the sentence and token level annotation tasks entail would help an annotator working on the document level considerably. That said, it is neither practical nor advisable to try to learn all annotation procedures by heart. Memory is prone to mislead, and recurrent reference to the manuals is the preferred way of utilizing them. Thus, annotators must read the entire manual before starting annotation, and remember to refer to it when there is the slightest doubt about a rule or a difficult case.
The content of the annotation manual is built on the general principles and standards of linguistic annotation laid out in other prominent annotation manuals such as ACE, CAMEO, and TimeML. These principles, however, have been adapted or rather modified heavily to accommodate the social scientific concepts and variables employed in the EMW project. The manual has been molded throughout a long trial and error process that accompanied the annotation of the GSC. It owes much of its current shape to the meticulous work and invaluable feedback provided by highly specialized teams of annotators, whose diligence and expertise greatly increased the quality of the corpus.

descriptionView Paper arrow_downwardDownload

Towards Open Domain Event Extraction from Twitter: REVEALing Entity Relations

by Svitlana Vakulenko

2022, Extended Semantic Web Conference

In the past years social media services received content contributions from millions of users, making them a fruitful source for data analysis. In this paper we present a novel approach for mining Twitter data in order to extract factual... more

descriptionView Paper arrow_downwardDownload

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

by ali hürriyetoğlu

2022, Data Intelligence

descriptionView Paper arrow_downwardDownload

Character-based Neural Embeddings for Tweet Clustering

by Lyndon Nixon

2022, Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media

descriptionView Paper arrow_downwardDownload

EvenTweet: online localized event detection from twitter

by Hamed Abdelhaq

2022

Microblogging services such as Twitter, Facebook, and Four-square have become major sources for information about real-world events. Most approaches that aim at extracting event information from such sources typically use the tem-poral... more

descriptionView Paper arrow_downwardDownload

Towards Open Domain Event Extraction from Twitter: REVEALing Entity Relations

by Svitlana Vakulenko

2022

descriptionView Paper arrow_downwardDownload

SEDTWik: Segmentation-based Event Detection from Tweets Using Wikipedia

by Aruna Malapati

2022

Event Detection has been one of the research areas in Text Mining that has attracted attention during this decade due to the widespread availability of social media data specifically twitter data. Twitter has become a major source for... more

descriptionView Paper arrow_downwardDownload

Towards Russian Text Generation Problem Using OpenAI's GPT-2

by Nataliya Ryabova

2022

This work is devoted to Natural Language Generation (NLG) problem. The modern approaches in this area based on deep neural networks are considered. The most famous and promising deep neural network architectures that are related to this... more

descriptionView Paper arrow_downwardDownload

Novelty Detection: A Perspective from Natural Language Processing

by Tanik Saikh

2022, Computational Linguistics

The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection... more

descriptionView Paper arrow_downwardDownload

SEDTWik: Segmentation-based Event Detection from Tweets Using Wikipedia

by surender samant

2022

descriptionView Paper arrow_downwardDownload

Neural Media Bias Detection Using Distant Supervision With BABE -Bias Annotations By Experts

by Bela Gipp

2022

Media coverage has a substantial effect on the public perception of events. Nevertheless, media outlets are often biased. One way to bias news articles is by altering the word choice. The automatic identification of bias by word choice is... more

Figure 1: Data collection and annotation pipeline The general data collection and annotation pipeline is outlined in Figure 1. Similar to the filtering strategy proposed by Spinde et al. (2021b), the sentences should contain more biased than neutral sentences. BABE contains 3,700 sentences, 1,700 from MBIC (Spinde et al., 2021c) and additional 2,000. Like Spinde et al. (2021c), we extracted our sentences from news articles covering 12 pre- defined controversial topics.> The articles were published on 14 US news platforms from January 2017 until June 2020. We focused on the US media since their political scenario became increasingly polarizing over the last years (Atkins, 2016).

Table 3: Data set annotation results for the expert-based approaches (left: eight annotators labeling 1,700 sen- tences (SG1); right: five annotators labeling 3,700 sen- tences (SG2)).

Standard errors across folds in parentheses. The first model block shows the best results of feature-based models. The second block of models consists of BERT and optimize variants. The models in the third block use new architectural or training approaches. The fourth block refers to models having learned bias- specific embeddings from the distantly supervised corpora. The best results are printed in bold.

descriptionView Paper arrow_downwardDownload

SEDTWik: Segmentation-based Event Detection from Tweets Using Wikipedia

by Surender Singh Samant

2021

descriptionView Paper arrow_downwardDownload

Neural relation extraction: a review

by Furkan Özbay

2021, Turkish J. Electr. Eng. Comput. Sci.

Neural relation extraction discovers semantic relations between entities from unstructured text using deep learning methods. In this study, we make a clear categorization of the existing relation extraction methods in terms of data... more

descriptionView Paper arrow_downwardDownload

Fine-tuning Pretrained Multilingual BERT Model for Indonesian Aspect-based Sentiment Analysis

by Annisa Nurul Azhar

2021

Although previous research on Aspect-based Sentiment Analysis (ABSA) for Indonesian reviews in hotel domain has been conducted using CNN and XGBoost, its model did not generalize well in test data and high number of OOV words contributed... more

descriptionView Paper arrow_downwardDownload

Open Aspect Target Sentiment Classification with Natural Language Prompts

by Brian Pinette

2021, ArXiv

For many business applications, we often seek to analyze sentiments associated with any arbitrary aspects of commercial products, despite having a very limited amount of labels or even without any labels at all. However, existing aspect... more

descriptionView Paper arrow_downwardDownload

Multilingual Protest News Detection - Shared Task 1, CASE 2021

by Erdem Yörük

2021

descriptionView Paper arrow_downwardDownload

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

by Erdem Yörük

2021, Data Intelligence

descriptionView Paper arrow_downwardDownload

An Automatic Participant Detection Framework for Event Tracking on Twitter

by Colin Layfield

2021, Algorithms

Topic Detection and Tracking (TDT) on Twitter emulates human identifying developments in events from a stream of tweets, but while event participants are important for humans to understand what happens during events, machines have no... more

descriptionView Paper arrow_downwardDownload

PROTEST-ER: Retraining BERT for Protest Event Extraction

by ali hürriyetoğlu

2021

We analyze the effect of further retraining BERT with different domain specific data as an unsupervised domain adaptation strategy for event extraction. Portability of event extraction models is particularly challenging, with large... more

descriptionView Paper arrow_downwardDownload

CDLM: Cross-Document Language Modeling

by Arie Cattan

2021

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked language modeling self-supervised objective. First, instead of considering documents in isolation, we... more

Figure 2: CDLM pretraining: The input consists of con- catenated documents, separated by special document separator tokens. The masked (unmasked) token col- ored in yellow (blue) represents global (local) attention. The goal is to predict the masked token alleges, based on the global context, i.e, the entire set of documents.

Figure 4: CD-coreference resolution pairwise mention representation, using the new setup, for our CDLM models. mi, mi} and s, are the cross-document contextualized representation vectors for mentions 2 and j, and of the [CLS] token, respectively. mom? is the element-wise product between m! and mj. m;,(i,j) is the final produced pairwise-mention representation. The tokens colored in yellow represent global attention, and tokens colored in blue represent local attention.

Table 2: Results on event and entity cross-document coreference resolution on ECB+- test set. AAN is composed of computational linguistics papers which were published on the ACL Anthol- ogy from 2001 to 2014, OC is composed of com- puter science and neuroscience papers, S2ORC is composed of open access papers across broad do- mains of science, and PAN is composed of web documents that contain several kinds of plagiarism phenomena. For further dataset prepossessing de- tails and statistics, see Appendix A.3.

Table 3: Ablation results (CoNLL F1) on our model on the test set of ECB+ event coreference.

Table 4: F scores over the document matching bench- marks’ test sets.

Table 5: HotpotQA-distractor results (F,) for the dev set. We use the “base” model size results from prior work for direct comparison. Ans: answer span, Sup: Supporting facts.

23, 34, 35; For test, we consider the topics: 36-45. Table 8: ECB+ dataset statistics. The slash numbers for Mentions and Clusters represent event/entity statistics.

Table 7: MultiNews training set statistics. We used the preprocessed, not truncated version of Multi-News, which totals 322MB of uncompressed text.’ Each one of the preprocessed documents contains up to 500 tokens. The average and 90" percentile of input length is 2.5k and 3.8K tokens, respectively. In Table 7 we list the number of re- lated documents per cluster. This follows the origi- nal dataset construction suggested in Fabbri et al. (2019). A.2 ECB+ Dataset

Table 10: Document-to-Document benchmarks statis- tics: The reported numbers are the count of document pairs and the count of unique documents.

descriptionView Paper arrow_downwardDownload

Building and Evaluating an Annotated Corpus for Automated Recognition of Chat-Based Social Engineering Attacks

by Nikolaos Tsinganos

2021, Applied Sciences

Figure 1. The concept map of CSE Attack.

The linguistic analysis was per. ormed according to the following levels of observa tion (see Figure 4). A sample of the CSE dialogues was analysed from a linguistic perspec tive and achieve formed the desired level of quality. served as a baseline to ensure that the software tools and libraries were able t nitially, as seen in Figure 4, a lexical analysis was per by breaking down a sentence into words, phrases, or other meaningful part, a tas] known as chunking. Afterwards, a morphological analysis was performed where the struc ture and the form of each word was t he main concern. Part-of-speech tagging (POS), stem: and lemmas were identified, and a syntactic analysis followed that focused on gramma' and syntax patterns. Subsequently, the meaning of the words and phrases were examined where t though, terances a speaker or a writer. he semantics of words and phrases were investigated. The most crucial step was the pragmatic analysis that took place to identify the actual meaning of the ut . This is reasonable because an automation tool cannot understand the hidder intent of

Figure 5. The CSE dialogues processing workflow. The standard Penn Treebank [30] tokenisation rules were utilised for sentence tokeni- sation, and finally, standardisation processes using regular expressions and lookup tables were applied to tune the CSE dialogues. Figure 5 depicts the dialogues processing work- flow where each stage, along with the individual tasks below, is shown.

Figure 7. Excerpt of the proposed CSE ontology. An excerpt of the CSE ontology created in Protégé, is illustrated in Figure 7 along with the arc types. In the following Figure 8, the core concepts of the proposed CSE ontology are pre- ented. The proposed CSE ontology extends the ontology presented in [32], which connects the social engineering concepts with cybersecurity concepts. The reason is that we are in- terested in protecting sensitive data leakage in personal, IT and enterprise context. The im- plemented ontology was finalised using Protégé [33] software.

Figure 11. Annotation targets. The preferred encoding scheme to tag the existing chunks of text (word, or text spans) was based on the IOB format [39]. In the IOB encoding scheme, the “I-” prefix of a tag indicates that the tag is inside a chunk, “O” indicates that the tag does not belong to a chunk and the “B-“ prefix of a tag indicates that a token is the beginning of a chunk.

Figure 13. (a) Distribution of sentence categories, (b) Average word count per category. As depicted in Figure 13a, the distribution of the sentence categories based on the aforementioned tags shows that the produced CSE corpus is well balanced. Valuable in- formation can also be extracted by observing the average word count per category, as seen in Figure 13b.

Figure 14. Five most frequent words per category. (a) Neutral, (b) Personal, (c) IT, (d) Enterprise Figure 15. Five most frequent bi-grams per category. (a) Neutral, (b) Personal, (c) IT, (d) Enterprise

Figure 16. (a) Loss metrics, (b) Accuracy metrics Figure 16a,b below illustrate the training history that contains the loss and the accu- racy achieved on training and validation after each epoch.

Table 2. CSE corpus summary. The CSE corpus is composed of realized and fictional CSE attack dialogues. The ex- istence of fictional attack dialogues does not affect its quality and capability because, sim- ilarly, a social engineer does not always act spontaneously but frequently prepares a fic- tional scenario (pretext attack) to guide the conversation and unleash his attack. The same applies to the CSE corpus, which incorporates a combination of confirmed and fictional CSE attacks.

Table 5. IOB encoding example. During the annotation task, the words were labelled based on their semantic and syntactic characteristics. This way, relationships were extracted between words or text spans belonging to different branches of the ontology. Moreover, hidden patterns that attackers use in a conversation were discovered and valuable information about how dif- ferent CSE concepts and ontology entities interact was extracted. For example, in Figure 12, where the semantic categories are labelled with tags in brackets, we can discover that

Table 6. Contingency matrix. “ The produced CSE corpus has N = 4500 terms and m = 3 categories, and both anno- tators (A and B) agreed for the personal category 1665 times, for the enterprise category 1442 times and for the IT category 1194 times. Table 6 contains a contingency matrix where each xi represents the multitude of terms that annotator A classified in category i, but Annotator B is classified in category j, with i, j = 1, 2, 3. The proportions on the diagonal, xii, represent the proportion of terms in each category for which the two annotators agreed on the assignment.

Table 7. Ten random utterances from CSE corpus.

descriptionView Paper arrow_downwardDownload

Mining Newsworthy Topics from Social Media

by Ayse Goker

2021, Studies in Computational Intelligence

Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit, and journalists now use them to rapidly discover stories and eye-witness accounts. We present a technique that detects... more

descriptionView Paper arrow_downwardDownload

Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

by Benny Saret

2021, EMNLP

We present models which complete missing text given transliterations of ancient Mesopotamian documents, originally written on cuneiform clay tablets (2500 BCE-100 CE). Due to the tablets' deterioration, scholars often rely on contextual... more

descriptionView Paper arrow_downwardDownload

Cross-Context News Corpus for Protest Event-Related Knowledge Base Construction

by ali hürriyetoğlu and

2021, Data Intelligence

descriptionView Paper arrow_downwardDownload

Financial Event Extraction Using Wikipedia-Based Weak Supervision

by Liat Ein-Dor

2021, Proceedings of the Second Workshop on Economics and Natural Language Processing

Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging... more

descriptionView Paper arrow_downwardDownload

Event information extraction

Key research themes

1. How can ontologies and structured semantic frameworks enhance the effectiveness of event extraction from unstructured text?

2. What modeling and joint learning strategies effectively handle complex phenomena such as role overlaps and ambiguity in multilingual event extraction?

3. How can computational frameworks leverage high-level event representations and distributed semantic processing to improve extraction and reasoning over complex and large-scale event streams?

Related Topics

All papers in Event information extraction