Papers by Despoina Antonakaki

arXiv (Cornell University), Jun 6, 2023
On 24 February 2022, Russia invaded Ukraine, starting what is now known as the Russo-Ukrainian Wa... more On 24 February 2022, Russia invaded Ukraine, starting what is now known as the Russo-Ukrainian War, initiating an online discourse on social media. Twitter as one of the most popular SNs, with an open and democratic character, enables a transparent discussion among its large user base. Unfortunately, this often leads to Twitter's policy violations, propaganda, abusive actions, civil integrity violation, and consequently to user accounts' suspension and deletion. This study focuses on the Twitter suspension mechanism and the analysis of shared content and features of the user accounts that may lead to this. Toward this goal, we have obtained a dataset containing 107.7M tweets, originating from 9.8 million users, using Twitter API. We extract the categories of shared content of the suspended accounts and explain their characteristics, through the extraction of text embeddings in junction with cosine similarity clustering. Our results reveal scam campaigns taking advantage of trending topics regarding the Russia-Ukrainian conflict for Bitcoin and Ethereum fraud, spam, and advertisement campaigns. Additionally, we apply a machine learning methodology including a SHapley Additive explainability model to understand and explain how user accounts get suspended.

Analysis of evolution, dynamics and vulnerabilities of online social networks
Τα σύγχρονα μέσα κοινωνικής δικτύωσης προσφέρουν μια εμπειρία που ξεπερνάει τα όρια της απλής επι... more Τα σύγχρονα μέσα κοινωνικής δικτύωσης προσφέρουν μια εμπειρία που ξεπερνάει τα όρια της απλής επικοινωνίας, της ενημέρωσης και της ψυχαγωγίας. Με μέσο ημερήσιο χρόνο χρήσης που μπορεί να φτάσει τις 3 ώρες, με μία πληθυσμιακή διείσδυση που ξεπερνάει το ένα τρίτο του παγκόσμιου πληθυσμού και με ένα σταθερό ρυθμό αύξησης τα τελευταία 30 χρόνια, τα μέσα κοινωνικής δικτύωσης πλέον, επηρεάζουν τον τρόπο με τον οποία μία κοινωνία αλληλεπιδρά, αντιδρά σε διάφορα γεγονότα αλλά και τον τρόπο που διαχέει μία πληροφορία στα μέλη της. Είναι φυσικό, η τεράστια κοινωνική επίδραση και η επέκταση των μέσων κοινωνικής δικτύωσης, να εγείρει διάφορα ερωτήματα. Μερικά από αυτά, έχουν να κάνουν με τον ρυθμό με τον οποίο μεταβάλλεται και εξελίσσεται ο γράφος που αναπαριστά τους χρήστες ενός κοινωνικού δικτύου και αντιμετωπίζει θέματα όπως, τι αυξάνει περισσότερο με τον χρόνο, οι χρήστες ή οι συνδέσεις που κάνουν μεταξύ τους. Ένα άλλο θέμα είναι η έγκαιρη και αποτελεσματική προστασία των χρηστών από απειλές όπως ανεπιθύμητα μηνύματα. Ένα τρίτο ερώτημα είναι πώς μπορούμε να αποτιμήσουμε την γενικότερη εντύπωση, θετική ή αρνητική, που έχουν οι χρήστες σχετικά με διάφορες ευαίσθητες οντότητες όπως είναι τα πολιτικά κόμματα και οι ιδεολογίες κατά τη διάρκεια μιας προεκλογικής περιόδου. Η παρούσα διδακτορική διατριβή εστιάζει στο δημοφιλές δίκτυο κοινωνικής δικτύωσης Twitter και επιχειρεί να απαντήσει σε αυτά τα ερωτήματα με την εφαρμογή και εξέλιξη μεθόδων από την περιοχή της ανάλυσης γράφων, τη μηχανική μάθηση και την επεξεργασία φυσικής γλώσσας. Αρχικά παρουσιάζεται ένα μοντέλο σχετικά με την χρονική εξέλιξη και μοντελοποίηση του κοινωνικού γράφου. Για το σκοπό αυτό, συλλέγονται δύο αντιπροσωπευτικά δείγματα του Twitter, ένα από την πρώιμη και ένα από την πιο πρόσφατη χρονική περίοδο. Χρησιμοποιώντας ένα γνωστό μοντέλο το οποίο όμως έχει εφαρμοστεί μόνο σε μικρούς γράφους, μελετάμε την εξέλιξη του Twitter, σε μια περίοδο 8 ετών. Επιπλέον αντιπαραθέτουμε τις παρατηρούμενες διακυμάνσεις αυτής της ανάπτυξης με πραγματικά γεγονότα και καταδεικνύουμε κατά πόσο η εφαρμογή πολιτικών εναντίων ανεπιθύμητων μηνυμάτων αλλά και η εισροή νέων χρηστών μπορεί να επηρεάσει την ανάπτυξη ενός κοινωνικού δικτύου. Στην συνέχεια προχωράμε στη μελέτη μιας νέας στρατηγικής για τη διάδοση του ανεπιθύμητων μηνυμάτων στα μέσα κοινωνικής δικτύωσης. Ο συγκεκριμένος τρόπος διάδοσης εκμεταλλεύεται τον συνδυασμό δημοφιλών θεμάτων (trending topics) στο Twitter με ανεπιθύμητα μηνύματα. Χρησιμοποιώντας μεθόδους μηχανικής μάθησης, δείχνουμε ότι η χρήση των δημοφιλών αυτών θεμάτων μας παρέχει τον βέλτιστο τρόπο για τον διαχωρισμό των ανεπιθύμητων μηνυμάτων αλλά και των χρηστών που τα στέλνουν. Επιπλέον αποκαλύπτουμε μια τεχνική απόκρυψης ανεπιθύμητων μηνυμάτων που διαφεύγει από τους μηχανισμούς ανίχνευσης του Twitter (spam masquerading) και δείχνουμε πώς μπορούμε να μετριάσουμε τα ανεπιθύμητα μηνύματα με απλή ανάλυση του γράφου καθώς και τεχνικών μηχανικής μάθησης. Η τελευταία πτυχή αυτής της διατριβής μελετάει την ανάλυση του περιεχομένου στο Twitter. Συγκεκριμένα, εφαρμόζουμε ένα συνδυασμό τεχνικών επεξεργασίας φυσικής γλώσσας (NLP) για να μελετήσουμε τον τρόπο έκφρασης των χρηστών και κατ' επέκταση των ψηφοφόρων, κατά τη διάρκεια ενός πραγματικού και ταραχώδους εκλογικού γεγονότος. Προκειμένου να γίνει αυτό εφαρμόζουμε τεχνικές εξαγωγής των σημαντικότερων οντοτήτων που περιέχονται στο σύνολο δεδομένων, μελετάμε τον όγκο των μηνυμάτων γύρω από τις οντότητες αυτές και ανιχνεύουμε τα ποσοστά σαρκασμού αλλά και των συναισθημάτων γύρω από αυτές. Με αυτές τις τεχνικές καταλήγουμε στην εξαγωγή σημασιολογικών σχέσεων μεταξύ των σημαντικότερων αυτών οντοτήτων, αλλά και την διακύμανση του συναισθήματος στο χρόνο για τις διάφορες ομάδες ψηφοφόρων.

COVID-19 pandemic initiated over a year ago continues to spread around the globe and the ongoing ... more COVID-19 pandemic initiated over a year ago continues to spread around the globe and the ongoing research regarding COVID-19 is on a continues growth as well. The online discourse on social media regarding COVID-19 has been growing along with the timeline of the pandemic. Open data on Twitter have been released and offer the research community the opportunity for new findings and resolving this new threat. In this dataset, we open a corpus of Twitter's data from March 2020 till today, that is being updated every day based on the two most important hashtags regarding COVID-19. This dataset will offer the research community the opportunity to explore the social extensions of this pandemic including topic analysis, hate speech sentiment analysis, regarding either the opinion of the users on the pandemic, the comments on the public discourse, or the vaccination releases. The dataset has been collected by retrieving all the tweets that contain the hashtags: #coronavirus and #COVID19, including approximately 208M tweets for #coronavirus and 392M tweets for hashtag #COVID-19, resulting in a total of 600M tweets.

arXiv (Cornell University), May 31, 2023
Twitter as one of the most popular social networks, offers a means for communication and online d... more Twitter as one of the most popular social networks, offers a means for communication and online discourse, which unfortunately has been the target of bots and fake accounts, leading to the manipulation and spreading of false information. Towards this end, we gather a challenging, multilingual dataset of social discourse on Twitter, originating from 9M users regarding the recent Russo-Ukrainian war, in order to detect the bot accounts and the conversation involving them. We collect the ground truth for our dataset through the Twitter API suspended accounts collection, containing approximately 343K of bot accounts and 8M of normal users. Additionally, we use a dataset provided by Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts. Besides the publicly available datasets, we also manage to collect 2 independent datasets around popular discussion topics of the 2022 energy crisis and the 2022 conspiracy discussions. Both of the datasets were labeled according to the Twitter suspension mechanism. We build a novel ML model for bot detection using the state-of-the-art XGBoost model. We combine the model with a high volume of labeled tweets according to the Twitter suspension mechanism ground truth. This requires a limited set of profile features allowing labeling of the dataset in different time periods from the collection, as it is independent of the Twitter API. In comparison with Botometer our methodology achieves an average 11% higher ROC-AUC score over two real-case scenario datasets.

Social media is a social space for communication between people that share common activities, hob... more Social media is a social space for communication between people that share common activities, hobbies, interests and lifestyle. Activity tracking apps allow sharing of user activity summary results or instantiated activity events or milestones, both as a motivation factor and for collaborative approach to activity methods, i.e. team effect. Social media posts are the major channel that such information, in several contextual forms, as well as other types of related posts, are made available to users. This work reports on the level of effect on the interest and motivation towards healthy living that such social information applies to casual social media users. The type of input enrichment is investigated to that effect. CCS Concepts • Human-centered computing~Social media • Human-centered computing~User studies • Human-centered computing~Empirical studies in HCI • Human-centered computing~Interaction design theory, concepts and paradigms

PLOS ONE, Oct 31, 2017
Today, a considerable proportion of the public political discourse on nationwide elections procee... more Today, a considerable proportion of the public political discourse on nationwide elections proceeds in Online Social Networks. Through analyzing this content, we can discover the major themes that prevailed during the discussion, investigate the temporal variation of positive and negative sentiment and examine the semantic proximity of these themes. According to existing studies, the results of similar tasks are heavily dependent on the quality and completeness of dictionaries for linguistic preprocessing, entity discovery and sentiment analysis. Additionally, noise reduction is achieved with methods for sarcasm detection and correction. Here we report on the application of these methods on the complete corpus of tweets regarding two local electoral events of worldwide impact: the Greek referendum of 2015 and the subsequent legislative elections. To this end, we compiled novel dictionaries for sentiment and entity detection for the Greek language tailored to these events. We subsequently performed volume analysis, sentiment analysis, sarcasm correction and topic modeling. Results showed that there was a strong anti-austerity sentiment accompanied with a critical view on European and Greek political actions.

arXiv (Cornell University), Oct 16, 2020
The presidential elections in the United States on 3 November 2020 have caused extensive discussi... more The presidential elections in the United States on 3 November 2020 have caused extensive discussions on social media. A part of the content on US elections is organic, coming from users discussing their opinions of the candidates, political positions, or relevant content presented on television. Another significant part of the content generated originates from organized campaigns, both official and by astroturfing. In this study, we obtain approximately 17.5M tweets containing 3M users, based on prevalent hashtags related to US election 2020, as well as the related YouTube links, contained in the Twitter dataset, likes, dislikes and comments of the videos and conduct volume, sentiment and graph analysis on the communities formed. Particularly, we study the daily traffic per prevalent hashtags, plot the retweet graph from July to September 2020, show how its main connected component becomes denser in the period closer to the elections and highlight the two main entities ('Biden' and 'Trump'). Additionally, we gather the related YouTube links contained in the previous dataset and perform sentiment analysis. The results on sentiment analysis on the Twitter corpus and the YouTube metadata gathered, show the positive and negative sentiment for the two entities throughout this period. The results of sentiment analysis indicate that 45.7% express positive sentiment towards Trump in Twitter and 33.8% positive sentiment towards Biden, while 14.55% of users express positive sentiment in YouTube metadata gathered towards Trump and 8.7% positive sentiment towards Biden. Our analysis fill the gap between the connection of offline events and their consequences in social media by monitoring important events in real world and measuring public volume and sentiment before and after the event in social media. CCS CONCEPTS • Networks → Online social networks; • Information systems → Sentiment analysis.
arXiv (Cornell University), Apr 7, 2022
On 24 February 2022, Russia invaded Ukraine, also known now as Russo-Ukrainian War. We have initi... more On 24 February 2022, Russia invaded Ukraine, also known now as Russo-Ukrainian War. We have initiated an ongoing dataset acquisition from Twitter API. Until the day this paper was written the dataset has reached the amount of 57.3 million tweets, originating from 7.7 million users. We apply an initial volume and sentiment analysis, while the dataset can be used to further exploratory investigation towards topic analysis, hate speech, propaganda recognition, or even show potential malicious entities like botnets.

arXiv (Cornell University), Oct 5, 2015
Twitter is one of the most prominent Online Social Networks. It covers a significant part of the ... more Twitter is one of the most prominent Online Social Networks. It covers a significant part of the online worldwide population(~20%) and has impressive growth rates. The social graph of Twitter has been the subject of numerous studies since it can reveal the intrinsic properties of large and complex online communities. Despite the plethora of these studies, there is a limited cover on the properties of the social graph while they evolve over time. Moreover, due to the extreme size of this social network (millions of nodes, billions of edges), there is a small subset of possible graph properties that can be efficiently measured in a reasonable timescale. In this paper we propose a sampling framework that allows the estimation of graph properties on large social networks. We apply this framework to a subset of Twitter's social network that has 13.2 million users, 8.3 billion edges and covers the complete Twitter timeline (from April 2006 to January 2015). We derive estimation on the time evolution of 24 graph properties many of which have never been measured on large social networks. We further discuss how these estimations shed more light on the inner structure and growth dynamics of Twitter's social network.

Advances in Social Networks Analysis and Mining, Aug 18, 2016
Today, a considerable proportion of the public political discourse that proceeds nationwide elect... more Today, a considerable proportion of the public political discourse that proceeds nationwide elections is happening through Online Social Networks. Through analyzing this content, we can discover the major themes that prevailed during the discussion, investigate the temporal variation of positive and negative sentiment and examine the semantic proximity of these themes. According to existing studies, the results of similar tasks are heavily dependent on the quality and completeness of dictionaries for linguistic preprocessing, entity discovery and sentiment analysis. Additionally, noise reduction is achieved with methods for sarcasm detection and correction. Here we report on the application of these methods on the complete corpus of tweets regarding two local electoral events of worldwide impact: the Greek referendum of 2015 and the subsequent legislative elections. To this end, we compiled novel dictionaries for sentiment and entity detection for the Greek language tailored to these events. We subsequently performed volume analysis, sentiment analysis and sarcasm correction. Results showed that there was a strong anti-austerity sentiment accompanied with a critical view on European and Greek political actions.
Zenodo (CERN European Organization for Nuclear Research), Jun 16, 2022
This study introduces a novel, reproducible and reusable Twitter bot identification system. The s... more This study introduces a novel, reproducible and reusable Twitter bot identification system. The system uses a machine learning (ML) pipeline, fed with hundreds of features extracted from a Twitter corpus. The main objective of the proposed ML pipeline is to train and validate different state-of-the-art machine learning models, where the eXtreme Gradient Boosting (XGBoost) model is selected since it achieves the highest detection performance. The Twitter dataset was collected during the 2020 US Presidential Elections, and additional experimental evaluation on distinct Twitter datasets demonstrates the superiority of our approach, in terms of high bot detection accuracy.
Lecture Notes in Computer Science, 2016
This work explores the use of speech enabled complex graphs that are designed to enable non-techn... more This work explores the use of speech enabled complex graphs that are designed to enable non-technical users to edit and appraise visually complex semantic structures. The standard usability evaluation that was performed previously, employed young, computer-literate participants that were familiar with such concepts and tools. We report on the findings of how technically-savvy and technically challenged users experience the different modalities, make choices and identify each modality advantages and shortcomings as well as the ability of each user group to optimally exploit modality combination paths.
Explainable machine learning pipeline for Twitter bot detection during the 2020 US Presidential Elections
Software impacts, Aug 1, 2022

Proceedings of the International AAAI Conference on Web and Social Media, May 31, 2022
Twitter is one of the most popular social networks attracting millions of users, while a consider... more Twitter is one of the most popular social networks attracting millions of users, while a considerable proportion of online discourse is captured. It provides a simple usage framework with short messages and an efficient application programming interface (API) enabling the research community to study and analyze several aspects of this social network. However, the Twitter usage simplicity can lead to malicious handling by various bots. The malicious handling phenomenon expands in online discourse, especially during the electoral periods, where except the legitimate bots used for dissemination and communication purposes, the goal is to manipulate the public opinion and the electorate towards a certain direction, specific ideology, or political party. This paper focuses on the design of a novel system for identifying Twitter bots based on labeled Twitter data. To this end, a supervised machine learning (ML) framework is adopted using an Extreme Gradient Boosting (XGBoost) algorithm, where the hyper-parameters are tuned via crossvalidation. Our study also deploys Shapley Additive Explanations (SHAP) for explaining the ML model predictions by calculating feature importance, using the game theoreticbased Shapley values. Experimental evaluation on distinct Twitter datasets demonstrate the superiority of our approach, in terms of bot detection accuracy, when compared against a recent state-of-the-art Twitter bot detection method.
International Journal of Molecular Sciences, Jun 27, 2022
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

Venous thromboembolism (VTE) is the third most common cardiovascular condition. Some high risk pa... more Venous thromboembolism (VTE) is the third most common cardiovascular condition. Some high risk patients diagnosed with VTE need immediate treatment and monitoring in intensive care units (ICU) as the mortality rate is high. Most of the published predictive models for ICU mortality give information on in-hospital mortality using data recorded in the first day of ICU admission. The purpose of the current study is to predict in-hospital and after-discharge mortality in patients with VTE admitted to ICU using a machine learning (ML) framework. We studied 2,468 patients from the Medical Information Mart for Intensive Care (MIMIC-III) database, admitted to ICU with a diagnosis of VTE. We formed ML classification tasks for early and late mortality prediction. In total, 1,471 features were extracted for each patient, grouped in seven categories each representing a different type of medical assessment. We used an automated ML platform, JADBIO, as well as a class balancing combined with a Random Forest classifier, in order to evaluate the importance of class imbalance. Both methods showed significant ability in prediction of early mortality (AUC=0.92). Nevertheless, the task of predicting late mortality was less efficient (AUC=0.82). To the best of our knowledge, this is the first study in which ML is used to predict short-term and long-term mortality for ICU patients with VTE based on a multitude of clinical features collected over time.

International Journal of Molecular Sciences
Intensive care unit (ICU) patients with venous thromboembolism (VTE) and/or cancer suffer from hi... more Intensive care unit (ICU) patients with venous thromboembolism (VTE) and/or cancer suffer from high mortality rates. Mortality prediction in the ICU has been a major medical challenge for which several scoring systems exist but lack in specificity. This study focuses on two target groups, namely patients with thrombosis or cancer. The main goal is to develop and validate interpretable machine learning (ML) models to predict early and late mortality, while exploiting all available data stored in the medical record. To this end, retrospective data from two freely accessible databases, MIMIC-III and eICU, were used. Well-established ML algorithms were implemented utilizing automated and purposely built ML frameworks for addressing class imbalance. Prediction of early mortality showed excellent performance in both disease categories, in terms of the area under the receiver operating characteristic curve (AUC–ROC): VTE-MIMIC-III 0.93, eICU 0.87, cancer-MIMIC-III 0.94. On the other hand, ...

A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks
Expert Systems With Applications, Feb 1, 2021
Abstract Twitter is the third most popular worldwide Online Social Network (OSN) after Facebook a... more Abstract Twitter is the third most popular worldwide Online Social Network (OSN) after Facebook and Instagram. Compared to other OSNs, it has a simple data model and a straightforward data access API. This makes it ideal for social network studies attempting to analyze the patterns of online behavior, the structure of the social graph, the sentiment towards various entities and the nature of malicious attacks in a vivid network with hundreds of millions of users. Indeed, Twitter has been established as a major research platform, utilized in more than ten thousands research articles over the last ten years. Although there are excellent review and comparison studies for most of the research that utilizes Twitter, there are limited efforts to map this research terrain as a whole. Here we present an effort to map the current research topics in Twitter focusing on three major areas: the structure and properties of the social graph, sentiment analysis and threats such as spam, bots, fake news and hate speech. We also present Twitter’s basic data model and best practices for sampling and data access. This survey also lays the ground of computational techniques used in these areas such as Graph Sampling, Natural Language Processing and Machine Learning. Along with existing reviews and comparison studies, we also discuss the key findings and the state of the art in these methods. Overall, we hope that this survey will help researchers create a clear conceptual model of Twitter and act as a guide to expand further the topics presented.

PLOS ONE
Most studies analyzing political traffic on Social Networks focus on a single platform, while cam... more Most studies analyzing political traffic on Social Networks focus on a single platform, while campaigns and reactions to political events produce interactions across different social media. Ignoring such cross-platform traffic may lead to analytical errors, missing important interactions across social media that e.g. explain the cause of trending or viral discussions. This work links Twitter and YouTube social networks using cross-postings of video URLs on Twitter to discover the main tendencies and preferences of the electorate, distinguish users and communities’ favouritism towards an ideology or candidate, study the sentiment towards candidates and political events, and measure political homophily. This study shows that Twitter communities correlate with YouTube comment communities: that is, Twitter users belonging to the same community in the Retweet graph tend to post YouTube video links with comments from YouTube users belonging to the same community in the YouTube Comment gra...

Complete COVID-19 Twitter dataset for two main hashtags
<br> COVID-19 pandemic initiated over a year ago continues to spread around the globe and t... more <br> COVID-19 pandemic initiated over a year ago continues to spread around the globe and the ongoing research regarding COVID-19 is on a continues growth as well. The online discourse on social media regarding COVID-19 has been growing along with the timeline of the pandemic. Open data on Twitter have been released and offer the research community the opportunity for new findings and resolving this new threat. In this dataset, we open a corpus of Twitter's data from March 2020 till today, that is being updated every day based on the two most important hashtags regarding COVID-19. This dataset will offer the research community the opportunity to explore the social extensions of this pandemic including topic analysis, hate speech sentiment analysis, regarding either the opinion of the users on the pandemic, the comments on the public discourse, or the vaccination releases. The dataset has been collected by retrieving all the tweets that contain the hashtags: #coronavirus and #COVID19 including approximately 208M tweets for hashtags #coronavirus and 392M tweets for hashtag #COVID-19, resulting in a total of 600M tweets.
Uploads
Papers by Despoina Antonakaki