In a data stream environment, classification models must effectively and efficiently handle conce... more In a data stream environment, classification models must effectively and efficiently handle concept drift. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, while in the latter, the model may suffer from inefficiency and instability. To address these issues, we introduce a novel ensemble approach based on the Broad Learning System (BLS), where mini chunks are used at each update. BLS is an effective lightweight neural architecture recently developed for incremental learning. Although it is fast, it requires huge data chunks for effective updates and is unable to handle dynamic changes observed in data streams. Our proposed approach, named Broad Ensemble Learning System (BELS), uses a novel updating method that significantly improves best-inclass model accuracy. It employs an ensemble of output layers to address the limitations of BLS and handle drifts. Our model tracks the changes in the accuracy of the ensemble components and reacts to these changes. We present our mathematical derivation of BELS, perform comprehensive experiments with 35 datasets that demonstrate the adaptability of our model to various drift types, and provide its hyperparameter, ablation, and imbalanced dataset performance analysis. The experimental results show that the proposed approach outperforms 10 state-of-the-art baselines, and supplies an overall improvement of 18.59% in terms of average prequential accuracy. INDEX TERMS Data stream mining, concept drift, ensemble learning, neural networks, big data.
Journal of the Association for Information Science and Technology, Sep 2, 2017
A story chain is a set of related news articles that reveal how different events are connected. T... more A story chain is a set of related news articles that reveal how different events are connected. This study presents a framework for discovering story chains, given an input document, in a text collection. The framework has 3 complementary parts that i) scan the collection, ii) measure the similarity between chain-member candidates and the chain, and iii) measure similarity among news articles. For scanning, we apply a novel text-mining method that uses a zigzagged search that reinvestigates past documents based on the updated chain. We also utilize social networks of news actors to reveal connections among news articles. We conduct 2 user studies in terms of 4 effectiveness measures-relevance, coverage, coherence, and ability to disclose relations. The first user study compares several versions of the framework, by varying parameters, to set a guideline for use. The second compares the framework with 3 baselines. The results show that our method provides statistically significant improvement in effectiveness in 61% of pairwise comparisons, with medium or large effect size; in the remainder, none of the baselines significantly outperforms our method.
Front-page news selection is the task of finding important news articles in news aggregators. In ... more Front-page news selection is the task of finding important news articles in news aggregators. In this study, we examine news selection for public front pages using raw text, without any meta-attributes such as click counts. A novel algorithm is introduced by jointly considering the importance and diversity of selected news articles and the length of front pages. We estimate the importance of news, based on topic modelling, to provide the required diversity. Then we select important documents from important topics using a priority-based method that helps in fitting news content into the length of the front page. A user study is subsequently conducted to measure effectiveness and diversity, using our newly-generated annotation program. Annotation results show that up to seven of 10 news articles are important and up to nine of them are from different topics. Challenges in selecting public front-page news are addressed with an emphasis on future research.
This tutorial aims to cover the state-of-the-art on stance detection and address open research av... more This tutorial aims to cover the state-of-the-art on stance detection and address open research avenues for interested researchers and practitioners. Stance detection is a recent research topic where the stance towards a given target or target set is determined based on the given content and there are significant application opportunities of stance detection in various domains. The tutorial comprises two parts where the first part outlines the fundamental concepts, problems, approaches, and resources of stance detection, while the second part covers open research avenues and application areas of stance detection. The tutorial will be a useful guide for researchers and practitioners of stance detection, social media analysis, information retrieval, and natural language processing.
Proceedings of the 31st ACM International Conference on Information & Knowledge Management
Data streams produce extensive data with high throughput from various domains and require copious... more Data streams produce extensive data with high throughput from various domains and require copious amounts of computational resources and energy. Many data streams are generated as multilabeled and classifying this data is computationally demanding. Some of the most well-known methods for Multi-Label Stream Classification are Problem Transformation schemes; however, previous work on this area does not satisfy the efficiency demands of multi-label data streams. In this study, we propose a novel Problem Transformation method for Multi-Label Stream Classification called Binary Transformation, which utilizes regression algorithms by transforming the labels into a continuous value. We compare our method against three of the leading problem transformation methods using eight datasets. Our results show that Binary Transformation achieves statistically similar effectiveness and provides a much higher level of efficiency.
We present the first quantitative analysis of spoken discourse for the Turkish language using mem... more We present the first quantitative analysis of spoken discourse for the Turkish language using memoirs of a group of old-time moviegoers of varying age groups whose birth year spreads over a period of four decades ranging from the 1930s to the 1960s. They tell their experiences by answering a set of questions. Their responses are evaluated comprehensively with the expectation that various attributes of the participants are reflected by their everyday speaking language. We also investigate their language characteristics in terms of their vocabulary richness and word usage. The results show that the age and gender of the participants can be inferred to some extent from their speech, as is the case for written text. However, the difference is not significant in the language use of younger and older respondents in terms of vocabulary richness and archaic word usage. With additional data obtained for some participants, it is shown that text can be accurately identified as being either spoken or written; however, the spoken text of a person can only be differentiated from their written text with the accuracy level of a random guess.
Annotated datasets in different domains are critical for many supervised learning-based solutions... more Annotated datasets in different domains are critical for many supervised learning-based solutions to related problems and for the evaluation of the proposed solutions. Topics in natural language processing (NLP) similarly require annotated datasets to be used for such purposes. In this paper, we target at two NLP problems, named entity recognition and stance detection, and present the details of a tweet dataset in Turkish annotated for named entity and stance information. Within the course of the current study, both the named entity and stance annotations of the included tweets are made publicly available, although previously the dataset has been publicly shared with stance annotations only. We believe that this dataset will be useful for uncovering the possible relationships between named entity recognition and stance detection in tweets.
Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022
Stance detection (also known as stance classification, stance prediction, and stance analysis) is... more Stance detection (also known as stance classification, stance prediction, and stance analysis) is a problem related to social media analysis, natural language processing, and information retrieval, which aims to determine the position of a person from a piece of text they produce, towards a target (a concept, idea, event, etc.) either explicitly specified in the text, or implied only. Common stance classes include Favor, Against, and None. In this tutorial, we will define the core concepts and other related research problems, present historical and contemporary approaches to stance detection (including shared tasks and tools employed), provide pointers to related datasets, and cover open research directions and application areas of stance detection. As solutions to stance detection can contribute to diverse applications including trend analysis, opinion surveys, user reviews, personalization, and predictions for referendums and elections, it will continue to stand as an important research problem, mostly on textual content currently, and particularly on Web content including social media.
Stance detection is a subproblem of sentiment analysis where the stance of the author of a piece ... more Stance detection is a subproblem of sentiment analysis where the stance of the author of a piece of natural language text for a particular target (either explicitly stated in the text or not) is explored. The stance output is usually given as Favor, Against, or Neither. In this paper, we target at stance detection on sports-related tweets and present the performance results of our SVM-based stance classifiers on such tweets. First, we describe three versions of our proprietary tweet data set annotated with stance information, all of which are made publicly available for research purposes. Next, we evaluate SVM classifiers using different feature sets for stance detection on this data set. The employed features are based on unigrams, bigrams, hashtags, external links, emoticons, and lastly, named entities. The results indicate that joint use of the features based on unigrams, hashtags, and named entities by SVM classifiers is a plausible approach for stance detection problem on sport...
Authorship attribution and identifying time period of literary works are fundamental problems in ... more Authorship attribution and identifying time period of literary works are fundamental problems in quantitative analysis of languages. We investigate two fundamentally different machine learning text categorization methods, Support Vector Machines (SVM) and Naïve Bayes (NB), and several style markers in the categorization of Ottoman poems according to their poets and time periods. We use the collected works (divans) of ten different Ottoman poets: two poets from each of the five different hundred-year periods ranging from the 15 th to 19 th century. Our experimental evaluation and statistical assessments show that it is possible to obtain highly accurate and reliable classifications and to distinguish the methods and style markers in terms of their effectiveness.
Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions... more Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions, and evaluations of users towards a product, an entity or a service that they are reviewing. One of the biggest challenges for sentiment analysis is that it is highly language dependent. Word embeddings, sentiment lexicons, and even annotated data are language specific. Further, optimizing models for each language is very time consuming and labor intensive especially for recurrent neural network models. From a resource perspective, it is very challenging to collect data for different languages. In this paper, we look for an answer to the following research question: can a sentiment analysis model trained on a language be reused for sentiment analysis in other languages, Russian, Spanish, Turkish, and Dutch, where the data is more limited? Our goal is to build a single model in the language with the largest dataset available for the task, and reuse it for languages that have limited reso...
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021
Stance detection (also known as stance classification and stance prediction) is a problem related... more Stance detection (also known as stance classification and stance prediction) is a problem related to social media analysis, natural language processing, and information retrieval, which aims to determine the position of a person from a piece of text they produce, towards a target (a concept, idea, event, etc.) either explicitly specified in the text, or implied only. The output of the stance detection procedure is usually from this set: {Favor, Against, None}. In this tutorial, we will define the core concepts and research problems related to stance detection, present historical and contemporary approaches to stance detection, provide pointers to related resources (datasets and tools), and we will cover outstanding issues and application areas of stance detection. As solutions to stance detection can contribute to significant tasks including trend analysis, opinion surveys, user reviews, personalization, and predictions for referendums and elections, it will continue to stand as an important research problem, mostly on textual content currently, and particularly on social media. Finally, we believe that image and video content will commonly be the subject of stance detection research soon.
Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018
As data streams become more prevalent, the necessity for online algorithms that mine this transie... more As data streams become more prevalent, the necessity for online algorithms that mine this transient and dynamic data becomes clearer. Multi-label data stream classification is a supervised learning problem where each instance in the data stream is classified into one or more pre-defined sets of labels. Many methods have been proposed to tackle this problem, including but not limited to ensemble-based methods. Some of these ensemble-based methods are specifically designed to work with certain multi-label base classifiers; some others employ online bagging schemes to build their ensembles. In this study, we introduce a novel online and dynamically-weighted stacked ensemble for multi-label classification, called GOOWE-ML, that utilizes spatial modeling to assign optimal weights to its component classifiers. Our model can be used with any existing incremental multi-label classification algorithm as its base classifier. We conduct experiments with 4 GOOWE-ML-based multi-label ensembles and 7 baseline models on 7 real-world datasets from diverse areas of interest. Our experiments show that GOOWE-ML ensembles yield consistently better results in terms of predictive performance in almost all of the datasets, with respect to the other prominent ensemble models.
Automatic elicitation of semantic information from natural language texts is an important researc... more Automatic elicitation of semantic information from natural language texts is an important research problem with many practical application areas. Especially after the recent proliferation of online content through channels such as social media sites, news portals, and forums; solutions to problems such as sentiment analysis, sarcasm/controversy/veracity/rumour/fake news detection, and argument mining gained increasing impact and significance, revealed with large volumes of related scientific publications. In this article, we tackle an important problem from the same family and present a survey of stance detection in social media posts and (online) regular texts. Although stance detection is defined in different ways in different application settings, the most common definition is “automatic classification of the stance of the producer of a piece of text, towards a target, into one of these three classes: { Favor , Against , Neither }.” Our survey includes definitions of related prob...
Turk Kutuphaneciligi - Turkish Librarianship, 2018
Bu makalede bilişimin beşerî bilimlerdeki önemli bir uygulaması olan sayısal üslup analizi yöntem... more Bu makalede bilişimin beşerî bilimlerdeki önemli bir uygulaması olan sayısal üslup analizi yönteminin tanıtılması hedeflenmiş ve çevirilerin aslına sadakatini ölçen özgün bir araştırma sunulmuştur. Sayısal üslup analizi, bilgi ve belge yönetiminde çeşitli sınıflama işlemlerini gerçekleştiren ve edebiyat araştırmalarında yakın okuma sırasında görülmesi mümkün olmayan gözlemleri sağlayan yaklaşımlardan oluşmaktadır. Makalede, öncelikle Türkçe metinler üzerinde çalışmak isteyen araştırmacılar için, üslup analizinin Türkçeye nasıl uyarlanacağı anlatılmış ve bu konuda Türkçe metinler üzerinde yapılan çalışmaları inceleyen kapsamlı bir kaynak taraması sunulmuştur. Üslup analizinin uygulama amaçları örneklerle incelenmiş, ön işleme ve öznitelik çıkarımı, sınıflandırma yaklaşımları, başarı düzeyi değerlendirmesi ve yardımcı bilişim araçları konularına yer verilmiştir. Orhan Pamuk'un Benim Adım Kırmızı isimli romanı ve çevirilerindeki üslup uyumuna ilişkin sunulan özgün araştırma, roman kahramanlarının temel bileşenler düzlemindeki dağılımlarını inceleyen yeni bir yaklaşım kullanmaktadır. İstatistiksel olarak kayda değer olan gözlemler yazar üslubunun çevirilerde korunduğunu gösteren niteliktedir. Anahtar Sözcükler: Üslup analizi; metin madenciliği; yazar doğrulama; yazar ataması; metin sınıflandırma.
Journal of the Association for Information Science and Technology, 2017
A story chain is a set of related news articles that reveal how different events are connected. T... more A story chain is a set of related news articles that reveal how different events are connected. This study presents a framework for discovering story chains, given an input document, in a text collection. The framework has 3 complementary parts that i) scan the collection, ii) measure the similarity between chain‐member candidates and the chain, and iii) measure similarity among news articles. For scanning, we apply a novel text‐mining method that uses a zigzagged search that reinvestigates past documents based on the updated chain. We also utilize social networks of news actors to reveal connections among news articles. We conduct 2 user studies in terms of 4 effectiveness measures—relevance, coverage, coherence, and ability to disclose relations. The first user study compares several versions of the framework, by varying parameters, to set a guideline for use. The second compares the framework with 3 baselines. The results show that our method provides statistically significant im...
Front-page news selection is the task of finding important news articles in news aggregators. In ... more Front-page news selection is the task of finding important news articles in news aggregators. In this study, we examine news selection for public front pages using raw text, without any meta-attributes such as click counts. A novel algorithm is introduced by jointly considering the importance and diversity of selected news articles and the length of front pages. We estimate the importance of news, based on topic modelling, to provide the required diversity. Then we select important documents from important topics using a priority-based method that helps in fitting news content into the length of the front page. A user study is subsequently conducted to measure effectiveness and diversity, using our newly-generated annotation program. Annotation results show that up to seven of 10 news articles are important and up to nine of them are from different topics. Challenges in selecting public front-page news are addressed with an emphasis on future research.
Uploads
Papers by Fazli Can