Recently, continual learning has received a lot of attention. One of the significant problems is ... more Recently, continual learning has received a lot of attention. One of the significant problems is the occurrence of concept drift, which consists of changing probabilistic characteristics of the incoming data. In the case of the classification task, this phenomenon destabilizes the model's performance and negatively affects the achieved prediction quality. Most current methods apply statistical learning and similarity analysis over the raw data. However, similarity analysis in streaming data remains a complex problem due to time limitation, non-precise values, fast decision speed, scalability, etc. This article introduces a novel method for monitoring changes in the probabilistic distribution of multi-dimensional data streams. As a measure of the rapidity of changes, we analyze the popular Kullback-Leibler divergence. During the experimental study, we show how to use this metric to predict the concept drift occurrence and understand its nature. The obtained results encourage further work on the proposed methods and its application in the real tasks where the prediction of the future appearance of concept drift plays a crucial role, such as predictive maintenance.
JUCS - Journal of Universal Computer Science, 2022
Modern analytical systems must process streaming data and correctly respond to data distribution ... more Modern analytical systems must process streaming data and correctly respond to data distribution changes. The phenomenon of changes in data distributions is called concept drift, and it may harm the quality of the used models. Additionally, the possibility of concept drift appearance causes that the used algorithms must be ready for the continuous adaptation of the model to the changing data distributions. This work focuses on non-stationary data stream classification, where a classifier ensemble is used. To keep the ensemble model up to date, the new base classifiers are trained on the incoming data blocks and added to the ensemble while, at the same time, outdated models are removed from the ensemble. One of the problems with this type of model is the fast reaction to changes in data distributions. We propose the new Chunk Adaptive Restoration framework that can be adapted to any block-based data stream classification algorithm. The proposed algorithm adjusts the data chunk size i...
Recently, the problem of imbalanced data is the focus of intense research of machine learning com... more Recently, the problem of imbalanced data is the focus of intense research of machine learning community. Following work tries to utilize an approach of transforming the data space into another where classification task may become easier. Paper contains a proposition of a tool, based on a photographic metaphor to build a classifier ensemble, combined with a random subspace approach. Developed solution is insensitive to a sample size and robust to dimension increase, which allows a regularization of feature space, reducing the impact of biased classes. The proposed approach was evaluated on the basis of the computer experiments carried out on the benchmark and synthetic datasets.
For the contemporary business, the crucial factor is making smart decisions on the basis of the k... more For the contemporary business, the crucial factor is making smart decisions on the basis of the knowledge hidden in stored data. Unfortunately,m traditional simple methods of data analysis are not sufficient for efficient management of modern enterprizes, because they are not appropriate for the huge and growing amount of the data stored by them. Additionally data usually comes continuously in the form of so-called data stream. The great disadvantage of traditional classification methods is that they assume that statistical properties of the discovered concept are being unchanged, while in real situation, we could observe so-called concept drift, which could be caused by changes in the probabilities of classes or/and conditional probability distributions of classes. The potential for considering new training data is an important feature of machine learning methods used in security applications (spam filtering or intrusion detection) or decision support systems for marketing departme...
Fake news has now grown into a big problem for societies and also a major challenge for people fi... more Fake news has now grown into a big problem for societies and also a major challenge for people fighting disinformation. This phenomenon plagues democratic elections, reputations of individual persons or organizations, and has negatively impacted citizens, (e.g., during the COVID-19 pandemic in the US or Brazil). Hence, developing effective tools to fight this phenomenon by employing advanced Machine Learning (ML) methods poses a significant challenge. The following paper displays the present body of knowledge on the application of such intelligent tools in the fight against disinformation. It starts by showing the historical perspective and the current role of fake news in the information war. Proposed solutions based solely on the work of experts are analysed and the most important directions of the application of intelligent systems in the detection of misinformation sources are pointed out. Additionally, the paper presents some useful resources (mainly datasets useful when assessing ML solutions for fake news detection) and provides a short overview of the most important R&D projects related to this subject. The main purpose of this work is to analyse the current state of knowledge in detecting fake news; on the one hand to show possible solutions, and on the other hand to identify the main challenges and methodological gaps to motivate future research.
The imbalanced data classification is one of the most crucial tasks facing modern data analysis. ... more The imbalanced data classification is one of the most crucial tasks facing modern data analysis. Especially when combined with other difficulty factors, such as the presence of noise, overlapping class distributions, and small disjuncts, data imbalance can significantly impact the classification performance. Furthermore, some of the data difficulty factors are known to affect the performance of the existing oversampling strategies, in particular SMOTE and its derivatives. This effect is especially pronounced in the multi-class setting, in which the mutual imbalance relationships between the classes complicate even further. Despite that, most of the contemporary research in the area of data imbalance focuses on the binary classification problems, while their more difficult multi-class counterparts are relatively unexplored. In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling (MC-CCR) algorithm. The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE. It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms. Finally, by incorporating a dedicated strategy of handling the multi-class problems, MC-CCR is less affected by the loss of information about the inter-class relationships than the traditional multi-class decomposition strategies. Based on the results of experimental research carried out for many multi-class imbalanced benchmark datasets, the high robust of the proposed approach to noise was shown, as well as its high quality compared to the state-of-art methods.
One of the crucial problems of designing a classifier ensemble is the proper choice of the base c... more One of the crucial problems of designing a classifier ensemble is the proper choice of the base classifier line-up. Basically, such an ensemble is formed on the basis of individual classifiers, which are trained in such a way to ensure their high diversity or they are chosen on the basis of pruning which reduces the number of predictive models in order to improve efficiency and predictive performance of the ensemble. This work is focusing on clustering-based ensemble pruning, which looks for the group of similar classifiers which are replaced by their representatives. We propose a novel pruning criterion based on well-known diversity measures and describe three algorithms using classifier clustering. The first method selects the model with the best predictive performance from each cluster to form the final ensemble, the second one employs the multistage organization, where instead of removing the classifiers from the ensemble each classifier cluster makes the decision independently,...
Instance reduction techniques are data preprocessing methods originally developed to enhance the ... more Instance reduction techniques are data preprocessing methods originally developed to enhance the nearest neighbor rule for standard classification. They reduce the training data by selecting or generating representative examples of a given problem. These algorithms have been designed and widely analyzed in multi-class problems providing very competitive results. However, this issue was rarely addressed in the context of one-class classification. In
Currently, knowledge discovery in databases is an essential step to identify valid, novel and use... more Currently, knowledge discovery in databases is an essential step to identify valid, novel and useful patterns for decision making. There are many real-world scenarios, such as bankruptcy prediction, option pricing or medical diagnosis, where the classification models to be learned need to fulfill restrictions of monotonicity (i.e. the target class label should not decrease when input attributes values increase). For instance, it is rational to assume that a higher debt ratio of a company should never result in a lower level of bankruptcy risk. Consequently,
In this paper new methods for fast computation of the chordal kernels are proposed. Two versions ... more In this paper new methods for fast computation of the chordal kernels are proposed. Two versions of the chordal kernels for tensor data are discussed. These are based on different projectors of the flattened matrices obtained from the input tensors. A direct transformation of multidimensional objects into the kernel feature space leads to better data separation which can result with a higher classification accuracy. Our approach to more efficient computation of the chordal distances between tensors is based on an analysis of the tensor projectors which exhibit different properties. Thanks to this an efficient eigendecomposition becomes possible which is done with a version of the fixed-point algorithm. Experimental results show that our method allows significant speed-up factors, depending mostly on tensor dimensions.
Data preprocessing and reduction have become essential techniques in current knowledge discovery ... more Data preprocessing and reduction have become essential techniques in current knowledge discovery scenarios, dominated by increasingly large datasets. These methods aim at reducing the complexity inherent to real-world datasets, so that they can be easily processed by current data mining solutions. Advantages of such approaches include, among others, a faster and more precise learning process, and more understandable structure of raw data. However, in the context of data preprocessing techniques for data streams have a long road ahead of them, despite online learning is growing in importance thanks to the development of Internet and technologies for massive data collection. Throughout this survey, we summarize, categorize and analyze those contributions on data preprocessing that cope with streaming data. This work also takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization). To enrich our study, we conduct thorough experiments using the most relevant contributions and present an analysis of their predictive performance, reduction rates, computational time, and memory usage. Finally, we offer general advices about existing data stream preprocessing algorithms, as well as discuss emerging future challenges to be faced in the domain of data stream preprocessing.
This paper presents a method for content change detection in multidimensional video signals. Vide... more This paper presents a method for content change detection in multidimensional video signals. Video frames are represented as tensors of order consistent with signal dimensions. The method operates on unprocessed signals and no special feature extraction is assumed. The dynamic tensor analysis method is used to build a tensor model from the stream. Each new datum in the stream is then compared to the model with the proposed concept drift detector. If it fits, then a model is updated. Otherwise, a model is rebuilt, starting from that datum, and the signal shot is recorded. The proposed fast tensor decomposition algorithm allows efficient operation compared to the standard tensor decomposition method. Experimental results show many useful properties of the method, as well as its potential further extensions and applications.
In many applications of information systems learning algorithms have to act in dynamic environmen... more In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for nonstationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.
One of the most important challenges for machine learning community is to develop efficient class... more One of the most important challenges for machine learning community is to develop efficient classifiers which are able to cope with data streams, especially with the presence of the so-called concept drift. This phenomenon is responsible for the change of classification task characteristics, and poses a challenge for the learning model to adapt itself to the current state of the environment. So there is a strong belief that one-class classification is a promising research direction for data stream analysis-it can be used for binary classification without an access to counterexamples, decomposing a multi-class data stream, outlier detection or novel class recognition. This paper reports a novel modification of weighted one-class support vector machine, adapted to the non-stationary streaming data analysis. Our proposition can deal with the gradual concept drift, as the introduced one-class classifier model can adapt its decision boundary to new, incoming data and additionally employs a forgetting mechanism which boosts the ability of the classifier to follow the model changes. In this work, we propose several different strategies for incremental learning and forgetting, and additionally we evaluate them on the basis of several real data streams. Obtained results confirmed the usability of proposed classifier to the problem of data stream classification with the presence of concept drift. Additionally, implemented Communicated by E. Lughofer.
International Journal of Applied Mathematics and Computer Science, 2012
This paper presents a significant modification to the AdaSS (Adaptive Splitting and Selection) al... more This paper presents a significant modification to the AdaSS (Adaptive Splitting and Selection) algorithm, which was developed several years ago. The method is based on the simultaneous partitioning of the feature space and an assignment of a compound classifier to each of the subsets. The original version of the algorithm uses a classifier committee and a majority voting rule to arrive at a decision. The proposed modification replaces the fairly simple fusion method with a combined classifier, which makes a decision based on a weighted combination of the discriminant functions of the individual classifiers selected for the committee. The weights mentioned above are dependent not only on the classifier identifier, but also on the class number. The proposed approach is based on the results of previous works, where it was proven that such a combined classifier method could achieve significantly better results than simple voting systems. The proposed modification was evaluated through c...
Primary headaches are common disease of the modern society and it has high negative impact on the... more Primary headaches are common disease of the modern society and it has high negative impact on the productivity and the life quality of the affected person. Unfortunately, the precise diagnosis of the headache type is hard and usually imprecise, thus methods of headache diagnosis are still the focus of intense research. The paper introduces the problem of the primary headache diagnosis and presents its current taxonomy. The considered problem is simplified into the three class classification task which is solved using advanced machine learning techniques. Experiments, carried out on the large dataset collected by authors, confirmed that computer decision support systems can achieve high recognition accuracy and therefore be a useful tool in an everyday physician practice. This is the starting point for the future research on automation of the primary headache diagnosis.
Uploads
Papers by Michal Wozniak