Papers by Ana Carolina Lorena
AAAI Press eBooks, Jul 1, 2007
Criteria for evaluating the performance of a classifier are an important part in its design. They... more Criteria for evaluating the performance of a classifier are an important part in its design. They allow to estimate the behavior of the generated classifier on unseen data and can be also used to compare its performance against the performance of classifiers generated by other classification algorithms. There are currently several performance measures for binary and flat classification problems. For hierarchical classification problems, where there are multiple classes which are hierarchically related, the evaluation step is more complex. This paper reviews the main evaluation metrics proposed in the literature to evaluate hierarchical classification models.
Neurocomputing, Mar 1, 2014
Springer eBooks, Aug 13, 2007
The version in the Kent Academic Repository may differ from the final published version. Users ar... more The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.

Springer eBooks, Aug 27, 2008
Despite the recent advances in Molecular Biology, the function of a large amount of proteins is s... more Despite the recent advances in Molecular Biology, the function of a large amount of proteins is still unknown. An approach that can be used in the prediction of a protein function consists of searching against secondary databases, also known as signature databases. Different strategies can be applied to use protein signatures in the prediction of function of proteins. A sophisticated approach consists of inducing a classification model for this prediction. This paper applies five hierarchical classification methods based on the standard Top-Down approach and one hierarchical classification method based on a new approach named Top-Down Ensembles-based on the hierarchical combination of classifiers-to three different protein functional classification datasets that employ protein signatures. The algorithm based on the Top-Down Ensembles approach presented slightly better results than the other algorithms, indicating that combinations of classifiers can improve the performance of hierarchical classification models.
International Joint Conference on Industrial Engineering and Operations Management Proceedings
This research assembles a dataset of COVID-19 positive patients suitable for the application of M... more This research assembles a dataset of COVID-19 positive patients suitable for the application of Machine Learning (ML) techniques, allowing to obtain prognosis models that distinguish severe from non-severe patients by taking as input hematological exams performed upon hospital attendance. Six ML techniques were applied to analyze data from 4,320 COVID-19 positive patients, 394 of which evolved to a severe health state, requiring intensive care. The Random Forest classifier showed the best predictive performance among the used algorithms and settings, with an AUC score up to 0.94 ± 0.02. In addition, ten clinical variables revealed to be more correlated to the prognosis by a mutual information score, although some of them had a high fraction of missing values.

Classifier Recommendation Using Data Complexity Measures
2018 24th International Conference on Pattern Recognition (ICPR), 2018
Application of machine learning to new and unfamiliar domains calls for increasing automation in ... more Application of machine learning to new and unfamiliar domains calls for increasing automation in choosing a learning algorithm suitable for the data arising from each domain. Meta-learning could address this need since it has been largely used in the last years to support the recommendation of the most suitable algorithms for a new dataset. The use of complexity measures could increase the systematic comprehension over the meta-models and also allow to differentiate the performance of a set of techniques taking into account the overlap between classes imposed by feature values, the separability and distribution of the data points. In this paper we compare the effectiveness of several standard regression models in predicting the accuracies of classifiers for classification problems from the OpenML repository. We show that the models can predict the classifiers' accuracies with low mean-squared-error and identify the best classifier for a problem that results in statistically significant improvements over a randomly chosen classifier or a fixed classifier believed to be good on average.
Extração de conhecimento de dados: data mining
Márcia Oliveira é investigador no LIAAD-INESC TEC e Professor Associado na Faculdade de Economia ... more Márcia Oliveira é investigador no LIAAD-INESC TEC e Professor Associado na Faculdade de Economia da Universidade do Porto. A sua área de investigação é a aprendizagem automática, principalmente em fluxos contínuos de dados. Publicou mais de 120 artigos científicos em conferências e revistas internacionais. Organizou várias conferências internacionais e séries de em . É membro do de várias revistas internacionais nas áreas de aprendizagem automática e extração de conhecimento em bases de dados. É autor de um livro recente em .

Measuring Instance Hardness Using Data Complexity Measures
Intelligent Systems, 2020
Assessing the hardness of each instance in a problem is an important meta-knowledge which may lev... more Assessing the hardness of each instance in a problem is an important meta-knowledge which may leverage advances in Machine Learning. In classification problems, an instance can be regarded as difficult if it gets systematically misclassified by a diverse set of classification techniques with different biases. The instance hardness measures were proposed with the aim of relating data characteristics to this notion of intrinsic difficulty of the instances. There are also in the literature a large set of measures which are dedicated at describing the difficulty of a classification problem from a dataset-level perspective. In this paper these measures are decomposed at the instance-level, giving a perspective of how each individual example in a dataset contributes to its overall complexity. Experiments on synthetic and benchmark datasets demonstrate the proposed measures can provide a complementary instance hardness perspective when compared to those from related literature.

Automatic recovering the number k of clusters in the data by active query selection
Proceedings of the 36th Annual ACM Symposium on Applied Computing, 2021
One common parameter of many clustering algorithms is the number k of clusters required to partit... more One common parameter of many clustering algorithms is the number k of clusters required to partition the data. This is the case of k-means, one of the most popular clustering algorithms from the Machine Learning literature, and its variants. Indeed, when clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this context, one popular procedure used to estimate the number of clusters present in a dataset is to run the clustering algorithm multiple times varying the number of clusters and one of the solutions obtained is chosen based on a given internal clustering validation measure (e.g., silhouette coefficient). This process can be very time consuming as the clustering algorithm must be run several times. In this paper we present some strategies that can be integrated to constrained clustering methods so as to recover automatically the number k of clusters. The idea is that constrained clustering algorithms allow one to incorporate prior information such as if some pairs of instances from the dataset must be placed in the same cluster or not. Still in the context of constrained clustering algorithms, in order to improve the quality of the pairwise constraints given as input to the algorithm, there are approaches that use active methods for pairwise constraint selection. In our proposed strategies we make use of the prior information provided by the pairwise constraints and the concept of neighborhood from active methods not only to build a partition, but also to identify automatically the number k of clusters in the data. Based on nine datasets, we show experimentally that our strategies, besides automatically recovering the number of clusters in the data, lead to the generation of partitions having high quality when evaluated by indicators of clustering performance such as the adjusted Rand index.

Data Complexity Measures for Imbalanced Classification Tasks
2018 International Joint Conference on Neural Networks (IJCNN), 2018
In imbalanced classification tasks, the training datasets may show class overlapping and classes ... more In imbalanced classification tasks, the training datasets may show class overlapping and classes of low density. In these scenarios, the predictions for the minority class are impaired. Although assessing the imbalance level of a training set is straightforward, it is hard to measure other aspects that may affect the predictive performance of classification algorithms in imbalanced tasks. This paper presents a set of measures designed to understand the difficulty of imbalanced classification tasks by regarding on each class individually. They are adapted from popular data complexity measures for classification problems, which are shown to perform poorly in imbalanced scenarios. Experiments on synthetic datasets with different levels of imbalance, class overlapping and density of the classes show that the proposed adaptations can better explain the difficulty of imbalanced classification tasks.
Information Sciences, 2021
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Applied Artificial Intelligence, 2017
Feature selection, an important combinatorial optimization problem in data mining, aims to find a... more Feature selection, an important combinatorial optimization problem in data mining, aims to find a reduced subset of features of high quality in a dataset. Different categories of importance measures can be used to estimate the quality of a feature subset. Since each measure provides a distinct perspective of data and of which are their important features, in this article we investigate the simultaneous optimization of importance measures from different categories using multi-objective genetic algorithms grounded in the Pareto theory. An extensive experimental evaluation of the proposed method is presented, including an analysis of the performance of predictive models built using the selected subsets of features. The results show the competitiveness of the method in comparison with six feature selection algorithms. As an additional contribution, we conducted a pioneer, rigorous, and replicable systematic review on related work. As a result, a summary of 93 related papers strengthens features of our method.

Intelligent Data Analysis, 2017
The use of smartphone devices has increased over the last years, as illustrated by the growth in ... more The use of smartphone devices has increased over the last years, as illustrated by the growth in smartphone sales. These devices are currently used for several services, such as bank account access, social networks and storage of personal information. In view of this scenario, an important question arises: does authentication mechanisms already present in these devices provide enough security? Recently, a new authentication method, named accelerometer biometrics, has been proposed. This method allows the authentication of users using accelerometer data, which can be obtained from accelerometers usually present in modern smartphones. This is a clear advantage of this biometric modality, as there would be no additional cost with hardware. However, as a behavioral biometric technology, user models induced from accelerometer data may become outdated over time. This paper investigates the use of adaptation mechanisms to update user models in accelerometer biometrics in a data stream context. Practical issues regarding the usage of accelerometer data are also discussed.

2015 Brazilian Conference on Intelligent Systems (BRACIS), 2015
Biometric systems have been applied to improve the security of several computational systems. The... more Biometric systems have been applied to improve the security of several computational systems. These systems analyse physiological or behavioural features obtained from the users in order to perform authentication. Biometric features should ideally meet a number of requirements, including permanence. In biometrics, permanence means that the analysed biometric feature will not change over time. However, recent studies have shown that this is not the case for several biometric modalities. Adaptive biometric systems deal with this issue by adapting the user model over time. Some algorithms for adaptive biometrics have been investigated and compared in the literature. In machine learning, several studies show that the combination of individual techniques in ensembles may lead to more accurate and stable decision models. This paper investigates the usage of some ensemble approaches to combine the output of current adaptive algorithms for biometrics. The experiments are carried out on keystroke dynamics, a biometric modality known to be subject to change over time.
Lecture Notes in Computer Science, 2003
The complete identification of human genes involves determining parts that generates proteins, na... more The complete identification of human genes involves determining parts that generates proteins, named exons, and those that do not code for proteins, known as introns. The splice site identification problem is concerned with the recognition of the boundaries between these regions. This work investigates the use of Support Vector Machines (SVMs) in human splice site identification. Two methods employed for building multiclass SVMs, one-against-all and all-against-all, were compared. For this application, the all-against-all method obtained lower classification error rates. Ensembles of multiclass SVMs with Bagging were also evaluated. Against the expected, the use of ensembles did not improve the performance obtained.
Algoritmos genéticos em problemas de classificação
Manual de computação evolutiva e metaheurística, 2012

2014 Brazilian Conference on Intelligent Systems, 2014
Nowadays, many services are available from mobile devices, like smartphones. A growing number of ... more Nowadays, many services are available from mobile devices, like smartphones. A growing number of people are using these devices to access bank accounts, social networks and to store personal information. However, common authentication mechanisms already present in these devices may not provide enough security. Recently, a new authentication method, named accelerometer biometrics, has been proposed. This method allows the identification of users using accelerometer data. Accelerometers, usually present in modern smartphones, are devices that measure acceleration forces. In accelerometer biometrics, a model is induced for the user of the smartphone. However, as a behavioral biometric technology, user models may became outdated over time. This paper investigates the use of adaptation mechanisms to update biometric user models induced by accelerometer data along the time. The paper also proposes and evaluates a new adaptation mechanism with promising experimental results. II. ACCELEROMETER BIOMETRICS Accelerometer biometrics has the goal of recognizing users by accelerometer data. The term accelerometer biometrics was used in a recent competition by Kaggle [4]. This work adopts the same term to define this technology. In the literature, accelerometer biometrics may also be referred to as cell phone-based biometrics [7]. This work focuses on accelerometer biometrics using data from mobile devices, like smartphones. One of the first studies to investigate the use of smartphone accelerometer data was [3], which considered users walking at three different speeds. Afterwards, [7] showed that other activities, like walking, jogging, ascending and descending stairs, can be used to recognize users by their smartphone accelerometer data. In [8], the authors evaluated three classification algorithms to perform this task: Hidden Markov Model (HMM), Support Vector Machines (SVMs)
Anais do XX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2023)
In Machine Learning (ML), selecting the most suitable algorithm for a problem is a challenge. Met... more In Machine Learning (ML), selecting the most suitable algorithm for a problem is a challenge. Meta-Learning (MtL) offers an alternative approach by exploring the relationships between dataset characteristics and ML algorithmic performance. To conduct a MtL study, it is necessary to create a metadataset comprising datasets of varying characteristics and defying the ML algorithms at different levels. This study analyzes the information available in the OpenML public repository for building such meta-datasets, which provides a Python API for easy data importation. Assessing the content currently available in the platform, there is still no extensive meta-feature characterization for all datasets, limiting their complete characterization.
Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2021)
Este artigo editorial descreve a Competição Brasileira de Descoberta de Conhecimento em Bancos de... more Este artigo editorial descreve a Competição Brasileira de Descoberta de Conhecimento em Bancos de Dados (KDD-BR 2021) e resume as contribuições das três melhores soluções obtidas em sua quinta edição. A competição de 2021 envolveu a resolução de instâncias do Problema do Caixeiro Viajante, de diferentes tamanhos, usando uma abordagem de previsão de arestas.
IEEE Access
This work involved human subjects or animals in its research. The authors confirm that all human/... more This work involved human subjects or animals in its research. The authors confirm that all human/animal subject research procedures and protocols are exempt from review board approval.
Uploads
Papers by Ana Carolina Lorena