Papers by Claudia Antunes

The interest on the discovery of information hidden in large amounts of data exploded in the last... more The interest on the discovery of information hidden in large amounts of data exploded in the last decade, bringing to light the need of efficient and effective tools to access all sources and kinds of data. On the other hand, the need to secure and share valuable data led to the development of new technologies, like blockchain, that warrant data integrity and transparency. Combining both is a natural demand, but several issues become clear, such as the lack of access efficiency and the need of data replication in common solutions. Indeed, the unique existing approach is by emulating queries, mostly through Smart Contracts, and applying traditional machine learning algorithms over the resulting data, stored externally for allowing multiple accesses. In this paper, we performed a systematic literature review that provides the above conclusions. Later, we discuss a new system architecture for the analysis of data stored in a blockchain, exploring the scalability and high-performance of data access in distributed file systems and the fast and up-to-date predictions of a streaming analysis approach.
Automatic Exploration of Domain Knowledge in Healthcare
Lecture Notes in Computer Science, 2022

The growing number of deployed data mining systems leverage the interest in temporal data anomaly... more The growing number of deployed data mining systems leverage the interest in temporal data anomaly detection. From cyber-security or finance to heart-diseases detection, unexpected data often incorporate critical information that must be analysed. Data anomalies have long been studied from an univariate perspective where only one data dimension changes over time. Few works have been dedicated to multivariate anomaly detection. In this work we provide a comprehensive and structured analysis of the main definitions, state-of-art methods and approaches focusing multivariate temporal data anomaly detection. Our research focus on dealing with variable length data series with millions of samples and multiple feature categories, either static or dynamic, real or categorical valued. We describe a case-study in the maritime domain investigating the unusual spatio-temporal behaviour of commercial vessels and experiment over two open datasets and one got from the MARISA H2020 Project 1 .

Proceedings of the 11th International Conference on Enterprise Information, 2009
One of the main difficulties of pattern mining is to deal with items of different nature in the s... more One of the main difficulties of pattern mining is to deal with items of different nature in the same itemset, which can occur in any domain except basket analysis. Indeed, if we consider the analysis of any transactional database composed by several entities and relationships, it is easy to understand that the equality function may be different for each element, which difficult the identification of frequent patterns. This situation is just one example of the need for using domain knowledge to manage the discovery process, but several other, no less important can be enumerated, such the need to consider patterns at higher levels of abstraction or the ability to deal with structured data. In this paper, we show how the Onto4AR framework can be explored to overcome these situations in a natural way, illustrating its use in the analysis of two distinct case studies. In the first one, exploring a cinematographic dataset, we capture patterns that characterize kinds of movies in accordance to the actors present in their casts and their roles. In the second one, identifying molecular fragments, we find structured patterns, including chains, rings and stars. Pattern mining is a subtask of mining association rules, a problem that was formulated in 1993 in the context of basket analysis. Formally, let I={i 1 ,i 2 ,…,i m } be a set of m distinct liaterals, called items and X⊆I a subset of items, therefore known as itemset. Let D be a set of transactions, i.e., itemsets transacted in the same conditions, under a unique 188 Antunes C. (2009).
The parallelization of mining algorithms under MapReduce (MR) became a reality in the last years,... more The parallelization of mining algorithms under MapReduce (MR) became a reality in the last years, but algorithms for training single decision trees, like ID3 [1]or C4.5 [2], remain unexplored. Decision trees continue to play an important role in data mining, mainly in the cases where the model itself is used to understand or just validate existing domain knowledge. In this paper, we discuss the different issues to deal with when trying to parallelize ID3 under MR, and propose a new algorithm, MRID4, for training a single decision tree, based on ID3 but that is able to deal with numerical attributes, as in C4.5. Experimental results show that our algorithm can scale to very high values.
ESTHER: A Recommendation System for Higher Education Programs
Progress in Artificial Intelligence, 2021

Proceedings of the 2014 International C* Conference on Computer Science & Software Engineering - C3S2E '14, 2008
Temporal information has become one of the most important features when it comes to data analytic... more Temporal information has become one of the most important features when it comes to data analytics. The need to understand the dynamics and evolutionary behaviors of different domains has driven data analytics' processes to make use of the temporal information associated with the data. Therefore, several approaches have been proposed, in the field of Temporal Pattern Mining, in order to use this temporal information to disclose temporal trends that could help in the decision making process. However, there are still significant limitations regarding both the quality of the disclosed information or the efficiency of the processes. In this work we propose a new constraint-based sequential mining method, called ConstraintPrefixSpan, for mining three types of periodic regularities: Cyclic, Converging and Diverging. Our experiments on two different datasets show both the quality of patterns found and the efficiency and flexibility of or algorithm to deal with multiple types of regularities.
2009 Ninth International Conference on Intelligent Systems Design and Applications, 2009
Understanding the causes for failure is one of the bottlenecks in the educational process. Despit... more Understanding the causes for failure is one of the bottlenecks in the educational process. Despite failure prediction has been pursued, models behind that prediction, most of the time, do not give a deep insight about failure causes. In this paper, we introduce a new method for mining fault trees automatically, and show that these models are a precious help on identifying direct and indirect causes for failure. An experimental study is presented in order to access the drawbacks of the proposed method.

Lecture Notes in Computer Science, 2009
Pattern mining derives from the need of discovering hidden knowledge in very large amounts of dat... more Pattern mining derives from the need of discovering hidden knowledge in very large amounts of data, regardless of the form in which it is presented. When it comes to Natural Language Processing (NLP), it arose along the humans' necessity of being understood by computers. In this paper we present an exploratory approach that aims at bringing together the best of both worlds. Our goal is to discover patterns in linguistically processed texts, through the usage of NLP state-of-the-art tools and traditional pattern mining algorithms. Articles from a Portuguese newspaper are the input of a series of tests described in this paper. First, they are processed by an NLP chain, which performs a deep linguistic analysis of text; afterwards, pattern mining algorithms Apriori and GenPrefixSpan are used. Results showed the applicability of pattern mining techniques in textual structured data, and also provided several evidences about the structure of the language.

Lecture Notes in Computer Science, 2014
The need for the study of dynamic and evolutionary settings made time a major dimension when it c... more The need for the study of dynamic and evolutionary settings made time a major dimension when it comes to data analytics. From business to health applications, being able to understand temporal patterns of customers or patients can determine the ability to adapt to future changes, optimizing processes and support other decisions. In this context, different approaches to Temporal Pattern Mining have been proposed in order to capture different types of patterns able to represent evolutionary behaviors, such as regular or emerging patterns. However, these solutions still lack on quality patterns with relevant information and on efficient mining methods. In this paper we propose a new efficient sequential mining algorithm, named PrefixSpan4Cycles, for mining cyclic sequential patterns. Our experiments show that our approach is able to efficiently mine these patterns when compared to other sequential pattern mining methods such as the GenPrefixSpan and PrefixSpan. Also for datasets with a significant number of regularities, our algorithm performs efficiently, even dealing with significant constraints regarding the nature of cyclic patterns.
A careful analysis of educational data reveals their multidimensional nature, with several orthog... more A careful analysis of educational data reveals their multidimensional nature, with several orthogonal dimensions from students to teachers, courses, evaluation items, topics, etc. In addition, their historical nature translates into large data warehouses, which are modeled through inter-connected huge tables that encompass data from several distinct perspectives. Despite the recent advances in big data research for this educational domain, the ability to consider these very large multi-dimensional datasets remains unexplored. In this paper, we explore a multi-dimensional algorithm in order to find multi-dimensional patterns in education, which in turn will be used to model student behaviors. Experimental results in a real case study show a significant improvement on the prediction of student results, when compared with the same classifiers trained without those patterns.

With the expansion of information systems and the increased interest in the education field, the ... more With the expansion of information systems and the increased interest in the education field, the quantity of data about education has exploded along with a new field -Educational Data Mining (EDM). The focus of EDM is the development of methods for exploring the types of data that come from an educational context. Predicting students' performance has been approached by several techniques, but the combination of supervised and non-supervised techniques appeared as a new tool for improving the results. In this dissertation, we studied the inclusion of an unsupervised technique, Biclustering, that has been successfully applied in areas such as gene expression and information retrieval, but not used in the educational context. We presented a methodology that allows us to use Biclustering algorithms in educational data to get new patterns and use these results as a complement to the classification. In particular, using matrices with grades of graduate Computer Science students (LEIC) of Instituto Superior Técnico we are able to anticipate the average grade of the master Program (MEIC) of those students. By applying this new technique we can improve the accuracy of the classifiers, similarly to other techniques previously used, finding new types of patterns which until now had never been discovered.

After bachelor, many students strive to select the masters' courses that are most likely to meet ... more After bachelor, many students strive to select the masters' courses that are most likely to meet their interests. Although this decision may have a big impact on students' motivation and future achievements, usually no support is offered to contest this problem. The use of recommendation systems to suggest items to users has well-known success in several domains, and some of the most successful techniques use Singular Value Decomposition (SVD) to capture hidden latent factors in reduced dimensionality and produce high quality recommendations. In this paper, we propose to use SVD, with a contextual mapping to the educational paradigm, to capture relationships between courses grades and recommend masters' courses that are suitable to students' skills given their bachelor achievements. Our results show that using SVD to predict the masters' courses marks has potential to serve as basis for the recommendation production.

Lecture Notes in Computer Science, 2012
The concept of Pattern Mining has obtained significant focus in Telecommunications Network Manage... more The concept of Pattern Mining has obtained significant focus in Telecommunications Network Management System (NMS). A big volume of literature has been dedicated to this field and valuable progress was also observed. Both sequential and structured pattern mining techniques were observed to be considered in NMS. In particular NMS logs (Performance and Alarm) pose several interesting issues for pattern mining. Pattern mining can help in various NMS activities such as alarm correlation, alarm associations, self-healing or pro-active fault management. In this review article, we present an overview of the different pattern mining techniques used in NMSs, compare them and finally select the best that can be beneficial to NMS for Radio over Fiber (RoF) like convergent networks. The pattern mining technique will be one of the basic steps that will be needed to implement various data processing functionalities (such as using sequential pattern mining to extract episode rules from network system alarms) of an intelligent NMS.
Lecture Notes in Computer Science
The problem of sequential pattern mining is one of the several that has deserved particular atten... more The problem of sequential pattern mining is one of the several that has deserved particular attention on the general area of data mining. Despite the important developments in the last years, the best algorithm in the area (PrefixSpan) does not deal with gap constraints and consequently doesn't allow for the introduction of background knowledge into the process. In this paper we present the generalization of the PrefixSpan algorithm to deal with gap constraints, using a new method to generate projected databases. Studies on performance and scalability were conducted in synthetic and real-life datasets, and the respective results are presented.

2014 47th Hawaii International Conference on System Sciences, 2014
Modeling the dependencies among multiple temporal attributes derived from integrated healthcare d... more Modeling the dependencies among multiple temporal attributes derived from integrated healthcare databases represents an unprecedented opportunity to support medical and administrative decisions. However, existing predictive models are not yet able to successfully anticipate health conditions based on multiple (sparse) time sequences derived from repositories of health-records. To tackle this problem, we propose new predictive models able to learn from an expressive temporal structure, a time-enriched itemset sequence, which captures both temporal and cross-attribute dependencies. Revised pattern-based models and hidden Markov models are proposed to address the properties of the target integrative temporal structures. The conducted experiments hold evidence for the utility and accuracy of the proposed predictive models to anticipate health conditions, such as the need for surgeries.

Lecture Notes in Computer Science, 2005
The main drawbacks of sequential pattern mining have been its lack of focus on user expectations ... more The main drawbacks of sequential pattern mining have been its lack of focus on user expectations and the high number of discovered patterns. However, the solution commonly accepted -the use of constraintsapproximates the mining process to a verification of what are the frequent patterns among the specified ones, instead of the discovery of unknown and unexpected patterns. In this paper, we propose a new methodology to mine sequential patterns, keeping the focus on user expectations, without compromising the discovery of unknown patterns. Our methodology is based on the use of constraint relaxations, and it consists on using them to filter accepted patterns during the mining process. We propose a hierarchy of relaxations, applied to constraints expressed as context-free languages, classifying the existing relaxations (legal, valid and naïve, previously proposed), and proposing several new classes of relaxations. The new classes range from the approx and non-accepted, to the composition of different types of relaxations, like the approx-legal or the nonprefix-valid relaxations. Finally, we present a case study that shows the results achieved with the application of this methodology to the analysis of the curricular sequences of computer science students.

Proceedings of the 18th International Database Engineering & Applications Symposium on - IDEAS '14, 2014
One of the major problems in pattern mining is still the problem of pattern explosion, i.e., the ... more One of the major problems in pattern mining is still the problem of pattern explosion, i.e., the large amounts of patterns produced by the mining algorithms when analyzing a database with a predefined minimum support threshold. The approach we take to overcome this problem aims for automatically inferring variables from the patterns found, in order to generalize those patterns by representing them in a compact way. We introduce the novel concept of meta-patterns and present the RECAP algorithm. Meta-patterns can take several forms and the sets of patterns can be grouped considering different criteria. These decisions come as a trade-off between expressiveness and compaction of the patterns. The proposed solution accomplishes good results in the tested dataset, reducing to less than half the amount of patterns found.
In this paper we describe an application that helps in the evaluation and rehabilitation of child... more In this paper we describe an application that helps in the evaluation and rehabilitation of children with low vision. We have three minor complementary systems: assessment, advisement, and rehabilitation. With the first two we model the children's impairments and needs, while with the last one we train the visual perception. We have developed some tools to support the information gathering and its evaluation, in detail some questionnaires, perception tests and exercises to train the eye movements. The next step to follow will be the integration and processing of data, in order to design the child's profile and specific needs. On the other hand, we will construct a first prototype of the advisement system, using the development expert systems techniques. At last, the rehabilitation exercises will be customised in accordance with the visual impairments previously detected.
Despite the efforts made on last decades to center the process of knowledge discovery on the user... more Despite the efforts made on last decades to center the process of knowledge discovery on the user, the balance between the discovery of unknown and interesting patterns is far from being reached. The discovery of association rules is a paradigmatic case, where this balance is quite difficult to establish. In this paper, we propose a new framework for pattern mining -the Onto4AR. This framework is centered on the use of ontologies, for the representation and introduction of domain knowledge into the mining process. By defining constraints based on an ontology, the framework provides a mining environment independent of the problem domain. With this simplification on the definition and use of constraints, the framework contributes to reduce the gap between discovered rules and user expectations.
Uploads
Papers by Claudia Antunes