Papers by Sara C . Madeira

Energies, Apr 18, 2022
Data are an important asset that the electric power industry have available today to support mana... more Data are an important asset that the electric power industry have available today to support management decisions, excel in operational efficiency, and be more competitive. The advent of smart grids has increased power grid sensorization and so, too, the data availability. However, the inability to recognize the value of data beyond the siloed application in which data are collected is seen as a barrier. Power load time series are one of the most important types of data collected by utilities, because of the inherent information in them (e.g., power load time series comprehend human behavior, economic momentum, and other trends). The area of time series analysis in the energy domain is attracting considerable interest because of growing available data as more sensorization is deployed in power grids. This study considers the shapelet technique to create interpretable classifiers for four use cases. The study systematically applied the shapelet technique to data from different hierarchical power levels (national, primary power substations, and secondary power substations). The study has experimentally shown shapelets as a technique that embraces the interpretability and accuracy of the learning models, the ability to extract interpretable patterns and knowledge, and the ability to recognize and monetize the value of the data, important subjects to reinforce the importance of data-driven services within the energy sector.

IEEE/ACM Transactions on Computational Biology and Bioinformatics, Sep 1, 2014
Identifying patterns in temporal data is key to uncover meaningful relationships in diverse domai... more Identifying patterns in temporal data is key to uncover meaningful relationships in diverse domains, from stock trading to social interactions. Also of great interest are clinical and biological applications, namely monitoring patient response to treatment or characterizing activity at the molecular level. In biology, researchers seek to gain insight into gene functions and dynamics of biological processes, as well as potential perturbations of these leading to disease, through the study of patterns emerging from gene expression time series. Clustering can group genes exhibiting similar expression profiles, but focuses on global patterns denoting rather broad, unspecific responses. Biclustering reveals local patterns, which more naturally capture the intricate collaboration between biological players, particularly under a temporal setting. Despite the general biclustering formulation being NP-hard, considering specific properties of time series has led to efficient solutions for the discovery of temporally aligned patterns. Notably, the identification of biclusters with time-lagged patterns, suggestive of transcriptional cascades, remains a challenge due to the combinatorial explosion of delayed occurrences. Herein, we propose LateBiclustering, a sensible heuristic algorithm enabling a polynomial rather than exponential time solution for the problem. We show that it identifies meaningful time-lagged biclusters relevant to the response of Saccharomyces cerevisiae to heat stress.

Lecture Notes in Computer Science, 2015
The discovery of dense biclusters in biological networks received an increasing attention in rece... more The discovery of dense biclusters in biological networks received an increasing attention in recent years. However, despite the importance of understanding the cell behavior, dense biclusters can only identify modules where genes, proteins or metabolites are strongly connected. These modules are thus often associated with trivial, already known interactions or background processes not necessarily related with the studied conditions. Furthermore, despite the availability of biclustering algorithms able to discover modules with more flexible coherency, their application over large-scale biological networks is hampered by efficiency bottlenecks. In this work, we propose BicNET (Biclustering NETworks), an algorithm to discover non-trivial yet coherent modules in weighted biological networks with heightened efficiency. First, we motivate the relevance of discovering network modules given by constant, symmetric and plaid biclustering models. Second, we propose a solution to discover these flexible modules without time and memory bottlenecks by seizing high efficiency gains from the inherent structural sparsity of networks. Results from the analysis of protein and gene interaction networks support the relevance and efficiency of BicNET.

Energies
Data are an important asset that the electric power industry have available today to support mana... more Data are an important asset that the electric power industry have available today to support management decisions, excel in operational efficiency, and be more competitive. The advent of smart grids has increased power grid sensorization and so, too, the data availability. However, the inability to recognize the value of data beyond the siloed application in which data are collected is seen as a barrier. Power load time series are one of the most important types of data collected by utilities, because of the inherent information in them (e.g., power load time series comprehend human behavior, economic momentum, and other trends). The area of time series analysis in the energy domain is attracting considerable interest because of growing available data as more sensorization is deployed in power grids. This study considers the shapelet technique to create interpretable classifiers for four use cases. The study systematically applied the shapelet technique to data from different hierar...

Algorithms for molecular biology : AMB, 2016
Despite the recognized importance of module discovery in biological networks to enhance our under... more Despite the recognized importance of module discovery in biological networks to enhance our understanding of complex biological systems, existing methods generally suffer from two major drawbacks. First, there is a focus on modules where biological entities are strongly connected, leading to the discovery of trivial/well-known modules and to the inaccurate exclusion of biological entities with subtler yet relevant roles. Second, there is a generalized intolerance towards different forms of noise, including uncertainty associated with less-studied biological entities (in the context of literature-driven networks) and experimental noise (in the context of data-driven networks). Although state-of-the-art biclustering algorithms are able to discover modules with varying coherency and robustness to noise, their application for the discovery of non-dense modules in biological networks has been poorly explored and it is further challenged by efficiency bottlenecks. This work proposes Biclu...

Lecture Notes in Computer Science, 2015
The discovery of dense biclusters in biological networks received an increasing attention in rece... more The discovery of dense biclusters in biological networks received an increasing attention in recent years. However, despite the importance of understanding the cell behavior, dense biclusters can only identify modules where genes, proteins or metabolites are strongly connected. These modules are thus often associated with trivial, already known interactions or background processes not necessarily related with the studied conditions. Furthermore, despite the availability of biclustering algorithms able to discover modules with more flexible coherency, their application over large-scale biological networks is hampered by efficiency bottlenecks. In this work, we propose BicNET (Biclustering NETworks), an algorithm to discover non-trivial yet coherent modules in weighted biological networks with heightened efficiency. First, we motivate the relevance of discovering network modules given by constant, symmetric and plaid biclustering models. Second, we propose a solution to discover these flexible modules without time and memory bottlenecks by seizing high efficiency gains from the inherent structural sparsity of networks. Results from the analysis of protein and gene interaction networks support the relevance and efficiency of BicNET.

Lecture Notes in Computer Science, 2015
Biclustering has been largely applied for gene expression data analysis. In recent years, a clear... more Biclustering has been largely applied for gene expression data analysis. In recent years, a clearer understanding of the synergies between pattern mining and biclustering gave rise to a new class of biclustering algorithms, referred as pattern-based biclustering. These algorithms are able to discover exhaustive structures of biclusters with flexible coherency and quality. Background knowledge has also been increasingly applied for biological data analysis to guarantee relevant results. In this context, despite numerous contributions from domaindriven pattern mining, there is not yet a solid view on whether and how background knowledge can be applied to guide pattern-based biclustering tasks. In this work, we extend pattern-based biclustering algorithms to effectively seize efficiency gains in the presence of constraints. Furthermore, we illustrate how constraints with succinct, (anti-)monotone and convertible properties can be derived from knowledge repositories and user expectations. Experimental results show the importance of incorporating background knowledge within pattern-based biclustering to foster efficiency and guarantee non-trivial yet biologically relevant solutions.

Journal of integrative bioinformatics, Jan 15, 2011
The constant drive towards a more personalized medicine led to an increasing interest in temporal... more The constant drive towards a more personalized medicine led to an increasing interest in temporal gene expression analyzes. It is now broadly accepted that considering a temporal perpective represents a great advantage to better understand disease progression and treatment results at a molecular level. In this context, biclustering algorithms emerged as an important tool to discover local expression patterns in biomedical applications, and CCC-Biclustering arose as an efficient algorithm relying on the temporal nature of data to identify all maximal temporal patterns in gene expression time series. In this work, CCC-Biclustering was integrated in new biclustering-based classifiers for prognostic prediction. As case study we analyzed multiple gene expression time series in order to classify the response of Multiple Sclerosis patients to the standard treatment with Interferon-β, to which nearly half of the patients reveal a negative response. In this scenario, using an effective predi...

Amyotrophic Lateral Sclerosis is a devastating neurodegenerative disease characterized by a usual... more Amyotrophic Lateral Sclerosis is a devastating neurodegenerative disease characterized by a usually fast progression of muscular denervation, generally leading to death in a few years from onset. In this context, any significant improvement of the patient's life expectancy and quality is of major relevance. Several studies have been made to address problems such as ALS diagnosis, and more recently, prognosis. However, these analysis have been mostly restricted to classical statistical approaches used to find the most associated features to a given outcome of interest. In this work we explore an innovative approach to the analysis of clinical data characterized by multivariate time series. We use a distance measure between patients as a reflection of their relationship, to build a network of patients, that in turn can be studied from a modularity point of view, in order to search for communities (groups of similar patients). Preliminary results show that it is possible to extract relevant information from such groups, each presenting a particular behavior for some of the features (patient characteristics) under analysis.

Until recently, knowledge discovery would be restricted to a static analysis, disregarding any te... more Until recently, knowledge discovery would be restricted to a static analysis, disregarding any temporal or sequential relations within the data. In the last decade, temporal data mining developed to be a hot topic of research, looking for those temporal dependencies, unveiling new insights in various areas of interest, including bioinformatics. Sequential pattern mining tries to achieve such goals, by finding frequent patterns within a population and returning them to the user. However, its application as a basis for a direct classification problem with clinical data was never studied, to our knowledge. Hence, this work uses discovered sequential patterns as features for standard classifiers, using a clinical dataset obtained from Amyotrophic Lateral Sclerosis (ALS) patients. The preliminary results are very promising, achieving a prediction accuracy over 83% with a very reduced set of features, both from original data and sequential patterns. Future work includes advancing from a classification problem to prognosis prediction.

With the expansion of information systems and the increased interest in the education field, the ... more With the expansion of information systems and the increased interest in the education field, the quantity of data about education has exploded along with a new field -Educational Data Mining (EDM). The focus of EDM is the development of methods for exploring the types of data that come from an educational context. Predicting students' performance has been approached by several techniques, but the combination of supervised and non-supervised techniques appeared as a new tool for improving the results. In this dissertation, we studied the inclusion of an unsupervised technique, Biclustering, that has been successfully applied in areas such as gene expression and information retrieval, but not used in the educational context. We presented a methodology that allows us to use Biclustering algorithms in educational data to get new patterns and use these results as a complement to the classification. In particular, using matrices with grades of graduate Computer Science students (LEIC) of Instituto Superior Técnico we are able to anticipate the average grade of the master Program (MEIC) of those students. By applying this new technique we can improve the accuracy of the classifiers, similarly to other techniques previously used, finding new types of patterns which until now had never been discovered.

An increasing number of biomedical tasks, such as patternbased biclustering, require the disclosu... more An increasing number of biomedical tasks, such as patternbased biclustering, require the disclosure of the transactions (e.g. genes) that support each pattern (e.g. expression profiles). The discovery of patterns with their supporting transactions, referred as full-pattern mining, has been solved recurring to extensions over Apriori and vertical-based algorithms for frequent itemset mining. Although pattern-growth alternatives are known to be more efficient across multiple biological datasets, there are not yet adaptations for the efficient delivery of full-patterns. In this paper, we propose a pattern-growth algorithm able to discover full-patterns with heightened efficiency and minimum memory overhead. Results confirm that for dense datasets or low support thresholds, a common requirement in biomedical settings, this method can achieve significant performance improvements against its peers.
The research on sequential pattern mining has been driven by efficiency principles. However, effi... more The research on sequential pattern mining has been driven by efficiency principles. However, efficiency is still a critical drawback for tasks that require the discovery of sequential patterns with mediumto-large length. Many of these tasks, such as pattern-based biclustering, rely on datasets with item-indexable properties. An item-indexable database, typically observed in order-preserving datasets across biological and customer-service domains, does not allow item repetitions per sequence. In this work, we propose a new sequential pattern mining method, called IndexSpan, which is able to mine sequential patterns over item-indexable databases with heightened efficiency in comparison with the existing alternatives. The superior performance of IndexSpan is demonstrated on both synthetic and real datasets, and its relevance for multiple applications is discussed.

Lecture Notes in Computer Science, 2008
This paper presents the logic programming concept of threadbased competitive or-parallelism, whic... more This paper presents the logic programming concept of threadbased competitive or-parallelism, which combines the original idea of competitive or-parallelism with committed-choice nondeterminism and speculative threading. In thread-based competitive or-parallelism, an explicit disjunction of subgoals is interpreted as a set of concurrent alternatives, each running in its own thread. The individual subgoals usually correspond to predicates implementing different procedures that, depending on the problem specifics, are expected to either fail or succeed with different performance levels. The subgoals compete for providing an answer and the first successful subgoal leads to the termination of the remaining ones. We discuss the implementation of thread-based competitive or-parallelism in the context of Logtalk, an object-oriented logic programming language, and present experimental results.
International Journal of Data Mining and Bioinformatics, 2012
Transcription Factors (TFs) control transcription by binding to specific sites in the promoter re... more Transcription Factors (TFs) control transcription by binding to specific sites in the promoter regions of the target genes, which can be modelled by structured motifs. In this paper we propose AliBiMotif, a method combining sequence alignment and a biclustering approach based on efficient string matching techniques using suffix trees to unravel approximately conserved sets of blocks (structured motifs) while straightforwardly disregarding non-conserved stretches in-between. The ability to ignore the width of non-conserved regions is a major advantage of the proposed method over other motif finders, as the lengths of the binding sites are usually easier to estimate than the separating distances.
Proceedings of the 5th Asia-Pacific Bioinformatics Conference, 2007
Biclustering algorithms have emerged as an important tool for the discovery of local patterns in ... more Biclustering algorithms have emerged as an important tool for the discovery of local patterns in gene expression data. For the case where the expression data corresponds to time-series, efficient algorithms that work with a discretized version of the expression matrix are known. However, these algorithms assume that the biclusters to be found are perfect, in the sense that each gene in the bicluster exhibits exactly the same expression pattern along the conditions that belong to it. In this work, we propose an algorithm that identifies genes with similar, but not necessarily equal, expression patterns, over a subset of the conditions. The results demonstrate that this approach identifies biclusters biologically more significant than those discovered by other algorithms in the literature.
Lecture Notes in Computer Science, 2009
We present a summary of a PhD thesis proposing efficient biclustering algorithms for time series ... more We present a summary of a PhD thesis proposing efficient biclustering algorithms for time series gene expression data analysis, able to discover important aspects of gene regulation as anticorrelation and time-lagged relationships, and a scoring method based on statistical significance and similarity measures. The ability of the proposed algorithms to efficiently identify sets of genes with statistically significant and biologically meaningful expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific transcriptional regulatory mechanisms.

Lecture Notes in Computer Science, 2014
An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving ... more An increasingly relevant set of tasks, such as the discovery of biclusters with order-preserving properties, can be mapped as a sequential pattern mining problem on data with item-indexable properties. An item-indexable database, typically observed in biomedical domains, does not allow item repetitions per sequence and is commonly dense. Although multiple methods have been proposed for the efficient discovery of sequential patterns, their performance rapidly degrades over item-indexable databases. The target tasks for these databases benefit from lengthy patterns and tolerate local mismatches. However, existing methods that consider noise relaxations to increase the average short length of sequential patterns scale poorly, aggravating the yet critical efficiency. In this work, we first propose a new sequential pattern mining method, IndexSpan, which is able to mine sequential patterns over item-indexable databases with heightened efficiency. Second, we propose a pattern-merging procedure, MergeIndexBic, to efficiently discover lengthy noise-tolerant sequential patterns. The superior performance of IndexSpan and MergeIndexBic against competitive alternatives is demonstrated on both synthetic and real datasets.

PLoS ONE, 2012
Disease gene prioritization aims to suggest potential implications of genes in disease susceptibi... more Disease gene prioritization aims to suggest potential implications of genes in disease susceptibility. Often accomplished in a guilt-by-association scheme, promising candidates are sorted according to their relatedness to known disease genes. Network-based methods have been successfully exploiting this concept by capturing the interaction of genes or proteins into a score. Nonetheless, most current approaches yield at least some of the following limitations: (1) networks comprise only curated physical interactions leading to poor genome coverage and density, and bias toward a particular source; (2) scores focus on adjacencies (direct links) or the most direct paths (shortest paths) within a constrained neighborhood around the disease genes, ignoring potentially informative indirect paths; (3) global clustering is widely applied to partition the network in an unsupervised manner, attributing little importance to prior knowledge; (4) confidence weights and their contribution to edge differentiation and ranking reliability are often disregarded. We hypothesize that network-based prioritization related to local clustering on graphs and considering full topology of weighted gene association networks integrating heterogeneous sources should overcome the above challenges. We term such a strategy Interactogeneous. We conducted cross-validation tests to assess the impact of network sources, alternative path inclusion and confidence weights on the prioritization of putative genes for 29 diseases. Heat diffusion ranking proved the best prioritization method overall, increasing the gap to neighborhood and shortest paths scores mostly on single source networks. Heterogeneous associations consistently delivered superior performance over single source data across the majority of methods. Results on the contribution of confidence weights were inconclusive. Finally, the best Interactogeneous strategy, heat diffusion ranking and associations from the STRING database, was used to prioritize genes for Parkinson's disease. This method effectively recovered known genes and uncovered interesting candidates which could be linked to pathogenic mechanisms of the disease.

Nucleic Acids Research, 2010
Babelomics is a response to the growing necessity of integrating and analyzing different types of... more Babelomics is a response to the growing necessity of integrating and analyzing different types of genomic data in an environment that allows an easy functional interpretation of the results. Babelomics includes a complete suite of methods for the analysis of gene expression data that include normalization (covering most commercial platforms), pre-processing, differential gene expression (case-controls, multiclass, survival or continuous values), predictors, clustering; large-scale genotyping assays (case controls and TDTs, and allows population stratification analysis and correction). All these genomic data analysis facilities are integrated and connected to multiple options for the functional interpretation of the experiments. Different methods of functional enrichment or gene set enrichment can be used to understand the functional basis of the experiment analyzed. Many sources of biological information, which include functional (GO, KEGG, Biocarta, Reactome, etc.), regulatory (Transfac, Jaspar, ORegAnno, miRNAs, etc.), text-mining or protein-protein interaction modules can be used for this purpose. Finally a tool for the de novo functional annotation of sequences has been included in the system. This provides support for the functional analysis of non-model species. Mirrors of Babelomics or command line execution of their individual components are now possible. Babelomics is available at .
Uploads
Papers by Sara C . Madeira