Motivation: Integration of heterogeneous data in life sciences is a growing and recognized challe... more Motivation: Integration of heterogeneous data in life sciences is a growing and recognized challenge. The problem is not only to enable the study of such data within the context of a biological question but also more fundamentally, how to represent the available knowledge and make it accessible for mining. Results: Our integration approach is based on the premise that relationships between biological entities can be represented as a complex network. The context dependency is achieved by a judicious use of distance measures on these networks. The biological entities and the distances between them are mapped for the purpose of visualization into the lower dimensional space using the Sammon's mapping. The system implementation is based on a multi-tier architecture using a native XML database and a software tool for querying and visualizing complex biological networks. The functionality of our system is demonstrated with two examples: (1) A multiple pathway retrieval, in which, given a pathway name, the system finds all the relationships related to the query by checking available metabolic pathway, transcriptional, signaling, protein-protein interaction and ontology annotation resources and (2) A protein neighborhood search, in which given a protein name, the system finds all its connected entities within a specified depth. These two examples show that our system is able to conceptually traverse different databases to produce testable hypotheses and lead towards answers to complex biological questions. Contact: matej.oresic@vtt.fi
Biological phenomena are usually described by relational model of interactions and dependencies b... more Biological phenomena are usually described by relational model of interactions and dependencies between different entities. Therefore, a network-based knowledge representation of biological knowledge seems to be an obvious choice. In this paper, we propose such a representation when integrating data from heterogeneous life science data sources, including information extracted from biomedical literature. We show that such a representation enables explanatory analysis in a context dependent manner. The context is enabled by a judicious assignment of weights on the quality dimensions. Analysis of clusters of nodes and links in the context of underlying biological questions may provide emergence of new concepts and understanding. Results are obtained with our megNet software, an integrative platform based on a multi-tier architecture using a native XML database.
Int. Conference on Artificial Neural Networks, 2006
In time series prediction, accuracy of predictions is often the primary goal. At the same time, h... more In time series prediction, accuracy of predictions is often the primary goal. At the same time, however, it would be very desirable if we could give interpretation to the system under study. For this goal, we have devised a fast input selection algorithm to choose a parsimonious, or sparse set of input variables. The method is an algorithm in the spirit of backward selection used in conjunction with the resampling procedure. In this paper, our strategy is to select a sparse set of inputs using linear models and after that the selected inputs are also used in the nonlinear prediction based on multi-layer perceptron networks. We compare the prediction accuracy of our parsimonious non-linear models with the linear models and the regularized non-linear perceptron networks. Furthermore, we quantify the importance of the individual input variables in the non-linear models using the partial derivatives. The experiments in a problem of electricity load prediction demonstrate that the fast input selection method yields accurate and parsimonious prediction models giving insight to the original problem.
Data mining algorithms such as the Apriori method for flnd- ing frequent sets in sparse binary da... more Data mining algorithms such as the Apriori method for flnd- ing frequent sets in sparse binary data can be used for e-cient computa- tion of a large number of summaries from huge data sets. The collection of frequent sets gives a collection of marginal frequencies about the un- derlying data set. Sometimes, we would like to use a collection of
Asbestos is a pulmonary carcinogen known to give rise to DNA and chromosomal damage, but the exac... more Asbestos is a pulmonary carcinogen known to give rise to DNA and chromosomal damage, but the exact carcinogenic mechanisms are still largely unknown. In this study, gene expression arrays were performed on lung tumor samples from 14 heavily asbestos-exposed and 14 non-exposed patients matched for other characteristics. Using a two-step statistical analysis, 47 genes were revealed that could differentiate the
We present a general framework for Self- Organizing Maps, which store probabilistic models in map... more We present a general framework for Self- Organizing Maps, which store probabilistic models in map units. We introduce the neg- ative log probability of the data sample as the error function and motivate its use by showing its correspondence to the Kullback-Leibler dis- tance between the unknown true distribution of data and our empirical models. We present a general winner
The Self-Organizing Map (SOM) is a powerful neural network for analysis and visualization of high... more The Self-Organizing Map (SOM) is a powerful neural network for analysis and visualization of high-dimensional data. It maps nonlinear statistical relationships between high-dimensional input data into simple geometric relationships on a usually two-dimensional grid. The mapping roughly preserves the most important topological and metric relationships of the original data elements and, thus, inherently clusters the data. The need for efficient
Abstract: Assessment of model properties withrespect to data is important for reliable analysis o... more Abstract: Assessment of model properties withrespect to data is important for reliable analysis ofdata. After training, Self-Organizing Map (SOM) canbe assessed, for instance, with respect to its quantizationor its topology preservation properties with onenumbersummaries. In this paper, we present a decompositionof the SOM distortion measure for measuringdi#erent aspects of the SOM for map units locally. Theterms measure quantization quality, the
IEEE International Symposium on Circuits and Systems, 1996
In this paper, a neural network based analysis method formonitoring and modeling the dynamic beha... more In this paper, a neural network based analysis method formonitoring and modeling the dynamic behavior of complexindustrial processes is considered. The method is based onthe unsupervised learning property of the Self-OrganizingMap (SOM) algorithm. The time series produced by severalsensors measuring the process parameters as well as otherprocess data are used in mapping the process behavior anddynamics into the network.1. INTRODUCTIONAnalysis,
The Self-Organizing Map (SOM) is a powerful neural network method for analysis and visualization ... more The Self-Organizing Map (SOM) is a powerful neural network method for analysis and visualization of high-dimensional data. It maps nonlinear statistical dependencies between high-dimensional measurement data into simple geometric relationships on a usually twodimensional grid. The mapping roughly preserves the most important topological and metric relationships of the original data elements and, thus, inherently clusters the data. The need for
Data mining algorithms such as the Apriori method for finding frequent sets in sparse binary data... more Data mining algorithms such as the Apriori method for finding frequent sets in sparse binary data can be used for efficient computation of a large number of summaries from huge data sets. The collection of frequent sets gives a collection of marginal frequencies about the underlying data set. Sometimes, we would like to use a collection of such marginal frequencies instead of the entire data set (e.g. when the original data is inaccessible for confidentiality reasons) to compute other interesting summaries. Using combinatorial arguments, we may obtain tight upper and lower bounds on the values of inferred summaries. In this paper, we consider a class of summaries wider than frequent sets, namely that of frequencies of arbitrary Boolean formulae. Given frequencies of a number of any different Boolean formulae, we consider the problem of finding tight bounds on the frequency of another arbitrary formula. We give a general formulation of the problem of bounding formula frequencies given some background information, and show how the bounds can be obtained by solving a linear programming problem. We illustrate the accuracy of the bounds by giving empirical results on real data sets.
Several specific cytogenetic changes are known to be associated with childhood acute lymphoblasti... more Several specific cytogenetic changes are known to be associated with childhood acute lymphoblastic leukemia (ALL), and many of them are important prognostic factors for the disease. Little is known, however, about the changes in gene expression in ALL. Recently, the development of cDNA array technology has enabled the study of expression of hundreds to thousands of genes in a single experiment. We used the cDNA array method to study the gene expression profiles of 17 children with precursor-B ALL. Normal B cells from adenoids were used as reference material. We discuss the 25 genes that were most over-expressed compared to the reference. These included four genes that are normally expressed only in the myeloid lineages of the hematopoietic cells: RNASE2, GCSFR, PRTN3 and CLC. We also detected over-expression of S100A12, expressed in nerve cells but also in myeloid cells. In addition to the myeloid-specific genes, other over-expressed genes included AML1, LCP2 and FGF6. In conclusion...
Structural Health Monitoring in Wireless Sensor Networks by the Embedded Goertzel Algorithm
2011 IEEE/ACM Second International Conference on Cyber-Physical Systems, 2011
Structural health monitoring aims to provide an accurate diagnosis of the condition of civil infr... more Structural health monitoring aims to provide an accurate diagnosis of the condition of civil infrastructures during their life-span by analyzing data collected by sensors. To this purpose, detection and localization of damages are fundamental tasks. This paper introduces a wireless sensor network for structural damage detection and localization in which the sensor nodes, in order to estimate the energies of
Statistical models for environmental monitoring strongly rely on automatic data acquisition syste... more Statistical models for environmental monitoring strongly rely on automatic data acquisition systems that use various physical sensors. Often, sensor readings are missing for extended periods of time, while model outputs need to be continuously available in real time. With a case study in solar-radiation nowcasting, we investigate how to deal with massively missing data (around 50 % of the time some data are unavailable) in such situations. Our goal is to analyze characteristics of missing data and recommend a strategy for deploying regression models which would be robust to missing data in situations where data are massively missing. We are after one model that performs well at all times, with and without data gaps. Due to the need to provide instantaneous outputs with minimum energy consumption for computing in the data streaming setting, we dismiss computationally demanding data imputation methods and resort to a mean replacement, accompanied with a robust regression model. We use an established strategy for assessing different regression models and for determining how many missing sensor readings can be tolerated before model outputs become obsolete. We experimentally analyze the accuracies and robustness to missing data of seven linear regression models. We recommend using the regularized PCA regression with our established guideline in training regression models, which themselves are robust to missing data.
Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '11, 2011
Time series representations are not always rich enough to describe the temporal activity, for ins... more Time series representations are not always rich enough to describe the temporal activity, for instance, when the context and the relations of the observed elements are of interest. Sequences of temporal intervals use such intervals as primitives in their representation, and allow focusing on the temporal relations of these elements. This is a useful representation of data across many domains. Searching, indexing, and mining such sequences is essential for domain experts in order to discover useful information out of them. In this paper, we formulate the problem of comparing sequences of temporal intervals and propose a novel distance measure. We discuss the properties of the measure and study its robustness in the domain of sign language. Experiments on real data show that the measure is robust in terms of retrieval accuracy even for high levels of artificially introduced distortion.
Proceedings of the 4th International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '11, 2011
An important theoretical topic in assistive environments is reasoning about temporal patterns, th... more An important theoretical topic in assistive environments is reasoning about temporal patterns, that represent the sequential output of various sensors, and that can give us information about the health and activities of humans and the state of the environment. The recent growth in the quantity and quality of sensors for assistive environments has made it possible to create large databases
Uploads
Papers by Jaakko Hollmén