This file contains the full set of gene–disease associations integrated from all sources of evide... more This file contains the full set of gene–disease associations integrated from all sources of evidence in DISEASES v2.
Finding interesting association rules is an important and active research field in data mining. T... more Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify "good " properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.
Behavioral study of interestingness measures of knowledge extraction
La recherche de règles d’association intéressantes est un domaine important et actif en fouille d... more La recherche de règles d’association intéressantes est un domaine important et actif en fouille de données. Puisque les algorithmes utilisés en extraction de connaissances à partir de données (ECD), ont tendance à générer un nombre important de règles, il est difficile à l’utilisateur de sélectionner par lui même les connaissances réellement intéressantes. Pour répondre à ce problème, un post-filtrage automatique des règles s’avère essentiel pour réduire fortement leur nombre. D’où la proposition de nombreuses mesures d’intérêt dans la littérature, parmi lesquelles l’utilisateur est supposé choisir celle qui est la plus appropriée à ses objectifs. Comme l’intérêt dépend à la fois des préférences de l’utilisateur et des données, les mesures ont été répertoriées en deux catégories : les mesures subjectives (orientées utilisateur ) et les mesures objectives (orientées données). Nous nous focalisons sur l’étude des mesures objectives. Néanmoins, il existe une pléthore de mesures objecti...
La recherche de regles d’association interessantes est un domaine important et actif en fouille d... more La recherche de regles d’association interessantes est un domaine important et actif en fouille de donnees. Puisque les algorithmes utilises en extraction de connaissances a partir de donnees (ECD), ont tendance a generer un nombre important de regles, il est difficile a l’utilisateur de selectionner par lui meme les connaissances reellement interessantes. Pour repondre a ce probleme, un post-filtrage automatique des regles s’avere essentiel pour reduire fortement leur nombre. D’ou la proposition de nombreuses mesures d’interet dans la litterature, parmi lesquelles l’utilisateur est suppose choisir celle qui est la plus appropriee a ses objectifs. Comme l’interet depend a la fois des preferences de l’utilisateur et des donnees, les mesures ont ete repertoriees en deux categories : les mesures subjectives (orientees utilisateur ) et les mesures objectives (orientees donnees). Nous nous focalisons sur l’etude des mesures objectives. Neanmoins, il existe une plethore de mesures objecti...
This file contains the filtered non-redundant set of gene–disease associations obtained from auto... more This file contains the filtered non-redundant set of gene–disease associations obtained from automatic text mining in DISEASES v2.
This file contains the full set of gene–disease associations obtained from automatic text mining ... more This file contains the full set of gene–disease associations obtained from automatic text mining in DISEASES v2.
This file contains the filtered non-redundant set of gene–disease associations from the TIGA data... more This file contains the filtered non-redundant set of gene–disease associations from the TIGA database of GWAS associations in DISEASES v2.
This file contains the full set of gene–disease associations from the TIGA database of GWAS assoc... more This file contains the full set of gene–disease associations from the TIGA database of GWAS associations in DISEASES v2.
This file contains the filtered non-redundant set of gene–disease associations from curated knowl... more This file contains the filtered non-redundant set of gene–disease associations from curated knowledge sources in DISEASES v2.
This file contains the full set of gene–disease associations from curated knowledge sources in DI... more This file contains the full set of gene–disease associations from curated knowledge sources in DISEASES v2.
This file contains the human gene and disease names used for text mining in the DISEASES database... more This file contains the human gene and disease names used for text mining in the DISEASES database v2.
In this paper, we introduce an approach for analyzing complex biological data obtained from metab... more In this paper, we introduce an approach for analyzing complex biological data obtained from metabolomic analytical platforms. Such platforms generate massive and complex data that need appropriate methods for discovering meaningful biological information. The datasets to analyze consist in a limited set of individuals and a large set of attributes (variables). In this study, we are interested in mining metabolomic data to identify predictive biomarkers of metabolic diseases, such as type 2 diabetes. Our experiments show that a combination of numerical methods, e.g. SVM, Random Forests (RF), and ANOVA, with a symbolic method such as FCA, can be successfully used for discovering the best combination of predictive features. Our results show that RF and ANOVA seem to be the best suited methods for feature selection and discovery. We then use FCA for visualizing the markers in a suggestive and interpretable concept lattice. The outputs of our experiments consist in a short list of the 10...
In this paper, we introduce a hybrid approach for analyzing metabolomic data about the so-called ... more In this paper, we introduce a hybrid approach for analyzing metabolomic data about the so-called diabetes of type 2. The identi-cation of biomarkers which are witness of the disease is very important and can be guided by data mining methods. The data to be analyzed are massive and complex and are organized around a small set of individuals and a large set of variables (attributes). In this study, we based our experiments on a combination of ecient numerical supervised methods , namely Support Vector Machines (SVM), Random Forests (RF), and ANOVA, and a symbolic non supervised method, namely Formal Concept Analysis (FCA). The data mining strategy is based on ten spe-cic classication processes which are organized around three main operations , ltering, feature selection, and post-processing. The numerical methods are mainly used in ltering and feature selection while FCA is mainly used for visualization and interpretation purposes. The rst results are encouraging and show that the pre...
The trajectory and underlying mechanisms of human health are determined by a complex interplay be... more The trajectory and underlying mechanisms of human health are determined by a complex interplay between intrinsic and extrinsic factors. Its evolution is a continuum of transitions, involving multifaceted processes at multiple levels and there is an urgent need for integrative biomarkers that can characterize and predict health status evolution. The objective of the present study was to identify accurate and robust multidimensional markers, predictive of type 2 diabetes (T2D). A case-control approach was used within the French population-based cohort GAZEL (n~20,000) [1]. Male overweight subjects (n=112, 25≤BMI<30 kg/m², 52-64 y.o.), free of T2D at baseline, were selected. Cases were defined as having developed T2D at follow-up (5 years later) and were compared for several parameters (clinical, biochemical parameters, and food habits) with Controls matched for BMI, age, and sex. Baseline serum samples were analyzed using mass spectrometry-based untargeted metabolomics [2]. Data mi...
The evolution of human health is a continuum of transitions, involving multifaceted processes at ... more The evolution of human health is a continuum of transitions, involving multifaceted processes at multiple levels, and there is an urgent need for integrative biomarkers that can characterize and predict progression toward disease development. The objective of this work was to perform a systems metabolomics approach to predict metabolic syndrome (MetS) development. A case-control design was used within the French occupational GAZEL cohort (n = 112 males: discovery study; n = 94: replication/validation study). Our integrative strategy was to combine untargeted metabolomics with clinical, sociodemographic, and food habit parameters to describe early phenotypes and build multidimensional predictive models. Different models were built from the discriminant variables, and prediction performances were optimized either when reducing the number of metabolites used or when keeping the associated signature. We illustrated that a selected reduced metabolic profile was able to reveal subtle phen...
Machine Learning and Knowledge Discovery in Databases, 2016
The analysis of complex and massive biological data issued from metabolomic analytical platforms ... more The analysis of complex and massive biological data issued from metabolomic analytical platforms is a challenge of high importance. The analyzed datasets are constituted of a limited set of individuals and a large set of features where predictive biomarkers of clinical outcomes should be mined. Accordingly, in this paper, we propose a new hybrid knowledge discovery approach for discovering meaningful predictive biological patterns. This hybrid approach combines numerical classifiers such as SVM, Random Forests (RF) and ANOVA, with a symbolic method, namely Formal Concept Analysis (FCA). The related experiments show how we can discover among the best potential predictive biomarkers of metabolic diseases thanks to specific combinations of classifiers mainly involving RF and ANOVA. The visualization of predictive biomarkers is based on heatmaps while FCA is mainly used for visualization and interpretation purposes, complementing the computational power of numerical methods.
Uploads
Papers by Dhouha Grissa