An Entropy Approach for Choosing Gene Expression Cutoff

Hy Vuong; Tung Nguyen; Huy Nguyen; Thao Truong; Son Pham

doi:10.1101/2022.05.05.490711

Outline

An Entropy Approach for Choosing Gene Expression Cutoff

Huy Nguyễn

2022, bioRxiv (Cold Spring Harbor Laboratory)

https://doi.org/10.1101/2022.05.05.490711

visibility

…

description

3 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Annotating cell types using single-cell transcriptome data usually requires binarizing the expression data to distinguish between the background noise vs. real expression or low expression vs. high expression cases. A common approach is choosing a "reasonable" cutoff value, but it remains unclear how to choose it. In this work, we describe a simple yet effective approach for finding this threshold value.

Adrien Six

PLOS ONE

Background Identifying differentially expressed genes between experimental conditions is still the gold-standard approach to interpret transcriptomic profiles. Alternative approaches based on diversity measures have been proposed to complement the interpretation of such datasets but are only used marginally. Methods Here, we reinvestigated diversity measures, which are commonly used in ecology, to characterize mice pregnancy microenvironments based on a public transcriptome dataset. Mainly, we evaluated the Tsallis entropy function to explore the potential of a collection of diversity measures for capturing relevant molecular event information. Results We demonstrate that the Tsallis entropy function provides additional information compared to the traditional diversity indices, such as the Shannon and Simpson indices. Depending on the relative importance given to the most abundant transcripts based on the Tsallis entropy function parameter, our approach allows appreciating the impac...

downloadDownload free PDF View PDFchevron_right

Entropy as a measure of variability and stemness in single-cell transcriptomics

Nicolas B Garnier

Current Opinion in Systems Biology, 2021

downloadDownload free PDF View PDFchevron_right

Multi-Algorithmic Approaches to Gene Expression Binarization

Jaime Seguel

2015

A basic problem in the construction of network representations of gene interactions is deciding whether a gene is or is not expressed at a time instant. This problem, referred here as the gene expression decision problem, has been approached with statistical and numerical algorithms. Numerical methods are based on different intuitions on what signals a gene expression threshold and as a consequence, they often return different answers. Consequently, the choice of a particular gene expression decision algorithm influences the gene interaction model. This article proposes an aggregation methodology for numerical gene expression decision algorithms that is based on voting. The result is thus, the expression decision made by the majority of the algorithms, provided that that decision is consistent with an underlying logical law referred as the doctrine. The proposed method is compared with some non-voting aggregation algorithms.

downloadDownload free PDF View PDFchevron_right

Clustering gene expression by dynamics: A maximum entropy approach

Luis Diambra

Physica A: Statistical Mechanics and its Applications, 2008

Arrays allow simultaneous measurements of the expression levels of thousands of mRNAs. By mining this data one can identify sets of genes with similar profiles. We show that information theoretic methods are capable of modeling and assessing dissimilarities between the dynamics underlying to the gene expression time series. By recourse of a maximum entropy-based method for building models, we built a distance between two gene expression profiles, which takes into account the dynamic features of the expression. The proposed distance measure can be implemented over a wide variety of clustering algorithms enhancing their usefulness.

downloadDownload free PDF View PDFchevron_right

Class discovery in gene expression data

Amir Ben-dor

Proceedings of the fifth annual international conference on Computational biology - RECOMB '01, 2001

Recent studies (Alizadeh et al, ; Bittner et al, ; Golub et al, ) demonstrate the discovery of putative disease subtypes from gene expression data. The underlying computational problem is to partition the set of sample tissues into statistically meaningful classes. In this paper we present a novel approach to class discovery and develop automatic analysis methods. Our approach is based on statistically scoring candidate partitions according to the overabundance of genes that separate the different classes. Indeed, in biological datasets, an overabundance of genes separating known classes is typically observed. we measure overabundance against a stochastic null model. This allows for highlighting subtle, yet meaningful, partitions that are supported on a small subset of the genes.

downloadDownload free PDF View PDFchevron_right

Statistical Distribution as a Way for Lower Gene Expressions Threshold Cutoff

Alessandro Giuliani

2018

While in mathematics (and in logic) the basic divide is between 'true' and 'false', in experimental science the frontier is between 'relevant' and 'irrelevant' and this is a much more tricky border. The classical way to track this frontier builds upon inferential statistics (signal analysis is a synonymous more popular among engineers) and is based on the definition of what we intend for 'randomness' in a given situation. Here we comment on the setting of the threshold between 'informative' and 'random' territories in the case of gene expression data where the definition of randomness is not only a 'statistical' but a 'biological' affair.

downloadDownload free PDF View PDFchevron_right

A Combinatorial Approach to Clustering Gene Expression Data Extended Abstract

Louise Showe

2004

We present a new algorithm to discovering natural partitions of a set of samples based on their gene expression patterns found with microarray experiments. The algorithm uses a bicriteria combinatorial optimization search to simultaneously identify an interesting set of genes and a partition of the array samples. Each gene in the gene set should respect the sample partition in the sense that if the gene's values are colored according to the partition class they come from, then the values, when sorted, should have a minimal number of color changes. We refer to this as the full color criterion. It measures how well a particular gene sorts the various partition classes. The other is the black and white criterion where we color the values of one of the partition classes black and the remaining values white and again count the number of color changes. For each gene, we choose the partition class to color black that minimizes this count. This criterion measures how well a gene distinguishes one sample class from the remaining samples. Using a branch-and-bound algorithm we are able to find both the optimal gene set and the sample partitioning that has the fewest total number of black and white and full color changes on this gene set. Additionally we can calculate the likelihood of observing a particular outcome in a random data set, thus permitting the calculation of a "pvalue" to interpret the significance of the results. The algorithm can be run in a completely unsupervised way, or a user can constrain the search to enforce that a particular group of samples be in the same partition (e.g. controls) or requiring groups of samples to belong to different partitions. We have tested the algorithm on a 30 sample Cutaneous T-cell Lymphoma data set; it was able to almost perfectly discriminate short-term survivors from long-term survivors and normal controls.

downloadDownload free PDF View PDFchevron_right

The Gene Expression Deconvolution Interactive Tool (GEDIT): accurate cell type quantification from gene expression data

Misha Khan

GigaScience

Background The cell type composition of heterogeneous tissue samples can be a critical variable in both clinical and laboratory settings. However, current experimental methods of cell type quantification (e.g., cell flow cytometry) are costly, time consuming and have potential to introduce bias. Computational approaches that use expression data to infer cell type abundance offer an alternative solution. While these methods have gained popularity, most fail to produce accurate predictions for the full range of platforms currently used by researchers or for the wide variety of tissue types often studied. Results We present the Gene Expression Deconvolution Interactive Tool (GEDIT), a flexible tool that utilizes gene expression data to accurately predict cell type abundances. Using both simulated and experimental data, we extensively evaluate the performance of GEDIT and demonstrate that it returns robust results under a wide variety of conditions. These conditions include multiple pla...

downloadDownload free PDF View PDFchevron_right

Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

Marcus Alvarez

Nature Communications

We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundan...

downloadDownload free PDF View PDFchevron_right

An entropy-based gene selection method for cancer classification using microarray data

Arun Krishnan

BMC bioinformatics, 2005

Accurate diagnosis of cancer subtypes remains a challenging problem. Building classifiers based on gene expression data is a promising approach; yet the selection of non-redundant but relevant genes is difficult. The selected gene set should be small enough to allow diagnosis even in regular clinical laboratories and ideally identify genes involved in cancer-specific regulatory pathways. Here an entropy-based method is proposed that selects genes related to the different cancer classes while at the same time reducing the redundancy among the genes. The present study identifies a subset of features by maximizing the relevance and minimizing the redundancy of the selected genes. A merit called normalized mutual information is employed to measure the relevance and the redundancy of the genes. In order to find a more representative subset of features, an iterative procedure is adopted that incorporates an initial clustering followed by data partitioning and the application of the algori...

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (3)

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986.
Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379-423, 1948.
Lei Zhang, Xin Yu, Liangtao Zheng, Yuanyuan Zhang, Yansen Li, Qiao Fang, Ranran Gao, Boxi Kang, Qiming Zhang, Julie Y Huang, et al. Lineage tracking reveals dynamic relationships of t cells in colorectal cancer. Nature, 564(7735):268-272, 2018.

Andrew Teschendorff

The ability to quantify differentiation potential of single cells is a task of critical importance for single-cell studies. So far however, there is no robust general molecular correlate of differentiation potential at the single cell level. Here we show that differentiation potency of a single cell can be approximated by computing the signaling promiscuity, or entropy, of a cell’s transcriptomic profile in the context of a cellular interaction network, without the need for model training or feature selection. We validate signaling entropy in over 7,000 single cell RNA-Seq profiles, representing all main differentiation stages, including time-course data. We develop a novel algorithm called Single Cell Entropy (SCENT), which correctly identifies known cell subpopulations of varying potency, enabling reconstruction of cell-lineage trajectories. By comparing bulk to single cell data, SCENT reveals that expression heterogeneity within single cell populations is regulated, pointing towa...

downloadDownload free PDF View PDFchevron_right

sc-REnF: An entropy guided robust feature selection for single-cell RNA-seq data

Snehalika Lall

Briefings in Bioinformatics, 2021

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. Since single-cell data are susceptible to technical noise, the quality of genes selected prior to clustering is of crucial importance in the preliminary steps of downstream analysis. Therefore, interest in robust gene selection has gained considerable attention in recent years. We introduce sc-REnF [robust entropy based feature (gene) selection method], aiming to leverage the advantages of $R{\prime}{e}nyi$ and $Tsallis$ entropies in gene selection for single cell clustering. Experiments demonstrate that with tuned parameter ($q$), $R{\prime}{e}nyi$ and $Tsallis$ entropies select genes that improved the clustering results significantly, over the other competing methods. sc-REnF can capture relevancy and redundancy among the features of noisy data extremely well due to its robust objective function. Moreover, the selected features/genes can able to determine the unknown cells with a hig...

downloadDownload free PDF View PDFchevron_right

sc-REnF:An entropy guided robust feature selection for clustering of single-cell rna-seq data

Snehalika Lall

2020

ABSTRACTMany single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc...

downloadDownload free PDF View PDFchevron_right

Semantics and Accuracy of Gene Expression Threshold Computations. A Case Study

Jaime Seguel

2013

The precise inner workings of cellular mechanisms remain largely unknown and, therefore, their modeling is usually based on conjectures. The availability of large amounts of genetic data, and the lack of abstract mathematical models, makes computer algorithms the only tool available for searching for these hypothetical realities. We call the conjectured algorithmic-independent reality that underlies the method design and intention, the semantics of the algorithm. This article is a brief semantics analysis exercise performed with four binary quantization algorithms for time series of gene expression data. Keywordsbinary quantization; gene expression; quantitative semantics

downloadDownload free PDF View PDFchevron_right

Uncovering Fine Structure in Gene Expression Profile by Maximum Entropy Modeling of cDNA Microarray Images and Kernel Density Methods

George Nikiforidis

Handbook of Research on Systems Biology Applications in Medicine, 2009

downloadDownload free PDF View PDFchevron_right

Pre-processing for noise detection in gene expression classification data

Ana Carolina Lorena

Journal of the Brazilian Computer Society, 2009

Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.

downloadDownload free PDF View PDFchevron_right

Identifying splits with clear separation: a new class discovery method for gene expression data

Anja von Heydebreck

Bioinformatics, 2001

We present a new class discovery method for microarray gene expression data. Based on a collection of gene expression profiles from different tissue samples, the method searches for binary class distinctions in the set of samples that show clear separation in the expression levels of specific subsets of genes. Several mutually independent class distinctions may be found, which is difficult to obtain from most commonly used clustering algorithms. Each class distinction can be biologically interpreted in terms of its supporting genes. The mathematical characterization of the favored class distinctions is based on statistical concepts. By analyzing three data sets from cancer gene expression studies, we demonstrate that our method is able to detect biologically relevant structures, for example cancer subtypes, in an unsupervised fashion. Contact: heydebre@molgen.mpg.de

downloadDownload free PDF View PDFchevron_right

A signal-to-noise classification model for identification of differentially expressed genes from gene expression data

Barnali Sahu

2011

A major focus in cancer research is identifying genetic markers or biomarkers. To build a robust classifier we have to find out the differentially expressed genes (key genes) in binary classification. The differentially expressed genes or biomarker gene selection is the preprocessing task for cancer classification. In this paper, we have compared the results of two approaches for selecting biomarkers from Leukemia data set. The first approach for feature selection is by implementing k-means clustering and signal-to-noise ratio (SNR) method for gene ranking, the top scored genes from each cluster is selected and given to the classifiers. The second approach uses signal to noise ratio ranking only for feature selection. For validation of both the approaches, we have used k nearest neighbor (kNN), support vector machine (SVM), probabilistic Neural Network (PNN) and Feed Forward Neural Network (fNN). After comparing the final results of two approaches we have got 100%, 96%and 96% accuracy with SVM, kNN and PNN respectively in first approach with five numbers of genes. Whereas, performance of FNN is 2.17 with 10 numbers of genes. In second approach we have got 96%, 96% and 62% accuracies for SVM, kNN and PNN respectively for 5 numbers of genes and the performance of FNN is 2.52 for 10 genes.

downloadDownload free PDF View PDFchevron_right

A Classification-Based Machine Learning Approach for the Analysis of Genome-Wide Expression Data

Soumyaroop Bhattacharya, James Lyons-Weiler

Genome Research, 2003

Three important areas of data analysis for global gene expression analysis are class discovery, class prediction, and finding dysregulated genes (biomarkers). The clinical application of microarray data will require marker genes whose expression patterns are sufficiently well understood to allow accurate predictions on disease subclass membership. Commonly used methods of analysis include hierarchical clustering algorithms, t-, F-, and Z-tests, and machine learning approaches. We describe an approach called the maximum difference subset (MDSS) algorithm that combines classification algorithms, classical statistics, and elements of machine learning and provides a coherent framework. By integrating prediction accuracy, the MDSS algorithm learns the critical threshold of statistical significance (the ␣ or P-value), eliminating the arbitrariness of setting a threshold of statistical significance and minimizing the effect of the normality assumptions. To reduce the false positive rate and to increase external validity of the predictive gene set, a jackknife step is used. This step identifies and removes genes in the initial MDSS with low combined predictive utility. The overall MDSS provides a prediction that is less dependent on an arbitrary study design (sample inclusion or exclusion) and should thus have high external validity. We demonstrate that this approach, unlike other published methods, identifies biomarkers capable of predicting the outcome of anthracycline-cytarabine chemotherapy in cases of acute myeloid leukemia. By incorporating two criteria-statistical significance and predictive utility-the approach learns the significance level relevant for a given data set. The MDSS approach can be used with any test and classifier operator pair.

downloadDownload free PDF View PDFchevron_right

An Integrated Framework for Fuzzy Classification and Analysis of Gene Expression Data

Hadi Khabbaz

New Concepts and Developments

downloadDownload free PDF View PDFchevron_right

An Entropy Approach for Choosing Gene Expression Cutoff

Sign up for access to the world's latest research

Abstract

Related papers

References (3)

Related papers

Related topics