An Entropy Approach for Choosing Gene Expression Cutoff
2022, bioRxiv (Cold Spring Harbor Laboratory)
https://doi.org/10.1101/2022.05.05.490711…
3 pages
1 file
Sign up for access to the world's latest research
Abstract
Annotating cell types using single-cell transcriptome data usually requires binarizing the expression data to distinguish between the background noise vs. real expression or low expression vs. high expression cases. A common approach is choosing a "reasonable" cutoff value, but it remains unclear how to choose it. In this work, we describe a simple yet effective approach for finding this threshold value.
Related papers
PLOS ONE
Background Identifying differentially expressed genes between experimental conditions is still the gold-standard approach to interpret transcriptomic profiles. Alternative approaches based on diversity measures have been proposed to complement the interpretation of such datasets but are only used marginally. Methods Here, we reinvestigated diversity measures, which are commonly used in ecology, to characterize mice pregnancy microenvironments based on a public transcriptome dataset. Mainly, we evaluated the Tsallis entropy function to explore the potential of a collection of diversity measures for capturing relevant molecular event information. Results We demonstrate that the Tsallis entropy function provides additional information compared to the traditional diversity indices, such as the Shannon and Simpson indices. Depending on the relative importance given to the most abundant transcripts based on the Tsallis entropy function parameter, our approach allows appreciating the impac...
Current Opinion in Systems Biology, 2021
2015
A basic problem in the construction of network representations of gene interactions is deciding whether a gene is or is not expressed at a time instant. This problem, referred here as the gene expression decision problem, has been approached with statistical and numerical algorithms. Numerical methods are based on different intuitions on what signals a gene expression threshold and as a consequence, they often return different answers. Consequently, the choice of a particular gene expression decision algorithm influences the gene interaction model. This article proposes an aggregation methodology for numerical gene expression decision algorithms that is based on voting. The result is thus, the expression decision made by the majority of the algorithms, provided that that decision is consistent with an underlying logical law referred as the doctrine. The proposed method is compared with some non-voting aggregation algorithms.
Physica A: Statistical Mechanics and its Applications, 2008
Arrays allow simultaneous measurements of the expression levels of thousands of mRNAs. By mining this data one can identify sets of genes with similar profiles. We show that information theoretic methods are capable of modeling and assessing dissimilarities between the dynamics underlying to the gene expression time series. By recourse of a maximum entropy-based method for building models, we built a distance between two gene expression profiles, which takes into account the dynamic features of the expression. The proposed distance measure can be implemented over a wide variety of clustering algorithms enhancing their usefulness.
Proceedings of the fifth annual international conference on Computational biology - RECOMB '01, 2001
Recent studies (Alizadeh et al, ; Bittner et al, ; Golub et al, ) demonstrate the discovery of putative disease subtypes from gene expression data. The underlying computational problem is to partition the set of sample tissues into statistically meaningful classes. In this paper we present a novel approach to class discovery and develop automatic analysis methods. Our approach is based on statistically scoring candidate partitions according to the overabundance of genes that separate the different classes. Indeed, in biological datasets, an overabundance of genes separating known classes is typically observed. we measure overabundance against a stochastic null model. This allows for highlighting subtle, yet meaningful, partitions that are supported on a small subset of the genes.
2018
While in mathematics (and in logic) the basic divide is between 'true' and 'false', in experimental science the frontier is between 'relevant' and 'irrelevant' and this is a much more tricky border. The classical way to track this frontier builds upon inferential statistics (signal analysis is a synonymous more popular among engineers) and is based on the definition of what we intend for 'randomness' in a given situation. Here we comment on the setting of the threshold between 'informative' and 'random' territories in the case of gene expression data where the definition of randomness is not only a 'statistical' but a 'biological' affair.
2004
We present a new algorithm to discovering natural partitions of a set of samples based on their gene expression patterns found with microarray experiments. The algorithm uses a bicriteria combinatorial optimization search to simultaneously identify an interesting set of genes and a partition of the array samples. Each gene in the gene set should respect the sample partition in the sense that if the gene's values are colored according to the partition class they come from, then the values, when sorted, should have a minimal number of color changes. We refer to this as the full color criterion. It measures how well a particular gene sorts the various partition classes. The other is the black and white criterion where we color the values of one of the partition classes black and the remaining values white and again count the number of color changes. For each gene, we choose the partition class to color black that minimizes this count. This criterion measures how well a gene distinguishes one sample class from the remaining samples. Using a branch-and-bound algorithm we are able to find both the optimal gene set and the sample partitioning that has the fewest total number of black and white and full color changes on this gene set. Additionally we can calculate the likelihood of observing a particular outcome in a random data set, thus permitting the calculation of a "pvalue" to interpret the significance of the results. The algorithm can be run in a completely unsupervised way, or a user can constrain the search to enforce that a particular group of samples be in the same partition (e.g. controls) or requiring groups of samples to belong to different partitions. We have tested the algorithm on a 30 sample Cutaneous T-cell Lymphoma data set; it was able to almost perfectly discriminate short-term survivors from long-term survivors and normal controls.
GigaScience
Background The cell type composition of heterogeneous tissue samples can be a critical variable in both clinical and laboratory settings. However, current experimental methods of cell type quantification (e.g., cell flow cytometry) are costly, time consuming and have potential to introduce bias. Computational approaches that use expression data to infer cell type abundance offer an alternative solution. While these methods have gained popularity, most fail to produce accurate predictions for the full range of platforms currently used by researchers or for the wide variety of tissue types often studied. Results We present the Gene Expression Deconvolution Interactive Tool (GEDIT), a flexible tool that utilizes gene expression data to accurately predict cell type abundances. Using both simulated and experimental data, we extensively evaluate the performance of GEDIT and demonstrate that it returns robust results under a wide variety of conditions. These conditions include multiple pla...
Nature Communications
We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundan...
BMC bioinformatics, 2005
Accurate diagnosis of cancer subtypes remains a challenging problem. Building classifiers based on gene expression data is a promising approach; yet the selection of non-redundant but relevant genes is difficult. The selected gene set should be small enough to allow diagnosis even in regular clinical laboratories and ideally identify genes involved in cancer-specific regulatory pathways. Here an entropy-based method is proposed that selects genes related to the different cancer classes while at the same time reducing the redundancy among the genes. The present study identifies a subset of features by maximizing the relevance and minimizing the redundancy of the selected genes. A merit called normalized mutual information is employed to measure the relevance and the redundancy of the genes. In order to find a more representative subset of features, an iterative procedure is adopted that incorporates an initial clustering followed by data partitioning and the application of the algori...

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (3)
- J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81-106, 1986.
- Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379-423, 1948.
- Lei Zhang, Xin Yu, Liangtao Zheng, Yuanyuan Zhang, Yansen Li, Qiao Fang, Ranran Gao, Boxi Kang, Qiming Zhang, Julie Y Huang, et al. Lineage tracking reveals dynamic relationships of t cells in colorectal cancer. Nature, 564(7735):268-272, 2018.