Traveling on discrete embeddings of gene expression
2016, Artificial Intelligence in Medicine
https://doi.org/10.1016/J.ARTMED.2016.05.002Abstract
Objective: High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG). Method: Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that "similar" co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature-extracted from the model-that can be effectively employed for classification. Results: A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology. Conclusion: The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.
References (69)
- Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal 2005;48(4):869-85.
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005;21(5):631-43.
- Wu M-Y, Dai D-Q, Shi Y, Yan H, Zhang X-F. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM Trans Comput Biol Bioinform 2012;9(6):1649-62.
- Tong M, Liu K-H, Xu C, Ju W. An ensemble of SVM classifiers based on gene pairs. Comput Biol Med 2013;43(6):729-37.
- Kerr G, Ruskin H, Crane M, Doolan P. Techniques for clustering gene expression data. Comput Biol Med 2008;38(3):283-93.
- de Souto M, Costa I, de Araujo D, Ludermir T, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics 2008;9(1):497.
- Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform 2012;9(4):1106-19.
- Hofmann T. Unsupervised learning by probabilistic latent semantic analysis. Mach Learning 2001;42(1-2):177-96.
- Fasoli M, Santo SD, Zenoni S, Tornielli GB, Farina L, Zamboni A, et al. The grapevine expression atlas reveals a deep transcriptome shift driving the entire plant into a maturation program. Plant Cell Online 2012;24(9):3489-505.
- Bicego M, Lovato P, Perina A, Fasoli M, Delledonne M, Pezzotti M, et al. Investigating topic models' capabilities in expression microarray data classification. IEEE/ACM Trans Comput Biol Bioinform 2012;9(6):1831-6.
- Rogers S, Girolami M, Campbell C, Breitling R. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Trans Comput Biol Bioinform 2005;2(2):143-56.
- Perina A, Lovato P, Murino V, Bicego M. Biologically-aware latent Dirichlet allocation (BALDA) for the classification of expression microarray. In: Proceedings of the international conference on pattern recognition in bioinformatics, LNCS. Springer; 2010. p. 230-41.
- Bicego M, Lovato P, Ferrarini A, Delledonne M. Biclustering of expression microarray data with topic models. In: Proceedings of the international conference on pattern recognition. IEEE; 2010. p. 2728-31.
- Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the European conference on machine learning. 1998. p. 137-42.
- Jojic N, Perina A. Multidimensional counting grids: Inferring word order from disordered bags of words. In: Uncertainty in artificial intelligence; 2011.
- Perina A, Jojic N. Image analysis by counting on a grid. In: International conference on computer vision and pattern recognition. 2011. p. 1985-92.
- Jordan I, Marino-Ramirez L, Koonin E. Evolutionary significance of gene expression divergence. Gene 2005;345(1):119-26.
- Lovato P, Bicego M, Cristani M, Jojic N, Perina A. Feature selection using counting grids: application to microarray data. In: Structural, syntactic, and statistical pattern recognition, vol. 7626 of LNCS. 2012. p. 629-37.
- Perina A, Kesa M, Bicego M. Expression microarray data classification using counting grids and fisher kernel. In: Proceedings of the international conference on pattern recognition. 2014. p. 1770-5.
- Frey BJ, Jojic N. A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans Pattern Anal Mach Intell 2005;27:2005.
- DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997;278(5338):680-6.
- Rossignol T, Dulau L, Julien A, Blondin B. Genome-wide monitoring of wine yeast gene expression during alcoholic fermentation. Yeast 2003;20(16):1369-85.
- Rodriguez-Colman MJ, Reverter-Branchat G, Sorolla MA, Tamarit J, Ros J, Cabiscol E. The forkhead transcription factor Hcm1 promotes mitochondrial biogenesis and stress resistance in yeast. J Biol Chem 2010;285(47):37092-101.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25(1):25-9.
- Beißbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004;20(9):1464-5.
- Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, et al. Delineation of prognostic biomarkers in prostate cancer. Nature 2001;412(6849):822-6.
- Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001;98(24):13790-5.
- Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002;415(6870):436-42.
- Manning CD, Raghavan P, Schütze H. Introduction to information retrieval, vol. 1. Cambridge: Cambridge University Press; 2008.
- Kuncheva LI. A stability index for feature selection. In: Proceedings of the 25th IASTED artificial intelligence and applications, AIAP'07. 2007. p. 390-5.
- Jun J-I, Lau LF. Taking aim at the extracellular matrix: CCN proteins as emerging therapeutic targets. Nat Rev Drug Discov 2011;10(12):945-63.
- Chen C-C, Lau L. Functions and mechanisms of action of CCN matricellular proteins. Int J Biochem Cell Biol 2009;41(4):771-83.
- Bennewith KL, Huang X, Ham CM, Graves EE, Erler JT, Kambham N, et al. The role of tumor cell-derived connective tissue growth factor (CTGF/CCN2) in pancreatic tumor growth. Cancer Res 2009;69(3):775-84.
- Xie D, Yin D, Wang H-J, Liu G-T, Elashoff R, Black K, et al. Levels of expression of CYR61 and CTGF are prognostic for tumor progression and survival of individuals with gliomas. Clin Cancer Res 2004;10(6):2072-81.
- Chen P-P, Li W-J, Wang Y, Zhao S, Li D-Y, Feng L-Y, et al. Expression of cyr61, CTGF, and WISP-1 correlates with clinical features of lung cancer. PLoS ONE 2007;2(6):e534.
- Fahmy RG, Dass CR, Sun L-Q, Chesterman CN, Khachigian LM. Transcription factor Egr-1 supports FGF-dependent angiogenesis during neovascularization and tumor growth. Nat Med 2003;9(8):1026-32.
- Lu D, Wolfgang CD, Hai T. Activating transcription factor 3, a stress-inducible gene, suppresses ras-stimulated tumorigenesis. J Biol Chem 2006;281(15):10473-81.
- Troup S, Njue C, Kliewer EV, Parisien M, Roskelley C, Chakravarti S, et al. Reduced expression of the small leucine-rich proteoglycans, lumican, and decorin is associated with poor outcome in node-negative invasive breast cancer. Clin Cancer Res 2003;9(1):207-14.
- Shahzad MMK, Arevalo JM, Armaiz-Pena GN, Lu C, Stone RL, Moreno-Smith M, et al. Stress effects on FosB-and interleukin-8 (IL8)-driven ovarian cancer growth and metastasis. J Biol Chem 2010;285(46):35462-70.
- Kataoka F, Tsuda H, Arao T, Nishimura S, Tanaka H, Nomura H, et al. EGRI and FOSB gene expressions in cancer stroma are independent prognostic indicators for epithelial ovarian cancer receiving standard therapy. Genes Chromosomes Cancer 2012;51(3):300-12.
- Wielockx B, Libert C, Wilson C. Matrilysin (matrix metalloproteinase-7): a new promising drug target in cancer and inflammation? Cytokine Growth Factor Rev 2004;15(2-3):111-5.
- Tokunaga K, Nakamura Y, Sakata K, Fujimori K, Ohkubo M, Sawada K, et al. Enhanced expression of a glyceraldehyde-3-phosphate dehydrogenase gene in human lung cancers. Cancer Res 1987;47(21):5616-9.
- Revillion F, Pawlowski V, Hornez L, Peyrat J. Glyceraldehyde-3-phosphate dehydrogenase gene expression in human breast cancer. Eur J Cancer 2000;36(8):1038-42.
- Minn A, Gupta G, Siegel P, Bos P, Shu W, Giri D, et al. Genes that mediate breast cancer metastasis to lung. Nature 2005;436(7050):518-24.
- D'Amico G, Korhonen EA, Anisimov A, Zarkada G, Holopainen T, Hagerling R, et al. Tie1 deletion inhibits tumor growth and improves angiopoietin antagonist therapy. J Clin Investig 2014;124(2):824-34.
- Shankar J, Messenberg A, Chan J, Underhill TM, Foster LJ, Nabi IR. Pseudopodial actin dynamics control epithelial-mesenchymal transition in metastatic cancer cells. Cancer Res 2010;70(9):3780-90.
- Kim J, Lee S, Chae Y, Kang B, Lee Y, Oh S, et al. Association between phosphorylated amp-activated protein kinase and mapk3/1 expression and prognosis for patients with gastric cancer. Oncology 2013;85(2):78-85.
- Balkwill F. Cancer and the chemokine network. Nat Rev Cancer 2004;4(7):540-50.
- Lin W-C, Li AF-Y, Chi C-W, Chung W-W, Huang CL, Lui W-Y, et al. Tie-1 protein tyrosine kinase: a novel independent prognostic marker for gastric cancer. Clin Cancer Res 1999;5(7):1745-51.
- Moncho-Amor V, Ibanez de Caceres I, Bandres E, Martinez-Poveda B, Orgaz J, Sanchez-Perez I, et al. Dusp1/mkp1 promotes angiogenesis, invasion and metastasis in non-small-cell lung cancer. Oncogene 2011;30(6):668-78.
- Bieche I, Lerebours F, Tozlu S, Espie M, Marty M, Lidereau R. Molecular profiling of inflammatory breast cancer: identification of a poor-prognosis gene expression signature. Clin Cancer Res 2004;10(20):6789-95.
- Zhang Y-J, Li H, Wu H-C, Shen J, Wang L, Yu M-W, et al. Silencing of hint1, a novel tumor suppressor gene, by promoter hypermethylation in hepatocellular carcinoma. Cancer Lett 2009;275(2):277-84.
- Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002;1(2):203-9.
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999;96(12):6745-50.
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 2002;8(1):68-74.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286(5439):531-7.
- Van't Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002;415(6871):530-6.
- Yu L, Han Y, Berens M. Stable gene selection from microarray data via sample weighting. IEEE/ACM Trans Comput Biol Bioinform 2012;9:262-72.
- Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005;27(8):1226-38.
- Weston J, Elisseeff A, Schölkopf B, Tipping M. Use of the zero norm with linear models and kernel methods. J Mach Learning Res 2003;3:1439-61.
- Perina A, Cristani M, Castellani U, Murino V, Jojic N. Free energy score spaces: using generative information in discriminative classifiers. IEEE Trans Pattern Anal Mach Intell 2012;34(7):1249-62.
- Perina A, Bicego M, Castellani U, Murino V. Exploiting geometry in counting grids. SIMBAD; 2013. p. 250-64.
- Jaakkola T, Haussler D. Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems. 1999. p. 487-93.
- Wang X, Gotoh O. A robust gene selection method for microarray-based cancer classification. Cancer Inform 2010;9:15-30.
- Chen P-C, Huang S-Y, Chen W, Hsiao C. A new regularized least squares support vector regression for gene selection. BMC Bioinformatics 2009;10(1):44.
- Osareh A, Shadgar B. Classification and diagnostic prediction of cancers using gene microarray data analysis. J Appl Sci 2009;9(3):459-68.
- Liu H, Liu L, Zhang H. Ensemble gene selection by grouping for microarray data classification. J Biomed Inform 2010;43(1):81-7.
- Bolón-Canedo V, Sánchez-Maro no N, Alonso-Betanzos A. An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 2012;45(1):531-9.
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005;21(5):631-43.