Clustering genomic signatures

Erik Norlander

Outline

Clustering genomic signatures

Erik Norlander

2018

Abstract

Pathogens such as bacteria and viruses are leading causes of disease worldwide, which makes it essential to identify them in DNA samples. Instead of analysing raw DNA sequences, mathematical models based on Variable Length Markov Chains (VLMCs), known as Genomic signatures, make it possible to classify DNA samples faster than with traditional alignment-based methods. To analyse a set of genomic signatures, we use clustering, which is an unsupervised machine-learning method. For the clustering of VLMCs, an accurate and fast similarity measure (distance function) is needed. To analyse distance functions and clusters, we define metrics based primarily on the taxonomic ranks of the underlying organisms. For the distance functions, we primarily analysed whether the VLMCs within the same taxonomic rank were closest to each other. For the cluster analysis, we use the silhouette metric to determine how well separated the clusters are and define the average percentages, sensitivity, and spec...

References (30)

Peter Bühlmann, Abraham J Wyner, et al. "Variable length Markov chains". In: The Annals of Statistics 27.2 (1999), pp. 480-513.
C Burge, A M Campbell, and S Karlin. "Over-and under-representation of short oligonucleotides in DNA sequences". In: Proceedings of the National Academy of Sciences 89.4 (1992), pp. 1358-1362. issn: 0027-8424.
Fabio Cuzzolin and Michael Sapienza. "Learning pullback HMM distances". In: IEEE transactions on pattern analysis and machine intelligence 36.7 (2014), pp. 1483-1489.
Daniel Dalevi, Devdatt Dubhashi, and Malte Hermansson. "A new order esti- mator for fixed and variable length Markov models with applications to DNA sequence similarity". In: Statistical applications in genetics and molecular bi- ology 5.1 (2006).
Daniel Dalevi, Devdatt Dubhashi, and Malte Hermansson. "Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures". In: Bioinformatics 22.5 (2006), pp. 517-522.
Robert C Edgar. "Search and clustering orders of magnitude faster than BLAST ". In: Bioinformatics 26.19 (2010), pp. 2460-2461.
Ronald A Fisher. "The use of multiple measurements in taxonomic problems". In: Annals of human genetics 7.2 (1936), pp. 179-188.
Mohammadreza Ghodsi, Bo Liu, and Mihai Pop. "DNACLUST: accurate and efficient clustering of phylogenetic marker genes". In: BMC Bioinformatics 12.1 (July 2011), p. 271.
Martin Holmudden. "Virus Attenuation by Genome-Wide Alterations of Ge- nomic Signatures". In: (2015).
B. H. Juang and L. R. Rabiner. "A probabilistic distance measure for hidden Markov models". In: AT T Technical Journal 64.2 (Feb. 1985), pp. 391-408. issn: 8756-2324. doi: 10.1002/j.1538-7305.1985.tb00439.x.
Samuel Kariin and Chris Burge. "Dinucleotide relative abundance extremes: a genomic signature". In: Trends in genetics 11.7 (1995), pp. 283-290.
Stephen E Levinson, Lawrence R Rabiner, and Man Mohan Sondhi. "An intro- duction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition". In: The Bell System Technical Jour- nal 62.4 (1983), pp. 1035-1074.
Chen Lu et al. "A normalized statistical metric space for hidden markov mod- els". In: IEEE transactions on cybernetics 43.3 (2013), pp. 806-819.
Tomoko Mihara et al. "Linking virus genomes with host taxonomy". In: Viruses 8.3 (2016), p. 66.
Fionn Murtagh and Pedro Contreras. "Algorithms for hierarchical clustering: an overview". In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2.1 (2012), pp. 86-97.
A Muto and S Osawa. "The guanine and cytosine content of genomic DNA and bacterial evolution". In: Proceedings of the National Academy of Sciences 84.1 (1987), pp. 166-169. issn: 0027-8424.
Saul B Needleman and Christian D Wunsch. "A general method applicable to the search for similarities in the amino acid sequence of two proteins". In: Journal of molecular biology 48.3 (1970), pp. 443-453.
World Health Organization et al. Antimicrobial resistance: global report on surveillance. World Health Organization, 2014.
Peter Pipenbacher et al. "ProClust: improved clustering of protein sequences with an extended graph-based approach". In: Bioinformatics 18.suppl_2 (2002), S182-S191.
Simon Rogers and Mark Girolami. A first course in machine learning. CRC Press, 2016.
Torbjørn Rognes et al. "VSEARCH: a versatile open source tool for metage- nomics". In: PeerJ 4 (2016), e2584.
Dana Ron, Yoram Singer, and Naftali Tishby. "Learning probabilistic au- tomata with variable memory length". In: Proceedings of the seventh annual conference on Computational learning theory. ACM. 1994, pp. 35-46.
Sourav RoyChoudhury, Archana Pan, and Debaprasad Mukherjee. "Genus spe- cific evolution of codon usage and nucleotide compositional traits of poxviruses". In: Virus genes 42.2 (2011), pp. 189-199.
Naruya Saitou and Masatoshi Nei. "The neighbor-joining method: a new method for reconstructing phylogenetic trees." In: Molecular biology and evolution 4.4 (1987), pp. 406-425.
Marcel H Schulz et al. "Fast and adaptive variable order Markov chain con- struction". In: International Workshop on Algorithms in Bioinformatics. Springer. 2008, pp. 306-317.
Kana Shimizu and Koji Tsuda. "SlideSort: all pairs similarity search for short reads". In: Bioinformatics 27.4 (2010), pp. 464-470.
B. G. Sürmeli et al. "Unsupervised mode detection in cyber-physical systems using variable order Markov models". In: 2017 IEEE 15th International Con- ference on Industrial Informatics (INDIN). July 2017, pp. 841-846.
WHO Ebola Response Team. "After Ebola in West Africa-unpredictable risks, preventable epidemics". In: New England Journal of Medicine 375.6 (2016), pp. 587-596.
Agnes Vathy-Fogarassy et al. Graph-based clustering and data visualization algorithms. English. 2013;1; New York: Springer, 2013.
Jianping Zeng, Jiangjiao Duan, and Chengrong Wu. "A new distance measure for hidden Markov models". In: Expert systems with applications 37.2 (2010), pp. 1550-1555. Coronaviridae: 716 Myoviridae: 11 Homo sapiens: Camelus dromedarius: Sus scrofa: Gallus gallus: 426 Adenoviridae: 244 Herpesviridae: 172 Alloherpesviridae: 7 Phycodnaviridae: 3 Homo sapiens: Not Found: Gallus gallus: Equus caballus: 299 Filoviridae: 196 Siphoviridae: 48 Myoviridae: 31 Retroviridae: 8 Homo sapiens: Epomops franqueti: Myonycteris torquata: Not Found: 177 Siphoviridae: 128 Myoviridae: 45 Podoviridae: 4 Mycobacterium smegmatis str. MC2 155: Not Found: Mycobacterium: Mycobacterium smegmatis: 142 Myoviridae: 142 Synechococcus sp. WH 7803: Synechococcus: 101 Poxviridae: 85 Mimiviridae: 6 Lipothrixviridae: 5 Iridoviridae: 2 Homo sapiens: Mus musculus: Bos taurus: Acinonyx jubatus: 89 Herpesviridae: 69 Phycodnaviridae: 5 Adenoviridae: 3 Myoviridae: 3 Felidae: Not Found: Homo sapiens: Chlorella variabilis: 81 Siphoviridae: 51 Adenoviridae: 23 Herpesviridae: 6 Filoviridae: 1 Not Found: Odocoileus hemionus columbianus: Pygoscelis antarcticus: Mus musculus: 80 Herpesviridae: 80 Homo sapiens: Bos taurus: Macaca fascicularis: Macaca leonina: 77 Baculoviridae: 71 Ascoviridae: 4 Polydnaviridae: 2 Not Found: Lepidoptera: Erinnyis ello: Trichoplusia ni: Table D.2: The ten largest clusters from the extended data set, with the distribution of families and hosts within the clusters. Only the four largest groups within each cluster is displayed. In most clusters, there are primarily a single family, but the correlation with respect to host is not as clear.

Clustering genomic signatures

Sign up for access to the world's latest research

Abstract

Related papers

References (30)

Related papers