Alignment-free Research Papers

Alignment-Free Sequence Comparison (I): Statistics and Power

2024, Journal of Computational Biology

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D 2 statistic, relies on the comparison of the ktuple content for both sequences. Although it has... more

descriptionView Paper arrow_downwardDownload

Efficient Influenza A Virus Origin Detection

by Karen Schlauch

2024

This research describes a novel, alignment-free method of genomic sequence comparisons based on absent nucleotide words and expression levels. Testing this method on Influenza A virus isolates, three classifications are presented which... more

descriptionView Paper arrow_downwardDownload

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements

by Diogo Pratas

2024, GigaScience

Background: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is... more

Figure 1: Similarities between synthetic sequences with different sizes, detected by Smash++. The parameters used are k-mer size = 14 and number of substitutions in substitution-tolerant Markov model (STMM) = 5, which are the default parameters used by Smash+-+. For the threshold, the default values of 1.5 and 1.97 are used for panels a-d and e, respectively. (a) 1.5 kb sequences; (b) 100 kb sequences. No similarity is detected for Part II of the reference because it is mutated 90%. Parts III and IV of the reference and I and II of the target are joined because there is no space between consecutive regions. (c) 5 Mb sequences; (d) 100 Mb sequences; (e) 60 kb sequences. Roughly 43% of mutation is detected.

Figure 2: Similarities in a real dataset, detected by Smash++. (a) G. gallus (chicken) chr. 18 and M. gallopavo (turkey) chr. 20. The parameters were k-mer size = 14, No. substitutions in STMM = 5, threshold = 1.9, and minimum block size (m) = 500,000; i.e., regions smaller than 500,000 bp were not considered for further processing; (b) G. gallus chr. 14 and M. gallopavo chr. 16. The result is obtained by setting k = 14, No. substitutions = 5, threshold = 1.95, and m = 400,000; (c) H. sapiens (human) chr. 12 and P. troglodytes (chimpanzee) chr. 12. The parameters were k = 14, without using STMM, threshold = 1.9, and m = 100,000; (d) X. oryzae pv. oryzae PXO99A (a rice pathogen) and X. oryzae pv. oryzae MAFF 311018 (a rice pathogen). The result was obtained by setting k = 13, threshold = 1.55, and m = 10,000.

Figure 4: (a) The peak memory consumption, in gigabytes; and (b) the elapsed (wall clock) time usage, in minutes, of Smash++ obtained by running on all synthetic and real datasets described in Table 1.

Figure 5: Comparison of Smash++ and Smash, in terms of (a) memory usage; and (b) time usage, running on real and synthetic data described in Table 1. To have a fai comparison, only 1 model (FCM) is used by Smash++, and also self-complexity is not computed. Diamonds indicate the mean, and bars, the ranges from minimun to maximum values.

Figure 6: The schema of Smash++. The process of finding similar regions in reference and target sequences and computing the redundancy in each region includes & stages. Smash++ outputs a x.pos file that includes the positions of the similar regions, and can be then visualized, resulting in an SVG image.

Figure 7: Data model used by Smash++. (a) Cooperation between finite-context models (FCMs) and substitution-tolerant Markov models (STMMs). Note that each STMM needs to be associated with an FCM. (b) Probability of an input symbol is estimated by using the probability and weight values that have been obtained from processing previous symbols.

Figure 8: The data structures used by Smash+-+ to store the models in memory. (a) Table of 64-bit counters that uses up to 128 MB of memory, (b) table of 32 bit counter: that consumes at most 960 MB of memory, (c) table of 8 bit approximate counters with memory usage of up to 1 GB, and (d) Count-Min-Log sketch of 4-bit counters which consumes up to }w x dB of memory; e.g., if w = 2° and d = 4, it uses 2 GB of memory.

Figure 9: Approximate counting update and query.

Figure 11: Finding similar regions in reference and target sequences. Smash++ first finds the regions in the target that are similar to the reference and then finds the regions in the reference that are similar to the detected target regions. This procedure is performance for both regular and inverted homologies.

The real dataset can be download from NCBI via accession number (access.) provided in the descriptions. Table 1: Synthetic and real dataset used in the experiments

Figure 10: Count-Min-Log Sketch update and query. The Kolmogorov complexity is not computable; hence, an al- ternative is required to compute it approximately. It has been shown in the literature that a compression algorithm can be used for this purpose [44-46]. In this article, we use a reference- free compressor to approximate the complexity and, conse-

descriptionView Paper arrow_downwardDownload

Efficient alignment-free software applications for next generation sequencing-based molecular epidemiology

by Hector Espitia

2023

CHAPTER 4. Application of the STing algorithm TO public health and environmental genomics 70 4.1 Applying STing to public health: Shiga toxin-producing Escherichia coli (E. coli) virulence profiling 4.1.1 Materials and methods 4.1.2... more

CHAPTER 4. Application of the STing algorithm TO public health and environmental genomics 70 4.1 Applying STing to public health: Shiga toxin-producing Escherichia coli (E. coli) virulence profiling 4.1.1 Materials and methods 4.1.2 Results and discussion 4.1 Applying STing to environmental genomics: nifH gene-based taxonomic assignment of amplicon sequencing samples 4.1.1 Materials and methods 4.1.2 Results and discussion CHAPTER 5. Conclusions and future prospects 102 PUBLICATIONS 107 APPENDIX A. SUPPLEMENTARY DATA FOR CHAPTER 2 108 A.1 Pseudocode for database indexing A.2 Pseudocode for sequence typing APPENDIX B. SUPPLEMENTARY DATA FOR CHAPTER 3 173 B.1 WebSTing data dictionary APPENDIX C. SUPPLEMENTARY DATA FOR CHAPTER 4 177 xiv SUMMARY Public health agencies increasingly couple next generation sequencing (NGS) based characterization of microbial genomes with bioinformatics analysis methods for molecular epidemiology. The overhead associated with the bioinformatics methods used for this purpose, in terms of both the required human expertise and computational resources, represents a critical bottleneck that limits the potential impact of microbial genomics on public health. This is particularly true for local public health agency laboratories, which are typically staffed with microbiologists who may not have substantial bioinformatics expertise or ready access to high-performance computational resources. There is a pressing need for bioinformatics solutions to genome-enabled molecular epidemiology that is accurate, easy to use, fast, and computationally efficient. The development of an alignment-free algorithm for NGS data analysis and its implementation into turn-key software applications tailored explicitly for genome-enabled molecular epidemiology and environmental microbial genomics is the focus of my research. I explored a computational strategy based on k-mer frequencies to distinguish among sequences of interest in NGS read samples. By combining this strategy with the efficient data structure enhanced suffix array (ESA), I developed a base algorithm for the rapid analysis of unprocessed NGS reads. I further adapted and implemented this algorithm into a suite of software applications for sequence typing, gene detection, and gene-based taxonomic read classification. My thesis research focused on three specific aims: (1) development of an alignment-and assembly-free algorithm and software solution for NGS-based molecular xv epidemiology, (2) development of an alignment-and assembly-free fully automated Webplatform for the comprehensive characterization of bacterial isolates using whole genome sequencing (WGS) data; and (3) expanding the applicability of the alignment-free algorithm to different problems. Sequence typing and gene detection are essential for pathogen characterization in genome-enabled approaches to molecular epidemiology. In this sense, I developed an assembly-and alignment-free algorithm, STing, which I implemented into two turn-key software utilities for sequence typing, and gene detection. Benchmarking and validation analyses showed that STing is an ultrafast and accurate solution for genome-enabled molecular epidemiology, which performs better than existing bioinformatics methods for sequence typing and gene detection. Limited access to bioinformatics-related infrastructure and expertise impedes the successful adoption of genome-enabled approaches to molecular epidemiology in public health. To overcome this challenge, I developed WebSTing, a Web-platform that uses the STing algorithm to supply easy access to the accurate and rapid alignment-free automated characterization of WGS samples of bacterial isolates. To demonstrate the utility of STing in problems beyond simple sequence typing and gene detection, I applied the alignment-free algorithm to two different areas: (1) public health, with the virulence gene profiling of Shiga toxin-producing Escherichia coli (STEC) isolates, and (2) environmental microbial genomics, with the nifH gene-based taxonomic classification of amplicon sequencing reads. I showed that STing performs better than the xvi gold-standard method for STEC isolate characterization and that it correctly classifies amplicon sequencing reads on simulated communities of nitrogen-fixing organisms. Research advance 1: A novel k-mer frequencies approach combined with the ESA data structure was used to develop an assembly-and alignment-free algorithm, STing, for rapid analysis of unprocessed NGS reads. The STing algorithm was implemented into two software applications for genome-enabled molecular epidemiology: (1) the STing typer for sequence typing, and (2) the STing detector for gene detection. The STing typer utility was compared to six widely used programs for genome-enabled sequence typing, using the traditional multilocus sequence typing (MLST) scheme, and two larger typing schemes, ribosomal MLST (rMLST), and core genome MLST (cgMLST). Comparison results showed that STing outperformed the other applications in terms of accuracy and efficiency (runtime and RAM) using the MLST and rMLST schemes and was second with the cgMLST scheme. Most importantly, STing was the only application able to perform the typing analysis using all three of the typing schemes assessed, while also showing the ability to scale successfully to genome-enabled typing schemes like cgMLST. The detector utility was used to evaluate the ability of STing for detecting two epidemiologically relevant types of markers: antimicrobial resistance (AMR) genes and virulence factor (VF) genes. Results showed that STing had 100% accuracy in detecting AMR and VF genes on 71 WGS samples generated from 17 bacterial species of high priority in clinical microbiology research. Research advance 2: A Web-based platform, WebSTing, was developed to provide fully automated NGS-based characterization of bacterial pathogens. The main goal of WebSTing to provide easy access to genome-enabled approaches to molecular xvii epidemiology in public health laboratories that have limitations in bioinformatics-related infrastructure and expertise. WebSTing uses the STing algorithm and supplies assemblyand alignment-free sequence typing, gene detection, and phylogenetic analysis of WGS samples of bacterial isolates. Research advance 3: The applicability of the STing algorithm was expanded to solve problems in two different areas: (1) public health, and (2) environmental microbial genomics. For the public health application, STing was used for virulence gene profiling STEC isolates from WGS samples. Here, STing was compared to the PCR method, the current gold-standard technique used by public health laboratories for STEC characterization. Results showed that STing is more accurate than PCR for characterizing virulence genes in STEC samples. STing showed between 98% and 100% accuracy in characterizing the four genes used as markers for STEC determination (stx1, stx2, eae, and ehxA), compared to a PCR accuracy between 90% and 94%. Most importantly, and unlike the PCR technique, STing was able to detect novel genes in the analyzed STEC isolates. In the environmental microbial application, the STing algorithm was extended for nifH gene-based taxonomic classification of amplicon sequencing reads. The algorithm was implemented into the STing classifier utility. Using a nifH gene reference database, the STing classifier program was able to classify reads correctly in samples of nitrogen-fixing organism communities, simulated up to the lowest sequencing depth of 1x coverage. Importantly, results showed that full-length reference sequences of the gene nifH led to better taxonomic classification of the sequencing reads.

descriptionView Paper arrow_downwardDownload

Influenza A Virus (H3N2) Genomic Sequence Difference Measures Based on Word Absence and Expression Levels

by Karen Schlauch

2023

In a genomic sequence, the oligonucleotide signature represents the ratio of the observed to expected number of occurrences of all possible nucleotide words of a specific length. Word absence is also found in genomic sequences whereby... more

descriptionView Paper arrow_downwardDownload

Alignment-Free Measures for Whole-Genome Comparison

by Davide Verzotto

2016, Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques... more

descriptionView Paper arrow_downwardDownload