CHAPTER 4. Application of the STing algorithm TO public health and environmental genomics 70 4.1 Applying STing to public health: Shiga toxin-producing Escherichia coli (E. coli) virulence profiling 4.1.1 Materials and methods 4.1.2...
moreCHAPTER 4. Application of the STing algorithm TO public health and environmental genomics 70 4.1 Applying STing to public health: Shiga toxin-producing Escherichia coli (E. coli) virulence profiling 4.1.1 Materials and methods 4.1.2 Results and discussion 4.1 Applying STing to environmental genomics: nifH gene-based taxonomic assignment of amplicon sequencing samples 4.1.1 Materials and methods 4.1.2 Results and discussion CHAPTER 5. Conclusions and future prospects 102 PUBLICATIONS 107 APPENDIX A. SUPPLEMENTARY DATA FOR CHAPTER 2 108 A.1 Pseudocode for database indexing A.2 Pseudocode for sequence typing APPENDIX B. SUPPLEMENTARY DATA FOR CHAPTER 3 173 B.1 WebSTing data dictionary APPENDIX C. SUPPLEMENTARY DATA FOR CHAPTER 4 177 xiv SUMMARY Public health agencies increasingly couple next generation sequencing (NGS) based characterization of microbial genomes with bioinformatics analysis methods for molecular epidemiology. The overhead associated with the bioinformatics methods used for this purpose, in terms of both the required human expertise and computational resources, represents a critical bottleneck that limits the potential impact of microbial genomics on public health. This is particularly true for local public health agency laboratories, which are typically staffed with microbiologists who may not have substantial bioinformatics expertise or ready access to high-performance computational resources. There is a pressing need for bioinformatics solutions to genome-enabled molecular epidemiology that is accurate, easy to use, fast, and computationally efficient. The development of an alignment-free algorithm for NGS data analysis and its implementation into turn-key software applications tailored explicitly for genome-enabled molecular epidemiology and environmental microbial genomics is the focus of my research. I explored a computational strategy based on k-mer frequencies to distinguish among sequences of interest in NGS read samples. By combining this strategy with the efficient data structure enhanced suffix array (ESA), I developed a base algorithm for the rapid analysis of unprocessed NGS reads. I further adapted and implemented this algorithm into a suite of software applications for sequence typing, gene detection, and gene-based taxonomic read classification. My thesis research focused on three specific aims: (1) development of an alignment-and assembly-free algorithm and software solution for NGS-based molecular xv epidemiology, (2) development of an alignment-and assembly-free fully automated Webplatform for the comprehensive characterization of bacterial isolates using whole genome sequencing (WGS) data; and (3) expanding the applicability of the alignment-free algorithm to different problems. Sequence typing and gene detection are essential for pathogen characterization in genome-enabled approaches to molecular epidemiology. In this sense, I developed an assembly-and alignment-free algorithm, STing, which I implemented into two turn-key software utilities for sequence typing, and gene detection. Benchmarking and validation analyses showed that STing is an ultrafast and accurate solution for genome-enabled molecular epidemiology, which performs better than existing bioinformatics methods for sequence typing and gene detection. Limited access to bioinformatics-related infrastructure and expertise impedes the successful adoption of genome-enabled approaches to molecular epidemiology in public health. To overcome this challenge, I developed WebSTing, a Web-platform that uses the STing algorithm to supply easy access to the accurate and rapid alignment-free automated characterization of WGS samples of bacterial isolates. To demonstrate the utility of STing in problems beyond simple sequence typing and gene detection, I applied the alignment-free algorithm to two different areas: (1) public health, with the virulence gene profiling of Shiga toxin-producing Escherichia coli (STEC) isolates, and (2) environmental microbial genomics, with the nifH gene-based taxonomic classification of amplicon sequencing reads. I showed that STing performs better than the xvi gold-standard method for STEC isolate characterization and that it correctly classifies amplicon sequencing reads on simulated communities of nitrogen-fixing organisms. Research advance 1: A novel k-mer frequencies approach combined with the ESA data structure was used to develop an assembly-and alignment-free algorithm, STing, for rapid analysis of unprocessed NGS reads. The STing algorithm was implemented into two software applications for genome-enabled molecular epidemiology: (1) the STing typer for sequence typing, and (2) the STing detector for gene detection. The STing typer utility was compared to six widely used programs for genome-enabled sequence typing, using the traditional multilocus sequence typing (MLST) scheme, and two larger typing schemes, ribosomal MLST (rMLST), and core genome MLST (cgMLST). Comparison results showed that STing outperformed the other applications in terms of accuracy and efficiency (runtime and RAM) using the MLST and rMLST schemes and was second with the cgMLST scheme. Most importantly, STing was the only application able to perform the typing analysis using all three of the typing schemes assessed, while also showing the ability to scale successfully to genome-enabled typing schemes like cgMLST. The detector utility was used to evaluate the ability of STing for detecting two epidemiologically relevant types of markers: antimicrobial resistance (AMR) genes and virulence factor (VF) genes. Results showed that STing had 100% accuracy in detecting AMR and VF genes on 71 WGS samples generated from 17 bacterial species of high priority in clinical microbiology research. Research advance 2: A Web-based platform, WebSTing, was developed to provide fully automated NGS-based characterization of bacterial pathogens. The main goal of WebSTing to provide easy access to genome-enabled approaches to molecular xvii epidemiology in public health laboratories that have limitations in bioinformatics-related infrastructure and expertise. WebSTing uses the STing algorithm and supplies assemblyand alignment-free sequence typing, gene detection, and phylogenetic analysis of WGS samples of bacterial isolates. Research advance 3: The applicability of the STing algorithm was expanded to solve problems in two different areas: (1) public health, and (2) environmental microbial genomics. For the public health application, STing was used for virulence gene profiling STEC isolates from WGS samples. Here, STing was compared to the PCR method, the current gold-standard technique used by public health laboratories for STEC characterization. Results showed that STing is more accurate than PCR for characterizing virulence genes in STEC samples. STing showed between 98% and 100% accuracy in characterizing the four genes used as markers for STEC determination (stx1, stx2, eae, and ehxA), compared to a PCR accuracy between 90% and 94%. Most importantly, and unlike the PCR technique, STing was able to detect novel genes in the analyzed STEC isolates. In the environmental microbial application, the STing algorithm was extended for nifH gene-based taxonomic classification of amplicon sequencing reads. The algorithm was implemented into the STing classifier utility. Using a nifH gene reference database, the STing classifier program was able to classify reads correctly in samples of nitrogen-fixing organism communities, simulated up to the lowest sequencing depth of 1x coverage. Importantly, results showed that full-length reference sequences of the gene nifH led to better taxonomic classification of the sequencing reads.