Sequence analysis

Paul R Berg

doi:10.1093/BIOINFORMATICS/BTM154

Outline

Title

Abstract

References

Sequence analysis

Paul R Berg

2013

https://doi.org/10.1093/BIOINFORMATICS/BTM154

visibility

…

description

2 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates

Justin Zook

Nature Biotechnology, 2014

downloadDownload free PDF View PDFchevron_right

A cross-sample statistical model for SNP detection in short-read sequencing data

G. Natsoulis

Nucleic Acids Research, 2012

Highly multiplex DNA sequencers have greatly expanded our ability to survey human genomes for previously unknown single nucleotide polymorphisms (SNPs). However, sequencing and mapping errors, though rare, contribute substantially to the number of false discoveries in current SNP callers. We demonstrate that we can significantly reduce the number of false positive SNP calls by pooling information across samples. Although many studies prepare and sequence multiple samples with the same protocol, most existing SNP callers ignore cross-sample information. In contrast, we propose an empirical Bayes method that uses cross-sample information to learn the error properties of the data. This error information lets us call SNPs with a lower false discovery rate than existing methods.

downloadDownload free PDF View PDFchevron_right

Benefit-of-doubt (BOD) scoring: A sequencing-based method for SNP candidate assessment from high to medium read number data sets

Fritz Sedlazeck

Genomics, 2013

Identification of single nucleotide polymorphisms (SNPs) is a key element in sequence-based genetic analysis. Next generation sequencing offers a cost-effective basis to generate the necessary, large sequence data sets, and bioinformatic methods are being developed to process sequencing machine readouts. We were interested in detection of SNPs in a 350 kb region of an EMS-mutagenized Arabidopsis chromosome 3. The region was selectively analyzed using PCR-generated, overlapping fragments for Solexa sequencing. The ensuing reads provided a high coverage and were processed bioinformatically. In order to assess the SNP candidates obtained with a frequently used alignment program and SNP caller, we developed an additional method that allows the identification of high confidence SNP loci. The method can easily be applied to complete genome sequence data of sufficient coverage.

downloadDownload free PDF View PDFchevron_right

Reference-free detection of isolated SNPs

R. Uricaru

Nucleic Acids Research, 2014

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, DISCOSNP, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, DISCOSNP ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, DISCOSNP requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

downloadDownload free PDF View PDFchevron_right

Genome-wide evaluation of the public SNP databases

Andreas Windemuth

Pharmacogenomics, 2003

The public SNP databases are an important resource for groups performing genetic association and linkage studies. Both academic and commercial groups are developing large numbers of genotyping assays for SNPs in candidate genes or spread across the genome. These databases now contain in excess of 6 million SNPs that have been generated using a large number of methods and cohorts. Today, however, only a small fraction of these SNPs are well characterized and validated. The latest release of dbSNP contains ~ 3.7 million non-redundant entries, only 0.5 million of which are validated, and 0.2 million of which have frequency information. Users of these databases have several common questions. How many of the SNPs are real? What is the frequency spectrum of the SNPs in these databases? What is the distribution picture of these SNPs across different ethnic and geographical populations? What fraction of the total number of SNPs is already captured by these databases? In order to address these questions, we compared the public SNPs against a well-characterized collection of gene-centric SNPs that we have developed. From this comparison, we find that > 50% of high frequency SNPs in the genome (> 20% minor allele frequency) have already been captured by these databases. The coverage drops dramatically below frequencies of 10%. At high frequencies, there is no sampling bias with respect to ethnicity or to regions of the genome. Finally, a relatively large fraction (> 40%) of SNPs in these databases were not seen in our study, which means that they are either of very low frequency, mismapped, or not polymorphic at all.

downloadDownload free PDF View PDFchevron_right

Performance comparison of SNP detection tools with illumina exome sequencing data--an assessment using both family pedigree information and sample-matched SNP array data

Ming Yi, Yongmei Zhao

Nucleic acids research, 2014

To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios--family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of ...

downloadDownload free PDF View PDFchevron_right

An Integrated SNP Mining and Utilization (ISMU) Pipeline for Next Generation Sequencing Data

BhanuPrakash Amindala

PLoS ONE, 2014

Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/ sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.

downloadDownload free PDF View PDFchevron_right

Evaluating information content of SNPs for sample-tagging in re-sequencing projects

H. Ropers

Scientific reports, 2015

Sample-tagging is designed for identification of accidental sample mix-up, which is a major issue in re-sequencing studies. In this work, we develop a model to measure the information content of SNPs, so that we can optimize a panel of SNPs that approach the maximal information for discrimination. The analysis shows that as low as 60 optimized SNPs can differentiate the individuals in a population as large as the present world, and only 30 optimized SNPs are in practice sufficient in labeling up to 100 thousand individuals. In the simulated populations of 100 thousand individuals, the average Hamming distances, generated by the optimized set of 30 SNPs are larger than 18, and the duality frequency, is lower than 1 in 10 thousand. This strategy of sample discrimination is proved robust in large sample size and different datasets. The optimized sets of SNPs are designed for Whole Exome Sequencing, and a program is provided for SNP selection, allowing for customized SNP numbers and int...

downloadDownload free PDF View PDFchevron_right

SNPdetector: A Software Tool for Sensitive and Accurate SNP Detection

Raman Sood

PLOS Computational Biology, 2005

Identification of single nucleotide polymorphisms (SNPs) and mutations is important for the discovery of genetic predisposition to complex diseases. PCR resequencing is the method of choice for de novo SNP discovery. However, manual curation of putative SNPs has been a major bottleneck in the application of this method to high-throughput screening. Therefore it is critical to develop a more sensitive and accurate computational method for automated SNP detection. We developed a software tool, SNPdetector, for automated identification of SNPs and mutations in fluorescence-based resequencing reads. SNPdetector was designed to model the process of human visual inspection and has a very low false positive and false negative rate. We demonstrate the superior performance of SNPdetector in SNP and mutation analysis by comparing its results with those derived by human inspection, PolyPhred (a popular SNP detection tool), and independent genotype assays in three large-scale investigations. The first study identified and validated inter-and intra-subspecies variations in 4,650 traces of 25 inbred mouse strains that belong to either the Mus musculus species or the M. spretus species. Unexpected heterozgyosity in CAST/Ei strain was observed in two out of 1,167 mouse SNPs. The second study identified 11,241 candidate SNPs in five ENCODE regions of the human genome covering 2.5 Mb of genomic sequence. Approximately 50% of the candidate SNPs were selected for experimental genotyping; the validation rate exceeded 95%. The third study detected ENU-induced mutations (at 0.04% allele frequency) in 64,896 traces of 1,236 zebra fish. Our analysis of three large and diverse test datasets demonstrated that SNPdetector is an effective tool for genome-scale research and for large-sample clinical studies. SNPdetector runs on Unix/Linux platform and is available publicly (http://lpg.nci.nih.gov). Citation: Zhang J, Wheeler DA, Yakub I, Wei S, Sood R, et al. (2005) SNPdetector: A software tool for sensitive and accurate SNP detection. PLoS Comput Biol 1(5): e53. PLoS Computational Biology | www.ploscompbiol.org October 2005 | Volume 1 | Issue 5 | e53 0395 d

downloadDownload free PDF View PDFchevron_right

SNP detection using overlapping bovine ESTs

John C McEwan

2001

A methodology is described to identify probable single nucleotide polymorphisms (SNPs) and small insertion deletion (indel) polymorphisms in silico using overlapping bovine Expressed Sequence Tag (EST) sequences. A stratified random sample consisting of 200 unique overlapping regions identified 63 SNPs and 9 indels, where the minor allele frequency was greater than 0.15. Given that 144,995bp of sequence was examined, this translates to 1 SNP per 2302bp and 1 indel per 16,476bp. The observed proportion of SNPs that were transitions (56%) was significantly higher than the expected 1:2 transition to transversion ratio (P=0.0055). Forty five percent of mutations occurred at a potential CpG site, significantly higher than expected based on the mean 26% guanine content of the ESTs. For the 60 percent of SNPs that were in a likely protein encoding region (CDS), 26, 5 and 68 percent were located in the first, second and third codon position, a significant deficit at the second (P=0.0027) and excess at the third codon (P=0.0002) positions respectively. The predicted amino acid was altered in 51 percent of the CDS SNPs (31% of all SNPs), significantly lower than the 76% expected by chance (P<0.0002). Sixty percent of the amino acid changes were conservative in nature based on them having positive similarity values using the PAM 250 weight matrix. These results reflect the chemical likelihood of various potential mutations and the effects of negative selection on the subsequent changes. Based on 354,928 bovine ESTs used in this study we predict 2416 SNPs and 286 indels would be detected if all overlapping regions were examined.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (11)

Buetow,K.H. et al. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet., 21, 323-325.
Ewing,B. et al. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175-185.
Gordon,D. et al. (1998) Consed: a graphical tool for sequence finishing. Genome Res., 8, 195-202.
Guryev,V. et al. (2004) Single nucleotide polymorphisms associated with rat expressed sequences. Genome Res., 14, 1438-1443.
Hawken,R.J. et al. (2004) An interactive bovine in silico SNP database (IBISS). Mamm. Genome., 819-827.
Irizarry,K. et al. (2000) Genome-wide analysis of single-nucleotide polymorph- isms in human expressed sequences. Nat. Genet., 26, 233-236.
Koop,B.F. and Davidson,W.S. (2007) cGRASP (http://web.uvic.ca/cbr/grasp/)
Lee,M.A. et al. (2006) Establishment of a pipeline to analyse non-synonymous SNPs in Bos Taurus. BMC Genomics, 26, 298.
Marth,G.T. et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet., 23, 452-456.
Rise,M.L. et al. (2004) Development and application of a salmonoid EST database and cDNA microarray: data mining and interspecific hybridization characteristics. Genome Res., 14, 478-490.
Taillon-Miller,P. et al. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res., 8, 748-754.

Related papers

SNP Discovery through EST Data Mining

Zhanjiang Liu

Liu/Next Generation Sequencing and Whole Genome Selection in Aquaculture, 2010

The key issue of single-nucleotide polymorphism (SNP) marker applications in aquaculture species is the availability of SNPs. Identifi cation of large numbers of SNPs requires massive sequencing efforts and resources (Picoult-Newberg et al., 1999). Prior to the application of next generation sequencing, large-scale genome sequencing was not possible with aquaculture species. With recent adoption of next generation sequencing, it is obvious that SNP discovery has been made possible through both whole genome sequencing or through sequencing of reduced representation libraries (RRLs; for details, see Chapter 5). In spite of such efforts and possibilities, it is expected that in the near future, whole genome sequences will still not become available for the vast majority of aquaculture species, especially not for the minor aquaculture species. Therefore, SNP identifi cation in aquaculture species will still likely be using various alternative available resources such as expressed sequence tags (ESTs). In this chapter, we will focus on SNP discovery through mining of EST databases. Advantages and Disadvantages of SNP Discovery through EST Data Mining ESTs are already available in the databases, so additional sequencing efforts are not essential. ESTs are single-pass sequence reads generated by direct sequencing of cDNA clones. They have been generated in the course of expression studies. In recent years, EST resources have become available for many aquaculture species, and a summary is provided in Table 6.1 for major aquaculture species. ESTs-derived SNPs are associated with genes, therefore they are type I markers. Gene-associated SNPs can account for genomic causes of phenotypes. In this regard, gene-associated markers are superior to markers identifi ed from anonymous genomic regions. EST-derived SNPs should correlate genes in terms of genomic locations. While the numbers of markers available is very important, it is even more important to have markers that are evenly distributed in the genome. In genomic scale, genes are distributed in all chromosomes and chromosome segments, allowing EST-derived SNPs to have the potential of the same distribution in the genome, thereby reducing the levels of marker clustering.

downloadDownload free PDF View PDFchevron_right

Assessment of Utility of ESTs for Nucleotide Diversity

Winston Hide

giw.hgc.jp

downloadDownload free PDF View PDFchevron_right

EST analysis online: WWW tools for detection of SNPs and alternative splice forms

jens reich

Trends in Genetics, 2000

downloadDownload free PDF View PDFchevron_right

A pipeline for high throughput detection and mapping of SNPs from EST databases

Richard Visser

Molecular Breeding, 2010

Single nucleotide polymorphisms (SNPs) represent the most abundant type of genetic variation that can be used as molecular markers. The SNPs that are hidden in sequence databases can be unlocked using bioinformatic tools. For efficient application of these SNPs, the sequence set should be error-free as much as possible, targeting single loci and suitable for the SNP scoring platform of choice. We have developed a pipeline to effectively mine SNPs from public EST databases with or without quality information using QualitySNP software, select reliable SNP and prepare the loci for analysis on the Illumina GoldenGate genotyping platform. The applicability of the pipeline was demonstrated using publicly available potato EST data, genotyping individuals from two diploid mapping populations and subsequently mapping the SNP markers (putative genes) in both populations. Over 7000 reliable SNPs were identified that met the criteria for genotyping on the GoldenGate platform. Of the 384 SNPs on the SNP array approximately 12% dropped out. For the two potato mapping populations 165 and 185 SNPs segregating SNP loci could be mapped on the respective genetic maps, illustrating the effectiveness of our pipeline for SNP selection and validation.

downloadDownload free PDF View PDFchevron_right

Accurate detection and genotyping of SNPs utilizing population sequencing data

vikas bansal

Genome Research, 2010

Next-generation sequencing technologies have made it possible to sequence targeted regions of the human genome in hundreds of individuals. Deep sequencing represents a powerful approach for the discovery of the complete spectrum of DNA sequence variants in functionally important genomic intervals. Current methods for single nucleotide polymorphism (SNP) detection are designed to detect SNPs from single individual sequence data sets. Here, we describe a novel method SNIP-Seq (single nucleotide polymorphism identification from population sequence data) that leverages sequence data from a population of individuals to detect SNPs and assign genotypes to individuals. To evaluate our method, we utilized sequence data from a 200-kilobase (kb) region on chromosome 9p21 of the human genome. This region was sequenced in 48 individuals (five sequenced in duplicate) using the Illumina GA platform. Using this data set, we demonstrate that our method is highly accurate for detecting variants and can filter out false SNPs that are attributable to sequencing errors. The concordance of sequencing-based genotype assignments between duplicate samples was 98.8%. The 200-kb region was independently sequenced to a high depth of coverage using two sequence pools containing the 48 individuals. Many of the novel SNPs identified by SNIP-Seq from the individual sequencing were validated by the pooled sequencing data and were subsequently confirmed by Sanger sequencing. We estimate that SNIP-Seq achieves a low falsepositive rate of ;2%, improving upon the higher false-positive rate for existing methods that do not utilize population sequence data. Collectively, these results suggest that analysis of population sequencing data is a powerful approach for the accurate detection of SNPs and the assignment of genotypes to individual samples.

downloadDownload free PDF View PDFchevron_right

4Pipe4 - A 454 data analysis pipeline for SNP detection in datasets with no reference sequence or strain information

Dora Batista

BMC bioinformatics, 2016

Next-generation sequencing datasets are becoming more frequent, and their use in population studies is becoming widespread. For non-model species, without a reference genome, it is possible from a panel of individuals to identify a set of SNPs that can be used for further population genotyping. However the lack of a reference genome to which the sequenced data could be compared makes the finding of SNPs more troublesome. Additionally when the data sources (strains) are not identified (e.g. in datasets of pooled individuals), the problem of finding reliable variation in these datasets can become much more difficult due to the lack of specialized software for this specific task. Here we describe 4Pipe4, a 454 data analysis pipeline particularly focused on SNP detection when no reference or strain information is available. It uses a command line interface to automatically call other programs, parse their outputs and summarize the results. The variation detection routine is built-in in ...

downloadDownload free PDF View PDFchevron_right

A fast and accurate SNP detection algorithm for next-generation sequencing data

Junwen Wang

Various methods have been developed for calling single-nucleotide polymorphisms from next-generation sequencing data. However, for satisfactory performance, most of these methods require expensive high-depth sequencing. Here, we propose a fast and accurate single-nucleotide polymorphism detection program that uses a binomial distribution-based algorithm and a mutation probability. We extensively assess this program on normal and cancer next-generation sequencing data from The Cancer Genome Atlas project and pooled data from the 1,000 Genomes Project. We also compare the performance of several stateof-the-art programs for single-nucleotide polymorphism calling and evaluate their pros and cons. We demonstrate that our program is a fast and highly accurate single-nucleotide polymorphism detection method, particularly when the sequence depth is low. The program can finish single-nucleotide polymorphism calling within four hours for 10-fold human genome next-generation sequencing data (30 gigabases) on a standard desktop computer.

downloadDownload free PDF View PDFchevron_right

Genome-Wide SNP Calling from Genotyping by Sequencing (GBS) Data: A Comparison of Seven Pipelines and Two Sequencing Technologies

Davoud Torkamaneh

Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequenc-ing data using different pipelines resulted in highly overlapping SNP catalogues (79–92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50–70%).

downloadDownload free PDF View PDFchevron_right

SNP calling using genotype model selection on high-throughput sequencing data

n you

Bioinformatics, 2012

Motivation: A review of the available single nucleotide polymorphism (SNP) calling procedures for Illumina high-throughput sequencing (HTS) platform data reveals that most rely mainly on base-calling and mapping qualities as sources of error when calling SNPs. Thus, errors not involved in base-calling or alignment, such as those in genomic sample preparation, are not accounted for. Results: A novel method of consensus and SNP calling, Genotype Model Selection (GeMS), is given which accounts for the errors that occur during the preparation of the genomic sample. Simulations and real data analyses indicate that GeMS has the best performance balance of sensitivity and positive predictive value among the tested SNP callers. Availability: The GeMS package can be downloaded from https://sites.google.com/a/bioinformatics.ucr.edu/xinping-cui/home/software or http://computationalbioenergy.org/software.html Contact: xinping.cui@ucr.edu Supplementary information: Supplementary data are avail...

downloadDownload free PDF View PDFchevron_right

In silico quality assessment of SNPs

Thomas Lange

In silico quality assessment of SNPs - A case study on the Axiom® Wheat genotyping arrays, 2020

Genotyping arrays proved to be an exemplary tool for the simultaneous analysis of a multitude of single nucleotide polymorphisms (SNPs), a special case of genomic variants. By the example of SNPs represented on the Axiom® Wheat HD genotyping array as well as on the Axiom® Wheat Breeder's genotyping array, we applied a three way classification system to assess the quality of SNPs in bread wheat (Triticum aestivum L.) and subsequently the quality of these genotyping arrays. Class 1 SNPs could be aligned uniquely to the reference genome and did not show any genomic variants in their flanking sequence. Class 2 SNPs could also be aligned uniquely to the reference genome but showed genomic variants in their flanking sequence. The remaining SNPs were assigned to class 3. To determine the number of genomic variants in a SNP's flanking sequence, we used all currently available SNPs in the Ensembl Plants database. From the 819,571 SNPs on the Axiom® Wheat HD genotyping array, we assigned 24,343 to class 1 and from the 35,143 SNPs on the Axiom® Wheat Breeder's genotyping array we classified 2295 SNPs as class 1. We show that class 1 SNPs of the Axiom® Wheat HD genotyping array result in an equidistant coverage of the reference genome. We make the classification table as well as R-scripts available to give breeders and researchers the possibility to reproduce our analysis in an easy way. Moreover, we discuss the possibilities and limitations of such an in silico analysis of genotyping arrays as well as future research possibilities for this approach.

downloadDownload free PDF View PDFchevron_right

Cited by

An Integrated Approach to Gene Discovery and Marker Development in Atlantic Cod (Gadus morhua)

Sophie Hubert, Catherine Kozera

Marine Biotechnology, 2011

Atlantic cod is a species that has been overexploited by the capture fishery. Programs to domesticate this species are underway in several countries, including Canada, to provide an alternative route for production. Selective breeding programs have been successfully applied in the domestication of other species, with genomics-based approaches used to augment conventional methods of animal production in recent years. Genomics tools, such as gene sequences and sets of variable markers, also have the potential to enhance and accelerate selective breeding programs in aquaculture, and to provide better monitoring tools to ensure that wild cod populations are well managed. We describe the generation of significant genomics resources for Atlantic cod through an integrated genomics/ Electronic supplementary material The online version of this article (selective breeding approach. These include 158,877 expressed sequence tags (ESTs), a set of annotated putative transcripts and several thousand single nucleotide polymorphism markers that were developed from, and have been shown to be highly variable in, fish enrolled in two selective breeding programs. Our EST collection was generated from various tissues and life cycle stages. In some cases, tissues from which libraries were generated were isolated from fish exposed to stressors, including elevated temperature, or antigen stimulation (bacterial and viral) to enrich for transcripts that are involved in these response pathways. The genomics resources described here support the developing aquaculture industry, enabling the application of molecular markers within selective breeding programs. Marker sets should also find widespread application in fisheries management.

downloadDownload free PDF View PDFchevron_right

Transcriptome Sequencing, and Rapid Development and Application of SNP Markers for the Legume Pod Borer Maruca vitrata (Lepidoptera: Crambidae)

Tolulope Agunbiade, Fernando Gallardo Covas, Larry Murdock

PLoS ONE, 2011

The legume pod borer, Maruca vitrata (Lepidoptera: Crambidae), is an insect pest species of crops grown by subsistence farmers in tropical regions of Africa. We present the de novo assembly of 3729 contigs from 454-and Sanger-derived sequencing reads for midgut, salivary, and whole adult tissues of this non-model species. Functional annotation predicted that 1320 M. vitrata protein coding genes are present, of which 631 have orthologs within the Bombyx mori gene model. A homology-based analysis assigned M. vitrata genes into a group of paralogs, but these were subsequently partitioned into putative orthologs following phylogenetic analyses. Following sequence quality filtering, a total of 1542 putative single nucleotide polymorphisms (SNPs) were predicted within M. vitrata contig assemblies. Seventy one of 1078 designed molecular genetic markers were used to screen M. vitrata samples from five collection sites in West Africa. Population substructure may be present with significant implications in the insect resistance management recommendations pertaining to the release of biological control agents or transgenic cowpea that express Bacillus thuringiensis crystal toxins. Mutation data derived from transcriptome sequencing is an expeditious and economical source for genetic markers that allow evaluation of ecological differentiation.

downloadDownload free PDF View PDFchevron_right

A dense SNP-based linkage map for Atlantic salmon (Salmo salar) reveals extended chromosome homeologies and striking differences in sex-specific recombination patterns

Paul R Berg

BMC Genomics, 2011

The Atlantic salmon genome is in the process of returning to a diploid state after undergoing a whole genome duplication (WGD) event between 25 and100 million years ago. Existing data on the proportion of paralogous sequence variants (PSVs), multisite variants (MSVs) and other types of complex sequence variation suggest that the rediplodization phase is far from over. The aims of this study were to construct a high density linkage map for Atlantic salmon, to characterize the extent of rediploidization and to improve our understanding of genetic differences between sexes in this species. Results: A linkage map for Atlantic salmon comprising 29 chromosomes and 5650 single nucleotide polymorphisms (SNPs) was constructed using genotyping data from 3297 fish belonging to 143 families. Of these, 2696 SNPs were generated from ESTs or other gene associated sequences. Homeologous chromosomal regions were identified through the mapping of duplicated SNPs and through the investigation of syntenic relationships between Atlantic salmon and the reference genome sequence of the threespine stickleback (Gasterosteus aculeatus). The sex-specific linkage maps spanned a total of 2402.3 cM in females and 1746.2 cM in males, highlighting a difference in sex specific recombination rate (1.38:1) which is much lower than previously reported in Atlantic salmon. The sexes, however, displayed striking differences in the distribution of recombination sites within linkage groups, with males showing recombination strongly localized to telomeres.

downloadDownload free PDF View PDFchevron_right

Transcriptome-Wide Single Nucleotide Polymorphisms (SNPs) for Abalone (Haliotis midae): Validation and Application Using GoldenGate Medium-Throughput Genotyping Assays

Aletta E Bester-van der Merwe

International Journal of Molecular Sciences, 2013

Haliotis midae is one of the most valuable commercial abalone species in the world, but is highly vulnerable, due to exploitation, habitat destruction and predation. In order to preserve wild and cultured stocks, genetic management and improvement of the species has become crucial. Fundamental to this is the availability and employment of molecular markers, such as microsatellites and single nucleotide (SNPs). Transcriptome sequences generated through sequencing-by-synthesis technology were utilized for the in vitro and in silico identification of 505 putative SNPs from a total of 316 selected contigs. A subset of 234 SNPs were further validated and characterized in wild and cultured abalone using two Illumina GoldenGate genotyping assays. Combined with VeraCode technology, this genotyping platform yielded a 65%−69% conversion rate (percentage polymorphic markers) with a global genotyping success rate of 76%−85% and provided a viable means for validating SNP markers in a non-model species. The utility of 31 of the validated SNPs in population structure analysis was confirmed, while a large number of SNPs (174) were shown to be informative and are, thus, good candidates for linkage map construction. The non-synonymous SNPs (50) located in coding regions of genes that showed similarities with known proteins will also be useful for genetic applications, such as the marker-assisted selection of genes of relevance to abalone aquaculture.

downloadDownload free PDF View PDFchevron_right

Single-Nucleotide Polymorphisms (SNP) Mining and Their Effect on the Tridimensional Protein Structure Prediction in a Set of Immunity-Related Expressed Sequence Tags (EST) in Atlantic Salmon (Salmo salar)

Mónica Imarai

Frontiers in Genetics

Single-nucleotide polymorphisms (SNPs) are single genetic code variations considered one of the most common forms of nucleotide modifications. Such SNPs can be located in genes associated to immune response and, therefore, they may have direct implications over the phenotype of susceptibility to infections affecting the productive sector. In this study, a set of immune-related genes (cc motif chemokine 19 precursor [ccl19], integrin b2 (itb2, also named cd18), glutathione transferase omega-1 [gsto-1], heat shock 70 KDa protein [hsp70], major histocompatibility complex class I [mhc-I]) were analyzed to identify SNPs by data mining. These genes were chosen based on their previously reported expression on infectious pancreatic necrosis virus (IPNV)-infected Atlantic salmon phenotype. The available EST sequences for these genes were obtained from the Unigene database. Twenty-eight SNPs were found in the genes evaluated and identified most of them as transition base changes. The effect of the SNPs located on the 5'-untranslated region (UTR) or 3'-UTR upon transcription factor binding sites and alternative splicing regulatory motifs was assessed and ranked with a low-medium predicted FASTSNP score risk. Synonymous SNPs were found on itb2 (c.2275G > A), gsto-1 (c.558G > A), and hsp70 (c.1950C > T) with low FASTSNP predicted score risk. The difference in the relative synonymous codon usage (RSCU) value between the variant codons and the wild-type codon (DRSCU) showed one negative (hsp70 c.1950C > T) and two positive DRSCU values (itb2 c.2275G > A; gsto-1 c.558G > A), suggesting that these

downloadDownload free PDF View PDFchevron_right

Quality assessment parameters for EST-derived SNPs from catfish

Zhanjiang Liu

BMC Genomics, 2008

Background SNPs are abundant, codominantly inherited, and sequence-tagged markers. They are highly adaptable to large-scale automated genotyping, and therefore, are most suitable for association studies and applicable to comparative genome analysis. However, discovery of SNPs requires genome sequencing efforts through whole genome sequencing or deep sequencing of reduced representation libraries. Such genome resources are not yet available for many species including catfish. A large resource of ESTs is to become available in catfish allowing identification of large number of SNPs, but reliability of EST-derived SNPs are relatively low because of sequencing errors. This project was designed to answer some of the questions relevant to quality assessment of EST-derived SNPs. Results wo factors were found to be most significant for validation of EST-derived SNPs: the contig size (number of sequences in the contig) and the minor allele sequence frequency. The larger the contigs were, the...

downloadDownload free PDF View PDFchevron_right

SNP Discovery through EST Data Mining

Zhanjiang Liu

Liu/Next Generation Sequencing and Whole Genome Selection in Aquaculture, 2010

downloadDownload free PDF View PDFchevron_right