Sequence analysis
2013
https://doi.org/10.1093/BIOINFORMATICS/BTM154…
2 pages
1 file
Sign up for access to the world's latest research
Abstract
SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates
Related papers
Nature Biotechnology, 2014
Nucleic Acids Research, 2012
Highly multiplex DNA sequencers have greatly expanded our ability to survey human genomes for previously unknown single nucleotide polymorphisms (SNPs). However, sequencing and mapping errors, though rare, contribute substantially to the number of false discoveries in current SNP callers. We demonstrate that we can significantly reduce the number of false positive SNP calls by pooling information across samples. Although many studies prepare and sequence multiple samples with the same protocol, most existing SNP callers ignore cross-sample information. In contrast, we propose an empirical Bayes method that uses cross-sample information to learn the error properties of the data. This error information lets us call SNPs with a lower false discovery rate than existing methods.
Genomics, 2013
Identification of single nucleotide polymorphisms (SNPs) is a key element in sequence-based genetic analysis. Next generation sequencing offers a cost-effective basis to generate the necessary, large sequence data sets, and bioinformatic methods are being developed to process sequencing machine readouts. We were interested in detection of SNPs in a 350 kb region of an EMS-mutagenized Arabidopsis chromosome 3. The region was selectively analyzed using PCR-generated, overlapping fragments for Solexa sequencing. The ensuing reads provided a high coverage and were processed bioinformatically. In order to assess the SNP candidates obtained with a frequently used alignment program and SNP caller, we developed an additional method that allows the identification of high confidence SNP loci. The method can easily be applied to complete genome sequence data of sufficient coverage.
Nucleic Acids Research, 2014
Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, DISCOSNP, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, DISCOSNP ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, DISCOSNP requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.
Pharmacogenomics, 2003
The public SNP databases are an important resource for groups performing genetic association and linkage studies. Both academic and commercial groups are developing large numbers of genotyping assays for SNPs in candidate genes or spread across the genome. These databases now contain in excess of 6 million SNPs that have been generated using a large number of methods and cohorts. Today, however, only a small fraction of these SNPs are well characterized and validated. The latest release of dbSNP contains ~ 3.7 million non-redundant entries, only 0.5 million of which are validated, and 0.2 million of which have frequency information. Users of these databases have several common questions. How many of the SNPs are real? What is the frequency spectrum of the SNPs in these databases? What is the distribution picture of these SNPs across different ethnic and geographical populations? What fraction of the total number of SNPs is already captured by these databases? In order to address these questions, we compared the public SNPs against a well-characterized collection of gene-centric SNPs that we have developed. From this comparison, we find that > 50% of high frequency SNPs in the genome (> 20% minor allele frequency) have already been captured by these databases. The coverage drops dramatically below frequencies of 10%. At high frequencies, there is no sampling bias with respect to ethnicity or to regions of the genome. Finally, a relatively large fraction (> 40%) of SNPs in these databases were not seen in our study, which means that they are either of very low frequency, mismapped, or not polymorphic at all.
Nucleic acids research, 2014
To apply exome-seq-derived variants in the clinical setting, there is an urgent need to identify the best variant caller(s) from a large collection of available options. We have used an Illumina exome-seq dataset as a benchmark, with two validation scenarios--family pedigree information and SNP array data for the same samples, permitting global high-throughput cross-validation, to evaluate the quality of SNP calls derived from several popular variant discovery tools from both the open-source and commercial communities using a set of designated quality metrics. To the best of our knowledge, this is the first large-scale performance comparison of exome-seq variant discovery tools using high-throughput validation with both Mendelian inheritance checking and SNP array data, which allows us to gain insights into the accuracy of SNP calling through such high-throughput validation in an unprecedented way, whereas the previously reported comparison studies have only assessed concordance of ...
PLoS ONE, 2014
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/ sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.
Scientific reports, 2015
Sample-tagging is designed for identification of accidental sample mix-up, which is a major issue in re-sequencing studies. In this work, we develop a model to measure the information content of SNPs, so that we can optimize a panel of SNPs that approach the maximal information for discrimination. The analysis shows that as low as 60 optimized SNPs can differentiate the individuals in a population as large as the present world, and only 30 optimized SNPs are in practice sufficient in labeling up to 100 thousand individuals. In the simulated populations of 100 thousand individuals, the average Hamming distances, generated by the optimized set of 30 SNPs are larger than 18, and the duality frequency, is lower than 1 in 10 thousand. This strategy of sample discrimination is proved robust in large sample size and different datasets. The optimized sets of SNPs are designed for Whole Exome Sequencing, and a program is provided for SNP selection, allowing for customized SNP numbers and int...
PLOS Computational Biology, 2005
Identification of single nucleotide polymorphisms (SNPs) and mutations is important for the discovery of genetic predisposition to complex diseases. PCR resequencing is the method of choice for de novo SNP discovery. However, manual curation of putative SNPs has been a major bottleneck in the application of this method to high-throughput screening. Therefore it is critical to develop a more sensitive and accurate computational method for automated SNP detection. We developed a software tool, SNPdetector, for automated identification of SNPs and mutations in fluorescence-based resequencing reads. SNPdetector was designed to model the process of human visual inspection and has a very low false positive and false negative rate. We demonstrate the superior performance of SNPdetector in SNP and mutation analysis by comparing its results with those derived by human inspection, PolyPhred (a popular SNP detection tool), and independent genotype assays in three large-scale investigations. The first study identified and validated inter-and intra-subspecies variations in 4,650 traces of 25 inbred mouse strains that belong to either the Mus musculus species or the M. spretus species. Unexpected heterozgyosity in CAST/Ei strain was observed in two out of 1,167 mouse SNPs. The second study identified 11,241 candidate SNPs in five ENCODE regions of the human genome covering 2.5 Mb of genomic sequence. Approximately 50% of the candidate SNPs were selected for experimental genotyping; the validation rate exceeded 95%. The third study detected ENU-induced mutations (at 0.04% allele frequency) in 64,896 traces of 1,236 zebra fish. Our analysis of three large and diverse test datasets demonstrated that SNPdetector is an effective tool for genome-scale research and for large-sample clinical studies. SNPdetector runs on Unix/Linux platform and is available publicly (http://lpg.nci.nih.gov). Citation: Zhang J, Wheeler DA, Yakub I, Wei S, Sood R, et al. (2005) SNPdetector: A software tool for sensitive and accurate SNP detection. PLoS Comput Biol 1(5): e53. PLoS Computational Biology | www.ploscompbiol.org October 2005 | Volume 1 | Issue 5 | e53 0395 d
2001
A methodology is described to identify probable single nucleotide polymorphisms (SNPs) and small insertion deletion (indel) polymorphisms in silico using overlapping bovine Expressed Sequence Tag (EST) sequences. A stratified random sample consisting of 200 unique overlapping regions identified 63 SNPs and 9 indels, where the minor allele frequency was greater than 0.15. Given that 144,995bp of sequence was examined, this translates to 1 SNP per 2302bp and 1 indel per 16,476bp. The observed proportion of SNPs that were transitions (56%) was significantly higher than the expected 1:2 transition to transversion ratio (P=0.0055). Forty five percent of mutations occurred at a potential CpG site, significantly higher than expected based on the mean 26% guanine content of the ESTs. For the 60 percent of SNPs that were in a likely protein encoding region (CDS), 26, 5 and 68 percent were located in the first, second and third codon position, a significant deficit at the second (P=0.0027) and excess at the third codon (P=0.0002) positions respectively. The predicted amino acid was altered in 51 percent of the CDS SNPs (31% of all SNPs), significantly lower than the 76% expected by chance (P<0.0002). Sixty percent of the amino acid changes were conservative in nature based on them having positive similarity values using the PAM 250 weight matrix. These results reflect the chemical likelihood of various potential mutations and the effects of negative selection on the subsequent changes. Based on 354,928 bovine ESTs used in this study we predict 2416 SNPs and 286 indels would be detected if all overlapping regions were examined.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (11)
- Buetow,K.H. et al. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet., 21, 323-325.
- Ewing,B. et al. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res., 8, 175-185.
- Gordon,D. et al. (1998) Consed: a graphical tool for sequence finishing. Genome Res., 8, 195-202.
- Guryev,V. et al. (2004) Single nucleotide polymorphisms associated with rat expressed sequences. Genome Res., 14, 1438-1443.
- Hawken,R.J. et al. (2004) An interactive bovine in silico SNP database (IBISS). Mamm. Genome., 819-827.
- Irizarry,K. et al. (2000) Genome-wide analysis of single-nucleotide polymorph- isms in human expressed sequences. Nat. Genet., 26, 233-236.
- Koop,B.F. and Davidson,W.S. (2007) cGRASP (http://web.uvic.ca/cbr/grasp/)
- Lee,M.A. et al. (2006) Establishment of a pipeline to analyse non-synonymous SNPs in Bos Taurus. BMC Genomics, 26, 298.
- Marth,G.T. et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet., 23, 452-456.
- Rise,M.L. et al. (2004) Development and application of a salmonoid EST database and cDNA microarray: data mining and interspecific hybridization characteristics. Genome Res., 14, 478-490.
- Taillon-Miller,P. et al. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res., 8, 748-754.