The genetic dissection of complex diseases represents a formidable challenge for modern human gen... more The genetic dissection of complex diseases represents a formidable challenge for modern human genetics. Recently, it has been suggested that linkage disequilibrium (LD) based methods will be a powerful approach for delineating complex disease genes. Most proposed LD test statistics search for association between a single marker and a putative trait locus. However, the power of a single marker association test may suffer because LD information contained in flanking markers is ignored. Intuitively, haplotypes (which can be regarded as a collection of ordered markers) may be more powerful than individual, unorganised markers. In this study, we derive the analytical tools based on standard chi-square statistics to directly investigate and compare the power between multilocus haplotypes and single marker LD tests. More specifically, novel formulas are obtained in order to calculate expected haplotype frequencies of unlimited size. This study demonstrates that the use of haplotypes can significantly improve the power and robustness of mapping disease genes. Additionally, we detail how the power of haplotype based association tests are affected by important population genetic parameters such as the genetic distance between markers and disease locus, mode of disease inheritance, age of trait causing mutation, frequency of associated marker allele, and level of initial LD. Finally, published data from the Hereditary Hemochromatosis disease region is used to illustrate the utility of haplotypes.
Noncoding genetic variation is known to significantly influence gene expression levels in a growi... more Noncoding genetic variation is known to significantly influence gene expression levels in a growing number of specific cases; however, the patterns of genome-wide noncoding variation present within populations, the evolutionary forces acting on noncoding variants, and the relative effects of regulatory polymorphisms on transcript abundance are not well characterized. Here, we address these questions by analyzing patterns of regulatory variation in motifs for 177 DNA binding proteins in 37 strains of Saccharomyces cerevisiae. Between S. cerevisiae strains, we found considerable polymorphism in regulatory motifs across strains (mean p = 0.005) as well as diversity in regulatory motifs (mean 0.91 motifs differences per regulatory region). Population genetics analyses reveal that motifs are under purifying selection, and there is considerable heterogeneity in the magnitude of selection across different motifs. Finally, we obtained RNA-Seq data in 22 strains and identified 49 polymorphic DNA sequence motifs in 30 distinct genes that are significantly associated with transcriptional differences between strains. In 22 of these genes, there was a single polymorphic motif associated with expression in the upstream region. Our results provide comprehensive insights into the evolutionary trajectory of regulatory variation in yeast and the characteristics of a compendium of regulatory alleles.
With the ability to measure thousands of related phenotypes from a single biological sample, it i... more With the ability to measure thousands of related phenotypes from a single biological sample, it is now feasible to genetically dissect systems-level biological phenomena. The genetics of transcriptional regulation and protein abundance are likely to be complex, meaning that genetic variation at multiple loci will influence these phenotypes. Several recent studies have investigated the role of genetic variation in transcription by applying traditional linkage analysis methods to genomewide expression data, where each gene expression level was treated as a quantitative trait and analyzed separately from one another. Here, we develop a new, computationally efficient method for simultaneously mapping multiple gene expression quantitative trait loci that directly uses all of the available data. Information shared across gene expression traits is captured in a way that makes minimal assumptions about the statistical properties of the data. The method produces easy-to-interpret measures of statistical significance for both individual loci and the overall joint significance of multiple loci selected for a given expression trait. We apply the new method to a cross between two strains of the budding yeast Saccharomyces cerevisiae, and estimate that at least 37% of all gene expression traits show two simultaneous linkages, where we have allowed for epistatic interactions. Pairs of jointly linking quantitative trait loci are identified with high confidence for 170 gene expression traits, where it is expected that both loci are true positives for at least 153 traits. In addition, we are able to show that epistatic interactions contribute to gene expression variation for at least 14% of all traits. We compare the proposed approach to an exhaustive two-dimensional scan over all pairs of loci. Surprisingly, we demonstrate that an exhaustive twodimensional scan is less powerful than the sequential search used here. In addition, we show that a two-dimensional scan does not truly allow one to test for simultaneous linkage, and the statistical significance measured from this existing method cannot be interpreted among many traits.
The dispersal of humans throughout the world was accompanied by adaptations to local environments... more The dispersal of humans throughout the world was accompanied by adaptations to local environments. New research shows that a previously identified haplotype of the EPAS1 gene, which allows Tibetans to live at high altitude, was inherited from archaic hominin ancestors.
Considerable work has been devoted to identifying regions of the human genome that have been subj... more Considerable work has been devoted to identifying regions of the human genome that have been subjected to recent positive selection. Although detailed follow-up studies of putatively selected regions are critical for a deeper understanding of human evolutionary history, such studies have received comparably less attention. Recently, we have shown that ALMS1 has been the target of recent positive selection acting on standing variation in Eurasian populations. Here, we describe a careful follow-up analysis of genetic variation across the ALMS1 region, which unexpectedly revealed a cluster of substrates of positive selection. Specifically, through the analysis of SNP data from the HapMap and Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain samples as well sequence data from the region, we find compelling evidence for three independent and distinct signals of recent positive selection across this 3 Mb region surrounding ALMS1. Moreover, we analyzed the HapMap data to identify other putative clusters of independent selective events and conservatively discovered 19 additional clusters of adaptive evolution. This work has important implications for the interpretation of genome-scans for positive selection in humans and more broadly contributes to a better understanding of how recent positive selection has shaped genetic variation across the human genome.
Understanding patterns of gene-expression variation within and among human populations will provi... more Understanding patterns of gene-expression variation within and among human populations will provide important insights into the molecular basis of phenotypic diversity and the interpretation of patterns of expression variation in disease. However, little is known about how gene-expression variation is apportioned within and among human populations. Here, we characterize patterns of natural gene-expression variation in 16 individuals of European and African ancestry. We find extensive variation in gene-expression levels and estimate that ∼83% of genes are differentially expressed among individuals and that ∼17% of genes are differentially expressed among populations. By decomposing total geneexpression variation into within-versus among-population components, we find that most expression variation is due to variation among individuals rather than among populations, which parallels observations of extant patterns of human genetic variation. Finally, we performed allele-specific quantitative polymerase chain reaction to demonstrate that cisregulatory variation in the lymphocyte adaptor protein (SH2B adapter protein 3) contributes to differential expression between European and African samples. These results provide the first insight into how human population structure manifests itself in gene-expression levels and will help guide the search for regulatory quantitative trait loci.
The roles of positive directional selection (selective sweeps) and negative selection (background... more The roles of positive directional selection (selective sweeps) and negative selection (background selection) in shaping the genome-wide distribution of genetic variation in humans remain largely unknown. Here, we optimize the parameter values of a model of the removal of deleterious mutations (background selection) to observed levels of human polymorphism, controlling for mutation rate heterogeneity by using interspecific divergence. A point of "best fit" was found between background-selection predictions and estimates of human effective population sizes, with reasonable parameter estimates whose uncertainty was assessed by bootstrapping. The results suggest that the purging of deleterious alleles has had some influence on shaping levels of human variation, although the effects may be subtle over the majority of the human genome. A significant relationship was found between background-selection predictions and measures of skew in the allele frequency distribution. The genome-wide action of selection (positive and/or negative) is required to explain this observation.
Determining historical sex ratios throughout human evolution can provide insight into patterns of... more Determining historical sex ratios throughout human evolution can provide insight into patterns of genomic variation, the structure and composition of ancient populations, and the cultural factors that influence the sex ratio (e.g., sex-specific migration rates). Although numerous studies have suggested that unequal sex ratios have existed in human evolutionary history, a coherent picture of sex-biased processes has yet to emerge. For example, two recent studies compared human X chromosome to autosomal variation to make inferences about historical sex ratios but reached seemingly contradictory conclusions, with one study finding evidence for a male bias and the other study identifying a female bias. Here, we show that a large part of this discrepancy can be explained by methodological differences. Specifically, through reanalysis of empirical data, derivation of explicit analytical formulae, and extensive simulations we demonstrate that two estimators of the effective sex ratio based on population structure and nucleotide diversity preferentially detect biases that have occurred on different timescales. Our results clarify apparently contradictory evidence on the role of sex-biased processes in human evolutionary history and show that extant patterns of human genomic variation are consistent with both a recent male bias and an earlier, persistent female bias.
Structural variation is an important and abundant source of genetic and phenotypic variation. Her... more Structural variation is an important and abundant source of genetic and phenotypic variation. Here we describe the first systematic and genome-wide analysis of segmental duplications and associated copy number variants (CNVs) in the modern domesticated dog, Canis familiaris, which exhibits considerable morphological, physiological, and behavioral variation. Through computational analyses of the publicly available canine reference sequence, we estimate that segmental duplications comprise ;4.21% of the canine genome. Segmental duplications overlap 841 genes and are significantly enriched for specific biological functions such as immunity and defense and KRAB box transcription factors. We designed high-density tiling arrays spanning all predicted segmental duplications and performed aCGH in a panel of 17 breeds and a gray wolf. In total, we identified 3583 CNVs, ;68% of which were found in two or more samples that map to 678 unique regions. CNVs span 429 genes that are involved in a wide variety of biological processes such as olfaction, immunity, and gene regulation. Our results provide insight into mechanisms of canine genome evolution and generate a valuable resource for future evolutionary and phenotypic studies. [Supplemental material is available online at www.genome.org. All aCGH data from this study have been submitted to Gene Expression Omnibus (GEO) () under accession no. GSE13266.]
Identifying regions of the human genome that have been targets of positive selection will provide... more Identifying regions of the human genome that have been targets of positive selection will provide important insights into recent human evolutionary history and may facilitate the search for complex disease genes. However, the confounding effects of population demographic history and selection on patterns of genetic variation complicate inferences of selection when a small number of loci are studied. To this end, identifying outlier loci from empirical genome-wide distributions of genetic variation is a promising strategy to detect targets of selection. Here, we evaluate the power and efficiency of a simple outlier approach and describe a genome-wide scan for positive selection using a dense catalog of 1.58 million SNPs that were genotyped in three human populations. In total, we analyzed 14,589 genes, 385 of which possess patterns of genetic variation consistent with the hypothesis of positive selection. Furthermore, several extended genomic regions were found, spanning >500 kb, that contained multiple contiguous candidate selection genes. More generally, these data provide important practical insights into the limits of outlier approaches in genome-wide scans for selection, provide strong candidate selection genes to study in greater detail, and may have important implications for disease related research.
Chromatin accessibility is an important functional genomics phenotype that influences transcripti... more Chromatin accessibility is an important functional genomics phenotype that influences transcription factor binding and gene expression. Genome-scale technologies allow chromatin accessibility to be mapped with high-resolution, facilitating detailed analyses into the genetic architecture and evolution of chromatin structure within and between species. We performed Formaldehyde-Assisted Isolation of Regulatory Elements sequencing (FAIRE-Seq) to map chromatin accessibility in two parental haploid yeast species, Saccharomyces cerevisiae and Saccharomyces paradoxus and their diploid hybrid. We show that although broad-scale characteristics of the chromatin landscape are well conserved between these species, accessibility is significantly different for 947 regions upstream of genes that are enriched for GO terms such as intracellular transport and protein localization exhibit. We also develop new statistical methods to investigate the genetic architecture of variation in chromatin accessibility between species, and find that cis effects are more common and of greater magnitude than trans effects. Interestingly, we find that cis and trans effects at individual genes are often negatively correlated, suggesting widespread compensatory evolution to stabilize levels of chromatin accessibility. Finally, we demonstrate that the relationship between chromatin accessibility and gene expression levels is complex, and a significant proportion of differences in chromatin accessibility might be functionally benign.
The rapid development of a dense single-nucleotide-polymorphism marker map has stimulated numerou... more The rapid development of a dense single-nucleotide-polymorphism marker map has stimulated numerous studies attempting to characterize the magnitude and distribution of background linkage disequilibrium (LD) within and between human populations. Although genotyping errors are an inherent problem in all LD studies, there have been few systematic investigations documenting their consequences on estimates of background LD. Therefore, we derived simple deterministic formulas to investigate the effect that genotyping errors have on four commonly used LD measures-D , r, Q, and d-in studies of background LD. We have found that genotyping error rates as small as 3% can have serious affects on these LD measures, depending on the allele frequencies and the assumed error model. Furthermore, we compared the robustness of D , r, Q, and d, in the presence of genotyping errors. In general, Q and d are more robust than D and r, although exceptions do exist. Finally, through stochastic simulations, we illustrate how genotyping errors can lead to erroneous inferences when measures of LD between two samples are compared.
Advances in sequencing technology have enabled whole-genome sequences to be obtained from multipl... more Advances in sequencing technology have enabled whole-genome sequences to be obtained from multiple individuals within species, particularly in model organisms with compact genomes. For example, 36 genome sequences of Saccharomyces cerevisiae are now publicly available, and SNP data are available for even larger collections of strains. One potential use of these resources is mapping the genetic basis of phenotypic variation through genome-wide association (GWA) studies, with the benefit that associated variants can be studied experimentally with greater ease than in outbred populations such as humans. Here, we evaluate the prospects of GWA studies in S. cerevisiae strains through extensive simulations and a GWA study of mitochondrial copy number. We demonstrate that the complex and heterogeneous patterns of population structure present in yeast populations can lead to a high type I error rate in GWA studies of quantitative traits, and that methods typically used to control for population stratification do not provide adequate control of the type I error rate. Moreover, we show that while GWA studies of quantitative traits in S. cerevisiae may be difficult depending on the particular set of strains studied, association studies to map cis-acting quantitative trait loci (QTL) and Mendelian phenotypes are more feasible. We also discuss sampling strategies that could enable GWA studies in yeast and illustrate the utility of this approach in Saccharomyces paradoxus. Thus, our results provide important practical insights into the design and interpretation of GWA studies in yeast, and other model organisms that possess complex patterns of population structure.
Oligonucleotide microarrays provide a high-throughput method for exploring genomes. In addition t... more Oligonucleotide microarrays provide a high-throughput method for exploring genomes. In addition to their utility for gene-expression analysis, oligonucleotide-expression arrays have also been used to perform genotyping on genomic DNA. Here, we show that in segregants from a cross between two unrelated strains of Saccharomyces cerevisiae, high-quality genotype data can also be obtained when mRNA is hybridized to an oligonucleotide-expression array. We were able to identify and genotype nearly 1000 polymorphisms at an error rate close to 3% in segregants and at an error rate of 7% in diploid strains, a performance comparable to methods using genomic DNA. In addition, we demonstrate how simultaneous genotyping and gene-expression profiling can reveal cis-regulatory variation by screening hundreds of genes for allele-specific expression. With this method, we discovered 70 ORFs with evidence for preferential expression of one allele in a diploid hybrid of two S. cerevisiae strains.
Telomere length-variation in deletion strains of Saccharomyces cerevisiae was used to identify ge... more Telomere length-variation in deletion strains of Saccharomyces cerevisiae was used to identify genes and pathways that regulate telomere length. We found 72 genes that when deleted confer short telomeres, and 80 genes that confer long telomeres relative to those of wild-type yeast. Among identified genes, 88 have not been previously implicated in telomere length control. Genes that regulate telomere length span a variety of functions that can be broadly separated into telomerase-dependent and telomerase-independent pathways. We also found 39 genes that have an important role in telomere maintenance or cell proliferation in the absence of telomerase, including genes that participate in deoxyribonucleotide biosynthesis, sister chromatid cohesion, and vacuolar protein sorting. Given the large number of loci identified, we investigated telomere lengths in 13 wild yeast strains and found substantial natural variation in telomere length among the isolates. Furthermore, we crossed a wild isolate to a laboratory strain and analyzed telomere length in 122 progeny. Genome-wide linkage analysis among these segregants revealed two loci that account for 30%-35% of telomere length-variation between the strains. These findings support a general model of telomere lengthvariation in outbred populations that results from polymorphisms at a large number of loci. Furthermore, our results laid the foundation for studying genetic determinants of telomere length-variation and their roles in human disease.
Gene expression levels are determined by the balance between rates of mRNA transcription and deca... more Gene expression levels are determined by the balance between rates of mRNA transcription and decay, and genetic variation in either of these processes can result in heritable differences in transcript abundance. Although the genetics of gene expression has been a subject of intense interest, the contribution of heritable variation in mRNA decay rates to gene expression variation has received far less attention. To this end, we developed a novel statistical framework and measured allele-specific differences in mRNA decay rates in a diploid yeast hybrid created by mating two genetically diverse parental strains. We estimate that 31% of genes exhibit allelic differences in mRNA decay rates, of which 350 can be identified at a false discovery rate of 10%. Genes with significant allele-specific differences in mRNA decay rates have higher levels of polymorphism compared to other genes, with all gene regions contributing to allelic differences in mRNA decay rates. Strikingly, we find widespread evidence for compensatory evolution, such that variants influencing transcriptional initiation and decay have opposite effects, suggesting that steady-state gene expression levels are subject to pervasive stabilizing selection. Our results demonstrate that heritable differences in mRNA decay rates are widespread and are an important target for natural selection to maintain or fine-tune steady-state gene expression levels.
Uploads
Papers by Joshua Akey