Skip to main content

Anna-sophie Fiston-Lavier

Followers

4

Following

6

Public Views

Interests

Uploads

Papers by Anna-sophie Fiston-Lavier

PlasForest: a homology-based random forest classifier for plasmid detection in genomic datasets

BMC Bioinformatics, Jun 26, 2021

Plasmids are extra-chromosomal fragments of DNA that replicate autonomously in the host cell. The... more Plasmids are extra-chromosomal fragments of DNA that replicate autonomously in the host cell. They often carry genes that can provide a benefit under specific environmental conditions . These mobile genetic elements remain a major biological concern for health and agriculture policies due to their ability to accumulate and spread resistance genes. Indeed, the frequency of plasmids, and of the resistance genes they carry, can increase quickly in populations thanks to their high mobility both within hosts (through Abstract Background: Plasmids are mobile genetic elements that often carry accessory genes, and are vectors for horizontal transfer between bacterial genomes. Plasmid detection in large genomic datasets is crucial to analyze their spread and quantify their role in bacteria adaptation and particularly in antibiotic resistance propagation. Bioinformatics methods have been developed to detect plasmids. However, they suffer from low sensitivity (i.e., most plasmids remain undetected) or low precision (i.e., these methods identify chromosomes as plasmids), and are overall not adapted to identify plasmids in whole genomes that are not fully assembled (contigs and scaffolds). We developed PlasForest, a homology-based random forest classifier identifying bacterial plasmid sequences in partially assembled genomes. Without knowing the taxonomical origin of the samples, PlasForest identifies contigs as plasmids or chromosomes with a F1 score of 0.950. Notably, it can detect 77.4% of plasmid contigs below 1 kb with 2.8% of false positives and 99.9% of plasmid contigs over 50 kb with 2.2% of false positives. PlasForest outperforms other currently available tools on genomic datasets by being both sensitive and precise. The performance of PlasForest on metagenomic assemblies are currently well below those of other k-mer-based methods, and we discuss how homology-based approaches could improve plasmid detection in such datasets.

BREC: An R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

bioRxiv (Cold Spring Harbor Laboratory), Jun 30, 2020

Background: Meiotic recombination is a vital biological process playing an essential role in geno... more Background: Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for nonmodel organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results: Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions: BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository . com/ Genom eStru cture Organ izati on.

Relationship between AChE1 activity and insecticide resistance, or the number of R copies

PLOS Biology, 2016

<p>(A) Boxplots present the distribution of AChE1R activity for [RR] individuals selected a... more <p>(A) Boxplots present the distribution of AChE1R activity for [RR] individuals selected at low (0.02 mg/l, <i>n</i> = 30) and high (0.04 mg/l, <i>n</i> = 30) doses of chlorpyrifos methyl. ***: Student’s <i>t</i> test, <i>p</i> < 0.001. (B) Regression analysis showing a significant positive relationship (GLM, *: <i>p</i> < 0.05) between AChE1R activity and the number of R copies (as the distribution by chromosome is unknown, the total number of R copies is given for each individual). Underlying data can be found in DRYAD <a href="http://dx.doi.org/10.5061/dryad.4f7qg" target="_blank">http://dx.doi.org/10.5061/dryad.4f7qg</a>.</p

Genomic structure of the 202.91 kb amplicon encompassing the ace-1 gene

PLOS Biology, Dec 5, 2016

Transposable sequence evolution is driven by gene context

arXiv (Cornell University), Sep 2, 2012

Background: Transposable elements (TEs) in eukaryote genomes are quantitatively the main componen... more Background: Transposable elements (TEs) in eukaryote genomes are quantitatively the main components affecting genome size, structure and expression. The dynamics of their insertion and deletion depend on diverse factors varying in strength and nature along the genome. We address here how TE sequence evolution is affected by neighboring genes and the chromatin status (euchromatin or heterochromatin) at their insertion site. Results: We estimated ages of TE sequences in Arabidopsis thaliana, and found that they depend on the distance to the nearest genes: TEs located close to genes are older than those that are more distant. Consequently, TE sequences in heterochromatic regions, which are gene-poor regions, are surprisingly younger and longer than that elsewhere. Conclusions: We provide evidence for biased TE age distribution close or near to genes. Interestingly, TE sequences in euchromatin and those in heterochromatin evolve at different rates, and as a result, could explain that TE sequences in heterochromatin tend to be younger and longer. Then, we revisit models of TE sequence dynamics and point out differences for TE-rich genomes, such as maize and wheat, compared to TE-poor genomes such as fly and A. thaliana. Pericentromeric heterochromatin Chromosome 1

Transposable element population dynamics in Drosophila melanogaster using next generation sequencing data

Trabajo presentado en la Annual meeting of the Society for Molecular Biology and Evolution (SMBE ... more

<i>T-lex3</i>: an accurate tool to genotype and estimate population frequencies of transposable elements using the latest short-read whole genome sequencing data

Bioinformatics, Oct 3, 2019

Motivation: Transposable elements (TEs) constitute a significant proportion of the majority of ge... more Motivation: Transposable elements (TEs) constitute a significant proportion of the majority of genomes sequenced to date. TEs are responsible for a considerable fraction of the genetic variation within and among species. Accurate genotyping of TEs in genomes is therefore crucial for a complete identification of the genetic differences among individuals, populations and species. Results: In this work, we present a new version of T-lex, a computational pipeline that accurately genotypes and estimates the population frequencies of reference TE insertions using short-read high-throughput sequencing data. In this new version, we have redesigned the T-lex algorithm to integrate the BWA-MEM short-read aligner, which is one of the most accurate short-read mappers and can be launched on longer short-reads (e.g. reads >150 bp). We have added new filtering steps to increase the accuracy of the genotyping, and new parameters that allow the user to control both the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion. We also showed for the first time that T-lex3 provides accurate TE calls in a plant genome. Availability and implementation: To test the accuracy of T-lex3, we called 1630 individual TE insertions in Drosophila melanogaster, 1600 individual TE insertions in humans, and 3067 individual TE insertions in the rice genome. We showed that this new version of T-lex is a broadly applicable and accurate tool for genotyping and estimating TE frequencies in organisms with different genome sizes and different TE contents.

TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches

Genome Biology, Apr 3, 2023

Transposable Element MOnitoring with LOng-reads (TrEMOLO) is a new software that combines assembl... more Transposable Element MOnitoring with LOng-reads (TrEMOLO) is a new software that combines assembly-and mapping-based approaches to robustly detect genetic elements called transposable elements (TEs). Using high-or low-quality genome assemblies, TrEMOLO can detect most TE insertions and deletions and estimate their allele frequency in populations. Benchmarking with simulated data revealed that TrEMOLO outperforms other state-of-the-art computational tools. TE detection and frequency estimation by TrEMOLO were validated using simulated and experimental datasets. Therefore, TrEMOLO is a comprehensive and suitable tool to accurately study TE dynamics. TrEMOLO is available under GNU GPL3.0 at https:// github. com/ Droso phila Genom eEvol ution/ TrEMO LO.

Drosophila melanogaster recombination rate calculator

Gene, Sep 1, 2010

This article appeared in a journal published by Elsevier. The attached copy is furnished to the a... more This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier's archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

The ace-1 Locus Is Amplified in All Resistant Anopheles gambiae Mosquitoes: Fitness Consequences of Homogeneous and Heterogeneous Duplications

PLOS Biology, Dec 5, 2016

Gene copy-number variations are widespread in natural populations, but investigating their phenot... more Gene copy-number variations are widespread in natural populations, but investigating their phenotypic consequences requires contemporary duplications under selection. Such duplications have been found at the ace-1 locus (encoding the organophosphate and carbamate insecticides' target) in the mosquito Anopheles gambiae (the major malaria vector); recent studies have revealed their intriguing complexity, consistent with the involvement of various numbers and types (susceptible or resistant to insecticide) of copies. We used an integrative approach, from genome to phenotype level, to investigate the influence of duplication architecture and gene-dosage on mosquito fitness. We found that both heterogeneous (i.e., one susceptible and one resistant ace-1 copy) and homogeneous (i.e., identical resistant copies) duplications segregated in field populations. The number of copies in homogeneous duplications was variable and positively correlated with acetylcholinesterase activity and resistance level. Determining the genomic structure of the duplicated region revealed that, in both types of duplication, ace-1 and 11 other genes formed tandem 203kb amplicons. We developed a diagnostic test for duplications, which showed that ace-1 was amplified in all 173 resistant mosquitoes analyzed (field-collected in several African countries), in heterogeneous or homogeneous duplications. Each type was associated with different fitness tradeoffs: heterogeneous duplications conferred an intermediate phenotype (lower resistance and fitness costs), whereas homogeneous duplications tended to increase both resistance and fitness cost, in a complex manner. The type of duplication selected seemed thus to depend on the intensity and distribution of selection pressures. This versatility of trade-offs available through gene duplication highlights the importance of large mutation events in adaptation to environmental variation. This impressive adaptability could have a major impact on vector control in Africa.

PlasForest: a homology-based random forest classifier for plasmid detection in genomic datasets

bioRxiv (Cold Spring Harbor Laboratory), Oct 7, 2020

Background Plasmids are extra-chromosomal fragments of DNA that replicate autonomously in the hos... more Background Plasmids are extra-chromosomal fragments of DNA that replicate autonomously in the host cell. They often carry genes that can provide a benefit under specific environmental conditions [1]. These mobile genetic elements remain a major biological concern for health and agriculture policies due to their ability to accumulate and spread resistance genes. Indeed, the frequency of plasmids, and of the resistance genes they carry, can increase quickly in populations thanks to their high mobility both within hosts (through Abstract Background: Plasmids are mobile genetic elements that often carry accessory genes, and are vectors for horizontal transfer between bacterial genomes. Plasmid detection in large genomic datasets is crucial to analyze their spread and quantify their role in bacteria adaptation and particularly in antibiotic resistance propagation. Bioinformatics methods have been developed to detect plasmids. However, they suffer from low sensitivity (i.e., most plasmids remain undetected) or low precision (i.e., these methods identify chromosomes as plasmids), and are overall not adapted to identify plasmids in whole genomes that are not fully assembled (contigs and scaffolds). Results: We developed PlasForest, a homology-based random forest classifier identifying bacterial plasmid sequences in partially assembled genomes. Without knowing the taxonomical origin of the samples, PlasForest identifies contigs as plasmids or chromosomes with a F1 score of 0.950. Notably, it can detect 77.4% of plasmid contigs below 1 kb with 2.8% of false positives and 99.9% of plasmid contigs over 50 kb with 2.2% of false positives. Conclusions: PlasForest outperforms other currently available tools on genomic datasets by being both sensitive and precise. The performance of PlasForest on metagenomic assemblies are currently well below those of other k-mer-based methods, and we discuss how homology-based approaches could improve plasmid detection in such datasets.

Structural variation turnovers and defective genomes: key drivers for the in vitro evolution of the large double-stranded DNA koi herpesvirus (KHV)

bioRxiv (Cold Spring Harbor Laboratory), Mar 10, 2022

Structural variations (SVs) constitute a significant source of genetic variability in virus genom... more Structural variations (SVs) constitute a significant source of genetic variability in virus genomes. Yet knowledge about SV variability and contribution to the evolutionary process in large double-stranded (ds)DNA viruses is limited. Cyprinid herpesvirus 3 (CyHV-3), also commonly known as koi herpesvirus (KHV), has the largest dsDNA genome within herpesviruses. This virus has become one of the biggest threats to common carp and koi farming, resulting in high morbidity and mortalities of fishes, serious environmental damage, and severe economic losses. A previous study analyzing CyHV-3 virulence evolution during serial passages onto carp cell cultures suggested that CyHV-3 evolves, at least in vitro, through an assembly of haplotypes that alternatively become dominant or under-represented. The present study investigates the SV diversity and dynamics in CyHV-3 genome during 99 serial passages in cell culture using, for the first time, ultra-deep whole-genome and amplicon-based sequencing. The results indicate that KHV polymorphism mostly involves SVs. These SVs display a wide distribution along the genome and exhibit high turnover dynamics with a clear bias towards inversion and deletion events. Analysis of the pathogenesis-associated ORF150 region in ten intermediate cell passages highlighted mainly deletion, inversion and insertion variations that deeply altered the structure of ORF150. Our findings indicate that SV turnovers and defective genomes represent key drivers in the viral population dynamics and in vitro evolution of KHV. Thus, the present study can contribute to the basic research needed to design safe live-attenuated vaccines, classically obtained by viral attenuation after serial passages in cell culture.

BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

BMC Bioinformatics, Aug 6, 2021

Background: Meiotic recombination is a vital biological process playing an essential role in geno... more Background: Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for nonmodel organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results: Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions: BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https:// github. com/ Genom eStru cture Organ izati on.

Population Genomics of Transposable Elements in Drosophila melanogaster

Molecular Biology and Evolution, Dec 16, 2010

Transposable elements (TEs) are the primary contributors to the genome bulk in many organisms and... more Transposable elements (TEs) are the primary contributors to the genome bulk in many organisms and are major players in genome evolution. A clear and thorough understanding of the population dynamics of TEs is therefore essential for full comprehension of the eukaryotic genome evolution and function. Although TEs in Drosophila melanogaster have received much attention, population dynamics of most TE families in this species remains entirely unexplored. It is not clear whether the same population processes can account for the population behaviors of all TEs in Drosophila or whether, as has been suggested previously, different orders behave according to very different rules. In this work, we analyzed population frequencies for a large number of individual TEs (755 TEs) in five North American and one sub-Saharan African D. melanogaster populations (75 strains in total). These TEs have been annotated in the reference D. melanogaster euchromatic genome and have been sampled from all three major orders (non-LTR, LTR, and TIR) and from all families with more than 20 TE copies (55 families in total). We find strong evidence that TEs in Drosophila across all orders and families are subject to purifying selection at the level of ectopic recombination. We showed that strength of this selection varies predictably with recombination rate, length of individual TEs, and copy number and length of other TEs in the same family. Importantly, these rules do not appear to vary across orders. Finally, we built a statistical model that considered only individual TE-level (such as the TE length) and family-level properties (such as the copy number) and were able to explain more than 40% of the variation in TE frequencies in D. melanogaster.

Etude de la dynamique des repetitions dans les genomes eucaryotes: de leur formation a leur elimination

qui ont accepté de juger ce mémoire avec beaucoup d'attention, ainsi qu'à Vincent Colot

Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements

PLOS ONE, Sep 4, 2014

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the d... more High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5-18.5 Kbp with an extremely low error rate (*0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate longreads, offer a powerful approach to improve de novo assemblies of whole genomes.

Transposable elements drive recent adaptation in Drosophila melanogaster

Trabajo presentado en el International Congres on Transposable Elements (ICTE 2016), celebrado en... more

Finding and Characterizing Repeats in Plant Genomes

Methods in molecular biology (Clifton, N.J.), 2016

Plant genomes contain a particularly high proportion of repeated structures of various types. Thi... more Plant genomes contain a particularly high proportion of repeated structures of various types. This chapter proposes a guided tour of available software that can help biologists to look for these repeats and check some hypothetical models intended to characterize their structures. Since transposable elements are a major source of repeats in plants, many methods have been used or developed for this large class of sequences. They are representative of the range of tools available for other classes of repeats and we have provided a whole section on this topic as well as a selection of the main existing software. In order to better understand how they work and how repeats may be efficiently found in genomes, it is necessary to look at the technical issues involved in the large-scale search of these structures. Indeed, it may be hard to keep up with the profusion of proposals in this dynamic field and the rest of the chapter is devoted to the foundations of the search for repeats and more...

Involving repetitive regions in scaffolding improvement

Journal of Bioinformatics and Computational Biology, Dec 17, 2021

In this paper, we investigate througth a premilinary study the influence of repeat elements durin... more In this paper, we investigate througth a premilinary study the influence of repeat elements during the assembly process. We analyze the link between the presence and the nature of one type of repeat element, called transposable element (TE) and misassembly events in genome assemblies. We propose to improve assemblies by taking into account the presence of repeat elements, including TEs, during the scaffolding step. We analyze the results and relate the misassemblies to TEs before and after correction.

Author Correction: Machine learning reveals bilateral distribution of somatic L1 insertions in human neurons and glia

Nature Neuroscience

In the version of the article originally published, Eric Courchesne was incorrectly listed as a c... more