Academia.eduAcademia.edu

Outline

Storage and Retrieval of Individual Genomes

2009, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-642-02008-7_9

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N ) bits, which very soon inhibits in-memory analyses. Recent advances in full-text selfindexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N/n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

References (23)

  1. D. Blanford and G. Blelloch. Compact representations of ordered sets. In Proc. 15th SODA, pages 11-19, 2004.
  2. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation, 1994.
  3. G. M. Church. Genomes for all. Scientific American, 294(1):47-54, 2006.
  4. P. Ferragina and G. Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552-581, 2005.
  5. P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representa- tions of sequences and full-text indexes. ACM Transactions on Algorithms (TALG), 3(2):article 20, 2007.
  6. J. Fischer, V. Mäkinen, and G. Navarro. An(other) entropy-bounded compressed suffix tree. In Proc. 19th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 5029, pages 152-165, 2008.
  7. R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2006.
  8. A. Gupta, W.-K. Hon, R. Shah, and J.S. Vitter. Compressed data structures: Dictionaries and data-aware measures. In DCC '06: Proceedings of the Data Com- pression Conference (DCC'06), pages 213-222, 2006.
  9. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
  10. N. Hall. Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology, 209:1518-1525, 2007.
  11. H. Kaplan. Handbook of Data Structures and Applications (D. P. Mehta and S. Sahni Eds.), chapter 31: Persistent Data Structures. Chapman & Hall, 2005.
  12. V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12(1):40-66, 2005.
  13. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993.
  14. G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407-430, 2001.
  15. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.
  16. M. H. Overmars. Searching in the past, i. Technical Report Technical Report RUU-CS-81-7, Department of Computer Science, University of Utrecht, Utrecht, Netherlands, 1981.
  17. E. Pennisi. Breakthrough of the year: Human genetic variation. Science, 21:1842- 1843, December 2007.
  18. L. Russo, G. Navarro, and A. Oliveira. Dynamic fully-compressed suffix trees. In Proc. 19th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 5029, pages 191-203, 2008.
  19. L. Russo, G. Navarro, and A. Oliveira. Fully-compressed suffix trees. In Proc. 8th Latin American Symposium on Theoretical Informatics (LATIN), LNCS 4957, pages 362-373, 2008.
  20. K. Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294-313, 2003.
  21. K. Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589-607, 2007.
  22. J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In Proc. of 15th Symposium on String Processing and Information Retrieval (SPIRE 2008), LNCS 5280, pages 164-175, 2008.
  23. M. S. Waterman. Introduction to Computational Biology. Chapman & Hall, Uni- versity Press, 1995.