Academia.eduAcademia.edu

Outline

Text sparsification via local maxima

2003, Theoretical Computer Science

https://doi.org/10.1016/S0304-3975(03)00142-7

Abstract

In this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the order assigned to the alphabet symbols, we first consider the case in which the order can be chosen in an arbitrary way. We show that looking for an order that minimizes the number of local maxima in the given text string is an Np-hard problem. Then, we consider the case in which the order is fixed a priori. Even though the order is not necessarily optimal, we can exploit the property that the average number of local maxima induced by the order in an arbitrary text is approximately one third of the text length. In particular, we describe how to iterate the process of selecting the local maxima by one or more iterations, so as to obtain a sparsified text. We show how to use this technique to filter the access to unstructured texts, which appear to have no natural division in words. Finally, we experimentally show that our approach can be successfully used in order to create a space efficient index for searching sufficiently long patterns in a DNA sequence as quickly as a full index.

References (12)

  1. A. Alstrup, G. S. Brodal, and T. Rauhe. Pattern matching in dynamic texts. In Proceed- ings of the 11th ACM-SIAM Annual Symposium on Discrete Algorithms, pages 819-828, San Francisco, CA, 2000.
  2. S. Burkhardt, A. Crauser, H.P. Lenhof, P. Ferragina, E. Rivals, and M. Vingron. Q-gram based database searching using a suffix array (QUASAR). In Proceedings of the Annual International Conference on Computational Biology (RECOMB), 1999.
  3. R. Cole and U. Vishkin. Deterministic coin tossing with applications to optimal parallel list ranking. Information and Control, 70(1):32-53, July 1986.
  4. G. Cormode, M. Paterson, S.C. Sahinalp, and U. Vishkin. Communication complexity of document exchange. In Proceedings of the 11th ACM-SIAM Annual Symposium on Discrete Algorithms, 2000.
  5. P. Crescenzi, A. Del Lungo, R. Grossi, E. Lodi, L. Pagli, and G. Rossi. Text sparsification via local maxima. In Proceedings of the Twentieth Conference on the Foundations of Software Technology and Theoretical Computer Science(FSTTCS), volume 1974 of Lecture Notes in Computer Science, pages 290-301, 2000.
  6. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP- completeness. Freeman, San Francisco, 1979.
  7. Juha Kärkkäinen and Esko Ukkonen. Sparse suffix trees. Lecture Notes in Computer Science, 1090:219-230, 1996.
  8. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993.
  9. K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, February 1997.
  10. M. Nelson and J.-L. Gailly. The Data Compression Book. M&T Books, 1996.
  11. S.C. S . ahinalp and U. Vishkin. Symmetry breaking for suffix tree construction (extended abstract). In Proceedings of the Twenty-Sixth Annual ACM Symposium on the Theory of Computing, pages 300-309, Montréal, Québec, Canada, 23-25 May 1994.
  12. S.C. S . ahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns us- ing a labeling paradigm (extended abstract). In 37th Annual Symposium on Foundations of Computer Science, pages 320-328. IEEE, 14-16 October 1996.