Academia.eduAcademia.edu

Outline

Dictionary matching and indexing with errors and don't cares

2004, Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

https://doi.org/10.1145/1007352.1007374

Abstract

This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k k! + # matches), with a data structure of size O(n (c 2 log n) k k!) and a preprocessing time of O(n (c 2 log n) k k!), where c1, c2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

References (41)

  1. REFERENCES
  2. K. Abrahamson. Generalized string matching. SIAM J. Computing, 16(6):1039-1051, 1987.
  3. A.V. Aho and M.J. Corasick. Efficient string matching. Comm. ACM, 18(6):333-340, 1975.
  4. T. Akutsu. Approximate string matching with don't care characters. Information Processing Letters, 55:235-239, 1995.
  5. T. Akutsu. Approximate string matching with variable length don't care characters. IEICE Trans. Information and Systems, E79-D:1353-1354, 1996.
  6. A. Amir and M. Farach. Adaptive dictionary matching. Proc. of the Symposium on Foundations of Computer Science, 1991, 760-766.
  7. A. Amir, M. Farach, R. Giancarlo, Z. Galil and K. Park. Dynamic dictionary matching. J. of Computer and System Sciences, 49(2):208-222, 1994.
  8. A. Amir, M. Farach, R.M. Idury, J.A. La Poutré, and A.A Schäffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258-282, 1995.
  9. A. Amir, D. Keselman, G.M. Landau, N. Lewenstein, M. Lewenstein, and M. Rodeh. Text Indexing and dictionary matching with one error. J. of Algorithms, 37(2): 309-325, 2000.
  10. A. Amir, G. Landau, M. Lewenstein and D. Sokol. Dynamic pattern, static text matching. Submitted for journal publication; preliminary version, Proc. of Workshop on Algorithms and Data Structures, 2003, 340-352.
  11. A. Amir, M. Lewenstein and E. Porat. Faster algorithms for string matching with k mismatches. Proc. of the Symposium on Discrete Algorithms, 2000, 794-803.
  12. A. Amir, M. Lewenstein and E. Porat. Approximate subset matching with don't cares. Proc. of the Symposium on Discrete Algorithms, 2001, 279-288.
  13. G. S. Brodal and L. Gasieniec. Approximate dictionary queries. Proc. of the Symposium on Combinatorial Pattern Matching, 1996, 65-74. (LNCS 1075).
  14. G. S. Brodal and S. Venkatesh. Improved bounds for dictionary lookup with one error. Information Processing Letters, 75(1-2):57-59 (2000).
  15. A.L. Buchsbaum, M.T. Goodrich, J. Westbrook. Range searching over tree cross products. Proc. of the European Symposium on Algorithms, 2000, 120-131.
  16. R. Cole and R. Hariharan. Approximate string matching: A faster simpler algorithm. Proc. of the Symposium on Discrete Algorithms, 1998, 463-472.
  17. R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching. Proc. of the Symposium on Theory of Computing, 2002, 592-601.
  18. M. Farach. Optimal suffix tree construction with large alphabets. Proc. of the Symposium on Foundations of Computer Science, 1997, 137-143.
  19. P. Ferragina, S. Muthukrishnan, and M. deBerg. Multi-method dispatching: a geometric approach with applications to string matching. Proc. of the Symposium on the Theory of Computing, 1999, 483-491.
  20. M.J. Fischer and M.S. Paterson. String matching and other products. Complexity of Computation, R.M. Karp (editor), SIAM-AMS Proceedings, 7:113-125, 1974.
  21. Z. Galil and R. Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4), 52-54, 1986.
  22. D. Greene, M. Parnas and F. Yao. Multi-index hashing for information retrieval. Proc. of the Symposium on the Foundations of Computer Science, 1994, 722-731.
  23. R. Grossi, J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. Proc. of the Symposium on Theory of Computing, 2000, 397-406.
  24. T. Hagerup, P. B. Miltersen, R. Pagh. Deterministic dictionaries. J. of Algorithms, 41(1):69-85, 2001.
  25. D. Harel, R.E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-55, 1984.
  26. R.M. Idury and A.A Schäffer. Dynamic dictionary matching with failure functions. Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching, 1992, 273-284.
  27. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. of the Symposium on Theory of Computing, 1998, 604-613.
  28. E. Kushilevitz, R. Ostrovsky, Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):457-474, 2000.
  29. G. M. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239-249, 1986.
  30. G. M. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10(2):157-169, 1989.
  31. U. Manber and S. Wu. An algorithm for approximate membership checking with applications to password security. Information Processing Letters, 50:92-197, 1994.
  32. E. M. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23:262-272, 1976.
  33. K. Mehlhorn. Data Structures and Efficient Algorithms, Vol. 1: Sorting and Searching, pp. 177-178. Springer Verlag, EATCS Monographs, 1984.
  34. M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass., 1969.
  35. S. C. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proc. of the Symposium on Foundations of Computer Science, 1996, 320-328.
  36. B. Schieber and U. Vishkin. On finding lowest common ancestors: simplifications and parallelization. SIAM Journal on Computing, 17(6):1253-62, 1988.
  37. S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu, C.T. Lu. Spatial databases -accomplishments and research needs. TKDE, 11(1):45-55 (1999).
  38. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-260, 1995.
  39. P. Weiner. Linear pattern matching algorithm. Proc. of the Symposium on Switching and Automata Theory, 1973, 1-11.
  40. D. E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(n). Information Processing Letters, 17(2):81-84, 1983.
  41. A. C.-C. Yao and F. F. Yao. Dictionary lookup with one error. J. of Algorithms, 25(1):194-202, 1997.