Dictionary matching and indexing with errors and don't cares
2004, Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
https://doi.org/10.1145/1007352.1007374Abstract
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution. The performance bounds all have a similar character. For example, for the indexing problem with n = |t| and m = |p|, the query time for k substitutions is O(m + (c 1 log n) k k! + # matches), with a data structure of size O(n (c 2 log n) k k!) and a preprocessing time of O(n (c 2 log n) k k!), where c1, c2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.
References (41)
- REFERENCES
- K. Abrahamson. Generalized string matching. SIAM J. Computing, 16(6):1039-1051, 1987.
- A.V. Aho and M.J. Corasick. Efficient string matching. Comm. ACM, 18(6):333-340, 1975.
- T. Akutsu. Approximate string matching with don't care characters. Information Processing Letters, 55:235-239, 1995.
- T. Akutsu. Approximate string matching with variable length don't care characters. IEICE Trans. Information and Systems, E79-D:1353-1354, 1996.
- A. Amir and M. Farach. Adaptive dictionary matching. Proc. of the Symposium on Foundations of Computer Science, 1991, 760-766.
- A. Amir, M. Farach, R. Giancarlo, Z. Galil and K. Park. Dynamic dictionary matching. J. of Computer and System Sciences, 49(2):208-222, 1994.
- A. Amir, M. Farach, R.M. Idury, J.A. La Poutré, and A.A Schäffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258-282, 1995.
- A. Amir, D. Keselman, G.M. Landau, N. Lewenstein, M. Lewenstein, and M. Rodeh. Text Indexing and dictionary matching with one error. J. of Algorithms, 37(2): 309-325, 2000.
- A. Amir, G. Landau, M. Lewenstein and D. Sokol. Dynamic pattern, static text matching. Submitted for journal publication; preliminary version, Proc. of Workshop on Algorithms and Data Structures, 2003, 340-352.
- A. Amir, M. Lewenstein and E. Porat. Faster algorithms for string matching with k mismatches. Proc. of the Symposium on Discrete Algorithms, 2000, 794-803.
- A. Amir, M. Lewenstein and E. Porat. Approximate subset matching with don't cares. Proc. of the Symposium on Discrete Algorithms, 2001, 279-288.
- G. S. Brodal and L. Gasieniec. Approximate dictionary queries. Proc. of the Symposium on Combinatorial Pattern Matching, 1996, 65-74. (LNCS 1075).
- G. S. Brodal and S. Venkatesh. Improved bounds for dictionary lookup with one error. Information Processing Letters, 75(1-2):57-59 (2000).
- A.L. Buchsbaum, M.T. Goodrich, J. Westbrook. Range searching over tree cross products. Proc. of the European Symposium on Algorithms, 2000, 120-131.
- R. Cole and R. Hariharan. Approximate string matching: A faster simpler algorithm. Proc. of the Symposium on Discrete Algorithms, 1998, 463-472.
- R. Cole and R. Hariharan. Verifying candidate matches in sparse and wildcard matching. Proc. of the Symposium on Theory of Computing, 2002, 592-601.
- M. Farach. Optimal suffix tree construction with large alphabets. Proc. of the Symposium on Foundations of Computer Science, 1997, 137-143.
- P. Ferragina, S. Muthukrishnan, and M. deBerg. Multi-method dispatching: a geometric approach with applications to string matching. Proc. of the Symposium on the Theory of Computing, 1999, 483-491.
- M.J. Fischer and M.S. Paterson. String matching and other products. Complexity of Computation, R.M. Karp (editor), SIAM-AMS Proceedings, 7:113-125, 1974.
- Z. Galil and R. Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4), 52-54, 1986.
- D. Greene, M. Parnas and F. Yao. Multi-index hashing for information retrieval. Proc. of the Symposium on the Foundations of Computer Science, 1994, 722-731.
- R. Grossi, J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. Proc. of the Symposium on Theory of Computing, 2000, 397-406.
- T. Hagerup, P. B. Miltersen, R. Pagh. Deterministic dictionaries. J. of Algorithms, 41(1):69-85, 2001.
- D. Harel, R.E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338-55, 1984.
- R.M. Idury and A.A Schäffer. Dynamic dictionary matching with failure functions. Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching, 1992, 273-284.
- P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. of the Symposium on Theory of Computing, 1998, 604-613.
- E. Kushilevitz, R. Ostrovsky, Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):457-474, 2000.
- G. M. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43, 239-249, 1986.
- G. M. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10(2):157-169, 1989.
- U. Manber and S. Wu. An algorithm for approximate membership checking with applications to password security. Information Processing Letters, 50:92-197, 1994.
- E. M. McCreight. A space-economical suffix tree construction algorithm. J. of the ACM, 23:262-272, 1976.
- K. Mehlhorn. Data Structures and Efficient Algorithms, Vol. 1: Sorting and Searching, pp. 177-178. Springer Verlag, EATCS Monographs, 1984.
- M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass., 1969.
- S. C. Sahinalp and U. Vishkin. Efficient approximate and dynamic matching of patterns using a labeling paradigm. Proc. of the Symposium on Foundations of Computer Science, 1996, 320-328.
- B. Schieber and U. Vishkin. On finding lowest common ancestors: simplifications and parallelization. SIAM Journal on Computing, 17(6):1253-62, 1988.
- S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu, C.T. Lu. Spatial databases -accomplishments and research needs. TKDE, 11(1):45-55 (1999).
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-260, 1995.
- P. Weiner. Linear pattern matching algorithm. Proc. of the Symposium on Switching and Automata Theory, 1973, 1-11.
- D. E. Willard. Log-logarithmic worst-case range queries are possible in space Θ(n). Information Processing Letters, 17(2):81-84, 1983.
- A. C.-C. Yao and F. F. Yao. Dictionary lookup with one error. J. of Algorithms, 25(1):194-202, 1997.