Abstract
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it uses about the same memory as a binary search tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all ...
References (56)
- A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison- Wesley, Reading, Massachusetts, 1983.
- A. V. Aho, R. Sethi, and J. D. Ullman. Compilers Principle Techniques and Tools. Addison- Wesley, Reading, Massachusetts, 1986.
- M. Al-Suwaiyel and E. Horowitz. Algorithms for trie compaction. ACM Transactions on Database Systems, 9(2):243-263, 1984.
- A. Andersson and S. Nilsson. Improved behaviour of tries by adaptive branching. Infor- mation Processing Letters, 46(6):295-300, June 1993.
- J.-I. Aoe, K. Morimoto, and T. Sato. An efficient implementation of trie structures. Software-Practice and Experience, 22(9):695-721, September 1992.
- J.-I. Aoe, K. Morimoto, M. Shishibori, and K.-H. Park. A trie compaction algorithm for a large set of keys. IEEE Transactions on Knowledge and Data Engineering, 8(3):476-491, 1996.
- R. A. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton searching on tries. Jour. of the ACM, 43(6):915-936, 1996.
- J. Bell and G.K. Gupta. An evaluation of self-adjusting binary search tree techniques. Software-Practice and Experience, 23(4):369-382, 1993.
- T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, Englewood Cliffs, New Jersey, 1990.
- J. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. An- nual ACM-SIAM Symp. on Discrete Algorithms, pages 360-369, New Orleans, Louisiana, 1997. ACM/SIAM.
- J. Clement, P. Flajolet, and B. Vallée. The analysis of hybrid trie structures. In Proc. An- nual ACM-SIAM Symp. on Discrete Algorithms, pages 531-539, San Francisco, California, 1998. ACM/SIAM.
- J. Clement, P. Flajolet, and B. Vallée. Dynamic sources in information theory: A general analysis of trie structures. Algorithmica, 29(1/2):307-369, 2001.
- D. Comer. Heuristics for trie minimization. ACM Transactions on Database Systems, 4(3):383-395, September 1979.
- D. Comer. Analysis of a heuristic for trie minimization. ACM Transactions on Database Systems, 6(3):513-537, 1981.
- D. Comer and R. Sethi. The complexity of trie index construction. Jour. of the ACM, 24(3):428-440, July 1977.
- R. de la Briandais. File searching using variable length keys. In Proc. Western Joint Computer Conference, volume 15, Montvale, NJ, USA, 1959. AFIPS Press.
- L. Devroye. A study of trie-like structures under the density model. Annals of Applied Probability, 2(2):402-434, 1992.
- R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong. Extendible hashing-a fast access method for dynamic files. ACM Transactions on Database Systems, 4(3):315-344, 1979.
- P. Flajolet. On the performance evaluation of extendible hashing and trie searching. Acta Informatica, 20:345-369, 1983.
- P. Flajolet and C. Puech. Partial match retrieval of multidimensional data. Jour. of the ACM, 33(2):371-407, 1986.
- P. Flajolet and R. Sedgewick. Digital search trees revisited. SIAM Jour. of Computing, 15(3):748-767, 1986.
- Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, September 1960.
- G. Gonnet. Handbook of Algorithms and Data structures. Addison-Wesley, Reading, Mas- sachusetts, 1984.
- D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-289, 1995.
- P. Jacquet and W. Szpankowski. Analysis of digital tries with Markovian dependency. IEEE Transactions on Information Theory, 37(5):1470-1475, September 1991.
- P. Kirschenhofer and H. Prodinger. Some further results on digital search trees. In L. Kott, editor, Automata, Languages, and Programming, volume 226 of Lecture Notes in Computer Science, pages 177-185, Rennes, France, 1986. Springer-Verlag.
- C. Knessl and W. Szpankowski. A note on the asymptotic behavior of the height in b-tries for b large. Electronic Jour. of Combinatorics, 7:R39, 2000.
- D. E. Knuth. The Art of Computer Programming, Volume 1: Fundamental algorithms. Addison-Wesley, Reading, Massachusetts, 1968.
- D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, Massachusetts, 1973.
- C. Martínez and S. Roura. Randomized binary search trees. Jour. of the ACM, 45(2):288- 323, March 1998.
- E. M. McCreight. A space-economical suffix tree construction algorithm. Jour. of the ACM, 23(2):262-271, 1976.
- D. R. Morrison. Patricia: a practical algorithm to retrieve information coded in alphanu- meric. Jour. of the ACM, 15(4):514-534, 1968.
- C. G. Nevill-Manning and I.H. Witten. Protein is incompressible. In J. Storer and M. Cohn, editors, Proc. IEEE Data Compression Conf., pages 257-266, Snowbird, Utah, 1999.
- S. Nilsson and G. Karlsson. Ip-address lookup using LC-tries. IEEE Journal on Selected Areas in Communication, 17(6):1083-1092, June 1999.
- S. Nilsson and M. Tikkanen. Implementing a dynamic compressed trie. In K. Mehlhorn, editor, Proc. Second Workshop on Algorithm Engineering (WAE '98), pages 25-36, Max- Planck-Institut für Informatik, Saarbrücken, Germany, August 1998.
- J. L. Peterson. Computer programs for detecting and correcting spelling errors. Commu- nications of the ACM, 23(12):676-686, 1980.
- T. D. M. Purdin. Compressing tries for storing dictionaries. In H. Berghel, J. Talburt, and D. Roach, editors, Proc. IEEE Symposium on Applied Computing, pages 336-340, Fayettville, Arkansas, April 1990. IEEE.
- B. Rais, P. Jacquet, and W. Szpankowski. Limiting distribution for the depth in Patricia tries. SIAM Jour. of Discrete Mathematics, 6:197-213, 1993.
- M. V. Ramakrishna and J. Zobel. Performance in practice of string hashing functions. In R. Topor and K. Tanaka, editors, Proc. Int. Conf. on Database Systems for Advanced Applications, pages 215-223, Melbourne, Australia, April 1997.
- R. Ramesh, A. J. G. Babu, and J. Peter Kincaid. Variable-depth trie index optimization: Theory and experimental results. ACM Transactions on Database Systems, 14(1):41-74, 1989.
- M. Regnier and P. Jacquet. New results of the size of tries. IEEE Transactions on Infor- mation Theory, 35(1):203-205, January 1989.
- R. L. Rivest. Partial match retrieval algorithms. SIAM Jour. of Computing, 5(1):19-50, 1976.
- D. G. Severance. Identifier search mechanisms: A survey and generalized model. Computing Surveys, 6(3):175-194, 1974.
- D. Sleator and R. Tarjan. Amortized efficiency of list update and paging rules. Communi- cations of the ACM, 28(2):202-208, February 1985.
- D.D. Sleator and R.E. Tarjan. Self-adjusting binary search trees. Jour. of the ACM, 32:652-686, 1985.
- E. Sussenguth. Use of tree structures for processing files. Communications of the ACM, 6(5):272-279, 1963.
- W. Szpankowski. Patricia tries again revisited. Jour. of the ACM, 37(4):691-711, 1990.
- W. Szpankowski. On the height of digital trees and related problems. Algorithmica, 6:256- 277, 1991.
- W. Szpankowski. Average case analysis of algorithms on sequences. John Wiley and Sons, New York, 2001.
- A. L. Uitdenbogerd and J. Zobel. Melodic matching techniques for large music databases. In D. Bulterman, K. Jeffay, and H. J. Zhang, editors, Proc. ACM Int. Multimedia Conf., pages 57-66, Orlando, Florida, November 1999.
- H. E. Williams and J. Zobel. Indexing and retrieval for genomic databases. IEEE Trans- actions on Knowledge and Data Engineering. To appear.
- H. E. Williams and J. Zobel. Searchable words on the web. In submission, 2001.
- H. E. Williams, J. Zobel, and S. Heinz. Splay trees in practice for large text collections. Software-Practice and Experience. To appear.
- I. H. Witten and T. C. Bell. Source models for natural language text. Int. Jour. on Man Machine Studies, 32:545-579, 1990.
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, California, second edition, 1999.
- J. Zobel, S. Heinz, and H. E. Williams. In-memory hash tables for accumulating text vocabularies. Information Processing Letters. To appear.