Academia.eduAcademia.edu

Outline

MergedTrie: Efficient textual indexing

PLOS ONE

https://doi.org/10.1371/JOURNAL.PONE.0215288

Abstract

The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our Merged-Trie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.

References (57)

  1. Gil D.; Ferra ´ndez A.; Mora-Mora H.; Peral J. (2016). Internet of Things: A Review of Surveys Based on Context Aware Intelligent Services. Sensors 16(7), 1069.
  2. Kiritchenko S.; Zhu X.; Mohammad S. M. (2014). Sentiment Analysis of Short Informal Texts. Journal of Artificial Intelligence Research 50, pp. 723-762.
  3. Bellot P.; Moriceau V.; Mothe J.; SanJuan E.; Tannier X. (2016). INEX Tweet Contextualization task: Evaluation, results and lesson learned, Information Processing & Management, 52(5), pp. 801-819.
  4. Korhonen A.; O ´Se ´aghdha D.; Silins I.; Sun L.; Ho ¨gberg J.; Stenius U. (2012). Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research. PLoS ONE 7(4): e33427. https://doi.org/10.1371/journal.pone.0033427 PMID: 22511921
  5. Kozareva, Z.; Ravi, S. (2011). Unsupervised Name Ambiguity Resolution Using a Generative Model. In Proceedings of the First Workshop on Unsupervised Learning in NLP (EMNLP), pp. 105-112.
  6. Martı ´nez-Prieto M.A.; Brisaboa N.; Ca ´novas R.; Claude F.; Navarro G. (2016). Practical compressed string dictionaries. Information Systems, 56, pp. 73-108.
  7. Kozareva, Z.; Hovy, E. (2011). Learning Temporal Information for States and Events. In Proceedings of the IEEE Fifth International Conference on Semantic Computing, pp. 424-429.
  8. Germann, U.; Joanis, E.; Larkin, S. (2009). Tightly Packed Tries: How to Fit Large Models into Memory, and Make them Load Fast, Too. In Proceedings of the NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 31-39.
  9. Baeza-Yates R. Ribeiro-Neto, B. (2011). Modern Information Retrieval. Addison Wesley.
  10. Ferra ´ndez A. (2011). Lexical and Syntactic knowledge for Information Retrieval. Information Processing & Management, 47, pp. 692-705.
  11. Kelbert P.; Droege G.; Barker K.; Braak K.; Cawsey EM.; Coddington J. (2015). B-HIT-A Tool for Har- vesting and Indexing Biodiversity Data. PLoS ONE 10(11): e0142240. https://doi.org/10.1371/journal. pone.0142240 PMID: 26544980
  12. Bu ¨ttcher S., Clarke C.L.A.; Cormack G. (2010). Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.
  13. Witten I.H.; Moffat A.; Bell T.C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann.
  14. Grossi, R.; Vitter, J.S. (2000). Compressed suffix arrays and suffix trees with applications to text index- ing and string matching. In Proceedings of the thirty-second annual ACM symposium on Theory of com- puting (STOC '00), pp.397-406.
  15. Inenaga S.; Hoshino H.; Shinohara A.; Takeda M.; Arikawa S.; Mauri G.; Pavesi G. (2005). On-line con- struction of compact directed acyclic word graphs. Discrete Applied Mathematics, 146, pp. 156-179.
  16. Navarro G.; Baeza-Yates R. (1999). Very fast and simple approximate string matching. Information Pro- cessing Letters, 72, pp. 65-70.
  17. Crochemore M. (2003). Reducing space for index implementation. Theoretical Computer Science, 292, pp.185-197.
  18. Morimoto K.; Iriguchi H.; Aoe JI. (1995). A dictionary retrieval algorithm using two trie structures. Sys- tems and Computers in Japan 26(2), pp. 85-97.
  19. Aoe J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. (1989). IEEE Transac- tions on Software Engineering, 15(9), pp. 1066-1077.
  20. Yoshinaga, N.; Kitsuregawa, M. (2014). A Self-adaptive Classifier for Efficient Text-stream Processing. In Proceedings of the COLING 2014, pp. 1091-1102.
  21. Huang K.; Xie G.; Li Y.; Zhang D. (2015). Memory-efficient IP lookup using trie merging for scalable vir- tual routers, Journal of Network and Computer Applications, 51, pp. 47-58.
  22. Mukhopadhyay I.; Chakraborty M.; Chakrabarti S. (2011). A Comparative Study of Related Technolo- gies of Intrusion Detection & Prevention Systems. Journal of Information Security 2(1), pp. 28-38.
  23. Fredkin E. (1960). Trie Memory. Communications of the ACM, 3(9), pp. 490-499.
  24. Briandais, R. (1959). File Searching Using Variable Length Keys. In Proceedings of the AFIPS Western Joint Computer Conference, pp. 295-298.
  25. Black, P.E. (2011a). "Trie", in Dictionary of Algorithms and Data Structures [online].
  26. Jung M.; Shishibori M.; Tanaka Y.; Aoe J. (2002). A dynamic construction algorithm for the Compact Patricia trie using the hierarchical structure, Information Processing & Management, 38(2), pp. 221- 236.
  27. Black, P.E. (2011b). Directed Acyclic Word Graph, in Dictionary of Algorithms and Data Structures [online], Vreda Pieterse and Paul E. Black, eds. 30 December. Available from: http://www.nist.gov/ dads/HTML/directedAcyclicWordGraph.html.
  28. Blumer A.; Blumer J.; Haussler D.; McConnell R.; Ehrenfeucht A. (1987). Complete inverted files for effi- cient text retrieval and analysis. Journal of the Association for Computing Machinery, 34 (3), pp. 578- 595.
  29. Daciuk J.; Watson B.W.; Mihov S.; Watson R.E. (2000). Incremental Construction of Minimal Acyclic Finite-State Automata. Computational Linguistics, 26(1), pp. 3-16.
  30. Carrasco R.C.; Forcada M.L. (2002). Incremental Construction and Maintenance of Minimal Finite- State Automata. Computational Linguistics, 28(2), pp. 207-216.
  31. Daciuk J. (2002). Comparison of construction algorithms for minimal, acyclic, deterministic, finite-state automata from sets of strings. In Proceedings of CIAA'02, LNCS, vol. 2608, pp. 255-261.
  32. Bubenzer J. (2014). Cycle-aware minimization of acyclic deterministic finite-state automata, Discrete Applied Mathematics, Volume 163(3), pp. 238-246.
  33. Fredriksson K. (2010). On building minimal automaton for subset matching queries, Information Pro- cessing Letters, 110(24), pp. 1093-1098.
  34. Watson, B. W. (2010). Constructing minimal acyclic deterministic finite automata, Ph.D. Thesis, Univer- sity of Pretoria, University of Pretoria.
  35. Garcı ´a P.; Lo ´pez D.; Va ´zquez de Parga M. (2015). DFA minimization: Double reversal versus split mini- mization algorithms, Theoretical Computer Science, 583(7), pp. 78-85.
  36. Heinz S.; Zobel J.; Williams H.E. (2002). Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst., 20, pp. 192-223.
  37. Dutta, S.; Bhattacharya, A. (2010). INSTRUCT-Space-Efficient Structure for Indexing and Complete Query Management of String Databases. In Proceedings of the 16th International Conference on Man- agement of Data (COMAD).
  38. Aoe J.; Morimoto K.; Shishibori M.; Park HK. (1996). A Trie Compaction Algorithm for a Large Set of Keys. IEEE Transactions on Knowledge & Data Engineering, 8, pp. 476-491.
  39. Watson B. W. (1996). Implementing and using finite automata toolkits. Natural Language Engineering 2 (4), pp. 295-302.
  40. Clarkson, P. R.; Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the EUROSPEECH 1997, pp. 2707-2710.
  41. Whittaker, E. W. D.; Raj, B. (2001). Quantization-based language model compression. In Proceedings of the EUROSPEECH 2001, pp. 33-36.
  42. Aoe J.; Morimoto K.; Sato T. (1992). An Efficient Implementation of Trie Structures. Software-Practice and Experience, 22(9), pp. 695-721.
  43. Morita K.; Fuketa M.; Yamakawa Y.; Aoe J. (2001). Fast insertion methods of a double-array structure. Software-Practice and Experience, 31, pp. 43-65.
  44. Oono M.; Atlam E.; Fuketa M.; Morita K.; Aoe J. (2003). A fast and compact elimination method of empty elements from a double-array structure. Software-Practice and Experience, 33, pp. 1229-1249.
  45. Yata S.; Oono M.; Morita K.; Fuketa M.; Sumitomo T.; Aoe J. (2007). A compact static double-array keeping character codes. Information Processing & Management, 43(1), pp. 237-247.
  46. Fuketa M.; Kitagawa H.; Ogawa T.; Morita K.; Aoe J, (2014). Compression of double array structures for fixed length keywords, Information Processing & Management, 50 (5), pp. 796-806.
  47. Kanda, S.; Fuketa, M.; Morita, K.; Aoe. JI. (2015). Trie compact representation using double-array structures with string labels. In Proceedings of the IEEE 8th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 3-8.
  48. Kanda S.; Morita K.; Fuketa M. (2017). Compressed double-array tries for string dictionaries supporting fast lookup. Knowledge and Information Systems, 51(3), pp. 1023-1042.
  49. Askitis, N.; Sinha, R. (2007). HAT-trie: A Cache-conscious Trie-based Data Structure for Strings. In Pro- ceedings of the 30th Australasian Computer Science Conference (ACSC2007), pp. 97-105.
  50. Bagwell, P. (2000). Ideal Hash Trees. Technical Report. Infoscience Department, E ´cole Polytechnique Fe ´de ´rale de Lausanne.
  51. Fu J.; Rexford J. (2008). Efficient IP address lookup with a shared forwarding table for multiple virtual routers. In Proceedings of the ACM CoNEXT. Article No. 21.
  52. Song H.; Kodialam, M.; Hao, F.; Lakshman, TV. (2010). Building scalable virtual routers with trie braid- ing. In Proceedings of the IEEE INFOCOM, p. 1442-50.
  53. Brisaboa N.R.; Fariña A.; Ladra S.; Navarro G. (2012). Implicit indexing of natural language text by reor- ganizing bytecodes. Information Retrieval, 15, pp. 527-557.
  54. Sa ´nchez-Martı ´nez F.; Carrasco R.C.; Martı ´nez-Prieto M.A.; Adiego J. (2012). Generalized Biwords for Bitext Compression and Translation Spotting. Journal of Artificial Intelligence Research, 43, pp. 389- 418.
  55. Adiego J.; Brisaboa N. R.; Martı ´nez-Prieto M. A.; Sa ´nchez-Martı ´nez F. (2009). A two-level structure for compressing aligned bitexts. In Proceedings of the 16th String Processing and Information Retrieval Symposium, Vol. 5721 of Lecture Notes in Computer Science, pp. 114-121.
  56. Chang M.; Poon C.K. (2008). Efficient phrase querying with common phrase index. Information Pro- cessing & Management, 44, pp. 756-769.
  57. Santana O.; Carreras F. J.; Herna ´ndez Z.; Gonzalez A. (2007). Integration of an XML electronic dictio- nary with linguistic tools for Natural Language Processing. Information Processing & Management, 43, pp. 946-957.