Academia.eduAcademia.edu

Outline

The Design of Text Signatures for Text Retrieval Systems

Abstract

Signature files are one technique for indexing documents for full-text retrieval systems. This paper discusses two methods for generating text signatures -- the word fragmentation and the pseudo-random generation techniques. The paper evaluates the effectiveness and efficiency of generating text signatures using these techniques. It also determines the optimal set of characteristics that define a text signature that is to be used for superimposed signature file indexes. The optimal set of characteristics can be used to create text signatures that minimise the number of false drops retrieved from the information system. Keywords Full-text retrieval; Searching; Signature Files; Superimposed coding; Text retrieval systems; Text signatures. Page 1 1. Introduction A text retrieval system is characterised by two components. The text database consists of a collection of text documents. The documents can either be unstructured (that is, devoid of any of the traditional database field str...

References (25)

  1. Bird, P. R. (1975) "Design analysis of random superimposed coding methods for data storage". Information Processing and Management, 11:3/4, 79-88.
  2. Colomb, R. M. (1985) "Use of Superimposed Code Words for Partial Match Data Retrieval". The Australian Computer Journal, 17:4, 181-188.
  3. Cooper, L. K. D. and Tharp, A. L. (1989) "Inverted signature trees and text searching on CD-ROMs". Information processing and management (UK), 25:2, 161-169.
  4. Croft, W. B. and Savino, P. (1988) "Implementing ranking strategies using text signatures". ACM Transactions on office information systems, 6:1, 42-62.
  5. Du, D. H., Ghanta, S, Maly, K. J. and Sharrock, S. M. (1989) "An efficient file structure for document retrieval in the automated office environment". IEEE Transactions on knowledge and data engineering, 1:2, 258-273.
  6. Eastman, C. M. (1989) "Handling incrementally specified Boolean queries: a comparison of inverted and signature file organizations". Information processing and management (UK), 25:1, 27-38.
  7. Faloutsos, C. (1985) "Access methods for text". Computing surveys, 17:1, 49-74.
  8. Faloutsos, C. and Christodoulakis, S. (1984) "Signature files: an access method for documents and its analytical performance evaluation". ACM Transactions on office information systems, 2:4, 267-288.
  9. Faloutsos, C. and Christodoulakis, S. (1985) "Design of a signature file method that accounts for non-uniform occurrence and query frequencies". Proceedings of 11th. conference on very large databases, Stockholm, August. 165-170.
  10. Faloutsos, C. and Christodoulakis, S. (1987a) "Description and performance analysis of signature file methods for office filing". ACM Transactions on office information systems, 5:3, 237-257.
  11. Faloutsos, C. and Christodoulakis, S. (1987b) "Optimal signature extraction and information loss". ACM Transactions on database systems, 12:3, 395-428
  12. Floyd, E. T. (1990) "An Existential Dictionary". Dr. Dobb's Journal, November, 20-32.
  13. Harrison, M. C. (1971) "Implementation of the substring test by hashing". Communications of the ACM, 14:12, 777-779.
  14. Knuth, D. E. (1973) The Art of Computer Programming. Vol. 3. Addison-Wesley, Reading, Massechusetts.
  15. Lee, D. L. and Leng, C. (1989) "Partitioned signature files: design issues and performance evaluation". ACM Transactions on office information systems, 7:2, 158-180.
  16. Leftkovitz, D. (1976) "The large database file structure dilemma". Moore School Report 76-5, University of Pennsylvania.
  17. Paice, C. D. (1990) "Another Stemmer". ACM SIGIR Forum, 24:3, 56-61.
  18. Pearson, P. K. (1990) "Fast hashing of variable length text strings". Communications of the ACM, 33:6, 677-680.
  19. Pfaltz, J. L., Berman, W. J. and Cagley, E. M. (1980) "Partial-match retrieval using indexed descriptor files". Communications of the ACM, 23:9, 522-528.
  20. Rabitti, F. and Zezula, P. (1990) "A dynamic signature technique for multimedia databases". Proceedings of the 13th International Conference on Research and Development in Information Retrieval, Brussels, Belgium, September 5-7, 1990. 193-210.
  21. Roberts, C. S. (1979) "Partial-match retrieval via the method of superimposed codes". Proceedings of the IEEE, 67:12, 1624-1642.
  22. Sacks-Davis, R., Ramamohanarao, K. and Kent, A. (1987) "Multi-key access methods based on superimposed coding techniques". ACM transactions on database systems, 12:4, 655-696.
  23. Savoy, J. (1990) "Statistical behaviour of fast hashing of variable-length text strings" ACM SIGIR Forum, 24:3.
  24. Wade, S. J., Willett, P. and Bawden, D. (1989) "SIBRIS: The Sandwich Interactive Browsing and Ranking Information System". Journal of information science principles and practice (Netherlands), 15:4, 249-260.
  25. Yannakoudakis, E. J., Goyal, P. and Huggill, J. A. (1982) "Signature file methods for implementing a ranking strategy". Information processing and management, 18:1, 15-21.