Mining Postal Addresses
2008
Abstract
This paper presents FuMaS (Fuzzy Matching System), a system capable of an efficient retrieval of postal addresses from noisy queries. The fuzzy postal addresses retrieval has many possible applications, ranging from datawarehouse dedumping, to the correction of input forms, or the integration within online street directories, etc. This paper presents the system architecture along with a series of experiments performed using FuMaS. The experimental results show that FuMaS is a very useful system when retrieving noisy postal addresses, being able to retrieve almost 85% of the total ones. This represents an improvement of the 15% when comparing with other systems tested in this set of experiments.
References (26)
- Batini, C. et al, 1986. A comparative analysis of methodologies for database schema integration. In ACM Computer Surveys, Vol. 18, No. 4,pp 323-364.
- Hernandez, M.A., Stolfo, S.J. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, Vol. 2, No. 4, pp. 9-37.
- Fellegi, I.P., Sunter, A.B. 1969. A theory for record linkage. In Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1138-1210.
- Gu, L., et al, Record linkage: Current practice and future directions.
- Dey, D., et al, A distance-based approach to entity reconciliation in heterogeneous databases. In IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, pp. 567-582.
- Lim, E.P., et al, 1996. Entity identification in database integration. In Information Sciences, Vol. 89, No. 1, pp. 1-38.
- Wang, Y.R., Madnick, S.E. 1989, The inter-database instance identification problem in integrating autonomous systems. Proceedings of the Fifth International Conference on Data Engineering, pp. 46-55.
- Hernandez, M.A., Stolfo, S.J. 1995, The merge/purge problem for large databases. In Proceedings of the SIGMOD Conference, pp. 127-138.
- Christen, P., Churches, T. 2005, febrl-freely extensible biomedical record linkage. Sourceforge.net.
- Sauleau, E.A., et al. 2005, Medical record linkage in health information systems by approximate string matching and clustering. In BMC Medical Information Decision Making, Vol. 32, No. 5, pp. 5-32.
- Navarro, G., Raffinot, 2002, M. Flexible pattern matching in strings -Practical on-line search algorithms for texts and biological sequences. Cambridge University Press.
- Levenshtein, V. 1966, Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, Vol. 10.
- Smith, T.F., Water, M.S., Identification of common molecular subsequences. Journal of Molecular Biology.
- Aho, A. 1990, Algorithms for finding patterns in string. In Handbook of Theoretical Computer Science: Algorithms and Complexity. MIT Press, pp. 255-300.
- Navarro, G., A guided tour to approximate string matching. ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88.
- Chollet, G., 1994, Automatic speech and speaker recognition: overview, current issues and perspectives. Fundamentals of speech synthesis and speech recognition. pp. 129-147.
- Vintsyuk, T.K., 1968, Speech discrimination by dynamic programming. In Cybernetics and System Analysis. Baeza-Yates, R., Navarro, G. 1997, A practical index for text retrieval allowing errors, pp. 273-282.
- Navarro, G. et al, 2001. Matchsmile: A flexible approximate matching tool for personal names searching. In Proceedings of the SBBD'01, pp. 273-282.
- Wagner, R.A., Fischer, M.J. 1974, The string-to-string correction problem. In Journal of the ACM.
- Wagner, R.A., Lowrance, R., 1975, An extension of the string-to-string correction problem. In Journal of the ACM, Vol. 22, No. 2, pp. 177-183.
- Gusfield, D., 1997, Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New Yotk, USA.
- Needleman, S., Wunsch, C., 1970, A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology, Vol. 48, pp. 444-453.
- Szucs, L.D., Hargreaves, S. 1996, The Source: A guidebook of American Genealogy. Ancestry.com.
- Kondrak, G.,et al. Cognates can improve statistical translation models. Proceedings of HLT-NAACL 2003, pp. 46-48.
- Salton, G., et al, 1975, A vector space model for automatic indexing. In Communications of the ACM.
- Gravano, L. et al, 2001, Using q-grams in a DBMS for approximate string processing. In IEEE Data Engineering Bulletin, Vol. 24, No. 4, pp. 28-34.