Mining Postal Addresses

Jose Maria Gomez Hidalgo

Outline

Mining Postal Addresses

Jose Maria Gomez Hidalgo

2008

Abstract

This paper presents FuMaS (Fuzzy Matching System), a system capable of an efficient retrieval of postal addresses from noisy queries. The fuzzy postal addresses retrieval has many possible applications, ranging from datawarehouse dedumping, to the correction of input forms, or the integration within online street directories, etc. This paper presents the system architecture along with a series of experiments performed using FuMaS. The experimental results show that FuMaS is a very useful system when retrieving noisy postal addresses, being able to retrieve almost 85% of the total ones. This represents an improvement of the 15% when comparing with other systems tested in this set of experiments.

Figures (2)

Figure 1. Conceptual view of FuMaS architecture (subfigure A) and an abstraction level module decomposition (subfigure B). As can be seen, FuMaS presents a 3 abstraction levels based architecture, which are expert when working on a certain kind of entities (addresses, phrases and words).

After a small set of experiments, SOUNDEX has been ruled out because it has been designed for English names and when using Spanish words the similarity values has no real sense. Cosine similarity only uses the information of the number of equal words in the given 2 input strings, which is not very useful for this problem. Dice's similarity works similar to cosine similarity and was ruled out soon by the same reason. Then, the only two algorithms that really seems to fit to this problem are Levenshtein distance and Q-Grams’. The first set of experiments evaluates the adjustment of each algorithm to the given problem. Figure 2. Comparison of relevant addresses retrieved by FuMaS, Uniserv and 4 street directories that integrates some techniques to correct the user inputs. Dark bars show the results for recall 10 and light ones show the results for recall 1

References (26)

Batini, C. et al, 1986. A comparative analysis of methodologies for database schema integration. In ACM Computer Surveys, Vol. 18, No. 4,pp 323-364.
Hernandez, M.A., Stolfo, S.J. 1998. Real-world data is dirty: Data cleansing and the merge/purge problem. In Data Mining and Knowledge Discovery, Vol. 2, No. 4, pp. 9-37.
Fellegi, I.P., Sunter, A.B. 1969. A theory for record linkage. In Journal of the American Statistical Association, Vol. 64, No. 328, pp. 1138-1210.
Gu, L., et al, Record linkage: Current practice and future directions.
Dey, D., et al, A distance-based approach to entity reconciliation in heterogeneous databases. In IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, pp. 567-582.
Lim, E.P., et al, 1996. Entity identification in database integration. In Information Sciences, Vol. 89, No. 1, pp. 1-38.
Wang, Y.R., Madnick, S.E. 1989, The inter-database instance identification problem in integrating autonomous systems. Proceedings of the Fifth International Conference on Data Engineering, pp. 46-55.
Hernandez, M.A., Stolfo, S.J. 1995, The merge/purge problem for large databases. In Proceedings of the SIGMOD Conference, pp. 127-138.
Christen, P., Churches, T. 2005, febrl-freely extensible biomedical record linkage. Sourceforge.net.
Sauleau, E.A., et al. 2005, Medical record linkage in health information systems by approximate string matching and clustering. In BMC Medical Information Decision Making, Vol. 32, No. 5, pp. 5-32.
Navarro, G., Raffinot, 2002, M. Flexible pattern matching in strings -Practical on-line search algorithms for texts and biological sequences. Cambridge University Press.
Levenshtein, V. 1966, Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, Vol. 10.
Smith, T.F., Water, M.S., Identification of common molecular subsequences. Journal of Molecular Biology.
Aho, A. 1990, Algorithms for finding patterns in string. In Handbook of Theoretical Computer Science: Algorithms and Complexity. MIT Press, pp. 255-300.
Navarro, G., A guided tour to approximate string matching. ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88.
Chollet, G., 1994, Automatic speech and speaker recognition: overview, current issues and perspectives. Fundamentals of speech synthesis and speech recognition. pp. 129-147.
Vintsyuk, T.K., 1968, Speech discrimination by dynamic programming. In Cybernetics and System Analysis. Baeza-Yates, R., Navarro, G. 1997, A practical index for text retrieval allowing errors, pp. 273-282.
Navarro, G. et al, 2001. Matchsmile: A flexible approximate matching tool for personal names searching. In Proceedings of the SBBD'01, pp. 273-282.
Wagner, R.A., Fischer, M.J. 1974, The string-to-string correction problem. In Journal of the ACM.
Wagner, R.A., Lowrance, R., 1975, An extension of the string-to-string correction problem. In Journal of the ACM, Vol. 22, No. 2, pp. 177-183.
Gusfield, D., 1997, Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, New Yotk, USA.
Needleman, S., Wunsch, C., 1970, A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology, Vol. 48, pp. 444-453.
Szucs, L.D., Hargreaves, S. 1996, The Source: A guidebook of American Genealogy. Ancestry.com.
Kondrak, G.,et al. Cognates can improve statistical translation models. Proceedings of HLT-NAACL 2003, pp. 46-48.
Salton, G., et al, 1975, A vector space model for automatic indexing. In Communications of the ACM.
Gravano, L. et al, 2001, Using q-grams in a DBMS for approximate string processing. In IEEE Data Engineering Bulletin, Vol. 24, No. 4, pp. 28-34.

Mining Postal Addresses

Sign up for access to the world's latest research

Abstract

Related papers

References (26)

Related papers