Postal Address Detection from Web Documents
2005
https://doi.org/10.1109/WIRI.2005.28…
6 pages
1 file
Sign up for access to the world's latest research
Abstract
An approach to postal address detection from webpages is proposed. The webpages are first segmented into text blocks based on their visual similarity. The text content in each block undergoes the recognition process, which employs a syntactic approach. The grammars of almost all possible patterns of postal addresses are built for this purpose. The results of our preliminary experiments on 44 webpages with 56 true addresses show that our approach can detect the postal addresses with a high precision (89.3%) and a low false alarms rate (3.8%).
Related papers
SPIE Proceedings, 2005
This paper presents the implementation and evaluation of a Hidden Markov Model to extract addresses from OCR text. Although Hidden Markov Models discover addresses with high precision and recall, this type of Information Extraction task seems to be affected negatively by the presence of OCR text.
2006
Although nowadays there are working systems for sorting mail in some constrained ways, segmenting gray level images of envelopes and locating address blocks in them is still a difficult problem. Pattern Recognition research has contributed greatly to this area since the problem concerns feature design, extraction, recognition, and also the image segmentation if one deals with the original gray level images from the beginning. This paper presents a segmentation and address block location algorithm based on feature selection in wavelet space. The aim is to automatically separate in postal envelopes the regions related to background, stamps, rubber stamps, and the address blocks. First, a typical image of a postal envelope is decomposed using Mallat algorithm and Haar basis. High frequency channel outputs are analyzed to locate salient points in order to separate the background. A statistical hypothesis test is taken to decide upon more consistent regions in order to clean out some noise left. The selected points are projected back to the original gray level image, where the evidence from the wavelet space is used to start a growing process to include the pixels more likely to belong to the regions of stamps, rubber stamps, and written area. Besides the new features and a growing process controlled by the salient points presented here, a fully comprehensive experimental setup was run by separating and classifying blocks in the envelopes, and validating results by a pixel to pixel accuracy measure using a ground truth database of 2200 images with different layouts and backgrounds. Success rate for address block location reached is over 90%.
Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004
A method for extracting a recipient address-block from a mail image has been developed. The method is composed of two steps: nomination of address-block candidates and evaluation of these candidates by using the Bayesian rule according to each of address-block type. Accordingly, the proposed method can cope with various types of address-blocks. The effectiveness of the method was confirmed in several address extraction experiments. These experiments show that the top-five extraction results include one correct address -block in 94% of total number of printed-mail cases and 89% in of handwritten-mail cases.
Pattern Recognition, 2010
Multi-script postal address block Script identification for numerals Quad-tree based feature extraction Handwritten numeral recognition Support vector machine a b s t r a c t Recognition of numeric postal codes in a multi-script environment is a classical problem in any postal automation system. In such postal documents, determination of the script of the handwritten postal codes is crucial for subsequent invocation of the digit recognizers for respective scripts. The current framework attempts to infer about the script of the numeric postal code without having any bias from the script of the textual address part of the rest of the address block, as they might differ in a potential multi-script environment. Scope of the current work is to recognize the postal codes written in any of the four popular scripts, viz., Latin, Devanagari, Bangla and Urdu. For this purpose, we first implement a Hough transformation based technique to localize the postal-code blocks from structured postal documents with defined address block region. Isolated handwritten digit patterns are then extracted from the localized postal-code region. In the next stage of the developed framework, similar shaped digit patterns of the said four scripts are grouped in 25 clusters. A script independent unified pattern classifier is then designed to classify the numeric postal codes into one of these 25 clusters. Based on these classification decisions a rule-based script inference engine is designed to infer about the script of the numeric postal code. One of the four script specific classifiers is subsequently invoked to recognize the digit patterns of the corresponding script. A novel quad-tree based image partitioning technique is also developed in this work for effective feature extraction from the numeric digit patterns. The average recognition accuracy over ten-fold cross validation of results for the support vector machine (SVM) based 25-class unified pattern classifier is obtained as 92.03%. With randomly selected six-digit numeric strings of four different scripts; an average of 96.72% script inference accuracy is achieved. The average of tenfold cross-validation recognition accuracies of the individual SVM classifiers for the Latin, Devanagari, Bangla and Urdu numerals are observed as 95.55%, 95.63%, 97.15% and 96.20%, respectively.
Engineering Applications of Artificial Intelligence, 2016
Nowadays, the World Wide Web is growing at increasing rate and speed, and consequently the online available resources populating Internet represent a large source of knowledge for various business and research interests. For instance, over the past years, increasing attention has been focused on retrieving information related to geographical location of places and entities, which is largely contained in web pages and documents. However, such resources are represented in a wide variety of generally unstructured formats, and this actually does not help final users to find desired information items. The automatic annotation and comprehension of toponyms, location names and addresses (at different resolution and granularity levels) can deliver significant benefits for the whole web community by improving search engines filtering capabilities and intelligent data mining systems. The present paper addresses the problem of gathering geographical information from unstructured text in web pages and documents. In the specific, the proposed method aims at extracting geographical location (at street number resolution) of commercial companies and services, by annotating geo-related information from their web domains. The annotation process is based on Natural Language Processing (NLP) techniques for text comprehension, and relies on Pattern
2008
This paper presents FuMaS (Fuzzy Matching System), a system capable of an efficient retrieval of postal addresses from noisy queries. The fuzzy postal addresses retrieval has many possible applications, ranging from datawarehouse dedumping, to the correction of input forms, or the integration within online street directories, etc. This paper presents the system architecture along with a series of experiments performed using FuMaS. The experimental results show that FuMaS is a very useful system when retrieving noisy postal addresses, being able to retrieve almost 85% of the total ones. This represents an improvement of the 15% when comparing with other systems tested in this set of experiments.
Ingeniería, 2023
Context: This article proposes the use of regular grammar as a strategy to validate the textual structures of emails. It focuses on the RFC 5321 standard and its syntax, formalizing regular grammars to apply production rules with the aim of validating the syntactic context of each structure of an email address. Method: This article presents a literature review and the development of an email validation model. Related texts focus on the Internet Protocol, along with building automata that apply IPV4 protocol. There are three phases: the development of the model from syntax and regular grammar rules and its construction and application. Results: The result is a functional application that validates email addresses based on regular grammars and existing regulations. When running efficiency tests, our application obtained a higher email validation margin in comparison with JFLAP. The library can work as a great analyzer of grammatical or lexical structures. Conclusions: The email validation tool based on GR regular grammars contributes to the practical use of specialized algorithms in the field of computer science, since it is possible to apply it to the recognition of search patterns such as the analysis of lexical structures (e.g., NITs, alphanumeric codes, and valid URLs).
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016
Automatic semantic annotation of data from databases or the web is an important pre-process for data cleansing and record linkage. It can be used to resolve the problem of imperfect field alignment in a database or identify comparable fields for matching records from multiple sources. The annotation process is not trivial because data values may be noisy, such as abbreviations, variations or misspellings. In particular, overlapping features usually exist in a lexicon-based approach. In this work, we present a probabilistic address parser based on linear-chain conditional random fields (CRFs), which allow more expressive token-level features compared to hidden Markov models (HMMs). In additions, we also proposed two general enhancement techniques to improve the performance. One is taking original semi-structure of the data into account. Another is post-processing of the output sequences of the parser by combining its conditional probability and a score function, which is based on a learned stochastic regular grammar (SRG) that captures segment-level dependencies. Experiments were conducted by comparing the CRF parser to a HMM parser and a semi-Markov CRF parser in two real-world datasets. The CRF parser out-performed the HMM parser and the semi-Markov CRF in both datasets in terms of classification accuracy. Leveraging the structure of the data and combining the linearchain CRF with the SRG further improved the parser to achieve an accuracy of 97% on a postal dataset and 96% on a company dataset.
2004
With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (8)
- Chen Z., Liu W., and Zhang F. A New Statistical Approach to Personal Name Extraction. In Proc. International Confer- ence on Machine Learning, pp. 67-74, Sydney, July, 2002.
- Liu Y., Liu W., and Jiang C. User Interest Detection on Web- pages for Building Personalized Information Agent. In Proc. International Conference on Web-Age Information Manage- ment(LNCS, Vol. 3129), pp. 280-287, Dalian, China, 2004.
- Meng X., Lu H., et al. Data Extraction from the Web based on Pre-defined Schema. In JCST, Vol.17 (4), pp. 377-388, 2002, 7
- Meng X., Hu D., Li C. Schema-Guided Wrapper Main- tenance for Web-Data Extraction. In ACM Fifth Interna- tional Workshop on Web Information and Data Management (WIDM 2003), November 7-8, 2003, New Orleans, Lou- siana, USA.
- Beeferman D., Berger A., and Lafferty J. Statistical Mod- els for Text Segmentation. Machine Learning 34: 177-210, 1999.
- An Automatic Method of Finding Topic Boundaries, In Proc. Annual Meeting of the ACL, pp. 331-333, 1994.
- Blumenstein M., and Verma B. A Segmentation Algorithm used in Conjunction with Artificial Neural Networks for the Recognition of Real-World Postal Addresses. In Proc. Inter- national Conference.
- Microsoft. About the W3C Document Object Model. 2002. http://msdn.microsoft.com/library/default.asp? url=/workshop/author/dom/domoverview.asp