Academia.eduAcademia.edu

Outline

Postal Address Detection from Web Documents

2005

https://doi.org/10.1109/WIRI.2005.28

Abstract

An approach to postal address detection from webpages is proposed. The webpages are first segmented into text blocks based on their visual similarity. The text content in each block undergoes the recognition process, which employs a syntactic approach. The grammars of almost all possible patterns of postal addresses are built for this purpose. The results of our preliminary experiments on 44 webpages with 56 true addresses show that our approach can detect the postal addresses with a high precision (89.3%) and a low false alarms rate (3.8%).

References (8)

  1. Chen Z., Liu W., and Zhang F. A New Statistical Approach to Personal Name Extraction. In Proc. International Confer- ence on Machine Learning, pp. 67-74, Sydney, July, 2002.
  2. Liu Y., Liu W., and Jiang C. User Interest Detection on Web- pages for Building Personalized Information Agent. In Proc. International Conference on Web-Age Information Manage- ment(LNCS, Vol. 3129), pp. 280-287, Dalian, China, 2004.
  3. Meng X., Lu H., et al. Data Extraction from the Web based on Pre-defined Schema. In JCST, Vol.17 (4), pp. 377-388, 2002, 7
  4. Meng X., Hu D., Li C. Schema-Guided Wrapper Main- tenance for Web-Data Extraction. In ACM Fifth Interna- tional Workshop on Web Information and Data Management (WIDM 2003), November 7-8, 2003, New Orleans, Lou- siana, USA.
  5. Beeferman D., Berger A., and Lafferty J. Statistical Mod- els for Text Segmentation. Machine Learning 34: 177-210, 1999.
  6. An Automatic Method of Finding Topic Boundaries, In Proc. Annual Meeting of the ACL, pp. 331-333, 1994.
  7. Blumenstein M., and Verma B. A Segmentation Algorithm used in Conjunction with Artificial Neural Networks for the Recognition of Real-World Postal Addresses. In Proc. Inter- national Conference.
  8. Microsoft. About the W3C Document Object Model. 2002. http://msdn.microsoft.com/library/default.asp? url=/workshop/author/dom/domoverview.asp