Structured Querying of Web Text Data: A Technical Challenge

Dan Suciu

Outline

Structured Querying of Web Text Data: A Technical Challenge

Dan Suciu

2007, Cidr

Abstract

The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web's unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.

Figures (6)

Figure 1: The top-ranked results for a query to our EXxDB prototype. The query here is q(?a, ?b, ?c) :- invented(?a, ?b), died-in(?a, <year> ?c). This query took 30 seconds to process on a database of 90M Web pages. {mjc, chrisre, suciu, etzioni, banko}@cs.washington.edu

Figure 2: Constructing the ExDB requires several processing steps. In step 1, we run information extractors over the downloaded web text, as described in Section 3. In step 2, the extracted information is stored in the ExDB data model, described in Section 2. Finally, applications can query the EXDB middleware and probabilistic RDBMS. Sections 4 and 5 describe query processing and possible applications.

Table 3: Top-10 prototype-ExDB results for q(?s) :- invented((scientist) ?s, ?x), and Google search results for scientist invented. The goal is to retrieve a list of practical-minded scientists. Only one document returnec by Google arguably contains an answer to the query; it is still embedded in unstructured text. Note that some of the ExDB entries are duplicates and should be merged; object synonyms will make this possible.

Figure 4: Construction pipeline for the Schema Extraction Model. We still run an IE system over downloaded text, but use the resulting extractions to compute a single traditional relational database.

Figure 5: The simple construction pipeline for the Text Query Model.

References (41)

E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 2001 ACM SIGMOD International Conference on Digital Libraries, 2000.
S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 2002.
P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006.
M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, 2006.
S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at EDBT '98, 1998.
P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316-330, 2001.
M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2005.
A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, 2006.
K. C.-C. Chang, B. He, and Z. Zhang. Toward large scale integration: Building a metaquerier over databases on the web. In CIDR, pages 44-55, 2005.
J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In ICDE, 2002.
W. Cohen. Information extraction and integration: An overview, 2004.
Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In ICDE, pages 367-378, 2000.
N. Dalvi, C. Ré, and D. Suciu. Query evaluation on probabilistic databases. In IEEE Data Engineering Bulletin, 2006.
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. In IEEE Data Engineering Bulletin, 2006.
A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction, 2006.
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
D. Downey, S. Soderland, and O. Etzioni. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91-134, 2005.
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614-656, 2003.
C. Fellbaum. English verbs as a semantic net. International Journal of Lexicography, 3(4):278-301, 1990.
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
M. Gubanov and P. A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.
A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. In CIDR, 2003.
V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002.
T. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. In IEEE Data Engineering Bulletin, 2006.
D. Konopnicki and O. Shmueli. W3QS -A System for WWW Querying. In 13th International Conference on Data Engineering (ICDE'97), 1997.
D. Lin and P. Pantel. Discovery of inference rules from text. In KDD, pages 323-328, 2001.
J. Liu, X. Dong, and A. Halevy. Answering structured queries on unstructured data. In WebDB, 2006.
I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, 2006.
A. O. Mendelzon, G. A. Mihalia, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1996.
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235-312, 1990.
A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter. Supporting incremental join queries on ranked inputs. In VLDB, pages 281-290, 2001.
D. V. K. Reynold Cheng and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, pages 551-562, 2003.
A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, April 2006.
Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference (HLT/NAACL-06), 2006.
E. Spertus and L. A. Stein. Squeal: A Structured Query Language for the Web. In WWW, pages 95-103, 2000.
M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with probabilistic guarantees. In VLDB, pages 648-659, 2004.
P. D. Turney. Expressing implicit semantic relations without supervision. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2006), 2006.
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262-276, 2005.

Structured Querying of Web Text Data: A Technical Challenge

Sign up for access to the world's latest research

Abstract

Related papers

References (41)

Related papers

Related topics