Structured Querying of Web Text Data: A Technical Challenge
2007, Cidr
Abstract
The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web's unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.
References (41)
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 2001 ACM SIGMOD International Conference on Digital Libraries, 2000.
- S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 2002.
- P. Andritsos, A. Fuxman, and R. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006.
- M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
- O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. Uldbs: Databases with uncertainty and lineage. In VLDB, 2006.
- S. Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at EDBT '98, 1998.
- P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316-330, 2001.
- M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2005.
- A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, 2006.
- K. C.-C. Chang, B. He, and Z. Zhang. Toward large scale integration: Building a metaquerier over databases on the web. In CIDR, pages 44-55, 2005.
- J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In ICDE, 2002.
- W. Cohen. Information extraction and integration: An overview, 2004.
- Y. Cui and J. Widom. Practical lineage tracing in data warehouses. In ICDE, pages 367-378, 2000.
- N. Dalvi, C. Ré, and D. Suciu. Query evaluation on probabilistic databases. In IEEE Data Engineering Bulletin, 2006.
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
- A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. In IEEE Data Engineering Bulletin, 2006.
- A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction, 2006.
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, 2005.
- D. Downey, S. Soderland, and O. Etzioni. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91-134, 2005.
- R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614-656, 2003.
- C. Fellbaum. English verbs as a semantic net. International Journal of Lexicography, 3(4):278-301, 1990.
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
- M. Gubanov and P. A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.
- A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, and I. Tatarinov. Crossing the structure chasm. In CIDR, 2003.
- V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002.
- T. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. In IEEE Data Engineering Bulletin, 2006.
- D. Konopnicki and O. Shmueli. W3QS -A System for WWW Querying. In 13th International Conference on Data Engineering (ICDE'97), 1997.
- D. Lin and P. Pantel. Discovery of inference rules from text. In KDD, pages 323-328, 2001.
- J. Liu, X. Dong, and A. Halevy. Answering structured queries on unstructured data. In WebDB, 2006.
- I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, 2006.
- A. O. Mendelzon, G. A. Mihalia, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1996.
- G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235-312, 1990.
- A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter. Supporting incremental join queries on ranked inputs. In VLDB, pages 281-290, 2001.
- D. V. K. Reynold Cheng and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, pages 551-562, 2003.
- A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, April 2006.
- Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference (HLT/NAACL-06), 2006.
- E. Spertus and L. A. Stein. Squeal: A Structured Query Language for the Web. In WWW, pages 95-103, 2000.
- M. Theobald, G. Weikum, and R. Schenkel. Top-k query evaluation with probabilistic guarantees. In VLDB, pages 648-659, 2004.
- P. D. Turney. Expressing implicit semantic relations without supervision. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-2006), 2006.
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262-276, 2005.