Abstract
Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.
References (83)
- REFERENCES
- S. Abiteboul and P. Senellart. Querying and updating probabilistic information in XML. In EDBT, pages 1059-1068, 2006.
- Ernest Adams. A Primer of Probability Logic. CSLI Publications, Stanford, California, 1998.
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586-597, 2002.
- P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006.
- L. Antova, C. Koch, and D. Olteanu. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
- L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194-208, 2007.
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918-929, 2006.
- F. Bacchus, A. Grove, J. Halpern, and D. Koller. From statistical knowledge bases to degrees of belief. Artificial Intelligence, 87(1-2):75-143, 1996.
- D. Barbara, H. Garcia-Molina, and D. Porter. The management of probabilistic data. IEEE Trans. Knowl. Data Eng., 4(5):487-502, 1992.
- O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953-964, 2006.
- G. Borriello and F. Zhao. World-Wide Sensor Web: 2006 UW-MSR Summer Institute Semiahmoo Resort, Blaine, WA, 2006. www.cs.washington.edu/mssi/2006/schedule.html.
- D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Efficient allocation algorithms for olap over imprecise data. In VLDB, pages 391-402, 2006.
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In ACM SIGMOD, San Diego, CA, 2003.
- T. Choudhury, M. Philipose, D. Wyatt, and J. Lester. Towards activity databases: Using sensors and statistical models to summarize people's lives. IEEE Data Eng. Bull, 29(1):49-58, March 2006.
- W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73-78, 2003.
- G. Cooper. Computational complexity of probabilistic inference using bayesian belief networks (research note). Artificial Intelligence, 42:393-405, 1990.
- R. Cowell, P. Dawid, S. Lauritzen, and D. Spiegelhalter, editors. Probabilistic Networks and Expert Systems. Springer, 1999.
- P. Dagum and M. Luby. Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence, 60:141-153, 1993.
- N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, 2005.
- N. Dalvi, Chris Re, and D. Suciu. Query evaluation on probabilistic databases. IEEE Data Engineering Bulletin, 29(1):25-31, 2006.
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, Toronto, Canada, 2004.
- N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In VLDB, 2005.
- N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on random structures. In PODS, 2007.
- Nilesh Dalvi. Query evaluation on a database given by a random graph. In ICDT, pages 149-163, 2007.
- Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM, 50(3):280-305, 2003.
- A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006.
- Michel de Rougemont. The reliability of queries. In PODS, pages 286-291, 1995.
- A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001.
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588-599, 2004.
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Using probabilistic models for data management in acquisitional environments. In CIDR, pages 317-328, 2005.
- A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):64-72, March 2006.
- M. Balazinska et al. Data management in the world-wide sensor web. IEEE Pervasive Computing, 2007. To appear.
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll: (preliminary results). In WWW, pages 100-110, 2004.
- Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.
- M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27-33, 2005.
- Norbert Fuhr and Thomas Roelleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32-66, 1997.
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371-380, 2001.
- Minos Garofalakis and Dan Suciu. Special issue on probabilistic data management. IEEE Data Engineering Bulletin, pages 1-72, 2006.
- Lise Getoor. An introduction to probabilistic graphical models for relational data. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):32-40, March 2006.
- E. Grädel, Y. Gurevich, and C. Hirsch. The complexity of query reliability. In PODS, pages 227-234, 1998.
- T. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin, 29(1):17-24, March 2006.
- R. Greenlaw, J. Hoover, and W. Ruzzo. Limits to Parallel Computation. P-Completeness Theory. Oxford University Press, New York, Oxford, 1995.
- L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. In CMIS Technical Report No. 03/83, 2003.
- R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965-976, 2006.
- A. Halevy, M. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1-9, 2006.
- A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In VLDB, pages 9-16, 2006.
- J. Halpern. From statistical knowledge bases to degrees of belief: an overview. In PODS, pages 110-113, 2006.
- D. Heckerman. Tutorial on graphical models, June 2002.
- M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127-138, 1995.
- E. Hung, L. Getoor, and V.S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. In ICDE, 2003.
- I.F. Ilyas, V. Markl, P.J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647-658, 2004.
- T.S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007.
- T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40-48, 2006.
- S. Jeffery, M. Garofalakis, and M. Franklin. Adaptive cleaning for RFID data streams. In VLDB, pages 163-174, 2006.
- R. Karp and M. Luby. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of the annual ACM symposium on Theory of computing, 1983.
- N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDB, pages 43-50, 2006.
- P. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61-75, 2005.
- D. Koller. Representation, reasoning, learning. Computers and Thought 2001 Award talk.
- L. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian. Probview: A flexible probabilistic database system. ACM Trans. Database Syst., 22(3), 1997.
- J. Lester, T. Choudhury, N. Kern, G. Borriello, and B. Hannaford. A hybrid discriminative/generative approach for modeling human activities. In IJCAI, pages 766-772, 2005.
- J. Madhavan, S. Cohen, X. Dong, A. Halevy, S. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342-350, 2007.
- G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004.
- Radford Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Univ. of Toronto, 1993.
- Christos Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.
- Judea Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, 1988.
- S. Philippi and J. Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nature Reviews Genetics, 7:481-488, June 2006.
- J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777-788, 1983.
- C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query evaluation on probabilistic data. In ICDE, 2007.
- Christopher Ré. Applications of probabilistic constraints. Technical Reprot TR2007-03-03, University of Washington, Seattle, Washington, March 2007.
- R. Ross, V.S. Subrahmanian, and J. Grant. Aggregate operators in probabilistic databases. JACM, 52(1), 2005.
- Sunita Sarawagi. Automation in information extraction and data integration. Tutorial presented at VLDB'2002.
- Prithviraj Sen and Amol Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
- W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, pages 862-867, 2005.
- D. Suciu and N. Dalvi. Tutorial: Foundations to probabilistic answers to queries. In SIGMOD, 2005. Available from www.cs.washington.edu/homes/suciu.
- L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410-421, 1979.
- M. van Keulen, A. de Keijzer, and W. Alink. A probabilistic XML approach to data integration. In ICDE, pages 459-470, 2005.
- M. Y. Vardi. The complexity of relational query languages. In Proceedings of 14th ACM SIGACT Symposium on the Theory of Computing, pages 137-146, San Francisco, California, 1982.
- T. Verma and J. Pearl. Causal networks: Semantics and expressiveness. Uncertainty in Artificial Intelligence, 4:69-76, 1990.
- L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI, pages 319-326, 2004.
- William Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1999.
- Y. Zabiyaka and A. Darwiche. Functional treewidth: Bounding complexity in the presence of functional dependencies. In SAT, pages 116-129, 2006.
- base.google.com.