Papers by Phokion Kolaitis

arXiv (Cornell University), May 7, 2019
An inconsistent database is a database that violates one or more integrity constraints, such as f... more An inconsistent database is a database that violates one or more integrity constraints, such as functional dependencies. Consistent Query Answering is a rigorous and principled approach to the semantics of queries posed against inconsistent databases. The consistent answers to a query on an inconsistent database is the intersection of the answers to the query on every repair, i.e., on every consistent database that differs from the given inconsistent one in a minimal way. Computing the consistent answers of a fixed conjunctive query on a given inconsistent database can be a coNP-hard problem, even though every fixed conjunctive query is efficiently computable on a given consistent database. We designed, implemented, and evaluated CAvSAT, a SAT-based system for consistent query answering. CAvSAT leverages a set of natural reductions from the complement of consistent query answering to SAT and to Weighted MaxSAT. The system is capable of handling unions of conjunctive queries and arbitrary denial constraints, which include functional dependencies as a special case. We report results from experiments evaluating CAvSAT on both synthetic and realworld databases. These results provide evidence that a SAT-based approach can give rise to a comprehensive and scalable system for consistent query answering.
arXiv (Cornell University), Mar 2, 2015
During the past decade, there has been an extensive investigation of the computational complexity... more During the past decade, there has been an extensive investigation of the computational complexity of the consistent answers of Boolean conjunctive queries under primary key constraints. Much of this investigation has focused on self-join-free Boolean conjunctive queries. In this paper, we study the consistent answers of Boolean conjunctive queries involving a single binary relation, i.e., we consider arbitrary Boolean conjunctive queries on directed graphs. In the presence of a single key constraint, we show that for each such Boolean conjunctive query, either the problem of computing its consistent answers is expressible in first-order logic, or it is polynomial-time solvable, but not expressible in first-order logic.

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2019
Election databases are the main elements of a recently introduced framework that aims to create b... more Election databases are the main elements of a recently introduced framework that aims to create bridges between the computational social choice and the data management communities. An election database consists of incomplete information about the preferences of voters, in the form of partial orders, alongside with standard database relations that provide contextual information. Earlier work in computational social choice focused on the computation of possible winners and necessary winners that are determined by the available incomplete information and the voting rule at hand. The presence of the relational context, however, permits the formulation of sophisticated queries about voting rules, candidates, potential winners, issues, and positions on issues. Such queries can be given possible answer semantics and necessary answer semantics on an election database, where the former means that the query is true on some completion of the given partial orders and the latter means that the q...

ACM Transactions on Database Systems, 2016
We introduce and develop a declarative framework for entity linking and, in particular, for entit... more We introduce and develop a declarative framework for entity linking and, in particular, for entity resolution. As in some earlier approaches, our framework is based on a systematic use of constraints. However, the constraints we adopt are link-to-source constraints, unlike in earlier approaches where source-to-link constraints were used to dictate how to generate links. Our approach makes it possible to focus entirely on the intended properties of the outcome of entity linking, thus separating the constraints from any procedure of how to achieve that outcome. The core language consists of link-to-source constraints that specify the desired properties of a link relation in terms of source relations and built-in predicates such as similarity measures. A key feature of the link-to-source constraints is that they employ disjunction, which enables the declarative listing of all the reasons two entities should be linked. We also consider extensions of the core language that capture collec...
Constraint Satisfaction, Complexity, and Logic
Lecture Notes in Computer Science, 2004
22nd Annual IEEE Symposium on Logic in Computer Science (LICS 2007), 2007
Advances in finite model theory have appeared in LICS proceedings since the very beginning of the... more Advances in finite model theory have appeared in LICS proceedings since the very beginning of the LICS Symposium. The goal of this paper is to reflect on finite model theory by highlighting some of its successes, examining obstacles that were encountered, and discussing some open problems that have stubbornly resisted solution.

Annals of Pure and Applied Logic, 1995
First-order logic is known to have a severely limited expressive power on finite structures. As a... more First-order logic is known to have a severely limited expressive power on finite structures. As a result, several different extensions have been investigated, including fragments of secondorder logic, fixpoint logic, and the infinitary logic Uz, in which every formula has only a finite number of variables. In this paper, we study generu~ize~ quu~t~~ers in the realm of finite structures and combine them with the infinitary logic .3'& to obtain the logics 9:,(Q), where Q = {Qi: in If is a family of generalized quantifiers on finite structures. Using the logics Y:,(Q), we can express polynomial-time properties that are not definable in LYE,, such as "there is an even number of x" and "there exists at least n/2 x" (n is the size of the universe), without going to second-order logic. We show that equivalence of finite structures relative to U:,(Q) can be characterized in terms of certain pebble games that are a variant of the Ehrenfeucht-FmYssk games. We combine this game-theoretic characterization with sophisticated combinatorial tools from Ramsey theory, such as van der Waerden's Theorem and Folkman's Theorem, in order to investigate the scope and limits of generalized quantifiers in finite model theory. We obtain sharp lower bounds for expressibility in the logics Y&(Q) and discover an intrinsic difference between adding finitely many simple unary generalized quantifiers to Yz, and adding infinitely many. In particular, we show that if Q is a finite sequence of simple unary generalized quantifiers, then the equicardinality, or Hartig, quantifier is not definable in P's,(Q). We also show that the query "does the equivalence relation E have an even number of equivalence classes" is not definable in the extension 5?zw(l, Q) of U& by the HSrtig quantifier I and any finite sequence Q of simple unary generalized quantifiers.

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2021
An ontology specifies an abstract model of a domain of interest via a formal language that is typ... more An ontology specifies an abstract model of a domain of interest via a formal language that is typically based on logic. Although description logics are popular formalisms for modeling ontologies, tuplegenerating dependencies (tgds), originally introduced as a unifying framework for database integrity constraints, and later on used in data exchange and integration, are also well suited for modeling ontologies that are intended for data-intensive tasks. The reason is that, unlike description logics, tgds can easily handle higherarity relations that naturally occur in relational databases. In recent years, there has been an extensive study of tgd-ontologies and of their applications to several different data-intensive tasks. However, the fundamental question of whether the expressive power of tgd-ontologies can be characterized in terms of model-theoretic properties remains largely unexplored. We establish several characterizations of tgd-ontologies, including characterizations of ontologies specified by such central classes of tgds as full, linear, guarded, and frontier-guarded tgds. Our characterizations use the well-known notions of critical instance and direct product, as well as a novel locality property for tgd-ontologies. We further use this locality property to decide whether an ontology expressed by frontier-guarded (respectively, guarded) tgds can be expressed by tgds in the weaker class of guarded (respectively, linear) tgds, and effectively construct such an equivalent ontology if one exists. CCS CONCEPTS • Theory of computation → Logic and databases; Description logics.

During the past fifteen years, data exchange has been explored in depth and in a variety of diffe... more During the past fifteen years, data exchange has been explored in depth and in a variety of different settings. Even though temporal databases constitute a mature area of research studied over several decades, the investigation of temporal data exchange was initiated only very recently. We analyze the properties of universal solutions in temporal data exchange with emphasis on the relationship between universal solutions in the context of concrete time and universal solutions in the context of abstract time. We show that challenges arise even in the setting in which the data exchange specifications involve a single temporal variable. After this, we identify settings, including data exchange settings that involve multiple temporal variables, in which these challenges can be overcome. 2012 ACM Subject Classification Information systems → Data management systems; Theory of computation → Data exchange; Information systems → Temporal data

Communications in Computer and Information Science, 2020
We consider the problem of answering temporal queries on RDF stores, in presence of time-agnostic... more We consider the problem of answering temporal queries on RDF stores, in presence of time-agnostic RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source schemas into the target ontology. Our proposed solution consists of two rule-based domainindependent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches both the data and the ontology with temporal information from the relational sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology, using SPARQL supplemented with a lightweight easy-to-use formalism for time annotations and comparisons. The algorithm translates the queries into the standard SPARQL form that respects the structure of the temporal RDF information while preserving the semantics of the questions, thus ensuring successful evaluation of the queries on the materialized temporally-enriched RDF data. In this paper we present the algorithms, report on their implementation and experimental results for two application domains, and discuss future work.

Journal of Data and Information Quality
We consider the problem of answering temporal queries on RDF stores, in presence of atemporal RDF... more We consider the problem of answering temporal queries on RDF stores, in presence of atemporal RDFS domain ontologies, of relational data sources that include temporal information, and of rules that map the domain information in the source schemas into the target ontology. Our proposed practice-oriented solution consists of two rule-based domain-independent algorithms. The first algorithm materializes target RDF data via a version of data exchange that enriches both the data and the ontology with temporal information from the relational sources. The second algorithm accepts as inputs temporal queries expressed in terms of the domain ontology using a lightweight temporal extension of SPARQL, and ensures successful evaluation of the queries on the materialized temporally-enriched RDF data. To study the quality of the information generated by the algorithms, we develop a general framework that formalizes the relational-to-RDF temporal data-exchange problem. The framework includes a chas...

Peer Data Management (PDM) deals with the management of structured data in unstructured peer-to-p... more Peer Data Management (PDM) deals with the management of structured data in unstructured peer-to-peer (P2P) networks. Each peer can store data locally and define relationships between its data and the data provided by other peers. Queries posed to any of the peers are then answered by also considering the information implied by those mappings. The overall goal of PDM is to provide semantically well-founded integration and exchange of heterogeneous and distributed data sources. Unlike traditional data integration systems, peer data management systems (PDMSs) thereby allow for full autonomy of each member and need no central coordinator. The promise of such systems is to provide flexible data integration and exchange at low setup and maintenance costs. However, building such systems raises many challenges. Beside the obvious scalability problem, choosing an appropriate semantics that can deal with arbitrary, even cyclic topologies, data inconsistencies, or updates while at the same tim...

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
We develop a novel framework that aims to create bridges between the computational social choice ... more We develop a novel framework that aims to create bridges between the computational social choice and the database management communities. This framework enriches the tasks currently supported in computational social choice with relational database context, thus making it possible to formulate sophisticated queries about voting rules, candidates, voters, issues, and positions. At the conceptual level, we give rigorous semantics to queries in this framework by introducing the notions of necessary answers and possible answers to queries. At the technical level, we embark on an investigation of the computational complexity of the necessary answers. In particular, we establish a number of results about the complexity of the necessary answers of conjunctive queries involving the plurality rule that contrast sharply with earlier results about the complexity of the necessary winners under the plurality rule.

Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2015
The framework of database repairs provides a principled approach to managing inconsistencies in d... more The framework of database repairs provides a principled approach to managing inconsistencies in databases. Informally, a repair of an inconsistence database is a consistent database that differs from the inconsistent one in a "minimal way." A fundamental problem in this framework is the repair-checking problem: given two instances, is the second a repair of the first? Here, all repairs are taken into account, and they are treated on a par with each other. There are situations, however, in which it is natural and desired to prefer one repair over another; for example, one data source is regarded to be more reliable than another, or timestamp information implies that a more recent fact should be preferred over an earlier one. Motivated by these considerations, Staworko, Chomicki and Marcinkowski introduced the framework of preferred repairs. The main characteristic of this framework is that it uses a priority relation between conflicting facts of an inconsistent database to define notions of preferred repairs. In this paper we focus on the globallyoptimal repairs, in the case where the constraints are functional dependencies. Intuitively, a globally-optimal repair is a repair that cannot be improved by exchanging facts with preferred facts. In this setting, it is known that there is a fixed schema (i.e., signature and functional dependencies) where globally-optimal repair-checking is coNP-complete. Our main result is a dichotomy in complexity: for each fixed relational signature and each fixed set of functional dependencies, the globally-optimal repair-checking problem either is solvable in polynomial time or is coNP-complete. Specifically, the problem is solvable in polynomial time if for each relation symbol in the signature, the functional dependencies are equivalent to either a single functional dependency or to a set of two key constraints; in all other cases, the globally-optimal repair-checking problem is coNPcomplete. We also show that there is a polynomial-time algorithm for distinguishing between the tractable and the intractable cases. The setup of preferred repairs assumes that preferences are only between conflicting facts. In the last

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2014
During the past decade, schema mappings have been extensively used in formalizing and studying su... more During the past decade, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings specified by source-to-target tuplegenerating dependencies (s-t tgds), and on schema mappings specified by second-order tgds (SO tgds), which constitute the closure of GLAV mappings under composition. In addition, nested GLAV mappings have also been considered, i.e., schema mappings specified by nested tgds, which have expressive power intermediate between s-t tgds and SO tgds. Even though nested GLAV mappings have been used in data exchange systems, such as IBM's Clio, no systematic investigation of this class of schema mappings has been carried out so far. In this paper, we embark on such an investigation by focusing on the basic reasoning tasks, algorithmic problems, and structural properties of nested GLAV mappings. One of our main results is the decidability of the implication problem for nested tgds. We also analyze the structure of the core of universal solutions with respect to nested GLAV mappings and develop useful tools for telling apart SO tgds from nested tgds. By discovering deeper structural properties of nested GLAV mappings, we show that also the following problem is decidable: given a nested GLAV mapping, is it logically equivalent to a GLAV mapping?
Towards a theory of schema-mapping optimization
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2008

The inversion of schema mappings has been identified as one of the fundamental operators for the ... more The inversion of schema mappings has been identified as one of the fundamental operators for the development of a general framework for data exchange, data integration, and more generally, for metadata management. Given a mapping M from a schema S to a schema T, an inverse of M is a new mapping that describes the reverse relationship fromT to S, and that is semantically consistent with the relationship previously established by M. In practical scenarios, the inversion of a schema mapping can have several applications. For example, in a data exchange context, if a mapping M is used to exchange data from a source to a target schema, an inverse of M can be used to exchange the data back to the source, thus reversing the application of M. The formalization of a clear semantics for the inverse operator has proved to be a very difficult task. In fact, during the last years, several alternative notions of inversion for schema mappings have been proposed in the literature. This chapter prov...

ACM Transactions on Database Systems, 2020
During the past 15 years, schema mappings have been extensively used in formalizing and studying ... more During the past 15 years, schema mappings have been extensively used in formalizing and studying such critical data interoperability tasks as data exchange and data integration. Much of the work has focused on GLAV mappings, i.e., schema mappings specified by source-to-target tuple-generating dependencies (s-t tgds), and on schema mappings specified by second-order tgds (SO tgds), which constitute the closure of GLAV mappings under composition. In addition, nested GLAV mappings have also been considered, i.e., schema mappings specified by nested tgds, which have expressive power intermediate between s-t tgds and SO tgds. Even though nested GLAV mappings have been used in data exchange systems, such as IBM’s Clio, no systematic investigation of this class of schema mappings has been carried out so far. In this article, we embark on such an investigation by focusing on the basic reasoning tasks, algorithmic problems, and structural properties of nested GLAV mappings. One of our main r...
Proceedings of the 12th International Conference on Database Theory, 2009
* A preliminary version of this paper has been presented at the Workshop on Logic in Databases-LI... more * A preliminary version of this paper has been presented at the Workshop on Logic in Databases-LID '08. † Work was done partly while this author was visiting at IBM Almaden and at Stanford University. ‡ This research project, no 03ED176, is co-financed by E.U.-European Social Fund (80%) and the Greek Ministry of Development-GSRT (20%).

We develop a flexible, open-source framework for query answering on relational databases by adopt... more We develop a flexible, open-source framework for query answering on relational databases by adopting methods and techniques from the Semantic Web community and the data exchange community, and we apply this framework to a medical use case. We first deploy module-extraction techniques to derive a concise and relevant sub-ontology from an external reference ontology. We then use the chase procedure from the data exchange community to materialize a universal solution that can be subsequently used to answer queries on an enterprise medical database. Along the way, we identify a new class of well-behaved acyclic EL-ontologies extended with role hierarchies, suitably restricted functional roles, and domain/range restrictions, which cover our use case. We show that such ontologies are C-stratified, which implies that the chase procedure terminates in polynomial time. We provide a detailed overview of our real-life application in the medical domain and demonstrate the benefits of this appro...
Uploads
Papers by Phokion Kolaitis