Overview
The computational complexity of query evaluation in databases is usually measured in terms of “data complexity” and “combined complexity.” Unfortunately, these complexity classes do not encourage the design of efficient algorithms for many interesting query evaluation problems. The main weak spot of these measures is in their failure to take into consideration the size of the output, as well as the speed in which answers are returned to the user. In recent years, various concepts of enumeration have arisen in the fields of Databases, Computational Logic, and Algorithms, motivated by applications of data analysis and query evaluation. Common to all concepts is the desire to compute a stream of items with as small as possible waiting time between consecutive items, referred to as the “delay.” Alongside each concept, there evolved algorithmic techniques for developing solvers, and proof techniques for establishing complexity bounds. In addition to the traditional guarantees of “polynomial delay” and “incremental polynomial,” researchers have been pursuing stronger guarantees such as “constant delay” in the context of logical query evaluation, “dynamic complexity” of incremental maintenance, and “factorized databases.” Our research aims to develop novel algorithms and techniques for enumeration with both complexity and quality guarantees (e.g., ranked order), as well as to understand limitations that preclude the existence of such algorithms (i.e., lower bounds).
People
Selected Publications
Noam Ravid, Dori Medini, Benny Kimelfeld,
"Ranked Enumeration of Minimal Triangulations", CoRR abs/1709.10254 (2017). To appear in PODS, 2019
abstractpaperTree decompositions facilitate computations on complex graphs by grouping vertices into bags interconnected in an acyclic structure; hence their importance in a plethora of problems such as query evaluation over databases and inference over probabilistic graphical models. Different applications take varying benefits from different tree decompositions, and hence, measure them by diverse (sometime complex) cost functions. For generic cost functions (such as width or fill-in), an optimal tree decomposition can be computed in some cases, notably when the number of minimal separators is bounded by a polynomial (due to Bouchitte and Todinca); we refer to this assumption as “poly-MS.” Yet, in general, finding an optimal tree decomposition is computationally intractable even for these cost functions, and approximations or heuristics are commonly used. Furthermore, the generic cost functions hardly cover the benefit measures needed in practice. Therefore, it has recently been proposed to devise algorithms for enumerating many decomposition candidates for applications to select from using specialized, or even machine-learned, cost functions. We present the first algorithm for enumerating the minimal triangulations of a graph by increasing cost, for a wide class of cost functions. Consequently, we get ranked enumeration of the (non-redundant) tree decompositions of a graph, for a class of cost functions that substantially generalizes the above generic ones. On the theoretical side, we establish the guarantee of polynomial delay if poly-MS is assumed, or if we are interested only in decompositions of a width bounded by a constant. Lastly, we describe an experimental evaluation on graphs of various domains (join queries, Bayesian networks and random graphs), and explore both the applicability of the poly-MS assumption and the performance of our algorithm relative to the state of the art.
Nofar Carmeli, Markus Kröll,
"Enumeration Complexity of Conjunctive Queries with Functional Dependencies", ICDT 2018: 11:1-11:17
abstractpaperWe study the complexity of enumerating the answers of Conjunctive Queries (CQs) in the presence of Functional Dependencies (FDs). Our focus is on the ability to list output tuples with a constant delay in between, following a linear-time preprocessing. A known dichotomy classifies the acyclic self-join-free CQs into those that admit such enumeration, and those that do not. However, this classification no longer holds in the common case where the database exhibits dependencies among attributes. That is, some queries that are classified as hard are in fact tractable if dependencies are accounted for. We establish a generalization of the dichotomy to accommodate FDs; hence, our classification determines which combination of a CQ and a set of FDs admits constant-delay enumeration with a linear-time preprocessing. In addition, we generalize a hardness result for cyclic CQs to accommodate a common type of FDs. Further conclusions of our development include a dichotomy for enumeration with linear delay, and a dichotomy for CQs with disequalities. Finally, we show that all our results apply to the known class of “cardinality dependencies” that generalize FDs (e.g., by stating an upper bound on the number of genres per movies, or friends per person).
Dominik D. Freydenberger, Benny Kimelfeld, Liat Peterfreund,
"Joining Extractions of Regular Expressions", PODS 2018: 137-149
abstractpaperRegular expressions with capture variables, also known as “regex formulas,” extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.’s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NPcompleteness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.
Ester Livshits, Benny Kimelfeld,
"Counting and Enumerating (Preferred) Database Repairs", PODS 2017: 289-301
abstractpaper
In the traditional sense, a subset repair of an inconsistent database refers to a consistent subset of facts (tuples) that is maximal under set containment. Preferences between pairs of facts allow to distinguish a set of preferred repairs based on relative reliability (source credibility, extraction quality, recency, etc.) of data items. Previous studies explored the problem of categoricity, where one aims to determine whether preferences suffice to repair the database unambiguously, or in other words, whether there is precisely one preferred repair. In this paper we study the ability to quantify ambiguity, by investigating two classes of problems. The first is that of counting the number of subset repairs, both preferred (under various common semantics) and traditional. We establish dichotomies in data complexity for the entire space of (sets of) functional dependencies. The second class of problems is that of enumerating (i.e., generating) the preferred repairs. We devise enumeration algorithms with efficiency guarantees on the delay between generated repairs, even for constraints represented as general conflict graphs or hypergraphs.
Nofar Carmeli, Batya Kenig, Benny Kimelfeld,
"Efficiently Enumerating Minimal Triangulations", PODS 2017: 273-287
abstractpaperWe present an algorithm that enumerates all the minimal triangulations of a graph in incremental polynomial time. Consequently, we get an algorithm for enumerating all the proper tree decompositions, in incremental polynomial time, where “proper”means that the tree decomposition cannot be improved by removing or splitting a bag. The algorithm can incorporate any method for (ordinary, single result) triangulation or tree decomposition, and can serve as an anytime algorithm to improve such a method. We describe an extensive experimental study of an implementation on real data from di↵erent fields. Our experiments show that the algorithm improves upon central quality measures over the underlying tree decompositions, and is able to produce a large number of high-quality decompositions.