There are numerous challenges in distributed query processing. The focus of this thesis is to provide solutions to three problem areas: (a) querying incomplete data, (b) approximate query processing (AQP) over subsets of data, and (c) high cost of shuffling data while processing distributed queries.;In distributed databases, large volumes of data are generally stored partitioned across multiple nodes and a user query typically spans many nodes. As the number of nodes accessed by a query increases, the probability of nodes being unavailable also increases; additionally, the amount of data shuffled across nodes also increases, thus increasing communication costs.;To provide fast responses to queries over distributed databases, AQP has been proposed. In AQP, queries are processed over a representative subset of the database and estimates of the query result are provided along with confidence bounds. While AQP provides estimates of query results in a fraction of the time required to run the query over all data, quickly obtaining representative samples for a query in a distributed setting is challenging.;We first consider the problem of querying over incomplete data. In failure and straggler scenarios, parts of the database that are still available form an incomplete database. We propose m-tables, a new representation system for representing and querying over incomplete databases.;Next, we consider the problem of AQP over subsets of data. We propose the ASAP (Approximation Strategies for Aggregate queries through Partitioning) framework to provide estimates and confidence bounds for aggregate queries using any subset of a database when the database is co-hash partitioned. A database is co-hash partitioned when some tables are hash partitioned, and the remaining tables are co-located through join predicates.;Finally, we study the problem of high cost of shuffling data across nodes for distributed query processing. Ideally, given a query and data distribution, we want to execute the query without any communication: in this case, the query is said to be parallel-correct w.r.t. the distribution. We again consider co-hash distribution schemes and as our main result, we determine the conditions for a given query to be parallel-correct for a given co-hash distribution scheme.
展开▼