首页> 外文学位 >Distributed Query Processing Over Incomplete, Sampled, and Locality-Aware Data
【24h】

Distributed Query Processing Over Incomplete, Sampled, and Locality-Aware Data

机译:对不完整,采样和位置感知的数据进行分布式查询处理

获取原文
获取原文并翻译 | 示例

摘要

There are numerous challenges in distributed query processing. The focus of this thesis is to provide solutions to three problem areas: (a) querying incomplete data, (b) approximate query processing (AQP) over subsets of data, and (c) high cost of shuffling data while processing distributed queries.;In distributed databases, large volumes of data are generally stored partitioned across multiple nodes and a user query typically spans many nodes. As the number of nodes accessed by a query increases, the probability of nodes being unavailable also increases; additionally, the amount of data shuffled across nodes also increases, thus increasing communication costs.;To provide fast responses to queries over distributed databases, AQP has been proposed. In AQP, queries are processed over a representative subset of the database and estimates of the query result are provided along with confidence bounds. While AQP provides estimates of query results in a fraction of the time required to run the query over all data, quickly obtaining representative samples for a query in a distributed setting is challenging.;We first consider the problem of querying over incomplete data. In failure and straggler scenarios, parts of the database that are still available form an incomplete database. We propose m-tables, a new representation system for representing and querying over incomplete databases.;Next, we consider the problem of AQP over subsets of data. We propose the ASAP (Approximation Strategies for Aggregate queries through Partitioning) framework to provide estimates and confidence bounds for aggregate queries using any subset of a database when the database is co-hash partitioned. A database is co-hash partitioned when some tables are hash partitioned, and the remaining tables are co-located through join predicates.;Finally, we study the problem of high cost of shuffling data across nodes for distributed query processing. Ideally, given a query and data distribution, we want to execute the query without any communication: in this case, the query is said to be parallel-correct w.r.t. the distribution. We again consider co-hash distribution schemes and as our main result, we determine the conditions for a given query to be parallel-correct for a given co-hash distribution scheme.
机译:分布式查询处理中存在许多挑战。本文的重点是为三个问题领域提供解决方案:(a)查询不完整的数据,(b)对数据子集进行近似查询处理(AQP),以及(c)处理分布式查询时改组数据的高成本。在分布式数据库中,通常在多个节点之间分区存储大量数据,并且用户查询通常跨越多个节点。随着查询访问的节点数量的增加,节点不可用的可能性也随之增加。此外,跨节点重排的数据量也增加了,从而增加了通信成本。为了提供对分布式数据库中查询的快速响应,已经提出了AQP。在AQP中,在数据库的代表性子集上处理查询,并提供查询结果的估计值以及置信范围。虽然AQP可以在对所有数据运行查询所需的时间的一小部分时间内提供查询结果的估计,但在分布式设置中快速获取查询的代表性样本却具有挑战性。我们首先考虑对不完整数据进行查询的问题。在失败和混乱的情况下,仍然可用的部分数据库将形成不完整的数据库。我们提出了m-tables,这是一种用于表示和查询不完整数据库的新表示系统。接下来,我们考虑数据子集上的AQP问题。我们提出了ASAP(通过分区进行聚合查询的近似策略)框架,以在对数据库进行共哈希分区时使用数据库的任何子集提供聚合查询的估计值和置信范围。当对某些表进行哈希分区时,将对数据库进行共哈希分区,而其余的表通过连接谓词进行共定位。最后,我们研究了跨节点进行分布式查询处理的数据重排成本高的问题。理想情况下,给定查询和数据分布,我们希望在没有任何通信的情况下执行查询:在这种情况下,该查询被称为并行正确w.r.t.分布。我们再次考虑共同哈希分配方案,并且作为我们的主要结果,我们确定给定查询的条件对于给定共同哈希分配方案是并行正确的。

著录项

  • 作者

    Sundarmurthy, Bruhathi.;

  • 作者单位

    The University of Wisconsin - Madison.;

  • 授予单位 The University of Wisconsin - Madison.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 154 p.
  • 总页数 154
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号