首页> 外文会议>Scientific and statistical database management >Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster
【24h】

Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster

机译:无共享集群中N体仿真的可扩展聚类算法

获取原文
获取原文并翻译 | 示例

摘要

Scientists' ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the availability of data have become the bottleneck to scientific discovery. MapReduce-style platforms hold the promise to address this growing data analysis problem, but it is not easy to express many scientific analyses in these new frameworks. In this paper, we study data analysis challenges found in the astronomy simulation domain. In particular, we present a scalable, parallel algorithm for data clustering in this domain. Our algorithm makes two contributions. First, it shows how a clustering problem can be efficiently implemented in a MapReduce-style framework. Second, it includes optimizations that enable scalability, even in the presence of skew. We implement our solution in the Dryad parallel data processing system using DryadLINQ. We evaluate its performance and scalability using a real dataset comprised of 906 million points, and show that in an 8-node cluster, our algorithm can process even a highly skewed dataset 17 times faster than the conventional implementation and offers near-linear scalability. Our approach matches the performance of an existing hand-optimized implementation used in astrophysics on a dataset with little skew and significantly outperforms it on a skewed dataset.
机译:科学家生成和收集大规模数据集的能力正在增强。结果,数据分析能力的限制而不是数据可用性的限制已成为科学发现的瓶颈。 MapReduce风格的平台有望解决这个日益增长的数据分析问题,但要在这些新框架中表达许多科学分析并不容易。在本文中,我们研究了在天文模拟领域发现的数据分析挑战。特别是,我们提出了一种可扩展的并行算法,用于在该域中进行数据聚类。我们的算法有两个贡献。首先,它显示了如何在MapReduce样式的框架中有效地实现聚类问题。其次,它包括即使在存在偏斜的情况下也可实现可伸缩性的优化。我们使用DryadLINQ在Dryad并行数据处理系统中实施我们的解决方案。我们使用一个包含9.06亿点的真实数据集来评估其性能和可伸缩性,并表明在8个节点的群集中,我们的算法甚至可以处理高度偏斜的数据集,其速度是传统实现方法的17倍,并且提供了近乎线性的可伸缩性。我们的方法可以与几乎没有偏斜的数据集上天体物理学中现有的手动优化实现方式的性能相匹配,并且在偏斜的数据集上的性能明显优于后者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号