首页> 外文会议>International conference on very large data bases >Scalable Distributed Subgraph Enumeration
【24h】

Scalable Distributed Subgraph Enumeration

机译:可伸缩的分布式子图枚举

获取原文

摘要

Subgraph enumeration aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph. As the subgraph isomorphism operation is computationally intensive, researchers have recently focused on solving this problem in distributed environments, such as MapReduce and Pregel. Among them, the state-of-the-art algorithm, TwinTwigJoin, is proven to be instance optimal based on a left-deep join framework. However, it is still not scalable to large graphs because of the constraints in the left-deep join framework and that each decomposed component (join unit) must be a star. In this paper, we propose SEED - a scalable subgraph enumeration approach in the distributed environment. Compared to TwinTwigJoin, SEED returns optimal solution in a generalized join framework without the constraints in TwinTwigJoin. We use both star and clique as the join units, and design an effective distributed graph storage mechanism to support such an extension. We develop a comprehensive cost model, that estimates the number of matches of any given pattern graph by considering power-law degree distribution in the data graph. We then generalize the left-deep join framework and develop a dynamic-programming algorithm to compute an optimal bushy join plan. We also consider overlaps among the join units. Finally, we propose clique compression to further improve the algorithm by reducing the number of the intermediate results. Extensive performance studies are conducted on several real graphs, one containing billions of edges. The results demonstrate that our algorithm outperforms all other state-of-the-art algorithms by more than one order of magnitude.
机译:子图枚举旨在查找与给定模式图同构的大型数据图的所有子图。由于子图同构运算的计算量很大,因此研究人员最近将重点放在解决分布式环境(例如MapReduce和Pregel)中的此问题上。其中,最先进的算法TwinTwigJoin被证明是基于左深连接框架的实例最佳实例。但是,由于左深连接框架中的约束,并且每个分解的组件(连接单元)必须是星形,因此它仍然无法缩放到大图。在本文中,我们提出了SEED-分布式环境中的可伸缩子图枚举方法。与TwinTwigJoin相比,SEED在不受TwinTwigJoin约束的通用联接框架中返回最佳解决方案。我们使用star和clique作为连接单元,并设计一种有效的分布式图形存储机制来支持这种扩展。我们开发了一个综合成本模型,该模型通过考虑数据图中的幂律度分布来估计任何给定模式图的匹配数。然后,我们对左深连接框架进行泛化,并开发一种动态编程算法来计算最佳的丛集连接计划。我们还考虑了联接单元之间的重叠。最后,我们提出集团压缩以通过减少中间结果的数量来进一步改进算法。在几张真实的图上进行了广泛的性能研究,其中一张包含数十亿条边。结果表明,我们的算法比所有其他现有技术算法高出一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号