首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Computing programs containing band linear recurrences on vector supercomputers
【24h】

Computing programs containing band linear recurrences on vector supercomputers

机译:向量超级计算机上包含带线性递归的计算程序

获取原文
获取原文并翻译 | 示例
           

摘要

Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems, spend a major portion of execution time in their core loops computing band linear recurrences (BLRs). Conventional compiler parallelization techniques cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCDs) in programs, and there is a limited amount of parallelism in a BLR with respect to LCDs. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of BLRs. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallelizing programs containing BLR and other kinds of code. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLRs, and in compiler parallelization of such programs combined with recurrence recognition techniques for vector supercomputers.
机译:许多大规模的科学和工程计算,例如一些重大挑战问题,其执行时间的大部分时间都在其核心循环中计算带线性递归(BLR)。常规的编译器并行化技术不能为这种类型的计算生成可伸缩的并行代码,因为它们遵循程序中的循环承载依赖性(LCD),并且BLR中相对于LCD的并行性数量有限。对于许多应用程序,使用库例程替换核心BLR要求将BLR与它的从属计算分开,这通常会产生大量开销。在本文中,我们提出了一种新的可扩展算法,称为常规调度,用于并行评估BLR。我们描述常规调度的实现,并讨论如何在向量超级计算机上实现调度时获得最大的内存吞吐量。我们还将根据常规时间表说明我们的方法,该方法用于并行化包含BLR和其他类型代码的程序。在Convex C240上展示了与使用高度优化的程序集内编码BLAS例程[11]实施的相同程序相比,使用C的常规调度实现的包含BLR的一系列程序的CPU性能的显着改善。我们的方法既可以在用户级别用于包含BLR的并行编程代码中,又可以在此类程序的编译器并行化中与矢量超级计算机的递归识别技术结合使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号