首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems
【24h】

Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

机译:可重构计算系统上浮点矩阵乘法的可扩展和模块化算法

获取原文
获取原文并翻译 | 示例
           

摘要

The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ a linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The Processing Elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-II Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1.
机译:当前可重构计算系统上的大量硬件资源为科学计算的高性能并行实现提供了新的机会。在本文中,我们研究了可重配置计算系统上浮点矩阵乘法的设计,这是许多科学应用中的基本内核。我们首先分析实现此内核时的设计权衡。这些折衷是由矩阵乘法的固有并行性和资源限制(包括可配置片的数量,片上存储器的大小以及可用的存储器带宽)引起的。我们提出了三种可以根据问题大小和可用硬件资源进行调整的参数化算法。我们的算法采用具有简单控制逻辑的线性阵列架构。该体系结构有效地利用了可用资源并降低了路由复杂性。我们的算法中使用的处理元素(PE)是模块化的,因此很容易将浮点单元嵌入其中。在Xilinx Virtex-II Pro XC2VP100上的实验结果表明,我们的算法实现了良好的可伸缩性和较高的GFLOPS持续性能。我们还在Cray XD1上实现了算法。 XD1是同时使用通用处理器和可重配置设备的高端可重配置计算系统。我们的算法在XD1的单个节点上实现了2.06 GFLOPS的持续性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号