首页> 外文学位 >Efficient mapping of fast Fourier transform on the Cyclops-64 multithreaded architecture.
【24h】

Efficient mapping of fast Fourier transform on the Cyclops-64 multithreaded architecture.

机译:快速傅立叶变换在Cyclops-64多线程体系结构上的有效映射。

获取原文
获取原文并翻译 | 示例

摘要

The emerging multi-core architectures unveil opportunities of massive on-chip parallelism through hardware support, but also present great challenges to application developers and system software designers. In this paper, we report our experience of optimizing the Fast Fourier Transform (FFT) on IBM Cyclops-64(C64) architecture, a novel multi-core architecture consisting of 160 threads, explicit memory-hierarchy and interconnection network to provide massive on-chip parallelism. C64 does not have data cache and thus cannot take advantage of cache-oblivious algorithm. In addition, current implementation of cache-oblivious method is entirely based on cache memory hierarchy that does not lend itself to the construction of an accurate performance model - which is critical in the performance optimization of data movement through a Cyclops-64 style explicit memory hierarchy. Therefore, to make a cache-oblivious FFT working efficiently on C64 is probably non-trivial.; This thesis takes a different path. The main contribution of this thesis includes: (1) an iterative search approach has been proposed and implemented for the C64 architecture taking advantage of the explicit memory hierarchy. Our approach fully exploits the opportunity provided by explicitly-addressable on-chip memory hierarchy (without caches) and constructs an accurate/deterministic performance model analytically. This model is used to rapidly calculate the performance of different "FFT computation plans" iteratively. Such performance numbers will be productively used by our search based optimization procedure; (2) a new technique for optimizing the scratch-pad memory space utilization has been proposed that can judiciously explore live-range splitting methods and a significant performance gain has been achieved as evidenced by our experiments; (3) an implementation of the proposed methods has been implemented on the C64 software and simulation toolchain, and a detailed scalability study of FFT on C64 architecture has been conducted. Furthermore, an in-depth analysis has been provided to illustrate the choice of optimal (vs. maximum) number of threads under each situation based on tradeoffs between the computation power and the synchronization overhead. The experimental results have demonstrated up to 25.5% speedup over a best known non-search based method.
机译:新兴的多核体系结构通过硬件支持揭示了大规模片上并行性的机会,但也给应用程序开发人员和系统软件设计人员带来了巨大挑战。在本文中,我们报告了我们在IBM Cyclops-64(C64)体系结构上优化快速傅立叶变换(FFT)的经验,该体系结构是由160个线程,显式内存层次结构和互连网络组成的新型多核体系结构,可提供大量的基于磁盘的内存。芯片并行性。 C64没有数据高速缓存,因此无法利用高速缓存无关的算法。此外,当前忽略缓存的方法的实现完全基于缓存内存层次结构,该层次结构不适合于构建准确的性能模型-这对于通过Cyclops-64样式显式内存层次结构进行数据移动的性能优化至关重要。因此,要使在C64上有效地执行不带缓存的FFT高效工作可能并非易事。本论文走了一条不同的道路。本文的主要贡献包括:(1)提出了一种迭代搜索方法,并利用显式内存层次结构为C64体系结构实现了迭代搜索方法。我们的方法充分利用了可显式寻址的片上存储器层次结构(无高速缓存)提供的机会,并通过分析构建了准确/确定的性能模型。该模型用于快速迭代地计算不同“ FFT计算计划”的性能。这些性能数字将被我们基于搜索的优化过程有效地使用; (2)提出了一种优化暂存器内存空间利用率的新技术,该技术可以明智地探索实时范围分割方法,并且通过我们的实验证明,已经获得了显着的性能提升; (3)所提出方法的实现已在C64软件和仿真工具链上实现,并对FFT在C64体系结构上进行了详细的可扩展性研究。此外,已经提供了深入的分析以说明基于计算能力和同步开销之间的折衷,在每种情况下最佳(相对于最大)线程数的选择。实验结果表明,与最著名的非基于搜索的方法相比,速度提高了25.5%。

著录项

  • 作者

    Xue, Liping.;

  • 作者单位

    University of Delaware.$bDepartment of Electrical and Computer Engineering.;

  • 授予单位 University of Delaware.$bDepartment of Electrical and Computer Engineering.;
  • 学科 Engineering Electronics and Electrical.
  • 学位 M.S.
  • 年度 2007
  • 页码 88 p.
  • 总页数 88
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 无线电电子学、电信技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号