首页> 外文学位 >Architecture, Mapping Algorithms and Physical Design of Mesh-of-Functional-Units FPGA Overlays for Pipelined Execution of Data Flow Graphs
【24h】

Architecture, Mapping Algorithms and Physical Design of Mesh-of-Functional-Units FPGA Overlays for Pipelined Execution of Data Flow Graphs

机译:功能单元网格FPGA覆盖的体系结构,映射算法和物理设计,用于数据流图的管线执行

获取原文
获取原文并翻译 | 示例

摘要

FPGAs can deliver high performance but their programmability wall hinders widespread use: they require hardware expertise and their CAD tools have long compile times. We tackle this challenge by exploring overlays: pre-compiled FPGA circuits that are themselves programmable via software-familiar models without FPGA CAD tools.;We propose a high-performance mesh-of-functional-units overlay architecture that projects a model of pipelined execution of data flow graphs (DFGs). It consists of cells, each containing a functional unit (FU) and routing logic, with elastic pipelines and FIFOs in every routing hop. The architecture realizes latency insensitive data-driven execution, facilitates high Fmax and scales to large mesh sizes. We design a DFG-to-overlay mapping algorithm that places, routes, and balances DFGs on the overlay for high throughput. We also propose a bottom-up CAD flow based on partitioning and floorplanning of an overlay into tiles. The flow maintains high Fmax for large overlays and enables parallel compilation and quick stitching of tiles from a pre-compiled library.;We prototype two overlays on a Stratix IV FPGA that has 212K ALMs: a 355 MHz 24x16 integer overlay and a 312 MHz 18x16 floating-point overlay. We map 16 DFGs and show that the two overlays deliver throughput of up to 37 GOPS and 22 GFLOPS, respectively. The DFG mapping is fast, taking less than 7 seconds. The tile-based bottom-up flow achieves 37% higher Fmax than the flat flow (the default CAD flow), with only 8% more resources. Compared to the flat flow, which compiles an overlay in 4 hours, the bottom-up flow stitches together an overlay from pre-compiled tiles in 35 minutes and can compile one from scratch in one hour, if tiles are compiled in parallel.;The resource overhead of implementing the DFGs on the floating-point and integer overlays as opposed to compiling them directly to the FPGA averages to 4x and 9x, respectively. The compile time across the DFGs is reduced by 1500x.;Our work demonstrates the feasibility of designing high-performance overlays that project a software-familiar programming model, scale with increasing FPGA resources and provide a considerable reduction in compile time. These benefits come with a modest resource overhead.
机译:FPGA可以提供高性能,但其可编程性壁垒却无法广泛使用:它们需要硬件专业知识,并且其CAD工具的编译时间很长。我们通过探索覆盖层来应对这一挑战:预编译的FPGA电路本身可以通过软件熟悉的模型进行编程,而无需使用FPGA CAD工具;我们提出了一种高性能的功能单元覆盖结构,可为流水线执行模型建模数据流图(DFG)。它由单元组成,每个单元都包含一个功能单元(FU)和路由逻辑,在每个路由跃点中都具有弹性管线和FIFO。该体系结构实现了对延迟不敏感的数据驱动的执行,促进了较高的Fmax并扩展到较大的网格尺寸。我们设计了DFG到叠加层的映射算法,该算法可以在叠加层上放置,路由和平衡DFG,以实现高吞吐量。我们还提出了一种基于覆盖层的分区和布局规划的自下而上的CAD流程。该流程可为大型覆盖图保持较高的Fmax值,并能够对预编译库中的图块进行并行编译和快速拼接。我们在具有212K ALM的Stratix IV FPGA上原型制作了两个覆盖图:355 MHz 24x16整数覆盖图和312 MHz 18x16浮点覆盖。我们绘制了16个DFG,并显示两个覆盖层分别提供高达37 GOPS和22 GFLOPS的吞吐量。 DFG映射速度很快,不到7秒。基于图块的自下而上流比平面流(默认的CAD流)实现的Fmax高37%,而资源仅增加8%。与扁平流在4小时内编译覆盖图相比,自下而上的流在35分钟内将预编译的图块拼接在一起,如果并行编译图块,则可以在一小时内从头开始编译一个图块。在浮点数和整数叠加层上实现DFG的资源开销,与直接将其直接编译到FPGA的平均值相反,分别为4x和9x。跨DFG的编译时间减少了1500倍。;我们的工作证明了设计高性能覆盖层的可行性,这些覆盖层可以投影出软件熟悉的编程模型,可以随着FPGA资源的增加而扩展,并且可以大大减少编译时间。这些好处伴随着适度的资源开销。

著录项

  • 作者

    Capalija, Davor.;

  • 作者单位

    University of Toronto (Canada).;

  • 授予单位 University of Toronto (Canada).;
  • 学科 Computer engineering.;Electrical engineering.;Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 278 p.
  • 总页数 278
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号