首页> 外文会议>International conference on computer design >CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks
【24h】

CNN-MERP: An FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks

机译:CNN-MERP:基于FPGA的内存高效可重构处理器,用于卷积神经网络的前向和后向传播

获取原文

摘要

Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.
机译:大型深度卷积神经网络(CNNS)广泛用于机器学习应用。虽然CNN涉及巨大的复杂性,但是将高密度计算资源集成的VLSI(ASIC和FPGA)芯片被视为CNN实施的有希望的平台。然而,在计算单元的大规模平行中,由VLSI芯片的引脚计数约束的外部存储器带宽成为系统瓶颈。此外,VLSI解决方案通常被认为是用于CNN的各种参数的缺乏可重新配置的灵活性。本文介绍了CNN-MERP解决这些问题。 CNN-MERP包含一个有效的内存层次结构,可显着降低多个优化的带宽要求,包括开/异单元数据分配,数据流优化和数据重用。所提出的2级重新配置性用于实现快速有效的重新配置,该重新配置为基于控制逻辑和FPGA的多点特征。结果,实现了1.94MB / gflop的外部存储器带宽要求,其比现有技术低55%。在Limited DRAM带宽下,在顶点UltraScale平台上实现了1244gFlop / s的系统吞吐量,比最先进的FPGA实现高5.48倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号