Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

Shen Junzhong; Huang You; Wen Mei; Zhang Chunyuan

首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

【24h】

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

机译：朝着高效的基于流水线模板的架构，用于加速FPGA上的整个2-D和3-D CNN

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

3-D convolutional neural networks (3-D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on design and optimization of accelerators for 2-D CNNs, with few attempts having been made to accelerate 3-D CNNs on FPGA. We find the acceleration of 3-D CNNs on FPGA to be challenging due to their high computational complexity and storage demands. More importantly, although the computational patterns of 2-D and 3-D CNNs are analogous, the conventional approaches that have been adopted for acceleration of 2-D CNNs may be unfit for 3-D CNN acceleration. In this paper, in order to accelerate 2-D and 3-D CNNs using a uniform framework, we first propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure the rapid development of 2-D and 3-D CNN accelerators. Then, with the aim of efficiently mapping all layers of 2-D/3-D CNNs onto a pipelined accelerator, techniques are developed to improve the throughput and computational efficiency of the accelerator, including layer fusion, layer clustering, and workload-balancing scheme. Finally, we demonstrate the effectiveness of the deep pipelined architecture by accelerating real-life 2-D and 3-D CNNs on the state-of-the-art FPGA platform. On VCU118, we achieve 3.7 TOPS for VGG-16, which outperforms state-of-the-art FPGA-based CNN accelerators. Comparisons with CPU and GPU solutions demonstrate that our implementation of 3-D CNN achieves gains of up to 17.8x and 64.2x in performance and energy relative to a CPU solution, and a 5.0x energy efficiency gain over a GPU solution.

机译：在许多计算机视觉应用中有效地使用3-D卷积神经网络（3-D CNNS）。该地区最先前的工作仅集中在2-D CNNS的加速器的设计和优化，几次尝试加速FPGA上的3-D CNN。由于其高计算复杂性和存储需求，我们发现FPGA上的3-D CNNS的加速度挑战。更重要的是，尽管2-D和3-D CNN的计算模式类似，但是已经采用的用于加速2-D CNN的传统方法可能是不合适的3-D CNN加速度。在本文中，为了使用统一框架加速2-D和3-D CNN，我们首先提出了一种统一的基于模板的架构，它使用基于WinoGrad算法的模板，以确保2-D和3-的快速发展D CNN加速器。然后，为了有效地将2-D / 3-D CNN的层映射到流水线加速器上，开发了技术以提高加速器的吞吐量和计算效率，包括层融合，层聚类和工作负载平衡方案。最后，我们通过在最先进的FPGA平台上加速现实生活2-D和3-D CNN来证明深度流水线架构的有效性。在VCU118上，我们为VGG-16实现了3.7个顶部，这优于基于最先进的FPGA的CNN加速器。与CPU和GPU解决方案的比较表明，我们的3-D CNN的实现可实现高达17.8倍和64.2倍的性能和能量相对于CPU解决方案的增益，以及GPU解决方案的5.0倍的能效增益。

著录项

来源
《IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems》 |2020年第7期|1442-1455|共14页
作者
Shen Junzhong; Huang You; Wen Mei; Zhang Chunyuan;
展开▼
作者单位

Natl Univ Def Technol Coll Comp Changsha 410073 Peoples R China;

Natl Univ Def Technol Coll Comp Changsha 410073 Peoples R China;

Natl Univ Def Technol Coll Comp Changsha 410073 Peoples R China;

Natl Univ Def Technol Coll Comp Changsha 410073 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Three-dimensional displays; Two dimensional displays; Field programmable gate arrays; Acceleration; Computer architecture; Convolution; Space exploration; 3-D convolutional neural networks (3-D CNNs); uniform templates; Winograd algorithm;

机译：三维显示器;二维显示器;现场可编程门阵列;加速;计算机架构;卷积;太空探索;3-D卷积神经网络（3-D CNNS）;均匀模板;Winograd算法;

相似文献

外文文献
中文文献
专利

1. FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters [J] . Wang Tianqi, Geng Tong, Li Ang, IEEE Transactions on Computers . 2020,第8期

机译：FPDeep：深管式FPGA集群中CNN训练的可扩展加速度
2. Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA [J] . Liu Shuanglong, Fan Hongxiang, Niu Xinyu, ACM transactions on reconfigurable technology and systems . 2018,第3期

机译：在FPGA上使用深度定制的卷积和反卷积架构优化基于CNN的分段
3. An Efficient VLSI Architecture and FPGA Implementation of High-Speed and Low Power 2-D DWT for (9, 7) Wavelet Filter [J] . A. Mansouri, A. Ahaitouf, F. Abdi. International journal of computer science and network security . 2009,第3期

机译：用于（9，7）小波滤波器的高速，低功耗2-D DWT的高效VLSI架构和FPGA实现
4. Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform [C] . Qixuan Cheng, Mei Wen, Junzhong Shen, International Conference on Algorithms and Architectures for Parallel Processing . 2020

机译：朝着深入管道的架构，用于加速多FPGA平台上的深态GCN
5. FPGA Logic Block Architectures for Efficient Deep Learning Inference [D] . ?Eldafrawy, Mohamed Bahaaeldin Mohamed 2020

机译：FPGA逻辑块架构，用于高效的深度学习推论
6. Pairwise domain adaptation module for CNN-based 2-D/3-D registration [O] . Jiannan Zheng, Shun Miao, Z. Jane Wang, 2018

机译：成对域自适应模块用于基于CNN的2-D / 3-D注册
7. Deep Collaborative Attention Network for Hyperspectral Image Classification by Combining 2-D CNN and 3-D CNN [O] . Hao Guo, Jianjun Liu, Jinlong Yang, 2020

机译：通过组合2-D CNN和3-D CNN，深度协作关注网络进行高光谱图像分类
8. Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs [R] . Dillon, T. 2004

机译：用于FpGa和asIC中超长FFT的高效架构

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

摘要

著录项

相似文献

相关主题

期刊订阅