Optimizing Depthwise Separable Convolution Operations on GPUs

Lu Gangzhao; Zhang Weizhe; Wang Zheng

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Optimizing Depthwise Separable Convolution Operations on GPUs

【24h】

Optimizing Depthwise Separable Convolution Operations on GPUs

机译：在GPU上优化深度可分离的卷积操作

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.

机译：在卷积神经网络（CNNS）中通常可以看到深度可分离的卷积，并且广泛用于降低标准多通道2D卷积的计算开销。对深度可分离卷积的现有实现目标加速模型训练，其大量批量尺寸具有大量样品待处理。这种方法对于小批量大小的模型训练和模型推断的典型情景不足，其中模型一次采用少量样品。本文旨在通过针对GPU架构弥合优化深度可分离卷积的差距。我们通过设计两种新颖算法来实现这一目标，以改善卷积操作的柱和行重复使用，以减少对宽度和2D卷积的高度尺寸执行的存储器操作的数量。我们的方法采用动态区块大小方案来自适应地将GPU线程分配计算数据，以提高GPU利用率并隐藏内存访问延迟。我们在两个GPU平台上应用了我们的方法：NVIDIA RTX 2080TI GPU和嵌入式NVIDIA Jetson AGX Xavier GPU，以及两个数据类型：32位浮点（FP32）和8位整数（INT8）。我们将我们对CUDNN的方法进行了比较，这对NVIDIA GPU架构进行了严重调整的。实验结果表明，我们的方法在CUDNN上提供了2倍（高达3倍）的性能改进。我们表明，在使用中等批次大小时，我们的方法平均降低了MobileNet和效率的端到端培训时间分别为9.7％和7.3％，并减少了MobileNet和WequenceNet的端到端推断时间12.2分别为11.6％。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2022年第1期|70-87|共18页
作者
Lu Gangzhao; Zhang Weizhe; Wang Zheng;
展开▼
作者单位

Harbin Inst Technol Sch Cyberspace Sci Harbin 150000 Peoples R China;

Harbin Inst Technol Sch Cyberspace Sci Harbin 150000 Peoples R China;

Univ Leeds Sch Comp Leeds LS2 9JT W Yorkshire England;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;

机译：卷积;图形处理单位;指令集;核;标准;培训;寄存器;绩效优化;卷积;深透明;点;记忆优化;GPU利用率;GPU利用率;GPU利用;

相似文献

外文文献
中文文献
专利

1. Diagnosis of COVID-19 from Chest X-Ray Images Using Wavelets-Based Depthwise Convolution Network [J] . Krishna Kant Singh, Akansha Singh 大数据挖掘与分析(英文) . 2021,第002期
2. Study on the Splitting Methods for Separable Convex Optimization in a Unified Algorithmic Framework [J] . Bingsheng He 分析、理论与应用（英文版） . 2020,第003期
3. A LIGHTWEIGHT AND EFFICIENT DEEP CONVOLUTIONAL NEURAL NETWORK BASED ON DEPTHWISE DILATED SEPARABLE CONVOLUTION [J] . HOANH NGUYEN Journal of Theoretical and Applied Information Technology . 2020,第15期

机译：基于深度扩张可分离卷积的轻质和高效的深卷积神经网络
4. IoT enabled depthwise separable convolution neural network with deep support vector machine for COVID-19 diagnosis and classification [J] . Le Dac-Nhuong, Parvathy Velmurugan Subbiah, Gupta Deepak, International journal of machine learning and cybernetics . 2021,第11期

机译：IoT使能深度可分离的卷积神经网络，深载向量机用于Covid-19诊断和分类
5. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration [J] . Li Baoting, Wang Hang, Zhang Xuchong, IEEE transactions on circuits and systems . I , Regular papers . 2021,第8期

机译：动态数据流调度和计算映射技术，用于高效的深度可分离卷积加速度
6. Data Locality Optimization of Depthwise Separable Convolutions for CNN Inference Accelerators [C] . Hao-Ning Wu, Chih-Tsun Huang Design, Automation amp;amp;amp;amp;amp;amp; Test in Europe Conference amp;amp;amp;amp;amp;amp; Exhibition . 2019

机译：用于CNN推理加速器的深度可分离卷曲的数据位置优化
7. An Integrative Algorithm/Architecture Co-Design of Deep Spatial and Temporal Separable Convolutional Neural Networks [D] . Baharani, Mohammadreza. 2021

机译：深空和时间可分离卷积神经网络的一体化算法/架构共设计
8. Joint optic disc and cup segmentation based on densely connected depthwise separable convolution deep network [O] . Bingyan Liu, Daru Pan, Hui Song 2021

机译：基于密集连接深度可分离卷积深网络的关节光盘和杯分割
9. Underwater Acoustic Target Recognition Based on Depthwise Separable Convolution Neural Networks [O] . Gang Hu, Kejun Wang, Liangliang Liu 2021

机译：基于深度可分离的卷积神经网络的水下声学目标识别
10. Comparative Study of Primal and Dual Approaches for Solving Separable and Partially-Separable Nonlinear Optimization Problems [R] . Lootsma, F. A. 1988

机译：求解可分离和部分可分非线性优化问题的原始方法和对偶方法的比较研究

Optimizing Depthwise Separable Convolution Operations on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅