首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Model Parallelism Optimization for Distributed Inference Via Decoupled CNN Structure
【24h】

Model Parallelism Optimization for Distributed Inference Via Decoupled CNN Structure

机译:通过解耦CNN结构模型并行优化分布式推断

获取原文
获取原文并翻译 | 示例
           

摘要

It is promising to deploy CNN inference on local end-user devices for high-accuracy and time-sensitive applications. Model parallelism has the potential to provide high throughput and low latency in distributed CNN inference. However, it is non-trivial to use model parallelism as the original CNN model is inherently tightly-coupled structure. In this article, we propose DeCNN, a more effective inference approach that uses decoupled CNN structure to optimize model parallelism for distributed inference on end-user devices. DeCNN is novel consisting of three schemes. Scheme-1 is structure-level optimization. It exploits group convolution and channel shuffle to decouple the original CNN structure for model parallelism. Scheme-2 is partition-level optimization. It is based on channel group to partition the convolutional layers, and then leverages input-based method to partition the fully connected layers, further exposing high degree of parallelism. Scheme-3 is communication-level optimization. It uses inter-sample parallelism to hide communications for better performance and robustness, especially in the weak network connections. We use ImageNet classification task to evaluate the effectiveness of DeCNN on a distributed multi-ARM platform. Notably, when using the number of devices from 1 to 4, DeCNN can accelerate the inference of large-scale ResNet-50 by 3.21x, and reduce 65.3 percent memory footprint, with 1.29 percent accuracy improvement.
机译:有希望在本地最终用户设备上部署CNN推断,以实现高精度和时间敏感的应用。模型并行性具有在分布式CNN推理中提供高吞吐量和低延迟。然而,由于原始CNN模型是固有的紧密耦合结构,因此使用模型并行性是不普遍的。在本文中,我们提出了一种更有效的推理方法,该方法使用解耦CNN结构来优化用于在最终用户设备上的分布式推断的模型并行性。 Decnn是由三种方案组成的新颖。方案-1是结构级优化。它利用小组卷积和通道随机播放,使原始CNN结构用于模型并行性。方案-2是分区级优化。它基于通道组来分区卷积层,然后利用基于输入的方法来分配完全连接的层,进一步暴露高度的平行度。方案-3是通信级优化。它使用类似于样本的并行性来隐藏通信以获得更好的性能和鲁棒性,尤其是在弱网络连接中。我们使用ImageNet分类任务来评估DECNN在分布式多臂平台上的有效性。值得注意的是,当使用1到4的设备数量时,Decnn可以加速大规模Reset-50的推动3.21倍,并减少65.3%的内存占地面积,精度提高1.29%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号