...
首页> 外文期刊>Integration >Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training
【24h】

Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training

机译:对数 - 近似浮点倍增器适用于高效的神经网络培训

获取原文
获取原文并翻译 | 示例
           

摘要

Recently, emerging "edge computing" moves data and services from the cloud to nearby edge servers to achieve short latency and wide bandwidth, and solve privacy concerns. However, edge servers, often embedded with GPU processors, highly demand a solution for power-efficient neural network (NN) training due to the limitation of power and size. Besides, according to the nature of the broad dynamic range of gradient values computed in NN training, floating-point representation is more suitable. This paper proposes to adopt a logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engines, where LAM approximates a floating-point multiplication as a fixed-point addition, resulting in smaller delay, fewer gates, and lower power consumption. We demonstrate the efficiency of LAM in two platforms, which are dedicated NN training hardware, and open-source GPU design. Compared to the NN training applying the exact multiplier, our implementation of the NN training engine for a 2-D classification dataset achieves 10% speed-up and 2.3X efficiency improvement in power and area, respectively. LAM is also highly compatible with conventional bit-width scaling (BWS). When BWS is applied with LAM in five test datasets, the implemented training engines achieve more than 4.9X power efficiency improvement, with at most 1% accuracy degradation, where 2.2X improvement originates from LAM. Also, the advantage of LAM can be exploited in processors. A GPU design embedded with LAM executing an NN-training workload, which is implemented in an FPGA, presents 1.32X power efficiency improvement, and the improvement reaches 1.54X with LAM + BWS. Finally, LAM-based training in deeper NN is evaluated. Up to 4-hidden layer NN, LAM-based training achieves highly comparable accuracy as that of the accurate multiplier, even with aggressive BWS.
机译:最近,新兴的“Edge Computing”将数据和服务从云移动到附近的边缘服务器,以实现短期延迟和宽带宽度,并解决隐私问题。但是,通常嵌入GPU处理器的边缘服务器,由于功率和尺寸的限制,高度要求提供高效神经网络(NN)训练的解决方案。此外,根据NN训练中计算的梯度值的宽度动态值的性质,浮点表示更适合。本文提出采用神经网络(NN)训练引擎中的乘法累积(MAC)计算的对数近似乘数(LAM),其中LAM近似于浮点乘法作为固定点添加,导致较小的延迟,较少的盖茨和较低的功耗。我们展示了两个平台中LAM的效率,这些平台是专用的NN培训硬件,以及开源GPU设计。与应用精确乘法器的NN训练相比,我们为2-D分类数据集的NN训练引擎的实现分别实现了10%的加速和2.3倍的功率和面积效率。林也与传统的钻头宽度缩放(BWS)高度兼容。当BWS在五个测试数据集中用LAM应用时,所实施的训练发动机实现了4.9倍的功率效率改进,最多为1%的精度降解,其中2.2倍改善源于林。而且,林的优点也可以在处理器中被利用。嵌入在FPGA中实施的轧杠嵌入的GPU设计呈现NN训练工作负载,提高了1.32倍的功率效率提高,并且随着LAM + BWS的改进达到1.54倍。最后,评估了深入NN的LAM的训练。最多4个隐藏的层NN,基于LAM的培训,即使是具有侵略性的BWS,也可以实现高度相当的准确性,即使是准确的乘数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号