...
首页> 外文期刊>Cluster computing >An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512
【24h】

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

机译:具有AVX-512的Intel Knl处理器矩阵矩阵乘法的实现

获取原文
获取原文并翻译 | 示例
           

摘要

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.
机译:第二代英特尔Xeon Phi处理器代码骑士登陆(KNL)最近用2D瓷砖网格架构和英特尔AVX-512说明出现。但是,通用用户非常困难,因为它们不熟悉最佳高速缓存重用,高效矢量化和汇编语言,因此非常困难。在本文中,我们通过执行一般矩阵矩阵乘法和不使用汇编语言来说明几种开发策略,以实现C编程语言的良好性能。我们的矩阵矩阵乘法的实现基于阻塞矩阵乘法作为改善数据重用的优化技术。我们使用数据预取,循环展开和英特尔AVX-512来优化阻塞矩阵乘法。当我们使用KNL的单一核心时,我们的实施可以使用英特尔MKL实现高达98%的SGEMM和99%的DGEMM,这是当前的最先进的库。我们使用Intel MKL使用所有68个KNL的所有68个核心的Spaction DGEMM实现了高达90%的DGEMM。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号