An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

Roktaek Lim; Yeongha Lee; Raehyun Kim; Jaeyoung Choi

首页> 外文期刊>Cluster computing >An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

【24h】

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

机译：具有AVX-512的Intel Knl处理器矩阵矩阵乘法的实现

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from the new architecture since they are not familiar with optimal cache reuse, efficient vectorization, and assembly language. In this paper, we illustrate several developing strategies to achieve good performance with C programming language by carrying out general matrix–matrix multiplications and without the use of assembly language. Our implementation of matrix–matrix multiplication is based on blocked matrix multiplication as an optimization technique that improves data reuse. We use data prefetching, loop unrolling, and the Intel AVX-512 to optimize the blocked matrix multiplications. When we use a single core of the KNL, our implementation achieves up to 98% of SGEMM and 99% of DGEMM using the Intel MKL, which is the current state-of-the-art library. Our implementation of the parallel DGEMM using all 68 cores of the KNL achieves up to 90% of DGEMM using the Intel MKL.

机译：第二代英特尔Xeon Phi处理器代码骑士登陆（KNL）最近用2D瓷砖网格架构和英特尔AVX-512说明出现。但是，通用用户非常困难，因为它们不熟悉最佳高速缓存重用，高效矢量化和汇编语言，因此非常困难。在本文中，我们通过执行一般矩阵矩阵乘法和不使用汇编语言来说明几种开发策略，以实现C编程语言的良好性能。我们的矩阵矩阵乘法的实现基于阻塞矩阵乘法作为改善数据重用的优化技术。我们使用数据预取，循环展开和英特尔AVX-512来优化阻塞矩阵乘法。当我们使用KNL的单一核心时，我们的实施可以使用英特尔MKL实现高达98％的SGEMM和99％的DGEMM，这是当前的最先进的库。我们使用Intel MKL使用所有68个KNL的所有68个核心的Spaction DGEMM实现了高达90％的DGEMM。

著录项

来源
《Cluster computing》 |2018年第4期|共11页
作者
Roktaek Lim; Yeongha Lee; Raehyun Kim; Jaeyoung Choi;
展开▼
作者单位

Soongsil University;

Soongsil University;

Soongsil University;

Soongsil University;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类分子生物学;
关键词
Matrix–matrix multiplication; Knights Landing; AVX-512; Vectorization; Threading;

机译：矩阵矩阵乘法;骑士着陆;AVX-512;矢量化;穿线;

相似文献

外文文献
中文文献
专利

1. An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512 [J] . Roktaek Lim, Yeongha Lee, Raehyun Kim, Cluster computing . 2018,第4期

机译：具有AVX-512的Intel Knl处理器矩阵矩阵乘法的实现
2. Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions [J] . Bérenger Bramas, Pavel Kus PeerJ Computer Science . 2018,第1期

机译：使用基于块的内核在具有AVX-512指令的处理器上使用零填充来计算稀疏矩阵矢量积
3. Exploiting processor features to implement error detection in reduced precision matrix multiplications [J] . Pedro Reviriego, Serdar Zafer Can, Cagri Eryilmaz, Microprocessors and microsystems . 2014,第6期

机译：利用处理器功能以降低精度的矩阵乘法实现错误检测
4. Evaluating performance of Parallel Matrix Multiplication Routine on Intel KNL and Xeon Scalable Processors [C] . Thi My Tuyen Nguyen, Yoosang Park, Jaeyoung Choi, IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion . 2020

机译：在Intel KNL和Xeon可扩展处理器上评估并行矩阵乘法例程的性能
5. A Novel Processing-In-Memory Architecture for Dense and Sparse Matrix Multiplications [D] . Bear, Andrew Robert 2019

机译：一种用于密集和稀疏矩阵乘法的新型处理内存架构
6. Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions [O] . Bérenger Bramas, Pavel Kus 2018

机译：使用AVX-512指令的处理器上没有零填充的基于块的内核计算稀疏矩阵矢量产品
7. Computing the Sparse Matrix Vector Product using Block-Based Kernels Without Zero Padding on Processors with AVX-512 Instructions [O] . Bramas, Berenger, Kus, Pavel 2018

机译：使用基于块的内核计算稀疏矩阵向量积没有使用aVX-512指令的处理器上的零填充

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

摘要

著录项

相似文献

相关主题

期刊订阅