Autotuning GEMM Kernels for the Fermi GPU

Kurzak Jakub; Tomov Stanimire; Dongarra Jack

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Autotuning GEMM Kernels for the Fermi GPU

【24h】

Autotuning GEMM Kernels for the Fermi GPU

机译：为Fermi GPU自动调整GEMM内核

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This paper presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

机译：近年来，图形芯片的使用已被公认为是加速科学和工程应用的可行方法，自NVIDIA推出Fermi架构以来，更是如此，它具有数值计算必不可少的功能，例如快速双精度算术和用纠错码保护的存储器。作为数字软件包（例如LAPACK和ScaLAPACK）的关键组件，常规的密集矩阵乘法例程是要在这些设备上实现的更重要的工作负载之一。本文介绍了一种方法，该方法可通过启发式自动调谐的规范过程，基于多种代码变体的生成，并通过基准测试选择最快的变体，来生成针对特定体系结构进行了优化的矩阵乘法内核。这项工作的关键贡献在于生成搜索空间的方法。具体来说，将其修剪到可管理的大小。性能数字匹配或超过其他可用的实现。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2012年第11期|p.2045-2057|共13页
作者
Kurzak Jakub; Tomov Stanimire; Dongarra Jack;
展开▼
作者单位

University of Tennessee, Knoxville;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
BLAS; CUDA; GEMM; Graphics processing unit; automatic tuning; code generation; matrix multiplication;

机译：BLAS;CUDA;GEMM;图形处理单元;自动调整;代码生成;矩阵乘法;

相似文献

外文文献
中文文献
专利

1. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices [J] . Chetan Jhurani, Paul Mullowney Journal of Parallel and Distributed Computing . 2015,第jana期

机译：用于多个小型矩阵的NVIDIA GPU上的GEMM接口和实现
2. Autotuning of configuration for program execution in GPUs [J] . Concurrency, practice and experience . 2020,第9期

机译：自动调整配置以在GPU中执行程序
3. Experiences in autotuning matrix multiplication for energy minimization on GPUs [J] . Hartwig Anzt, Blake Haugen, Jakub Kurzak, CONCURRENCY PRACTICE & EXPERIENCE . 2015,第17期

机译：自动优化矩阵乘法以减少GPU上的能量的经验
4. Performance, Design, and Autotuning of Batched GEMM for GPUs [C] . Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, International conference on high performance computing . 2016

机译：用于GPU的批处理GEMM的性能，设计和自动调整
5. Autotuning, code generation and optimizing compiler technology for gpus. [D] . Khan, Malik Muhammad Zaki Murtaza. 2012

机译：自动调整，代码生成并优化GPU的编译器技术。
6. Revisiting Molecular Dynamics on a CPU/GPU system: Water Kernel and SHAKE Parallelization [O] . A. Peter Ruymgaart, Ron Elber -1

机译：在CPU / GPU系统上重新定位分子动力学：水核和摇动并行化
7. Autotuning gemm kernels for the fermi gpu [O] . Jakub Kurzak, Stanimire Tomov, Jack Dongarra, 2014

机译：自动调整Fermi GPU的gemm内核

Autotuning GEMM Kernels for the Fermi GPU

摘要

著录项

相似文献

相关主题

期刊订阅