Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support

机译：Clang的OpenMP 4.5 GPU支持的性能分析和优化

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The Clang implementation of OpenMP® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA® performance without major code restructuring is an open research issue.In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which guided work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.

机译：现在，OpenMP®4.5的Clang实施提供了对该规范的全面支持，提供了针对NVIDIA®GPU的唯一开源选项。尽管使用OpenMP可以跨不同体系结构移植，但在不进行重大代码重组的情况下匹配本机CUDA®性能是一个开放的研究问题。为了分析当前性能，我们移植了一组具有代表性的基准测试，以及成熟的微型应用TeaLeaf，CloverLeaf，和SNAP到Clang OpenMP 4.5编译器。然后，我们在NVIDIA Kepler GPU上收集这些端口及其等效CUDA端口的性能结果。通过手动分析生成的代码，我们能够发现OpenMP与CUDA之间的性能差异的根本原因。可以对现有的编译器实现进行许多改进，以使性能接近手动优化的CUDA。我们的第一个观察结果是，生成的代码未使用融合乘加指令，而使用现有标志对其进行了解析。接下来，我们看到编译器没有通过非一致性缓存传递任何负载，并向编译器添加了一个新标志来解决此问题，然后我们观察到在大多数情况下可以改善编译器对线程和团队的分区内核，它指导工作以确保编译器可以选择更多的最佳默认值。我们发现了现有实现中的寄存器分配问题，该问题与其他问题一起修复后，可实现接近CUDA的性能。最后，我们使用一些不同的内核来强调需要在规范中引入对管理内存层次结构的支持，并提出用于编程共享缓存的简单选项。

著录项

来源
《2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems》|2016年|54-64|共11页
会议地点 Salt Lake City(US)
作者
Matt Martineau; Simon McIntosh-Smith; Carlo Bertolli; Arpith C. Jacob; Samuel F. Antao; Alexandre Eichenberger; Gheorghe-Teodor Bercea; Tong Chen; Tian Jin; Kevin OBrien; Georgios Rokos; Hyojin Sung; Zehra Sura;
展开▼
作者单位

Univ. of Bristol, Bristol, UK;

Univ. of Bristol, Bristol, UK;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

IBM T.J. Watson Res. Lab., NY, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Kernel; Graphics processing units; Benchmark testing; Registers; Optimization; Ports (Computers); Programming;

机译：内核;图形处理单元;基准测试;寄存器;优化;端口（计算机）;编程;

相似文献

外文文献
中文文献
专利

1. OpenMP 4.5 compiler optimization for GPU offloading [J] . E. Tiotto, B. Mahjour, W. Tsang, IBM Journal of Research and Development . 2020,第3a4期

机译：OpenMP 4.5 GPU卸载的编译器优化
2. High performance data clustering: a comparative analysis of performance for GPU, RASC, MPI, and OpenMP implementations [J] . Luobin Yang, Steve C. Chiu, Wei-Keng Liao, Journal of supercomputing . 2014,第1期

机译：高性能数据集群：GPU，RASC，MPI和OpenMP实现的性能比较分析
3. Hierarchical Roofline analysis for GPUs: Accelerating Performance optimization for the NERSC-9 Perlmutter system [J] . Yang Charlene, Kurth Thorsten, Williams Samuel Concurrency, practice and experience . 2020,第20期

机译：GPU的分层屋顶分析：加速NERSC-9 Perlmuter系统的性能优化
4. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support [C] . Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems;International Conference for High Performance Computing, Networking, Storage and Analysis . 2016

机译：Clang OpenMP的性能分析与优化4.5 GPU支持
5. Analysis and Performance Optimization of a GPGPU Implementation of Image Quality Assessment (IQA) Algorithm VSNR. [D] . Gupta, Ayush. 2017

机译：GPGPU图像质量评估（IQA）算法VSNR实现的分析和性能优化。
6. High Performance Data Clustering: A Comparative Analysis of Performance for GPU RASC MPI and OpenMP Implementations [O] . Luobin Yang, Steve C. Chiu, Wei-Keng Liao, -1

机译：高性能数据集群：GPURASCMPI和OpenMP实现的性能比较分析
7. The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs [O] . Martineau, Matt, McIntosh-Smith, Simon 2018

机译：适用于针对Intel CPU，IBM CPU和NVIDIA GPU的科学应用的OpenMP 4.5的生产率，可移植性和性能

Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support

摘要

著录项

相似文献

相关主题

期刊订阅