【24h】

Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support

机译:Clang的OpenMP 4.5 GPU支持的性能分析和优化

获取原文
获取原文并翻译 | 示例

摘要

The Clang implementation of OpenMP® 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA® GPUs. While using OpenMP allows portability across different architectures, matching native CUDA® performance without major code restructuring is an open research issue.In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which guided work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.
机译:现在,OpenMP®4.5的Clang实施提供了对该规范的全面支持,提供了针对NVIDIA®GPU的唯一开源选项。尽管使用OpenMP可以跨不同体系结构移植,但在不进行重大代码重组的情况下匹配本机CUDA®性能是一个开放的研究问题。为了分析当前性能,我们移植了一组具有代表性的基准测试,以及成熟的微型应用TeaLeaf,CloverLeaf,和SNAP到Clang OpenMP 4.5编译器。然后,我们在NVIDIA Kepler GPU上收集这些端口及其等效CUDA端口的性能结果。通过手动分析生成的代码,我们能够发现OpenMP与CUDA之间的性能差异的根本原因。可以对现有的编译器实现进行许多改进,以使性能接近手动优化的CUDA。我们的第一个观察结果是,生成的代码未使用融合乘加指令,而使用现有标志对其进行了解析。接下来,我们看到编译器没有通过非一致性缓存传递任何负载,并向编译器添加了一个新标志来解决此问题,然后我们观察到在大多数情况下可以改善编译器对线程和团队的分区内核,它指导工作以确保编译器可以选择更多的最佳默认值。我们发现了现有实现中的寄存器分配问题,该问题与其他问题一起修复后,可实现接近CUDA的性能。最后,我们使用一些不同的内核来强调需要在规范中引入对管理内存层次结构的支持,并提出用于编程共享缓存的简单选项。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号