首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >A Performance Modeling and OptimizationAnalysis Tool for Sparse Matrix-VectorMultiplication on GPUs
【24h】

A Performance Modeling and OptimizationAnalysis Tool for Sparse Matrix-VectorMultiplication on GPUs

机译:GPU上稀疏矩阵-矢量乘法的性能建模和优化分析工具

获取原文
获取原文并翻译 | 示例
           

摘要

This paper presents a performance modeling and optimization analysis tool to predict and optimize the performance of sparse matrix-vector multiplication (SpMV) on GPUs. We make the following contributions: 1) We present an integrated analytical and profile-based performance modeling to accurately predict the kernel execution times of CSR, ELL, COO, and HYB SpMV kernels. Our proposed approach is general, and neither limited by GPU programming languages nor restricted to specific GPU architectures. In this paper, we use CUDA-based SpMV kernels and NVIDIA Tesla C2050 for our performance modeling and experiments. According to our experiments, for 77 out of 82 test cases, the performance differences between the predicted and measured execution times are less than 9 percent; for the rest five test cases, the differences are between 9 and 10 percent. For CSR, ELL, COO, and HYB SpMV CUDA kernels, the average differences are 6.3, 4.4, 2.2, and 4.7 percent, respectively. 2) Based on the performance modeling, we design a dynamic-programming based SpMV optimal solution auto-selection algorithm to automatically report an optimal solution (i.e., optimal storage strategy, storage format(s), and execution time) for a target sparse matrix. In our experiments, the average performance improvements of the optimal solutions are 41.1, 49.8, and 37.9 percent, compared to NVIDIA’s CSR, COO, and HYB CUDA kernels, respectively.
机译:本文介绍了一种性能建模和优化分析工具,用于预测和优化GPU上的稀疏矩阵矢量乘法(SpMV)的性能。我们做出以下贡献:1)我们提出了一个基于分析和基于配置文件的集成性能模型,以准确预测CSR,ELL,COO和HYB SpMV内核的内核执行时间。我们提出的方法是通用的,既不受GPU编程语言的限制,也不受特定GPU架构的限制。在本文中,我们使用基于CUDA的SpMV内核和NVIDIA Tesla C2050进行性能建模和实验。根据我们的实验,对于82个测试用例中的77个,预测执行时间和测量执行时间之间的性能差异小于9%;对于其余五个测试用例,差异在9%到10%之间。对于CSR,ELL,COO和HYB SpMV CUDA内核,平均差异分别为6.3%,4.4%,2.2%和4.7%。 2)基于性能建模,我们设计了基于动态编程的SpMV最佳解决方案自动选择算法,以自动报告目标稀疏矩阵的最佳解决方案(即最佳存储策略,存储格式和执行时间) 。在我们的实验中,与NVIDIA的CSR,COO和HYB CUDA内核相比,最佳解决方案的平均性能改进分别为41.1%,49.8%和37.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号