...
首页> 外文期刊>International Journal of High Performance Computing Applications >Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning
【24h】

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

机译:多核和多核共享内存并行光线投射体积渲染优化和调整

获取原文
获取原文并翻译 | 示例
           

摘要

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.
机译:鉴于计算行业通过在芯片上增加更多内核来增加处理能力的趋势,这项工作的重点是调整钉书可视化算法的性能,光线投射体积渲染,在多核CPU和多核上实现共享内存并行性。 GPU。我们的方法是更改​​可调算法设置,以及已知的算法优化和两种不同的内存布局,并根据绝对运行时间和L2内存高速缓存未命中来衡量性能。我们的结果表明,所有平台上的运行时性能差异很大,我们在多核CPU上测试的可调参数高达254%,在多核GPU上高达265%,并且最佳配置在各个平台上各不相同,通常一种非显而易见的方式。例如,我们的结果表明,GPU上的最佳配置出现在保持高速缓存利用率良好的配置和饱和计算吞吐量的交叉点之间。对于此特定算法,使用经验性能模型可能很难预测此结果,因为它具有非结构化的内存访问模式,该模式对于单个光线局部更改,对于所选视点全局更改。我们的结果还表明,现代体系结构上的最佳参数与以前在较旧的体系结构上进行的研究中的参数明显不同。此外,考虑到跨平台的最佳算法设置和性能结果的巨大性能差异,采用可视化和分析代码通过自动调整性能优化策略具有明显的好处。随着每个芯片的内核数量以及通过存储器层次结构传输数据的成本都增加,这些好处在将来可能会变得更加明显。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号