Toward Efficient SIMT Execution---A Microarchitecture Perspective.

机译：迈向高效SIMT执行-一种微体系结构的观点

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The design philosophy of many-core architectures such as graphics processing units (GPUs) is to exploit thread-level parallelism (TLP) to achieve high throughput. Compared to central processing unit (CPU) designs, GPU-like many-core architectures spend the on-die area mainly for computation rather than complex instruction processing, and therefore is more energy efficient.;In this dissertation, we identify several inefficiencies of current GPU design and proposal architectural designs for higher performance and better energy efficiency. First, I will first present our study on eliminating the computational redundancies within GPGPU. According to our study, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. And besides redundancy within a uniform vector, different vectors can also have the identical values. Therefore, we propose detailed architecture designs to exploit both types of redundancy for performance improvements and energy savings. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors.;Secondly, I will present the novel resource management scheme for GPGPUs. In this study, we observe that the currently used TB-level resource management inside GPGPU can severely affect the TLP that may be achieved in the hardware. Since different warps in a TB may finish at different times. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. What is more, TB-level management can lead to resource fragmentation as well. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources.;Finally, I will present our study on the impact of ILP enhancing techniques on GPGPU. In this study, we show that these ILP techniques can greatly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a sufficient number of threads concurrently. In such cases, the ILP techniques can deliver significant performance gains at modest hardware costs. Based on this workload-dependent behavior, we then propose heterogeneous architecture for GPU computing. In our proposed heterogeneous GPU architecture, there are two types of in-order shader cores, one customized for applications with limited TLP due to their resource usage and the other customized for applications with sufficient TLP. This way, applications can be scheduled to either core based on their resource requirements and characteristics for better performance and energy efficiency.

机译：图形处理单元（GPU）等多核体系结构的设计理念是利用线程级并行（TLP）来实现高吞吐量。与中央处理器（CPU）设计相比，类GPU的多核体系结构主要将片上区域用于计算而不是复杂的指令处理，因此更加节能。 GPU设计和建议架构设计可提供更高的性能和更好的能源效率。首先，我将首先介绍我们关于消除GPGPU中计算冗余的研究。根据我们的研究，SIMD执行中存在大量计算冗余，其中不同的执行通道对相同的操作数值进行操作。除了统一向量内的冗余外，不同向量还可以具有相同的值。因此，我们提出了详细的架构设计，以利用两种类型的冗余来提高性能并节省能源。为了在统一向量中实现冗余，我们建议使用令牌位扩展向量寄存器文件，或添加单独的小标量寄存器文件以消除冗余计算和冗余数据存储。为了实现不同统一向量之间的冗余，我们采用了最初针对CPU体系结构提出的指令重用，以检测并消除冗余。消除冗余计算和数据存储可显着节省能源并提高性能。此外，我们建议利用这种冗余来保护算术逻辑单元（ALU）并针对硬件错误注册文件。其次，我将介绍用于GPGPU的新颖资源管理方案。在这项研究中，我们观察到GPGPU内部当前使用的TB级资源管理会严重影响可能在硬件中实现的TLP。由于结核病中的不同翘曲可能在不同的时间结束。由于TB级资源管理，分配给早期完成的经纱的资源实际上是浪费的，因为它们需要等待同一TB中运行时间最长的经纱完成。此外，TB级管理也可能导致资源碎片化。为了克服这些低效率问题，我们建议在经纱级别分配和释放资源。只要有足够的资源用于扭曲而不是TB，就将扭曲分配给SM。此外，每当完成一次经纱时，其资源就会被释放并可以容纳新的经纱。这样一来，我们可以有效地增加活动扭曲的数量，而无需实际增加关键资源的大小。最后，我将介绍ILP增强技术对GPGPU的影响。在这项研究中，我们表明这些ILP技术可以大大降低对TLP的性能依赖性。这对于其资源使用率限制了硬件同时运行足够数量的线程的应用程序特别有用。在这种情况下，ILP技术可以以适度的硬件成本带来显着的性能提升。基于这种与工作负载有关的行为，我们然后提出了用于GPU计算的异构体系结构。在我们提出的异构GPU架构中，有两种类型的有序着色器内核，一种是由于资源使用而为TLP受限的应用程序定制的，另一种是针对具有足够TLP的应用程序定制的。这样，可以根据应用程序的资源需求和特性将其调度到任一核心，以实现更好的性能和能效。

著录项

作者
Xiang, Ping.;
展开▼
作者单位

North Carolina State University.;

展开▼
授予单位 North Carolina State University.;
学科 Computer engineering.
学位 Ph.D.
年度 2014
页码 127 p.
总页数 127
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A Multiple-Radix MAP-Decoder Microarchitecture and Its ASIC Implementation for Energy-Efficient and Variable-Throughput Applications [J] . Rahul Shrestha Very Large Scale Integration (VLSI) Systems, IEEE Transactions on . 2021,第1期

机译：多基数MAP-解码器微架构及其用于节能和可变吞吐量应用的ASIC实现
2. Toward a Microarchitecture for Efficient Execution of Irregular Applications [J] . JOHN D.LEIDEL, XI WANG, BRODY WILLIAMS, ACM Transactions on Parallel Computing . 2020,第4期

机译：朝着微架构，以便有效执行不规则应用
3. Construction of 3D hierarchical microarchitectures of Z-scheme UiO-66-(COOH)(2)/ZnIn2S4 hybrid decorated with non-noble MoS2 cocatalyst: A highly efficient photocatalyst for hydrogen evolution and Cr(VI) reduction [J] . Chemical engineering journal . 2020,第期

机译：用非贵族MOS2 Cocatalyst装饰的Z-Scheme UIO-66-（COOH）（2）/ Znin2S4杂交体的3D层次微体建筑施工：氢气进化和Cr（VI）减少的高效光催化剂
4. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching [C] . Zhen Lin, Lars Nyland, Huiyang Zhou International Conference for High Performance Computing, Networking, Storage and Analysis . 2016

机译：通过轻量级上下文切换为SIMT架构启用有效的抢占
5. Advancing STTRAM Caches for Runtime Adaptable Energy-Efficient Microarchitectures [D] . Kuan, Chi-Chih Kyle. 2021

机译：推进STTRAM缓存进行运行时适应性节能微体系结构
6. Coupling SIMD and SIMT architectures to boost performance of a phylogeny-aware alignment kernel [O] . Nikolaos Alachiotis, Simon A Berger, Alexandros Stamatakis 2012

机译：耦合SIMD和SIMT体系结构以提高系统发育感知比对内核的性能
7. Ozone: Efficient Execution with Zero Timing Leakage for Modern Microarchitectures [O] . Aweke, Zelalem Birhanu, Austin, Todd 2017

机译：臭氧：现代化的零时序泄漏的高效执行微体系架构
8. More efficient household electricity use. An international perspective. [R] . Schipper, L., Hawk, D. V. 1989

机译：更有效的家庭用电。国际视角。

Toward Efficient SIMT Execution---A Microarchitecture Perspective.

摘要

著录项

相似文献

相关主题

期刊订阅