Extending Vector Processing Units for Enhanced Linear Algebra Performance
Licentiatavhandling, 2024

Vector Processing Units (VPUs) have made a comeback to the landscape of computer architecture as a response to the diminishing returns from technology scaling and power density limitations. VPUs are used as general-purpose accelerators, offering a trade-off between the flexibility of general-purpose architectures and the performance of hardware accelerators. However, application demands keep growing. Thus, we want to extract even more performance out of VPUs, as well as achieving better area and  energy efficiency. To achieve these improvements, one approach is to enhance current VPUs with Instruction Set Architecture (ISA) extensions tailored to specific kernels or applications.

A relevant set of kernels widely used nowadays are linear algebra kernels. These kernels have been used in multiple domains for decades. However, they are at the core of Machine Learning (ML) applications, which is one of the domains with the fastest requirement increase, both in terms of performance and energy. Consequently, there is a high interest in computing these kernels faster and more efficiently. VPUs are a good mapping for these kernels but they do not offer the same performance and efficiency as custom accelerators.

This Thesis presents two different extensions for enhancing linear algebra kernels in VPUs. The first extension enhances VPUs with the functionality of Systolic Arrays (SAs) for more efficient computation of General Matrix-Matrix Multiplication (GEMM). This enhancement is done by remapping the functional units of the VPU from a 1D to a 2D array. In addition, this Thesis also analyzes the implications of this new SA-like functionality, proposing corresponding new memory instructions and an analysis to dynamically select the functionality that maximizes resource utilization. The second extension proposes a memory extension that provides VPUs with index-matching functionalities for sparse linear algebra operations. This extension transforms the index-matching problem into one of hash lookup, and implements this problem in hardware using cache-like  techniques. These extensions achieve up to 4.22x and 3.19x speedup respectively.

Linear Algebra

SIMD

Sparse

Vector

ISA extension

Dense

Room EE, EDIT Building, Chalmers
Opponent: Antonio González, Universitat Politècnica de Catalunya, Spain

Författare

Mateo Vázquez Maceiras

Chalmers, Data- och informationsteknik, Datorteknik

VSA: A Hybrid Vector-Systolic Architecture

Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors,; (2022)p. 368-376

Paper i proceeding

Exploiting the Potential of Flexible Processing Units

Proceedings - Symposium on Computer Architecture and High Performance Computing,; (2023)p. 34-45

Paper i proceeding

Mateo Vázquez, Muhammad Waqar Azhar, Mohammad Ali Maleki, Pedro Petersen Moura Trancoso. Scalable Hardware Hash for Index-Matching in Vector Architectures.

Very Efficient Deep Learning in IOT (VEDLIoT)

Europeiska kommissionen (EU) (EC/H2020/957197), 2020-11-01 -- 2023-10-31.

European, extendable, energy-efficient, energetic, embedded, extensible, Processor Ecosystem (eProcessor)

Europeiska kommissionen (EU) (EC/H2020/956702), 2021-01-01 -- 2024-06-30.

Styrkeområden

Informations- och kommunikationsteknik

Ämneskategorier

Datavetenskap (datalogi)

Utgivare

Chalmers

Room EE, EDIT Building, Chalmers

Opponent: Antonio González, Universitat Politècnica de Catalunya, Spain

Mer information

Skapat

2024-05-24