Extending Vector Processing Units for Enhanced Linear Algebra Performance
Licentiate thesis, 2024

Vector Processing Units (VPUs) have made a comeback to the landscape of computer architecture as a response to the diminishing returns from technology scaling and power density limitations. VPUs are used as general-purpose accelerators, offering a trade-off between the flexibility of general-purpose architectures and the performance of hardware accelerators. However, application demands keep growing. Thus, we want to extract even more performance out of VPUs, as well as achieving better area andĀ  energy efficiency. To achieve these improvements, one approach is to enhance current VPUs with Instruction Set Architecture (ISA) extensions tailored to specific kernels or applications.

A relevant set of kernels widely used nowadays are linear algebra kernels. These kernels have been used in multiple domains for decades. However, they are at the core of Machine Learning (ML) applications, which is one of the domains with the fastest requirement increase, both in terms of performance and energy. Consequently, there is a high interest in computing these kernels faster and more efficiently. VPUs are a good mapping for these kernels but they do not offer the same performance and efficiency as custom accelerators.

This Thesis presents two different extensions for enhancing linear algebra kernels in VPUs. The first extension enhances VPUs with the functionality of Systolic Arrays (SAs) for more efficient computation of General Matrix-Matrix Multiplication (GEMM). This enhancement is done by remapping the functional units of the VPU from a 1D to a 2D array. In addition, this Thesis also analyzes the implications of this new SA-like functionality, proposing corresponding new memory instructions and an analysis to dynamically select the functionality that maximizes resource utilization. The second extension proposes a memory extension that provides VPUs with index-matching functionalities for sparse linear algebra operations. This extension transforms the index-matching problem into one of hash lookup, and implements this problem in hardware using cache-likeĀ  techniques. These extensions achieve up to 4.22x and 3.19x speedup respectively.

Vector

Sparse

Dense

Linear Algebra

SIMD

ISA extension

Room EE, EDIT Building, Chalmers
Opponent: Antonio González, Universitat Politècnica de Catalunya, Spain

Author

Mateo Vázquez Maceiras

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

VSA: A Hybrid Vector-Systolic Architecture

Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors,;Vol. 2022-October(2022)p. 368-376

Paper in proceeding

Exploiting the Potential of Flexible Processing Units

Proceedings - Symposium on Computer Architecture and High Performance Computing,;(2023)p. 34-45

Paper in proceeding

Mateo Vázquez, Muhammad Waqar Azhar, Mohammad Ali Maleki, Pedro Petersen Moura Trancoso. Scalable Hardware Hash for Index-Matching in Vector Architectures.

Very Efficient Deep Learning in IOT (VEDLIoT)

European Commission (EC) (EC/H2020/957197), 2020-11-01 -- 2023-10-31.

European, extendable, energy-efficient, energetic, embedded, extensible, Processor Ecosystem (eProcessor)

European Commission (EC) (EC/H2020/956702), 2021-01-01 -- 2024-06-30.

Areas of Advance

Information and Communication Technology

Subject Categories

Computer Science

Publisher

Chalmers

Room EE, EDIT Building, Chalmers

Opponent: Antonio González, Universitat Politècnica de Catalunya, Spain

More information

Latest update

7/31/2024