Extending Vector Processing Units for Enhanced Linear Algebra Performance
Licentiate thesis, 2024
A relevant set of kernels widely used nowadays are linear algebra kernels. These kernels have been used in multiple domains for decades. However, they are at the core of Machine Learning (ML) applications, which is one of the domains with the fastest requirement increase, both in terms of performance and energy. Consequently, there is a high interest in computing these kernels faster and more efficiently. VPUs are a good mapping for these kernels but they do not offer the same performance and efficiency as custom accelerators.
This Thesis presents two different extensions for enhancing linear algebra kernels in VPUs. The first extension enhances VPUs with the functionality of Systolic Arrays (SAs) for more efficient computation of General Matrix-Matrix Multiplication (GEMM). This enhancement is done by remapping the functional units of the VPU from a 1D to a 2D array. In addition, this Thesis also analyzes the implications of this new SA-like functionality, proposing corresponding new memory instructions and an analysis to dynamically select the functionality that maximizes resource utilization. The second extension proposes a memory extension that provides VPUs with index-matching functionalities for sparse linear algebra operations. This extension transforms the index-matching problem into one of hash lookup, and implements this problem in hardware using cache-likeĀ techniques. These extensions achieve up to 4.22x and 3.19x speedup respectively.
Vector
Sparse
Dense
Linear Algebra
SIMD
ISA extension
Author
Mateo Vázquez Maceiras
Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)
VSA: A Hybrid Vector-Systolic Architecture
Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors,;Vol. 2022-October(2022)p. 368-376
Paper in proceeding
Exploiting the Potential of Flexible Processing Units
Proceedings - Symposium on Computer Architecture and High Performance Computing,;(2023)p. 34-45
Paper in proceeding
Mateo Vázquez, Muhammad Waqar Azhar, Mohammad Ali Maleki, Pedro Petersen Moura Trancoso. Scalable Hardware Hash for Index-Matching in Vector Architectures.
Very Efficient Deep Learning in IOT (VEDLIoT)
European Commission (EC) (EC/H2020/957197), 2020-11-01 -- 2023-10-31.
European, extendable, energy-efficient, energetic, embedded, extensible, Processor Ecosystem (eProcessor)
European Commission (EC) (EC/H2020/956702), 2021-01-01 -- 2024-06-30.
Areas of Advance
Information and Communication Technology
Subject Categories
Computer Science
Publisher
Chalmers
Room EE, EDIT Building, Chalmers
Opponent: Antonio González, Universitat Politècnica de Catalunya, Spain