Efficient Processing of Compact and Heterogeneous Deep Neural Networks
Licentiate thesis, 2024

The unprecedented success of Deep Learning (DL) algorithms, or Deep Neural Networks (DNNs), is driving the trend toward deploying them in a variety of environments including ones with tight resources and time constraints. This has led to the emergence of compact DNNs. On the one hand, compact DNNs have fewer operations and lower resource requirements, which makes them the right choice for time-critical and energy-constrained applications. On the other hand, they pose new challenges for the deep learning accelerator and library design. First, these DNNs are composed of a set of operators with varying computational requirements. This makes them more heterogeneous. Secondly, they contain novel operators with computational requirements and bottlenecks that differ from those of the operators in traditional DNNs. These characteristics render the generic, accelerator architectures and library routines inefficient and necessitate custom designs considering these DNNs characteristics. 

The constant evolution of state-of-the-art DNNs and their use in domains that have constantly changing algorithms and standards motivate deploying them on flexible hardware. That is hardware that could be programmed or reconfigured to support such variations. Moreover, the massive parallelism present in these DNNs suggests that such hardware should support parallelism. Field Programmable Gate Arrays (FPGAs), and General-Purpose GraphicsProcessing Units (GPGPUs) are two widely used devices that offer flexibility and support parallelism. This thesis presents hardware and software designs,i.e. accelerators and library routines, that enable efficient processing of the compact DNNs and their constituting operators on FPGAs, and GPGPUs.

The first contribution of the thesis is Fixed Budget Hybrid CNN Accelerator(FiBHA). FiBHA is a hybrid architecture that combines both dedicated and reusable processing engines in a way that enables striking a balance between capturing DNN model heterogeneity and being resource-aware. The second contribution is proposing Fused Convolutional Modules (FCMs), a set of GPU kernels fusing various combinations of two core operators used in compact vision, including convolutional neural networks (CNNs) and vision transformers(ViTs). These operations are depthwise (DW) and pointwise (PW) convolutions. FCMs alleviate these operators’ performance bottlenecks leading to low-latency and energy-efficient execution.

FiBHA improves the throughput by up to 4x and 2.5x compared to the prior art. It achieves up to 2x improvement in resource utilization. Moreover, it improves the energy efficiency by up to 28%. FCMs achieve up to 3.7x speedup over standard DL libraries layer-by-layer implementations, and up to 1.8x speedup in end-to-end implementations compared to a state-of-the-art DL compiler. Moreover, FCMs-based implementations consume down to 34% of the energy per inference compared to those of a DL compiler.

Layer Fusion

GPU

Inter-Layer Pipelining

Deep Learning Accelerators

Deep Neural Networks (DNNs)

FPGA

HC2
Opponent: Prof. Christos Savvas Bouganis, Imperial College London, UK

Author

Fareed Mohammad Qararyah

Chalmers, Computer Science and Engineering (Chalmers), Computer Engineering (Chalmers)

An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs

Transactions on Architecture and Code Optimization,; (2024)

Journal article

FiBHA: Fixed Budget Hybrid CNN Accelerator

Proceedings - Symposium on Computer Architecture and High Performance Computing,; (2022)p. 180-190

Paper in proceeding

fcms: Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

Very Efficient Deep Learning in IOT (VEDLIoT)

European Commission (EC) (EC/H2020/957197), 2020-11-01 -- 2023-10-31.

Subject Categories

Computer Engineering

Computer Science

Computer Systems

Publisher

Chalmers

HC2

Opponent: Prof. Christos Savvas Bouganis, Imperial College London, UK

More information

Latest update

5/3/2024 1