Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

Fareed Mohammad Qararyah; Muhammad Waqar Azhar; Mohammad Ali Maleki; Pedro Petersen Moura Trancoso

doi:10.1145/3677333.3678153

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs
Paper i proceeding, 2024

Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy. FCM and FusePlanner implementations are open source: https://github.com/fqararyah/Fusing_DW_and_PW_on_GPUs

layer fusion

depthwise convolution

CNN

vision transformer

pointwise convolution

GPU

Författare

Fareed Mohammad Qararyah

Chalmers, Data- och informationsteknik, Datorteknik

Forskning Andra publikationer

Muhammad Waqar Azhar

ZEROPOINT TECHNOLOGIES AB

Forskning Andra publikationer

Mohammad Ali Maleki

Chalmers, Data- och informationsteknik, Datorteknik

Forskning Andra publikationer

Pedro Petersen Moura Trancoso

Chalmers, Data- och informationsteknik, Datorteknik

Forskning Andra publikationer

ACM International Conference Proceeding Series

58-67
9798400718021 (ISBN)

53rd International Conference on Parallel Processing, ICPP 2024
Gotland, Sweden,

Very Efficient Deep Learning in IOT (VEDLIoT)

Europeiska kommissionen (EU) (EC/H2020/957197), 2020-11-01 -- 2023-10-31.

Visa projekt

EPI SGA2

Europeiska kommissionen (EU) (101036168), 2022-01-01 -- 2024-12-31.

Visa projekt

Ämneskategorier (SSIF 2011)

Datavetenskap (datalogi)

DOI

10.1145/3677333.3678153

Publikationsdata kopplat till DOI

Mer information

Senast uppdaterat

2024-10-28

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs Paper i proceeding, 2024

Författare

Fareed Mohammad Qararyah

Muhammad Waqar Azhar

Mohammad Ali Maleki

Pedro Petersen Moura Trancoso

ACM International Conference Proceeding Series

Very Efficient Deep Learning in IOT (VEDLIoT)

EPI SGA2

Ämneskategorier (SSIF 2011)

DOI

Mer information

Senast uppdaterat

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs
Paper i proceeding, 2024