Vectorized Barrier and Reduction in LLVM OpenMP Runtime
Paper i proceeding, 2021

Barrier synchronization is a well known operation in parallel processing that can be an obstacle for getting performance in parallel programs, particularly for high thread counts. Similarly, reduction is a collective communication pattern frequently used in parallel applications and needs to be optimized for applications to achieve their best performance. With the introduction of multi-core and many-core processors several new barrier and reduction implementations have been proposed. As the number of cores per node continues to grow, implementation of these primitives need to be revisited and adapted for upcoming architectures. We see an opportunity to improve synchronization by exploiting vector units present in modern and future CPU designs based on vector ISAs such as ARM’s Scalable Vector Extension and the RISC-V Vector extension. In this work we propose vectorized barriers and reductions using the vector length agnostic paradigm and implement them in the LLVM OpenMP runtime. Our barrier implementation achieves up to 2.2 × and 1.4 × speedup over the default LLVM OpenMP implementation on Intel KNL and Fujitsu A64FX, respectively.

Reduction

Vectorization

OpenMP

Barrier

Författare

Muhammad Nufail Farooqi

Chalmers, Data- och informationsteknik, Datorteknik

Miquel Pericas

Chalmers, Data- och informationsteknik, Datorteknik

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

03029743 (ISSN) 16113349 (eISSN)

Vol. 12870 LNCS 18-32
9783030852610 (ISBN)

17th International Workshop on OpenMP, IWOMP 2021
Bristol, United Kingdom,

The European Processor Initiative (EPI)

Europeiska kommissionen (EU) (EC/H2020/800928), 2018-12-01 -- 2021-11-30.

Ämneskategorier (SSIF 2011)

Datorteknik

Inbäddad systemteknik

Datorsystem

DOI

10.1007/978-3-030-85262-7_2

Mer information

Senast uppdaterat

2021-10-04