Obsidian: GPU Kernel Programming in Haskell
Graphics Processing Units (GPUs) are evolving into powerful general purpose computing platforms. At first, GPU performance was driven by the requirements of 3D graphics computer games. To fit this workload, a GPU is a many-core processor suitable for the data-parallel programming paradigm. Today, GPUs come with hundreds of processing elements and a theoretical single precision floating point performance in the teraflop range.
Because of the computing power of modern GPUs, programmers are increasingly inter- ested in making use of them for non-graphics applications. This desire has given rise to the research field that studies General Purpose Computations on GPUs (GPGPU). The manufacturers of GPUs are also acknowledging this trend and are tailoring their GPUs to meet both the desires of those playing games and the GPGPU community.
CUDA is NVIDIA’s tool-set for GPGPU programming on their GPUs. CUDA is a big improvement for the GPGPU programmer compared to what was available before. In the early days, the GPGPU programmer was forced to express the algorithm being im- plemented as a computer graphics computation. CUDA provides a C compiler and a set of libraries for general purpose programming on the GPU, freeing the programmer from graphics APIs. In CUDA, the programmer decomposes the problem into a set of kernels. A kernel is an isolated data-parallel program executed by a number of threads on the GPU. CUDA has some problems. For example, CUDA is a very low level interface to the GPU capabilities and there is also the issue that CUDA kernels are not easily composable.
Obsidian is an embedded language for implementing kernels in the functional programming language Haskell. From higher level descriptions of algorithms based on combinators, CUDA code is generated. Using this approach, Obsidian kernels are more compositional and also relieve the programmer from inventing the typically complex index arithmetic expressions that are used to load and store data in data-parallel algorithms. The indexing arithmetic is hidden away from the programmer in the set of combinators provided as a library.
The performance obtained from the kernels generated using Obsidian is decent. It does not compare to optimized handwritten code but if the implementation effort is taken into consideration performance is good. Obsidian allows the programmer to think about the problem at hand, rather than being weighed down by the lower level details.
In this thesis, two different implementations of Obsidian are shown. The first of these implementations is based on monads and the second on arrows, two concepts familiar to functional programmers. A number of applications are presented, expressed using the arrow based version.