GPU programming with Haskell
Sign up for access to the world's latest research
Abstract
This chapter lead you through using Haskell and the accerlerate package in order to write a parallel program on your high performance muti threaded GPU on your computer and finished with a program which lets you calculate Black-Scholes on your GPU.
Related papers
2009
The success of the gaming industry is now pushing processor technology like we have never seen before. Since recent graphics processors (GPU’s) have been improving both their programmability as well as have been adding more and more floating point processing, it makes them very appealing as accelerators for generalpurpose computing. This minisymposium gives an overview of some of these advancements by bringing together experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’ scientific applications. This first EuroGPU Minisymposium brought together severl experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’s for scientific applications. This short summary thus gives a very useful, but quick overview of some of the major recent advancemen...
Proceedings of the sixth workshop on Declarative aspects of multicore programming - DAMP '11, 2011
Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed this purely functional array language in Haskell with an online code generator for NVIDIA's CUDA GPGPU programming environment. We regard the embedded language's collective array operations as algorithmic skeletons; our code generator instantiates CUDA implementations of those skeletons to execute embedded array programs. This paper outlines our embedding in Haskell, details the design and implementation of the dynamic code generator, and reports on initial benchmark results. These results suggest that we can compete with moderately optimised native CUDA code, while enabling much simpler source programs.
2009
Abstract In this paper, we investigate the differences and tradeoffs imposed by two parallel Haskell dialects running on multicore machines. GpH and Eden are both constructed using the highly-optimising sequential GHC compiler, and share thread scheduling, and other elements, from a common code base. The GpH implementation investigated here uses a physically-shared heap, which should be well-suited to multicore architectures.
Proceedings of the 2011 ACM SIGPLAN X10 Workshop, 2011
GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures. The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80% of handwritten CUDA kernels. [Copyright notice will appear here once 'preprint' option is removed.] We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters.
2013
Multicore and NUMA architectures are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is a systematic programming and performance comparison of four parallel Haskell implementations on a common multicore architecture. It provides a detailed analysis of the performance, and contrasts the programming effort that each language requires with the parallel performance delivered. The study uses 15 ’typical’ programs to compare a ‘no pain’, i.e. entirely implicit, parallel implementation with three ‘low pain’, i.e. semi-explicit, language implementations. We report detailed studies comparing the parallel performance delivered. The comparative performance metric is speedup which normalises against sequential performance. We ground the speedup comparisons by reporting both sequential and parallel runtimes a...
Journal of Computer Science and Technology, 2012
ArXiv, 2014
Since the first idea of using GPU to general purpose computing, things have evolved over the years and now there are several approaches to GPU programming. GPU computing practically began with the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA and Stream by AMD. These are APIs designed by the GPU vendors to be used together with the hardware that they provide. A new emerging standard, OpenCL (Open Computing Language) tries to unify different GPU general computing API implementations and provides a framework for writing programs executed across heterogeneous platforms consisting of both CPUs and GPUs. OpenCL provides parallel computing using task-based and data-based parallelism. In this paper we will focus on the CUDA parallel computing architecture and programming model introduced by NVIDIA. We will present the benefits of the CUDA programming model. We will also compare the two main approaches, CUDA and AMD APP (STREAM) and the new framwork, OpenCL that tries...
ACM SIGPLAN Notices, 2009
With the emergence of commodity multicore architectures, exploiting tightly-coupled parallelism has become increasingly important. Functional programming languages, such as Haskell, are, in principle, well placed to take advantage of this trend, offering the ability to easily identify large amounts of fine-grained parallelism. Unfortunately, obtaining real performance benefits has often proved hard to realise in practice.
CUDA is now the dominant language used for programming GPUs, one of the most exciting hardware developments of recent decades. With CUDA, you can use a desktop PC for work that would have previously required a large cluster of PCs or access to a HPC facility. As a result, CUDA is increasingly important in scientific and technical computing across the whole STEM community, from medical physics and financial modelling to big data applications and beyond. This unique book on CUDA draws on the author's passion for and long experience of developing and using computers to acquire and analyse scientific data. The result is an innovative text featuring a much richer set of examples than found in any other comparable book on GPU computing. Much attention has been paid to the C++ coding style, which is compact, elegant and efficient. A code base of examples and supporting material is available online, which readers can build on for their own projects.
The Haskell developers focus on providing an open range of packages and libraries in various research areas. Particularly, image processing is naturally expressed in terms of parallel array operations and we use here Repa as a great tool for coding image manipulation algorithms. Our target is to analyze the execution time of a Haskell parallel implementation and also to compare the results to the appropriate C++ implementation. A certain example from the image processing area of interests is selected. The conclusion is that the compared execution time values depend both on the physic and the logic parameters of the applied solutions.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.