GPU programming with Haskell

saul wiggin

Outline

Title

Abstract

Ldata Parallel Programming

GPU programming with Haskell

saul wiggin

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

This chapter lead you through using Haskell and the accerlerate package in order to write a parallel program on your high performance muti threaded GPU on your computer and finished with a program which lets you calculate Black-Scholes on your GPU.

Related papers

Parallel Computing with GPUs

Stephane Requena

2009

The success of the gaming industry is now pushing processor technology like we have never seen before. Since recent graphics processors (GPU’s) have been improving both their programmability as well as have been adding more and more floating point processing, it makes them very appealing as accelerators for generalpurpose computing. This minisymposium gives an overview of some of these advancements by bringing together experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’ scientific applications. This first EuroGPU Minisymposium brought together severl experts working on the development of techniques and tools that improve the programmability of GPU’s as well as the experts interested in utilizing the computational power of GPU’s for scientific applications. This short summary thus gives a very useful, but quick overview of some of the major recent advancemen...

downloadDownload free PDF View PDFchevron_right

Accelerating Haskell array codes with multicore GPUs

Vinod Grover

Proceedings of the sixth workshop on Declarative aspects of multicore programming - DAMP '11, 2011

Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed this purely functional array language in Haskell with an online code generator for NVIDIA's CUDA GPGPU programming environment. We regard the embedded language's collective array operations as algorithmic skeletons; our code generator instantiates CUDA implementations of those skeletons to execute embedded array programs. This paper outlines our embedding in Haskell, details the design and implementation of the dynamic code generator, and reports on initial benchmark results. These results suggest that we can compete with moderately optimised native CUDA code, while enabling much simpler source programs.

downloadDownload free PDF View PDFchevron_right

Comparing and Optimising Parallel Haskell Implementations for Multicore Machines

Kevin Hammond

2009

Abstract In this paper, we investigate the differences and tradeoffs imposed by two parallel Haskell dialects running on multicore machines. GpH and Eden are both constructed using the highly-optimising sequential GHC compiler, and share thread scheduling, and other elements, from a common code base. The GpH implementation investigated here uses a physically-shared heap, which should be well-suited to multicore architectures.

downloadDownload free PDF View PDFchevron_right

GPU programming in a high level language

Vijay Saraswat

Proceedings of the 2011 ACM SIGPLAN X10 Workshop, 2011

GPU architectures have emerged as a viable way of considerably improving performance for appropriate applications. Program fragments (kernels) appropriate for GPU execution can be implemented in CUDA or OpenCL and glued into an application via an API. While there is plenty of evidence of performance improvements using this approach, there are many issues with productivity. Programmers must understand an additional programming model and API to program the accelerator; concurrency and synchronization in this programming model is typically expressed differently from the programming model for the host. On top of this, the languages used to write kernels are very low level and thus prone to the kinds of errors that one does not encounter in higher level languages. Programmers must explicitly deal with moving data back-and-forth between the host and the accelerator. These problems are compounded when the user code must be run across a cluster of accelerated nodes. Now the host programming model must further be extended with constructs to deal with scale-out and remote accelerators. We believe there is a critical need for a single source programming model that can be used to write clean, efficient code for heterogeneous, multi-core and scale-out architectures. The APGAS programming model has been developed for such architectures over the past six years. APGAS is based on four fundamental (and architecture-independent) notions: locality, asynchrony, conditional atomicity and order. X10 is an instantiation of the APGAS programming model on top of a base sequential language with Java-style productivity. Earlier work has shown that X10 can be used to write clean and efficient code for homogeneous multi-cores, SMPs, Cell-accelerated nodes, and clusters of such nodes. In this paper we show how X10 programmers can write code that can be compiled and run on GPUs. GPU programming idioms such as threads, blocks, barriers, constant memory, local registers, shared memory variables, etc. can be directly expressed in X10, and do not require new language extensions. We present the design of an extension of the X10-to-C++ compiler which recognizes such idioms and produces CUDA kernel code. We show several benchmarks written in this style. The performance of these kernels is within 80% of handwritten CUDA kernels. [Copyright notice will appear here once 'preprint' option is removed.] We believe these results establish X10 as a single-source programming language in which clean, efficient programs can be written for GPU-accelerated clusters.

downloadDownload free PDF View PDFchevron_right

Implementations : FDIP GHC − SMP GUM Semi _ Explicit Languages GpH Eden Haskell Eden Implementation Languages

Mustafa Aswad

2013

Multicore and NUMA architectures are becoming the dominant processor technology and functional languages are theoretically well suited to exploit them. In practice, however, implementing effective high level parallel functional languages is extremely challenging. This paper is a systematic programming and performance comparison of four parallel Haskell implementations on a common multicore architecture. It provides a detailed analysis of the performance, and contrasts the programming effort that each language requires with the parallel performance delivered. The study uses 15 ’typical’ programs to compare a ‘no pain’, i.e. entirely implicit, parallel implementation with three ‘low pain’, i.e. semi-explicit, language implementations. We report detailed studies comparing the parallel performance delivered. The comparative performance metric is speedup which normalises against sequential performance. We ground the speedup comparisons by reporting both sequential and parallel runtimes a...

downloadDownload free PDF View PDFchevron_right

Computación de alto desempeño en GPU

Fabiana Piccoli

Journal of Computer Science and Technology, 2012

downloadDownload free PDF View PDFchevron_right

GPGPU Computing

Tudorel Andrei

ArXiv, 2014

Since the first idea of using GPU to general purpose computing, things have evolved over the years and now there are several approaches to GPU programming. GPU computing practically began with the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA and Stream by AMD. These are APIs designed by the GPU vendors to be used together with the hardware that they provide. A new emerging standard, OpenCL (Open Computing Language) tries to unify different GPU general computing API implementations and provides a framework for writing programs executed across heterogeneous platforms consisting of both CPUs and GPUs. OpenCL provides parallel computing using task-based and data-based parallelism. In this paper we will focus on the CUDA parallel computing architecture and programming model introduced by NVIDIA. We will present the benefits of the CUDA programming model. We will also compare the two main approaches, CUDA and AMD APP (STREAM) and the new framwork, OpenCL that tries...

downloadDownload free PDF View PDFchevron_right

Low-pain, high-gain multicore programming in Haskell

Abdallah Deeb I. Al Zain, Philip Trinder

ACM SIGPLAN Notices, 2009

With the emergence of commodity multicore architectures, exploiting tightly-coupled parallelism has become increasingly important. Functional programming languages, such as Haskell, are, in principle, well placed to take advantage of this trend, offering the ability to easily identify large amounts of fine-grained parallelism. Unfortunately, obtaining real performance benefits has often proved hard to realise in practice.

downloadDownload free PDF View PDFchevron_right

Programming in Parallel with CUDA

Richard Ansorge

CUDA is now the dominant language used for programming GPUs, one of the most exciting hardware developments of recent decades. With CUDA, you can use a desktop PC for work that would have previously required a large cluster of PCs or access to a HPC facility. As a result, CUDA is increasingly important in scientific and technical computing across the whole STEM community, from medical physics and financial modelling to big data applications and beyond. This unique book on CUDA draws on the author's passion for and long experience of developing and using computers to acquire and analyse scientific data. The result is an innovative text featuring a much richer set of examples than found in any other comparable book on GPU computing. Much attention has been paid to the C++ coding style, which is compact, elegant and efficient. A code base of examples and supporting material is available online, which readers can build on for their own projects.

downloadDownload free PDF View PDFchevron_right

Upon the performance of a Haskell parallel implementation

Alexandra Baicoianu, Anca Vasilescu

The Haskell developers focus on providing an open range of packages and libraries in various research areas. Particularly, image processing is naturally expressed in terms of parallel array operations and we use here Repa as a great tool for coding image manipulation algorithms. Our target is to analyze the execution time of a Haskell parallel implementation and also to compare the results to the appropriate C++ implementation. A certain example from the image processing area of interests is selected. The conclusion is that the compared execution time values depend both on the physic and the logic parameters of the applied solutions.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Sean Lee

We present a novel high-level parallel programming model for graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically isolate and extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks.

downloadDownload free PDF View PDFchevron_right

Native Offload of Haskell Repa Programs to Integrated GPUs

Laurence E. Day

In light of recent hardware advances, General Purpose Graphics Processing Units (GPGPUs) are becoming increasingly commonplace, and demand novel programming models to account for their radically different architecture. For the most part, existing approaches to programming GPGPUs within a high-level programming language choose to embed a domain specific language (DSL) within a host metalanguage and implement a compiler mapping programs written within said DSL to code in low-level languages such as OpenCL and CUDA. We question this design choice, and argue that by directly implementing a GPGPU offload primitive as part of a general-purpose language compiler, we gain access to a substantial number of existing optimisations without having to reimplement them in a DSL compiler. In this paper we describe the structure of our prototypical treatment of this research direction, demonstrating the applicability of our approach by showing how to bridge between APIs by extending the Repa library of Haskell with an offload primitive, and detailing an experimental implementation of our approach within the Intel Labs Haskell Research Compiler (HRC). We also provide a detailed study of a set of nine benchmarks, by compiling them to both GPU and two distinct CPUs and comparing their performance.

downloadDownload free PDF View PDFchevron_right

A Tutorial on Parallel and Concurrent Programming in Haskell

OSCAR ALBERTO BARRIOS QUINTEROS

Advanced Functional Programming, 2009

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors. 2 Applications of concurrency and parallelism Writing concurrent and parallel programs is more challenging than the already difficult problem of writing sequential programs. However, there are some compelling reasons for writing concurrent and parallel programs: Performance. We need to write parallel programs to achieve improving performance from each new generation of multi-core processors. Hiding latency. Even on single-core processors we can exploit concurrent programs to hide the latency of slow I/O operations to disks and network devices. Software structuring. Certain kinds of problems can be conveniently represented as multiple communicating threads which help to structure code in a more modular manner e.g. by modeling user interface components as separate threads. Real world concurrency. In distributed and real-time systems we have to model and react to events in the real world e.g. handling multiple server requests in parallel. All new mainstream microprocessors have two or more cores and relatively soon we can expect to see tens or hundreds of cores. We can not expect the performance of each individual core to improve much further. The only way to achieve increasing performance from each new generation of chips is by dividing the work of a program across multiple processing cores. One way to divide an application over multiple processing cores is to somehow automatically parallelize the sequential code and this is an active area of research. Another approach is for the user to write a semi-explicit or explicitly parallel program which is then scheduled onto multiple cores by the operating systems and this is the approach we describe in these lectures. 3 Compiling Parallel Haskell Programs To reproduce the results in this paper you will need to use a version of the GHC Haskell compiler later than 6.10.1 (which at the time of printing requires building the GHC compiler from the HEAD branch of the source code repository). To compile a parallel Haskell program you need to specify the-threaded extra flag. For example, to compile the parallel program contained in the file Wombat.hs issue the command: ghc-make-threaded Wombat.hs To execute the program you need to specify how many real threads are available to execute the logical threads in a Haskell program. This is done by specifying an argument to Haskell's run-time system at invocation time. For example, to use three real threads to execute the Wombat program issue the command: Wombat +RTS-N3 In these lecture notes we use the term thread to describe a Haskell thread rather than a native operating system thread.

downloadDownload free PDF View PDFchevron_right

Chestnut: A GPU Programming Language for Non-Experts

Tia Newhall

2012

Graphics processing units (GPUs) are powerful devices capable of rapid parallel computation. GPU programming, however, can be quite difficult, limiting its use to experienced programmers and keeping it out of reach of a large number of potential users. We present Chestnut, a domainspecific GPU parallel programming language for parallel multidimensional grid applications. Chestnut is designed to greatly simplify the process of programming on the GPU, making GPU computing accessible to computational scientists who have little or no parallel programming experience, as well as a useful and powerful language for more experienced programmers. In addition, Chestnut has an optional GUI programming interface that makes GPU computing accessible to even novice programmers. Chestnut is intuitive and easy to use, while still powerful in the types of parallelism it can express. The language provides a single simple parallel construct that allows a Chestnut programmer to “think sequentially ” in e...

downloadDownload free PDF View PDFchevron_right

Comparing and Optimising Parallel Haskell Implementations on Multicore—Work in Progress—

Kevin Hammond

Abstract. We investigate the differences and tradeoffs imposed by different implementations of two parallel Haskell dialects running on a multicore machine. The GpH and Eden dialects of Haskell are both constructed using the highly-optimising sequential GHC compiler, and share a common code base.

downloadDownload free PDF View PDFchevron_right

SPOC: GPGPU PROGRAMMING THROUGH STREAM PROCESSING WITH OCAML

Kevin Hammond

2012

General purpose computing on graphics processing units (GPGPU) consists of using GPUs to handle computations commonly handled by CPUs. GPGPU programming implies developing specific programs to run on GPUs managed by a host program running on the CPU. To achieve high performance implies to explicitly organize memory transfers between devices. Besides, different incompatible frameworks exist making productivity and portability difficult to achieve.

downloadDownload free PDF View PDFchevron_right

CUDACL: A tool for CUDA and OpenCL programmers

Jeff Gray

2010

Abstract Graphical Processing Unit (GPU) programming languages are used extensively for general-purpose computations. However, GPU programming languages are at a level of abstraction suitable only for use by expert parallel programmers. This paper presents a new approach through whichC'or Java programmers can access these languages without having to focus on the technical or language-specific details.

downloadDownload free PDF View PDFchevron_right

Graphics processing unit computing and exploitation of hardware accelerators

Basilio B Fraguela

Concurrency and Computation: Practice and Experience, 2013

This special issue contributes to this promising field with extended and carefully reviewed versions of selected papers from two workshops, namely the 2nd Minisymposium on GPU Computing, which was held as

downloadDownload free PDF View PDFchevron_right

GUM: a portable parallel implementation of Haskell

Philip Trinder

Proceedings of the …, 1996

GUM is a portable, parallel implementation of the Haskell functional language. Despite sustained research interest in parallel functional programming, GUM is one of the first such systems to be made publicly available. GUM is message-based, and portability is facilitated by using the PVM communications harness that is available on many multi-processors. As a result, GUM is available for both shared-memory (Sun SPARCserver multiprocessors) and distributed-memory (networks of workstations) architectures. The ...

downloadDownload free PDF View PDFchevron_right

Productivity of GPUs under different programming paradigms

Maria Malik

Concurrency and Computation: Practice and Experience, 2012

Graphical processing units have been gaining rising attention because of their high performance processing capabilities for many scientific and engineering applications. However, programming such highly parallel devices requires adequate programming tools. Many such programming tools have emerged and hold the promise for high levels of performance. Some of such tools may require specialized parallel programming skills, while others attempt to target the domain scientist. The costs versus the benefits out of such tools are often unclear. In this work we examine the use of several of these programming tools such as Compute Unified Device Architecture, Open Compute Language, Portland Group Inc., and MATLAB in developing kernels from the (NAS) NASA Advanced Supercomputing parallel benchmarking suite. The resulting performance as well as the needed programmers' efforts were quantified and used to characterize the productivity of graphical processing units using these different programming paradigms.

downloadDownload free PDF View PDFchevron_right