Basilio B Fraguela

Universidade da Coruña, Departamento de Electrónica y Sistemas, Faculty Member

Followers

Following

Co-authors

Public Views

Uploads

Papers by Basilio B Fraguela

Hierarchically Tiled Arrays Vs. Intel Threading Building Blocks for Programming Multicore Systems

Multicore systems are now the norm. Programmers can no longer rely on faster clock rates to speed... more Multicore systems are now the norm. Programmers can no longer rely on faster clock rates to speed up their applications. Thus, software developers are increasingly forced to face the complexities of parallel programming. The Intel Threading Building Blocks (TBBs) library was designed to facilitate parallel programming. The key notion is to separate logical task patterns, which are easy to understand, from physical threads, and delegate the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature. The model underlying HTAs provides programmers with a data parallel, single-threaded view of the execution. The HTA implementation in C++ has been recently extended to support multicore machines. In this work we implement several algorithms using both libraries in order to compare ease of programming and performance.

Download

High productivity multi-device exploitation with the Heterogeneous Programming Library

Journal of Parallel and Distributed Computing, 2017

Heterogeneous devices require much more work from programmers than traditional CPUs, particularly... more Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multidevice applications require to distribute kernel executions and, even worse, arrays portions that must be kept coherent among the different device memories and the host memory. In addition, when devices with different characteristics participate in a computation, optimally distributing the work among them is not trivial. In this paper we extend an existing framework for the programming of accelerators called Heterogeneous Programming Library (HPL) with three kinds of improvements that facilitate these tasks. The first two ones are the ability to define subarrays and subkernels, which distribute kernels on different devices. The last one is a convenient extension of the subkernel mechanism to distribute computations among heterogeneous devices seeking the best work balance among them. This last contribution includes two analytical models that have proved to automatically provide very good work distributions. Our experiments also show the large programmability advantages of our approach and the negligible overhead incurred.

Download

Design and Use of htalib – A Library for Hierarchically Tiled Arrays

Lecture Notes in Computer Science, 2007

Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism o... more Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a global view of distributed data as well as a single-threaded view of the execution. In this paper we present htalib, a C++ implementation of HTAs. This library provides several novel constructs: (i) A map-reduce operator framework that facilitates the implementation of distributed operations with HTAs. (ii) Overlapped tiling in support of tiling in stencil codes. (iii) Data layering, facilitating the use of HTAs in adaptive mesh refinement applications. We describe the interface and design of htalib and our experience with the new programming constructs.

Download

A highly optimized skeleton for unbalanced and deep divide-and-conquer algorithms on multi-core clusters

The Journal of Supercomputing

Efficiently implementing the divide-and-conquer pattern of parallelism in distributed memory syst... more Efficiently implementing the divide-and-conquer pattern of parallelism in distributed memory systems is very relevant, given its ubiquity, and difficult, given its recursive nature and the need to exchange tasks and data among the processors. This task is noticeably further complicated in the presence of multi-core systems, where hybrid parallelism must be exploited to attain the best performance, and when unbalanced and deep workloads are considered, as additional measures must be taken to load balance and avoid deep recursion problems. In this manuscript a parallel skeleton that fulfills all these requirements while providing high levels of usability is presented. In fact, the evaluation shows that our proposal is on average 415.32% faster than MPI codes and 229.18% faster than MPI + OpenMP benchmarks, while offering an average improvement in the programmability metrics of 131.04% over MPI alternatives and 155.18% over MPI + OpenMP solutions.

Download

OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

Mathematics, 2021

Improving the performance of the convolution operation has become a key target for High Performan... more Improving the performance of the convolution operation has become a key target for High Performance Computing (HPC) developers due to its prevalence in deep learning applied mainly to video processing. The improvement is being pushed by algorithmic and implementation innovations. Algorithmically, the convolution can be solved as it is mathematically enunciated, but other methods allow to transform it into a Fast Fourier Transform (FFT) or a GEneral Matrix Multiplication (GEMM). In this latter group, the Winograd algorithm is a state-of-the-art variant that is specially suitable for smaller convolutions. In this paper, we present openCNN, an optimized CUDA C++ implementation of the Winograd convolution algorithm. Our approach achieves speedups of up to 1.76× on Turing RTX 2080Ti and up to 1.85× on Ampere RTX 3090 with respect to Winograd convolution in cuDNN 8.2.0. OpenCNN is released as open-source software.

Download

ScalaParBiBit: scaling the binary biclustering in distributed-memory systems

Biclustering is a data mining technique that allows us to find groups of rows and columns that ar... more Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit .

A software cache autotuning strategy for dataflow computing with UPC++ DepSpawn

Computational and Mathematical Methods

A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments

Proceedings of the 50th Hawaii International Conference on System Sciences (2017), 2017

The larger flexibility that task parallelism offers with respect to data parallelism comes at the... more The larger flexibility that task parallelism offers with respect to data parallelism comes at the cost of a higher complexity due to the variety of tasks and the arbitrary patterns of dependences that they can exhibit. These dependencies should be expressed not only correctly, but optimally, i.e. avoiding over-constraints, in order to obtain the maximum performance from the underlying hardware. There have been many proposals to facilitate this non-trivial task, particularly within the scope of nowadays ubiquitous multi-core architectures. A very interesting family of solutions because of their large scope of application, ease of use and potential performance are those in which the user declares the dependences of each task, and lets the parallel programming framework figure out which are the concrete dependences that appear at runtime and schedule accordingly the parallel tasks. Nevertheless, as far as we know, there are no comparative studies of them that help users identify their relative advantages. In this paper we describe and evaluate four tools of this class discussing the strengths and weaknesses we have found in their use.

Download

Heterogeneous distributed computing based on high-level abstractions

Concurrency and Computation: Practice and Experience

The rise of heterogeneous systems has given place to great challenges for users, as they involve ... more The rise of heterogeneous systems has given place to great challenges for users, as they involve new concepts, restrictions and frameworks. Their exploitation is further complicated in the context of distributed memory systems, which require the usage of additional different programming paradigms and tools. In this paper we propose a novel approach to program heterogeneous clusters that is based on high level abstractions such as tiles and hierarchical decomposition combined with the powerful APIs that data types and embedded languages can provide in languages such as C++. Rather than building our proposal from scratch, we have implemented it as a natural integration of the existing Hierarchically Tiled Arrays (HTA) and Heterogeneous Programming Library (HPL) projects, the first one being focused on distributed computing and the second one on heterogeneous processing. The result, called Heterogeneous Hierarchically Tiled Arrays (H 2 TA), is very intuitive and easy to use thanks to the global view of the data and the single-threaded view of the execution that it provides at cluster level together with the transparency it provides with respect to the management of the heterogeneous devices. An evaluation comparing our proposal with MPI-based implementations shows its large programmability advantages and the reasonable overhead incurred.

Download

Easy Dataflow Programming in Clusters with UPC++ DepSpawn

IEEE Transactions on Parallel and Distributed Systems

The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposa... more The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages with respect to the traditional message-passing paradigm, PGAS has not been yet widely adopted. We think that PGAS libraries are more promising than languages because they avoid the requirement to (re)write the applications using them, with the implied uncertainties related to portability and interoperability with the vast amount of APIs and libraries that exist for widespread languages. Nevertheless, the need to embed these libraries within a host language can limit their expressiveness and very useful features can be missing. This paper contributes to the advance of PGAS by enabling the simple development of arbitrarily complex task-parallel codes following a dataflow approach on top of the PGAS UPC++ library, implemented in C++. In addition, our proposal, called UPC++ DepSpawn, relies on an optimized multithreaded runtime that provides very competitive performance, as our experimental evaluation shows.

Download

Portable and efficient FFT and DCT algorithms with the Heterogeneous Butterfly Processing Library

Journal of Parallel and Distributed Computing

The existence of a wide variety of computing devices with very different properties makes essenti... more The existence of a wide variety of computing devices with very different properties makes essential the development of software that is not only portable among them, but which also adapts to the properties of each platform. In this paper, we present the Heterogeneous Butterfly Processing Library (HBPL), which provides optimized portable kernels for problems of small sizes that allow using orthogonal transform algorithms such as the FFT and DCT on different accelerators and regular CPUs. Our library is implemented on the OpenCL standard, which provides portability on a large number of platforms. Furthermore, high performance is achieved on a wide range of devices by exploiting run-time code generation and metaprogramming guided by a parametrization strategy. An exhaustive evaluation on different platforms shows that our proposal obtains competitive or better performance than related libraries.

Download

Facilitating the development of stencil applications using the Heterogeneous Programming Library

Concurrency and Computation: Practice and Experience

Stencil computations are very common in scientific codes. Heterogeneous systems achieve good resu... more Stencil computations are very common in scientific codes. Heterogeneous systems achieve good results solving these problems, but their programming is complex because of the ghost regions required in multi-device implementations and the difficulty to properly exploit their hardware. The Heterogeneous Programming Library (HPL) is a recent framework that improves the programmability of heterogeneous devices. This paper describes two extensions of HPL focused on stencil computations. The first one allows to automatically update the ghost regions they involve. The second one automates the implementation of the computational kernels of these algorithms. In our evaluation the first mechanism reduces on average the number of lines of code and the Halstead programming effort of the host code of comparable HPL baselines by 34% and 64.2%, respectively, while the second contribution reduces these metrics by 72% and 79% in the computational kernels, respectively. Also, the first technique has negligible performance overheads, while the second one matches the performance of manually developed kernels. As an added benefit, the facilitation of the development of these codes thanks to these techniques helps programmers experiment with optimizations suited for this applications such as the ghost cell expansion technique, which provides speedups of up to 13% in our experiments.

Download

Accelerating the HyperLogLog Cardinality Estimation Algorithm

Scientific Programming

In recent years, vast amounts of data of different kinds, from pictures and videos from our camer... more In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, parti...

Download

A general and efficient divide-and-conquer algorithm framework for multi-core clusters

Cluster Computing

Divide-and-conquer is one of the most important patterns of parallelism, being applicable to a la... more Divide-and-conquer is one of the most important patterns of parallelism, being applicable to a large variety of problems. In addition, the most powerful parallel systems available nowadays are computer clusters composed of distributed-memory nodes that contain an increasing number of cores that share a common memory. The optimal exploitation of these systems often requires resorting to a hybrid model that mimics the underlying hardware by combining a distributed and a shared memory parallel programming model. This results in longer development times and increased maintenance costs. In this paper we present a very general skeleton library that allows to parallelize any divide-and-conquer problem in hybrid distributedshared memory systems with little effort while providing much flexibility and good performance. Our proposal combines a message-passing paradigm at the process level and a threaded model inside each process, hiding the related complexity from the user. The evaluation shows that this skeleton provides performance comparable, and often better than that of manually optimized codes while requiring considerably less effort when parallelizing applications on multi-core clusters.

Download

Novel parallelization of simulated annealing and Hooke & Jeeves search algorithms for multicore systems with application to complex fisheries stock assessment models

Journal of Computational Science, 2016

Estimating parameters of a statistical fisheries assessment model typically involves a comparison... more Estimating parameters of a statistical fisheries assessment model typically involves a comparison of disparate datasets to a forward simulation model through a likelihood function. In all but trivial cases the estimations of these models tend to be time-consuming due to issues related to multi-modality and non-linearity. This paper develops novel parallel implementations of popular search algorithms, applicable to expensive function calls typically encountered in fisheries stock assessment. It proposes two versions of both Simulated Annealing and Hooke & Jeeves optimization algorithms with the aim of fully utilizing the processing power of common multicore systems. The proposals have been tested on a 24-core server using three different input models. Results indicate that the parallel versions are able to take advantage of available resources without sacrificing the quality of the solution.

Download

Performance comparison of MPI on MPP and workstation clusters

Uso de modelos analíticos en la búsqueda del tamaño de bloque óptimo

Computacion De Altas Prestaciones Actas De Las Xv Jornadas De Paralelismo Almeria 15 16 Y 17 De Septiembre De 2004 2004 Isbn 84 8240 714 7 Pags 162 167, 2004

Cache miss prediction in sparse matrix computations

New abstractions for data parallel programming

Developing applications is becoming increasingly difficult due to recent growth in machine comple... more Developing applications is becoming increasingly difficult due to recent growth in machine complexity along many dimensions, especially that of parallelism. We are studying data types that can be used to represent data parallel operations. Developing parallel programs with these data types have numerous advantages and such a strategy should facilitate parallel programming and enable portability across machine classes and machine generations without significant performance degradation. In this paper, we discuss our vision of data parallel programming with powerful abstractions. We first discuss earlier work on data parallel programming and list some of its limitations. Then, we introduce several dimensions along which is possible to develop more powerful data parallel programming abstractions. Finally, we present two simple examples of data parallel programs that make use of operators developed as part of our studies.

Download

Cflex: a programming language for the flexram intelligent memory architecture

Basilio B Fraguela

Uploads

Papers by Basilio B Fraguela

Log In