Academia.eduAcademia.edu

Data Parallelism

description644 papers
group2,368 followers
lightbulbAbout this topic
Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.
lightbulbAbout this topic
Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

This research theme addresses the challenge of supporting efficient data-parallel computations on irregular and nested data structures such as graphs, sparse matrices, trees, and complex objects within parallel computing frameworks and languages. Efficient execution models and programming abstractions are needed that can represent irregular data and enable parallel traversals or computations without excessive overhead, while preserving portability and performance across different hardware architectures.

Key finding: The paper presents NESL, the first portable nested data-parallel language supporting nested data structures and nested data-parallel functions, enabling concise expression of parallel algorithms over irregular data like... Read more
by Wolfram Schulte and 
1 more
Key finding: The study proposes an intermediate language and runtime scheduling method to map independent traversals of multiple irregular pointer-based data structures (like forests of decision trees and regular expressions) onto SIMD... Read more
Key finding: This work investigates intra- and inter-object parallelism for query processing on complex, nested objects such as 3D models or molecules in non-standard databases. It proposes a layered architecture using nested transactions... Read more

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

As deep learning models grow in size and complexity, single-model data parallelism or model parallelism alone often become insufficient for memory capacity or communication efficiency. This research theme explores techniques combining or extending data, model, pipeline, and operator-level parallelisms to efficiently train huge networks on distributed systems. Key questions include how to split models and data, partition operators, optimize communication, and schedule tasks for maximal scalability and throughput while maintaining training accuracy.

Key finding: RaNNC middleware automatically partitions PyTorch models for hybrid parallelism by identifying atomic subcomponents and grouping them into blocks, then using dynamic programming to find balanced pipeline partitions fitting... Read more
Key finding: Alpa framework unifies data, operator (intra-operator), and pipeline (inter-operator) parallelisms into a hierarchical execution plan space and automatically derives efficient distributed training plans. By mapping... Read more
Key finding: SplitBrain introduces hybrid parallelism for distributed CNN training by combining data parallelism and model parallelism with layer-specific partitioning. Compute-intensive convolution layers are co-located while... Read more
Key finding: This work extends data parallelism by incorporating spatial parallelism to partition single samples in large 3D CNNs, enabling strong scaling beyond mini-batch size limitations. Implemented within LBANN for CosmoFlow and 3D... Read more

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Pipeline parallelism is a fundamental pattern capturing sequences of task stages with dependencies, common in streaming and hierarchical computations. Current pipeline frameworks focus on data-centric abstractions which are convenient but can be inefficient and inflexible for purely task-parallel pipeline algorithms. This theme investigates programming models, scheduling algorithms, and runtime techniques that separate data abstraction from pipeline task scheduling, enhance composability with other parallel paradigms, and enable efficient dynamic load balancing and resource utilization.

Key finding: Pipeflow is a novel C++ task-parallel pipeline programming framework built atop the Taskflow system that decouples pipeline scheduling from data abstractions. It provides a composable interface enabling users to explicitly... Read more
Key finding: The Encore programming language integrates active object parallelism with unshared local heaps and capabilities to guarantee race-free concurrency without complex synchronization. Combining message-based concurrency with... Read more

All papers in Data Parallelism

This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more
This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more
One well-liked method for condensing massive language models (LLMs) into smaller, faster, more effective versions without sacrificing performance is knowledge distillation (KD). However, it is no longer feasible to run distillation on a... more
Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to... more
This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more
With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful... more
Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this... more
Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom... more
SIMD (single instruction multiple data)-type processors have been found very efficient in image processing applications, because their repetitive structure is able to exploit the huge amount of data-level parallelism in pixel-type... more
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages... more
Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of... more
Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multidevice applications require to distribute kernel executions and,... more
Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of... more
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,... more
Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed. An APM contains a set of parallel operation definitions, which... more
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast... more
With the increasing demand for deep learning in the last few years, CNNs have been widely used in many applications and have gained interest in classification, regression, and image recognition tasks. The training of these deep neural... more
In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in... more
Background: Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same... more
Cathedral: Total Centralized Control Design, implement, test, release • Bazaar: Total Decentralization Release early, release often, make users partners in software development "Given enough eyeballs, all bugs are shallow" Code errors,... more
Modern Molecular Dynamics methods are employed to study quantum manybody systems, chemically reactive systems including explicit electronic degrees of freedom, and combinations thereof, as well as large classical biomolecular systems.... more
Modern Molecular Dynamics methods are employed to study quantum manybody systems, chemically reactive systems including explicit electronic degrees of freedom, and combinations thereof, as well as large classical biomolecular systems.... more
In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across... more
K-means clustering, a fundamental unsupervised machine learning technique, is widely used in anomaly detection, image recognition, and customer segmentation. Traditional Python implementations, especially those using NumPy, face... more
Pipeline is a fundamental parallel programming pattern. Mainstream pipeline programming frameworks count on data abstractions to perform pipeline scheduling. This design is convenient for data-centric pipeline applications but inefficient... more
In the future, if we are to continue to expect improved application performance we will have to achieve it by exploiting course-grained hardware parallelism rather then simply relying on processor cycles getting faster. Programmers will,... more
this paper we present COLT HPF (COordination Layer for Tasks expressed in HPF), a portable coordination /communication layer for HPF tasks. COLT HPF is implemented on top of PVM and provides suitable mechanisms for starting, even at... more
Experience in applicative fields, above all deriving from the development of multidisciplinary parallel applications, seems to suggest a model where art outer coordination level is provided to allow data parallel tasks to run concurrently... more
Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single... more
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse... more
The main objective of compiler and processor designers is to effectively exploit the instruction-level parallelism (ILP) available in applications. Although most of the times their research activities have been conducted separately, we... more
Complex objects to support non-standard database applications require the use of substantial computing resources because their powerful operations must be performed and maintained in an interactive environment. Since the exploitation of... more
I must give a thank to libre software. It has provided me with a lot of material to study and learn from, play with, contribute to. I can't remember something I took at look at, as fortuitous as it may have seemed at the time, that didn't... more
This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such... more
Many institutions (e.g., universities, banks) are moving towards clusters. These clusters consist of commodity components such as PCs connected by fast networks. However, the single factor limiting the harnessing of the enormous computing... more
A number of computational applications lack instruction-level parallelism. This loss is particularly acute on sequences of dependent instructions on wide-issue or deeply pipelined architectures. We consider four real applications from... more
The main objective of compiler and processor designers is to effectively exploit the instruction-level parallelism (ILP) available in applications. Although most of the times their research activities have been conducted separately, we... more
Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to... more
In this paper we present ALDIMS, a language that combines the expressibility of general functional (MIMD) parallelism with compact expressibility of data (SPMD) parallelism. It uses distributed data structures for specifying data... more
H.264/AVC has been introduced in recent years to decrease the bit-rate and to increase the flexibility of implementations. After careful study and analysis, we have concluded that the complexity of this video codec depends mainly on its... more
We study a class of deep neural networks with architectures that form a directed acyclic graph (DAG). For backpropagation defined by gradient descent with adaptive momentum, we show weights converge for a large class of nonlinear... more
Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fineand coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however,... more
Le paradigme de la programmation chimique a ete introduit a la fin des annees 1980 comme une maniere elegante de definir mathematiquement des programmes repartis. Le principe repose sur l’analogie des reactions chimiques, dans lequel un... more
Two ways to exploit chips with a very large number of transistors are multicore processors and programmable logic chips. Some data parallel algorithms can be executed efficiently on ordinary parallel computers, including multicores. A... more
A number of computational applications lack instruction-level parallelism. This loss is particularly acute on sequences of dependent instructions on wide-issue or deeply pipelined architectures. We consider four real applications from... more
We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by... more
The massive addition of cores on a chip is adding more pressure to the accesses to main memory. In order to avoid this bottleneck, we propose the use of a simple producer-consumer model, which allows for the temporary results to be... more
This paper addresses the design of visual paradigms for observing the parallel execution of logic programs. First, an intuitive method is proposed for arriving at the design of a paradigm and its implementation as a tool for a given model... more
Download research papers for free!