Data Parallelism

description644 papers

group2,368 followers

lightbulbAbout this topic

Data parallelism is a computational paradigm that involves distributing data across multiple processing units, allowing simultaneous execution of operations on the data. This approach enhances performance and efficiency in processing large datasets by leveraging parallel computing architectures, such as multi-core processors or distributed systems.

lightbulbAbout this topic

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

This research theme addresses the challenge of supporting efficient data-parallel computations on irregular and nested data structures such as graphs, sparse matrices, trees, and complex objects within parallel computing frameworks and languages. Efficient execution models and programming abstractions are needed that can represent irregular data and enable parallel traversals or computations without excessive overhead, while preserving portability and performance across different hardware architectures.

Implementation of a Portable Nested Data-Parallel Language

by Jonathan C Hardwick

2022

Key finding: The paper presents NESL, the first portable nested data-parallel language supporting nested data structures and nested data-parallel functions, enabling concise expression of parallel algorithms over irregular data like... Read more

articleView Paper downloadDownload

SIMD Parallelization of Applications that Traverse Irregular Data Structures

by Wolfram Schulte and

2015

Key finding: The study proposes an intermediate language and runtime scheduling method to map independent traversals of multiple irregular pointer-based data structures (like forests of decision trees and regular expressions) onto SIMD... Read more

articleView Paper downloadDownload

Parallelism in Processing Queries on Complex Objects

by Harald Schöning

2016

Key finding: This work investigates intra- and inter-object parallelism for query processing on complex, nested objects such as 3D models or molecules in non-standard databases. It proposes a layered architecture using nested transactions... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

As deep learning models grow in size and complexity, single-model data parallelism or model parallelism alone often become insufficient for memory capacity or communication efficiency. This research theme explores techniques combining or extending data, model, pipeline, and operator-level parallelisms to efficiently train huge networks on distributed systems. Key questions include how to split models and data, partition operators, optimize communication, and schedule tasks for maximal scalability and throughput while maintaining training accuracy.

Automatic Graph Partitioning for Very Large-scale Deep Learning

by Toshihiro Hanawa

2024, arXiv (Cornell University)

Key finding: RaNNC middleware automatically partitions PyTorch models for hybrid parallelism by identifying atomic subcomponents and grouping them into blocks, then using dynamic programming to find balanced pipeline partitions fitting... Read more

articleView Paper downloadDownload

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

by Yonghao Zhuang

2022

Key finding: Alpa framework unifies data, operator (intra-operator), and pipeline (inter-operator) parallelisms into a hierarchical execution plan space and automatically derives efficient distributed training plans. By mapping... Read more

articleView Paper downloadDownload

SplitBrain: Hybrid Data and Model Parallel Deep Learning

by Erik Kruus

2023, arXiv (Cornell University)

Key finding: SplitBrain introduces hybrid parallelism for distributed CNN training by combining data parallelism and model parallelism with layer-specific partitioning. Compute-intensive convolution layers are co-located while... Read more

articleView Paper downloadDownload

Implementing a neural network interatomic model with performance portability for emerging exascale architectures

by James Belak

2022, Computer Physics Communications

Key finding: This work extends data parallelism by incorporating spatial parallelism to partition single samples in large 3D CNNs, enabling strong scaling beyond mini-batch size limitations. Implemented within LBANN for CosmoFlow and 3D... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Pipeline parallelism is a fundamental pattern capturing sequences of task stages with dependencies, common in streaming and hierarchical computations. Current pipeline frameworks focus on data-centric abstractions which are convenient but can be inefficient and inflexible for purely task-parallel pipeline algorithms. This theme investigates programming models, scheduling algorithms, and runtime techniques that separate data abstraction from pipeline task scheduling, enhance composability with other parallel paradigms, and enable efficient dynamic load balancing and resource utilization.

Pipeflow: An Efficient Task-Parallel Pipeline Programming Framework using Modern C++

by Zhishan Guo

2025, arXiv (Cornell University)

Key finding: Pipeflow is a novel C++ task-parallel pipeline programming framework built atop the Taskflow system that decouples pipeline scheduling from data abstractions. It provides a composable interface enabling users to explicitly... Read more

articleView Paper downloadDownload

Parallel Objects for Multicores: A Glimpse at the Parallel Language Encore

by Silvia Lizeth Tapia Tarifa

2022, Lecture Notes in Computer Science

Key finding: The Encore programming language integrates active object parallelism with unshared local heaps and capabilities to guarantee race-free concurrency without complex synchronization. Combining message-based concurrency with... Read more

articleView Paper downloadDownload

All papers in Data Parallelism

Teaching distributed memory programming from mental models

by Victor Eijkhout

2025, Journal of Parallel and Distributed Computing

descriptionView Paper arrow_downwardDownload

A new look at exploiting data parallelism in embedded systems

by Jaime Moreno

2025

This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single... more

descriptionView Paper arrow_downwardDownload

A new look at exploiting data parallelism in embedded systems

by Jaime Moreno

2025, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems

descriptionView Paper arrow_downwardDownload

Scalable knowledge distillation for large language models on multi-GPU systems

by Wary Hossain Rabby

2025, International Journal of Science and Research archive

One well-liked method for condensing massive language models (LLMs) into smaller, faster, more effective versions without sacrificing performance is knowledge distillation (KD). However, it is no longer feasible to run distillation on a... more

descriptionView Paper arrow_downwardDownload

Expressing pipeline parallelism using TBB constructs

by Eric Reed

2025, Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11 - SPLASH '11 Workshops

Task-based libraries such as Intel's Threading Building Blocks (TBB) provide higher levels of abstraction than threads for parallel programming. Work remains, however, to determine how straightforward it is to use these libraries to... more

descriptionView Paper arrow_downwardDownload

A Parallel Algorithm for Solving Complex Multibody Problems With Stream Processors

by Mihai Anitescu

2025, Volume 4: 7th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A, B and C

This paper describes a numerical method for the parallel solution of the differential measure inclusion problem posed by mechanical multibody systems containing bilateral and unilateral frictional constraints. The method proposed has been... more

descriptionView Paper arrow_downwardDownload

Parallelism in relational data base systems

by C. Mohan

2025, Proceedings of the second international symposium on Databases in parallel and distributed systems - DPDS '90

With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful... more

descriptionView Paper arrow_downwardDownload

Memory and Parallelism Analysis Using a Platform-Independent Approach

by Henk Corporaal

2025, Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems

Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this... more

descriptionView Paper arrow_downwardDownload

Exploring processor parallelism: Estimation methods and optimization strategies

by Henk Corporaal

2025, 2013 IEEE 16th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)

Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom... more

descriptionView Paper arrow_downwardDownload

DC-SIMD : Dynamic communication for SIMD processors

by Henk Corporaal

2025, 2008 IEEE International Symposium on Parallel and Distributed Processing

SIMD (single instruction multiple data)-type processors have been found very efficient in image processing applications, because their repetitive structure is able to exploit the huge amount of data-level parallelism in pixel-type... more

descriptionView Paper arrow_downwardDownload

The OpenMP Cluster Programming Model

by Marcio Machado Pereira

2025, Workshop Proceedings of the 51st International Conference on Parallel Processing

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages... more

descriptionView Paper arrow_downwardDownload

An automatic optimizer for heterogeneous devices

by Diego Andrade

2025, Future Generation Computer Systems

Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of... more

descriptionView Paper arrow_downwardDownload

High productivity multi-device exploitation with the Heterogeneous Programming Library

by Diego Andrade

2025, Journal of Parallel and Distributed Computing

Heterogeneous devices require much more work from programmers than traditional CPUs, particularly when there are several of them, as each one has its own memory space. Multidevice applications require to distribute kernel executions and,... more

descriptionView Paper arrow_downwardDownload

Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems

by Diego Andrade

2025, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing

Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of... more

descriptionView Paper arrow_downwardDownload

An Introduction to the Pthales Domain of Ptolemy II

by R. Barrere

2025

descriptionView Paper arrow_downwardDownload

A Survey on Large-scale Machine Learning

by Weijie Fu

2025, arXiv (Cornell University)

Machine learning can provide deep insights into data, allowing machines to make high-quality predictions and having been widely used in real-world applications, such as text mining, visual classification, and recommender systems. However,... more

descriptionView Paper arrow_downwardDownload

Cost Hierarchies for Abstract Parallel Machines

by John O'Donnell

2025, Springer eBooks

Parallel Machine (APM) model separates the definitions of parallel operations from the application algorithm, which defines the sequence of parallel operations to be executed. An APM contains a set of parallel operation definitions, which... more

descriptionView Paper arrow_downwardDownload

Hybrid parallelization strategies for large-scale machine learning in SystemML

by Berthold Reinwald

2025, Proceedings of the VLDB Endowment

SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast... more

descriptionView Paper arrow_downwardDownload

GAPCNN with HyPar: Global Average Pooling convolutional neural network with novel NNLU activation function and HYBRID parallelism

by Shaima Qureshi

2025, Frontiers in Computational Neuroscience

With the increasing demand for deep learning in the last few years, CNNs have been widely used in many applications and have gained interest in classification, regression, and image recognition tasks. The training of these deep neural... more

descriptionView Paper arrow_downwardDownload

Leveraging task-parallelism in message-passing dense matrix factorizations using SMPSs

by Ruymán Reyes

2025, Parallel Computing

In this paper, we investigate how to exploit task-parallelism during the execution of the Cholesky factorization on clusters of multicore processors with the SMPSs programming model. Our analysis reveals that the major difficulties in... more

descriptionView Paper arrow_downwardDownload

PLAST: parallel local alignment search tool for database comparison

by Văn Hiến Nguyễn

2025, BMC Bioinformatics

Background: Sequence similarity searching is an important and challenging task in molecular biology and next-generation sequencing should further strengthen the need for faster algorithms to process such vast amounts of data. At the same... more

descriptionView Paper arrow_downwardDownload

GCC for Parallelization

by Uday Khedker

2025

Cathedral: Total Centralized Control Design, implement, test, release • Bazaar: Total Decentralization Release early, release often, make users partners in software development "Given enough eyeballs, all bugs are shallow" Code errors, logical errors, and architectural errors A combination of the two seems more sensible UPK,SB,PR GRC, IIT Bombay GCC-Par: GCC ≡ The Great Compiler Challenge 13/147 Another Example of The Generation Related Gap Locating the main function in the directory gcc-4.4.2/gcc using cscope File Line 0 collect2.c 766 main (int argc, char **argv) 1 fix-header.c 1074 main (int argc, char **argv) 2 fp-test.c 85 main (void ) 3 gcc.c 6216 main (int argc, char **argv) 4 gcov-dump.c 76 main (int argc ATTRIBUTE_UNUSED, char **argv 5 gcov-iov.c 29 main (int argc, char **argv) 6 gcov.c 355 main (int argc, char **argv) 7 gen-protos.c 130 main (int argc ATTRIBUTE_UNUSED, char **argv 8 genattr.c 89 main (int argc, char **argv) 9 genattrtab.c 4438 main (int argc, char **argv) a genautomata.c 9321 main (int argc, char **argv) b genchecksum.c 65 main (int argc, char ** argv) c gencodes.c 51 main (int argc, char **argv) d genconditions.c 209 main (int argc, char **argv) e genconfig.c 261 main (int argc, char **argv) f genconstants.c 50 main (int argc, char **argv) UPK,SB,PR GRC, IIT Bombay PPoPP'10 GCC-Par: GCC ≡ The Great Compiler Challenge 13/147 Another Example of The Generation Related Gap Locating the main function in the directory gcc-4.4.2/gcc using cscope File Line 0 collect2.c 766 main (int argc, char **argv) 1 fix-header.c 1074 main (int argc, char **argv) 2 fp-test.c 85 main (void ) 3 gcc.c 6216 main (int argc, char **argv) 4 gcov-dump.c 76 main (int argc ATTRIBUTE_UNUSED, char **argv 5 gcov-iov.c 29 main (int argc, char **argv) 6 gcov.c 355 main (int argc, char **argv) 7 gen-protos.c 130 main (int argc ATTRIBUTE_UNUSED, char **argv 8 genattr.c 89 main (int argc, char **argv) 9 genattrtab.c 4438 main (int argc, char **argv) a genautomata.c 9321 main (int argc, char **argv) b genchecksum.c 65 main (int argc, char ** argv) c gencodes.c 51 main (int argc, char **argv) d genconditions.c 209 main (int argc, char **argv) e genconfig.c 261 main (int argc, char **argv) f genconstants.c 50 main (int argc, char **argv) UPK,SB,PR GRC, IIT Bombay PPoPP'10 GCC-Par: GCC ≡ The Great Compiler Challenge 13/147 Another Example of The Generation Related Gap Locating the main function in the directory gcc-4.4.2/gcc using cscope g genemit.c 820 main (int argc, char **argv) h genextract.c 394 main (int argc, char **argv) i genflags.c 231 main (int argc, char **argv) j gengenrtl.c 350 main (int argc, char **argv) k gengtype.c 3584 main (int argc, char **argv) l genmddeps.c 45 main (int argc, char **argv) m genmodes.c 1376 main (int argc, char **argv) n genopinit.c 472 main (int argc, char **argv) o genoutput.c 1005 main (int argc, char **argv) p genpeep.c 353 main (int argc, char **argv) q genpreds.c 1399 main (int argc, char **argv) r genrecog.c 2718 main (int argc, char **argv) s main.c 33 main (int argc, char **argv) t mips-tdump.c 1393 main (int argc, char **argv) u mips-tfile.c 655 main (void ) v mips-tfile.c 4690 main (int argc, char **argv) w protoize.c 4373 main (int argc, char **const argv)

descriptionView Paper arrow_downwardDownload

Exploiting multiple levels of parallelism in Molecular Dynamics based calculations via modern techniques and software paradigms on distributed memory computers

by Adam Hughes

2025, Computer Physics Communications

Modern Molecular Dynamics methods are employed to study quantum manybody systems, chemically reactive systems including explicit electronic degrees of freedom, and combinations thereof, as well as large classical biomolecular systems.... more

descriptionView Paper arrow_downwardDownload

Exploiting multiple levels of parallelism in Molecular Dynamics based calculations via modern techniques and software paradigms on distributed memory computers

by Adam Hughes

2025, Computer Physics Communications

descriptionView Paper arrow_downwardDownload

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

by Sajal Dash

2025, arXiv (Cornell University)

In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across... more

descriptionView Paper arrow_downwardDownload

Next-Generation K-Means Clustering: Mojo-Driven Performance for Big Data

by Abhijit Pathak and

2025, International Journal of Intelligent Information Systems

K-means clustering, a fundamental unsupervised machine learning technique, is widely used in anomaly detection, image recognition, and customer segmentation. Traditional Python implementations, especially those using NumPy, face... more

descriptionView Paper arrow_downwardDownload

Pipeflow: An Efficient Task-Parallel Pipeline Programming Framework using Modern C++

by Zhishan Guo

2025, arXiv (Cornell University)

Pipeline is a fundamental parallel programming pattern. Mainstream pipeline programming frameworks count on data abstractions to perform pipeline scheduling. This design is convenient for data-centric pipeline applications but inefficient... more

descriptionView Paper arrow_downwardDownload

Reasoning about inherent parallelism in modern object-oriented languages

by Wayne Kelly

2025, Australasian Computer Science Conference

In the future, if we are to continue to expect improved application performance we will have to achieve it by exploiting course-grained hardware parallelism rather then simply relying on processor cycles getting faster. Programmers will,... more

descriptionView Paper arrow_downwardDownload

Mixed data and task parallelism with HPF and PVM

by Paolo Palmerini

2025, Cluster Computing

this paper we present COLT HPF (COordination Layer for Tasks expressed in HPF), a portable coordination /communication layer for HPF tasks. COLT HPF is implemented on top of PVM and provides suitable mechanisms for starting, even at... more

descriptionView Paper arrow_downwardDownload

Coordinating HPF programs to mix task and data parallelism

by Paolo Palmerini

2025, Proceedings of the 2000 ACM symposium on Applied computing - SAC '00

Experience in applicative fields, above all deriving from the development of multidisciplinary parallel applications, seems to suggest a model where art outer coordination level is provided to allow data parallel tasks to run concurrently... more

descriptionView Paper arrow_downwardDownload

Robust Class Parallelism - Error Resilient Parallel Inference with Low Communication Cost

by Kenan Jiang

2025, 2020 54th Asilomar Conference on Signals, Systems, and Computers

Model parallelism is a standard paradigm to decouple a deep neural network (DNN) into sub-nets when the model is large. Recent advances in class parallelism significantly reduce the communication overhead of model parallelism to a single... more

descriptionView Paper arrow_downwardDownload

Data and task parallelism in ILP using MapReduce

by Sachindra Joshi

2025, Machine Learning

Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal

descriptionView Paper arrow_downwardDownload

Increasing effective IPC by exploiting distant parallelism

by Ivan Benítez Martel

2024, Proceedings of the 13th international conference on Supercomputing

The main objective of compiler and processor designers is to effectively exploit the instruction-level parallelism (ILP) available in applications. Although most of the times their research activities have been conducted separately, we... more

descriptionView Paper arrow_downwardDownload

Parallel Query Evaluation: A New Approach to Complex Object Processing

by Harald Schöning

2024, IEEE Data(base) Engineering Bulletin

Complex objects to support non-standard database applications require the use of substantial computing resources because their powerful operations must be performed and maintained in an interactive environment. Since the exploitation of... more

descriptionView Paper arrow_downwardDownload

On Runtime Systems for Task-based Programming on Heterogeneous Platforms

by Samuel Thibault

2024

I must give a thank to libre software. It has provided me with a lot of material to study and learn from, play with, contribute to. I can't remember something I took at look at, as fortuitous as it may have seemed at the time, that didn't... more

descriptionView Paper arrow_downwardDownload

Automatic Graph Partitioning for Very Large-scale Deep Learning

by Toshihiro Hanawa

2024, arXiv (Cornell University)

This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism. In recent deep learning research, as exemplified by T5 and GPT-3, the size of neural network models continues to grow. Since such models do not fit into the memory of accelerator devices, they need to be partitioned by model parallelism techniques. Moreover, to accelerate training for huge training data, we need a combination of model and data parallelisms, i.e., hybrid parallelism. Given a model description for PyTorch without any specification for model parallelism, RaNNC automatically partitions the model into a set of subcomponents so that (1) each subcomponent fits a device memory and (2) a high training throughput for pipeline parallelism is achieved by balancing the computation times of the subcomponents. Since the search space for partitioning models can be extremely large, RaNNC partitions a model through the following three phases. First, it identifies atomic subcomponents using simple heuristic rules. Next it groups them into coarser-grained blocks while balancing their computation times. Finally, it uses a novel dynamic programming-based algorithm to efficiently search for combinations of blocks to determine the final partitions. In our experiments, we compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism, but a version allowing hybrid parallelism also exists), for training models with increasingly greater numbers of parameters. In the pretraining of enlarged BERT models, RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models. RaNNC also achieved better training throughputs than GPipe on both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and the enlarged ResNet models (GPipe with model parallelism) in all of the settings we tried. These results are remarkable, since RaNNC automatically partitions models without any modification to their descriptions; Megatron-LM and GPipe require users to manually rewrite the models' descriptions.

descriptionView Paper arrow_downwardDownload

Making parallel processing on clusters efficient, transparent and easy for programmers

by Andrzej Goscinski

2024

Many institutions (e.g., universities, banks) are moving towards clusters. These clusters consist of commodity components such as PCs connected by fast networks. However, the single factor limiting the harnessing of the enormous computing power of these clusters for parallel computing is the lack of appropriate software. Present operating systems are not built to support parallel computing -they do not provide services to manage parallelism. Managing the available parallelism in a cluster means managing parallel processes and computational resources, in order to achieve high performance and use computational resources efficiently, and to make programming and use of the parallel system easy. Parallelism management in parallel programming tools, distributed shared memory and enhanced operating system environments has been in the majority of trials left to application programmers. Programmers must deal not only with programming of communication and coordination of parallel processes to achieve the correct execution of an application, but also with the problems of initialisation and control of the execution on the cluster. Users do not see a cluster as a single powerful computer. Furthermore, parallel systems are seen as being user unfriendly, due to their complexity. An analysis of the existing clusters shows that parallelism management systems are being developed and offered using two different approaches: middleware, at the application level; and underware, at the kernel level. Both of them have advantages and disadvantages. The problem is how to take advantage of both of them to create the best solution from the point of view of the execution performance, programmers and efficient use of resources. In the majority of execution environments programmers of parallel application cannot make a choice between the message passing and distributed shared memory communication paradigm. Both paradigms have advantages and disadvantages. The former is fast but difficult to use, the latter is easy to use but demonstrates reduced performance. These communication paradigms and the systems supporting them are treated independently of an operating system, rather than to be parts of a comprehensive operating system as they manage system resources. Parallel processing can be divided into the initialisation, execution and termination phases. Currently, researchers and manufacturers mainly concentrate their work on the execution phase in order to achieve the best performance. Ease of use of parallel systems and programmer's time are neglected. This approach discourages application programmers from parallel processing, as they have to program many activities, which are of an operating system nature, in particular those of the initialization and termination phases. These activities should be carried out auto matically by a cluster operating system to relieve programmers from error prone and time-consuming activities. In this talk an overview of our work carried out toward the devel opment of cluster operating systems that automatically and dynamically support parallel processing is presented. There are a number of aims of this talk. The first aim is to identify and discuss the basic issues of and solutions to the problem of the management of parallel processing on clusters. The second aim is to propose a new class of cluster operating systems that provide these services. In particular, these operating systems should: guarantee high performance of parallel processing on clusters and the efficient use of resources; support execution on a cluster of both message passing and shared memory based parallel applications; relieve programmers from error prone and time consuming work of allocation of processes to computers, management of interprocess communication and process synchronisation; provide transparency; and make the whole cluster based parallel system easy to use. The third aim is to introduce and discuss the architecture and services of a cluster operating system, called GENESIS, that allow the above specified goals to be achieved. The fourth

descriptionView Paper arrow_downwardDownload

Deep Jam: Conversion of Coarse-Grain Parallelism to Fine-Grain and Vector Parallelism

by William Jalby

2024, HAL (Le Centre pour la Communication Scientifique Directe)

A number of computational applications lack instruction-level parallelism. This loss is particularly acute on sequences of dependent instructions on wide-issue or deeply pipelined architectures. We consider four real applications from... more

descriptionView Paper arrow_downwardDownload

Increasing effective IPC by exploiting distant parallelism

by Daniel Ortega

2024, Proceedings of the 13th international conference on Supercomputing

descriptionView Paper arrow_downwardDownload

Expressing pipeline parallelism using TBB constructs

by Ralph Johnson

2024, Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11 - SPLASH '11 Workshops

descriptionView Paper arrow_downwardDownload

ALDIMS - A Language for Programming Distributed Memory Multiprocessors

by dattatraya kulkarni

2024, The Sixth Distributed Memory Computing Conference, 1991. Proceedings

In this paper we present ALDIMS, a language that combines the expressibility of general functional (MIMD) parallelism with compact expressibility of data (SPMD) parallelism. It uses distributed data structures for specifying data... more

descriptionView Paper arrow_downwardDownload

Utilisation of the Array-OL specification language for self-generation of a memory controller especially for the H.264/AVC

by Kamel MESSAOUDI

2024, International Journal of Embedded Systems

H.264/AVC has been introduced in recent years to decrease the bit-rate and to increase the flexibility of implementations. After careful study and analysis, we have concluded that the complexity of this video codec depends mainly on its... more

descriptionView Paper arrow_downwardDownload

CrossNets : A New Approach to Complex Learning

by Dan Schonfeld

2024, arXiv (Cornell University)

We study a class of deep neural networks with architectures that form a directed acyclic graph (DAG). For backpropagation defined by gradient descent with adaptive momentum, we show weights converge for a large class of nonlinear... more

descriptionView Paper arrow_downwardDownload

Design Space exploration of FPGA-based accelerators with multi-level parallelism

by Smail Niar

2024, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017

Applications containing compute-intensive kernels with nested loops can effectively leverage FPGAs to exploit fineand coarse-grained parallelism. HLS tools used to translate these kernels from high-level languages (e.g., C/C++), however,... more

descriptionView Paper arrow_downwardDownload

Compilation itérative pour l'exécution de programmes chimiques sur une chaîne de compilation flot de données

by Thierry Goubier

2024

Le paradigme de la programmation chimique a ete introduit a la fin des annees 1980 comme une maniere elegante de definir mathematiquement des programmes repartis. Le principe repose sur l’analogie des reactions chimiques, dans lequel un... more

descriptionView Paper arrow_downwardDownload

Parallelism through Digital Circuit Design

by John O'Donnell

2024

Two ways to exploit chips with a very large number of transistors are multicore processors and programmable logic chips. Some data parallel algorithms can be executed efficiently on ordinary parallel computers, including multicores. A... more

descriptionView Paper arrow_downwardDownload

Deep Jam: Conversion of Coarse-Grain Parallelism to Fine-Grain and Vector Parallelism

by William Jalby

2024, HAL (Le Centre pour la Communication Scientifique Directe)

descriptionView Paper arrow_downwardDownload

BASE Layers: Simplifying Training of Large, Sparse Models

by Naman Goyal

2024

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by... more

descriptionView Paper arrow_downwardDownload

Producer-consumer: The programming model for future many-core processors

by PEDRO TRONCOSO

2024

The massive addition of cores on a chip is adding more pressure to the accesses to main memory. In order to avoid this bottleneck, we propose the use of a simple producer-consumer model, which allows for the temporary results to be... more

descriptionView Paper arrow_downwardDownload

Some paradigms for visualizing parallel execution of logic programs

by Luis M Gómez Henríquez

2024

This paper addresses the design of visual paradigms for observing the parallel execution of logic programs. First, an intuitive method is proposed for arriving at the design of a paradigm and its implementation as a tool for a given model... more

descriptionView Paper arrow_downwardDownload

Data Parallelism

Key research themes

1. How can nested and irregular data-parallelism be implemented efficiently for algorithms on complex data structures?

2. What strategies enable efficient hybrid parallelism for training large-scale deep neural networks beyond conventional data or model parallelism?

3. How can task-parallel pipeline programming models and asynchronous execution improve performance and composability in parallel algorithms?

Related Topics

All papers in Data Parallelism