Papers by Daniel Alejandro Orozco

International Journal of Parallel Programming, 2015
This paper provides an extended description of the design and implementation of the Time Iterated... more This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures

20th Annual International Conference on High Performance Computing, 2013
Optimization of parallel applications under new many-core architectures is challenging even for r... more Optimization of parallel applications under new many-core architectures is challenging even for regular applications. Successful strategies inherited from previous generations of parallel or serial architectures just return incremental gains in performance and further optimization and tuning are required. We argue that conservative static optimizations are not the best fit for modern many-core architectures. The limited advantages of static techniques come from the new scenarios present in many-cores: Plenty of thread units sharing several resources under different coordination mechanisms. We point out that scheduling and data movement across the memory hierarchy are extremely important in the performance of applications. In particular, we found that scheduling of data movement operations significantly impact performance. To overcome those difficulties, we took advantage of the fine-grain synchronization primitives of many-cores to define percolation operations in order to schedule data movement properly. In addition, we have fused percolation operations with dynamic scheduling into a dynamic percolation approach. We used Dense Matrix Multiplication on a modern manycore to illustrate how our proposed techniques are able to increase the performance under these new environments. In our study on the IBM Cyclops-64, we raised the performance from 44 GFLOPS (out of 80 GFLOPS possible) to 70.0 GFLOPS (operands in on-chip memory) and 65.6 GFLOPS (operands in off-chip memory). The success of our approach also resulted in excellent power efficiency: 1.09 GFLOPS/Watt and 993 MFLOPS/Watt when the input data resided in on-chip and off-chip memory respectively.

Lecture Notes in Computer Science, 2011
This paper proposes tiling techniques based on data dependencies and not in code structure. The w... more This paper proposes tiling techniques based on data dependencies and not in code structure. The work presented here leverages and expands previous work by the authors in the domain of non traditional tiling for parallel applications. The main contributions of this paper are: (1) A formal description of tiling from the point of view of the data produced and not from the source code. (2) A mathematical proof for an optimum tiling in terms of maximum reuse for stencil applications, addressing the disparity between computation power and memory bandwidth for many-core architectures. (3) A description and implementation of our tiling technique for well known stencil applications. ( ) Experimental evidence that confirms the effectiveness of the tiling proposed to alleviate the disparity between computation power and memory bandwidth for many-core architectures. Our experiments, performed using one of the first Cyclops-64 many-core chips produced, confirm the effectiveness of our approach to reduce the total number of memory operations of stencil applications as well as the running time of the application.

2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing, 2011
The many-core revolution brought forward by recent advances in computer architecture has created ... more The many-core revolution brought forward by recent advances in computer architecture has created immense challenges in the writing of parallel programs for High Performance Computing (HPC). Development of parallel HPC programs remains an art, and a universal doctrine for synchronization, scheduling and execution in general has not been found for many-core/multi-core architectures. These issues are exacerbated by the popularity of traditional execution models derived from the serial-programming school of thought. Previous solutions for parallel programming, such as OpenMP, MPI and similar models, require significant effort from the programmer to achieve high performance. This paper provides an introduction to the Time Iterated Dependency Flow (TIDeFlow) model, a parallel execution model inspired by dataflow, and a description of its associated runtime system. TIDeFlow was designed for efficient development of high performance parallel programs for many-core architectures. The TIDeFlow execution model was designed to efficiently express (1) parallel loops, (2) dependencies (data, control or other) between parallel loops and (3) to allow composability of programs. TIDeFlow is a work in progress. This paper presents an introduction to the TIDeFlow execution model and shows examples and preliminary results to illustrate the qualities of TIDeFlow. The main contributions of this paper are: 1. A brief description of the TIDeFlow execution model, and its programming model, 2. A description of the implementation of the TIDeFlow runtime system and its capabilities and 3. Preliminary results showing the suitability of TIDe-Flow to express parallel programs in many-core archi-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DFM '

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
The recent evolution of many-core architectures has resulted in chips where the number of process... more The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the way the sharing of those resources is arbitrated. On such machines, task scheduling is of paramount importance to orchestrate a satisfactory distribution of tasks with an efficient utilization of resources, especially when fine-grain parallelism is desired or required. In the past, the primary focus of scheduling techniques has been on achieving load balancing and reducing overhead with the aim to increase total performance. This focus has resulted in a scheduling paradigm where Static Scheduling (SS) is preferred to Dynamic Scheduling (DS) for highly regular and embarrassingly parallel applications running on homogeneous architectures. We have revisited the task scheduling problem for these types of applications under the scenario imposed by manycore architectures to investigate whether or not there exists scenarios where DS is better than SS. Our main contribution is the idea that, for highly regular and embarrassingly parallel applications, DS is preferable to SS in some situations commonly found in many-core architectures. We present experimental evidence that shows how the performance of SS is degraded by the new environment on many-core chips. We analyze three reasons that contribute to the superiority of DS over SS on many-core architectures under the situations described: 1) A uniform mapping of work to processors without considering the granularity of tasks is not necessarily scalable under limited amounts of work. 2) The presence of shared resources (i.e. the crossbar switch) produces unexpected and stochastic variations on the duration of tasks that SS is unable to manage properly. 3) Hardware features, such as in-memory atomic operations, greatly contribute to decrease the overhead of DS.
Lecture Notes in Computer Science, 2011

Lecture Notes in Computer Science, 2012
This paper presents a comprehensive story of the development of simpler performance models for di... more This paper presents a comprehensive story of the development of simpler performance models for distributed implementations of the Fast Fourier Transform in 3 Dimensions (FFT3D). We start by providing an overview of several implementations and their performance models. Then, we present arguments to support the use of a simple power function instead of the full performance models proposed by other publications. We argue that our model can be obtained for a particular problem size with minimal experimentation while other models require significant tuning to determine their constants. Our advocacy for simpler performance models is inspired by the difficulties found when estimating the performance of FFT3D programs. Correctly estimating how well large-scale programs (such as FFT3D) will work is one of the most challenging problems faced by scientists. The significant effort devoted to this problem has resulted in the appearance of numerous works on performance modeling. The results produced by an exhaustive performance modeling study may predict the performance of a program with a reasonably good accuracy. However, those studies may become unusable because their aim for accuracy can make them so difficult and cumbersome to use that direct experimentation with the program may be preferable, defeating their original purpose. We propose an alternative approach in this paper that does not require a full, accurate, performance model. Our approach mitigates the problem of existing performance models, each one of the parameters and constants in the model has to be carefully measured and tuned, a process that is intrinsically harder than direct experimentation with the program at hand. Instead, we were able to simplify our approach by (1) building performance models that target particular applications in their normal operating conditions and (2) using simpler models that still produce good approximations for the particular case of a program's normal operating environment. We have conducted experiments using the Blue Fire Supercomputer at the National Center for Atmospheric Research (NCAR), showing that our simplified model can predict the performance of a particular implementation with a high degree of accuracy and very little effort when the program is used in its intended operating range. Demystifying Performance Predictions of Distributed FFT3D Implementations 197 Finally, although our performance model does not cover extreme cases, we show that its simple approximation under the normal operating conditions of FFT3D is able to provide solid, useful approximations.

2007 IEEE International Parallel and Distributed Processing Symposium, 2007
Automatic library generators, such as ATLAS , Spiral [8] and FFTW , are promising technologies to... more Automatic library generators, such as ATLAS , Spiral [8] and FFTW , are promising technologies to generate efficient code for different computer architectures. The library generators usually tune programs using two layers of optimizations: the search at the algorithm level, and the optimization for micro kernels. The micro optimizations are important for the performance of library, because the optimized micro kernels are the bases of algorithm level search, and have a great impact on the overall performance of the generated libraries. A successfully optimized micro kernel requires thorough understanding of the interaction between architectural features and highly optimized code. However, literature on library generators focus more on the algorithm level optimization, and usually give only simple discussion of how kernel codes are generated and tuned. As a result, the optimization of micro kernels is still an art that depends on individual expertise, and is insufficiently documented. In this paper, we study the problem of how to generate efficient FFT kernels. We apply a series of micro optimizations, for example, memory hierarchy locality enhancements, to several FFT routines, and use hardware counters to observe the interactions between those optimizations with Intel Pentium 4 and the latest Intel Core 2 architecture. We achieve good speedups, and more importantly, we present methods that can be used to generate high-performance FFT kernels on different architectures.

2009 International Conference on Parallel Processing, 2009
This paper reports a study of mapping the Finite Difference Time Domain (FDTD) application to the... more This paper reports a study of mapping the Finite Difference Time Domain (FDTD) application to the IBM Cyclops-64 (C64) many-core chip architecture [1]. C64 is chosen for this study as it represents the current trend in computer architecture to develop a class of many-core architectures with distinct features e.g. software manageable on-chip memory hierarchy (vs. a hardware-managed data cache), high on-chip bandwidth, fine grain multithreading and synchronization, among others. Major results of our study include: 1. A good mapping of FDTD can effectively exploit the on-chip parallelism of C64-like architectures and show good performance and scalability. 2. Such performance improvement is derived by employing a number of code optimization techniques such as time skewing and split tiling that judiciously exploit the architecture features described in (1). 3. High performance requires maximum reuse of on-chip memory, which is obtained by tiling with non conventional tile shapes. 4. Such code optimization techniques we used in (2) and tiling such as the one used in (3) should be implementable within a reasonable compilation framework, opening a new set of possibilities for compiler optimizations.

Journal of Physics: Conference Series, 2011
The development of precipitating warm clouds is affected by several effects of small-scale air tu... more The development of precipitating warm clouds is affected by several effects of small-scale air turbulence including enhancement of droplet-droplet collision rate by turbulence, entrainment and mixing at the cloud edges, and coupling of mechanical and thermal energies at various scales. Large-scale computation is a viable research tool for quantifying these multiscale processes. Specifically, top-down large-eddy simulations (LES) of shallow convective clouds typically resolve scales of turbulent energy-containing eddies while the effects of turbulent cascade toward viscous dissipation are parameterized. Bottom-up hybrid direct numerical simulations (HDNS) of cloud microphysical processes resolve fully the dissipation-range flow scales but only partially the inertial subrange scales. it is desirable to systematically decrease the grid length in LES and increase the domain size in HDNS so that they can be better integrated to address the full range of scales and their coupling. In this paper, we discuss computational issues and physical modeling questions in expanding the ranges of scales realizable in LES and HDNS, and in bridging LES and HDNS. We review our on-going efforts in transforming our simulation codes towards PetaScale computing, in improving physical representations in LES and HDNS, and in developing better methods to analyze and interpret the simulation results.

Runspace Method, System and Apparatus
Disclosed is known as a run space, computing system control, data processing, and to a field of d... more Disclosed is known as a run space, computing system control, data processing, and to a field of data communications, and more particularly, especially, can be performed in a plurality of processing elements, the configuration of many possible degradation about synergistic method and system for providing resource efficient computation for the elements of the task. A method and system (101), using a metric space representing the code and data locality leads to allocation and movement of code and data, and performs an analysis to mark the coding region, the runtime It provides an opportunity to improve, further suitable dispersing start of the compression section of code that accesses the local memory, to provide a low power, local, the secure memory management system. Run space, hierarchical allocation, to help optimize, monitoring and control, further provides a mechanism to assist the recoverable, large-scale calculation good energy efficiency. .FIELD 1
Codeletset Representation, Manipulation, and Execution-Methods, System and Apparatus

ACM Transactions on Architecture and Code Optimization, 2012
Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in ... more Advanced many-core CPU chips already have a few hundreds of processing cores (e.g., 160 cores in an IBM Cyclops-64 chip) and more and more processing cores become available as computer architecture progresses. The underlying runtime systems of such architectures need to efficiently serve hundreds of processors at the same time, requiring all basic data structures within the runtime to maintain unprecedented throughput. In this paper, we analyze the throughput requirements that must be met by algorithms in runtime systems to be able to handle hundreds of simultaneous operations in real time. We reach a surprising conclusion: Many traditional algorithm techniques are poorly suited for highly parallel computing environments because of their low throughput. We reach the conclusion that the intrinsic throughput of a parallel program depends on both its algorithm and the processor architecture where the program runs. We provide theory to quantify the intrinsic throughput of algorithms, an...
Synchronization for Dynamic Task Parallelism on Manycore Architectures
Toward Efficient Fine-grained Dynamic Scheduling on Many-Core Architectures
Lecture Notes in Computer Science, 2013
Dynamic percolation
Proceedings of the 9th conference on Computing Frontiers - CF '12, 2012
Uploads
Papers by Daniel Alejandro Orozco