Compiler Optimizations for High Performance Architectures

Gabriel Rivera

Outline

Title

Abstract

All Topics

Computer Science

Hardware and Architecture

Compiler Optimizations for High Performance Architectures

Gabriel Rivera

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We describe two ongoing compiler projects for high performance architectures at the University of Maryland being developed us- ing the Stanford SUIF compiler infrastructure. First, we are in- vestigating the impact of compilation techniques for eliminat- ing synchronization overhead in compiler-parallelized programs running on software distributed-shared-memory (DSM) systems. Second, we are evaluating data layout transformations to im- prove cache performance on uniprocessors by eliminating conflict misses through inter- and intra-variable padding. Our optimiza- tions have been implemented in SUIF and tested on a number of programs. Preliminary results are encouraging.

Cosimo Prete

ACM SIGARCH Computer Architecture News

In this issue, we present a selection of papers from several workshops held in September 2001 in Barcelona, Spain. The workshops were hosted within the PACT (Parallel Architecture and Compilation Techniques) Conference [1], [2]. The advances in technology arc improving the processing power and the computing speed of systems. As addressed by keynote speakers, the time has never been so propitious to explore the potentials of compilers on the architecture and vice versa, due to the strong demand for advances in the interaction of these two areas. The increasing interest is also shown by the record number of attendees this year. This is also due to the , high-quality workshops focused on hot topics in Compiler and Computer Architecture research areas. This year 2001, five different workshops covered hot research themes: the Compilers and Operating Systems for Low Power (COLP) workshop, the European Workshop on OpenMP (EWOMP), the MEmory DEcoupling Architecture workshop (MEDEA), the Ubiquitous Computing and Communication (UCC) workshop, and the Workshop on Binary Translation (WBT). For copyright reasons, we cannot include

downloadDownload free PDF View PDFchevron_right

Compiler Support for Scalable and Efficient Memory Systems,

Rajeev Barua

AbstractÐTechnological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bankexposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguationÐthat of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance by a factor of 3 to 5 over using ILP alone.

downloadDownload free PDF View PDFchevron_right

Supporting software distributed shared memory with an optimizing compiler

Kei Hiraki

Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205), 1998

To execute a shared memory program efficiently, we have to manage memory consistency with low overheads, and have to utilize communication bandwidth of the platform as much as possible. A software distributed shared memory (DSM) can solve these problems via proper support by an optimizing compiler. The optimizing compiler can detect shared write operations, using interprocedural pointsto analysis. It also coalesces shared write commitments onto contiguous regions, and removes redundant write commitments, using interprocedural redundancy elimination. A page-based target software DSM system can utilize communication bandwidth, owing to coalescing optimization. We have implemented the above optimizing compiler and a runtime software DSM on AP1000+. We have obtained a high speed-up ratio with the SPLASH-2 benchmark suite. The result shows that using an optimizing compiler to assist a software DSM is a promising approach to obtain a good performance. It also shows that the appropriate protocol selection at a write commitment is an effective optimization.

downloadDownload free PDF View PDFchevron_right

A Survey of General and Architecture-Specific Compiler Optimization Techniques

Michael Hsiao

2007

Abstract Experience with commercial and research high-performance architectures has indicated that the compiler plays an increasingly important role in real application performance. In particular, the di culty in programming some of the so-called\ hardware rst" machines underscores the need for integrating architecture design and compilation strategy. In addition, architectures featuring novel hardware optimizations require compilers that can take advantage of them in order to be commercially viable.

downloadDownload free PDF View PDFchevron_right

Compiler Analysis and Optimizations: What is New?

Uday Khedker

2003

Traditional compiler analyses and back-end optimizations, which play an important role in generating efficient code for modern high-performance processors, are quite mature, well understood, and have been widely used in production compilers. However, recent advances in high-performance (general purpose) processor architecture, emergence of novel architectural paradigms, emphasis on application-specific processors and embedded systems, and the increasing trend on compiling applications directly onto silicon present several interesting challenges and opportunities in high performance compilation techniques. In this paper we discuss the trends that are emerging to meet the above challenges. In particular, we discuss recent advances in data flow analyses, compiling techniques for embedded and DSP processors, and compiling techniques that reduce power consumption.

downloadDownload free PDF View PDFchevron_right

An Approach for Compiler Optimization to Exploit Instruction Level Parallelism

Rajendra Kumar

Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2- core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.

downloadDownload free PDF View PDFchevron_right

Interprocedural data flow based optimizations for distributed memory compilation

Gagan Agrawal

1998

Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages.

downloadDownload free PDF View PDFchevron_right

Compile-time techniques for efficient utilization of parallel memories

Mary Lou Soffa

ACM SIGPLAN Notices, 1988

The partitioning of shared memory into a number of memory modules is an approach to achieve high memory bandwidth for parallel processors. Memory access conflicts can occur when several processors simultaneously request data from the same memory module. Although work has been done to improve access performance for vectors, no work has been reported to improve the access performance of scalars. For systems in which the processors operate in a lock-step mode, a large percentage of memory access conflicts can be predicted at compile-time. These conflicts can be avoided by appropriate distribution of data among the memory modules at compile-time. A long instruction word machine is an example of a system in which the functional units operate in a lock-step mode performing operations on data fetched in parallel from multiple memory modules. In this paper, compile-time techniques for distribution of scalars to avoid memory access conflicts are presented. Furthermore, algorithms to schedule...

downloadDownload free PDF View PDFchevron_right

Compiling for massively parallel architectures: a perspective

Paul Feautrier

Microprocessing and Microprogramming, 1995

The problem of automatically generating programs for massively parallel computers is a very complicated one, mainly because there are many architectures, each of them seeming to pose its own particular compilation problem. The purpose of this paper is to propose a framework in which to discuss the compilation process, and to show that the features which a ect it are few and generate a small number of combinations. The paper is oriented toward ne-grained parallelization of static control programs, with emphasis on data ow analysis, scheduling and placement. When going from there to more general programs and to coarser parallelism, one encounters new problems, some of which are discussed in the conclusion.

downloadDownload free PDF View PDFchevron_right

Advanced Compilers, Architectures and Parallel Systems

Prakash Panangaden

1994

Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can e cient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating the synchronization and communication latencies, without intruding on the performance of sequentially-executed code? And how much (quantitatively) does such non-intrusive multithreading support contribute to the scalable parallel performance in the presence of increasing ...

downloadDownload free PDF View PDFchevron_right

This document is currently being converted. Please check back in a few minutes.

J. Ramanujam

Journal of Parallel and Distributed Computing, 2000

Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with large number of processors is difficult. Recently, some scalable architectures based on logicallyshared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (4) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.

downloadDownload free PDF View PDFchevron_right

Maximizing Multiprocessor Performance with the SUIF Compiler

Brian Murphy

IEEE Computer, 1996

Parallelizing compilers for multiprocessors face many hurdles. However, SUIF's robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

downloadDownload free PDF View PDFchevron_right

Tuning compiler optimizations for simultaneous multithreading

Dean Tullsen

micro, 1997

Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost, inter-processor communication.

downloadDownload free PDF View PDFchevron_right

Compiler techniques for reducing data cache miss rate on a multithreaded architecture

Dean Tullsen

… of the 3rd international conference on High …, 2008

High performance embedded architectures will in some cases combine simple caches and multithreading, two techniques that increase energy efficiency and performance at the same time. However, that combination can produce high and unpredictable cache miss rates, even when the compiler optimizes the data layout of each program for the cache.

downloadDownload free PDF View PDFchevron_right

SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers

Saman Amarasinghe

ACM Sigplan …, 1994

Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts.

downloadDownload free PDF View PDFchevron_right

Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers

John Chandy

1994 International Conference on Parallel Processing (ICPP'94), 1994

The PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project at the University of Illinois provides a fully au tomated means to parallelize programs, written in a se rial programming model, for execution on distributedmemory multicomputers. To provide efficient execution, PARADIGM automatically performs various optimiza tions to reduce the overhead and idle time caused by interprocessor communication. Optimizations studied in this paper include message coalescing, message vectorization, message aggregation, and coarse gram pipelining. To sepa rate the optimization algorithms from machine-specific de tails, parameterized models are used to estimate commu nication and computation costs for a given machine. The models are also used in coarse gram pipelining to automat ically select a task granularity that balances the available parallelism with the costs of communication. To determine the applicability of the optimizations on different machines, we analyzed their performance on an Intel iPSC/860, an Intel iPSC/2, and a Thinking Machines CM-5.

downloadDownload free PDF View PDFchevron_right

Benchmarking parallel compilers: A UPC case study

Ahmed Mohamed

Future Generation Computer Systems, 2006

Unified Parallel C (UPC) is an explicit parallel extension to ISO C which follows the Partitioned Global Address Space (PGAS) programming model. UPC, therefore, combines the ability to express parallelism while exploiting locality. To do so, compilers must embody effective UPCspecific optimizations. In this paper we present a strategy for evaluating the performance of PGAS compilers. It is based on emulating possible optimizations and comparing the performance to the raw compiler performance. It will be shown that this technique uncovers missed optimization opportunities. The results also demonstrate that, with such automatic optimizations, the UPC performance will be compared favorably with other paradigms.

downloadDownload free PDF View PDFchevron_right

Compiler Optimizations for High Performance Architectures

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics