Compiler Optimizations for High Performance Architectures
Sign up for access to the world's latest research
Abstract
We describe two ongoing compiler projects for high performance architectures at the University of Maryland being developed us- ing the Stanford SUIF compiler infrastructure. First, we are in- vestigating the impact of compilation techniques for eliminat- ing synchronization overhead in compiler-parallelized programs running on software distributed-shared-memory (DSM) systems. Second, we are evaluating data layout transformations to im- prove cache performance on uniprocessors by eliminating conflict misses through inter- and intra-variable padding. Our optimiza- tions have been implemented in SUIF and tested on a number of programs. Preliminary results are encouraging.
Related papers
ACM SIGARCH Computer Architecture News
In this issue, we present a selection of papers from several workshops held in September 2001 in Barcelona, Spain. The workshops were hosted within the PACT (Parallel Architecture and Compilation Techniques) Conference [1], [2]. The advances in technology arc improving the processing power and the computing speed of systems. As addressed by keynote speakers, the time has never been so propitious to explore the potentials of compilers on the architecture and vice versa, due to the strong demand for advances in the interaction of these two areas. The increasing interest is also shown by the record number of attendees this year. This is also due to the , high-quality workshops focused on hot topics in Compiler and Computer Architecture research areas. This year 2001, five different workshops covered hot research themes: the Compilers and Operating Systems for Low Power (COLP) workshop, the European Workshop on OpenMP (EWOMP), the MEmory DEcoupling Architecture workshop (MEDEA), the Ubiquitous Computing and Communication (UCC) workshop, and the Workshop on Binary Translation (WBT). For copyright reasons, we cannot include
AbstractÐTechnological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bankexposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguationÐthat of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance by a factor of 3 to 5 over using ILP alone.
Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205), 1998
To execute a shared memory program efficiently, we have to manage memory consistency with low overheads, and have to utilize communication bandwidth of the platform as much as possible. A software distributed shared memory (DSM) can solve these problems via proper support by an optimizing compiler. The optimizing compiler can detect shared write operations, using interprocedural pointsto analysis. It also coalesces shared write commitments onto contiguous regions, and removes redundant write commitments, using interprocedural redundancy elimination. A page-based target software DSM system can utilize communication bandwidth, owing to coalescing optimization. We have implemented the above optimizing compiler and a runtime software DSM on AP1000+. We have obtained a high speed-up ratio with the SPLASH-2 benchmark suite. The result shows that using an optimizing compiler to assist a software DSM is a promising approach to obtain a good performance. It also shows that the appropriate protocol selection at a write commitment is an effective optimization.
2007
Abstract Experience with commercial and research high-performance architectures has indicated that the compiler plays an increasingly important role in real application performance. In particular, the di culty in programming some of the so-called\ hardware rst" machines underscores the need for integrating architecture design and compilation strategy. In addition, architectures featuring novel hardware optimizations require compilers that can take advantage of them in order to be commercially viable.
2003
Traditional compiler analyses and back-end optimizations, which play an important role in generating efficient code for modern high-performance processors, are quite mature, well understood, and have been widely used in production compilers. However, recent advances in high-performance (general purpose) processor architecture, emergence of novel architectural paradigms, emphasis on application-specific processors and embedded systems, and the increasing trend on compiling applications directly onto silicon present several interesting challenges and opportunities in high performance compilation techniques. In this paper we discuss the trends that are emerging to meet the above challenges. In particular, we discuss recent advances in data flow analyses, compiling techniques for embedded and DSP processors, and compiling techniques that reduce power consumption.
Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2- core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.
1998
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages.
ACM SIGPLAN Notices, 1988
The partitioning of shared memory into a number of memory modules is an approach to achieve high memory bandwidth for parallel processors. Memory access conflicts can occur when several processors simultaneously request data from the same memory module. Although work has been done to improve access performance for vectors, no work has been reported to improve the access performance of scalars. For systems in which the processors operate in a lock-step mode, a large percentage of memory access conflicts can be predicted at compile-time. These conflicts can be avoided by appropriate distribution of data among the memory modules at compile-time. A long instruction word machine is an example of a system in which the functional units operate in a lock-step mode performing operations on data fetched in parallel from multiple memory modules. In this paper, compile-time techniques for distribution of scalars to avoid memory access conflicts are presented. Furthermore, algorithms to schedule...
Microprocessing and Microprogramming, 1995
The problem of automatically generating programs for massively parallel computers is a very complicated one, mainly because there are many architectures, each of them seeming to pose its own particular compilation problem. The purpose of this paper is to propose a framework in which to discuss the compilation process, and to show that the features which a ect it are few and generate a small number of combinations. The paper is oriented toward ne-grained parallelization of static control programs, with emphasis on data ow analysis, scheduling and placement. When going from there to more general programs and to coarser parallelism, one encounters new problems, some of which are discussed in the conclusion.
1994
Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can e cient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating the synchronization and communication latencies, without intruding on the performance of sequentially-executed code? And how much (quantitatively) does such non-intrusive multithreading support contribute to the scalable parallel performance in the presence of increasing ...