Efficiency in vector handling is the key to obtaining high performance in numerical programs. So ... more Efficiency in vector handling is the key to obtaining high performance in numerical programs. So far, the main defect of dataflow computers is inefficiency in vector processing. We propose structureflow processing as a new scheme for handling data structures such as vectors in dataflow architecture. The main objective of structure-flow processing is to enhance vector processing performance of a dataflow computer. In this structure-flow processing scheme, the arrival of a data structure unrolls the control structure which processes the data structure itself. A high-level structure is an implementation mechanism of a structure-flow scheme on a practical dataflow computer. Since all the computation is executed by instruction-level dataflow architecture, scalar level parallelism and function level parallelism are also fully utilized by this scheme. The SIGMA-l architecture that supports high-level structure processing are discussed and the performance is measured. According to the measurement, vector programs can be executed three to four times faster than by unfolding using scalar dataflow processing.
7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings., 2004
This manuscript was written for ISPAN03. In this paper, we point out a problem of congestion cont... more This manuscript was written for ISPAN03. In this paper, we point out a problem of congestion control mechanism of TCP on Long Fat pipe Network (LFN) by precise analysis using our handmade tools, and we propose "congestion control algorithm for parallel streams" and "packet spacing algorithm in slow start phase" for LFN. This paper presents (1) observation of TCP/IP multi-stream data transfer across 7500 miles, (2) analysis tools for congestion problem; i.e., packet logging tool and pseudo LFN emulator, (3) pseudo LFN experiments and (4) modification of congestion control algorithm, "balancing parallel streams with TCP fairness and TCP compatibility" and "packet spacing".
Proceedings of the 2004 workshop on Computer architecture education held in conjunction with the 31st International Symposium on Computer Architecture - WCAE '04, 2004
In this paper, we present the new curriculum of the processor laboratory of the Department of Com... more In this paper, we present the new curriculum of the processor laboratory of the Department of Computer Science at the University of Tokyo. This laboratory is a part of the computer architecture education curriculum. In this laboratory, students design and implement their own processors using field-programmable gate arrays (FPGAs), and write the necessary software. In 2003, the curriculum of the laboratory was changed, the main change being that the FPGA was changed to a large one to increase the range of design trade-offs. As a result, students have been enabled to implement the techniques used in modern processors such as FPU, cache, branch prediction, and superscalar architecture. In this paper, we detail the new curriculum and note the educational results of the year following the changes. Especially, we focus on the educational advantages of the large FPGA size.
To execute shared-memory-based parallel programs efficiently, we introduce two compiler-assisted ... more To execute shared-memory-based parallel programs efficiently, we introduce two compiler-assisted software cache schemes which are well-suited to automatic optimizations of remote communications. One scheme is a full user-level software cache (User-level Distributed Shared Memory: UDSM) and another is a page-based cache (Asymmetric Distributed Shared Memory: ADSM) which exploits TLB/MMU only in the cases of read-access-misses. Under these schemes we can apply several optimizing techniques, which exploit capabilities of the middle-grained or coarsegrained remote-memory-accesses, to reduce the number and the amount of communications. We also introduce a highspeed user-level communication and synchronization scheme "Memory-Based Communication Facilities (MBCF)" for providing the capabilities in a general-purpose system with offthe-shelf communication-hardware. In this paper, we explain outline of our approach, the UDSM and the ADSM, the MBCF, and optimizing techniques for remote communications. Finally we show experimental results on effects of our proposed approach using our prototype optimizing compiler "Remote Communication Optimizer (RCOP)" and the MBCF on Fast Ethernet.
Proceedings of the 1982 ACM symposium on LISP and functional programming - LFP '82, 1982
Desig n of a 10 MIPS Lisp machine used for symbolic algebra is presented. Besides incorporating t... more Desig n of a 10 MIPS Lisp machine used for symbolic algebra is presented. Besides incorporating the hardware mechanisms which greatly speed up primitive Lisp operations, the machine is equipped with parallel hashing hardware for content addressed associative tabulation and a
Proceedings of the 12th international conference on Supercomputing - ICS '98, 1998
Binary compatibility, efficient utilization of parallelism and scalability are three fundamental ... more Binary compatibility, efficient utilization of parallelism and scalability are three fundamental requirements for a practical and general-purpose computation. In this paper, we propose a new execution model that satisfies these three requirements, "Speculative SPMD model with duplicated execution (SD-SPMD model)". SDSPMD model exploits parallelism from sequential program at run-time and reduces inter-processor communication by duplicated execution. We describe examples of parallel processing using the SDSPMD model. They are a processor architecture for exploiting high-level parallelism at architectural level, and parallelizing Java Virtual Machine which exploit existing parallelism in a Java byte-code. In both examples, execution on a parallel computer is based on Speculative Duplicated SPMD (SDSPMD) execution model, where each processing element executes a speculative sequential thread that communicates with other processing elements only by speculative memory accesses. In order to eliminate other kinds of inter-processor communication such as register copying and direct signaling, a part of execution on each thread is duplicated on all the processing elements. Transformation of a sequential binary program to SDSPMD threads is performed either (1) at run-time by analyzing control and data dependencies among instructions and analyzing data dependencies among memory access instructions or (2) at a byte-code interpreter analyzing Java byte code with execution history. We have evaluated the performance of an on-chip MIMD processor based on SDSPMD model by simulation.
Proceedings of the 12th international conference on Supercomputing, 1998
We introduce a novel high-speed user-level communication and synchronization scheme "Memory-Based... more We introduce a novel high-speed user-level communication and synchronization scheme "Memory-Based Communication Facilities (MBCF)" for a general-purpose system with an off-the-shelf communication-hardware. This mechanism is protected and virtualized as completely as memory so that it can be used not only in parallel systems but also in distributed systems. The MBCF realizes the direct remotememory-accesses in user-task-spaces and offers programmers and compilers a wide variety of functions and a large sharedmemory space. In this paper we first explain outlines and features of the MBCF, and present the high-speed implementation techniques for the MBCF. Then this paper describes varieties of the MBCF functions and introduces two novel memory-based mechanism: the Memory-Based FIFO and the Memory-Based Signal. Next we show that the MBCF is more flexible and less expensive than the messagepassing-style system-interfaces for communication or generalized active messages. Finally we show performance evaluations on the real MBCF implementation using lOOBASE-TX Ethernet. hmisaion to make digital or hard copies of all or pti ofthis WO~C for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the till citation on the first page. To Copy otherwise, to republish, to post OII servers or to redistribute to lists requires prior specific permission and/or a fee. ICS 98 Melbourne Australia Copyright ACM 1998 o-89791-9984981 7...$5.00 high-speed user-level communication and synchronization scheme "Memory-Based Communication Facilities (MBCF)"[l,
A massively parallel processor called JUMP-1 has been developed to build an efficient cache coher... more A massively parallel processor called JUMP-1 has been developed to build an efficient cache coherent-distributed shared memory (DSM) on a large system with more than 1000 processors. Here, the dedicated processor called MBP (Memory Based Processor)-light to manage the DSM of JUMP-1 is introduced, and its preliminary performance with two protocol policies-update/invalidate-is evaluated. From results of its simulation, it appears that simple operations like the tag check and the collection/generation of acknowledgment packets are mostly processed by the hardware mechanisms in MBP-light without aids of the core processor with both policies. Also, the buffer-register architecture adopted by the core processor in MBP-light is exploited enough to process a protocol transaction for both policies.
International Symposium on Code Generation and Optimization
Many excepting instructions cannot be removed by existing Partial Redundancy Elimination (PRE) al... more Many excepting instructions cannot be removed by existing Partial Redundancy Elimination (PRE) algorithms because the ordering constraints must be preserved between the excepting instructions, which we call exception dependencies. In this work, we propose Sentinel PRE, a PRE algorithm that overcomes exception dependencies and retains program semantics. Sentinel PRE first hoists excepting instructions without considering exception dependencies, and then detects exception reordering by fast analysis. If an exception occurs at a reordered instruction, it deoptimizes the code into the one before hoisting. Since we rarely encounter exceptions in real programs, the optimized code is executed in almost all cases. We implemented Sentinel PRE in a Java just-in-time compiler and conducted experiments. The results show 9.0% performance improvement in the LU program in the Java Grande Forum Benchmark Suite. 1 nullcheck a 2 x:=a.field1 3 nullcheck a 4 y:=a.field1 (a) 1 nullcheck a 2 t:=a.field1 x:=t 3 4 y:=t (b) 5 nullcheck a 6 t:=a.field1
Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN'97)
We have proposed an \Asymmetric Distributed Shared Memory: ADSM", that provides users with efcien... more We have proposed an \Asymmetric Distributed Shared Memory: ADSM", that provides users with efcient shared memory model. The ADSM is a hybrid system that needs not only the operating system support but also the compiler support. The ADSM executes a load instruction as the shared-read with the assistance of virtual memory mechanism. As for the shared-write, the ADSM executes a sequence of instructions for consistency management after the corresponding store instruction. We describe the algorithm to reduce overheads for consistency management. The algorithm coalesces a sequence of instructions for consistency management using the information of ane memory accesses. The coalescing algorithm is evaluated using the SPLASH-2 benchmark. The performance evaluation shows that the coalescing algorithm achieves the execution time improvement compared to the not-optimized result, ranging from 76% to 85% .
Proceedings of the 14th international conference on Supercomputing, 2000
To execute shared-memory parallel programs efficiently on distributed-memory systems without remo... more To execute shared-memory parallel programs efficiently on distributed-memory systems without remote-caching hardware mechanisms, software-caching mechanisms must be used. We have proposed two compiler-assisted software-caching schemes. One is a page-basH system (Asymmetric Distributed Shared Memory: ADSM) that uses virtual memory mechanisms only for cachemisses. The other is a segment-based system (User-level Distributed Shared Memory: UDSM) that uses user-level instrumentation codes to maintain software-cache coherence. The purpose of this paper is to investigate the performance tradeoffs between the page-based scheme and the segment-based scheme by running fully-optimized real applications. Our optimizing compiler directly analyzes the shared-memory source programs and optimizes them. Along with lazy release consistency, it exploits the capabilities of middle-grained or coarse-grained remote-memory accesses to reduce the volume of communications and to reduce the overhead of the cache-management routines. It performs interprocedural points-to analysis, interproeedural redundancy elimination, and loop-level optimizations such as coalescing. In ADSM shared writes are supported by the compiler, while in UDSM not only shared writes but also shared reads are supported by the optimizing compiler. We implemented this optimizing compiler for both ADSM and UDSM, and run-time system for user-level cache-emulation. The run-time system runs on an SS20 workstation cluster connected with a 100BASE-TX Ethemet. We quantitatively compare UDSM with ADSM by using nine benchmarks from SPLASH-2. The experimental results clearly show that the performance of ADSM scheme is limited by the communication of unnecessary data, while *Research Fellow of the Japan Society for the Promotion of Science.
We introduce a software/hardware scheme called the Field Array Compression Technique (FACT) which... more We introduce a software/hardware scheme called the Field Array Compression Technique (FACT) which reduces cache misses due to recursive data structures. Using a data layout transformation, data with temporal affinity is gathered in contiguous memory, where the recursive pointers and integer fields are compressed. As a result, one cacheblock can capture a greater amount of data with temporal affinity, especially pointers, improving the prefetching effect of a cache-block. In addition, the compression enlarges the effective cache capacity. On a suite of pointer-intensive programs, FACT achieves a 41.6% reduction in memory stall time and a 37.4% speedup on average.
Proceedings of the 18th annual international conference on Supercomputing - ICS '04, 2004
We propose a novel replacement algorithm, called Inter-Reference Gap Distribution Replacement (IG... more We propose a novel replacement algorithm, called Inter-Reference Gap Distribution Replacement (IGDR), for setassociative secondary caches of processors. IGDR attaches a weight to each memory-block, and on a replacement request it selects the memory-block with the smallest weight for eviction. The time difference between successive references of a memory-block is called its Inter-Reference Gap (IRG). IGDR estimates the ideal weight of a memory-block by using the reciprocal of its IRG. To estimate this reciprocal, it is assumed that each memory-block has its own probability distribution of IRGs; from which IGDR calculates the expected value of the reciprocal of the IRG to use as the weight of the memory-block. For implementation, IGDR does not have the probability distribution; instead it records the IRG distribution statistics at run-time. IGDR classifies memoryblocks and records statistics for each class. It is shown that the IRG distributions of memory-blocks correlate their reference counts, this enables classifying memory-blocks by their reference counts. IGDR is evaluated through an executiondriven simulation. For ten of the SPEC CPU2000 programs, IGDR achieves up to 46.1% (on average 19.8%) miss reduction and up to 48.9% (on average 12.9%) speedup, over the LRU algorithm.
We introduce a software/hardware scheme called the Field Array Compression Technique (FACT) which... more We introduce a software/hardware scheme called the Field Array Compression Technique (FACT) which reduces cache misses caused by recursive data structures. Using a data layout transformation, data with temporal affinity are gathered in contiguous memory, where recursive pointer and integer fields are compressed. As a result, one cacheblock can capture a greater amount of data with temporal affinity, especially pointers, thereby improving the prefetching effect. In addition, the compression enlarges the effective cache capacity. On a suite of pointerintensive programs, FACT achieves a 41.6% average reduction in memory stall time and a 37.4% average increase in speed.
Power consumption has become an important factor in the design of highperformance computer system... more Power consumption has become an important factor in the design of highperformance computer systems. The power consumption of newer systems is now published but is unknown for many older systems. Data for only two or three generations of systems are insufficient for projecting the performance/power of future systems. We measured the performance and power consumption of 70 computer systems from 1989 to 2011. Our collection of computers included desktop and laptop personal computers, workstations, handheld devices and supercomputers. This is the first paper reporting the performance and power consumption of systems over twenty years, using a uniform method. The primary benchmark we used was Dhrystone. We also used NAS Parallel Benchmarks and CPU2006 suite. The Dhrystone/power ratio was found to be growing exponentially. The data we obtained indicates that the Dhrystone result and the CINT2006 in SPEC CPU2006 correlate closely. The NAS Parallel Benchmarks and CFP2006 results also correlate. Using the trend of Dhrystone/power that we obtained, we predict that the Dhrystone/power ratio will reach 2,963 VAX MIPS/Watt in 2018, when exaflops machines are expected to appear.
Proceedings of the 7th international conference on Supercomputing, 1993
Latency associated with memory accesses and process communications are one of the most difficult ... more Latency associated with memory accesses and process communications are one of the most difficult obstacles in con. strutting a practical massively parallel system. So far, two approaches to hide latencies have been proposed. They are prefetching and rnuiti-threading. An instruction-level datadriven computer is an ideal test-bed for evaluating these latency hiding methods because prefetching and multi-threading are naturally implemented in an instruction-level dat a-driven computer aa unfolding and concurrent execution of multiple contezts. This paper evaluates latency hiding methods on SIGMA-1, a dataflow supercomputer developed in Electrotechnical Laboratory. As a result of evaluation, these methods are effective to hide static latencies but not effective to hide dynamic latencies. Also, concurrent execution of multiple contexts is more effective than prefetching.
Proceedings of the 5th international conference on Supercomputing - ICS '91, 1991
In the past, various computer languages aimed at users of dataflow computers have been designed a... more In the past, various computer languages aimed at users of dataflow computers have been designed and implemented. VAL, Id and SISAL are famous examples. One concept that is common to such languages is to maintain a sense of functionality by introducing a single assignment rule for all variables. DFCII, Dataflow C II, the language proposed here, is designed for writing practical application programs and system programs that must be executed in parallel on the SIGMA-1. It is the most important dataflow research project being undertaken by the Electrotechnical Laboratory of Japan. This paper proposes the design of this new dataflow language, DFCII. 1 Introduction Many languages have been proposed to describe parallelism for parallel computer systems. Most parallel languages look like some sort of extension of the FORTRAN languages or some form of object oriented languages, so targets of these languages are directed to parallel von Neumann computer systems and Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or epecific permission.
As the network speed grows, inter-layer coordination becomes more important. This paper shows 3 i... more As the network speed grows, inter-layer coordination becomes more important. This paper shows 3 inter-layer coordination methods; (1) "Comet-TCP"; cooperation of datalink layer and transport layer using hardware, (2) "Transmission Rate Controlled TCP (TRC-TCP)"; cooperation of data-link layer and transport layer using software, and (3) "Dulling Edges of Cooperative Parallel streams (DECP)"; cooperation of transport layer and application layer. We show the experimental results of file transfer at Bandwidth Challenge in SC2003; one and a half round trip from Japan to U.S., 15,000 miles, which has 350 ms RTT and 8.2 Gbps bandwidth. Comet-TCP hardware solution attained max 7.56 Gbps using a pair of 16 IA servers, which is 92% of available bandwidth and DECP software attained max 7.01 Gbps using a pair of 32 IA servers.
2011 Second International Conference on Networking and Computing, 2011
One of the significant issues of processor architecture is to overcome memory latency. Prefetchin... more One of the significant issues of processor architecture is to overcome memory latency. Prefetching can greatly improve cache performance, however, it has the drawback of cache pollution unless its aggressiveness is properly set. Although several techniques for prefetcher throttling have been proposed which use accuracy as a metric, their robustness were not sufficient due to the variations between program working set sizes and cache capacities. In this paper, we revisit cache behavior with the viwepoint of data lifetime in a cache with prefetching. Based on this observation Cache-Convection-Control-based Prefetch Optimization (CCCPO) is proposed, which exploits the characteristics of cache line reuse and controls the prefetcher aggressiveness. Evaluation results showed that this novel approach achieved 4.6% improvement against the most recent prefetcher throttling algorithms in the geometric mean of SPEC CPU 2006 benchmark suite with 256KB LLC. 1
Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007
We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system... more We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system, which will consist of 4096 processor chips each with 512 cores operating at the clock frequency of 500 MHz. The peak speed of a processor chip is 512Gflops (single precision) or 256 Gflops (double precision). The GRAPE-DR chip works as an attached processor to standard PCs. Currently, a PCI-X board with single GRAPE-DR chip is in operation. We are developing a 4chip board with PCI-Express interface, which will have the peak performance of 1 Tflops. The final system will be a cluster of 512 PCs each with two GRAPE-DR boards. We plan to complete the final system by early 2009. The application area of GRAPE-DR covers particle-based simulations such as astrophysical many-body simulations and molecular-dynamics simulations, quantum chemistry calculations, various applications which requires dense matrix operations, and many other compute-intensive applications.
Uploads
Papers by Kei Hiraki