jaswinder singh

The Performance Advantages of Integrating Block Data Trabsfer in Cache-Coherent Multiprocessors

Operating Systems Review, 1994

The two predominant multiprocessor communication paradigms are implicit communication through a s... more The two predominant multiprocessor communication paradigms are implicit communication through a shared address space and explicit communication via message passing. A shared address space presents programmers with a favorable programming abstraction, and hardware cache-coherent shared-address-space machines have been shown to perform well on tasks that require fine-grain communication. Message passing machines, on the other hand, perform well on tasks that require coarse-grain communication. Integrating a coarse-grain communication facility into a hardware cache-coherent shared-address-space machine offers the potential for a favorable programming abstraction and good performance over a wide range of communication grain sizes. In this type of machine, fine-grain communication is managed by the cache-coherence hardware, while coarse-grain communication is managed by a block transfer facility external to the main processor.

Download

SPLASH: Stanford parallel applications for shared-memory

ACM Sigarch Computer Architecture News, 1992

Abstract We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of paral... more

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

ACM Sigarch Computer Architecture News, 1993

The distribution of resources among processors, memory and caches is a crucial question faced by ... more The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are NII on larger machines?

Download

The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors

Page 1. The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocesso... more

Parallel computer architecture - a hardware / software approach

Download

Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity

Journal of Parallel and Distributed Computing, 1995

Hierarchical N-body methods, which are based on a fundamental insight into the nature of many phy... more Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication.

Download

THE EFFECTS OF LATENCY, OCCUPANCY, AND BANDWIDTH IN DISTRIBUTED SHARED MEMORY MULTIPROCESSORS

Page 1. THE EFFECTS OF LATENCY, OCCUPANCY, AND BANDWIDTH IN DISTRIBUTED SHARED MEMORY MULTIPROCES... more

Scaling Parallel Programs for Multiprocessors: Methodology and Examples

IEEE Computer, 1993

Page 1. Scaling Parallel Programs for Multiprocessors: Methodology and Examples Jaswinder Pal Sin... more

The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Operating Systems Review, 1994

The two predominant multiprocessor communication paradigms are implicit communication through a s... more The two predominant multiprocessor communication paradigms are implicit communication through a shared address space and explicit communication via message passing. A shared address space presents programmers with a favorable programming abstraction, and hardware cache-coherent shared-address-space machines have been shown to perform well on tasks that require fine-grain communication. Message passing machines, on the other hand, perform well on tasks that require coarse-grain communication. Integrating a coarse-grain communication facility into a hardware cache-coherent shared-address-space machine offers the potential for a favorable programming abstraction and good performance over a wide range of communication grain sizes. In this type of machine, fine-grain communication is managed by the cache-coherence hardware, while coarse-grain communication is managed by a block transfer facility external to the main processor.

Download

The performance impact of flexibility in the Stanford FLASH multiprocessor

Sigplan Notices, 1994

Several multiprocessors have been proposed that offer programmable implementations of scalable ca... more Several multiprocessors have been proposed that offer programmable implementations of scalable cache coherence as well as support for message passing. In the FLASH machine, flexibility is obtained by the use of a programmable node controller, called MAGIC, through which all transactions in a node pass. We use the actual code sequences that implement the cache coherence protocol, together with a detailed simulator of the MAGIC design, to evaluate the performance costs of flexibility. We compare the performance of FLASH to an idealized hardwired machine on representative applications. In many cases, the overhead of the programmable protocol can be hidden behind the memory access time. When the miss rates are low, the performance differences between the ideal machine and FLASH are small. At high miss rates, performance is not good for either machine, though the increased remote access latencies and the contention within MAGIC can lead to larger performance losses for the flexible design. The results of our initial investigations point to a number of improvements that could be made to increase robustness in a flexible design such as FLASH.

Download

Parallelizing the simulation of ocean eddy currents

Parallel Visualization Algorithms: Performance and Architectural Implications

IEEE Computer, 1994

Shared-address-space multiprocessors are effective vehicles for speeding up visualization and ima... more Shared-address-space multiprocessors are effective vehicles for speeding up visualization and image synthesis algorithms. This article demonstrates excellent parallel speedups on some well-known sequential algorithms. S everal recent algorithms have substantially sped up complex and timeconsuming visualization tasks. In particular, novel algorithms for radiosity computation' and volume r e n d e r i r~g~.~ have demonstrated performance far superior to earlier methods. Despite these advances, visualization of complex scenes or data sets remains computationally expensive. Rendering a 256 x 256 x 256-voxel volume data set takes about 5 seconds per frame on a 100-MHz Silicon Graphics Indigo workstation using Levoy's ray-casting algorithm2 and about a second per frame using a new shear-warp algorithm.' These times are much larger than the 0.03 second per frame required for real-time rendering or the 0.1 second per frame required for interactive rendering. Realistic radiosity and ray-tracing computations are much more time-consuming.

Download

An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors

Page 1. An Empirical Comparison of the Kendall Square Research KSR-1 and Stanford DASH Multiproce... more

Uploads

Papers by jaswinder singh

Log In