Papers by Richard Fujimoto

Simuletter, Jul 1, 1997
It is well known that Time Warp may suffer from poor performance due to excessive rollbacks cause... more It is well known that Time Warp may suffer from poor performance due to excessive rollbacks caused by overly optimistic execution. Here we present a simple flow control mechanism using only local information and GVT that limits the number of uncommitted messages generated by a processor, thus throttling overly optimistic TW execution. The flow control scheme is analogous to traditional networking flow control mechanisms. A "window" of messages defines the maximum number of uncommitted messages allowed to be scheduled by a process. Committing messages is analogous to acknowledgments in networking flow control. The initial size of the window is calculated using a simple analytical model that estimates the instantaneous number of messages that a process will eventually commit. This window is expanded so that the process may progress up to the next commit point (generally the next fossil collection), and to accommodate optimistic execution. The expansions to the window are based on monitoring TW performance statistics so the window size automatically adapts to changing program behaviors. The flow control technique presented here is simple and fully automatic. No global knowledge or synchronization (other than GVT) is required. We also develop an implementation of the flow control scheme for shared memory multiprocessors that uses dynamically sized pools of free message buffers. Experimental data indicates that the adaptive flow control scheme maintains high performance for "balanced workloads", and achieves as much as a factor of 7 speedup over unthrottled TW for certain irregular workloads. ,< 1 / 1 Introduction 1 Time Warp is a well known parallel discrete synchronization protocol that detects out-of-order executions of I a' events as they occur, and recovers using a rollback mechj anism [ll]. It is well known that Time Warp may suffer from long rollbacks due to overly optimistic execution. Depending on the cost and frequency of rollback, the rollback overheads may dominate processing time. In addition, logical processes (LPs) further ahead in virtual time consume memory, which can be better utilized by LPs closer to GVT. In such cases it is better to block the optimistic LPs and prevent long rollbacks rather than spend resources in undoing the wrong computation after the fact. Numerous variations of Time Warp have been proposed that attempt to reduce the amount of rolled back computation that may occur. Surveys of methods in this regard are described in [7, 181. Broadly, there are two classes of optimism control schemes: non-adaptive and adaptive. System parameters, e.g. window sizes remain static in non-adaptive schemes, where as they are dynamic in the adaptive schemes.

Workshop on Parallel and Distributed Simulation, Jul 1, 1995
Mechanisms for managing message buffers in Time Warp parallel simulations executing on cache-cohe... more Mechanisms for managing message buffers in Time Warp parallel simulations executing on cache-coherent shared-memory multiprocessors are studied. Two simple buffer management strategies called the sender pool and receiver pool mechanisms are examined with respect to their efficiency, and in particular, their interaction with multiprocessor cache-coherence protocols. Measurements of implementations on a Kendall Square Research KSR-2 machine using both synthetic workloads and benchmark applications demonstrate that sender pools offer significant performance advantages over receiver pools. However, it is also observed that both schemes, especially the sender pool mechanism, are prone to severe performance degradations due to poor locality of reference in large simulations using substantial amounts of message buffer memory. A third strategy called the partitioned buffer pool approach is proposed that exploits the advantages of sender pools, but exhibits much better locality. Measurements of this approach indicate that the partitioned pool mechanism yields substantially better performance than both the sender and receiver pool schemes for large-scale, small-granularity parallel simulation applications. The central conclusions from this study are: (1) buffer management strategies play an important role in determining the overall efficiency of multiprocessor-based parallel simulators, and (2) the partitioned buffer pool organization offers significantly better performance than the sender and receiver pool schemes. These studies demonstrate that poor performance may result if proper attention is not paid to realizing an efficient buffer management mechanism.
Workshop on Parallel and Distributed Simulation, May 1, 1999

The Journal of Defense Modeling and Simulation, Apr 1, 2008
Many of today's military services and applications run on geographically distributed sites. Befor... more Many of today's military services and applications run on geographically distributed sites. Before these services and applications can be deployed in an actual network, they need to be tested and evaluated under realistic scenarios with many unpredictable factors. Existing experimental tools cannot meet the requirements for scale, accuracy and timeliness. A network emulation framework called ROSENET is proposed that can meet these requirements by using a remote parallel simulation server to model a wide area network and a local network emulator to provide timely QoS predictions for testing real world applications. This paper discusses the challenges faced in applying ROSENET to defense applications through two case studies. In the first case study we apply synthetic traffic workloads over DARPA's NMS network topology for a large scale simulation and define a metric called remote emulation delay to evaluate and quantify ROSENET's performance. In the second case study we illustrate the procedures using ROSENET to evaluate a contemporary real-time distributed VoIP application, Skype, and present experimental results.

Techniques for efficient parallel simulation and their application to large-scale telecommunication network models
It is widely recognized that parallel simulation technology is necessary to address the new simul... more It is widely recognized that parallel simulation technology is necessary to address the new simulation requirements of important applications such as large-scale network models. However, current parallel simulation techniques possess limitations in the type and sizes of models that can be efficiently simulated in parallel. In particular, optimistic parallel simulation of fine-grained models have been plagued by large state saving overheads, both in event-oriented and process-oriented views, resulting in unsatisfactory parallel execution speed for many important applications. In the absence of alternative solutions, it was generally believed that optimistic approaches are inapplicable in such applications. This thesis asserts that, by use of appropriately constructed novel techniques, it is indeed possible to perform fast optimistic parallel simulation of fine-grained models, using both event-oriented and process-oriented views. In support of this claim, techniques are presented here that significantly lower the overheads, thereby enabling the capability to efficiently simulate large-scale, fine-grained models in parallel. On sample models, when compared to previously known approaches, the techniques presented here improve the simulation speed by a factor of 3 or more, while simultaneously reducing the memory requirements by almost half. The first technique addresses the high overheads of state saving mechanisms that are traditionally used in supporting rollback operations in optimistic parallel simulation. An alternative approach called reverse computation is identified for realizing rollback, which is demonstrated to significantly improve the parallel simulation speed while greatly reducing the memory utilization of the simulation. The next technique concerns the process-oriented worldview, which is extremely useful in many domains, such as telecommunication network protocol modeling. An approach called stack reconstruction is developed to address the high execution overheads traditionally associated with process-oriented views, and its effectiveness is demonstrated in achieving a high rate of process-context switching during optimistic simulation. Additional contributions of this thesis include the identification and solution of other typical problems encountered in the design, development and parallel simulation of models for real-life telecommunication network protocols. This work serves to demonstrate the readiness and feasibility of applying parallel simulation technology to today's large and complex models. The parallel simulation techniques described in this thesis are applied and analyzed in the context of telecommunication network simulation, using representative network models. The techniques, however, are not restricted to network simulation, but are equally applicable to other domains as well. For example, the parallel simulation of any network of queues can benefit from the reverse computation system presented here. In fact, the reverse computation system is relevant to other application areas as well, such as in database recovery, and debugging environments. Similarly, the stack reconstruction approach is applicable to any multi-threaded system that requires an efficient incremental checkpointing facility for its thread states.
Simuletter, Jul 1, 1996
Recently, a considerable amount of effort in the U.S. Department of Defense has been devoted to d... more Recently, a considerable amount of effort in the U.S. Department of Defense has been devoted to defining the High Level Architecture (HLA) for distributed simulations. This paper describes the time management component of the HLA that defines the means by which individual simulations (called federates) advance through time. Time management includes synchronization mechanisms to ensure event ordering when this is needed. The principal challenge of the time management structure is to support interoperability among federates using different local time management mechanisms such as that used in DIS, conservative and optimistic mechanisms developed in the parallel simulation community, and real-time hardware-in-theloop simulations.
HOP: A process model for synchronous hardware; semantics and experiments in process composition
Integration, Dec 1, 1989
Abstract We present a language “hardware viewed as objects and processes”(HOP) for specifying the... more Abstract We present a language “hardware viewed as objects and processes”(HOP) for specifying the structure, behavior, and timing of hardware systems. HOP embodies a simple process model for lock-step synchronous processes. An absproc specification written in ...
1z 9y § 1 1 gj k0 }h HS 9f a 9 " 9 « d gW vq iX f V y YW ¬ #X % cS u® sS 9f y %¡ Q 9a £vW d ga "U... more 1z 9y § 1 1 gj k0 }h HS 9f a 9 " 9 « d gW vq iX f V y YW ¬ #X % cS u® sS 9f y %¡ Q 9a £vW d ga "U 1 #I f W % % cS 9 ¤W ga ' ¤ § 1 c& ¨ c " 1 c v 8 g0 g © ce 1e g 1

IEEE Transactions on Parallel and Distributed Systems, Mar 1, 2008
The High-Level Architecture (HLA) standard developed by the Department of Defense in the United S... more The High-Level Architecture (HLA) standard developed by the Department of Defense in the United States is a key technology to perform distributed simulation. Inside the HLA framework, many different simulators (termed federates) may be interconnected to create a single more complex simulator (federation). Data Distribution Management (DDM) is an optional subset of services that controls which federates should receive notification of state modifications made by other federates. A simple DDM implementation will usually generate much more traffic than needed, whereas a complex one might introduce too much overhead. In this work, we describe an approach to DDM that delegates a portion of the DDM computation to a processor on the network card in order to provide more CPU time for other federate and Runtime Infrastructure (RTI) computations while still being able to exploit the benefits of a complex DDM implementation to reduce the amount of information exchange.
Efficient computer simulation of complex physical phenomena has long been challenging due to thei... more Efficient computer simulation of complex physical phenomena has long been challenging due to their multiphysics and multi-scale nature. In contrast to traditional time-stepped execution methods, we describe an approach using optimistic parallel discrete event simulation (PDES) and reverse computation techniques. We show that reverse computation-based optimistic parallel execution can significantly reduce the execution time of a plasma simulation without requiring a significant amount of additional memory compared to conservative execution techniques. We describe an application-level reverse computation technique that is efficient and suitable for complex scientific simulations involving floating point operations.
An empirical study of short range communications for vehicles
Abstract This paper presents a detailed measurement study of short range communications between v... more Abstract This paper presents a detailed measurement study of short range communications between vehicles and between vehicles and roadside stations in a realistic highway scenario. We show the expected wireless communication characteristics in a driving ...

Workstations: The Virtual Communication Machine-based Architecture
This paper presents a novel networking architecture designed for communication intensive parallel... more This paper presents a novel networking architecture designed for communication intensive parallel applications running on clusters of workstations (COWs) connected by high speed networks. The architecture addresses what is considered one of the most important problems of cluster-based parallel computing: the inherent inability of scaling the performance of communication software along with the host CPU performance. The Virtual Communication Machine (VCM), resident on the network coprocessor, presents a scalable software solution by providing configurable communication functionality directly accessible at user-level. The VCM architecture is configurable in that it enables the transfer to the VCM of selected communication-related functionality that is traditionally part of the application and/or the host kernel. Such transfers are beneficial when a significant reduction of the host CPU's load translates into a small increase in the coprocessor's load. The functionality imple...

Parallel and distributed simulation tools are emerging that offer the ability to perform detailed... more Parallel and distributed simulation tools are emerging that offer the ability to perform detailed, packet-level simulations of large-scale computer networks on an unprecedented scale. The state-of-the-art in large-scale network simulation is characterized quantitatively. For this purpose, a metric based on the number of Packet Transmissions that can be processed by a simulator per Second of wallclock time (PTS) is used as a means to quantitatively assess packet-level network simulator performance. An approach to realizing scalable network simulations that leverages existing sequential simulation models and software is described. Results from a recent performance study are presented concerning large-scale network simulation on a variety of platforms ranging from workstations to cluster computers to supercomputers. These experiments include runs utilizing as many as 1536 processors yielding performance as high as 106 Million PTS. The performance of packet-level simulations of web and ftp traffic, and Denial of Service attacks on networks containing millions of network nodes are briefly described, including a run demonstrating the ability to simulate a million web traffic flows in near real-time. New opportunities and research challenges to fully exploit this capability are discussed.
Winter Simulation Conference, Dec 13, 2009
Workshop on Parallel and Distributed Simulation, Jul 1, 1998
Uploads
Papers by Richard Fujimoto