Efficacy and performance impact of value prediction
Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192)
Page 1. Efficacy and Performance Impact of Value Prediction Bohuslav Rychlik, John Faistl, Bryon ... more Page 1. Efficacy and Performance Impact of Value Prediction Bohuslav Rychlik, John Faistl, Bryon Krug, John P. Shen Electrical and Computer Engineering Carnegie Mellon University {bohuslav, jf4w, krug, shen}@ece.cmu.edu ...
Abstract : An approach is developed which exploits the deterministic behavior of a processor to p... more Abstract : An approach is developed which exploits the deterministic behavior of a processor to perform concurrent fault monitoring. A very low cost and highly effective technique is called Continuous Signature Monitoring (CSM), has been developed. This technique is capable of detecting transients with very low detection latency, and requires very minimal memory overhead and performance penalty. This technique has been applied to both CISC and RISC type processors. Both analytical and experimental results have been obtained in validating the effectiveness of the approach. CSM has been adopted by two aerospace companies in their design of a 32-bit RISC processor targeted for avionics and space applications. It appears that the signature monitoring technique can be extended to detect computer viruses as well via a form of program encryption.
As mobile devices enter a new era with high speed connectivity and increasing compute capabilitie... more As mobile devices enter a new era with high speed connectivity and increasing compute capabilities a new class of applications called social networking applications is being showcased as the next revolution in mobile computing. In this class of applications each user in a ...
The relationship between fault tolerance and performance is explored for /3-networks used as inte... more The relationship between fault tolerance and performance is explored for /3-networks used as interconnection networks in multicomputer systems. The networks of interest are composed of 2 X2 switches (&eiements) and are represented by a graph model called a p-graph. Two parameters derived from P-graphs are used to characterize &networks. The fault tolerance (FT) parameter is the maximum number of p-element faults that can be tolerated. The communication delay (CD) parameter, representing the worst-case delay between any pair of computers, is used as a measure of the performance of the &networks. Tight bounds for both FT and CD parameters are derived. Two important classes of @etworks are introduced, namely, DPR-networks and MISE-networks. It is shown that DPR-networks possess the maximal fault tolerance, and the class of DPR-networks is unique in achieving the maximum possible fault tolerance. The class of MISE-networks is minimally fault tolerant. but has the minimum communication delay. A class of &networks. called RD'IT-networks, that achieve an optimal balance of the FT and CD parameters is also presented.
Proceedings of the 13th international conference on Supercomputing, 1999
This paper presents the concept of dynamic control independence (DCl) and shows how it can be det... more This paper presents the concept of dynamic control independence (DCl) and shows how it can be detected and exploited in an out-of-order superscalar processor to reduce the performance penalties of branch mispredictions. We show how DCI can be leveraged during branch misprediction recovery to reduce the number of instructions squashed on a misprediction as well as how it can be used to avoid predicting unpredictable branches by fetching instructions out-of-order A realistic implementation is described and evaluated using six SPECint95 benchmarks. We show that exploiting DCI during branch misprediction recovety improves pe$ormance by 0.9-9.9% on a I-wide processol; by I&11.2% on an b-wide processor and by 1.9-15.3% on a 12-wideprocessol: We also show that using DCI information to fetch instructions out-of-order when an unpredictable branch is encountered potentially improves performance by 0.9-15.2% on a I-wide processol: by 2.0-14.8% on an 8-wide processor and by 2.6-16.2% on a 12wide processor: Some of the largest performance gains are observed on go and gee, which have traditionally posed the most d@cult challenge to aggressive branch prediction techniques. *Currently with Intel Corp. (jmfung @I ichips.intel.com).
This paper describes a framework for modeling program behavior and applies it to optimizing presc... more This paper describes a framework for modeling program behavior and applies it to optimizing prescient instruction prefetch–a novel technique that uses helper threads to improve singlethreaded application performance. Spawn-target pair selection is optimized by modeling program behavior with Markov chains and analyzing them with path expression mappings. Mappings for reaching, and posteriori probability; path length mean, and variance; and path footprint are presented. A limit study demonstrates speedups of 4.8% to 17% for benchmarks with high I-cache miss rates. The framework has been applied to data prefetch helper threads and is potentially applicable to other thread speculation techniques.
Voice control is a prominent interaction method on personal computing devices. While automatic sp... more Voice control is a prominent interaction method on personal computing devices. While automatic speech recognition (ASR) systems are readily applicable for large audiences, there is room for further adaptation at the edge, ie. locally on devices, targeted for individual users. In this work, we explore improving ASR systems over time through a user's own interactions. Our online learning approach for speaker-adaptive language modeling leverages a user's most recent utterances to enhance the speaker dependent features and traits. We experiment with the Large-Vocabulary Continuous Speech Recognition corpus Tedlium v2, and demonstrate an average reduction in perplexity (PPL) of 19.18% and average relative reduction in word error rate (WER) of 2.80% compared to a state-of-the-art baseline on Tedlium v2.
IoT (Internet of Things) devices, such as network-enabled wearables, are carried by increasingly ... more IoT (Internet of Things) devices, such as network-enabled wearables, are carried by increasingly more people throughout daily life. Information from multiple devices can be aggregated to gain insights into a person’s behavior or status. For example, an elderly care facility could monitor patients for falls by combining fitness bracelet data with video of the entire class. For this aggregated data to be useful to each person, we need a multi-modality association of the devices’ physical ID (i.e., location, the user holding it, visual appearance) with a virtual ID (e.g., IP address/available services). Existing approaches for multi-modality association often require intentional interaction or direct line-of-sight to the device, which is infeasible for a large number of users or when the device is obscured by clothing. We present IDIoT , a calibration-free passive sensing approach that fuses motion sensor information with camera footage of an area to estimate the body location of motio...
Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Inte... more Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources. This paper describes the Hyper-Threading Technology architecture, and discusses the microarchitecture details of Intel's first implementation on the Intel Xeon processor family. Hyper-Threading Technology is an important addition to Intel's enterprise product line and will be integrated into a wide variety of products. Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures
This paper examines the efficiency of the register stack engine (RSE) in the canonical Itanium ar... more This paper examines the efficiency of the register stack engine (RSE) in the canonical Itanium architecture, and introduces novel optimization techniques to enhance the RSE performance. To minimize spills and fills of the physical register file, optimizations are applied to reduce internal fragmentation in statically allocated register stack frames. Through the use of dynamic register usage (DRU) and dead register value information (DVI), the processor can dynamically guide allocation and deallocation of register frames. Consequently, a speculatively allocated register frame with a dynamically determined frame size can be much smaller than the statically determined frame size, thus achieving minimum spills and fills. Using the register stack engine (RSE) in the canonical Itanium architecture as the baseline reference, we thoroughly study and gauge the tradeoffs of the RSE and the proposed optimizations using a set of SPEC CPU2000 benchmarks built with different compiler optimizations. A combination of frame allocation policies using the most frequent frame size and deallocation policies using dead register information proves to be highly effective. On average, a 71% reduction in aggregate spills and fills can be achieved over the baseline reference.
Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24, 1991
An architectu?'e synthesis method for the automated design of high-performance application-specif... more An architectu?'e synthesis method for the automated design of high-performance application-specific processors has been p?'oposed. This method divides the design task into the Specification Optimization (behavioml) and synthesis of these examples are on the oTder of a few seconds.
Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture
The current trace-driven simulation approach to determine superscalar processor performance is wi... more The current trace-driven simulation approach to determine superscalar processor performance is widely used but has some shortcomings. Modern benchmarks generate extremely long traces, resulting in problems with data storage, as well as very long simulation run times. More fundamentally, simulation generally does not provide significant insight into the factors that determine performance or a characterization of their interactions. This paper proposes a theoretical model of superscalar processor performance that addresses these shortcomings. Performance is viewed as an interaction of program parallelism and machine parallelism. Both program and machine parallelisms are decomposed into multiple component functions. Methods for measuring or computing these functions are described. The functions are combined to provide a model of the interaction between program and machine parallelisms and an accurate estimate of the performance. The computed performance, based on this model, is compared to simulated performance for six benchmarks from the SPEC 92 suite on several configurations of the IBM RS/6000 instruction set architecture. *Supported by a National Science Foundation Graduate Fellowship. andlor specific permission. MICRO 27-11/94 San Jose CA USA (3 1994 ACM 0-89791-707-3/94/001 1..$3.50 this trace and generates performance information. In order to obtain accurate performance data, this process must be repeated for many large benchmarks and extremely long traces.
Proceedings Eighth International Symposium on High Performance Computer Architecture
The performance of in-order execution Itanium TM processors can suffer significantly due to cache... more The performance of in-order execution Itanium TM processors can suffer significantly due to cache misses. Two memory latency tolerance approaches can be applied for the Itanium processors. One uses an out-of-order (OOO) execution core; the other assumes multithreading support and exploits cache prefetching via speculative precomputation (SP). This paper evaluates and contrasts these two approaches. In addition, this paper assesses the effectiveness of combining the two approaches. For a select set of memory-intensive programs, an in-order SMT Itanium processor using speculative precomputation can achieve performance improvement (92%) comparable to that of an outof-order design (87%). Applying both OOO and SP yields a total performance improvement of 141% over the baseline in-order machine. OOO tends to be effective in prefetching for L1 misses; whereas SP is primarily good at covering L2 and L3 misses. Our analysis indicates that the two approaches can be redundant or complementary depending on the type of delinquent loads that each targets. Both approaches are effective on delinquent loads in the loop body; however only SP is effective on delinquent loads found in loop control code.
Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34
A large number of memory accesses in memory-bound applications are irregular, such as pointer der... more A large number of memory accesses in memory-bound applications are irregular, such as pointer dereferences, and can be effectively targeted by thread-based prefetching techniques like Speculative Precomputation. These techniques execute instructions, for example on an available SMT thread context, that have been extracted directly from the program they are trying to accelerate. Proposed techniques typically require manual user intervention to extract and optimize instruction sequences. This paper proposes Dynamic Speculative Precomputation, which performs all necessary instruction analysis, extraction, and optimization through the use of back-end instruction analysis hardware, located off the processor's critical path. For a set of memory limited benchmarks an average speedup of 14% is achieved when constructing simple p-slices, and this gain grows to 33% when making use of aggressive optimizations.
Proceedings Eighth International Symposium on High Performance Computer Architecture
As the frequency gap between main memory and modern microprocessor grows, the implementation and ... more As the frequency gap between main memory and modern microprocessor grows, the implementation and efficiency of on-chip caches become more important. The growing latency to memory is motivating new research into load instruction behavior and selective data caching. This work investigates the classification of load instruction behavior. A new load classification method is proposed that classifies loads into those vital to performance and those not vital to performance. A limit study is presented to characterize different types of non-vital loads and to quantify the percentage of loads that are non-vital. Finally, a realistic implementation of the non-vital load classification method is presented and a new cache structure called the Vital Cache is proposed to take advantage of non-vital loads. The Vital Cache caches data for vital loads only, deferring non-vital loads to slower caches. Results: The limit study shows 75% of all loads are non-vital with only 35% of the accessed data space being vital for caching. The Vital Cache improves the efficiency of the cache hierarchy and the hit rate for vital loads. The Vital Cache increases performance by 17%.
Uploads
Papers by John Shen