Energy optimization of multi-level processor cache architectures

Uming Ko; Poras T. Balsara; Ashwini K. Nanda

doi:10.1145/224081.224090

Outline

Title

Abstract

Introduction

Energy Analysis Model of Memory Hierarchy

Energy optimization of multi-level processor cache architectures

Dr. Ashwini Nanda

1995, Proceedings of the 1995 international symposium on Low power design - ISLPED '95

https://doi.org/10.1145/224081.224090

visibility

…

description

5 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

To optimize performance and power of a processor's cache, a multiple-divided module (MDM) cache architecture is proposed to save power at memory peripherals as well as the bit array. For a MxB-divided MDM cache, latency is equivalent to that of the smallest module and power consumption is only 1/MxB of the regular, non-divided cache. Based on the architecture and given transistor budgets for onchip processor caches, this paper extends investigation to analyze energy effects from cache parameters in a multi-level cache design. The analysis is based on execution of SPECint92 benchmark programs with miss ratios of a RISC processor.

Constantine Polychronopoulos

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000

The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intel&amp;amp;#x27;s Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved

downloadDownload free PDF View PDFchevron_right

Increasing energy efficiency of processor caches via line usage predictors

Marco Aurélio Motta Alves

downloadDownload free PDF View PDFchevron_right

Improvement of energy-efficiency in off-chip caches by selective prefetching

Per Stenström

Microprocessors and Microsystems, 2002

The line size/performance trade-offs in off-chip second-level caches in light of energy-ef®ciency are revisited. Based on a mix of applications representing server and mobile computer system usage, we show that while the large line sizes (128 bytes) typically used maximize performance, they result in a high power dissipation owing to the limited exploitation of spatial locality. In contrast, small blocks (32 bytes) are found to cut the energy-delay by more than a factor of 2 with only a moderate performance loss of less than 25%. As a remedy, prefetching, if applied selectively, is shown to avoid the performance losses of small blocks, yet keeping power consumption low.

downloadDownload free PDF View PDFchevron_right

Power efficient instruction caches for embedded systems

Walid Najjar

2005

Instruction caches typically consume 27% of the total power in modern high-end embedded systems. We propose a compiler-managed instruction store architecture (K-store) that places the computation intensive loops in a scratchpad like SRAM memory and allocates the remaining instructions to a regular instruction cache. At runtime, execution is switched dynamically between the instructions in the traditional instruction cache and the ones in the K-store, by inserting jump instructions. The necessary jump instructions add 0.038% on an average to the total dynamic instruction count. We compare the performance and energy consumption of our K-store with that of a conventional instruction cache of equal size. When used in lieu of a 8KB, 4-way associative instruction cache, K-store provides 32% reduction in energy and 7% reduction in execution time. Unlike loop caches, K-store maps the frequent code in a reserved address space and hence, it can switch between the kernel memory and the instruction cache without any noticeable performance penalty.

downloadDownload free PDF View PDFchevron_right

Two-level caches tuning technique for energy consumption in reconfigurable embedded MPSoC

Asmaa BENGUEDDACH

In order to meet the ever-increasing computing requirement in the embedded market, multiprocessor chips were proposed as the best way out. In this work we investigate the energy consumption in these embedded MPSoC systems. One of the efficient solutions to reduce the energy consumption is to reconfigure the cache memories. This approach was applied for one cache level/one processor architecture, but has not yet been investigated for multiprocessor architecture with two level caches. The main contribution of this paper is to explore two level caches (L1/L2) multiprocessor architecture by estimating the energy consumption. Using a simulation platform, we first built a multiprocessor architecture, and then we propose a new algorithm that tunes the two-level cache memory hierarchy (L1 and L2). The tuning caches approach is based on three parameters: cache size, line size, and associativity. To find the best cache configuration, the application is divided into several execution intervals. And then, for each interval, we generate the best cache configuration. Finally, the approach is validated using a set of open source benchmarks; Spec 2006, Splash-2, MediaBench and we discuss the performance in terms of speedup and energy reduction.

downloadDownload free PDF View PDFchevron_right

Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs

Pablo Valle

Proceedings of the 20th symposium on Great lakes symposium on VLSI - GLSVLSI '10, 2010

On-chip memory organization is one of the most important aspects that can influence the overall system behavior in multiprocessor systems. Following the trend set by high-performance processors, high-end embedded cores are moving from singlelevel on chip caches to a two-level on-chip cache hierarchy. Whereas in the embedded world there is general consensus on L1 private caches, for L2 there is still not a dominant architectural paradigm. Cache architectures that work for high performance computers turn out to be inefficient for embedded systems (mainly due to power-efficiency issues). This paper presents a virtual platform for design space exploration of L2 cache architectures in low-power Multi-Processor-Systemson-Chip (MPSoCs). The tool contains several L2 caches templates, and new architectures can be easily added using our flexible plug-in system. Given a set of constrains for a specific system (power, area, performance), our tool will perform extensive exploration to find the cache organization that best suits our needs. Through some practical experiments, we show how it is possible to select the optimal L2 cache, and how this kind of tool can help designers avoid some common misconceptions. Benchmarking results in the experiments section will show that for a case study with multiple processors running communicating tasks allocated on different cores, the private L2 cache organization still performs better than the shared one.

downloadDownload free PDF View PDFchevron_right

Static energy reduction techniques for microprocessor caches

Vikas Agarwal

IEEE Transactions on Very Large Scale Integration Systems, 2003

Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consumption in on-chip level-1 and level-2 caches. One technique employs low-leakage transistors in the memory cell. Another technique, power supply switching, can be used to turn off memory cells and discard their contents. A third alternative is dynamic threshold modulation, which places memory cells in a standby state that preserves cell contents. In our experiments, we explore the energy and performance tradeoffs of these techniques. We also investigate the sensitivity of microprocessor performance and energy consumption to additional cache latency caused by leakage-reduction techniques.

downloadDownload free PDF View PDFchevron_right

Analytical Evaluation of Energy and Throughput for Multilevel Caches

Muhammad Tanveer Qadri

2010 12th International Conference on Computer Modelling and Simulation, 2010

With the increase of processor-memory performance gap, it has become important to gauge the performance of cache architectures so as to evaluate their impact on energy requirement and throughput of the system. Multilevel caches are found to be increasingly prevalent in the high-end processors. Additionally, the recent drive towards multicore systems has necessitated the use of multilevel cache hierarchies for shared memory architectures. This paper presents simplified and accurate mathematical models to estimate the energy consumption and the impact on throughput for multilevel caches for single core systems.

downloadDownload free PDF View PDFchevron_right

Cache capacity and its effects on power consumption for tiled chip multi-processors

Shounak Chakraborty

2014 International Conference on Electronics and Communication Systems (ICECS), 2014

Minimizing power consumption of Chip Multiprocessors has drawn attention of the researchers now-a days. A single chip contains a number of processor cores and equally larger caches. According to recent research, it is seen that, on chip caches consume the maximum amount of total power consumed by the chip. Reducing on-chip cache size may be a solution for reducing on-chip power consumption, but it will degrade the performance. In this paper we present a study of reducing cache capacity and analyzing its effect on power and performance. We reduce the number of available cache banks and see its effect on reduction in dynamic and static energy. Experimental evaluation shows that for most of the benchmarks, we get significant reduction in static energy; which can result in controlling chip temperature. We use CACTI and full system simulator for our experiments.

downloadDownload free PDF View PDFchevron_right

A Survey of Emerging Architectural Techniques for Improving Cache Energy Consumption

Washington Bhebhe

Communications on Applied Electronics

The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (7)

S. Date, N. Shibata, S.Mutoh, and J. Yamada, "1V 30MHz Memory-Macrocell-Circuit Technology with a 0.5um Multi-Threshold CMOS," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp. 90-91, Oct. 1994.
S. T. Chu, "A 25 ns Low Power Full-CMOS 1Mbit (128Kx8) SRAM," Journal of Solid State Circuits, vol. 23, pp. 1078-1084, Oct. 1988.
D. T. Wong, "A 11 ns 8Kx18 CMOS Static RAM with 0.5 µm devices," Journal of Solid State Circuits, vol. 23, pp. 1095-1103, Oct. 1988.
B. Amrutur, and M. Horowitz, "Techniques to Reduce Power in Fast Wide Memories," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp. 92-93, Oct. 1994.
K. Itoh, K. Sasaki, and Y. Nakagome, "Trends in Low- Power RAM Circuit Technologies," Proceedings of the 1994 Symposium on Low Power Electronics, San Diego, CA, pp. 84-87, Oct. 1994.
A. J. Smith, "Cache Memories," Computing Surveys, pp. 473-530, Sep. 1982.
J. Gee, M. D. Hill, D. N. Pnevmatikatos, and A.J. Smith, "Cache Performance of the Spec92 Benchmark Suite," IEEE Micro, pp. 17-27, Aug. 1993.

Klaus McDonald-Maier

downloadDownload free PDF View PDFchevron_right

Power aware design of second level cache for multicore embedded systems

Abu Sayem Mohammad Asaduzzaman 1621403643

Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon), 2010

Designing efficient cache, memory, and storage subsystem for modern embedded systems supporting a variety of applications is a great need. Embedded systems are being deployed with multicore processors to help parallel and distributed computing in order to meet the requirements for increased processing speed. Multiple cores offer manifold options to organize multi-level caches. A mixture of cache memory hierarchies are proposed to satisfy the requirements of high-performance low-power multicore embedded systems. In this paper, we investigate the impact of CL2 organizations on the performance and power consumption for multicore embedded systems. We simulate two 4-core architectures, one with shared CL2 and the other one with private CL2s. We use MPEG4, FFT, MI, and DFT applications/algorithms in our experiment. Simulation results depict that the mean delay and total power consumption significantly vary with the variations of CL2 organization and applications. It is observed that reductions in total power consumption and mean delay per task of up to 43% and 36%, respectively, are possible with optimized CL2, with an optimal choice of 256KB CL2 cache, 64 B CL2 line size, and 8-way CL2 associativity level. I.

downloadDownload free PDF View PDFchevron_right

A Survey of Architectural Techniques For Improving Cache Power Efficiency

Saee Joshi

Modern processors are using increasingly larger sized on-chip caches. Also, with each CMOS technology generation, there has been a significant increase in their leakage energy consumption. For this reason, cache power management has become a crucial research issue in modern processor design. To address this challenge and also meet the goals of sustainable computing, researchers have proposed several techniques for improving energy efficiency of cache architectures. This paper surveys recent architectural techniques for improving cache power efficiency and also presents a classification of these techniques based on their characteristics. For providing an application perspective, this paper also reviews several real-world processor chips that employ cache energy saving techniques. The aim of this survey is to enable engineers and researchers to get insights into the techniques for improving cache power efficiency and motivate them to invent novel solutions for enabling lowpower operation of caches.

downloadDownload free PDF View PDFchevron_right

Design Space Exploration of Power Efficient Cache Design Techniques

Ashish Kapania

Advances in Networks and …, 2011

downloadDownload free PDF View PDFchevron_right

Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation

Milind Kamble

Low Power Electronics and Design, …, 1999

Modern microprocessors employ one or two levels of on-chip caches to bridge the burgeoning speed disparities between the processor and the RAM. These SRAM caches are a major source of power dissipation. We investigate architectural techniques, that do not compromise the processor cycle time, for reducing the power dissipation within the on-chip cache hierarchy in superscalar microprocessors. We use a detailed register-level simulator of a superscalar microprocessor that simulates the execution of the SPEC benchmarks and SPICE measurements for the actual layout of a 0.5 micron, 4metal layer cache, optimized for a 300 MHz. clock. We show that a combination of subbanking, multiple line buffers and bit-line segmentation can reduce the on-chip cache power dissipation by as much as 75% in a technology-independent manner.

downloadDownload free PDF View PDFchevron_right

Tuning Caches to Applications for Low-Energy Embedded Systems

Chuanjun Zhang

Ultra Low-Power Electronics and Design

The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors' cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache's total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache memory.

downloadDownload free PDF View PDFchevron_right

Energy Consumption in Reconfigurable MPSoC Architecture: Two-Level Caches Optimization Oriented Approach

Asmaa BENGUEDDACH

In order to meet the ever-increasing computing requirement in embedded market, multiprocessor chips were proposed as the best way out. In this work we investigate the estimation of the energy consumption in embedded MPSoC system. One of the efficient solutions to reduce the energy consumption is to reconfigure the caches memories. This approach was applied for one cache level/one processor architecture. The main contribution of this paper is to explore two level data cache (L1/L2) multiprocessor architecture by estimating the energy consumption. Using a simulation platform (Multi2Sim), we first built a multiprocessor architecture, and then we propose a new modified CPACT algorithm that tunes the two-level caches memory hierarchy (L1 & L2). The caches tuning approach is based on three parameters: cache size, line size, and associativity. In this approach, and in order to find the best cache configuration, the software application is divided into several intervals and we generate automatically the best cache configuration for each interval of the application. Finally, the approach is validated using a set of open source benchmarks, Spec2006, Splash-2 and MediaBench and we discuss the performance in terms of speedup and energy reduction.

downloadDownload free PDF View PDFchevron_right

Multiple-Valued Caches for Power-Efficient Embedded Systems

David Gregg

35th International Symposium on Multiple-Valued Logic (ISMVL'05)

In this paper, we propose three novel cache models using Multiple-Valued Logic (MVL) paradigm to reduce the cache data storage area and cache energy consumption for embedded systems. Multiple-valued caches have significant potential for compact and powerefficient cache array design. The cache models differ from each other depending on whether they store tag and data in binary, radix-r or a mix of both. Our analytical study of cache silicon area shows that an embedded System-on-achip (SoC) equipped with a multiple-valued cache model can reduce the cache data storage area up to 6% regardless of cache parameters. Also, our experiments on several embedded benchmarks demonstrate that dynamic cache energy consumption can be reduced up to 62% in a multiple-valued instruction cache in an embedded SoC.

downloadDownload free PDF View PDFchevron_right

Revisiting level-0 caches in embedded processors

Nam Duong

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems - CASES '12, 2012

Level-0 (L0) caches have been proposed in the past as an inexpensive way to improve performance and reduce energy consumption in resource-constrained embedded processors. This paper proposes new L0 data cache organizations using the assumption that an L0 hit/miss determination can be completed prior to the L1 access. This is a realistic assumption for very small L0 caches that can nevertheless deliver significant miss rate and/or energy reduction. The key issue for such caches is how and when to move data between the L0 and L1 caches. The first new cache, a flow cache, targets a conflict miss reduction in a direct-mapped L1 cache. It offers a simpler hardware design and uses on average 10% less dynamic energy than the victim cache with nearly identical performance. The second new cache, a hit cache, reduces the dynamic energy consumption in a set-associative L1 cache by 30% without impacting performance. A variant of this policy reduces the dynamic energy consumption by up to 50%, with 5% performance degradation.

downloadDownload free PDF View PDFchevron_right

Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Abu Asaduzzaman

Microprocessors and Microsystems, 2009

In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.

downloadDownload free PDF View PDFchevron_right

Energy optimization of multi-level processor cache architectures

Sign up for access to the world's latest research

Abstract

Related papers

References (7)

Related papers

Related topics