Shared last-level TLBs for chip multiprocessors
2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture
https://doi.org/10.1109/HPCA.2011.5749717Abstract
Translation Lookaside Buffers (TLBs) are critical to processor performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as chip multiprocessors (CMPs) become ubiquitous, TLB design must be re-evaluated. This paper is the first to propose and evaluate shared last-level (SLL) TLBs as an alternative to the commercial norm of private, per-core L2 TLBs. SLL TLBs eliminate 7-79% of system-wide misses for parallel workloads. This is an average of 27% better than conventional private, per-core L2 TLBs, translating to notable runtime gains. SLL TLBs also provide benefits comparable to recently-proposed Inter-Core Cooperative (ICC) TLB prefetchers, but with considerably simpler hardware. Furthermore, unlike these prefetchers, SLL TLBs can aid sequential applications, eliminating 35-95% of the TLB misses for various multiprogrammed combinations of sequential applications. This corresponds to a 21% average increase in TLB miss eliminations compared to private, per-core L2 TLBs. Because of their benefits for parallel and sequential applications, and their readily-implementable hardware, SLL TLBs hold great promise for CMPs.
References (28)
- Advanced Micro Devices. http://www.amd.com.
- T. Barr, A. Cox, and S. Rixner. Translation Caching: Skip, Don't Walk (the Page Table). ISCA, 2010.
- A. Bhattacharjee and M. Martonosi. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. PACT, 2009.
- A. Bhattacharjee and M. Martonosi. Inter-Core Cooper- ative TLB Prefetchers for Chip Multiprocessors. ASP- LOS, 2010.
- C. Bienia et al. The PARSEC Benchmark Suite: Charac- terization and Architectural Implications. PACT, 2008.
- J. B. Chen, A. Borg, and N. Jouppi. A Simulation Based Study of TLB Performance. ISCA, 1992.
- D. Clark and J. Emer. Performance of the VAX- 11/780 Translation Buffers: Simulation and Measure- ment. ACM Trans. on Comp. Sys., 3(1), 1985.
- E. Ebrahimi et al. Fairness via Source Throttling: a Con- figurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. ISCA, 2010.
- G. Hinton. The Microarchitecture of the Pentium 4. Intel Technology Journal, 2001.
- H. Huck and H. Hays. Architectural Support for Trans- lation Table Management in Large Address Space Ma- chines. ISCA, 1993.
- B. Jacob and T. Mudge. A Look at Several Memory Management Units: TLB-Refill, and Page Table Organi- zations. ASPLOS, 1998.
- B. Jacob and T. Mudge. Virtual Memory in Contempo- rary Microprocessors. IEEE Micro, 1998.
- G. Kandiraju and A. Sivasubramaniam. Characterizing the d-TLB Behavior of SPEC CPU2000 Benchmarks. Sigmetrics, 2002.
- G. Kandiraju and A. Sivasubramaniam. Going the Distance for TLB Prefetching: An Application-Driven Study. ISCA, 2002.
- C. Kim, D. Burger, and S. Keckler. NUCA: A Non- Uniform Cache Architecture for Wire-Delay Dominated On-Chip Caches. IEEE Micro Top Picks, 2003.
- W. Korn and M. Chang. SPEC CPU2006 Sensitivity to Memory Page Sizes. ACM SIGARCH Comp. Arch. News, 35(1), 2007.
- M. Martin et al. Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset. Comp. Arch. News, 2005.
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. CACTI 6.0: A Tool to Model Large Caches. HP Labs Tech Report HPL-2009-85, 2009.
- D. Nagle et al. Design Tradeoffs for Software Managed TLBs. ISCA, 1993.
- A. Phansalkar et al. Subsetting the SPEC CPU2006 Benchmark Suite. ACM SIGARCH Comp. Arch. News, 35(1), 2007.
- X. Qui and M. Dubois. Options for Dynamic Address Translations in COMAs. ISCA, 1998.
- M. Rosenblum et al. The Impact of Architectural Trends on Operating System Performance. Trans. on Mod. and Comp. Sim., 1995.
- A. Saulsbury, F. Dahlgren, and P. Stenström. Based TLB Preloading. ISCA, 2000.
- A. Sharif and H.-H. Lee. Data Prefetching Mechanism by Exploiting Global and Local Access Patterns. Jour- nal of Instruction-Level Parallelism Data Prefetching Championship, 2009.
- Sun. UltraSPARC III Cu User's Manual. 2004.
- M. Talluri and M. Hill. Surpassing the TLB Performance of Superpages with Less Operating System Support. AS- PLOS, 1994.
- Virtutech. Simics for Multicore Software. 2007.
- D. H. Woo et al. An Optimized 3D-Stacked Memory Ar- chitecture by Exploiting Excessive, High-Density TSV Bandwidth. HPCA, 2010.