Hardware Translation Coherence for Virtualized Systems

Guilherme Cox

doi:10.1145/3140659.3080211

Outline

Hardware Translation Coherence for Virtualized Systems

Guilherme Cox

2017, ACM SIGARCH Computer Architecture News

https://doi.org/10.1145/3140659.3080211

visibility

…

description

14 pages

link

1 file

Abstract

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer to such activities as page remappings. Unfortunately, page remappings are expensive. We show that a big part of this cost arises from address translation coherence, particularly on systems employing virtualization. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 0.2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.

References (70)

Keith Adams and Ole Agesen. 2006. A Comparison of Software and Hard- ware Techniques for x86 Virtualization. In Proceedings of the 12th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 2-13. https: //doi.org/10.1145/1168857.1168860
Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 607-618. https://doi.org/10.1145/ 2694344.2694381
Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting Hardware- assisted Page Walks for Virtualized Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 476-487. http://dl.acm.org/citation.cfm?id= 2337159.2337214
Andrea Arcangeli. 2010. Transparent Hugepage Support. KVM Forum (August 2010). Retrieved April 18, 2017 from https://www.linux-kvm.org/images/9/9e/ 2010-forum-thp.pdf
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achiev- ing High Performance and Scalability in Heterogeneous Systems. In Proceed- ings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 416-427. http: //dl.acm.org/citation.cfm?id=2337159.2337207
Amitabha Banerjee, Rishi Mehta, and Zach Shen. 2015. NUMA Aware I/O in Virtualized Systems. In Proceedings of the 2015 IEEE 23rd Annual Sympo- sium on High-Performance Interconnects (HOTI '15). IEEE Computer Society, Washington, DC, USA, 10-17. https://doi.org/10.1109/HOTI.2015.17
Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 48-59. https://doi.org/10.1145/1815961.1815970
Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A Mechanism for Speculative Address Translation. In Proceedings of the 38th Annual Interna- tional Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 307-318. https://doi.org/10.1145/2000064.2000101
Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems. In Proceed- ings of the 13th International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26-35. https://doi.org/10.1145/1346281.1346286
Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 383-394. https: //doi.org/10.1145/2540708.2540741
Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetching. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 63-76. https://doi.org/10.1145/3037697.3037705
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Sym- posium on High Performance Computer Architecture. 62-63. https://doi.org/10. 1109/HPCA.2011.5749717
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72-81. https: //doi.org/10.1145/1454115.1454128
Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. 2006. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA, 469-479. https://doi.org/10.1109/MICRO.2006. 18
Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. 2014. Improving DRAM per- formance by parallelizing refreshes with accesses. In 2014 IEEE 20th Interna- tional Symposium on High Performance Computer Architecture (HPCA). 356-367. https://doi.org/10.1109/HPCA.2014.6835946
Jonathan Corbet. 2016. Heterogeneous memory management. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/684916
Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 435-448. https://doi.org/10.1145/3037697.3037704
Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2013. A Circuit-architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies. ACM Trans. Archit. Code Optim. 10, 4, Article 23 (Dec. 2013), 22 pages. https: //doi.org/10.1145/2541228.2541230
Malcolm C. Easton and Peter A. Franaszek. 1979. Use Bit Scanning in Re- placement Decisions. IEEE Trans. Comput. C-28, 2 (Feb 1979), 133-141. https://doi.org/10.1109/TC.1979.1675302
Babak Falsafi, Tim Harris, Dushyanth Narayanan, and David A. Patterson. 2016. Rack-scale Computing (Dagstuhl Seminar 15421). Dagstuhl Reports 5, 10 (2016), 35-49. https://doi.org/10.4230/DagRep.5.10.35
Dongrui Fan, Zhimin Tang, Hailin Huang, and Guang R. Gao. 2005. An Energy Efficient TLB Design Methodology. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED '05). ACM, New York, NY, USA, 351-356. https://doi.org/10.1145/1077603.1077688
Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 37-48. https://doi.org/10.1145/2150976.2150982
Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 178-189. https://doi.org/10.1109/MICRO.2014.37
Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Agile Paging: Exceeding the Best of Nested and Shadow Paging. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 707-718. https://doi.org/10.1109/ISCA.2016.67
Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large Pages May Be Harmful on NUMA Systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 231-242. http://dl.acm.org/citation.cfm?id=2643634.2643659
Jerome Glisse. 2016. HMM (Heterogeneous memory management) v5. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/619067
Fei Guo, Seongbeom Kim, Yury Baskakov, and Ishan Banerjee. 2015. Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15). ACM, New York, NY, USA, 39-51. https://doi.org/10.1145/2731186.2731187
John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1-17. https://doi.org/10.1145/1186736. 1186737
Intel. 2015. Introducing Intel Optane Technology -Bringing 3D
XPoint Memory to Storage and Memory Products. (2015). Re- April 18, 2017 from https://newsroom.intel.com/press-kits/ introducing-intel-optane-technology-bringing-3d-xpoint-memory-to-storage\ -and-memory-products
Toni Juan, Tomas Lang, and Juan J. Navarro. 1997. Reducing TLB Power Requirements. In Proceedings of the 1997 International Symposium on Low Power Electronics and Design (ISLPED '97). ACM, New York, NY, USA, 196- 201. https://doi.org/10.1145/263272.263332
I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. 2002. Generating Physical Addresses Directly for Saving Instruction TLB Energy. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microar- chitecture (MICRO 35). IEEE Computer Society Press, Los Alamitos, CA, USA, 185-196. http://dl.acm.org/citation.cfm?id=774861.774882
Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546-558. https://doi.org/10.1145/2830772.2830808
Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631-643. https://doi.org/10. 1109/HPCA.2016.7446100
Anshuman Khandaul. 2016. Define coherent device memory node. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/404403
Joonyoung Kim, Younsu Kim, undefined, undefined, undefined, and unde- fined. 2014. HBM: Memory solution for bandwidth-hungry processors. 2014 IEEE Hot Chips 26 Symposium (HCS) 00 (2014), 1-24. https://doi.org/doi. ieeecomputersociety.org/10.1109/HOTCHIPS.2014.7478812
Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 705-721. http://dl.acm.org/citation.cfm?id=3026877.3026931
Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277-289. http://dl.acm.org/ citation.cfm?id=2813767.2813788
Gabriel Loh and Mark D. Hill. 2012. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap. IEEE Micro 32, 3 (May 2012), 70-78. https://doi.org/10.1109/MM.2012.25
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190-200. https://doi.org/10.1145/1065010.1065034
Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Trans. Archit. Code Optim. 10, 1, Article 2 (April 2013), 38 pages. https://doi.org/10.1145/2445572.2445574
Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 233-247. https://doi.org/10.1145/2872362.2872399
Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78-89. https://doi.org/10.1145/2209249.2209269
Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ig- natowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 126-136. https://doi.org/10.1109/HPCA.2015.7056027
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Op- timizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In 40th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO 2007). 3-14. https://doi.org/10.1109/MICRO.2007.33
Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104. https://doi.org/10.1145/844128.844138
Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die- Stacked DRAM. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Wash- ington, DC, USA, 188-200. https://doi.org/10.1109/PACT.2015.30
Jiannan Ouyang, John R. Lange, and Haoqiang Zheng. 2016. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs. In Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '16). ACM, New York, NY, USA, 17-23. https://doi.org/10. 1145/2892242.2892245
J. T. Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS). 1-24. https://doi.org/10.1109/HOTCHIPS.2011.7477494
Sujay Phadke and Satish Narayanasamy. 2011. MLP aware heterogeneous memory system. In 2011 Design, Automation Test in Europe. 1-6. https: //doi.org/10.1109/DATE.2011.5763155
Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 558-567. https://doi.org/10.1109/HPCA.2014.6835964
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhat- tacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitec- ture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269. https://doi.org/10.1109/MICRO.2012.32
Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Using TLB Speculation to Overcome Page Splintering in Virtual Machines. Rutgers Tech- nical Report DCS-TR-713. Department of Computer Science, Rutgers University, Pistcataway, NJ.
Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. 2015. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 1-12. https: //doi.org/10.1145/2830772.2830773
Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 85-95. https://doi.org/ 10.1145/1995896.1995911
Dulloor Subramanya Rao and Karsten Schwan. 2010. vNUMA-mgr: Managing VM memory on NUMA platforms. In 2010 International Conference on High Performance Computing. 1-10. https://doi.org/10.1109/HIPC.2010.5713191
Jia Rao, Kun Wang, Xiaobo Zhou, and Cheng-Zhong Xu. 2013. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 306-317. https://doi.org/10.1109/HPCA.2013.6522328
Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. In HPCA -16 2010 The Sixteenth International Symposium on High- Performance Computer Architecture. 1-12. https://doi.org/10.1109/HPCA.2010. 5416643
Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 79-91. https://doi.org/10.1145/2749469.2750379
Agam Shah. 2014. Micron's Revolutionary Hybrid Memory Cube Tech is 15 Times Faster than Today's DRAM. (2014). Re- trieved April 18, 2017 from http://www.pcworld.com/article/2366680/ computer-memory-overhaul-due-with-microns-hmc-in-early-2015.html
Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. (2011). Retrieved April 18, 2017 from https://www.microarch.org/micro44/files/Micro% 20Keynote%20Final%20-%20Avinash%20Sodani.pdf
Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.
Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2016. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (Oct 2016), 3071-3087. https://doi.org/10.1109/TPDS.2016. 2526003
Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 171-182. https://doi.org/10.1145/195473.195531
Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhat- tacharjee. 2016. Observations and opportunities in architecting shared vir- tual memory for heterogeneous systems. In 2016 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS). 161-171. https://doi.org/10.1109/ISPASS.2016.7482091
Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Un- sal. 2011. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 340-349. https://doi.org/10.1109/PACT.2011.65
VMware. 2011. Performance Best Practices for VMware vSphere 5.0. (2011). Re- trieved April 18, 2017 from https://www.vmware.com/pdf/Perf_Best_Practices_ vSphere5.0.pdf
Yuan Xie. 2011. Modeling, Architecture, and Applications for Emerging Memory Technologies. IEEE Des. Test 28, 1 (Jan. 2011), 44-51. https://doi.org/10.1109/ MDT.2011.20
Yuan Xie. 2013. Emerging Memory Technologies: Design, Architecture, and Applications. Springer Publishing Company, Incorporated.
Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain Co- herence Directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 359- 370. https://doi.org/10.1145/2540708.2540739

Hardware Translation Coherence for Virtualized Systems

Sign up for access to the world's latest research

Abstract

Related papers

References (70)

Related papers

Related topics