Hardware Translation Coherence for Virtualized Systems
2017, ACM SIGARCH Computer Architecture News
https://doi.org/10.1145/3140659.3080211Abstract
To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer to such activities as page remappings. Unfortunately, page remappings are expensive. We show that a big part of this cost arises from address translation coherence, particularly on systems employing virtualization. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 0.2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.
References (70)
- Keith Adams and Ole Agesen. 2006. A Comparison of Software and Hard- ware Techniques for x86 Virtualization. In Proceedings of the 12th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 2-13. https: //doi.org/10.1145/1168857.1168860
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 607-618. https://doi.org/10.1145/ 2694344.2694381
- Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting Hardware- assisted Page Walks for Virtualized Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 476-487. http://dl.acm.org/citation.cfm?id= 2337159.2337214
- Andrea Arcangeli. 2010. Transparent Hugepage Support. KVM Forum (August 2010). Retrieved April 18, 2017 from https://www.linux-kvm.org/images/9/9e/ 2010-forum-thp.pdf
- Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achiev- ing High Performance and Scalability in Heterogeneous Systems. In Proceed- ings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 416-427. http: //dl.acm.org/citation.cfm?id=2337159.2337207
- Amitabha Banerjee, Rishi Mehta, and Zach Shen. 2015. NUMA Aware I/O in Virtualized Systems. In Proceedings of the 2015 IEEE 23rd Annual Sympo- sium on High-Performance Interconnects (HOTI '15). IEEE Computer Society, Washington, DC, USA, 10-17. https://doi.org/10.1109/HOTI.2015.17
- Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 48-59. https://doi.org/10.1145/1815961.1815970
- Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A Mechanism for Speculative Address Translation. In Proceedings of the 38th Annual Interna- tional Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 307-318. https://doi.org/10.1145/2000064.2000101
- Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems. In Proceed- ings of the 13th International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26-35. https://doi.org/10.1145/1346281.1346286
- Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 383-394. https: //doi.org/10.1145/2540708.2540741
- Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetching. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 63-76. https://doi.org/10.1145/3037697.3037705
- Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Sym- posium on High Performance Computer Architecture. 62-63. https://doi.org/10. 1109/HPCA.2011.5749717
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72-81. https: //doi.org/10.1145/1454115.1454128
- Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. 2006. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA, 469-479. https://doi.org/10.1109/MICRO.2006. 18
- Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. 2014. Improving DRAM per- formance by parallelizing refreshes with accesses. In 2014 IEEE 20th Interna- tional Symposium on High Performance Computer Architecture (HPCA). 356-367. https://doi.org/10.1109/HPCA.2014.6835946
- Jonathan Corbet. 2016. Heterogeneous memory management. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/684916
- Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 435-448. https://doi.org/10.1145/3037697.3037704
- Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2013. A Circuit-architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies. ACM Trans. Archit. Code Optim. 10, 4, Article 23 (Dec. 2013), 22 pages. https: //doi.org/10.1145/2541228.2541230
- Malcolm C. Easton and Peter A. Franaszek. 1979. Use Bit Scanning in Re- placement Decisions. IEEE Trans. Comput. C-28, 2 (Feb 1979), 133-141. https://doi.org/10.1109/TC.1979.1675302
- Babak Falsafi, Tim Harris, Dushyanth Narayanan, and David A. Patterson. 2016. Rack-scale Computing (Dagstuhl Seminar 15421). Dagstuhl Reports 5, 10 (2016), 35-49. https://doi.org/10.4230/DagRep.5.10.35
- Dongrui Fan, Zhimin Tang, Hailin Huang, and Guang R. Gao. 2005. An Energy Efficient TLB Design Methodology. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED '05). ACM, New York, NY, USA, 351-356. https://doi.org/10.1145/1077603.1077688
- Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 37-48. https://doi.org/10.1145/2150976.2150982
- Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 178-189. https://doi.org/10.1109/MICRO.2014.37
- Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Agile Paging: Exceeding the Best of Nested and Shadow Paging. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 707-718. https://doi.org/10.1109/ISCA.2016.67
- Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large Pages May Be Harmful on NUMA Systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 231-242. http://dl.acm.org/citation.cfm?id=2643634.2643659
- Jerome Glisse. 2016. HMM (Heterogeneous memory management) v5. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/619067
- Fei Guo, Seongbeom Kim, Yury Baskakov, and Ishan Banerjee. 2015. Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15). ACM, New York, NY, USA, 39-51. https://doi.org/10.1145/2731186.2731187
- John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1-17. https://doi.org/10.1145/1186736. 1186737
- Intel. 2015. Introducing Intel Optane Technology -Bringing 3D
- XPoint Memory to Storage and Memory Products. (2015). Re- April 18, 2017 from https://newsroom.intel.com/press-kits/ introducing-intel-optane-technology-bringing-3d-xpoint-memory-to-storage\ -and-memory-products
- Toni Juan, Tomas Lang, and Juan J. Navarro. 1997. Reducing TLB Power Requirements. In Proceedings of the 1997 International Symposium on Low Power Electronics and Design (ISLPED '97). ACM, New York, NY, USA, 196- 201. https://doi.org/10.1145/263272.263332
- I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. 2002. Generating Physical Addresses Directly for Saving Instruction TLB Energy. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microar- chitecture (MICRO 35). IEEE Computer Society Press, Los Alamitos, CA, USA, 185-196. http://dl.acm.org/citation.cfm?id=774861.774882
- Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546-558. https://doi.org/10.1145/2830772.2830808
- Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631-643. https://doi.org/10. 1109/HPCA.2016.7446100
- Anshuman Khandaul. 2016. Define coherent device memory node. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/404403
- Joonyoung Kim, Younsu Kim, undefined, undefined, undefined, and unde- fined. 2014. HBM: Memory solution for bandwidth-hungry processors. 2014 IEEE Hot Chips 26 Symposium (HCS) 00 (2014), 1-24. https://doi.org/doi. ieeecomputersociety.org/10.1109/HOTCHIPS.2014.7478812
- Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 705-721. http://dl.acm.org/citation.cfm?id=3026877.3026931
- Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277-289. http://dl.acm.org/ citation.cfm?id=2813767.2813788
- Gabriel Loh and Mark D. Hill. 2012. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap. IEEE Micro 32, 3 (May 2012), 70-78. https://doi.org/10.1109/MM.2012.25
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190-200. https://doi.org/10.1145/1065010.1065034
- Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Trans. Archit. Code Optim. 10, 1, Article 2 (April 2013), 38 pages. https://doi.org/10.1145/2445572.2445574
- Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 233-247. https://doi.org/10.1145/2872362.2872399
- Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78-89. https://doi.org/10.1145/2209249.2209269
- Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ig- natowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 126-136. https://doi.org/10.1109/HPCA.2015.7056027
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Op- timizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In 40th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO 2007). 3-14. https://doi.org/10.1109/MICRO.2007.33
- Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104. https://doi.org/10.1145/844128.844138
- Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die- Stacked DRAM. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Wash- ington, DC, USA, 188-200. https://doi.org/10.1109/PACT.2015.30
- Jiannan Ouyang, John R. Lange, and Haoqiang Zheng. 2016. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs. In Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '16). ACM, New York, NY, USA, 17-23. https://doi.org/10. 1145/2892242.2892245
- J. T. Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS). 1-24. https://doi.org/10.1109/HOTCHIPS.2011.7477494
- Sujay Phadke and Satish Narayanasamy. 2011. MLP aware heterogeneous memory system. In 2011 Design, Automation Test in Europe. 1-6. https: //doi.org/10.1109/DATE.2011.5763155
- Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 558-567. https://doi.org/10.1109/HPCA.2014.6835964
- Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhat- tacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitec- ture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269. https://doi.org/10.1109/MICRO.2012.32
- Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Using TLB Speculation to Overcome Page Splintering in Virtual Machines. Rutgers Tech- nical Report DCS-TR-713. Department of Computer Science, Rutgers University, Pistcataway, NJ.
- Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. 2015. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 1-12. https: //doi.org/10.1145/2830772.2830773
- Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 85-95. https://doi.org/ 10.1145/1995896.1995911
- Dulloor Subramanya Rao and Karsten Schwan. 2010. vNUMA-mgr: Managing VM memory on NUMA platforms. In 2010 International Conference on High Performance Computing. 1-10. https://doi.org/10.1109/HIPC.2010.5713191
- Jia Rao, Kun Wang, Xiaobo Zhou, and Cheng-Zhong Xu. 2013. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 306-317. https://doi.org/10.1109/HPCA.2013.6522328
- Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. In HPCA -16 2010 The Sixteenth International Symposium on High- Performance Computer Architecture. 1-12. https://doi.org/10.1109/HPCA.2010. 5416643
- Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 79-91. https://doi.org/10.1145/2749469.2750379
- Agam Shah. 2014. Micron's Revolutionary Hybrid Memory Cube Tech is 15 Times Faster than Today's DRAM. (2014). Re- trieved April 18, 2017 from http://www.pcworld.com/article/2366680/ computer-memory-overhaul-due-with-microns-hmc-in-early-2015.html
- Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. (2011). Retrieved April 18, 2017 from https://www.microarch.org/micro44/files/Micro% 20Keynote%20Final%20-%20Avinash%20Sodani.pdf
- Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.
- Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2016. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (Oct 2016), 3071-3087. https://doi.org/10.1109/TPDS.2016. 2526003
- Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 171-182. https://doi.org/10.1145/195473.195531
- Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhat- tacharjee. 2016. Observations and opportunities in architecting shared vir- tual memory for heterogeneous systems. In 2016 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS). 161-171. https://doi.org/10.1109/ISPASS.2016.7482091
- Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Un- sal. 2011. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 340-349. https://doi.org/10.1109/PACT.2011.65
- VMware. 2011. Performance Best Practices for VMware vSphere 5.0. (2011). Re- trieved April 18, 2017 from https://www.vmware.com/pdf/Perf_Best_Practices_ vSphere5.0.pdf
- Yuan Xie. 2011. Modeling, Architecture, and Applications for Emerging Memory Technologies. IEEE Des. Test 28, 1 (Jan. 2011), 44-51. https://doi.org/10.1109/ MDT.2011.20
- Yuan Xie. 2013. Emerging Memory Technologies: Design, Architecture, and Applications. Springer Publishing Company, Incorporated.
- Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain Co- herence Directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 359- 370. https://doi.org/10.1145/2540708.2540739