Academia.eduAcademia.edu

Outline

Hardware Translation Coherence for Virtualized Systems

2017, ACM SIGARCH Computer Architecture News

https://doi.org/10.1145/3140659.3080211

Abstract

To improve system performance, operating systems (OSes) often undertake activities that require modification of virtual-to-physical address translations. For example, the OS may migrate data between physical pages to manage heterogeneous memory devices. We refer to such activities as page remappings. Unfortunately, page remappings are expensive. We show that a big part of this cost arises from address translation coherence, particularly on systems employing virtualization. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 0.2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.

References (70)

  1. Keith Adams and Ole Agesen. 2006. A Comparison of Software and Hard- ware Techniques for x86 Virtualization. In Proceedings of the 12th Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 2-13. https: //doi.org/10.1145/1168857.1168860
  2. Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 607-618. https://doi.org/10.1145/ 2694344.2694381
  3. Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting Hardware- assisted Page Walks for Virtualized Systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 476-487. http://dl.acm.org/citation.cfm?id= 2337159.2337214
  4. Andrea Arcangeli. 2010. Transparent Hugepage Support. KVM Forum (August 2010). Retrieved April 18, 2017 from https://www.linux-kvm.org/images/9/9e/ 2010-forum-thp.pdf
  5. Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achiev- ing High Performance and Scalability in Heterogeneous Systems. In Proceed- ings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 416-427. http: //dl.acm.org/citation.cfm?id=2337159.2337207
  6. Amitabha Banerjee, Rishi Mehta, and Zach Shen. 2015. NUMA Aware I/O in Virtualized Systems. In Proceedings of the 2015 IEEE 23rd Annual Sympo- sium on High-Performance Interconnects (HOTI '15). IEEE Computer Society, Washington, DC, USA, 10-17. https://doi.org/10.1109/HOTI.2015.17
  7. Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA '10). ACM, New York, NY, USA, 48-59. https://doi.org/10.1145/1815961.1815970
  8. Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A Mechanism for Speculative Address Translation. In Proceedings of the 38th Annual Interna- tional Symposium on Computer Architecture (ISCA '11). ACM, New York, NY, USA, 307-318. https://doi.org/10.1145/2000064.2000101
  9. Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems. In Proceed- ings of the 13th International Conference on Architectural Support for Program- ming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26-35. https://doi.org/10.1145/1346281.1346286
  10. Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 383-394. https: //doi.org/10.1145/2540708.2540741
  11. Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetching. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 63-76. https://doi.org/10.1145/3037697.3037705
  12. Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In 2011 IEEE 17th International Sym- posium on High Performance Computer Architecture. 62-63. https://doi.org/10. 1109/HPCA.2011.5749717
  13. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). ACM, New York, NY, USA, 72-81. https: //doi.org/10.1145/1454115.1454128
  14. Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, and Clair Webb. 2006. Die Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, Washington, DC, USA, 469-479. https://doi.org/10.1109/MICRO.2006. 18
  15. Kevin Kai-Wei Chang, Donghyuk Lee, Zeshan Chishti, Alaa R. Alameldeen, Chris Wilkerson, Yoongu Kim, and Onur Mutlu. 2014. Improving DRAM per- formance by parallelizing refreshes with accesses. In 2014 IEEE 20th Interna- tional Symposium on High Performance Computer Architecture (HPCA). 356-367. https://doi.org/10.1109/HPCA.2014.6835946
  16. Jonathan Corbet. 2016. Heterogeneous memory management. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/684916
  17. Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Address Translation for Architectures with Multiple Page Sizes. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 435-448. https://doi.org/10.1145/3037697.3037704
  18. Xiangyu Dong, Norman P. Jouppi, and Yuan Xie. 2013. A Circuit-architecture Co-optimization Framework for Exploring Nonvolatile Memory Hierarchies. ACM Trans. Archit. Code Optim. 10, 4, Article 23 (Dec. 2013), 22 pages. https: //doi.org/10.1145/2541228.2541230
  19. Malcolm C. Easton and Peter A. Franaszek. 1979. Use Bit Scanning in Re- placement Decisions. IEEE Trans. Comput. C-28, 2 (Feb 1979), 133-141. https://doi.org/10.1109/TC.1979.1675302
  20. Babak Falsafi, Tim Harris, Dushyanth Narayanan, and David A. Patterson. 2016. Rack-scale Computing (Dagstuhl Seminar 15421). Dagstuhl Reports 5, 10 (2016), 35-49. https://doi.org/10.4230/DagRep.5.10.35
  21. Dongrui Fan, Zhimin Tang, Hailin Huang, and Guang R. Gao. 2005. An Energy Efficient TLB Design Methodology. In Proceedings of the 2005 International Symposium on Low Power Electronics and Design (ISLPED '05). ACM, New York, NY, USA, 351-356. https://doi.org/10.1145/1077603.1077688
  22. Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, USA, 37-48. https://doi.org/10.1145/2150976.2150982
  23. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Mi- croarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 178-189. https://doi.org/10.1109/MICRO.2014.37
  24. Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Agile Paging: Exceeding the Best of Nested and Shadow Paging. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA '16). IEEE Press, Piscataway, NJ, USA, 707-718. https://doi.org/10.1109/ISCA.2016.67
  25. Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large Pages May Be Harmful on NUMA Systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 231-242. http://dl.acm.org/citation.cfm?id=2643634.2643659
  26. Jerome Glisse. 2016. HMM (Heterogeneous memory management) v5. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/619067
  27. Fei Guo, Seongbeom Kim, Yury Baskakov, and Ishan Banerjee. 2015. Proactively Breaking Large Pages to Improve Memory Overcommitment Performance in VMware ESXi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '15). ACM, New York, NY, USA, 39-51. https://doi.org/10.1145/2731186.2731187
  28. John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1-17. https://doi.org/10.1145/1186736. 1186737
  29. Intel. 2015. Introducing Intel Optane Technology -Bringing 3D
  30. XPoint Memory to Storage and Memory Products. (2015). Re- April 18, 2017 from https://newsroom.intel.com/press-kits/ introducing-intel-optane-technology-bringing-3d-xpoint-memory-to-storage\ -and-memory-products
  31. Toni Juan, Tomas Lang, and Juan J. Navarro. 1997. Reducing TLB Power Requirements. In Proceedings of the 1997 International Symposium on Low Power Electronics and Design (ISLPED '97). ACM, New York, NY, USA, 196- 201. https://doi.org/10.1145/263272.263332
  32. I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. 2002. Generating Physical Addresses Directly for Saving Instruction TLB Energy. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microar- chitecture (MICRO 35). IEEE Computer Society Press, Los Alamitos, CA, USA, 185-196. http://dl.acm.org/citation.cfm?id=774861.774882
  33. Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2015. Enabling Interposer-based Disintegration of Multi-core Processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 546-558. https://doi.org/10.1145/2830772.2830808
  34. Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, and Osman S. Unsal. 2016. Energy-efficient address translation. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 631-643. https://doi.org/10. 1109/HPCA.2016.7446100
  35. Anshuman Khandaul. 2016. Define coherent device memory node. (2016). Retrieved April 18, 2017 from http://lwn.net/Articles/404403
  36. Joonyoung Kim, Younsu Kim, undefined, undefined, undefined, and unde- fined. 2014. HBM: Memory solution for bandwidth-hungry processors. 2014 IEEE Hot Chips 26 Symposium (HCS) 00 (2014), 1-24. https://doi.org/doi. ieeecomputersociety.org/10.1109/HOTCHIPS.2014.7478812
  37. Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, and Emmett Witchel. 2016. Coordinated and Efficient Huge Page Management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 705-721. http://dl.acm.org/citation.cfm?id=3026877.3026931
  38. Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277-289. http://dl.acm.org/ citation.cfm?id=2813767.2813788
  39. Gabriel Loh and Mark D. Hill. 2012. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap. IEEE Micro 32, 3 (May 2012), 70-78. https://doi.org/10.1109/MM.2012.25
  40. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190-200. https://doi.org/10.1145/1065010.1065034
  41. Daniel Lustig, Abhishek Bhattacharjee, and Margaret Martonosi. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Trans. Archit. Code Optim. 10, 1, Article 2 (April 2013), 38 pages. https://doi.org/10.1145/2445572.2445574
  42. Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. 2016. COATCheck: Verifying Memory Ordering at the Hardware-OS Interface. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 233-247. https://doi.org/10.1145/2872362.2872399
  43. Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. 2012. Why On-chip Cache Coherence is Here to Stay. Commun. ACM 55, 7 (July 2012), 78-89. https://doi.org/10.1145/2209249.2209269
  44. Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ig- natowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 126-136. https://doi.org/10.1109/HPCA.2015.7056027
  45. Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Op- timizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In 40th Annual IEEE/ACM International Symposium on Microarchi- tecture (MICRO 2007). 3-14. https://doi.org/10.1109/MICRO.2007.33
  46. Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104. https://doi.org/10.1145/844128.844138
  47. Mark Oskin and Gabriel H. Loh. 2015. A Software-Managed Approach to Die- Stacked DRAM. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Wash- ington, DC, USA, 188-200. https://doi.org/10.1109/PACT.2015.30
  48. Jiannan Ouyang, John R. Lange, and Haoqiang Zheng. 2016. Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs. In Proceedings of the12th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '16). ACM, New York, NY, USA, 17-23. https://doi.org/10. 1145/2892242.2892245
  49. J. T. Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS). 1-24. https://doi.org/10.1109/HOTCHIPS.2011.7477494
  50. Sujay Phadke and Satish Narayanasamy. 2011. MLP aware heterogeneous memory system. In 2011 Design, Automation Test in Europe. 1-6. https: //doi.org/10.1109/DATE.2011.5763155
  51. Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 558-567. https://doi.org/10.1109/HPCA.2014.6835964
  52. Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhat- tacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitec- ture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269. https://doi.org/10.1109/MICRO.2012.32
  53. Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Using TLB Speculation to Overcome Page Splintering in Virtual Machines. Rutgers Tech- nical Report DCS-TR-713. Department of Computer Science, Rutgers University, Pistcataway, NJ.
  54. Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee. 2015. Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have It Both Ways?. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 1-12. https: //doi.org/10.1145/2830772.2830773
  55. Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 85-95. https://doi.org/ 10.1145/1995896.1995911
  56. Dulloor Subramanya Rao and Karsten Schwan. 2010. vNUMA-mgr: Managing VM memory on NUMA platforms. In 2010 International Conference on High Performance Computing. 1-10. https://doi.org/10.1109/HIPC.2010.5713191
  57. Jia Rao, Kun Wang, Xiaobo Zhou, and Cheng-Zhong Xu. 2013. Optimizing Virtual Machine Scheduling in NUMA Multicore Systems. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 306-317. https://doi.org/10.1109/HPCA.2013.6522328
  58. Bogdan F. Romanescu, Alvin R. Lebeck, Daniel J. Sorin, and Anne Bracy. 2010. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all. In HPCA -16 2010 The Sixteenth International Symposium on High- Performance Computer Architecture. 1-12. https://doi.org/10.1109/HPCA.2010. 5416643
  59. Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 79-91. https://doi.org/10.1145/2749469.2750379
  60. Agam Shah. 2014. Micron's Revolutionary Hybrid Memory Cube Tech is 15 Times Faster than Today's DRAM. (2014). Re- trieved April 18, 2017 from http://www.pcworld.com/article/2366680/ computer-memory-overhaul-due-with-microns-hmc-in-early-2015.html
  61. Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. (2011). Retrieved April 18, 2017 from https://www.microarch.org/micro44/files/Micro% 20Keynote%20Final%20-%20Avinash%20Sodani.pdf
  62. Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.
  63. Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2016. BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (Oct 2016), 3071-3087. https://doi.org/10.1109/TPDS.2016. 2526003
  64. Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB Performance of Superpages with Less Operating System Support. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 171-182. https://doi.org/10.1145/195473.195531
  65. Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhat- tacharjee. 2016. Observations and opportunities in architecting shared vir- tual memory for heterogeneous systems. In 2016 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS). 161-171. https://doi.org/10.1109/ISPASS.2016.7482091
  66. Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Un- sal. 2011. DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 340-349. https://doi.org/10.1109/PACT.2011.65
  67. VMware. 2011. Performance Best Practices for VMware vSphere 5.0. (2011). Re- trieved April 18, 2017 from https://www.vmware.com/pdf/Perf_Best_Practices_ vSphere5.0.pdf
  68. Yuan Xie. 2011. Modeling, Architecture, and Applications for Emerging Memory Technologies. IEEE Des. Test 28, 1 (Jan. 2011), 44-51. https://doi.org/10.1109/ MDT.2011.20
  69. Yuan Xie. 2013. Emerging Memory Technologies: Design, Architecture, and Applications. Springer Publishing Company, Incorporated.
  70. Jason Zebchuk, Babak Falsafi, and Andreas Moshovos. 2013. Multi-grain Co- herence Directories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 359- 370. https://doi.org/10.1145/2540708.2540739