An Early Functional and Performance Experiment of the MarFS Hybrid Storage EcoSystem
Many computing sites, LANL being one of them, have a requirement for long-term retention of mostl... more Many computing sites, LANL being one of them, have a requirement for long-term retention of mostly cold data. Although the main function of this storage tier is capacity, it does also have a bandwidth requirement. For many years, tape was the best economic solution for this requirement. However, over time, data sets have grown larger more quickly than tape bandwidth has improved. We have now entered a regime in which disk is the more economically efficient medium for this storage tier. Also more and more, data dominates the computing world. There is a "sea" of data out there in many different formats such as file, object, and Key-value that needs to be efficiently managed and effectively used. In this paper, we introduce a new hybrid storage system named MarFS. MarFS is a Near-POSIX File System using scale-out commercial/cloud for data and many POSIX file systems for metadata services. MarFS is an approach to support a data lake for HPC that sits on industry based commodity storage hardware and is a software layer that provides a global namespace and near POSIX semantics. MarFS provides the capability to serve as an umbrella over a variety of underling storage layers. In this paper, we present the system architecture of the proposed MarFS near-POISX file system, we conduct early functional performance testing cases on MarFS's software components, and finally we address current deployment status and future development works of the MarFS.
To improve the checkpoint bandwidth of critical applications at LANL, we developed the Parallel L... more To improve the checkpoint bandwidth of critical applications at LANL, we developed the Parallel Log Structured File System (PLFS)[1]. PLFS is a transformative I/O middleware layer placed within our storage stack. It transforms a concurrently written single shared file into non-shared component pieces. This reorganized I/O has made write size a non-issue and improved checkpoint performance by orders of magnitude, meeting the project's L2 milestone to show increased performance for checkpointing with LANL codes. LANL is working together with EMC under an umbrella Cooperative Research and Development Agreement (CRADA) to further enhance, design, build, test, and deploy PLFS. PLFS has been integrated with multiple types of storage systems, including cloud storage, and has shown improvements in file storage sizes and metadata rates.
Storage Systems and Input/Output: Organizing, Storing, and Accessing Data for Scientific Discovery. Report for the DOE ASCR Workshop on Storage Systems and I/O. [Summary Brief]
The Crossroads supercomputer was designed to simulate some of the most complex physical devices i... more The Crossroads supercomputer was designed to simulate some of the most complex physical devices in the world. These simulations routinely require 1/2 petabyte or more of system memory running on thousands of compute nodes for months at a time on the most powerful supercomputers. Improvements in time to solutions for these workloads can have major impact on our mission capabilities. In this paper we present early results of representative application workloads on 4 th Gen Intel Xeon and Intel Xeon Processors codenamed Sapphire Rapids with HBM. These results demonstrate an extremely promising 8.57x improvement (node to node) over our prior generation Intel Broadwell (BDW) based HPC systems. No code modifications were required to achieve this speedup, providing a compelling path forward toward major reductions in time to solution and the complexity of physical systems that can be simulated in the future.
DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges
Exascale computing systems are essential for the scientific fields that will transform the 21st c... more Exascale computing systems are essential for the scientific fields that will transform the 21st century global economy, including energy, biotechnology, nanotechnology, and materials science. Progress in these fields is predicated on the ability to perform advanced scientific and engineering simulations, and analyze the deluge of data. On July 29, 2013, ASCAC was charged by Patricia Dehmer, the Acting Director of the Office of Science, to assemble a subcommittee to provide advice on exascale computing. This subcommittee was directed to return a list of no more than ten technical approaches (hardware and software) that will enable the development of a system that achieves the Department's goals for exascale computing. Numerous reports over the past few years have documented the technical challenges and the non¬-viability of simply scaling existing computer designs to reach exascale. The technical challenges revolve around energy consumption, memory performance, resilience, extreme concurrency, and big data. Drawing from these reports and more recent experience, this ASCAC subcommittee has identified the top ten computing technology advancements that are critical to making a capable, economically viable, exascale system.
The Crossroads supercomputer was designed to simulate some of the most complex physical devices i... more The Crossroads supercomputer was designed to simulate some of the most complex physical devices in the world. These simulations routinely require 1/2 petabyte or more of system memory running on thousands of compute nodes for months at a time on the most powerful supercomputers. Improvements in time to solutions for these workloads can have major impact on our mission capabilities. In this paper we present early results of representative application workloads on 4 th Gen Intel Xeon and Intel Xeon Processors codenamed Sapphire Rapids with HBM. These results demonstrate an extremely promising 8.57x improvement (node to node) over our prior generation Intel Broadwell (BDW) based HPC systems. No code modifications were required to achieve this speedup, providing a compelling path forward toward major reductions in time to solution and the complexity of physical systems that can be simulated in the future.
Uploads
Papers by Gary Grider