In 2003, the High End Computing Revitalization Task Force designated file systems and I/O as an a... more In 2003, the High End Computing Revitalization Task Force designated file systems and I/O as an area in need of national focus. The purpose of the High End Computing Interagency Working Group (HECIWG) is to coordinate government spending on File Systems and I/O (FSIO) R&D by all the government agencies that are involved in High End Computing. The HECIWG tasked a smaller advisory group to list, categorize, and prioritize HEC I/O and File Systems R&D needs. In 2005, leaders in FSIO from academia, industry and government agencies collaborated to list and prioritize areas of research in HEC FSIO. This led to a very successful High End Computing University Research Activity (HECURA) call from NSF in 2006 and has prompted a new HECURA call from NSF in 2009. This paper serves as both a review of the 2008 HEC FSIO identified research gaps as well as a preview of this forthcoming HECURA call.
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, 2013
The I/O bottleneck in high-performance computing is becoming worse as application data continues ... more The I/O bottleneck in high-performance computing is becoming worse as application data continues to grow. In this work, we explore how patterns of I/O within these applications can significantly affect the effectiveness of the underlying storage systems and how these same patterns can be utilized to improve many aspects of the I/O stack and mitigate the I/O bottleneck. We offer three main contributions in this paper. First, we develop and evaluate algorithms by which I/O patterns can be efficiently discovered and described. Second, we implement one such algorithm to reduce the metadata quantity in a virtual parallel file system by up to several orders of magnitude, thereby increasing the performance of writes and reads by up to 40 and 480 percent respectively. Third, we build a prototype file system with pattern-aware prefetching and evaluate it to show a 46 percent reduction in I/O latency. Finally, we believe that efficient pattern discovery and description, coupled with the observed predictability of complex patterns within many high-performance applications, offers significant potential to enable many additional I/O optimizations.
2012 IEEE International Conference on Cluster Computing, 2012
Extracting high data bandwidth and metadata rates from parallel file systems is notoriously diffi... more Extracting high data bandwidth and metadata rates from parallel file systems is notoriously difficult. User workloads almost never achieve the performance of synthetic benchmarks. The reason for this is that real-world applications are not as well-aligned, well-tuned, or consistent as are synthetic benchmarks. There are at least three possible ways to address this challenge: modification of the real-world workloads, modification of the underlying parallel file systems, or reorganization of the real-world workloads using transformative middleware. In this paper, we demonstrate that transformative middleware is applicable across a large set of high performance computing workloads and is portable across the three major parallel file systems in use today. We also demonstrate that our transformative middleware layer is capable of improving the write, read, and metadata performance of I/O workloads by up to 150x, 10x, and 17x respectively, on workloads with processor counts of up to 65,536.
012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012
In the petascale era, the storage stack used by the extreme scale high performance computing comm... more In the petascale era, the storage stack used by the extreme scale high performance computing community is fairly homogeneous across sites. On the compute edge of the stack, file system clients or IO forwarding services direct IO over an interconnect network to a relatively small set of IO nodes. These nodes forward the requests over a secondary storage network to a spindle-based parallel file system. Unfortunately, this architecture will become unviable in the exascale era. As the density growth of disks continues to outpace increases in their rotational speeds, disks are becoming increasingly cost-effective for capacity but decreasingly so for bandwidth. Fortunately, new storage media such as solid state devices are filling this gap; although not cost-effective for capacity, they are so for performance. This suggests that the storage stack at exascale will incorporate solid state storage between the compute nodes and the parallel file systems. There are three natural places into which to position this new storage layer: within the compute nodes, the IO nodes, or the parallel file system. In this paper, we argue that the IO nodes are the appropriate location for HPC workloads and show results from a prototype system that we have built accordingly. Running a pipeline of computational simulation and visualization, we show that our prototype system reduces total time to completion by up to 30%.
012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012
In applications ranging from radio telescopes to Internet traffic monitoring, our ability to gene... more In applications ranging from radio telescopes to Internet traffic monitoring, our ability to generate data has outpaced our ability to effectively capture, mine, and manage it. These ultra-high-bandwidth data streams typically contain little useful information and most of the data can be safely discarded. Periodically, however, an event of interest is observed and a large segment of the data must be preserved, including data preceding detection of the event. Doing so requires guaranteed data capture at source rates, line speed filtering to detect events and data points of interest, and TiVo-like ability to save past data once an event has been detected. We present Valmar, a system for guaranteed capture, indexing, and storage of ultra-highbandwidth data streams. Our results show that Valmar performs at nearly full disk bandwidth, up to several orders of magnitude faster than flat file and database systems, works well with both small and large data elements, and allows concurrent read and search access without compromising data capture guarantees.
International Series in Operations Research & Management Science, 2004
We describe NeST, a flexible software-only storage appliance designed to meet the storage needs o... more We describe NeST, a flexible software-only storage appliance designed to meet the storage needs of the Grid. NeST has three key features that make it wellsuited for deployment in a Grid environment. First, NeST provides a generic data transfer architecture that supports multiple data transfer protocols (including GridFTP and NFS), and allows for the easy addition of new protocols. Second, NeST is dynamic, adapting itself on-the-fly so that it runs effectively on a wide range of hardware and software platforms. Third, NeST is Grid-aware, implying that features that are necessary for integration into the Grid, such as storage space guarantees, mechanisms for resource and data discovery, user authentication, and quality of service, are a part of the NeST infrastructure. We include a practical discussion about building grid tools using the NeST software.
Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001
The key to storage manageability is adaptation. In traditional storage systems, adaptation is per... more The key to storage manageability is adaptation. In traditional storage systems, adaptation is performed by a human administrator, who must assess problems, and then manually adjust various knobs and levers to bring the behavior of the system back to an acceptable level. Future storage systems must themselves adapt, and in doing so, reduce the need for manual intervention. In this paper, we describe the Wisconsin Network Disks project (WiND), wherein we seek to understand and develop the key adaptive techniques required to build a truly manageable network-attached storage system. WiND gracefully and efficiently adapts to changes in the environment, reducing the burden of administration and increasing the flexibility and performance of storage for an eclectic range of clients. In particular, WiND will automatically adapt to the addition of new disks to the system, the failure or erratic performance of existing disks, and changes in client workload and access patterns. 1 The group at CMU refers to NASD as "network-attached secure disks"; since we wish to refer to something more general, we simply refer to disks on a network as network-attached storage devices.
Proceedings 11th IEEE International Symposium on High Performance Distributed Computing
We present NeST, a flexible software-only storage appliance designed to meet the storage needs of... more We present NeST, a flexible software-only storage appliance designed to meet the storage needs of the Grid. NeST has three key features that make it well-suited for deployment in a Grid environment. First, NeST provides a generic data transfer architecture that supports multiple data transfer protocols (including GridFTP and NFS), and allows for the easy addition of new protocols. Second, NeST is dynamic, adapting itself on-the-fly so that it runs effectively on a wide range of hardware and software platforms. Third, NeST is Grid-aware, implying that features that are necessary for integration into the Grid, such as storage space guarantees, mechanisms for resource and data discovery, user authentication, and quality of service, are a part of the NeST infrastructure.
Proceedings of the 4th Annual Workshop on Petascale Data Storage, 2009
MapReduce-tailored distributed filesystems-such as HDFS for Hadoop MapReduce-and parallel high-pe... more MapReduce-tailored distributed filesystems-such as HDFS for Hadoop MapReduce-and parallel high-performance computing filesystems are tailored for considerably different workloads. The purpose of our work is to examine the performance of each filesystem when both sorts of workload run on it concurrently. We examine two workloads on two filesystems. For the HPC workload, we use the IOR checkpointing benchmark and the Parallel Virtual File System, Version 2 (PVFS); for Hadoop, we use an HTTP attack classifier and the CloudStore filesystem. We analyze the performance of each file system when it concurrently runs its "native" workload as well as the non-native workload.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010
Due to the explosive growth in the size of scientific data sets, data-intensive computing is an e... more Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. Many application scientists are looking to integrate data-intensive computing into computationalintensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language to allow users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented Map-Reduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33% throughput improvement in one real application, and up to 70% in an I/O kernel of another application.
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011
Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we prese... more Mean Time To failure, MTTF, is a commonly accepted metric for reliability. In this paper we present a novel approach to achieve the desired MTTF with minimum redundancy. We analyze the failure behavior of large scale systems using failure logs collected by Los Alamos National Laboratory. We analyze the root cause of failures and present a choice of specific hardware and software components to be made fault-tolerant, through duplication, to achieve target MTTF at minimum expense. Not all components show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of components to be incrementally selected for protection to achieve a target MTTF. We propose a model for MTTF for tolerating failures in a specific component, system-wide, and order components according to the coverage provided. Systems grouped based on hardware configuration showed similar improvements in MTTF when different components in them were targeted for fault-tolerance.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
HDF5 is a data model, library and file format for storing and managing data. It is designed for f... more HDF5 is a data model, library and file format for storing and managing data. It is designed for flexible and efficient I/O for high volume and complex data. Natively, it uses a single-file format where multiple HDF5 objects are stored in a single file. In a parallel HDF5 application, multiple processes access a single file, thereby resulting in a performance bottleneck in I/O. Additionally, a single-file format does not allow semantic post processing on individual objects outside the scope of the HDF5 application. We have developed a new plugin for HDF5 using its Virtual Object Layer that serves two purposes: 1) it uses PLFS to convert the single-file layout into a data layout that is optimized for the underlying file system, and 2) it stores data in a unique way that enables semantic post-processing on data. We measure the performance of the plugin and discuss work leveraging the new semantic post-processing functionality enabled. We further discuss the applicability of this approach for exascale burst buffer storage systems.
As HPC applications run on increasingly high process counts on larger and larger machines, both t... more As HPC applications run on increasingly high process counts on larger and larger machines, both the frequency of checkpoints needed for fault tolerance [14] and the resolution and size of Data Analysis Dumps are expected to increase proportionally. In order to maintain an acceptable ratio of time spent performing useful computation work to time spent performing I/O, write bandwidth to
Condor is a distributed system that harnesses the power of users' unused workstations to del... more Condor is a distributed system that harnesses the power of users' unused workstations to deliver large amounts of computing to CPU intensive projects. Because users can and do claim their machines at unforeseeable times, Condor checkpoints programs' state periodically and migrates interrupted jobs to new host machines. Additionally, Condor checkpoints a job when it detects user activity at the terminal; this is called a vacate checkpoint. As enrollment in a Condor pool is usually voluntary, the Condor system must ...
We present measurements and analysis of the Linux ext3 file system. We develop and apply a novel ... more We present measurements and analysis of the Linux ext3 file system. We develop and apply a novel analysis method known as semantic block-level analysis (SBA), which examines the low-level block stream that a file system generates in order to understand its behavior under a series of controlled workloads. We use SBA to evaluate the strengths and weaknesses of the ext3 design and implementation; in comparison to standard benchmarking approaches, SBA enables us to understand why the file system behaves in ...
Intelligent routing using network processors: guiding design through analysis
Browse by Title. MINDS@UW Home >; Browse by Title. Browse by Title. 0-9; A; B; C; D; E; F; G; ... more Browse by Title. MINDS@UW Home >; Browse by Title. Browse by Title. 0-9; A; B; C; D; E; F; G; H; I; J; K; L; M; N; O; P; Q; R; S; T; U; V; W; X; Y; Z. Or enter first few letters:Browse for items that begin with these letters. Sort by: title Order: ascending Results: 5. << Showing 9115-9134 of 18296. >>. Intelligent Routing Using Network Processors: Guiding Design Through Analysis. ��� Seshadri, Madhu Sudanan; Bent ...
We present our experience of turning a Linux clusterinto a high-performance parallel sorting syst... more We present our experience of turning a Linux clusterinto a high-performance parallel sorting system. Our implementation, WIND-SORT, broke the Datamationrecord by roughly a factor of two, sorting1 million 100-byte records in 0.48 seconds. We haveidentified three keys to our success: developing afast remote execution service, configuring the clusterproperly, and avoiding the potential ill-effects ofoccasionally faulty hardware. 1
Uploads
Papers by John Bent