Data prefetching and cache replacement algorithms have been intensively studied in the design of ... more Data prefetching and cache replacement algorithms have been intensively studied in the design of high performance microprocessors. Typically, the data prefetcher operates in the private caches and does not interact with the replacement policy in the shared Last-Level Cache (LLC). Similarly, most replacement policies do not consider demand and prefetch requests as different types of requests. In particular, program counter (PC)-based replacement policies cannot learn from prefetch requests since the data prefetcher does not generate a PC value. PC-based policies can also be negatively affected by compiler optimizations. In this paper, we propose a holistic cache management technique called Kill-the-PC (KPC) that overcomes the weaknesses of traditional prefetching and replacement policy algorithms. KPC cache management has three novel contributions. First, a prefetcher which approximates the future use distance of prefetch requests based on its prediction confidence. Second, a simple replacement policy provides similar or better performance than current state-of-the-art PC-based prediction using global hysteresis. Third, KPC integrates prefetching and replacement policy into a whole system which is greater than the sum of its parts. Information from the prefetcher is used to improve the performance of the replacement policy and vice-versa. Finally, KPC removes the need to propagate the PC through entire on-chip cache hierarchy while providing a holistic cache management approach with better performance than state-of-the-art PC-, and non-PC-based schemes. Our evaluation shows that KPC provides 8% better
Network-on-Chip (NoC) designs have emerged as a replacement for traditional shared-bus designs fo... more Network-on-Chip (NoC) designs have emerged as a replacement for traditional shared-bus designs for on-chip communications. Typically, these systems require fully balanced clock distribution trees to enable synchronous communication between all nodes on-chip, resulting in higher power consumption. One approach to reduce power consumption is to replace the balanced clock tree with a globally-asynchronous, locally-synchronous (GALS) mesochronous clocking scheme. NoCs implemented with a GALS clocking scheme, however, tend to have high latencies as packets must be synchronized at every hop between source and destination. In this paper, we propose a novel router microarchitecture for GALS NoCs which offers superior performance versus typical synchronizing router designs. Our approach features Asynchronous Bypass Channels (ABCs) at intermediate nodes thus avoiding synchronization delay. We also propose a new network topology that leverages the advantages of the bypass channel offered by our router design. Our experiments show that our design improves the performance of a conventional synchronizing design with similar resources by up to 26% at low loads and increases saturation throughput by up to 11%.
ACM Transactions on Architecture and Code Optimization
Industry is moving towards large-scale hardware systems which bundle processor cores, memories, a... more Industry is moving towards large-scale hardware systems which bundle processor cores, memories, accelerators, etc. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively easy to integrate into one larger sophisticated system. However, the benefits of this approach come at the cost of new security challenges, especially when integrating chiplets that come from untrusted or not fully trusted, third- party vendors. In this work, we explore these challenges for modern interposer-based systems of cache-coherent, multi-core chiplets. First, we present basic coherence-oriented hardware Trojan attacks that pose a significant threat to chiplet-based designs and demonstrate how these basic attacks can be orchestrated to pose a significant threat to interposer-based sy...
Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been propos... more Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory.In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance tests typically have much larger working sets, thus taking an even longer simulation warm-up period.In this paper, we propose an FPGA-based hybrid memory system emulation platform. We target the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. The focus of our platform is on the design of hybrid memory system, so we leverage the on-board hard IP ARM processors to enhance simulation performance while improving the accuracy of results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart gem5.
Processor design validation and debug is a difficult and complex task, which consumes the lion's ... more Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running benchmarks is typically not deterministically known. Thus, when performance benchmarking new microarchitectures, performance teams may assume that the design is correct when the performance of the new microarchitecture exceeds that of the previous generation, despite significant performance regressions existing in the design. In this work we present a two-stage, machine learning-based methodology that is able to detect the existence of performance bugs in microprocessors. Our results show that our best technique detects 91.5% of microprocessor core performance bugs whose average IPC impact across the studied applications is greater than 1% versus a bug-free design with zero false positives. When evaluated on memory system bugs, our technique achieves 100% detection with zero false positives. Moreover, the detection is automatic, requiring very little performance engineer time.
Erasure coding is widely used in storage systems to achieve fault tolerance while minimizing the ... more Erasure coding is widely used in storage systems to achieve fault tolerance while minimizing the storage overhead. Recently, Minimum Storage Regenerating (MSR) codes are emerging to minimize repair bandwidth while maintaining the storage efficiency. Traditionally, erasure coding is implemented in the storage software stacks, which hinders normal operations and blocks resources that could be serving other user needs due to poor cache performance and costs high CPU and memory utilizations. In this paper, we propose a generic FPGA accelerator for MSR codes encoding/decoding which maximizes the computation parallelism and minimizes the data movement between off-chip DRAM and the on-chip SRAM buffers. To demonstrate the efficiency of our proposed accelerator, we implemented the encoding/decoding algorithms for a specific MSR code called Zigzag code on Xilinx VCU1525 acceleration card. Our evaluation shows our proposed accelerator can achieve ∼2.4-3.1x better throughput and ∼4.2-5.7x better power efficiency compared to the state-of-art multi-core CPU implementation and ∼2.8-3.3x better throughput and ∼4.2-5.3x better power efficiency compared to a modern GPU accelerator.
As core counts increase, lock acquisition and release become even more critical because they lie ... more As core counts increase, lock acquisition and release become even more critical because they lie on the critical path of shared memory applications. In this paper, we show that many applications exhibit regular and repeating lock sharing patterns. Based on this observation, we introduce SpecLock, an efficient hardware mechanism which speculates on the lock acquisition pattern between cores. Upon the release of a lock, the cache line containing the lock is speculatively forwarded to the next consumer of the lock. This forwarding action is performed via a specialized prefetch request and does not require coherence protocol modification. Further, the lock is not speculatively acquired, only the cache line containing the lock variable is placed in the private cache of the predicted consumer. Speculative forwarding serves to hide the remote core's lock acquisition latency. SpecLock is distributed and all predictions are made locally at each core. We show that SpecLock captures 87% of predictable lock patterns correctly and improves performance by an average of 10% with 64 cores. SpecLock incurs a negligible overhead, with a 75% area reduction compared to past work. Compared to two state of the art methods, SpecLock provides a speedup of 8% and 4% respectively.
The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-late... more The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byteaddressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byteaddressable NVM which can be shared across many applications. In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.
ACM Transactions on Architecture and Code Optimization, May 4, 2022
Packet classiication methods rely upon matching packet content/header against pre-deined rules, w... more Packet classiication methods rely upon matching packet content/header against pre-deined rules, which are generated by network applications and their conigurations. With the rapid development of network technology and the fast-growing network applications, users seek more enhanced, secure, and diverse network services. Hence it becomes critical to improve the performance of arbitrary matching operations. This paper presents SIMD-Matcher, an eicient Single Instruction Multiple Data (SIMD) and cache-friendly arbitrary matching framework. To further improve the arbitrary matching performance, SIMD-Matcher adopts a trie node with a ixed high-fanout, and a varying span for each node depending on the data distribution. The trie node layout leverage cache and modern processor features such as SIMD instructions. To support arbitrary matching, we irst interpret arbitrary rules into three ields: value, mask, and priority. Second, to support insertion of randomly positioned wildcards to arbitrary rules, we propose the SIMD-Matcher extraction algorithm to process the wildcard bits. Third, we add an array of wildcard entries to the leaf entries, which stores the wildcard rules and guarantees the correctness of matching result. Experiments show that SIMD-Matcher outperforms GenMatcher under large scale of rule set and key set, in terms of search time, insert time, and memory cost. Speciically with 5M rules, our method achieves a 2.7X speedup on search time, and the insertion time takes ∼ 7.3 seconds, gaining a 1.38X speedup; meanwhile the memory cost reduction is up to 6.17X.
Key-value (KV) stores have been widely deployed in a variety of scale-out enterprise applications... more Key-value (KV) stores have been widely deployed in a variety of scale-out enterprise applications such as online retail, big data analytics, social networks, etc. Key-Value SSDs (KVSSDs) provide a key-value interface directly from the device aiming at lowering software overhead and reducing I/O amplification for such applications. In this paper, we present KVRAID, a high performance, write efficient erasure coding management scheme on emerging key-value SSDs. The core innovation of KVRAID is to use logical to physical key conversion to efficiently pack similar size KV objects and dynamically manage the membership of erasure coding groups. Such design enables packing multiple user objects to a single physical object to reduce the object amplification compared to prior works. By applying out-of-place update technique, KVRAID can significantly reduce the I/O amplification compared to the state-of-art designs. Our experiments show that KVRAID outperforms state-of-art software KV-store with block RAID by 28x in terms of insert throughput and reduces CPU utilization, tail latency and write amplification significantly. Compared to state-of-art KV devices erasure coding management, KVRAID reduces object amplification by ~ 2.6x compared to StripeFinder and reduces I/O amplification by ~ 9.6x when compared to KVMD and StripeFinder for update intensive workloads.
Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been propos... more Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of two memory classes point towards the design of hybrid memory systems containing multiple classes of main memory. In the iterative and incremental development of new architectures, the timeliness of simulation completion is critical to project progression. Hence, a highly efficient simulation method is needed to evaluate the performance of different hybrid memory system designs. Design exploration for hybrid memory systems is challenging, because it requires emulation of the full system stack, including the OS, memory controller, and interconnect. Moreover, benchmark applications for memory performance test typically have much larger working sets, thus taking even longer simulation warm-up period. In this paper, we propose a FPGA-based hybrid memory system emulation platform. We target at the mobile computing system, which is sensitive to energy consumption and is likely to adopt NVM for its power efficiency. Here, because the focus of our platform is on the design of the hybrid memory system, we leverage the on-board hard IP ARM processors to both improve simulation performance while improving accuracy of the results. Thus, users can implement their data placement/migration policies with the FPGA logic elements and evaluate new designs quickly and effectively. Results show that our emulation platform provides a speedup of 9280x in simulation time compared to the software counterpart Gem5.
Shared-memory, multi-threaded applications often require programmers to insert thread synchroniza... more Shared-memory, multi-threaded applications often require programmers to insert thread synchronization primitives (i.e. locks, barriers, and condition variables) in critical sections to synchronize data access between processes. Scaling performance requires balanced per-thread workloads with little time spent in critical sections. In practice, however, threads often waste time waiting to acquire locks/barriers, leading to thread imbalance and poor performance scaling. Moreover, critical sections often stall data prefetchers that mitigate the effects of waiting by ensuring data is preloaded in core caches when the critical section is done. This paper introduces a pure hardware technique to enable safe data prefetching beyond synchronization points in chip multiprocessors (CMPs). We show that successful prefetching beyond synchronization points requires overcoming two significant challenges in existing techniques. First, typical prefetchers are designed to trigger prefetches based on current misses. Unlike cores in single-threaded applications, a multi-threaded core stall on a synchronization point does not produce new references to trigger a prefetcher. Second, even if a prefetch were correctly directed to read beyond a synchronization point, it will likely prefetch shared data from another core before this data has been written. This prefetch would be considered "accurate" but highly undesirable because it would lead to three extra "ping-pong" movements due to coherence, costing more latency and energy than without prefetching. We develop a new data prefetcher, Synchronization-aware B-Fetch (SB-Fetch), built as an extension to a previous single-threaded data prefetcher. SB-Fetch addresses both issues for shared memory multi-threaded workloads. The novelty in SB-Fetch is that it explicitly issues prefetches for data beyond synchronization points and it distinguishes between data likely and unlikely to incur cache coherence overhead. These two features are directly synergistic since blindly prefetching beyond synchronization is likely to incur coherence penalties. No prior work includes both features. SB-Fetch is evaluated using a representative set of benchmarks from Parsec [4], Rodinia [7], and Parboil [39]. SB-Fetch improves execution time by 12.3% over baseline and 4% over best of class prefetching.
Routing deadlocks, i.e., a cyclic dependence between buffered packets, are a fundamental network ... more Routing deadlocks, i.e., a cyclic dependence between buffered packets, are a fundamental network design challenge. Existing solutions require resource over provisioning. We propose a new theory for deadlock freedom, Synchronized Progress in Interconnection Networks (SPIN) that solves the problem through coordinated movement of deadlocked packets.
Energy-efficient implementations of GF (p) and GF(2m) elliptic curve cryptography
While public-key cryptography is essential for secure communications, the energy cost of even the... more While public-key cryptography is essential for secure communications, the energy cost of even the most efficient algorithms based on Elliptic Curve Cryptography (ECC) is prohibitive on many ultra-low energy devices such as sensor-network nodes and identification tags. Although an abundance of hardware acceleration techniques for ECC have been proposed in literature, little research has focused on understanding the energy benefits of these techniques. Therefore, we evaluate the energy cost of ECC on several different hardware/software configurations across a range of security levels. Our work comprehensively explores implementations of both GF(p) and GF(2m) ECC, demonstrating that GF(2m) provides a 1.31 to 2.11 factor improvement in energy efficiency over GF(p) on an extended RISC processor. We also show that including a 4KB instruction cache in our system can reduce the energy cost of ECC by as much as 30%. Furthermore, our GF(2m) coprocessor achieves a 2.8 to 3.61 factor improvement in energy efficiency compared to instruction set extensions and significantly outperforms prior work.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Mar 1, 2018
Networks-on-chips (NoCs) are gaining in popularity as replacement for shared medium interconnects... more Networks-on-chips (NoCs) are gaining in popularity as replacement for shared medium interconnects in chipmultiprocessors (CMPs) and multiprocessor systems-on-chips (MPSoCs), and their performance becoming essential to system performance. There have been emerging studies to achieve better power/energy efficiency without performance degradation on NoCs. However, there are still non-negligible latency issues caused by the mechanism of power efficient approaches. To alleviate the latency problem and to transfer data efficiently with the high utilization of interconnect resources, we propose an on-chip network architecture that improves latency and bandwidth. Increasing the data/link widths across the network may considerably resolve this problem but is a costly proposition both in terms of device area and of power. Alternatively, we propose a dual-path router architecture that efficiently exploits path diversity to attain low latency without significant hardware overhead. By 1) doubling the number of injection and ejection ports, 2) splitting packets into two halves, 3) recomposing routing policy to support path diversity, and 4) provisioning the network hardware design, we can considerably enhance network resource utilization to achieve much higher performance in latency. The proposed simultaneous dual-path routing (SDPR) scheme outperformed the conventional dimension order routing (DOR) technique across synthetic workloads by 31-40% in average latency and up to a 100% improvement in throughput performance running on a 49-core CMP. Our synthesizable model for the SDPR router and network provides accurate power and area reports. According to the synthesis reports, SDPR incurs insignificant overhead compared to the baseline XY DOR router.
Proceedings of the International Symposium on Memory Systems, Oct 2, 2017
The quest for greater performance and efficiency has driven modern cloud applications towards "in... more The quest for greater performance and efficiency has driven modern cloud applications towards "in-memory" implementations, such as memcached and Apache Spark. Looking forward, however, the costs of DRAM, due to its low area density and high energy consumption, may make this trend unsustainable. Traditionally, OS paging system mechanisms were intended to bridge the gap between expensive, under-provisioned DRAM and inexpensive, dense storage, however, in the past twenty years the latency of storage, relative to DRAM became too great to overcome without significant performance impact. Recent NVM storage devices, such as Intel Optane drives and aggressive, 3D flash SSDs, may dramatically change the picture for OS paging. These new drives are expected to provide much lower latency compared to the existing flash-based SSDs or traditional HDDs. Unfortunately, even these future NVM drives are still much too slow to replace DRAM, since the access latency of fast NVM storage is expected on the order of tens of microseconds, and they often require block-level access. Unlike traditional HDDs, for which the baseline OS paging policies are designed, these new SSDs place no penalty for "random" access and their access latency promises to be significantly less than traditional SSDs, thus arguing for a rearchitecting of the OS paging system. In this paper, we propose SPAN (Speculative PAging for future NVM storage), a software-only, OS swap-based, page management and prefetching scheme designed for emerging NVM storage. Unlike the baseline OS swapping mechanism, which is highly optimized for traditional spinning disks, SPAN leverages the inherent parallelism of NVM devices to proactivley fetch a set of pages from NVM storage to the small and fast main DRAM. In doing so, SPAN yields a speedup of ∼18% versus swapping into the NVM with the baseline OS (∼50% of the performance lost by the baseline OS versus placing the entire working set in DRAM memory). The proposed technique thus enables the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system.
Uploads
Papers by Paul Gratz