This paper deals with two level on-chip cache m e mories. W e show the impact of th,ree different... more This paper deals with two level on-chip cache m e mories. W e show the impact of th,ree different relationships between the contents of th,ese levels o n the syst e m performance. In addition t o the classical Inclusion contents management, we propose two alternatives, namely Exclusion and Demand, developing for t h e m the necessary coherence support and quantifying their relative performmce in a design space (sizes, latencies,. . .) in agreement with the constrain.ts imposed by integration. T w o performance metrics are eonsidered: the secondlevel cache miss ratio and the system CPI. The experiments have been carried out running n set of integer and floating p0in.t SPEC'92 benchmarks. W e conclude showing the superiority of our improved version of Exclusion throughout all the sizing and workload spectrum studied.
Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to ... more Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to build the Last-Level Cache (LLC). Their advantages, compared to SRAM memory, are higher density and lower static power, but write operations wear out the bitcells to the point of eventually losing their storage capacity. In this context, this paper presents a novel LLC organization designed to extend the lifetime of the NV data array and a procedure to forecast in detail the capacity and performance of such an NV-LLC over its lifetime. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of an NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, this work proposes a forecasting procedure that combines detailed simulation and prediction, allowing an accurate analysis of the impact of different cache control policies and mechanisms (replacement, wear-leveling, compression, etc.) on the temporal evolution of the indices of interest, such as the effective capacity of the NV-LLC or the system IPC. We also introduce L2C2, a LLC design intended for implementation in NV memory technology that combines fault tolerance, compression, and internal write wear leveling for the first time. Compression is not used to store more blocks and increase the hit rate, but to reduce the write rate and increase the lifetime during which the cache supports near-peak performance. In addition, to support byte loss without performance drop, L2C2 inherently allows N redundant bytes to be added to each cache entry. Thus, L2C2+N, the endurancescaled version of L2C2, allows balancing the cost of redundant capacity with the benefit of longer lifetime. For instance, as a use case, we have implemented the L2C2 cache with STT-RAM technology. It has affordable hardware overheads compared to that of a baseline NV-LLC without compression in terms of area, latency and energy consumption, and increases up to 6-37 times the time in which 50% of the effective capacity is degraded, depending on the variability in the manufacturing process. Compared to L2C2, L2C2+6 which adds 6 bytes of redundant capacity per entry, that means 9.1% of storage overhead, can increase up to 1.4-4.3 times the time in which the system gets its initial peak performance degraded.
Time Domain Reflectometry is a technique widely used in hydrology and agronomy that allows real t... more Time Domain Reflectometry is a technique widely used in hydrology and agronomy that allows real time estimation of soil volumetric water content (), which is related to soil's apparent permittivity (a) and bulk electrical conductivity (). This work presents an enhanced release of TDR-Lab, software which controls instrumentation for field measurements of and , enabling a convenient recording and retrieving of data. TDR-Lab 2.0 supports Tektronics 1502C, TDR-100 Campbell Sci. and TRASE Soil-moisture Equipment Corp., which can be connected to a multiplexing system (SDMX50, Campbell Sci) allowing automated scheduled measurements from up to 251 different probes. This new release, when connected to TDR-100, allows increasing the waveform accuracy up to 2048 points/waveform. Multiple comparison of TDR waveform within the same TDR screen is now allowed. Graphical or numerical methods can be used for the estimation of and from soil measurements coming taken with different probes. Additional features to carry out water-surface-level measurements such as matric potential and soil solution electrical conductivity are also available when measurements are made using the corresponding specific probes. Two different versions, a lite and a full release, for field and laboratory applications have been developed. The light version with a reduced set of features (TDR-Lab Lite) was designed to run in low-end ultraportable devices. TDR-Lab Lite works with XML-files instead of the SQL database engine of the extended TDR-Lab, and has lower system requirements and a faster boot-up time. A robust import/export graphical user interface (GUI) facilitates transferring projects between the centralized SQL database and XML files. A new project manager window has been implemented, where the bar menu has been complemented with a useful set of icons. The display system projects has
Nowadays, most computer manufacturers offer chip multiprocessors (CMPs) due to the always increas... more Nowadays, most computer manufacturers offer chip multiprocessors (CMPs) due to the always increasing chip density. These CMPs have a broad range of characteristics, but all of them support the shared memory programming model. As a result, every CMP implements a coherence protocol to keep local caches coherent. Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with write-through local caches, a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that energy is wasted in the directory due to two main reasons. Firstly, an important fraction of directory lookups are useless, because the target block is not located in any local cache. The power consumed by the directory could be reduce by filtering out useless directory lookups. Secondly, useful directory lookups (there are local copies of the target block) are performed over target blocks that are shared by a small number of processors. The directory power consumption could be reduced by limiting the directory lookups to only the directory entries that have a copy of the block. Along this thesis we propose two filtering mechanisms. Each of these mechanisms is focused on one of the problems described above: while our first proposal focuses on reducing number of directory lookups performed, our second proposal aims at reducing the associativity of directory lookups. Several implementations of both filtering approaches have been proposed and evaluated, having all of them a very limited hardware complexity. Our results show that the power consumed by the directory can be reduced as much as 30%.
Power density has become the limiting factor in technology scaling as power budget restricts the ... more Power density has become the limiting factor in technology scaling as power budget restricts the amount of hardware that can be active at the same time. Reducing supply voltage to ultra-low voltage ranges close to the threshold region has the promise of great energy savings. However, the potential savings of voltage scaling are limited by the correct operation of SRAM cells, which is not guaranteed below Vdd min , the minimum voltage in which cache structures operate reliably. Understanding the effects of operating below Vdd min requires complex modeling, so we introduce an updated probability failure model of SRAM cells at 22nm and explore the reliability impact of lowering the chip voltage supply below Vdd min in sharedmemory coherent chip-multiprocessors (CMP) running a variety of parallel workloads. A microarchitectural technique to cope with cache reliability at ultra-low voltages is block disabling; however, in many cases, the savings in on-chip caches do not compensate for the consumption in the rest of the system, as the consumption increase of the off-chip memory may offset the on-chip gain. We make the case that existing coherence mechanisms can provide the substrate to improve energy savings with block disabling and propose two low-complexity techniques. Taking the best of both techniques we can scale voltage below Vdd min and reduce system energy up to 39%, and system energy-delay up to 10%. Besides, by lowering the CMP consumption in a powerconstrained scenario, we could activate offline cores, reaching a potential speedup between 3.7 and 4.4.
The performance impact of the Physical Register File (PRF) size on Simultaneous Multithreading pr... more The performance impact of the Physical Register File (PRF) size on Simultaneous Multithreading processors has not been extensively studied in spite of being a critical shared resource. In this paper we analyze the effect on performance of the PRF size for a broad set of resource allocation policies (Icount, Stall, Flush, Flush++, Static, Dcra and Hill-climbing) and evaluate them under two metrics: instructions per second (IPS) for throughput and harmonic mean of weighted IPCs (Hmean-wIPC) for fairness. We have found that resource allocation policy and PRF size should be considered together in order to obtain the best score in the proposed metrics. For instance, for the analyzed 2 and 4-threaded SPEC CPU2000 workloads, small PRFs are best managed by Flush, whereas for larger PRFs, Hill-climbing and Static lead to the best values for the throughput and fairness metrics, respectively. The second contribution of this work is a simple procedure that, for a given resource allocation policy, selects the PRF size that maximizes IPS and obtains for Hmean-wIPC a value close to its maximum. According to our results, Hill-climbing with a 320-entry PRF achieves the best figures for 2-threaded workloads. When executing 4threaded workloads, Hill-Climbing with a 384-entry PRF achieves the best throughput whereas Static obtains the best throughput-fairness balance.
Concurrency and Computation: Practice and Experience, Jul 12, 2014
Optical networks-on-chip (ONoCs) are gaining momentum as a way to improve energy consumption and ... more Optical networks-on-chip (ONoCs) are gaining momentum as a way to improve energy consumption and bandwidth scalability in the next generation multicore and many-core systems. Although many valuable research works have investigated their properties, the vast majority of them lack an accurate exploration of the network interface architecture required to support optical communications on the silicon chip. The complexity of this architecture is especially critical for a specific kind of ONoCs: the wavelength-routed ones. These are capable of delivering contention-free all-to-all connectivity without the need for path reservation, unlike space-routed ONoCs. From a logical viewpoint, they can be considered as full nonblocking crossbars; thus, the control complexity is implemented at the network interfaces. To our knowledge, this paper proposes the first complete network interface architecture for wavelength-routed optical NoCs, by coping with the intricacy of networking issues such as flow control, buffering strategy, deadlock avoidance, serialization, and above all, their codesign in a complete architecture. The evaluation methodology spans from area and energy analysis via actual synthesis runs in 40-nm technology to RTL-equivalent (register-transfer level) SystemC modelling of the network architecture and aims at verifying whether the projected benefits of ONoCs versus their electrical counterparts are still preserved when the complexity of their network interface is considered in the analysis.
Register files are becoming one of the critical components of current out-of-order processors in ... more Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two earlyreleasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.
Design, Automation, and Test in Europe, Apr 20, 2009
To deal with the "memory wall" problem, microprocessors include large secondary on-chip caches. B... more To deal with the "memory wall" problem, microprocessors include large secondary on-chip caches. But as these caches enlarge, they originate a new latency gap between them and fast L1 caches (inter-cache latency gap). Recently, Non-Uniform Cache Architectures (NUCAs) have been proposed to sustain the size growth trend of secondary caches that is threatened by wire-delay problems. NUCAs are size-oriented, and they were not conceived to close the inter-cache latency gap. To tackle this problem, we propose Light NUCAs (L-NUCAs) leveraging on-chip wire density to interconnect small tiles through specialized networks, which convey packets with distributed and dynamic routing. Our design reduces the tile delay (cache access plus one-hop routing) to a single processor cycle and places cache lines at a finer-granularity than conventional caches reducing cache latency. Our evaluations show that in general, L-NUCA improves simultaneously performance, energy, and area when integrated into both conventional or D-NUCA hierarchies.
This paper proposes and evaluates a new microarchitecture for out-of-order processors that suppor... more This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of physical registers. These renaming policies may cause a register operand not to be kept in the Physical Register File (PRF). Thus, we add a low-ported Auxiliary Register File (XRF) located outside the processor core that keeps the values absent in PRF and supplies them at higher latency. To support the location of register operands being either in PRF or XRF, we use virtual registers. We consider omission and release policies directed by hardware prediction. Namely, we will use a single Last-Use Predictor that directs both speculative omission and release. We call this mechanism SR-LUP (Speculative Renaming based on Last-Use Prediction). Two Last-Use predictor designs of incremental complexity and performance are analyzed. In a 256-ROB, 8-way processor with an 80int+80fp PRF, SR-LUP with an 11-port 256int+256fp XRF, speeds up computations up to 11.5% and 29% for INT and FP SPEC2K benchmarks, respectively. For FP benchmarks, if the PRF limits the clock frequency, a conventionally managed 128int+128fp PRF can be replaced using SR-LUP by a 64int+64fp PRF backed up with a 10-port 224int+224fp XRF, showing 19% IPS gain.
A new radiolocation method for precise depth estimation and its application to the analysis of changes in groundwater levels in Colonia Clunia Sulpicia
We introduce a multi-level prefetching framework with three setups, respectively aimed to minimiz... more We introduce a multi-level prefetching framework with three setups, respectively aimed to minimize cost (Mincost), minimize losses in individual applications (Minloss) or maximize performance with moderate cost (Maxperf). Performance is boosted in all cases by a sequential tagged prefetcher in the L1 cache, with an effective static degree policy. In both cache levels (L1 and L2), we also apply prefetch filters. In the L2 cache we use a novel adaptive policy that selects the best prefetching degree within a fixed set of values, by tracking the performance gradient. Mincost resorts to sequential tagged prefetching in the L2 cache as well. Minloss relies on an accurate, home-made, correlating prefetcher (PDFCM, Differencial Finite Context Method Prefetcher). Maxperf maximizes performance at the expense of slight performance losses in a small number of benchmarks, by integrating a sequential tagged prefetcher with PDFCM in the L2 cache.
Identifying influencing factors on Branch Target Cache Memory performance
In this paper, we study a particular organization for an instruction cache memory. This cache is ... more In this paper, we study a particular organization for an instruction cache memory. This cache is tagged with the target addresses for taken branches and caches the target instruction as well as a fixed number of consecutive instruction bytes. Using traces from two 32-bit architectures (DEC VAX-11, Berkelsy RISCC-II), we find that a Branch Target Cache Memory (BTCM) in addition whit a burst-mode external memory and a prefetch mechanism can be very useful to supply instructions at the rate needed by teh processor. We define several hit ratios in order to isolate the factors that influence the global performance of a systems with a BTCM. Results of this measures for differents line sizes and number of entries are presented.
The late release policy of conventional renaming keeps many registers in the register file assign... more The late release policy of conventional renaming keeps many registers in the register file assigned in spite of containing values that will never be read in the future. In this work, we study the potential of a novel scheme that speculatively releases a physical register as soon as it has been read by a predicted last instruction that references its value. An auxiliary register file placed outside the critical paths of the processor pipeline holds the early released values just in case they are unexpectedly referenced by some instruction. In addition to demonstrate the feasibility of a last-use predictor, this paper also analyzes the auxiliary register file (latency and size) required to support a speculative early release mechanism that uses a perfect predictor. The obtained results set the performance bound that any real speculative early release implementation is able to reach. We show that in a processor with a 64int+64fp register file, a perfect early release supported by an unbounded auxiliary register file has the potential of speeding up computations up to 23% and 47% for SPECint2000 and SPECfp2000 benchmarks, respectively. Speculative early release can also be used to reduce register file size without losing performance. For instance, a processor with a conventionally managed 96int+96fp register file could be replaced for equal IPC with a 64int+64fp register file managed with perfect early register release and backed with a 64int+64fp auxiliary register file, this representing a 12% IPS (Instructions Per Second) increase if the processor frequency were constrained by the register file access time.
Time Domain Reflectometry is a technique widely used in hydrology and agronomy that allows real t... more Time Domain Reflectometry is a technique widely used in hydrology and agronomy that allows real time estimation of soil volumetric water content (θ), which is related to soil's apparent permittivity (εa) and bulk electrical conductivity (σ). This work presents an enhanced release of TDR-Lab, software which controls instrumentation for field measurements of θ and σ, enabling a convenient recording and retrieving of data. TDR-Lab 2.0 supports Tektronics 1502C, TDR-100 Campbell Sci. and TRASE Soil-moisture Equipment Corp., which can be connected to a multiplexing system (SDMX50, Campbell Sci) allowing automated scheduled measurements from up to 251 different probes. This new release, when connected to TDR-100, allows increasing the waveform accuracy up to 2048 points/waveform. Multiple comparison of TDR waveform within the same TDR screen is now allowed. Graphical or numerical methods can be used for the estimation of θ and σ from soil measurements coming taken with different probes. Additional features to carry out water-surface-level measurements such as matric potential and soil solution electrical conductivity are also available when measurements are made using the corresponding specific probes. Two different versions, a lite and a full release, for field and laboratory applications have been developed. The light version with a reduced set of features (TDR-Lab Lite) was designed to run in low-end ultraportable devices. TDR-Lab Lite works with XML-files instead of the SQL database engine of the extended TDR-Lab, and has lower system requirements and a faster boot-up time. A robust import/export graphical user interface (GUI) facilitates transferring projects between the centralized SQL database and XML files. A new project manager window has been implemented, where the bar menu has been complemented with a useful set of icons. The display system projects has been improved and simplified. Within the project, a new friendly configuration manager for cables and TDR probes has been developed. Finally, an improved calibration procedure for TDR probes has been implemented.TDR Lab 2.0 supports more devices and provides additional estimated features from soil's measurements. Besides, the low hardware requirement of TDR-Lab Lite facilitates faster in-the-field tests and measurements.
ÍNDICE 1) alternate with colder stages, represented by different tills belonging to Sabiñánigo (M... more ÍNDICE 1) alternate with colder stages, represented by different tills belonging to Sabiñánigo (MIS 6) and Salinas (MIS 4) phases. KEY WORDS.-Endokarst, glacial features, Quaternary, Cotiella massif, Pyrenees. 10 Ánchel BELMONTE y Carlos SANCHO 12 Ánchel BELMONTE y Carlos SANCHO Tabla I. Datos meteorológicos de algunas localidades próximas a la zona de estudio. Temperatura media anual (°C) Precipitación media anual (mm) Plan d'Escún (1100 m)
Uploads
Papers by Víctor Viñals