Jose A Gregorio

Followers

Following

Co-authors

Public Views

Interests

Uploads

Papers by Jose A Gregorio

CLASSIC: A cortex-inspired hardware accelerator

Journal of Parallel and Distributed Computing, 2019

This work explores the feasibility of specialized hardware implementing the Cortical Learning Alg... more This work explores the feasibility of specialized hardware implementing the Cortical Learning Algorithm (CLA) in order to fully exploit its inherent advantages. This algorithm, which is inspired in the current understanding of the mammalian neo-cortex, is the basis of the Hierarchical Temporal Memory (HTM). In contrast to other machine learning (ML) approaches, the structure is not application dependent and relies on fully unsupervised continuous learning. We hypothesize that a hardware implementation will be able not only to extend the already practical uses of these ideas to broader scenarios but also to exploit the hardware-friendly CLA characteristics. The architecture proposed will enable an unfeasible scalability for software solutions and will fully capitalize on one of the many CLA advantages: very low computational requirements and optimal storage utilization. Compared to a state-of-the-art CLA software implementation it could be possible to improve by 4 orders of magnitude in performance and up to 8 orders of magnitude in energy efficiency. Embracing the problem's complex nature, we found that the most demanding issue, from a scalability standpoint, is the massive degree of connectivity required. We propose to use a packet-switched network to tackle this. The paper addresses the fundamental issues of such an approach, proposing solutions to achieve scalable solutions. We will analyze cost and performance when using well-known architecture techniques and tools. The results obtained suggest that even with CMOS technology, under constrained cost, it might be possible to implement a large-scale system. We found that the proposed solutions enable a saving of ~90% of the original communication costs running either synthetic or realistic workloads.

Download

Frequency response of RC-active circuits using Norton amplifiers with highly asymmetrical slew rates

IEE Proceedings G (Electronic Circuits and Systems), 1984

ABSTRACT

Petri net modeling of interconnection networks for massively parallel architectures

Proceedings of the 9th international conference on Supercomputing - ICS '95, 1995

Page 1. Petri Net Modeling of Intemonnection Networks for Massively Pamllel Amhitectums. JAGregor... more

Performance evaluation of the bubble algorithm: benefits for k-ary n-cubes

Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99, 1999

The bubble algorithm evaluated in this paper assures message deadlock freedom in k-ary, n-cube ne... more The bubble algorithm evaluated in this paper assures message deadlock freedom in k-ary, n-cube network without using virtual channels. This algorithm is based both on a dimension order I outing (DOR) and on a restricted injection policy extended to the dimension changes. An exhaustive comparison between the bubble mechanism and the classical deterministic virtual channels solution is presented here. For that purpose, the message router of both proposals has been designed by using VHDL descriptions and the Synopsys VLSI CAD tool. Additionally, formal models of the routers, based on colored Petri nets, have been carried out together with simulation techniques in order to assure the validation of the results and shorten the design cycle. The performance evaluation of n-dimension tori highlights the benefits of the bubble algorithm as both the temporal delay and the necessary silicon area of the message router are reduced.

Analytical models for rapid prototyping of multiprocessor systems

CMP off-chip bandwidth scheduling guided by instruction criticality

Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, 2013

This paper explores the benefits of scheduling off-chip memory operations in a Chip Multiprocesso... more This paper explores the benefits of scheduling off-chip memory operations in a Chip Multiprocessor (CMP) according to their execution relevance. Assuming the scenario of having many outof-order execution cores in the CMP, from the processor perspective, the importance of the instruction that triggers an access to off-chip memory may vary considerably. Consequently, it makes sense to consider this point of view at the memory controller level to reorder outgoing memory accesses. After exploring different processor-centric sorting criteria, we reach the conclusion that the most simple and useful metric for scheduling a memory operation is the position in the reorder buffer of the instruction that triggers the on-chip miss. We propose a simple memory controller scheduling policy that employs this information as its main parameter. This proposal significantly improves system responsiveness, both in terms of throughput and fairness. The idea is analyzed through full-system simulation, running a broad set of workloads with diverse memory behavior. When it is compared with other scheduling algorithms with similar complexity, throughput can be improved by an average of 10% and fairness enhanced by an average of 15% even in very adverse usage scenarios. Moreover, the idea supports the possibility of dynamically favoring throughput or fairness, according to the end-user requirements.

Download

Multilevel Cache Modeling for Chip-Multiprocessor Systems

IEEE Computer Architecture Letters, 2011

This paper presents a simple analytical model for predicting on-chip cache hierarchy effectivenes... more This paper presents a simple analytical model for predicting on-chip cache hierarchy effectiveness in chip multiprocessors (CMP) for a state-of-the-art architecture. Given the complexity of this type of systems, we use rough approximations, such as the empirical observation that the re-reference timing pattern follows a power law and the assumption of a simplistic delay model for the cache, in order to provide a useful model for the memory hierarchy responsiveness. This model enables the analytical determination of average access time, which makes design space pruning useful before sweeping the vast design space of this class of systems. The model is also useful for predicting cache hierarchy behavior in future systems. The fidelity of the model has been validated using a state-of-the-art, full-system simulation environment, on a system with up to sixteen out-of-order processors with cache-coherent caches and using a broad spectrum of applications, including complex multithread workloads. This simple model can predict a near-to-optimal, on-chip cache distribution while also estimating how future systems running future applications might behave.

Download

AC-WAR: Architecting the Cache Hierarchy to Improve the Lifetime of an Non-volatile Endurance-limited Main Memory

IEEE Transactions on Parallel and Distributed Systems, 2016

This work shows how by adapting replacement policies in contemporary cache hierarchies it is poss... more This work shows how by adapting replacement policies in contemporary cache hierarchies it is possible to extend the lifespan of a write endurance-limited main memory by almost one order of magnitude. The inception of this idea is that during cache residency (1) blocks are modified in a bimodal way: either most of the content of the block is modified or most of the content of the block never changes, and (2) in most applications, the majority of blocks are only slightly modified. When those facts are considered by the cache replacement algorithms, it is possible to significantly reduce the number of bit-flips per write-back to main memory. Our proposal favors the off-chip eviction of slightly modified blocks according to an adaptive replacement algorithm that operates coordinately in L2 and L3. This way it is possible to improve significantly system memory lifetime, with negligible performance degradation. We found that using a few bits per block to track changes in cache blocks with respect to the main memory content is enough. With a slightly modified sectored LRU and a simple cache performance predictor it is possible to achieve a simple implementation with minimal cost in area and no impact on cache access time. On average, our proposal increases the memory lifetime obtained with an LRU policy up to ten times (10×) and fifteen times (15×) when combined with other memory centric techniques. In both cases, the performance degradation could be considered negligible.

Download

Modeling of interconnection subsystems for massively parallel computers

Performance Evaluation, 2002

The analysis, design and evaluation of the interconnection subsystem for massively parallel archi... more The analysis, design and evaluation of the interconnection subsystem for massively parallel architectures is normally carried out using computer simulation tools, requiring elevated computational costs. Moreover, in some cases, these simulation processes show serious difficulties when both experiments and results have to be reproduced by other research or design teams. This work shows the suitability of the use of formal representation methods, like DSPN (stochastic Petri nets with deterministic and exponential firing times), for the description of the message routers, focusing on two important features. Firstly, the possibility of obtaining network performance indicators through the simulation of the obtained models with a lower computational cost than using conventional techniques; in some cases, analytical results can also be obtained. And secondly, making the basic parameters of the network design relatively independent of the router implementation features, thus simplifying the method of establishing the behavior of new router structures. This approach has been successfully applied to the analysis of both symmetrical torus and asymmetrical mesh interconnection topologies, with virtual cut-through flow control, oblivious routing and random traffic. It should be noted that most modern parallel computers employ a local buffer space big enough to store at least a complete packet. Two different functional router structures have been used in each case: transit buffers located at the input or at the output router links.

Comparative study of the effect of slew-induced distortion on single-amplifier biquadratic stages

International Journal of Electronics, 1984

A comparative study is presented of the effect of slew-rate-induced distortion on main single-amp... more A comparative study is presented of the effect of slew-rate-induced distortion on main single-amplifier biquadratic stages, Evidence is presented to show how only one of those stages is free of regenerative phenomena for all input conditions. A normalized comparison criterion is defined that can be used to establish the differences between the various biquadratic stages independently of the input-output transfer functions implemented in each filter. This criterion can be used to obtain the precise operating conditions that will guarantee a regenerative-phenomena-free linear response in each of the different structures.

Performance evaluation of parallel systems by using unbounded generalized stochastic Petri nets

IEEE Transactions on Software Engineering, 1992

Page 1. IEEE TRANSACTIONS ON SOITWARE ENGINEERING, VOL. 18, NO. 1, JANUARY 1992 55 Performance Ev... more

A flow control mechanism to prevent message deadlock in k-ary n-cube networks

Beneficios del uso de la Red de Interconexión en la Aceleración de la Coherencia

... Cuando el número de procesadores es más alto, la nula escalabilidad en el ancho de banda del ... more

TOPAZ: Un simulador de redes de interconexión para CMPs y supercomputadores

Towards a Shared/Private Non-Uniform Cache Architecture in CMP Systems

Page 1. Towards a Shared/Private Non-Uniform Cache Architecture in CMP Systems Javier Merino∗,1, ... more

Topology-aware CMP design

... “Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures”, MIT CSAI... more

Accuracy vs. Computational Cost Tradeoff in Distributed Computer System Simulation

Cornell University - arXiv, Jan 24, 2019

Simulation is a fundamental research tool in the computer architecture field. These kinds of tool... more Simulation is a fundamental research tool in the computer architecture field. These kinds of tools enable the exploration and evaluation of architectural proposals capturing the most relevant aspects of the highly complex systems under study. Many state-of-the-art simulation tools focus on single-system scenarios, but the scalability required by trending applications has shifted towards distributed computing systems integrated via complex software stacks. Web services with client-server architectures or distributed storage and processing of scale-out data analytics (Big Data) are among the main exponents. The complete simulation of a distributed computer system is the appropriate methodology to conduct accurate evaluations. Unfortunately, this methodology could have a significant impact on the already large computational effort derived from detailed simulation. In this work, we conduct a set of experiments to evaluate this accuracy/cost tradeoff. We measure the error made if client-server applications are evaluated in a single-node environment, as well as the overhead induced by the methodology and simulation tool employed for multi-node simulations. We quantify this error for different micro-architecture components, such as last-level cache and instruction/data TLB. Our findings show that accuracy loss can lead to completely wrong conclusions about the effects of proposed hardware optimizations. Fortunately, our results also demonstrate that the computational overhead of a multi-node simulation framework is affordable, suggesting multi-node simulation as the most appropriate methodology.

Download

The case for a scalable coherence protocol for complex on-chip cache hierarchies in many-core systems

Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

This paper introduces a new coherence protocol that addresses the challenges of complex multileve... more This paper introduces a new coherence protocol that addresses the challenges of complex multilevel cache hierarchies in future many-core systems. In order to keep coherence protocol complexity bounded, inclusiveness is required to track coherence information across levels in this type of systems, but this might introduce unsustainable costs for directory structures. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private caches size is large. The coherence protocol presented in this work, denoted MOSAIC, introduces a new approach to tackle this problem. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. Contrary to the common belief that inclusiveness is inescapable when attempting to maintain complexity constrained, MOSAIC is even simpler than a conventional directory. The results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory.

Download

Necessary and Sufficient Conditions for Deadlockfree Networks

In this paper we develop a new and generic theory about the necessary and sufficient conditions f... more In this paper we develop a new and generic theory about the necessary and sufficient conditions for deadlock-free routing in the interconnection networks An extension of the channel dependency graph described by Dally is defined, the channel dynamic dependency graph. The main achievement of this new concept is consecuence of introducing the concept of time and the flow control function in its definition. Our theory remains valid for different routing and flow control functions showing that even if Duato’s theorem conditions are not fulfilled the network can be deadlock-free. Index Terms Multicomputer networks, deadlock, flow control, routing

LOCKE Detailed Specification Tables

Download

Jose A Gregorio

Uploads

Papers by Jose A Gregorio

Log In