Papers by Jose A Gregorio

Journal of Parallel and Distributed Computing, 2019
This work explores the feasibility of specialized hardware implementing the Cortical Learning Alg... more This work explores the feasibility of specialized hardware implementing the Cortical Learning Algorithm (CLA) in order to fully exploit its inherent advantages. This algorithm, which is inspired in the current understanding of the mammalian neo-cortex, is the basis of the Hierarchical Temporal Memory (HTM). In contrast to other machine learning (ML) approaches, the structure is not application dependent and relies on fully unsupervised continuous learning. We hypothesize that a hardware implementation will be able not only to extend the already practical uses of these ideas to broader scenarios but also to exploit the hardware-friendly CLA characteristics. The architecture proposed will enable an unfeasible scalability for software solutions and will fully capitalize on one of the many CLA advantages: very low computational requirements and optimal storage utilization. Compared to a state-of-the-art CLA software implementation it could be possible to improve by 4 orders of magnitude in performance and up to 8 orders of magnitude in energy efficiency. Embracing the problem's complex nature, we found that the most demanding issue, from a scalability standpoint, is the massive degree of connectivity required. We propose to use a packet-switched network to tackle this. The paper addresses the fundamental issues of such an approach, proposing solutions to achieve scalable solutions. We will analyze cost and performance when using well-known architecture techniques and tools. The results obtained suggest that even with CMOS technology, under constrained cost, it might be possible to implement a large-scale system. We found that the proposed solutions enable a saving of ~90% of the original communication costs running either synthetic or realistic workloads.
Frequency response of RC-active circuits using Norton amplifiers with highly asymmetrical slew rates
IEE Proceedings G (Electronic Circuits and Systems), 1984
ABSTRACT
Petri net modeling of interconnection networks for massively parallel architectures
Proceedings of the 9th international conference on Supercomputing - ICS '95, 1995
Page 1. Petri Net Modeling of Intemonnection Networks for Massively Pamllel Amhitectums. JAGregor... more Page 1. Petri Net Modeling of Intemonnection Networks for Massively Pamllel Amhitectums. JAGregorio, F. Vallejo, R. Beivide and C. Carrion Departamento de Electr6nica Universidad de Cantabria 39005 Santander-Spain e-mail: ja@ ctrhp3. unican. es Abstract. ...

Performance evaluation of the bubble algorithm: benefits for k-ary n-cubes
Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99, 1999
The bubble algorithm evaluated in this paper assures message deadlock freedom in k-ary, n-cube ne... more The bubble algorithm evaluated in this paper assures message deadlock freedom in k-ary, n-cube network without using virtual channels. This algorithm is based both on a dimension order I outing (DOR) and on a restricted injection policy extended to the dimension changes. An exhaustive comparison between the bubble mechanism and the classical deterministic virtual channels solution is presented here. For that purpose, the message router of both proposals has been designed by using VHDL descriptions and the Synopsys VLSI CAD tool. Additionally, formal models of the routers, based on colored Petri nets, have been carried out together with simulation techniques in order to assure the validation of the results and shorten the design cycle. The performance evaluation of n-dimension tori highlights the benefits of the bubble algorithm as both the temporal delay and the necessary silicon area of the message router are reduced.
Analytical models for rapid prototyping of multiprocessor systems

Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, 2013
This paper explores the benefits of scheduling off-chip memory operations in a Chip Multiprocesso... more This paper explores the benefits of scheduling off-chip memory operations in a Chip Multiprocessor (CMP) according to their execution relevance. Assuming the scenario of having many outof-order execution cores in the CMP, from the processor perspective, the importance of the instruction that triggers an access to off-chip memory may vary considerably. Consequently, it makes sense to consider this point of view at the memory controller level to reorder outgoing memory accesses. After exploring different processor-centric sorting criteria, we reach the conclusion that the most simple and useful metric for scheduling a memory operation is the position in the reorder buffer of the instruction that triggers the on-chip miss. We propose a simple memory controller scheduling policy that employs this information as its main parameter. This proposal significantly improves system responsiveness, both in terms of throughput and fairness. The idea is analyzed through full-system simulation, running a broad set of workloads with diverse memory behavior. When it is compared with other scheduling algorithms with similar complexity, throughput can be improved by an average of 10% and fairness enhanced by an average of 15% even in very adverse usage scenarios. Moreover, the idea supports the possibility of dynamically favoring throughput or fairness, according to the end-user requirements.

IEEE Computer Architecture Letters, 2011
This paper presents a simple analytical model for predicting on-chip cache hierarchy effectivenes... more This paper presents a simple analytical model for predicting on-chip cache hierarchy effectiveness in chip multiprocessors (CMP) for a state-of-the-art architecture. Given the complexity of this type of systems, we use rough approximations, such as the empirical observation that the re-reference timing pattern follows a power law and the assumption of a simplistic delay model for the cache, in order to provide a useful model for the memory hierarchy responsiveness. This model enables the analytical determination of average access time, which makes design space pruning useful before sweeping the vast design space of this class of systems. The model is also useful for predicting cache hierarchy behavior in future systems. The fidelity of the model has been validated using a state-of-the-art, full-system simulation environment, on a system with up to sixteen out-of-order processors with cache-coherent caches and using a broad spectrum of applications, including complex multithread workloads. This simple model can predict a near-to-optimal, on-chip cache distribution while also estimating how future systems running future applications might behave.

IEEE Transactions on Parallel and Distributed Systems, 2016
This work shows how by adapting replacement policies in contemporary cache hierarchies it is poss... more This work shows how by adapting replacement policies in contemporary cache hierarchies it is possible to extend the lifespan of a write endurance-limited main memory by almost one order of magnitude. The inception of this idea is that during cache residency (1) blocks are modified in a bimodal way: either most of the content of the block is modified or most of the content of the block never changes, and (2) in most applications, the majority of blocks are only slightly modified. When those facts are considered by the cache replacement algorithms, it is possible to significantly reduce the number of bit-flips per write-back to main memory. Our proposal favors the off-chip eviction of slightly modified blocks according to an adaptive replacement algorithm that operates coordinately in L2 and L3. This way it is possible to improve significantly system memory lifetime, with negligible performance degradation. We found that using a few bits per block to track changes in cache blocks with respect to the main memory content is enough. With a slightly modified sectored LRU and a simple cache performance predictor it is possible to achieve a simple implementation with minimal cost in area and no impact on cache access time. On average, our proposal increases the memory lifetime obtained with an LRU policy up to ten times (10×) and fifteen times (15×) when combined with other memory centric techniques. In both cases, the performance degradation could be considered negligible.

Modeling of interconnection subsystems for massively parallel computers
Performance Evaluation, 2002
The analysis, design and evaluation of the interconnection subsystem for massively parallel archi... more The analysis, design and evaluation of the interconnection subsystem for massively parallel architectures is normally carried out using computer simulation tools, requiring elevated computational costs. Moreover, in some cases, these simulation processes show serious difficulties when both experiments and results have to be reproduced by other research or design teams. This work shows the suitability of the use of formal representation methods, like DSPN (stochastic Petri nets with deterministic and exponential firing times), for the description of the message routers, focusing on two important features. Firstly, the possibility of obtaining network performance indicators through the simulation of the obtained models with a lower computational cost than using conventional techniques; in some cases, analytical results can also be obtained. And secondly, making the basic parameters of the network design relatively independent of the router implementation features, thus simplifying the method of establishing the behavior of new router structures. This approach has been successfully applied to the analysis of both symmetrical torus and asymmetrical mesh interconnection topologies, with virtual cut-through flow control, oblivious routing and random traffic. It should be noted that most modern parallel computers employ a local buffer space big enough to store at least a complete packet. Two different functional router structures have been used in each case: transit buffers located at the input or at the output router links.
Comparative study of the effect of slew-induced distortion on single-amplifier biquadratic stages
International Journal of Electronics, 1984
A comparative study is presented of the effect of slew-rate-induced distortion on main single-amp... more A comparative study is presented of the effect of slew-rate-induced distortion on main single-amplifier biquadratic stages, Evidence is presented to show how only one of those stages is free of regenerative phenomena for all input conditions. A normalized comparison criterion is defined that can be used to establish the differences between the various biquadratic stages independently of the input-output transfer functions implemented in each filter. This criterion can be used to obtain the precise operating conditions that will guarantee a regenerative-phenomena-free linear response in each of the different structures.
Performance evaluation of parallel systems by using unbounded generalized stochastic Petri nets
IEEE Transactions on Software Engineering, 1992
Page 1. IEEE TRANSACTIONS ON SOITWARE ENGINEERING, VOL. 18, NO. 1, JANUARY 1992 55 Performance Ev... more Page 1. IEEE TRANSACTIONS ON SOITWARE ENGINEERING, VOL. 18, NO. 1, JANUARY 1992 55 Performance Evaluation of Parallel Systems by Using Unbounded Generalized Stochastic Petri Nets Mercedes Granda, JosC M. Drake, and JosC A. Gregorio ...
A flow control mechanism to prevent message deadlock in k-ary n-cube networks
Beneficios del uso de la Red de Interconexión en la Aceleración de la Coherencia
... Cuando el número de procesadores es más alto, la nula escalabilidad en el ancho de banda del ... more ... Cuando el número de procesadores es más alto, la nula escalabilidad en el ancho de banda del bus hace necesario utilizar redes de interconexión ... astar hmmer lbm bt cg ft is lu mg sp ua apache jbb zeus blacks cann fluid swapt AVG ... [15] H. Jin, M. Frumkin, and J. Yan, The ...
TOPAZ: Un simulador de redes de interconexión para CMPs y supercomputadores
Towards a Shared/Private Non-Uniform Cache Architecture in CMP Systems
Page 1. Towards a Shared/Private Non-Uniform Cache Architecture in CMP Systems Javier Merino∗,1, ... more Page 1. Towards a Shared/Private Non-Uniform Cache Architecture in CMP Systems Javier Merino∗,1, Valentín Puente∗,1, Pablo Prieto∗,1, José Ángel Gregorio∗,1 ∗ Grupo de Arquitectura y Tecnología de Computadores, Universidad de Cantabria. ABSTRACT ...
Topology-aware CMP design
... “Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures”, MIT CSAI... more ... “Energy Scalability of On-Chip Interconnection Networks in Multicore Architectures”, MIT CSAIL Technical Report, November, 2007 [11] PS Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, B. Werner, “Simics: A Full System Simulation Platform”. ...

Cornell University - arXiv, Jan 24, 2019
Simulation is a fundamental research tool in the computer architecture field. These kinds of tool... more Simulation is a fundamental research tool in the computer architecture field. These kinds of tools enable the exploration and evaluation of architectural proposals capturing the most relevant aspects of the highly complex systems under study. Many state-of-the-art simulation tools focus on single-system scenarios, but the scalability required by trending applications has shifted towards distributed computing systems integrated via complex software stacks. Web services with client-server architectures or distributed storage and processing of scale-out data analytics (Big Data) are among the main exponents. The complete simulation of a distributed computer system is the appropriate methodology to conduct accurate evaluations. Unfortunately, this methodology could have a significant impact on the already large computational effort derived from detailed simulation. In this work, we conduct a set of experiments to evaluate this accuracy/cost tradeoff. We measure the error made if client-server applications are evaluated in a single-node environment, as well as the overhead induced by the methodology and simulation tool employed for multi-node simulations. We quantify this error for different micro-architecture components, such as last-level cache and instruction/data TLB. Our findings show that accuracy loss can lead to completely wrong conclusions about the effects of proposed hardware optimizations. Fortunately, our results also demonstrate that the computational overhead of a multi-node simulation framework is affordable, suggesting multi-node simulation as the most appropriate methodology.

Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013
This paper introduces a new coherence protocol that addresses the challenges of complex multileve... more This paper introduces a new coherence protocol that addresses the challenges of complex multilevel cache hierarchies in future many-core systems. In order to keep coherence protocol complexity bounded, inclusiveness is required to track coherence information across levels in this type of systems, but this might introduce unsustainable costs for directory structures. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private caches size is large. The coherence protocol presented in this work, denoted MOSAIC, introduces a new approach to tackle this problem. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. Contrary to the common belief that inclusiveness is inescapable when attempting to maintain complexity constrained, MOSAIC is even simpler than a conventional directory. The results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory.
Necessary and Sufficient Conditions for Deadlockfree Networks
In this paper we develop a new and generic theory about the necessary and sufficient conditions f... more In this paper we develop a new and generic theory about the necessary and sufficient conditions for deadlock-free routing in the interconnection networks An extension of the channel dependency graph described by Dally is defined, the channel dynamic dependency graph. The main achievement of this new concept is consecuence of introducing the concept of time and the flow control function in its definition. Our theory remains valid for different routing and flow control functions showing that even if Duato’s theorem conditions are not fulfilled the network can be deadlock-free. Index Terms Multicomputer networks, deadlock, flow control, routing
Uploads
Papers by Jose A Gregorio