Papers by Ruben Gran Tejero
Zenodo (CERN European Organization for Nuclear Research), Sep 20, 2023
Towards the Inclusion of FPGAs on Commodity Heterogeneous Systems
Nowadays, most commodity heterogeneous systems consist in either a CPU+GPU or CPU+FPGA. An attrac... more Nowadays, most commodity heterogeneous systems consist in either a CPU+GPU or CPU+FPGA. An attractive alternative consists in merging both in a new class of heterogeneous system: CPU+GPU+FPGA in order to combine the advantages of all of them into a single platform. In a three-device heterogeneous system, similarly to a two-device system, we have to face problems like: programmability versus device performance, data-buffers management and workload distribution. In this work, we present the first steps into a runtime which deals with theses problems for a system with a CPU+GPU+FPGA.
Non-Speculative Techniques to Enhance Instruction Scheduling La verdad es demasiado complicada co... more Non-Speculative Techniques to Enhance Instruction Scheduling La verdad es demasiado complicada como para permitir nada más allá de meras aproximaciones

Parallel Computing, Mar 7, 2018
In this paper, we investigate how to enhance an existing software-defined framework to reduce ove... more In this paper, we investigate how to enhance an existing software-defined framework to reduce overheads and enable the parallel utilization of all the programmable processing resources present in systems that include FPGA-based hardware accelerators. To remove overheads, a new hardware platform is created based on interrupts, which removes spin-locks and frees the processing resources. Additionally, instead of simply using the hardware accelerator to offload a task from the CPU, we propose a scheduler that dynamically distributes the tasks among all the resources to minimize load unbalance. The experimental evaluation shows that the interrupt-based heterogeneous platform increases performance by up 22% while reducing energy requirements by 15%. Additionally, we measure between 50% to 25% reduction in execution time when the CPU cores assist FPGA execution at the same level of energy requirements depending on hardware speed-ups.
Zenodo (CERN European Organization for Nuclear Research), Sep 20, 2022

Consumers of personal devices such as desktops, tablets, or smart phones run applications based o... more Consumers of personal devices such as desktops, tablets, or smart phones run applications based on image or video processing, as they enable a natural computer-user interaction. The challenge with these computationally demanding applications is to execute them efficiently. One way to address this problem is to use on-chip heterogeneous systems, where tasks can execute in the device where they run more efficiently. In this paper, we discuss the optimization of a feature tracking application, written in OpenCL, when running on an on-chip heterogeneous platform. Our results show that OpenCL can facilitate programming of these heterogeneous systems because it provides a unified programming paradigm and at the same time can deliver significant performance improvements. We show that, after optimization, our feature tracking application runs 3.2, 2.6, and 4.3 times faster and consumes 2.2, 3.1, and 2.7 times less energy when running on the multicore, the GPU, or both the CPU and the GPU of an Intel i7, respectively.

arXiv (Cornell University), Feb 9, 2018
In this paper, we introduce a so ware-de ned framework that enables the parallel utilization of a... more In this paper, we introduce a so ware-de ned framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with di erent architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources. Instead of simply using the hardware accelerator to o oad a task from the CPU, we propose a scheduler that dynamically distributes the tasks among all the resources to fully exploit all computing devices while minimizing load unbalance. e multi-architecture study compares an ARMV7 and ARMV8 implementation with di erent number and type of CPU cores and also di erent FPGA micro-architecture and size. We measure that both platforms bene t from having the CPU cores assist FPGA execution at the same level of energy requirements.
Gated-CNN: Combating NBTI and HCI aging effects in on-chip activation memories of Convolutional Neural Network accelerators
Journal of Systems Architecture, Jul 1, 2022
Zenodo (CERN European Organization for Nuclear Research), Sep 20, 2022

PLOS ONE, 2020
One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Wors... more One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Worst-Case Execution Time (WCET) analysis methods supporting an instruction cache are based on iterative or convergence algorithms, which are rather slow. Our goal in this paper is to reduce the WCET analysis time on systems with a simple lockable instruction cache, focusing on the Lock-MS method. First, we propose an algorithm to obtain a structure-based representation of the Control Flow Graph (CFG). It organizes the whole WCET problem as nested subproblems, which takes advantage of common branch-and-bound algorithms of Integer Linear Programming (ILP) solvers. Second, we add support for multiple locking points per task, each one with specific cache contents, instead of a given locked content for the whole task execution. Locking points are set heuristically before outer loops. Such simple heuristics adds no complexity, and reduces the WCET by taking profit of the temporal reuse found in loops. Since loops can be processed as isolated regions, the optimal contents to lock into cache for each region can be obtained, and the WCET analysis time is further reduced. With these two improvements, our WCET analysis is around 10 times faster than other approaches. Also, our results show that the WCET is reduced, and the hit ratio achieved for the lockable instruction cache is similar to that of a real execution with an LRU instruction cache. Finally, we analyze the WCET sensitivity to compiler optimization, showing for each benchmark the right choices and pointing out that O0 is always the worst option.
Jornada de Jóvenes Investigadores del I3A, 2018
En esta comunicación se presenta parte de trabajo que se está realizando dentro de la tesis docto... more En esta comunicación se presenta parte de trabajo que se está realizando dentro de la tesis doctoral de Mª Angélica Dávila-Guzmán. En concreto, se está desarrollando un planificador de carga para sistemas de cómputo heterogéneo compuestos por dispositivos como la CPU, GPU y FPGA. Este trabajo está en fase inicial y se presentan los objetivos y primeros resultados alcanzados.
Jornada de Jóvenes Investigadores del I3A, 2017
En sistemas de tiempo real es imprescindible analizar tiempos de ejecución de peor caso (WCET) de... more En sistemas de tiempo real es imprescindible analizar tiempos de ejecución de peor caso (WCET) de tareas, especialmente difícil con memorias cache. Nuestra propuesta automatiza dicho análisis mediante el método Lock-MS para caches de instrucciones bloqueables. Esto permite analizar programas complejos, variando arbitrariamente los parámetros hardware o software relevantes.

The Journal of Supercomputing, 2019
Heterogeneous computing that exploits simultaneous co-processing with different device types has ... more Heterogeneous computing that exploits simultaneous co-processing with different device types has been shown to be effective at both increasing performance and reducing energy consumption. In this paper we extend a scheduling framework encapsulated in a high level C++ template, and previously developed for heterogeneous chips comprising CPU and GPU cores, to new high-performance platforms for the data center, which include a cache coherent FPGA fabric and manycore CPU resources. Our goal is to evaluate the suitability of our framework with these new FPGA-based platforms, identifying performance benefits and limitations. We target the state-of-the-art HARP processor that includes 14 high-end Xeon-class tightly coupled to a FPGA device located in the same package. We select 8 benchmarks from the High Performance Computing domain that have been ported and optimized for this heterogeneous platform. The results show that a dynamic and adaptive scheduler that exploits simultaneous processing among the devices can improve performance up to a factor of 8x compared to the best alternative solutions that only use the CPU cores or the FPGA fabric. Moreover, our proposal achieves up to 15% and 37% of improvement compared to the best heterogeneous solutions found with a Dynamic and Static schedulers, respectively.

IEEE Transactions on Very Large Scale Integration Systems, Feb 1, 2016
The efficiency of the reconfiguration process in modern FPGAs can improve drastically if an on-ch... more The efficiency of the reconfiguration process in modern FPGAs can improve drastically if an on-chip configuration memory is included in the system because it can reduce both the reconfiguration latency and its energy consumption. However, FPGA on-chip memory resources are very limited. Thus, it is very important to manage them effectively in order to improve the reconfiguration process as much as possible even when the size of the on-chip configuration memory is small. This paper presents a hardware implementation of an on-chip configuration memory controller that efficiently manages run-time reconfigurations. In order to optimize the use of the on-chip memory, this controller includes support to deal with configurations that have been divided into blocks of customizable size. When a reconfiguration must be carried out, our controller provides the blocks stored on-chip and looks for the remaining blocks by accessing to the off-chip configuration memory. Moreover, it dynamically decides which blocks must be stored on-chip. To this end, the designed controller implements a simple but efficient technique that allows maximizing the benefits of the on-chip memories. Experimental results will demonstrate that its implementation cost is very affordable and that it introduces negligible run-time management overheads.
peRISCVcope: A Tiny Teaching-Oriented RISC-V Interpreter
2022 37th Conference on Design of Circuits and Integrated Circuits (DCIS)
Lightweight asynchronous scheduling in heterogeneous reconfigurable systems
Journal of Systems Architecture, 2022
Tasa de aciertos ideal y predecible para la transposición de matrices en caches de datos

IEEE Access, 2020
The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applic... more The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files operating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of ...
A generic framework to integrate data caches in the WCET analysis of real-time systems
Journal of Systems Architecture, 2021

Mathematics, 2020
Matrix transposition is a fundamental operation, but it may present a very low and hardly predict... more Matrix transposition is a fundamental operation, but it may present a very low and hardly predictable data cache hit ratio for large matrices. Safe (worst-case) hit ratio predictability is required in real-time systems. In this paper, we obtain the relations among the cache parameters that guarantee the ideal (predictable) data hit ratio assuming a Least-Recently-Used (LRU) data cache. Considering our analytical assessments, we compare a tiling matrix transposition to a cache oblivious algorithm, modified with phantom padding to improve its data hit ratio. Our results show that, with an adequate tile size, the tiling version results in an equal or better data hit ratio. We also analyze the energy consumption and execution time of matrix transposition on real hardware with pseudo-LRU (PLRU) caches. Our analytical hit/miss assessment enables the usage of a data cache for matrix transposition in real-time systems, since the number of misses in the worst case is bound. In general and hi...
Uploads
Papers by Ruben Gran Tejero