Papers by Laura Tosoratto
Dynamic Many-process Applications on Many-tile Embedded Systems and HPC Clusters: the EURETILE programming environment and execution platforms
Journal of Systems Architecture, 2015
APEnet+ 34 Gbps data transmission system and custom transmission logic
Journal of Instrumentation, 2013
ABSTRACT APEnet+ is a point-to-point, low-latency, 3D-torus network controller integrated in a PC... more ABSTRACT APEnet+ is a point-to-point, low-latency, 3D-torus network controller integrated in a PCIe Gen2 board based on the Altera Stratix IV FPGA. We characterize the transmission system (embedded transceivers driving external QSFP+ modules) analyzing signal integrity, throughput, latency, BER and jitter at different data rates up to 34 Gbps. We estimate the efficiency of a custom logic able to sustain 2.6 GB/s per link with an FPGA on-chip memory footprint of 40 KB, providing deadlock-free routing and systemic awareness of faults. Finally, we show the preliminary results obtained with the embedded transceivers of a next-generation FPGA and outline some ideas to increase the performance with the same FPGA memory footprint.

Journal of Physics: Conference Series, 2014
APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-D... more APEnet+ is an INFN (Italian Institute for Nuclear Physics) project aiming to develop a custom 3-Dimensional torus interconnect network optimized for hybrid clusters CPU-GPU dedicated to High Performance scientific Computing. The APEnet+ interconnect fabric is built on a FPGA-based PCI-express board with 6 bi-directional off-board links showing 34 Gbps of raw bandwidth per direction, and leverages upon peer-to-peer capabilities of Fermi and Kepler-class NVIDIA GPUs to obtain real zero-copy, GPU-to-GPU low latency transfers. The minimization of APEnet+ transfer latency is achieved through the adoption of RDMA protocol implemented in FPGA with specialized hardware blocks tightly coupled with embedded microprocessor. This architecture provides a high performance low latency offload engine for both trasmit and receive side of data transactions: preliminary results are encouraging, showing 50% of bandwidth increase for large packet size transfers. In this paper we describe the APEnet+ architecture, detailing the hardware implementation and discuss the impact of such RDMA specialized hardware on host interface latency and bandwidth.

2013 IEEE Nuclear Science Symposium and Medical Imaging Conference (2013 NSS/MIC), 2013
Interest in many-core architectures applied to real time selections is growing in High Energy Phy... more Interest in many-core architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of many-core devices when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction -the SVT algorithm -as a realistic test-case for low-latency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. We measure performance of different architectures (Intel Xeon Phi and AMD GPUs, in addition to NVidia GPUs) and different software environments (OpenCL, in addition to NVidia CUDA). Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the manycore devices.

GPUs for real-time processing in HEP trigger systems
Journal of Physics: Conference Series, 2014
ABSTRACT We describe a pilot project for the use of Graphics Processing Units (GPUs) for online t... more ABSTRACT We describe a pilot project for the use of Graphics Processing Units (GPUs) for online triggering applications in High Energy Physics (HEP) experiments. Two major trends can be identified in the development of trigger and DAQ systems for HEP experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a pure software selection system (trigger-less). The very innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software both at low- and high-level trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming very attractive. We discuss in details the use of online parallel computing on GPUs for synchronous low-level trigger with fixed latency. In particular we show preliminary results on a first test in the NA62 experiment at CERN. The use of GPUs in high-level triggers is also considered, the ATLAS experiment (and in particular the muon trigger) at CERN will be taken as a study case of possible applications.
Journal of Instrumentation, 2014
ABSTRACT NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34~Gbps... more ABSTRACT NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34~Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.
Architectural improvements and technological enhancements for the APEnet+ interconnect system
Journal of Instrumentation, 2015

Applications of GPUs to online track reconstruction in HEP experiments
2012 IEEE Nuclear Science Symposium and Medical Imaging Conference Record (NSS/MIC), 2012
ABSTRACT One of the most important issues that particle physics experiments at hadron colliders h... more ABSTRACT One of the most important issues that particle physics experiments at hadron colliders have to solve is real- time selection of the most interesting events. Typical collision frequencies do not allow all events to be written to tape for offline analysis, and in most cases, only a small fraction can be saved. The most commonly used strategy is based on two or three se- lection levels, with the low level ones usually exploiting dedicated hardware to decide within a few to ten microseconds if the event should be kept or not. This strict time requirement has made the usage of commercial devices inadequate, but recent improvements to Graphics Processing Units (GPUs) have substantially changed the conditions. Thanks to their highly parallel, multi-threaded, multicore architecture with remarkable computational power and high memory bandwidth, these commercial devices may be used in scientific applications, among which the event selection system (trigger) in particular may benefit, even at low levels. This paper describes the results of an R&D project to study the performance of GPU technology for low latency applications, such as HEP fast tracking trigger algorithms. On two different setups, we measure the latency to transfer data to/from the GPU, exploring the timing of different I/O technologies on different GPU models. We then describe the implementation and the performance of a track fitting algorithm which mimics the CDF Silicon Vertex Tracker. These studies provide performance benchmarks to investigate the potential and limitations of GPUs for future real-time applications in HEP experiments.

Journal of Physics: Conference Series, 2014
Interest in parallel architectures applied to real time selections is growing in High Energy Phys... more Interest in parallel architectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic Processing Units (GPUs) and Intel Many Integrated Core architecture (MIC) when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction -the SVT algorithm -as a realistic test-case for lowlatency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the parallel devices.
legaSCi
ACM Transactions on Embedded Computing Systems, 2014
Design and implementation of a modular, low latency, fault-aware, FPGA-based network interface
2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2013

legaSCi: Legacy SystemC Model Integration into Parallel Systemc Simulators
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013
ABSTRACT Virtual prototyping of parallel and embedded systems increases insight into existing com... more ABSTRACT Virtual prototyping of parallel and embedded systems increases insight into existing computer systems. It further allows to explore properties of new systems already during their specification phase. Virtual prototypes of such systems benefit from parallel simulation techniques due to the increased simulation speed. One common problem full system simulator implementers face is the revision and integration of legacy models coded without thread-safety and deterministic behavior in mind. To lessen this burden, this paper presents a methodology to integrate unmodified SystemC legacy models into parallel SystemC simulators. Using the proposed technique, the embedded platform simulator of the EU FP7 project EURETILE, which inherited a number of legacy models from its predecessor project SHAPES, has been transformed into a parallel simulation platform, demonstrating speed-ups of up to 3.36 on four simulation host cores.
Time-decoupled parallel SystemC simulation
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, 2014
High speed data transfer with FPGAs and QSFP+ modules
IEEE Nuclear Science Symposium Conference Record, 2010
... The test system is composed of an Altera Stratix IV GX 230 development kit [6], based on an A... more ... The test system is composed of an Altera Stratix IV GX 230 development kit [6], based on an Altera Stratix IV FPGA with 230k Logic Elements, 14 Mbit of embedded memory and 24 serial transceivers (see figure 1). 8 serial transceivers are used for the PCI-Express gen 2 ...

Journal of Instrumentation, 2011
We present test results and characterization of a data compression system for the readout of the ... more We present test results and characterization of a data compression system for the readout of the NA62 liquid krypton calorimeter trigger processor. The Level-0 electromagnetic calorimeter trigger processor of the NA62 experiment at CERN receives digitized data from the calorimeter main readout board. These data are stored on an on-board DDR2 RAM memory and read out upon reception of a Level-0 accept signal. The maximum raw data throughput from the trigger front-end cards is 2.6 Gbps. To readout these data over two Gbit Ethernet interfaces we investigated different implementations of a data compression system based on the Rice-Golomb coding: one is implemented in the FPGA as a custom block and one is implemented on the FPGA embedded processor running a C code. The two implementations are tested on a set of sample events and compared with respect to achievable readout bandwidth.

A hierarchical watchdog mechanism for systemic fault awareness on distributed systems
Future Generation Computer Systems, 2015
ABSTRACT Systemic fault tolerance is usually pursued with a number of strategies, like redundancy... more ABSTRACT Systemic fault tolerance is usually pursued with a number of strategies, like redundancy and checkpoint/restart; any of them needs to be triggered by safe and fast fault detection. We devised a hardware/software approach to fault detection that enables a system-level Fault Awareness by implementing a hierarchical Mutual Watchdog. It relies on an improved high performance Network Interface Card (NIC), implementing an -dimensional mesh topology and a Service Network. The hierarchical watchdog mechanism is able to quickly detect faults on each node, as the Host and the high performance NIC guard each other while every node monitors its own first neighbours in the mesh. Duplicated and distributed Supervisor Nodes receive communication by means of diagnostic messages routed through either the Service Network or the -dimensional Network, then assemble a global picture of the system status. In this way our approach allows achieving a Fault Awareness with no-single-point-of-failure. We describe an implementation of this hardware/software co-design for our high performance 3D torus NIC, with a focus on how routed diagnostic messages do not affect the system performances.
ASIP acceleration for virtual-to-physical address translation on RDMA-enabled FPGA-based network interfaces
Future Generation Computer Systems, 2015

LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System
2014 IEEE 33rd International Symposium on Reliable Distributed Systems, 2014
ABSTRACT QUonG is a parallel computing platform developed at INFN and equipped with commodity mul... more ABSTRACT QUonG is a parallel computing platform developed at INFN and equipped with commodity multi-core CPUs coupled with last generation NVIDIA GPUs. Computing nodes communicate through a point-to-point, high performance, low latency 3D torus network implemented by the APEnet+ FPGA-based interconnect. Scaling of this cluster towards peta-and possibly exascale is a prominent investigation point and in this context fault tolerance issues are structural. Typical fault tolerance solutions for HPC systems (e.g. checkpoint/restart) need to be triggered to be applied in an automated and transparent way, or at least knowledge about occurring faults needs propagating in order to prompt a readjustment: an effective tool to detect faults and make the system aware of them is required. Thus, as a first step towards a fault tolerant QUonG we designed the Local Fault Monitor (LO|FA|MO), an HW/SW solution aimed at providing systemic fault awareness. LO|FA|MO allows the detection of node faults thanks to a mutual watchdog mechanism between the host and the APEnet+ NIC, moreover, diagnostic messages can be delivered to neighbour nodes through both the 3D network and a secondary connection for service communication. The double path ensures that no fault remains unknown at the global level, guaranteeing systemic fault awareness with no single point of failure. In this paper we describe our LO|FA|MO implementation, reporting preliminary measures that show scalability and its next to nil impact on system performance.

Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities
2013 International Conference on Field-Programmable Technology (FPT), 2013
ABSTRACT We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU... more ABSTRACT We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task — we estimated up to 70% of the μC load — for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ∼ 60% of bandwidth increase) in given message size ranges.
2012 IEEE Nuclear Science Symposium and Medical Imaging Conference Record (NSS/MIC), 2012
APEnet+ is our custom developed PCIe gen2 board based on an Altera Stratix IV FPGA. We demonstrat... more APEnet+ is our custom developed PCIe gen2 board based on an Altera Stratix IV FPGA. We demonstrate reliable usage of Altera's embedded transceivers coupled with QSFP+ (Quad Small Form Pluggable) technology. QSFP+ standard defines a hot-pluggable transceiver available in copper or optical cable assemblies for an aggregated bandwidth of up to 40 Gbps. We use embedded transceivers in a 4 lane configuration, each one capable of 8.5 Gbps, for an aggregate bandwidth of 34 Gpbs per link. On Stratix IV 290 we can place up to 6 bidirectional links, together with a PCIe gen2 x8 hard IP. We describe design and implementation of this data transmission system.
Uploads
Papers by Laura Tosoratto