We demonstrate that a small library of customizable interconnect components permits low-area, hig... more We demonstrate that a small library of customizable interconnect components permits low-area, high-performance, reliable communication tuned to an application, by analogy with the way designers customize their compute. Whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-to-FPGA interconnect. Using the example of BlueLink, our lightweight pluggable interconnect library, we describe how to construct reliable FPGA clusters from hundreds of lower-cost commodity FPGA boards. Utilizing the increasing number of serial links on FPGAs demands efficient use of soft-logic, making domain-optimized custom interconnect attractive for some time to come.
Prototyping large SoCs (Systems on Chip) using multiple FPGAs introduces a risk of errors on inte... more Prototyping large SoCs (Systems on Chip) using multiple FPGAs introduces a risk of errors on inter-FPGA links. This raises the question of how we can prove the correctness of a SoC prototyped using multiple FPGAs. We propose using high-speed serial interconnect between FPGAs, with a transparent error detection and correction protocol working on a link-by-link basis. Our inter-FPGA interconnect has an interface that resembles that of a network-on-chip, providing a consistent interface to a prototype SoC and masking the difference between on-chip and off-chip interconnect. Lowlatency communication and low area usage are favoured at the expense of a little bandwidth inefficiency, a trade-off we believe is appropriate given the high bandwidth of inter-FPGA links.
Managing the memory wall is critical for massively parallel
FPGA applications where data-sets ar... more Managing the memory wall is critical for massively parallel
FPGA applications where data-sets are large and external
memory must be used. We demonstrate that a soft vector
processor can efficiently stream data from external memory
whilst running computation in parallel. A non-trivial neural
computation case study illustrates that multi-core vector
processing coupled with careful layout of data structures
performs similarly to an elaborate full-custom memory controller
and execution pipeline. The vector processing version
was far simpler to code so we encourage others to consider
vector machines before contemplating a full-custom architecture
on FPGA.
"Reverse-engineering the brain is one of the US National Academy of Engineering’s “Grand Challeng... more "Reverse-engineering the brain is one of the US National Academy of Engineering’s “Grand Challenges.” The structure of the brain can be examined at many different levels, spanning many disciplines from low-level biology through psychology and computer science. This thesis focusses on real-time computation of large neural networks using the Izhikevich spiking neuron model.
Neural computation has been described as “embarrassingly parallel” as each neuron can be thought of as an independent system, with behaviour described by a mathematical model. However, the real challenge lies in modelling neural communication. While the connectivity of neurons has some parallels with that of electrical systems, its high fan-out results in massive data processing and communication requirements when modelling neural communication, particularly for real-time computations.
It is shown that memory bandwidth is the most significant constraint to the scale of real-time neural computation, followed by communication bandwidth, which leads to a decision to implement a neural computation system on a platform based on a network of Field Programmable Gate Arrays (FPGAs), using commercial off-the-shelf components with some custom supporting infrastructure. This brings implementation challenges, particularly lack of on-chip memory, but also many advantages, particularly high-speed transceivers. An algorithm to model neural communication that makes efficient use of memory and communication resources is developed and then used to implement a neural computation system on the multi-FPGA platform.
Finding suitable benchmark neural networks for a massively parallel neural computation system proves to be a challenge. A synthetic benchmark that has biologically-plausible fan-out, spike frequency and spike volume is proposed and used to evaluate the system. It is shown to be capable of computing the activity of a network of 256k Izhikevich spiking neurons with a fan-out of 1k in real-time using a network of 4 FPGA boards. This compares favourably with previous work, with the added advantage of scalability to larger neural networks using more FPGAs.
It is concluded that communication must be considered as a first-class design constraint when implementing massively parallel neural computation systems."
Bluehive is a custom 64-FPGA machine targeted at scientific simulations with demanding communicat... more Bluehive is a custom 64-FPGA machine targeted at scientific simulations with demanding communication re- quirements. Bluehive is designed to be extensible with a recon- figurable communication topology suited to algorithms with demanding high-bandwidth and low-latency communication, something which is unattainable with commodity GPGPUs and CPUs. We demonstrate that a spiking neuron algorithm can be efficiently mapped to Bluehive using Bluespec SystemVerilog by taking a communication-centric approach. This contrasts with many FPGA-based neural systems which are very focused on parallel computation, resulting in inefficient use of FPGA resources. Our design allows 64k neurons with 64M synapses per FPGA and is scalable to a large number of FPGAs.
Communication on- and off-chip now dominates the power and performance of modern electronic circu... more Communication on- and off-chip now dominates the power and performance of modern electronic circuits. We propose the use of modern field programmable gate arrays (FPGAs) to investigate the communication properties of systems capable of simulating one billion neurons. Each FPGA provides gigabits of chip-to-chip communication bandwidth and on- and off-chip memory bandwidth. The FPGA structure allows us to control the allocation of this bandwidth in great detail allowing optimisations and analysis to be performed. We present our architectural explorations and initial findings.
In this paper we demonstrate how error-correcting addition and multiplication can be performed us... more In this paper we demonstrate how error-correcting addition and multiplication can be performed using self-checking modules. Our technique is based on the observation that a suitably designed full adder under the presence of any single stuck-at fault produces the fault-free complement of the desired output when fed by the complement of its functional input. We initially apply conventional parity-based error detection in arithmetic modules; upon detection of a fault, this is followed by input inversion, recomputation, and suitable output inversion. We present adder, register and multiplier designs that can be used in this context. We also design a large-scale circuit using this technique (an elliptical filter), outlining the area savings with respect to traditional triple modular redundancy
Lecture given to Year 10 (15 year old) students to explain some of what Computer Science is about... more Lecture given to Year 10 (15 year old) students to explain some of what Computer Science is about, particularly focussing on the basic operation of a processor.
Communication on- and off-chip now dominates the power and performance of modern electronic circu... more Communication on- and off-chip now dominates the power and performance of modern electronic circuits. We propose the use of modern field programmable gate arrays (FPGAs) to investigate the communication properties of systems ca- pable of simulating one billion neurons. Each FPGA pro- vides gigabits of chip-to-chip communication bandwidth and on- and off-chip memory bandwidth. The FPGA structure allows us to control the allocation of this bandwidth in great detail allowing optimisations and analysis to be performed. We present our architectural explorations and initial find- ings.
Uploads
Papers by Paul Fox
reliable communication tuned to an application, by analogy with
the way designers customize their compute. Whilst soft cores for standard protocols (Ethernet, RapidIO, Infiniband, Interlaken) are a boon for FPGA-to-other-system interconnect, we argue that they are inefficient and unnecessary for FPGA-to-FPGA interconnect. Using the example of BlueLink, our lightweight pluggable interconnect library, we describe how to construct reliable FPGA clusters from hundreds of lower-cost commodity FPGA boards. Utilizing the increasing number of serial links on FPGAs demands efficient use of soft-logic, making domain-optimized custom interconnect attractive for some time to come.
FPGA applications where data-sets are large and external
memory must be used. We demonstrate that a soft vector
processor can efficiently stream data from external memory
whilst running computation in parallel. A non-trivial neural
computation case study illustrates that multi-core vector
processing coupled with careful layout of data structures
performs similarly to an elaborate full-custom memory controller
and execution pipeline. The vector processing version
was far simpler to code so we encourage others to consider
vector machines before contemplating a full-custom architecture
on FPGA.
Neural computation has been described as “embarrassingly parallel” as each neuron can be thought of as an independent system, with behaviour described by a mathematical model. However, the real challenge lies in modelling neural communication. While the connectivity of neurons has some parallels with that of electrical systems, its high fan-out results in massive data processing and communication requirements when modelling neural communication, particularly for real-time computations.
It is shown that memory bandwidth is the most significant constraint to the scale of real-time neural computation, followed by communication bandwidth, which leads to a decision to implement a neural computation system on a platform based on a network of Field Programmable Gate Arrays (FPGAs), using commercial off-the-shelf components with some custom supporting infrastructure. This brings implementation challenges, particularly lack of on-chip memory, but also many advantages, particularly high-speed transceivers. An algorithm to model neural communication that makes efficient use of memory and communication resources is developed and then used to implement a neural computation system on the multi-FPGA platform.
Finding suitable benchmark neural networks for a massively parallel neural computation system proves to be a challenge. A synthetic benchmark that has biologically-plausible fan-out, spike frequency and spike volume is proposed and used to evaluate the system. It is shown to be capable of computing the activity of a network of 256k Izhikevich spiking neurons with a fan-out of 1k in real-time using a network of 4 FPGA boards. This compares favourably with previous work, with the added advantage of scalability to larger neural networks using more FPGAs.
It is concluded that communication must be considered as a first-class design constraint when implementing massively parallel neural computation systems."
Talks by Paul Fox