RISC-V2: A Scalable RISC-V Vector Processor

Karyofyllis Patsidis

doi:10.1109/ISCAS45731.2020.9181071

Outline

RISC-V2: A Scalable RISC-V Vector Processor

Karyofyllis Patsidis

2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS)

https://doi.org/10.1109/ISCAS45731.2020.9181071

visibility

…

description

5 pages

link

1 file

Abstract

Machine learning adoption has seen a widespread bloom in recent years, with neural network implementations being at the forefront. In light of these developments, vector processors are currently experiencing a resurgence of interest, due to their inherent amenability to accelerate data-parallel algorithms required in machine learning environments. In this paper, we propose a scalable and high-performance RISC-V vector processor core. The presented processor employs a triptych of novel mechanisms that work synergistically to achieve the desired goals. An enhanced vector-specific incarnation of register renaming is proposed to facilitate dynamic hardware loop unrolling and alleviate instruction dependencies. Moreover, a cost-efficient decoupled execution scheme splits instructions into execution and memory-access streams, while hardware support for reductions accelerates the execution of key instructions in the RISC-V ISA. Extensive performance evaluation and hardware synthesis analysis validate the efficiency of the new architecture.

FAQs

What performance gains does the register remapping technique achieve in vector processors?add

The proposed design achieves a 2.1× increase in average throughput, measured in Elements Per Cycle, due to enhanced instruction scheduling from register remapping and dynamic allocation.

How does the decoupling of execution and memory access improve vector processor efficiency?add

This technique allows independent execution and memory flows, minimizing stalls and optimizing resource utilization, thus significantly enhancing overall performance.

What unique reduction operation acceleration method is used in the proposed architecture?add

The architecture employs a dynamically generated reduction tree, resulting in latency improvements calculated as log2 of vector length for NN algorithm execution.

How does vector processor scalability compare to traditional architectures under testing?add

The scalable architecture demonstrated almost linear throughput gains with increasing execution lanes for linear algebra and DSP algorithms, although NN algorithms showed limited scaling due to complex memory patterns.

What are the hardware cost implications of integrating a vector core with a superscalar processor?add

While area and power consumption increase with the addition of the vector core, the architecture achieves significantly better power efficiency, measured in EPC/Watt, suitable for high-demand applications.

Figures (8)

THE EXECUTION LATENCIES OF THE VARIOUS INSTRUCTION TYPES. is variable, and it depends on the operation being executed. Table When Stage t is vari lists the latencies for the various classes of instructions. a result is generated, it becomes available to the issue hrough the forwarding paths. Since the execution latency able, the orchestration of instruction progress is per- formed by the scoreboard, which notifies stalled instructions in the ready. issue stage whenever their pending operand values are The stalled instructions “wake up” and proceed to the next pipeline stage. During execution, vector jzops may trigger the same operation in multiple execution lanes, based on the vector ength. TABLE I

Fig. 1. A high-level overview of the micro-architecture of the proposed vector processor. All vector instructions are diverted to the vector execution path upon completion of the scalar Issue Stage (sIS).

Traditionally, synchronization in decoupled processor schemes is achieved by employing so called synchronization queues and special move operations [24]. However, such schemes are not amenable to vector processors, where hun- dreds of elements have to be moved from/to the memory. The synchronization queues incur a significant hardware overhead, while the inserted move instructions block the computation stream until all the data has been transfered. Fig. 2. The three-phase process of remapping the registers and dynamically allocating the register file. The software initially requests a number of logical registers, which dictates the number of groups the vector register file is going to be split into. Each instruction dispatches multiple jzops to cover the full size of the allocated group.

Fig. 3. The deployment of the dynamically generated hardware reduction tree in a setup with 4 execution lanes. Each cycle the length of the vector is reduced in half by computing the neighboring partial results, until the final result is ready.

Finally, we evaluate the scalability of the overall vector processor design. We compare three different vector config- urations using 4, 8, and 16 execution lanes and a baseline IMPACT OF THE HARDWARE-BASED REDUCTION TREE ON THE THROUGHPUT (EPC) OF NN ALGORITHMS IN VARIOUS CONFIGURATIONS.

HARDWARE IMPLEMENTATION RESULTS OF FOUR INVESTIGATED DESIGNS (CACHES ARE EXCLUDED) AT 45 NM/ 0.8 V AT | GHZ. As expected, the area increases significantly when augment- ing the superscalar processor with a vector core. The area overhead of the vector core scales almost linearly with the increase in the number of execution lanes. The same trends are also followed by the power consumption. Nevertheless, modern systems (and especially resource-constrained ones) demand increasingly higher computational power implemented in a cost-effective manner. Therefore, a key metric is that of power efficiency (EPC/Watt). Clearly, the proposed architec- ture achieves a markedly better overall power efficiency that scales well with bigger vector configurations.

Fig. 4. The performance improvement obtained when using the novel registet remapping mechanism and the dynamic allocation of the register file. Both vector cores feature 8 execution lanes. We first examine the impact of the novel register remapping scheme discussed in Section I-A. We compare the proposed design with a simpler baseline vector processor [22] that does not have the register remapping mechanism and operates with a shorter pipeline (i.e., one without the VRRM stage). Figure 4 depicts the results, normalized to the throughput of the base- ine design. The average throughput — calculated as Elements Per Cycle (EPC), the ratio of total elements over the execution time — increases by 2.1x. This significant improvement is primarily attributed to the enhanced instruction scheduling resulting from the synergistic effect of register remapping, instruction expansion, and the dynamically allocated register file.

Fig. 5. Performance scaling for 3 different vector configurations, as compared to a baseline dual-issue superscalar core.

References (26)

G. E. Dahl, T. N. Sainath, and G. E. Hinton, "Improving deep neural networks for lvcsr using rectified linear units and dropout," in IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8609-8613.
K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proc. of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026-1034.
V. Sze, Y. Chen, T. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec 2017.
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators," in International Symposium on Computer Architecture (ISCA), June 2011, pp. 129-140.
H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, "Neural network stream processing core (nnsp) for embedded systems," in IEEE International Symposium on Circuits and Systems, May 2006, pp. pp.-2776.
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, "Deep learning with cots hpc systems," in Proc. of International Confer- ence on International Conference on Machine Learning (ICML), 2013, pp. III-1337-III-1345.
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural ac- celeration for general-purpose approximate programs," in IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2012, pp. 449-460.
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in Proc. of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269-284.
Y. Chen, T. Yang, J. Emer, and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019.
R. M. Russell, "The cray-1 computer system," Commun. ACM, vol. 21, no. 1, pp. 63-72, Jan. 1978.
J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and N. Morgan, "Spert-ii: a vector microprocessor system," Computer, vol. 29, no. 3, pp. 79-86, March 1996.
R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernan- dez, T. Juan, G. Lowney, M. Mattina, and A. Seznec, "Tarantula: a vector extension to the alpha architecture," in Proc. International Symposium on Computer Architecture (ISCA), May 2002, pp. 281-292.
S. Hurkat and J. F. Martínez, "VIP: A versatile inference processor," in IEEE International Symposium on High Performance Computer Architecture, HPCA, 2019, pp. 345-358.
"RISC-V Foundation," http://www.riscv.org, accessed: 17-10-2019.
F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, "PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision," Journal Signal Processing Systems, vol. 84, p. 339-354, 2016.
"Working draft of the proposed RISC-V V vector extension," https:// github.com/riscv/riscv-v-spec, accessed: 17-10-2019.
K. Asanovic, "Vector microprocessors," Ph.D. dissertation, University of California, Berkeley, 1998.
C. E. Kozyrakis and D. A. Patterson, "Scalable, vector processors for embedded systems," IEEE Micro, vol. 23, no. 6, pp. 36-45, Nov 2003.
J. Yu, G. Lemieux, and C. Eagleston, "Vector processing as a soft-core cpu accelerator," in Proc. of ACM International Symposium on Field Programmable Gate Arrays (FPGA), 2008, pp. 222-232.
C. Celio, D. A. Patterson, and K. Asanovic, "The Berkeley Out- of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor," EECS Department, University of California, Berkeley, Tech. Rep., Technical Report UCB/EECS-2015- 167 2015.
C. Celio, P. F. Chiu, B. Nikolic, D. A. Patterson, and K. Asanovic, "BOOM v2: An Open-Source Out-of-Order RISC-V Core," EECS Department, University of California, Berkeley, Tech. Rep., Technical Report UCB/EECS-2017-157 2017.
K. Patsidis, D. Konstantinou, C. Nicopoulos, and G. Dimitrakopoulos, "A low-cost synthesizable risc-v dual-issue processor core leveraging the compressed instruction set extension," in Microprocessors and Microsystems, vol. 61, 2018, pp. 1-10.
R. Espasa, M. Valero, and J. E. Smith, "Out-of-order vector architec- tures," in Proc. of ACM/IEEE International Symposium on Microarchi- tecture (MICRO), 1997, pp. 160-170.
R. Espasa and M. Valero, "Decoupled vector architectures," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 1996, pp. 281-290.
L. Deng, "The mnist database of handwritten digit images for machine learning research [best of the web]," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141-142, Nov 2012.
IC-Lab-DUTH Repository. (2020) RISC-V-Vector processor. [Online]. Available: https://github.com/ic-lab-duth/RISC-V-Vector

RISC-V2: A Scalable RISC-V Vector Processor

Sign up for access to the world's latest research

Abstract

FAQs

Related papers

References (26)

Related papers