RISC-V2: A Scalable RISC-V Vector Processor
2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS)
https://doi.org/10.1109/ISCAS45731.2020.9181071Abstract
Machine learning adoption has seen a widespread bloom in recent years, with neural network implementations being at the forefront. In light of these developments, vector processors are currently experiencing a resurgence of interest, due to their inherent amenability to accelerate data-parallel algorithms required in machine learning environments. In this paper, we propose a scalable and high-performance RISC-V vector processor core. The presented processor employs a triptych of novel mechanisms that work synergistically to achieve the desired goals. An enhanced vector-specific incarnation of register renaming is proposed to facilitate dynamic hardware loop unrolling and alleviate instruction dependencies. Moreover, a cost-efficient decoupled execution scheme splits instructions into execution and memory-access streams, while hardware support for reductions accelerates the execution of key instructions in the RISC-V ISA. Extensive performance evaluation and hardware synthesis analysis validate the efficiency of the new architecture.
FAQs
AI
What performance gains does the register remapping technique achieve in vector processors?
The proposed design achieves a 2.1× increase in average throughput, measured in Elements Per Cycle, due to enhanced instruction scheduling from register remapping and dynamic allocation.
How does the decoupling of execution and memory access improve vector processor efficiency?
This technique allows independent execution and memory flows, minimizing stalls and optimizing resource utilization, thus significantly enhancing overall performance.
What unique reduction operation acceleration method is used in the proposed architecture?
The architecture employs a dynamically generated reduction tree, resulting in latency improvements calculated as log2 of vector length for NN algorithm execution.
How does vector processor scalability compare to traditional architectures under testing?
The scalable architecture demonstrated almost linear throughput gains with increasing execution lanes for linear algebra and DSP algorithms, although NN algorithms showed limited scaling due to complex memory patterns.
What are the hardware cost implications of integrating a vector core with a superscalar processor?
While area and power consumption increase with the addition of the vector core, the architecture achieves significantly better power efficiency, measured in EPC/Watt, suitable for high-demand applications.
References (26)
- G. E. Dahl, T. N. Sainath, and G. E. Hinton, "Improving deep neural networks for lvcsr using rectified linear units and dropout," in IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp. 8609-8613.
- K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proc. of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026-1034.
- V. Sze, Y. Chen, T. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec 2017.
- Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović, "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators," in International Symposium on Computer Architecture (ISCA), June 2011, pp. 129-140.
- H. Esmaeilzadeh, P. Saeedi, B. N. Araabi, C. Lucas, and S. M. Fakhraie, "Neural network stream processing core (nnsp) for embedded systems," in IEEE International Symposium on Circuits and Systems, May 2006, pp. pp.-2776.
- A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, "Deep learning with cots hpc systems," in Proc. of International Confer- ence on International Conference on Machine Learning (ICML), 2013, pp. III-1337-III-1345.
- H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural ac- celeration for general-purpose approximate programs," in IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec 2012, pp. 449-460.
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in Proc. of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269-284.
- Y. Chen, T. Yang, J. Emer, and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019.
- R. M. Russell, "The cray-1 computer system," Commun. ACM, vol. 21, no. 1, pp. 63-72, Jan. 1978.
- J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and N. Morgan, "Spert-ii: a vector microprocessor system," Computer, vol. 29, no. 3, pp. 79-86, March 1996.
- R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernan- dez, T. Juan, G. Lowney, M. Mattina, and A. Seznec, "Tarantula: a vector extension to the alpha architecture," in Proc. International Symposium on Computer Architecture (ISCA), May 2002, pp. 281-292.
- S. Hurkat and J. F. Martínez, "VIP: A versatile inference processor," in IEEE International Symposium on High Performance Computer Architecture, HPCA, 2019, pp. 345-358.
- "RISC-V Foundation," http://www.riscv.org, accessed: 17-10-2019.
- F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, "PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision," Journal Signal Processing Systems, vol. 84, p. 339-354, 2016.
- "Working draft of the proposed RISC-V V vector extension," https:// github.com/riscv/riscv-v-spec, accessed: 17-10-2019.
- K. Asanovic, "Vector microprocessors," Ph.D. dissertation, University of California, Berkeley, 1998.
- C. E. Kozyrakis and D. A. Patterson, "Scalable, vector processors for embedded systems," IEEE Micro, vol. 23, no. 6, pp. 36-45, Nov 2003.
- J. Yu, G. Lemieux, and C. Eagleston, "Vector processing as a soft-core cpu accelerator," in Proc. of ACM International Symposium on Field Programmable Gate Arrays (FPGA), 2008, pp. 222-232.
- C. Celio, D. A. Patterson, and K. Asanovic, "The Berkeley Out- of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor," EECS Department, University of California, Berkeley, Tech. Rep., Technical Report UCB/EECS-2015- 167 2015.
- C. Celio, P. F. Chiu, B. Nikolic, D. A. Patterson, and K. Asanovic, "BOOM v2: An Open-Source Out-of-Order RISC-V Core," EECS Department, University of California, Berkeley, Tech. Rep., Technical Report UCB/EECS-2017-157 2017.
- K. Patsidis, D. Konstantinou, C. Nicopoulos, and G. Dimitrakopoulos, "A low-cost synthesizable risc-v dual-issue processor core leveraging the compressed instruction set extension," in Microprocessors and Microsystems, vol. 61, 2018, pp. 1-10.
- R. Espasa, M. Valero, and J. E. Smith, "Out-of-order vector architec- tures," in Proc. of ACM/IEEE International Symposium on Microarchi- tecture (MICRO), 1997, pp. 160-170.
- R. Espasa and M. Valero, "Decoupled vector architectures," in Proc. International Symposium on High-Performance Computer Architecture (HPCA), 1996, pp. 281-290.
- L. Deng, "The mnist database of handwritten digit images for machine learning research [best of the web]," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141-142, Nov 2012.
- IC-Lab-DUTH Repository. (2020) RISC-V-Vector processor. [Online]. Available: https://github.com/ic-lab-duth/RISC-V-Vector