Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 2 – uploaded by Karyofyllis Patsidis

See full PDF downloadDownload figure

Traditionally, synchronization in decoupled processor schemes is achieved by employing so called synchronization queues and special move operations [24]. However, such schemes are not amenable to vector processors, where hun- dreds of elements have to be moved from/to the memory. The synchronization queues incur a significant hardware overhead, while the inserted move instructions block the computation stream until all the data has been transfered. Fig. 2. The three-phase process of remapping the registers and dynamically allocating the register file. The software initially requests a number of logical registers, which dictates the number of groups the vector register file is going to be split into. Each instruction dispatches multiple jzops to cover the full size of the allocated group. — Figure 2 Traditionally, synchronization in decoupled processor schemes is achieved by employing so called synchronization queues and special move operations [24]. However, such schemes are not amenable to vector processors, where hun- dreds of elements have to be moved from/to the memory. The synchronization queues incur a significant hardware overhead, while the inserted move instructions block the computation stream until all the data has been transfered. Fig. 2. The three-phase process of remapping the registers and dynamically allocating the register file. The software initially requests a number of logical registers, which dictates the number of groups the vector register file is going to be split into. Each instruction dispatches multiple jzops to cover the full size of the allocated group.

Related Figures (7)

Fig. 1. A high-level overview of the micro-architecture of the proposed vector processor. All vector instructions are diverted to the vector execution path upon completion of the scalar Issue Stage (sIS).

THE EXECUTION LATENCIES OF THE VARIOUS INSTRUCTION TYPES. is variable, and it depends on the operation being executed. Table When Stage t is vari lists the latencies for the various classes of instructions. a result is generated, it becomes available to the issue hrough the forwarding paths. Since the execution latency able, the orchestration of instruction progress is per- formed by the scoreboard, which notifies stalled instructions in the ready. issue stage whenever their pending operand values are The stalled instructions “wake up” and proceed to the next pipeline stage. During execution, vector jzops may trigger the same operation in multiple execution lanes, based on the vector ength. TABLE I

Fig. 3. The deployment of the dynamically generated hardware reduction tree in a setup with 4 execution lanes. Each cycle the length of the vector is reduced in half by computing the neighboring partial results, until the final result is ready.

Finally, we evaluate the scalability of the overall vector processor design. We compare three different vector config- urations using 4, 8, and 16 execution lanes and a baseline IMPACT OF THE HARDWARE-BASED REDUCTION TREE ON THE THROUGHPUT (EPC) OF NN ALGORITHMS IN VARIOUS CONFIGURATIONS.

HARDWARE IMPLEMENTATION RESULTS OF FOUR INVESTIGATED DESIGNS (CACHES ARE EXCLUDED) AT 45 NM/ 0.8 V AT | GHZ. As expected, the area increases significantly when augment- ing the superscalar processor with a vector core. The area overhead of the vector core scales almost linearly with the increase in the number of execution lanes. The same trends are also followed by the power consumption. Nevertheless, modern systems (and especially resource-constrained ones) demand increasingly higher computational power implemented in a cost-effective manner. Therefore, a key metric is that of power efficiency (EPC/Watt). Clearly, the proposed architec- ture achieves a markedly better overall power efficiency that scales well with bigger vector configurations.

Fig. 4. The performance improvement obtained when using the novel registet remapping mechanism and the dynamic allocation of the register file. Both vector cores feature 8 execution lanes. We first examine the impact of the novel register remapping scheme discussed in Section I-A. We compare the proposed design with a simpler baseline vector processor [22] that does not have the register remapping mechanism and operates with a shorter pipeline (i.e., one without the VRRM stage). Figure 4 depicts the results, normalized to the throughput of the base- ine design. The average throughput — calculated as Elements Per Cycle (EPC), the ratio of total elements over the execution time — increases by 2.1x. This significant improvement is primarily attributed to the enhanced instruction scheduling resulting from the synergistic effect of register remapping, instruction expansion, and the dynamically allocated register file.

Fig. 5. Performance scaling for 3 different vector configurations, as compared to a baseline dual-issue superscalar core.

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved