Figure 2 Traditionally, synchronization in decoupled processor schemes is achieved by employing so called synchronization queues and special move operations [24]. However, such schemes are not amenable to vector processors, where hun- dreds of elements have to be moved from/to the memory. The synchronization queues incur a significant hardware overhead, while the inserted move instructions block the computation stream until all the data has been transfered. Fig. 2. The three-phase process of remapping the registers and dynamically allocating the register file. The software initially requests a number of logical registers, which dictates the number of groups the vector register file is going to be split into. Each instruction dispatches multiple jzops to cover the full size of the allocated group.