Papers by Eli Bozorgzadeh

arXiv (Cornell University), Oct 1, 2019
Despite all the available commercial and open-source frameworks to ease deploying FPGAs in accele... more Despite all the available commercial and open-source frameworks to ease deploying FPGAs in accelerating applications, the current schemes fail to support sharing multiple accelerators among various applications. ere are three main features that an accelerator sharing scheme requires to support: exploiting dynamic parallelism of multiple accelerators for a single application, sharing accelerators among multiple applications, and providing a non-blocking congestion-free environment for applications to invoke the accelerators. In this paper, we developed a scalable fully functional hardware controller, called UltraShare, with a supporting so ware stack that provides a dynamic accelerator sharing scheme through an accelerators grouping mechanism. UltraShare allows so ware applications to fully utilize FPGA accelerators in a non-blocking congestion-free environment. Our experimental results for a simple scenario of a combination of three streaming accelerators invocation show an improvement of up to 8x in throughput of the accelerators by removing accelerators idle times.

This paper presents a theoretical framework that optimally solves many open problems in time budg... more This paper presents a theoretical framework that optimally solves many open problems in time budgeting. Our approach unifies a large class of existing timemanagement paradigms. Examples include time budgeting for maximizing total weighted delay relaxation, minimizing the maximum relaxation and min-skew time budget distribution. We show that many of the time management problems can be transformed into a min-cost flow instance that can be optimally and efficiently solved through well-known combinatorial techniques. Experiments include mapping of several designs, which are implemented using parameterized CoreGen IP cores, on Xilinx FPGA devices. Different time budgeting policies have been applied during the mapping stage. Our time management techniques always improved the area requirement of the implemented testbenches compared to a widely-used path-based method. We also compared the maximum budgeting and fairness in delay budget assignments. Our experimental results show that an average improvement of 19% in area can be achieved when fairness and maximum budgeting policies are combined, compared to pure maximum budgeting.

Configurable multiprocessor system is a promising design alternative because of its high degree o... more Configurable multiprocessor system is a promising design alternative because of its high degree of flexibility, short development time, and potentially high performance under constraints and challenges driven by applications. An important design challenge at 45nm for multi-core system is manufacturing process variation. Due to increasing concern of WID variation, designers will have to choose configurations of processing cores that maximize yield of the system while not affecting performance and throughput constraints. Due to interdependency between processor configuration selection and task allocation and its impact on yield and latency constraints, we tackle both problems simultaneously. In this paper, we propose the problem of task allocation and configuration selection for yield optimization. We prove the problem is NP-hard and propose an optimal pseudo-polynomial on Serial-Parallel graphs. We target streaming applications in pipelined reconfigurable multiprocessor systems. We provide a case study of configurable Leon processors as the cores implemented on FPGA. Results show that proposed problem could result in significant improvement of the timing yield of the system by exploiting extra slack on tasks.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Aug 1, 2004
Excess delay that each component of a design can tolerate under a given timing constraint is refe... more Excess delay that each component of a design can tolerate under a given timing constraint is referred to as delay budget. Delay budgeting has been widely exploited to improve the design quality in VLSI CAD flow. The objective of the delay budgeting problem investigated in this paper is to maximize the total delay budget assigned to each node in a directed acyclic graph under a given timing constraint. Due to discreteness of the timing of the components in the libraries during design optimization flow, discrete solution for delay budgeting is essential. We present an optimal integer delay budgeting algorithm. We prove that the problem can be solved optimally in polynomial time. In addition, we look at different extensions of the delay budgeting problem, such as maximization of weighted summation of delay budgets assigned to the nodes with constraints on lower bound and upper bound on the delay budget allocated to each node. We prove that for both aforementioned extensions, our algorithm can produce an optimal integer solution in polynomial time. Our algorithm is generic and can be applied in different design tasks at different levels of abstraction. We applied our proposed optimal delay budgeting algorithm in library mapping during datapath synthesis on an FPGA platform ,using pre-optimized cores of FPGA libraries. For each application, we go through synthesis and place and route stages in order to obtain accurate results. Our optimal algorithm outperforms ZSA algorithm [4] in terms of area by ½¼± on average for all applications. In some applications, optimal delay budgeting can speedup runtime of place and route up to ¾ times.

Delay budget is an excess delay each component of a design can tolerate under a given timing cons... more Delay budget is an excess delay each component of a design can tolerate under a given timing constraint. Delay budgeting has been widely exploited to improve the design quality. We present an optimal integer delay budgeting algorithm. Due to numerical instability and discreteness of libraries of components during library mapping in design optimization flow, integer solution for delay budgeting is essential. We prove that integer budgeting problem -a 20-year old open problem in design optimization [7]-can be solved optimally in polynomial time. We applied optimal delay budgeting in mapping applications on FPGA platform using pre-optimized cores of FPGA libraries. For each application we go through synthesis and place and route stages in order to obtain accurate results. Our optimal algorithm outperforms ZSA algorithm in terms of area by 10% on average for all applications. In some applications, optimal delay budgeting can speedup runtime of place and route up to 2 times.

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
This paper presents a two-step aging-aware methodology for Representative Critical Paths (RCPs) s... more This paper presents a two-step aging-aware methodology for Representative Critical Paths (RCPs) selection from a large number of Critical Paths (CPs) in programmable logic devices. First, nomination of CPs is based on delay, temperature, and lexicographic function of duty cycle and switching activity filtering, which are the major causes in Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) aging mechanisms. Secondly, RCPs will be selected based on Fan-out (FO) and physical location of Logic Blocks (LBs) along a CP to decrease aging propagation and sensor distribution fairness, respectively. We then present a sensor insertion algorithm that will be used during design placement to avoid sensors inaccuracy. Implementation steps of sensor insertion are performed automatically with a limited human interaction. Higher aging-rate of RCPs than unselected CPs in our experiments demonstrates the effectiveness of the proposed methodology.

The increase in complexity of integrated circuits results in the need to develop hardware platfor... more The increase in complexity of integrated circuits results in the need to develop hardware platforms shared among a set of applications in the same domain. Today's general purpose processors cannot satisfy the future aggressive timing and power constraints for a specific application. On the other hand, conventional ASIC design methodologies are costly and require a long time-to-market for today's complex designs. We need a platform based system optimized for a set of applications in a same domain. Reconfiguration has to be integrated into system design. We must exploit the regularity (or similarity) among applications in target system design. This regularity depends on the domain of applications . For example, Each application can demand different set of modules to be embedded as fixed cores in the target system. In this work, we specifically study one of the important issues in domain-specific programmable design methodologies. We introduce the pattern selection problem. Patterns are application specific computational units. The patterns embedded on the target system are selected by exploiting regularity among the applications in the same domain. The number of patterns nominated by applications to be embedded on systems can be large. There can also exist overlap between the patterns. We present a gain model which can represent different characteristics of patterns. Using this gain model, we propose an algorithm to select a set of patterns such that the objective is maximized. Our model and proposed algorithm can be applied at different levels of design hierarchy. Our method also considers the overlap between the patterns. The experimental results show that our method chooses different sets of patterns when area limit for embedded cores on the system changes. We used our method to select a set of embedded modules on the SPS architecture [6] for multimedia applications. Comparing the results obtained by our algorithm with the method in which only common patterns are chosen to be embedded, latency and utilization of embedded patterns can be improved by ¢ £¢ ¥¤ and ¦ £ § £¤ , respectively.
RPack
Proceedings of the 2001 conference on Asia South Pacific design automation - ASP-DAC '01, 2001

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design, 2010
In this paper, a novel thermal-aware dynamic placement planner for reconfigurable systems is pres... more In this paper, a novel thermal-aware dynamic placement planner for reconfigurable systems is presented, which targets transient temperature reduction. Rather than solving time-consuming differential equations to obtain the hotspots, we propose a fast and accurate heuristic model based on power budgeting to plan the dynamic placements of the design statically, while considering the boundary conditions. Based on our heuristic model, we have developed a fast optimization technique to plan the dynamic placements at design time. Our results indicate that our technique is two orders of magnitude faster while the quality of the placements generated in terms of temperature and interconnection overhead is the same, if not better, compared to the thermal-aware placement techniques which perform thermal simulations inside the search engine.

Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis - CODES/ISSS '08, 2008
Configurable multiprocessor system is a promising design alternative because of its high degree o... more Configurable multiprocessor system is a promising design alternative because of its high degree of flexibility, short development time, and potentially high performance under constraints and challenges driven by applications. An important design challenge at 45nm for multi-core system is manufacturing process variation. Due to increasing concern of WID variation, designers will have to choose configurations of processing cores that maximize yield of the system while not affecting performance and throughput constraints. Due to interdependency between processor configuration selection and task allocation and its impact on yield and latency constraints, we tackle both problems simultaneously. In this paper, we propose the problem of task allocation and configuration selection for yield optimization. We prove the problem is NP-hard and propose an optimal pseudo-polynomial on Serial-Parallel graphs. We target streaming applications in pipelined reconfigurable multiprocessor systems. We provide a case study of configurable Leon processors as the cores implemented on FPGA. Results show that proposed problem could result in significant improvement of the timing yield of the system by exploiting extra slack on tasks.
Combinatorial Optimization, 2003

Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
The major drawback of partial dynamic reconfiguration is the reconfiguration delay overhead. To r... more The major drawback of partial dynamic reconfiguration is the reconfiguration delay overhead. To reduce the reconfiguration bitstream between two consecutive implementations, design components are reused. However, this incurs additional physical constraints to design which can lead to unroutability and congestion in design. In this paper, we propose a physicallyaware component reuse strategy. We propose a floorplanning algorithm to support two-dimensional partial reconfiguration. The proposed floorplanning tool enables a wide design space exploration for component reuse. Key features are selection of the fixed modules, location of the fixed modules, mapping to the fixed modules, and interconnect planning between the fixed and reconfigurable modules. We implemented a sequence of dataflow graphs on Xilinx Virtex 4 devices using our tool for component reuse. When reuse is exploited, the experimental results report more than 50% reduction in the number of reconfiguration frames compared to the flow during which component reuse is not applied. Our proposed floorplan-aware matching technique (to map the modules to fixed components) can reduce the reconfiguration frames by 10% on average compared to dependencybased matching algorithm. In addition, we show that by different placement of the modules for two consecutive tasks, the variation in the number of reconfiguration frames can be between 25%-60% or it may even lead to unroutability of the circuits. The results imply that there is a need to tune the physical design tools for minimizing runtime reconfiguration delay overhead.

2008 IEEE International Conference on Computer Design, 2008
Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by a... more Software Defined Radio (SDR) base stations can compensate for failures in disaster scenarios by assimilating different communication technologies. FPGAs play an important role in the platform of an SDR base station because of flexibility and DSP processing power that they deliver. The flexibility of FPGAs comes at the high cost of reconfiguration time overhead which can be a serious deterrence because of QoS requirements of real time traffic. In this paper we propose a solution to reduce reconfiguration time overhead at systemlevel where we are provided the configuration of each wireless system. Following that we step further and integrate our solution in to a floorplanner to generate placements for wireless systems which can systematically hide or reduce reconfiguration time overhead. Our experiments show the effectiveness of our approach
ACM SIGBED Review, 2013
This paper presents a novel graph representation that captures the transition overhead due to run... more This paper presents a novel graph representation that captures the transition overhead due to runtime configuration of underlying hardware in reconfigurable embedded systems resulting from various configuration schemes such as FPGA-like reconfiguration or dynamic voltage/frequency scaling (DVFS). We propose an intuitive heuristic to solve combined configuration selection and task scheduling problem on this graph. In addition, when applied to DVFS, our algorithm provides simultaneous task ordering and configuration selection of the system which outperforms the state-of-the-art DVFS methods applied after task ordering.
2007 25th International Conference on Computer Design, 2007
In this paper, we present co-processor selection problem for minimum energy consumption in hw/sw ... more In this paper, we present co-processor selection problem for minimum energy consumption in hw/sw co-design on FPGAs with dual power mode. We provide theoretical analysis for the problem under no constraint, resource constraint, and timing constraint. We prove that the complexity of the problem in each case is NP-Hard and we provide a generalized ILP formulation. We compared the result of our approach in minimizing energy to the result of other approaches that had not considered both static and dynamic power during optimization and we showed that we can reduce energy by 63% in some cases.

2006 International Conference on Field Programmable Logic and Applications, 2006
Partial dynamic reconfiguration is an emerging area in FPGA designs which is used for saving devi... more Partial dynamic reconfiguration is an emerging area in FPGA designs which is used for saving device area and cost. In order to reduce the reconfiguration overhead, two consecutive similar sub-designs should be placed in the same locations to get the maximum reuse of common components. This requires that all the future designs be considered while floorplanning for any given design. In this work, we introduce a new multi-layer sequence pair representation based floorplanner that allows overlap of static and non-static components of multiple designs and guarantees a feasible overlapping floorplan with minimal area packing. The multi-layer sequence pair is an efficient representation that helps in reducing the total floorplan runtime significantly. It also improves the design quality of the whole sequence as floorplans of all the designs are simultaneously computed. In our experiments, compared to a traditional sequential floorplanner, our floorplanner removes infeasibility in many designs, achieves an improvement of clock period by 12% on average and reduces the place and route time by as much as 3 times. It also reduces the average wirelength by 50% in the designs. Our proposed floorplanner could be used for finding high quality floorplans for applications that use partial reconfiguration.

Proceedings. 42nd Design Automation Conference, 2005., 2005
Many reconfigurable architectures offer partial dynamic configurability, but current system-level... more Many reconfigurable architectures offer partial dynamic configurability, but current system-level tools cannot guarantee feasible implementations when exploiting this feature. We present a physically aware hardware-software (HW-SW) scheme for minimizing application execution time under HW resource constraints, where the HW is a reconfigurable architecture with partial dynamic reconfiguration capability. Such architectures impose strict placement constraints that lead to implementation infeasibility of even optimal scheduling formulations that ignore the nature of these constraints. We propose an exact and a heuristic formulation that simultaneously partition, schedule, and do linear placement of tasks on such architectures. With our exact formulation, we prove the critical nature of placement constraints. We demonstrate that our heuristic generates high-quality schedules by comparing the results with the exact formulation for small tests and a popular, but placementuanaware scheduling heuristic for larger tests. With a case study, we demonstrate extension of our approach to handle heterogenous architectures with specialized resources distributed between general purpose programmable logic columns. The execution time of our heuristic is very reasonable-task graphs with hundreds of nodes are processed in a couple of minutes.

Proceedings of the 40th conference on Design automation - DAC '03, 2003
Delay budget is an excess delay each component of a design can tolerate under a given timing cons... more Delay budget is an excess delay each component of a design can tolerate under a given timing constraint. Delay budgeting has been widely exploited to improve the design quality. We present an optimal integer delay budgeting algorithm. Due to numerical instability and discreteness of libraries of components during library mapping in design optimization flow, integer solution for delay budgeting is essential. We prove that integer budgeting problem -a 20-year old open problem in design optimization [7]-can be solved optimally in polynomial time. We applied optimal delay budgeting in mapping applications on FPGA platform using pre-optimized cores of FPGA libraries. For each application we go through synthesis and place and route stages in order to obtain accurate results. Our optimal algorithm outperforms ZSA algorithm in terms of area by 10% on average for all applications. In some applications, optimal delay budgeting can speedup runtime of place and route up to 2 times.

ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005., 2005
Due to decreasing transistor sizes and increasing clock frequency, interconnect delay is a domina... more Due to decreasing transistor sizes and increasing clock frequency, interconnect delay is a dominant factor in achieving timing closure in deep sub-micron designs. Techniques like wire pipelining and retiming can manage delay of timing critical wires. The latency of the system, however, limits the total pipelining in the design. New techniques are, thus, needed at synthesis stage to consider the effect of critical wires in the design. In this work, we propose a novel intuitive algorithm, Critical Edge Reduction (CER) algorithm, which produces a maximal delay budgeting solution under fixed latency while minimizing the number of critical wires. We also present an in-depth analysis of trade-off between maximum budgeting and critical edge minimization. We implemented our design flow using a set of MediaBench data paths on Xilinx VirtexE FPGA devices. Using our algorithm, the Xilinx Place and Route tool achieved timing closure, on average, 2.8 times faster than using maximum budgeting. The resulting average clock period using CER algorithm outperforms the one using maximum budgeting by 6%.
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, 2005
Modern FPGA architectures provide ample routing resources so that designs can be routed successfu... more Modern FPGA architectures provide ample routing resources so that designs can be routed successfully. The routing architecture is designed to handle versatile connection configurations. However, providing such great flexibility comes at a high cost in terms of area, delay and power. We propose a new FPGA routing architecture 1 that utilizes a mixture of hardwired and traditional flexible switches. The result is 24% reduction in leakage power consumption, 7% smaller area and 24% shorter delays, which translates to 30% increase in clock frequency. Despite the increase in clock speeds, the overall power consumption is reduced by 8%.
Uploads
Papers by Eli Bozorgzadeh