Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 20 – uploaded by Avishek Biswas

See full PDF downloadDownload figure

after the other, on the test-chip. Data is transferred back and forth between MATLAB (running on a host PC) and the test- chip, via an FPGA board. Table. II shows the detailed mapping of the 4 CONV/FC layers to the CSRAM array to compute the convolutions. Let us first consider layer C3. It has a filter size of 5 x 5, with 6 input channels and 16 output channels (number of 3-D filters). Each of the 16 3-D filters are mapped to one of the 16 local arrays in the CSRAM. Since each row in the local array has 64 bit-cells, a maximum of 2 (= Se \) input channels can fit per L5x5 row. Therefore, 3 (= ) rows are required in each local array to fit the entire 3-D filter. In every clock cycle, 50 (= 5 x 5 x 2) X7n’s are sent through 90 x 16 x 2 operations ( a buffer (shift-registers) to the CSRAM array to compute 16 partial convolution outputs. Thus, the CSRAM array processes MAV = 2 OPs: 1 multiply + 1 add/average) per clock cycle. For layer F5, the entire filter cannot fit at once in the CSRAM array (due to its limited 16 Kb size in the test-chip). Hence, the entire process, explained above, is repeated multiple times to finish all the computations. However, having multiple CSRAM arrays operating in parallel can easily alleviate this pro together on-chip. blem, by fitting all the filter weights Fig. 19. Test setup for automatically running the 4 CONV/FC layers of LeNet-5 CNN on Cony-SRAM, for a given input image (28 x 28). — Figure 19 after the other, on the test-chip. Data is transferred back and forth between MATLAB (running on a host PC) and the test- chip, via an FPGA board. Table. II shows the detailed mapping of the 4 CONV/FC layers to the CSRAM array to compute the convolutions. Let us first consider layer C3. It has a filter size of 5 x 5, with 6 input channels and 16 output channels (number of 3-D filters). Each of the 16 3-D filters are mapped to one of the 16 local arrays in the CSRAM. Since each row in the local array has 64 bit-cells, a maximum of 2 (= Se \) input channels can fit per L5x5 row. Therefore, 3 (= ) rows are required in each local array to fit the entire 3-D filter. In every clock cycle, 50 (= 5 x 5 x 2) X7n’s are sent through 90 x 16 x 2 operations ( a buffer (shift-registers) to the CSRAM array to compute 16 partial convolution outputs. Thus, the CSRAM array processes MAV = 2 OPs: 1 multiply + 1 add/average) per clock cycle. For layer F5, the entire filter cannot fit at once in the CSRAM array (due to its limited 16 Kb size in the test-chip). Hence, the entire process, explained above, is repeated multiple times to finish all the computations. However, having multiple CSRAM arrays operating in parallel can easily alleviate this pro together on-chip. blem, by fitting all the filter weights Fig. 19. Test setup for automatically running the 4 CONV/FC layers of LeNet-5 CNN on Cony-SRAM, for a given input image (28 x 28).

Related Figures (29)

Fig. 1. | Basics of a typical convolutional neural network (CNN) for a classification problem, showing the structure for the CONV and FC layers [4], [5]. Convolutional neural networks (CNN) provide state-of-the- art results in a wide variety of AI/ ML applications, ranging from image classification [3] to speech recognition [1]. How- ever, they are highly computation-intensive and require huge amounts of storage. Hence, they consume a lot of energy when implemented in hardware and are not suitable for energy- constrained applications e.g. “edge-computing”’.

Fig. 2. Comparison of conventional approach vs. proposed approach of memory-embedded convolution computation, for processing of CNNs. In general, CNNs use real-valued inputs and weights. How- ever, in order to reduce their storage and compute complexity recent works have strived towards using small bit-widths to represent the input/filter-weight values. [6] proposed a binary- weight-network (BWN), where the filter weights (w;’s) can be trained to be +1/-1 (with a common scaling factor per 3-D filter: a). This leads to a significant reduction in the amount of storage required for the w,’s, making it possible to store them entirely on-chip. BWN’s also simplify the MAC operation to an add/subtract operation, since a is common for a given 3-D filter and it can be incorporated after finishing the entire convolution computation for that filter. As shown in [6], this algorithm does not compromise much on the original classification accuracy of the CNN, obtained using full precision weights. BWN performs better than binary- connect [7], which does not incorporate the scaling factor of a per filter, and also binarized-neural-networks [8], where both weights and activations are constrained to +1. In the conventional all-digital implementation of CNNs [4], [5], [9], with the memory and the processing elements being physically separate, reading the w,’s and the partial sums from the on-chip SRAMs lead to a lot of data movement per computation [10] and hence, make them energy-hungry. This is because, in modern CMOS processes, the energy required to access data from memory can be much higher than the energy needed for a compute operation with that data [11]. To address this problem, we present an SRAM-embedded convolution computation architecture [12], conceptually shown in Fig. 2.

where #7 is the width/height of the IFMP (with padding), E is the OFMP width/height (for a stride S$), R is the filter width/height, C’ is the number of IFMP/filter channels and M is the number of filters/OFMP channels for a given CONV/FC layer. The width and height of the feature-maps/filters are assumed to be same for simplicity and also because it is very common in most of the popular CNNs.

Fig. 3. Concept of embedded convolution computation as averaging in SRAMs for binary-weight convolutional neural networks.

Fig. 5. Simulated results for the MNIST dataset with the LeNet-5 CNN by varying: (a) bit-width to represent IFMP/OFMP values, (b) averaging factor (N).

Fig. 6. Comparison of the conventional and proposed approaches of using SRAM bit-cells for embedded analog computations.

Fig. 7. Comparison of the conventional and proposed approaches on write- disturb issue of SRAM bit-cells during compute mode.

Fig. 8. Schematic of the column-wise GBL_DAC circuit, showing the digital- to-time converter (bottom-left) and time-to-analog converter (top-left). Also shown are the timing signals and operation waveforms for 2 input codes (right).

Fig. 9. Architecture of a 16x64 local array of the Conv-SRAM, showing the 10T bit-cells storing the filter weights and local analog multiply-and-average (MAV 4) circuits. Also shown are typical operation waveforms (bottom) for one column.

Fig. 10. Variation of the local bit-line discharge time for weight evalua- tion/multiplication in phase-2.

Fig. 11. Simulated distribution of the partial convolution output from the ADC (Your), for a typical CONV layer (C3) in the LeNet-5 CNN. Choosing the ADC architecture is crucial since it would be replicated multiple times in the CSRAM array. Hence, area and power consumption are key metrics to consider. In addition, the typical distribution of the ADC outputs (Your’s) should also be considered to find the more appropriate architecture. As seen from simulation results in Fig. 11, for a typical CONV layer with a full scale input range of +31, Your has an absolute mean value of +1.3 and is typically limited to +7. Hence, a serial integrating ADC architecture is more suita ble in this scenario, compared to other area-intensive (e.g. SAR) and more power-hungry ones (e.g. flash). In spite of its serial nature, in most cases we can expect the ADC to finish operation within a few cycles, due to the particular Yo distribution. its UT

Fig. 13. Circuit for the 2-cycle offset-cancellation technique for the SA in CSH_ADC. Fig. 12 also shows the waveforms for a typical CSH_ADC operation. It starts by sending a SA_EN pulse from the ADC logic block to the SA. The SA compares Vp avg and Vnava, and sends its outputs (SAOp,SAOn) to the ADC logic block. The first comparison determines the sign of the output, e.g. for the case shown in Fig. 12, Your is positive since Vpavea is higher than Vnyaya. After the first comparison, the lower of the 2 voltage rails (Vn,vyaq) is integrated by charge-sharing it with a reference local bit-line (BLN, f), using the equalize signal (EQwy in this case). The reference bit-line, which replicates the local bit-line capacitance, was pre-charged during the SA comparison using the PCHp signal to V,~¢ (= 1 V in this work). Therefore, the step- size of the integration is ~ Veet , where N is the number of SRAM local columns that were averaged. The pre-charge and equalize/integrate operations, along with the SA comparison, continue until the lower voltage rail (Vn4avq@) exceeds the higher one (Vpayvq). When this happens, the SA outputs flip indicating the end-of-conversion (HOC). After this, no more timing pulses are generated. A counter in the ADC logic block counts the number of equalize pulses (EQ) it takes to reach EOC and that generates the digital value of the convolution/dot-product output (Your), which is +4 for the example shown in Fig. 12.

‘ig. 12. Architecture (left) for the charge-sharing based ADC (CSH_ADC) for 1 local array of the Conv-SRAM and typical waveforms (right) for the digital yutput (Your) computation for the convolution (dot-product) operation.

8.6% by the local MAV, circuits, 7.3% by the CSH_ADCs and the rest by global timing circuits. The test-chip summary is shown in Table I. Fig. 14. Die micro-photograph in a 65-nm CMOS process.

Fig. 16. Measured transfer function and energy consumption of CSH_ADC at Vaa,apc = 1 V, Vaa,ary = 9.8 V and fapc = 250 MHz. stack in them operating in the saturation region (as a constant current source). It can be seen from Fig. 15 that there is a good linearity in the DAC transfer function with DNL < 1 LSB. Since the SAs have NMOS input-pair, low values of V, cannot be properly estimated. Hence, the characterization is done till X;y = 16 (or Vz = 500 mV).

Fig. 15. Measured transfer function of GBL_DAC at Vag,pac = 1.2 V, with V,¢¢ = 1 V and to © 250 ps.

Fig. 18. Architecture of the LeNet-5 CNN, showing the sizes of the feature maps (top) and the filters (bottom).

Fig. 17. Measured distribution of convolution output values (Your) from CSH_ADC with and without the offset-cancellation (OC) technique, for two values of the input code (X7y)). the CSH_ADCs, all X;;~’s are fed the same input code, all w,’sS are written the same value and then the ADC outputs (Your ’s) are observed. The measurement results show a good linearity in the overall transfer function and low variation in the Your values, which is due to the fact that the variation in BL capacitance (used in CSH_ADC) is much lower than transistor V;-variation. It can be also seen from Fig. 16 that the energy/ADC scales linearly with the input/output value, which is expected for the integrating ADC topology.

+ Repeated 8 times to cover all the 120 filters “15 columns and 8 rows mapping used at Vpp = 0.8 V PARAMETER MAPPING FOR THE CONV/FC LAYERS OF LENET-5 CNN TC THE CSRAM ARRAY

Fig. 20. Measured error rate for the 1OK images in the MNIST test dataset using LeNet-5 CNN, with and without BN, at Vag = Va,maz = LV. measured, each experiment is repeated multiple times, and the average value of the error rate is reported. We tested 2 different versions of LeNet-5: with and without Batch-Normalization (BN) layers preceeding the CONV/FC layers. Without BN ayers (‘vl’) we achieve a classification error rate of 2.5% after all the 4 layers. The error rate is improved to 1.7% by using the BN layers (‘v2’). This is mostly because BN normalizes the convolution inputs for every layer, with a mean around 0 and also limits the maximum value of the inputs. Hence, after input quantization to 6-b, its features are better preserved compared to an un-normalized input distribution. The measured error rate, which is close to the expected value from an ideal digital implementation, shows the robustness of the CSRAM architecture to compute convolutions. The error rate for the MNIST dataset is improved by 8.3% compared to prior work on in/near-memory compute [13], [16], where a 10% error rate was achieved. Next, we tested functionality at a lower voltage setting of Vag,npac = 1 V and the rest of the circuits operating at Vjq (rest) = 0.8 V, with a clock period of 400 ns. The maximum DAC pre-charge voltage (Va,maz), corresponding to the maximum input code, is calibrated to 0.8 V. Hence, the magnitude of 1 LSB is ~ 26 mV (instead of 32 mV for the previous case with Va maz = 1 V). Fig. 21 shows the measured error rate for this set of voltages. Due to reduced analog voltage precision, the error rates are slightly higher, with ‘vl’ achieving 3.4% and ‘v2’ achieving 1.9% for the MNIST test dataset.

Fig. 21. Measured error rate for the 10K images in the MNIST test dataset using LeNet-5 CNN, with and without BN, at Vag = Va,maa = 0.8V.

MEASURED ENERGY-EFFICIENCY* (TOPS/W) FOR THE CONV/FC LAYERS OF LENET-5 CNN, AT Vag = 1V TABLE III

Fig. 23. Measured distribution of the 6-b convolution inputs (X7,y’s) for the 4 different CONV/FC layers of the LeNet-5 CNN. scaled and quantized to 6-b (including sign bit) before being sent to the CSRAM array to compute the convolutions. As seen from the figure, all the layers have a high proportion of 0’s for the X;y’s. This helps in reducing the GBL_DAC energy to convert and send them to the columns of the CSRAM array.

Fig. 22. Measured distribution of the partial convolution outputs (You 7’s) for the 4 different CONV/FC layers of the LeNet-5 CNN.

Fig. 25. Measured energy consumption of the CSRAM array when running the 4 different CONV/FC layers of the LeNet-5 CNN, with Vag,pac = | V, Vaa (rest) = 0.8 V and feiksmain = 2.5 MHz. “| MAV = | multiply + 1 average = 2 OPs, with 6-b inputs and 1-b weights

Fig. 24. Measured energy consumption of the CSRAM array when running the 4 different CONV/FC layers of the LeNet-5 CNN, with Vag,pac = 1.2 V, Vad, ARY = 0.8 V, Vaq (rest) = 1 V and lethunain = 5 MHz. Whereas, for ‘v2’ (with BN), layer F5 achieves the best energy-efficiency of 40.3 TOPS/W, utilizing 15 of the 16 local arrays. Fig. 24 also shows the energy breakdown for the 3 major components: GBL_DAC, ARY+MAV, and CSH_ADC. The energy for the GBL_DACs is limited by the bit-precision requirement for representing the IFMP values. Whereas, the energy for the ARY, MAV, and CSH_ADC circuits can be scaled down by scaling their supply voltages while sacrificing speed. Fig. 25 shows the measured energy consumption of the CSRAM array, with Vaa,pac = 1 V, Vaa (rest) = 0.8 V and fox = 2.5 MHz. The reduced supply voltages help in decreasing the energy consumption, leading to better energy-efficiency numbers (Table IV).

MEASURED ENERGY-EFFICIENCY* (TOPS/W) FOR THE CONV/FC LAYERS OF LENET-5 CNN, AT Vag = 0.8V “ 1 MAV = | multiply + 1 average = 2 OPs, with 6-b inputs and 1-b weights TABLE IV

' SVM: Support Vector Machine with 45 binary classifiers, each with 81 inputs, i.e. 81 x 45 MACs per 10-way classification 2 k-NN: k-Nearest Neighbor, only 4-output classes (out of 10) were demonstrated with 100 test images 3 We assume 2 operations (OPs) for 1 MAV (1 mult. + 1 avg.), similar to a MAC (1 mult. + 1 acc.) 4 Does not include energy to access IIMP/OFMP memories > Assuming a 65-nm implementation and Energy oc (Tech.)? COMPARISON WITH PRIOR WORK ON LOW BIT-WIDTH HARDWARE IMPLEMENTATIONS OF ML ALGORITHMS TABLE V

Related topics:

Computer Science Dot Product Electrical and Electronic Engineering Static random access memory

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved