Deep Learning Accelerators: A Case Study

doi:10.21203/rs.3.rs-24147/v1

Download PDF

Research

Deep Learning Accelerators: A Case Study

https://doi.org/10.21203/rs.3.rs-24147/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 12 Nov, 2020

Read the published version in Journal of Big Data →

You are reading this older preprint version

Read the latest preprint version →

In the recent years, deep learning has become one of the most important topics in computer science. Deep learning is a growing trend in the edge of technology and its applications are now seen in many aspects of our life such as object detection, speech recognition, natural language processing, etc. Currently, almost all major sciences and technologies are benefiting from the advantages of deep learning such as high accuracy, speed and flexibility. Therefore, any efforts for improving performance of related techniques is valuable. Deep learning accelerators are considered as hardware architecture, which are designed and optimized for increasing the speed, efficiency and accuracy of computers that are running deep learning algorithms. In this paper, after reviewing some backgrounds about deep learning, a well-known accelerator architecture named MAERI is investigated. By using an open source tool called MAESTRO, the performance of a deep learning task is measured and compared on two different data flow strategies: NLR and NVDLA. Measured performance indicators of novel optimized architecture, NVDLA shows higher L1 and L2 computation reuse and lower total runtime (cycles) in comparison to the other one.

Computer Architecture and Engineering

Deep Learning

Neural Networks

Hardware

Accelerator

NoC

The main idea of neural networks is based on biological neural system structure, which consists of several connected elements named neurons [1]. In biological systems, neurons get signals from dendrites and pass them to next neurons via axon as shown in Fig. 1.

Neural networks are made up of artificial neurons for handling brain tasks like learning, recognition and optimization. In this structure, the nodes are neurons, links can be considered as synapses and biases as activation thresholds [3]. Each layer extracts some information related to the features and forward them with a weight to the next layer. Output is sum of all these information gains multiplied by their related weights.

Deep neural networks are complex artificial neural networks with more than two layers. Nowadays, these networks are widely used for several scientific and industrial purposes like visual object detection, segmentation, image classification, speech recognition, natural language processing, genomics, drug discovery and many other areas [4].

Deep learning is a new subset of machine learning including algorithms that are used for learning concepts in different levels, utilizing artificial neural networks [5].

Based on Fig. 3, if each neuron and its weight are represented by X_i and W_i _j respectively, the output result (Y_j) would be:

[Please see the supplementary files section to view the equation.] (1)

Which is the activation function. A popular function that is used for activation in deep neural networks is ReLU (Rectified Linear Unit) function, which is proposed in (2).

[Please see the supplementary files section to view the equation.] (2)

Leaky ReLU, tanhh and Sigmoid functions are some other activation functions with less usage [6].

[Please see the supplementary files section to view the equation.] (3)

As shown in Fig. 4, each layer of a deep neural network’s role is to extract some features and send to the next layer with its corresponding weight. For example, in the first layer, color properties (green, red blue) are gained, in the next layer, edge of objects are determined and so on.

Deep learning has a wide range of applications for recognition, classification and prediction and since it tends to work like the human brain and consequently does the human jobs in a more accurate and low cost manner, its usage is dramatically increasing. Observing more than 100 papers during 2015 to 2018, helped to categorize the main applications as below:

Computer vision
Translation
Smart cars
Robotics
Health monitoring
Disease prediction
Medical image analysis
Drug discovery
Biomedicine
Bioinformatics
Smart clothing
Personal health advisors
Pixel restoration for photos
Sound restoration in videos
Describing photos
Handwriting recognition
Predicting natural disasters
Cyber physical security systems
Deep dreams

As mentioned previously artificial intelligence and deep learning applications are growing drastically, but they have high complexity computation, energy consumption, costs and memory bandwidth. All these reasons were major motivation for developing deep learning accelerators (DLA) [8]. A DLA is a hardware architecture that is specially designed and optimized for deep learning purposes. Recent DLA architectures (e.g. OpenCL) have mainly focused on maximizing computation reuse and minimizing memory bandwidth, which led to higher speed and performance [9].

Generally, most of the accelerators support just fixed data flow and are not reconfigurable, but for doing huge deployments, they need to be programmable. Hyoukjun et al. proposed a novel architecture named MAERI, which is reconfigurable and employs augmented reduction tree (ART) which showed 8 ~ 459% better utilization for different data flows over a strict network-on-chip (NoC) fabric [8]. Fig.5 shows the overall structure of MAERI DLA.

In another research, Hyoukjun et al. offered a framework called “MAESTRO” (Modeling Accelerator Efficiency via Spatio-Temporal Resource Occupancy) for predicting energy performance and efficiency in DLAs. MAESTRO is an open-source tool that is capable of computing many NoC parameters for a proposed accelerator and related data flow such as maximum performance (roofline throughput), compute runtime, total runtime, NoC analysis, L1 to L2 NoC bandwidth, L2 to L1 bandwidth analysis, buffer analysis, L1 and L2 computation reuse, L1 and L2 weight reuse, L1 and L2 input reuse and so on [10]. The topology, tool flow and relationship between each of its blocks of this framework are presented in Fig. 6.

In this paper, we used MAESTRO to investigate buffer, NoC and performance parameters of a DLA in comparison to a classical architecture for a specific deep learning data flow. For running MAESTRO and get the related analysis, some parameters should be configured, which are listed below:

LayerFile: Including the information related to the layers of neural network.
dataFlowFile: Information related to data flow.
vectorWidth: Width of the vectors.
NoCBandwidth: Bandwidth of NoC.
multicastSupported: This logical indictor (True/False) is for defining that the NoC supports multicast or not.
numAverageHopsinNoC: Average number of hops in the NoC.
numPEs: Number of processing elements.

For the simulation of this paper, we configured the mentioned parameters as presented in Table I.

Table I: MAESTRO configuration parameters

No.	Input Parameter	Value
1	LayerFile	Vgg16_conv11
2	dataFlowFile	NLR.m NVDLA.m
3	vectorWidth	64
4	NoCBandwidth	128
5	multicastSupported	True(1)
6	numAverageHopsinNoC	4
7	numPEs	32

As presented in Table I, we have selected Vgg16_conv11 as LayerFile, which is a convolutional neural network that has proposed by K. Simonyan and A. Zisserman. This deep convolutional network model was offered for image recognition with 92.7% accuracy on ImageNet dataset [11].

Table II: Simulation Results For NLR and NVDLA

Buffer Analysis
Data Flow	NLR	NVDLA
L1 Buffer Requiremnet (Byte)	18.00	66.00
L2 Buffer Requiremnet (KB)	1.12	4.12
L1RdSum	7,225,344	451,584
L1WrSum	7,225,344	451,584
L2RdSum	462,422,016	28,901,376
L2WrSum	462,422,016	28,901,376
L1 Weight Reuse	1	16
L1 Input Reuse	4	16
L2 Weight Reuse	448	190.26
L2 Input Reuse	2,633	4,473
*NoC Analysis*
L1 to L2 NoC BW	128	32
L2 to L1 NoC BW	160	1,024
*Performance Analysis*
L1 to L2 Sum	56	32
L1 to L2 Delay	4.43	4.25
L2 to L1 Delay	0	0
Roofline Throughput (GFLOPS with 1 GHZ clock)	896	128
Compute Runtime	169	421
Total Runtime (Cycles)	1,428,553,728	384,072,192

Two different data flow strategies are investigated and compared in this study: NLR and NVDLA. NLR stands for “No Local Reuse” which expresses its specific strategy and NVDLA is a novel DLA designed by NVIDIA Co. [12]

Other parameters such as vector width, NoC bandwidth, multicast support capability, average numbers of hops and numbers of processing elements in NoC has been selected based on a real hardware condition.

Simulation results demonstrated that NVDLA has better performance, runtime, higher computation reuse and lower memory bandwidth in comparison to NLR as presented in Table II and Figs. 7 to 9.

Artificial intelligence, machine learning and deep learning are growing trends and affecting our lives in almost all aspects of human’s life. These technologies make our life easier by assigning routine tasks of human resources to the machines that are much more accurate and fast. Therefore, any effort for optimizing performance, speed, and accuracy of these technologies is valuable. In this research, we focused on performance improvements of the hardware that are used for deep learning purposes named deep learning accelerators. Investigating recent researches on these hardware accelerators shows that they can optimize costs, energy consumption, run time about 8% ~ 459% based on MAERI’s investigation, by minimizing memory bandwidth and maximizing computation reuse. Utilizing an open source tool named MAESTRO, we compared buffer, NoC and performance parameters of NLR and NVDLA data flows. Results showed higher computation reuse for both L1 and L2 of the NVDLA data flow that led to much shorter total runtime in comparison to NLR.

Availability of data and materials

Available.

Competing interests

Evaluating a deep learning accelerator’s performance.

Funding

Not Applicable.

Authors' contributions

Investigating deep learning accelerators functionality
Analyzing a deep learning accelerator’s architecture
Performance measurement of NVIDIA deep learning accelerator as the case study.
Higher computation reuse and lower total runtime for the studied deep learning accelerator in comparison with non-optimized architecture

Acknowledgements

Not Applicable.

Jurgen Schmidhuber, “Deep learning in neural networks: an overview,” Neural Networks, vol. 61, Jan. 2015, pp. 85-117.
George S. Everly Jr., “The anatomy and physiology of the human stress response,” Springer, A clinical guide to the treatment of the human stress responses, pp 19-56.
Muller, J. Reinhardt, and M. T. Strickland, “Neural networks: an introduction,” Springer science and business media, 6 Dec 2012, pp. 14-15.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” Nature, vol. 521, 28 May 2015, pp. 436-444.
Li Deng, and Dong Yu, “Deep learning: methods and applications,” Foundations and trends in signal proceesing, vol. 7, issue 3-4.
Jianqing Fan, Cong Ma, and Yiqiao Zhong, “A selective overview of deep learning,” arXiv:1904.05526[stat.ML], 14 Apr 2019.
Christian Szegedy, Alexander Toshev, and Dumitru Erhan, “Deep neural networks for object detection,” Advances in neural information processing systems 26, NIPS 2013.
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna, “MAERI: enabaling flexible dataflow mapping over DNN accelerators via reconfigurable interconnetcs,” ASPLOS ’18, Proceedings of the twenty-third international conference on architectural support for programming languages and operating systems.
Uktu Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu, “An OpenCL deep learning accelerator on Arria 10,” FPGA ’17, Proceedings of the 2017 ACM/SIGDA international symposium on field programmable gate arrays, pp. 55-64.
Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna, “MAESTRO: an open-source infrastructure for modeling dataflows within deep learning accelerators,” arXiv:1805.02566(2018).
Karen Simonsyan, and Andrew Zisserman, “Very deep convolutional network for large-scale image recognition,” arXiv:1409.1556, 10 Apr 2015 (version 6).
NVDLA Deep Learning Accelerator, http://nvdla.org, 2017.

Equations.docx

Download PDF

Journal Publication

published 12 Nov, 2020

Read the published version in Journal of Big Data →

Editorial decision: Major Revision
14 Jun, 2020
Review #10 received at journal
09 Jun, 2020
Reviewer #10 agreed at journal
09 Jun, 2020
Reviewer #9 agreed at journal
04 Jun, 2020
Reviewer #8 agreed at journal
03 Jun, 2020
Review #7 received at journal
02 Jun, 2020
Review #6 received at journal
02 Jun, 2020
Reviewer #6 agreed at journal
01 Jun, 2020
Review #3 received at journal
01 Jun, 2020
Review #2 received at journal
01 Jun, 2020
Reviewer #7 agreed at journal
01 Jun, 2020
Review #1 received at journal
31 May, 2020
Review #5 received at journal
31 May, 2020
Reviewer #5 agreed at journal
31 May, 2020
Review #4 received at journal
31 May, 2020
Reviewer #4 agreed at journal
30 May, 2020
Reviewer #3 agreed at journal
30 May, 2020
Reviewer #2 agreed at journal
30 May, 2020
Reviewer #1 agreed at journal
30 May, 2020
Reviewers invited by journal
30 May, 2020
Editor assigned by journal
22 May, 2020
Editor invited by journal
21 May, 2020
Submission checks completed at journal
20 May, 2020
First submitted to journal
13 May, 2020

You are reading this older preprint version

Read the latest preprint version →

Deep Learning Accelerators: A Case Study

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Deep Learning Applications

Method

Results and Discussion

Conclusion

Declarations

Availability of data and materials

Competing interests

Funding

Authors' contributions

Acknowledgements

References

Supplementary Files

Status:

Journal Publication

Version 1