Real-time approximate and combined 2D convolvers for FPGA-based image processing

Convolution widely has been used as the main part of the improvement in digital image processing applications. In convolutional computations, a large number of memory accesses and a huge amount of computations challenge its performance. Many of the related proposed convolvers are based on exact computations. Although exact convolvers keep the accuracy of the convolution operation at the top level, sometimes by missing a negligible amount of accuracy, the performance can be improved. Approximate computing is a new technique for solving computation overhead problems. In this paper, approximate 2D convolvers are presented which minimize the memory access rate and computations by a special factor of multiply-and-accumulate (MAC) terms. On the other hand, to preserve the flexibility for supporting different required accuracy, the proposed approximate convolvers are combined with the exact designs with real-time pre-processing stages by exploiting innovative methods which manage the hardware overhead. In comparison with conventional convolvers, the proposed designs improve the number of active resources which causes a significant reduction in power consumption. For 3 × 3 kernel size, the evaluation results on the Xilinx Virtex-7 (XC7V2000t) FPGA device show 34% and 20% power optimization of the proposed approximate and combined convolvers, respectively, in comparison with exact convolver (EC). Also, this improvement grows by increasing the kernel size. Finally, a comparison based on RMSE and PSNR for different sample images and filters reveals that the error rate and image quality reduction are acceptable for many real-time image processing applications.


Introduction
Two-dimensional (2D) convolution is a widespread operator used in computer vision and digital image processing applications.As an example, convolution operators are widely exploited for edge detection in advanced mobile vision applications.Also, high-pass filtering (sharpening) and low-pass filtering (blurring) are performed by 2D convolution filters [1].For another example, in image recognition with convolutional neural networks (CNNs), there are a large number of parallel convolutional computations to extract features from the input image [2].Therefore, the implementation of a convolver working with a low pixel access rate and optimized performance is necessary.
It is well-known that the computational complexity of 2D convolution challenges its performance.To compute the convolution result between an image and a filter with k × k kernel size, k 2 − 1 addition and k 2 multiplications must be performed.For another challenge, 2D convolution requires high memory bandwidth.For k × k kernel size, k 2 image pixels must be read from memory for calculation of only one output [4].These problems will be critical when the size of the kernel grows.Therefore, the performance of the convolver is significantly dependent on the design of computational units and memory bandwidth.
In image processing applications, like edge detection, the focus is on developing the most effective convolutional computation with little attention to the computational complexity and the hardware requirement.Although the implementation on personal or supercomputer platforms may not challenge the performance, in embedded applications, the hardware power consumption must be considered.Therefore, a heuristic high-performance computing paradigm with improved hardware utilization is mandatory.As one of the most energy-efficient computing strategies, approximate computing has drawn research attention in the past years.In [3], different approximate computing techniques are presented.By exploiting approximate techniques, the metrics such as power consumption, critical path delay, and computational complexity can be improved at the expense of reducing a negligible amount of accuracy.
A large number of convolution-based applications are tolerant to degradation in accuracy.Many of the conventional convolvers are proposed based on the exact computation.Exact designs are suitable when the applications need high computation accuracy.But, it is at the expense of missing a huge amount of performance.In this paper, new approximate convolvers are presented to address the aforementioned limitations.In the proposed designs, by conducting preprocessing on a sample image and exploiting a power-efficient approximate computing strategy, convolutional computations for repetitive or narrow-range pixels are performed in a single time.Finally, the resource utilization of the intended hardware platform is improved significantly.Different applications may need various levels of accuracy.Therefore, the proposed approximate convolvers are combined with the exact designs to prepare more flexibility for supporting different required accuracy.The proposed design can switch between approximate and exact convolver based on the accuracy required by applications by setting a special threshold.The selection between two convolvers for processing current convolution computation is decided by a preprocessing stage.This stage can be designed by real-time processing.However, designing a real-time decision system for the preprocessing stage increases the resource utilization, power consumption, and critical path delay of the proposed combined convolvers.Thus, it is crucial to propose an optimized real-time decision system with less hardware overhead.In addition, the approximate convolver must be structured in such a way that compensates for the real-time preprocessing stage design cost.
For the practical usage of the combined architecture, suppose that an embedded device in applications such as self-driving cars, face recognition, and military tracking system is going to perform an image processing algorithm which must be executed with minimum power consumption and hardware resources in a range of acceptable accuracy.The embedded hardware system, particularly a simple controller (an ARM processor) with an FPGA board (as a hardware accelerator), dictates the number of hardware resources.The accuracy will be specified by a threshold value.Moreover, the operating system (of the ARM core) takes the opportunity for the user to choose various power and accuracy modes.Thus, the user selects the power and accuracy plan, and the operating system will set the threshold value based on the requirement of the user.Finally, the combined architecture processes the algorithm according to the threshold value specified by the operating system.
Convolution operation can be executed on different hardware platforms such as application-specific integrated circuit (ASIC), graphical processing unit (GPU), and field programmable gate array (FPGA).Between these platforms, for hardware implementation of 2D convolution, the FPGA-based devices are suitable for three main reasons.First, it can be easily reconfigured for different architectures.Second, for exploiting parallel processing opportunities in 2D convolution, the FPGA-based platforms are the best option because of their fine-grained parallelism architecture.Finally, it can be developed faster than other hardware platforms [4,12,13].For these reasons, we have selected FPGA-based platforms to implement our proposed designs.
In this paper, an innovative design for an approximate convolver is proposed that exploits narrow-range pixels to minimize the convolution computation complexity and pixel access rate.The proposed design offers five main advantages: (1) In the proposed architecture, because the multiply-and-accumulate (MAC) operation for narrow-range pixels is performed by a special factor of reusable terms, the required multipliers are reduced to improve resource utilization and power consumption.
(2) The narrow-range pixels are considered to be in a window, row, and column of the input image for different kernel sizes to investigate various amount of error rates and performance.(3) For managing the degradation in accuracy, the proposed approximate convolvers are combined with the exact design, and new dual-purpose convolver architectures are presented which provide power optimization with negligible on-chip overhead in resource utilization.

3
Real-time approximate and combined 2D convolvers for FPGA-based… (4) New real-time preprocessing stages are designed with exploiting pipeline processing and an innovative decision system for combined convolvers with minimum hardware overhead.(5) Since the convolutional calculation provides the reusable terms, the presented convolvers require a lower pixel access rate.
The rest of the paper is organized as follows.In Sect.2, the related works are mentioned.Motivation in image processing and computational analysis are presented in Sect.3. The proposed 2D approximate and combined convolvers are explained in Sect. 4. The FPGA implementation and filtered image accuracy evaluation are provided in Sect. 5. Finally, Sect.6 concludes the research.

Related works
As mentioned earlier, there are two main challenges for proposing an intended convolver which are computational complexity and memory bandwidth.Several approaches are considered to reduce the computational complexity of 2D convolution.In [4], a fine-grained 2D pipelined convolver is proposed.The basic idea is that in convolution computation, there are some reusable computations, and, by using pipelining, the delay of computations will be decreased.At last, the critical path delay, pixel access rate, power consumption, and resource utilization of the proposed reduced-access pipelined convolver (RAPC) are compared by a conventional non-pipelined convolver (NPC).Also, the authors of [5] have optimized the FPGA implementation of a convolution-based 2D filtering processor for image processing applications.The proposed filter swaps the multiplication unit with floating-point adders and also exploits a set of pre-computed coefficients to design a 32-bit multiplier module.
The second part of the literature emphasizes the use of approximate methods to improve key design metrics.In [6] by using the approximate bit-width selection strategy in the fractional part, the FPGA implementation of a fixed-point 2D Gaussian filter for image processing is proposed.As floating-point computation needs a huge amount of power consumption, designing a new fixed-point 2D Gaussian filter is essential which causes performance improvement in the processing and decreases computational costs.In [7], a two-dimensional (2D) convolver is presented in which both approximate circuit-and algorithm-level techniques are utilized.Truncation is exploited as a circuit technique while bit-width reduction is used at the algorithm level.Authors of [8] have implemented reduced precision redundancy (RPR) multiply-and-accumulate (MAC).RPR utilizes approximated reduced precision copies instead of replicating the whole circuit which highly reduces the hardware overhead, while still the largest errors can be modified.The authors of [9] have presented an FPGA-based accelerator for the Gaussian Filter by exploiting approximate computing.In this paper, based on approximate techniques, the hardware architecture of the 2D convolver is modified to improve the on-chip resource utilization.For example, in the Gaussian filter, some coefficients exploited to multiply in the input pixels are repetitive.So, the input pixels multiplied by the repeated values will be added before multiplication.In [10], the authors have proposed a low-power hardware accelerator for Sobel edge detection by using an approximate gradient magnitude.Accordingly, separate gradient components are obtained for vertical and horizontal orientations in which approximate method will cause reducing the complexity of gradient computations.The authors of [11] also have applied new approximate techniques to compute gradient orientation and magnitude.
The third part of the related papers focuses on dividing 2D convolution into several smaller sections to decrease the computational complexity.For example in [12], a multi-window partial buffering approach for 2D convolvers is proposed by using FPGA platforms.In this paper, the authors focus on balancing performance and cost in using FPGA resources.Finally, the new approach causes a suitable trade-off between resource utilization and on-chip memory bus bandwidth.Also, the authors of [13] have partitioned the 2D convolver into several one-dimensional convolution sections.Other approaches try to enhance the clock frequency as well as minimize the power consumption by exploiting multiplier-less constant multiplication units for fixed elements of the kernel [14][15][16][17].In these methods, the performance is optimized, and the power consumption is improved by proposing kernel-dependent convolvers.But, the proposed convolvers can just be exploited in specific applications because the kernel sizes are limited to a special amount.In [18], by exploiting the recurrently decomposable (RD) filter, a 2D convolver is designed where the convolution mask will be separated into a set of smaller masks.In this paper, the resource utilization is improved; but, the critical path delay is increased.
In the other methods, different pipelining techniques, are exploited to increase the throughput of the proposed design.For example, in [15,[19][20][21], the convolution is expressed as the sum-of-products among the image's pixels and the coefficients of the kernel, while the ordinary pipelined convolver exploits separate pipeline stages for buffering, multiplication, and adder modules.Also, the proposed design works with high clock frequency, but it is at the expense of a huge computational overhead in each pipeline stage.In [22], the 2D convolvers are compared.The comparison of the convolvers is based on four methods.The methods are named nonpipelined, reduced-bandwidth pipelined, multiplier-less pipelined, and time-shared convolver.Finally, the critical path delay, memory bandwidth, and resource utilization are analyzed for various convolution kernel sizes.In this paper, different convolver types for executing 2D convolution are explained.In non-pipelined convolver besides huge resource utilization, the critical path delay is large.Also, in pipelined convolver, the critical path delay is highly reduced but this reduction results in a huge amount of FPGA resources.In the multiplier-less pipelined convolver, a special constant is prepared which could multiply by a constant multiplication module; so, the flexibility is reduced.Finally, in time-shared convolver, the FPGA resources are significantly reduced but the computation time grows unexpectedly.Authors of [23] have proposed an area-efficient FPGA-based reconfigurable 2D convolver for image processing.In this paper, the adjustment of logical block arrangement for the latest convolvers is analyzed.At last, the throughput and convolution computation time are compared with pre-proposed convolvers.However, this paper optimizes the computation time of different intended kernel sizes but this optimization is earned by utilizing a large number of FPGA resources.

3
Real-time approximate and combined 2D convolvers for FPGA-based… Several researchers have focused on proposing approximate convolvers to make the processing of convolutional neural networks more efficient.For instance, authors of [25] have expressed that using the approximate multipliers utilizes the training performance of CNNs in terms of power, area, and speed.In their work, approximate binary multipliers have been presented which exploit 2's complement to represent the data.In addition, approximate adders are used for the data path of the proposed design to minimize the delay and area.In [26], instead of using DSP blocks, an 8-bit fixed-point MAC unit is presented to customize the FPGA accelerator of CNN which increases the computational speed.Solovyev et al. implemented the fixed-point representation of the input data in convolutional blocks for digit recognition [27].Finally, authors of [28] have exploited fine-grained pruning methods to minimize the computation overhead.Moreover, in their research, to accelerate the CNNs in parameters such as area and power, parallel accumulate share MAC (PMAC) is utilized in a weight-shared CNN.
As mentioned in related works, different innovative methods are considered to minimize the computational complexity and memory access of convolutional computation.Approximate computing has a special place among these techniques.Using approximate computing in such a way that makes a trade-off between accuracy and performance is essential.On the other hand, proposing a particular architecture that supports various levels of accuracy with switching among approximate and exact convolvers seems necessary.In this paper, the required design is proposed with power optimization by minimizing the number of active hardware resources.

Motivation and pre-analysis
In this section, to motivate our architecture in image processing applications, an innovative scenario is considered.In addition, before presenting the design of approximate and combined 2D convolvers, a computational analysis is provided which investigates the proposed designs with theoretical analysis.

Motivation in image processing
Approximate computing is a novel opportunity for solving the computation overhead challenges in image processing.There are several options in this scope.As shown in Fig. 1, there are places in an image where pixels are repetitive or in a narrow range.To decrease the memory access rate while reducing the computational overhead, one way will be averaging close-range input pixels in image processing and performing the computation in a single time.For example, spatial and temporal coding have been expressed in image processing applications.In the spatial coding example, instead of sending the same pixels, just the color value and the number of repeated pixels can be sent to the computation units.In addition, in the temporal coding example, only differences from frame (i) can be sent instead of sending the whole frame at (i + 1) [24].Therefore, by conducting the computation of repetitive or close-range pixels at once, the computational overhead will highly decrease.These are our primitive conditions for proposing approximate and combined 2D convolvers for image processing applications.

Computational analysis
In motivation, the existence of repetitive or close-range pixels is studied.To survey how the mentioned opportunity can be exploited in convolutional computations, the theoretical investigation helps for defining the problem.Two-dimensional convolution with a k × k kernel is shown as Eq. ( 1): In this equation, X, Y, and h represent the input image, output image, and convolution kernel, respectively.If the input pixels in a k × k window be fixed, the equation with a k × k kernel will be as follows: As shown in Eq. ( 2), if the input pixels in a k × k window be fixed, one access to the memory is required because all pixels are repetitive.Therefore, all coefficients Real-time approximate and combined 2D convolvers for FPGA-based… in the k × k kernel are added, and the single input pixel is multiplied by the sum of coefficients.
Considering all pixels in a window as one single value is an ideal assumption.Since it is common that these numbers are narrow range [24], only their average is sent to the processing element from off-chip memory.As a result, the average of all close-range pixels is shown as Eq. ( 3), and finally, Eq. ( 2) is changed to Eq. ( 4): where Avg(X) is the average of all input pixels in the current window.Since the computed average is in a window of pixels, Eq. ( 4) is named window-based convolution.
In the aforementioned scenario, we have considered that all pixels in the selected input window are in a narrow range.Another scenario will be considering pixels in a row or column of input pixels in a narrow range.Let's consider 'i' as the index of the row and 'j' as the index of the column.Therefore, if we suppose that pixels in a row are in a narrow range (not fixed), then Eq. ( 3) is modified to Eq. ( 5): where X m denotes the mth row of X .For simplification, the average of X m has been shown by Avg m .Then, based on Eqs. ( 4) and (5), Eq. ( 6) is as follows, which is named row-based convolution.In other words, Eq. ( 6) computes Y(m, n) by multi- plying the average of each row of X by the coefficients of that row.
For another scenario, consider that pixels in a column are in a narrow range (not fixed); then, Eqs. ( 5) and ( 6) are changed to Eqs. ( 7) and ( 8): where Avg n is the average of all input pixels in nth column of X , and Eq. ( 8) is named column-based convolution. (3) The main intuition behind the presented window-, row-, and column-based methods is that in many real-world images, close-range pixels exist, and the MAC operation can be performed by a simple factor of reusable terms.Finally, this approach leads to minimizing resource utilization by using fewer multipliers.More details are discussed in Sect. 4.

Proposed real-time approximate and combined 2D convolvers
In this section, the design of the proposed real-time approximate and combined 2D convolvers is explained.First, the architecture of the exact convolver is presented.Figure 2 shows the exact 2D convolver for a 3 × 3 kernel size based on Eq. ( 1).In the exact design, nine input pixels are registered and multiplied by nine kernel coefficients.After that, the partial products are added with parallel adders.For a k × k kernel size, the exact convolver (EC) contains k 2 −1 adders, k 2 multipliers, and k 2 + 1 I/O registers (pixels/kernel coefficients/result).The critical path delay includes ⌈ log 2 k 2 ⌉ adders and one multiplier.

Real-time approximate convolver
Figure 3 shows the general architecture of the proposed real-time approximate convolver.In this architecture, a row or column of the input pixels is registered.Then, based on Eqs. ( 3), ( 5), and ( 7), the average of all input pixels in a window, row, or column is computed, respectively.To implement the pipeline processing, the average value is stored before sending it to the proposed approximate convolver.At last, the final convolution result is registered to transform to the output port.It is worth noting, since in convolution computation in image processing the sliding window equals one (stride = 1), and after the pipeline's fill-up, the approximate convolver in Real-time approximate and combined 2D convolvers for FPGA-based… each cycle will produce one output result.The binary average (BA) is a new hardware-efficient averaging method.In this module, the operands are classified with equivalent binary values to perform the division only with the shift operator.The next sub-sections explain the BA and AWC/ARC modules in more detail.

Approximate window-based convolver (AWC)
Based on Eq. ( 4), when the closed-range input pixels are in a window, the exact convolver is changed as shown in Fig. 4. First, the averaged pixel is registered.Then, instead of adding all of the partial products in EC, the kernel coefficients are added with parallel adders.Finally, the sum of the coefficients is multiplied by the average registered pixel.For a k × k kernel size, the AWC includes k 2 −1 adders, one multiplier, and two I/O registers.The critical path delay is ⌈ log 2 k 2 ⌉ adders and one multiplier.

Approximate row-based convolver (ARC)
The proposed approximate row-based convolver (ARC) is presented for a 3 × 3 kernel in Fig. 5. First, three input pixels are stored in three registers.As mentioned before, because three pixels in a row are narrow range, based on Eq. ( 5), the average pixel is transferred to the appropriate buffering module.Second, based on Eq. ( 6), in the multiplication module, all coefficients for the equivalent row are added, and the addition result is multiplied by the appropriate average term of the input pixels.Third, these product terms are added by a two-operand adder.Finally, based on the two-operand adder output, the final convolution result is computed.ARC includes k multipliers, k 2 −1 adder, and k + 1 input/output registers for a k × k kernel, while these amount for EC is k 2 , k 2 −1, and k 2 + 1, respectively.The critical path of the proposed non-pipelined convolver includes ⌈ log 2 k 2 ⌉ adders and one multiplier for a k × k kernel.

Binary average (BA)
Figure 6 shows the binary average for ARC with a 3 × 3 kernel.As mentioned earlier, in this module, the operands are classified in a power-of-two method which is equivalent to the binary representation of the number of operands.For example, in Fig. 6, three operands must be averaged.Therefore, instead of adding all three numbers and dividing by three, we have classified them into 2 + 1 operands which is equivalent to (11) b .For another example, when k = 5, the binary representation of 5 is (101) b .So, the operands are classified in the 4 + 1 terms.The proposed method improves the hardware overhead since the division is performed by shift operators; in addition, the number of utilized shift operators is optimized.On the other hand, Real-time approximate and combined 2D convolvers for FPGA-based… because we have supposed that the pixel values are close range, the error rate for averaging will be negligible.Based on Fig. 6, for averaging nine input pixels, three pixels in the first row are selected.After that, with add and shift operators, the average of these three pixels is computed and stored in a shift register.The same process will be performed for the second and third rows.Finally, with a serial to parallel shift register (STP_SHR), all averaged values are sent to the approximate convolver.For a k × k kernel size, the BA for ARC uses k−1 adders, ⌊log 2 k⌋ shifters, and k reg- isters (serial to k parallel output shift register).
Similar to the previous scenario, Fig. 7 shows the BA for AWC with a 3 × 3 kernel.The only difference is that a new BA module is needed for averaging all three rows to compute the average value in a window.The BA for AWC exploits 2(k−1) adders, 2⌊log 2 k⌋ shifters, and k registers.It is worth mentioning, the approximate column-based convolver (ACC) architecture is similar to the row-based method in the hardware realization.In the evaluation section, two presented methods are named real-time approximate window-based convolver (RAWC) and real-time approximate row-based convolver (RARC) for the general architecture as shown in Fig. 3.

Real-time combined convolver
In the previous section, real-time approximate convolvers with two window-based and row-based methods are proposed.As mentioned earlier, different applications need various amounts of accuracy.So, for managing the error rate of the proposed approximate convolvers, the real-time combined 2d convolvers are presented in this section.In the new method, the proposed approximate convolvers are combined with the exact one.The general architecture of the real-time combined convolver is shown in Fig. 8. Similar to previous real-time approximate convolvers, a row or column of input pixels is registered and with the BA module, the required computation is performed.Next, a decision module is prepared to decide for selecting approximate or exact convolver based on a special relation and threshold.Also, because of the pipeline architecture of the proposed combined convolver, the internal results are stored.Finally, the input pixels are sent to CWC/CRC with three STP_SHR (one serial row to a parallel window).Similar to real-time approximate convolver, after the pipeline's fill-up time, the convolution result will be generated in each cycle uninterruptedly.

Combined window-based convolver (CWC)
Figure 9 demonstrates the architecture of the proposed combined window-based convolver (CWC) for 3 × 3 kernel size.In this convolver, the close-range pixels are considered in a window.In the exact convolver process, the select bit is equal to 1; so, the enable bit is set to 1, and nine input pixels in a window are registered.Also, the nine multiplication modules are enabled, and the kernel coefficients are multiplied by the input pixels.Next, the multiplexer transforms the partial products into the output with a select bit equal to 1.The parallel adders compute the sum of partial Real-time approximate and combined 2D convolvers for FPGA-based… products and guide the result to the next level of the multiplexer.In the approximate process, the nine input registers are turned off because the select and enable bits are equal to 0. So, the kernel coefficients are sent through path 0 of the multiplexer, and the sum of coefficients is computed.Also, all nine multiplication units are turned off.Instead of that, one multiplier is enabled to multiply the sum of coefficients by the average of the input pixels.For a k× kkernel size, CWC uses k 2 + 1 multipliers, k 2 + 1 multiplexers, k 2 −1 adders, and k 2 + 2 I/O registers.It is worth mentioning that in the approximate process, only k 2 + 1 multiplexers, k 2 −1 adders, two registers, and one multiplier are turned on.Finally, the critical path delay of CWC is similar to RAWC and EC designs in each process, respectively, plus two levels of the multiplexer.

Combined row-based convolver (CRC)
Figure 10 shows the architecture of the proposed combined row-based convolver (CRC) for a 3 × 3 kernel.First, nine input pixels are registered in the buffering modules.After that, the input pixels are multiplied by the kernel coefficients for the exact convolver.Because in approximate convolver, the sum of kernel coefficients in a row is required; by using multiplexers, the coefficients and partial product are guided to two-operand adders input.With two-level of parallel adders, based on the select bit of the multiplexers, the sum of coefficients or partial products is computed.
In the next level, in the approximate convolver, the sum of coefficients is multiplied by the average of each row.Next, with the new levels of multiplexers and twooperand adders, it should be decided whether to add the sum of partial products of the previous level in the exact convolver or partial products of the current level of approximate convolver.It is worth mentioning that in the proposed design, we have exploited multipliers with enable pin in which the select bit decides to turn off the nine multipliers of exact computation or three multipliers of approximate convolver.For a k × k kernel size, CRC uses  In the evaluation section, two presented designs are named real-time combined window-based convolver (RCWC) and real-time combined row-based convolver (RCRC) for the general architecture shown in Fig. 8.

Decision (ApEx)
To balance the accuracy of the filtered image and the performance of the proposed convolver, an innovative selection mechanism between approximate and exact processes is required.Therefore, a new sub-module named decision is structured in Fig. 8. Since the decision module works with real-time processing, it must be designed in such a way that not only prevents high hardware overhead, but also make a reliable decision to manage the degradation in accuracy.On the other hand, since the workflow of Fig. 8 is based on pipeline processing, the execution time of the decision module must be less than the proposed combined convolver to ensure that the critical path delay will not increase.
The selection strategy of the decision module is based on Eq. ( 9).This relation makes a comparison between the amount of mean absolute error (MAE) and a special threshold value.If MAE is less than a threshold value, then the error rate is acceptable, and the approximate process is selected and vice versa.The decision sub-module receives the input pixels, the average value (based on Figs. 6 and 7), and the threshold value as the inputs and generates the select bit as output.It is worth mentioning, in contrast to mean square error (MSE) and root-mean-square error (RMSE), MAE does not consist of square root and power operations which minimize the preprocessing overhead.The MAE formula is shown in Eq. (10), where x i is the input pixel, x avg is the average of all input pixels, and n is the number of input pixels.
Figure 11 shows the architecture of the decision module for CRC based on Eqs. ( 9) and (10).First, the input pixels are subtracted from the average of each row which was computed in Fig. 6.After computing the average of absolute subtractions, with a comparator module, the previous level result is compared with the threshold value.The output of the comparator module specifies using approximate or exact convolver with 0 or 1 select bit, respectively.Finally, with the STP_SHR module, all three prepared select bits are sent to the combined convolver.For k × k kernel size, the decision module for CRC utilizes k subtractor, k absolute module, k−1 adders, ( 9) MAE(original pixels, Averaged pixels) < threshold ⌊log 2 k⌋ shifters, one comparator, and k registers (serial to k parallel output shift reg- ister).The architecture of the decision module for CWC is demonstrated in Fig. 12.In this module, because the decision is made based on a window of input pixels, all pixels are subtracted from the average of the whole window which is computed in Fig. 7.After computing the average of the absolute subtraction results in each row, with the STP_SHR module, the partial results are sent to the BA module.Finally, with a comparator module, the select bit for the whole window is prepared.Finally, for a k × k kernel size, the decision module for CWC uses k subtractors, k absolute modules, 2(k−1) adders, 2⌊log 2 k⌋ shifters, one comparator, and k registers (serial to k parallel output shift register).
As a final note in this section, regarding Fig. 8, the proposed combined designs include the binary average and decision module, and in the evaluation section, all of the analyses are conducted based on the full hardware realization (end-to-end).Therefore, the impact of additional steps in hardware is measured.Since the multipliers are the main part of the hardware overhead in convolutional computations, Real-time approximate and combined 2D convolvers for FPGA-based… by using fewer multipliers, the design cost including delay, area, and power will be drastically decreased.In the combined designs, a reduction in the number of multipliers by processing the computations through approximate path causes our design, mainly the binary average and decision circuits to have cost less than the multipliers which are deactivated.This qualifying observation makes our design highly resource utilized.

Hardware Evaluation
All of the proposed approximate convolvers are coded in VHDL for various kernel sizes.First, we have synthesized the designs with Xilinx Virtex-7 (XC7V2000t) FPGA device by exploiting Xilinx Vivado (v2018.3)with 8-bits kernel coefficient and input pixels.To make a fair comparison, one of the related designs proposed in [4] is implemented to compare critical path delay, pixel access rate, power consumption, and resource utilization for the approximate and combined convolvers.Finally, the implementations are extended to the other FPGA devices such as XC6SLX16, XCV4LX25, and XCV4LX160 by using the Xilinx ISE tool to make an equitable comparison with other related works.

Critical path delay
In Fig. 13, the critical path delay for the proposed approximate and combined convolvers is compared with EC and the proposed design in [4]  Fig. 13 Critical path delay comparison among various kernel sizes critical path delay in comparison with EC since the 8-bit coefficients add before multiplication, and RARC's delay is close to EC approximately.But, the combined convolvers have a negligible overhead because of using two levels of multiplexers in the critical path of the circuit.For combined designs, the critical path delays for both approximate and exact paths are shown in Fig. 13; but, the largest one will be selected as the critical path.Except for RAPC [4], the delay of all convolvers goes up by increasing the size of the kernel, since the number of operands for the multiply and accumulation operator will be increased.In RAPC, because it is a pipelined design, only one multiplier is in the critical path of the circuit, and the slight growth respecting the kernel size is due to the more routing required in the FPGA device.It is worth mentioning that for future work, the proposed AWC, ARC, CWC, and CRC can be pipelined to improve the critical path delay.

Pixel access rate
Figure 14 shows the pixel access rate of the proposed designs in comparison with EC and RAPC [4].The pixel access rate of the RAWC, RARC, RCWC, and RCRC is equal to k and grows linearly.But, EC's pixel access rate is equal to k 2 and increases quadratically with kernel size.Also, in RAPC because of using pipeline convolver, this parameter goes up linearly respecting the kernel size.Therefore, the proposed design outperforms according to the low access to the memory in comparison with EC.As mentioned before, convolutional computation needs a huge amount of input data.On the other hand, memory accesses are one of the most significant challenges for designing convolvers since off-chip DRAM access consumes the main part of chip power.So, the proposed convolver could answer these problems by minimizing the off-chip pixel access.Finally, by decreasing the pixel access rate, the computational complexity of MAC operation in convolution unit will also diminish.Real-time approximate and combined 2D convolvers for FPGA-based…

Power consumption
Figure 15 demonstrates the power consumption of the approximate and combined convolvers in comparison with EC and RAPC [4] among various kernel sizes which shows the proposed design optimizes the power consumption.For instance, when kernel size is 3, RAWC and RARC consume 17.52 and 23.05 watts for onchip power but this amount for EC grows to 26.55 watts which shows 34% and 13% improvement for mentioned designs, respectively.In addition, by increasing the kernel size, the difference in power consumption goes up quadratically since the number of look-up tables (LUTs), flip-flops (FFs), and I/Os is decreased in approximate convolvers.On the other hand, in combined designs, using the approximate path (process) significantly improves the dynamic power consumption and the total on-chip power.For example, when k is 3, RCWC (Ap) consumes 21.17 watts of on-chip power which indicates 20% improvement in comparison to EC since only one multiplier and input register are turned on instead of nine in the EC module.It is worth mentioning that the reduction in power consumption goes up with increasing kernel size.Therefore, using a large number of approximate execution has a negligible effect on accuracy degradation; nevertheless, it will cause a significant reduction in dynamic power consumption.
To show the dynamic power improvement versus the number of approximate convolvers, Fig. 16 is drawn.The comparison is done for RCWC and RCRC architectures among different kernel sizes.As can be seen in this figure, ten processes are executed on the combined designs.When the number of approximate processes increases (reduction in the exact process), the dynamic power will be improved.As predicted, the RCWC method is superior since a window of pixels is in a narrow range in contrast to RCRC which averages the pixels in a row of the input data.Moreover, by increasing the kernel size, the dynamic power Fig. 15 Power consumption comparison among various kernel sizes improves because more pixels are considered to be close range, and fewer hardware resources are required to process the convolution computations.

Resource utilization
In Fig. 17, the number of LUT slices is compared.By increasing the kernel size, the number of LUTs grows.The resource utilization of the proposed approximate convolver is significantly improved in comparison with EC.For instance, when the ) 3) RCWC( 5) RCWC( 7) RCWC( 9) RCRC( 3) RCRC( 5) RCRC( 7) RCRC( 9) Fig. 16 Dynamic power improvement versus the number of approximate convolvers among various kernel sizes for RCWC and RCRC designs Real-time approximate and combined 2D convolvers for FPGA-based… kernel size is 3, RAWC and RARC use 155 and 281 LUTs.But, this amount for EC and RAPC [4] is 767 and 771, respectively.It is worth mentioning that the total number of utilized LUTs for RCWC and RCRC is more than EC since both approximate and exact designs have been implemented; but, extra resources in each process are turned off.For example, in RCWC (Ap), the number of active LUTs is 292 from 1160 total LUTs.For this reason, in combined designs, the number of active LUTs in the approximate and exact processes also is mentioned in Fig. 17.
Figure 18 shows the number of registers used for different convolvers in comparison with EC and other related works.For k = 5, RAWC and RARC use 98 and 126 FFs, respectively, while EC and RAPC exploit 216 and 977 of them.In combined designs, the total number of registers is more than EC (NPC [4]); but, the active FFs are highly reduced.For instance, when k = 5, the RCRC total register is equal to 343, while RCRC uses 143 FFs in the approximate process.Also, the number of FFs will increase respecting the kernel size.As can be seen in this figure, RAPC [4] utilizes more registers since exploits pipeline processing to hold temporary internal results which is a drawback for the proposed design; however, it happened with a significant reduction in critical path delay.In combined designs, the total and active number of FFs are mentioned in each scenario to show the superiority of the approximate path when a large number of registers are turned off.For example when k = 3, the total number of FFs in the RCWC method is equal to 186 while the approximate path will turn on 114 of them which shows up to 39% improvement.Finally, when the approximate path is selected, by reducing the resource utilization, the power consumption will be reduced.
To show the superiority of the proposed design, we have also compared it with other related 2D convolvers.As shown in Table 1, the pixel access rate and resource utilization of the convolvers proposed in [5,6] and [19] are compared with approximate and combined convolvers.In this discussion, the FPGA devices are altered to make a fair comparison with other designs.Based on this table, the resource utilization of the approximate convolvers is improved significantly as shown in the  Real-time approximate and combined 2D convolvers for FPGA-based… total resource utilization column of Table 1.In the proposed combined convolvers, the total number of LUTs, FFs, and DPSs is less than the convolvers proposed in [5].Also, in Table 1, the number of active resources is mentioned in the approximate and exact processes which shows a significant reduction in the approximate execution for RCWC and RCRC.In comparison with [6], the proposed approximate convolvers utilize fewer LUTs and FFs; but, the combined designs use more LUTs, while the number of FFs is reduced in RCRC.Moreover, the pixel access rate and resource utilization of the proposed approximate and combined designs outperform in comparison with the proposed convolver in [19].
In comparison with [23], the RAWC and RARC methods show a significant enhancement in resource utilization.In addition, when the approximate path is selected, the combined designs outperform in the number of active resources, and the improvement is more considerable when compared with [15].In the frequency and throughput comparison, our methods also can compete with [6,15,19], and [23]; but, the work done in [5] reports a better clock frequency.On the other hand, in contrast with [5], our proposed methods are not tailored for particular applications and can be configured for various conditions; moreover, the throughput is sacrificed in [5].At last, since the pipeline processing is used in [4], the clock frequency and throughput are increased.For the future work, by modifying the proposed methods to the pipeline architecture similar to [4], we hope to have an improvement in the mentioned parameters.It is worth noting, the main characteristic of compared related works is that they conclude comparable key parameters in the same scenario; in addition, the proposed designs are implemented on FPGA-based hardware platforms.

Error rate evaluation
The final part of the evaluation is conducting convolution operation on a sample image to compare the degradation in accuracy while using the proposed approximate and combined convolvers.All of the experimental analyses in this section are conducted in MATLAB which is a popular mathematical and scientific software.In this evaluation, four popular benchmarks in image processing named Cameraman, Lena, Mandrill, and Living-room are selected.For the comparison of the error rate and filtered image quality of exact and approximate convolvers in computing convolution results, the parameters such as root-mean-square error (RMSE), peak signalto-noise ratio (PSNR), and the number of approximate and exact convolvers in each scenario are measured.RMSE shows the error rate by checking the similarity of two images, and when is closer to zero, the higher similarity is the result.On the other hand, PSNR compares the image quality between the reconstructed image and the original one.As it is obvious, an increase in RMSE is equivalent to a decrease in PSNR [24].In this section, four filters, i.e., Sobel, Gaussian, Laplacian, and Sharpening, are selected to evaluate the proposed convolvers.In image processing applications, the Sobel filter in directions x and y will be used for detecting a wide range of edges in a sample image, while the Gaussian filter is a method for blurring images by decreasing the amount of intensity variation between neighboring pixels 1 3 Real-time approximate and combined 2D convolvers for FPGA-based… [1].Also, the Laplacian and Sharpening filters will be explained more in the next sections.In convolution computation, we have used both exact and approximate convolvers for the trade-off between accuracy and performance.As mentioned earlier, a pre-determined threshold is considered to decide the amount of using approximate convolver which prevents high degradation in accuracy.In the evaluation scenarios, the threshold is set from 0 to 10.Based on Eqs. ( 9) and (10), when the threshold value grows, the number of approximate convolvers will be increased and vice versa.

Error rate evaluation with Sobel filter
Figure 19 shows the Sobel filter coefficients in directions x and y.To evaluate the proposed designs, two general scenarios are selected.First, the experimental analyses are conducted on the approximate convolver by using window-, row-, and column-based convolutional computations.Second, the evaluation is performed on the combined convolvers under the same methods of the convolutional computations.
Figures 20 and 21 demonstrate different filtered images with the Sobel in directions x and y kernel by using approximate convolver under the Cameraman benchmark.As shown in these figures, the column-and row-based strategies are superior for Sobel kernels in directions x and y, respectively.Since the Sobel filter in direction x finds vertical edges and the column-based method averages the column of input pixels, the error rate is bounded.In contrast to the Sobel filter in direction x, for y direction, the row-based method is superior because the horizontal edges are found.Finally, since the window-based method computes the average of input pixels in a window, the error rate for both filters is increased.The RMSE, PSNR, and the number of approximate and exact convolvers for the Sobel filter in direction x under different threshold values are demonstrated in Fig. 22.When the threshold is set to 0, because all exploited convolvers are exact, the RMSE value is zero, and PSNR is ∞ accordingly.As the threshold value goes up, in all approximate convolutional computation techniques, the number of approximate convolvers increases, and accordingly, the number of exact convolvers decreases.Because in this scenario, the Sobel filter in direction x is used, and the column-based method is superior among all other strategies which is proven by the PSNR and RMSE values.On the other hand, the Window-based method has negligible superiority in comparison with the row-based strategy by using fewer approximate convolvers.In Fig. 23, to compare two strategies in a fair situation, the number of approximate convolvers is equalized.After that, the PSNR value of the row-based method is more than the window-based strategy, and the RMSE value is decreased accordingly.To show the effect of the proposed strategies on a sample image, Figs. 24, 25, and 26 are depicted under 0, 5, and 10 threshold values, respectively.As can be seen, by increasing the threshold value, the columnbased method has minimum degradation in accuracy in edge detection.But, the  this scenario, the PSNR for the row-based method is from 65 to 40 under different threshold values.Also, the RMSE value is significantly less than other related methods.In this scenario, the window-based method is superior in comparison with the column-based method by using a fewer number of approximate convolvers.In Fig. 28, we have also equalized the number of approximate convolvers among different strategies which the PSNR and RMSE for the column-based method are improved.

Error rate evaluation with Gaussian filter
To show the error rate of the proposed approximate and combined convolvers for various kernel sizes, the Gaussian filter is selected.Figure 29 shows the Gaussian filter coefficients for 3 × 3, 5 × 5, and 7 × 7 kernel sizes.The Gaussian filter performs a blurring operation on a sample image.Real-time approximate and combined 2D convolvers for FPGA-based… In the Gaussian filter, as predicted, the error rate is less than the Sobel filter in the three methods generally.To make a fair comparison, we have also equalized the number of approximate and exact convolvers in Figs. 30, 31, 32, and 33.As shown in Fig. 30, all three methods have PSNR more than 30 dB even with large usage of the approximate convolvers.Accordingly, the window-based method has the largest error rate among others because of averaging in a window of pixels.In the column-and row-based methods, the error rate is dependent on the input image, and based on the selected benchmark, one of them will be superior.To show the effect of increasing the kernel size on RMSE and PSNR value, Figs.32 and 33 are drawn with 5 × 5 and 7 × 7 kernel sizes , respectively.As evident from these figures, the error rate is increased; but, the PSNR value is more than 30 dB which is acceptable for many applications.In addition, because the top amount of error rate occurs when the number of exact convolvers is close the zero, the   Real-time approximate and combined 2D convolvers for FPGA-based…

Error rate evaluation with Laplacian and sharpening filters
For the reasonable evaluation of the proposed methods, we have extended the filters to Laplacian and Sharpening filters which are shown in Fig. 34.In addition, new benchmark images named Boat and Pirate are exploited to change the experiment situation in all aspects.The Laplacian kernel is an edge detector exploited to  Figures 35 and 36 show different filtered images with the same number of approximate and exact convolvers by the Laplacian and Sharpening kernel under Boat and Pirate benchmarks, respectively, with threshold = 10.In the Laplacian and Sharpening filters, all three methods with a large number of approximate convolver have shown acceptable filtered image which is not detectable by human eyes.Moreover, Fig. 37 demonstrates the comparison between RMSE, PSNR, and the same number of approximate and exact convolvers with the Laplacian kernel under Boat benchmark for 3 × 3 kernel size.As can be seen, for a large number of approximate convolvers, the proposed methods still have PSNR values more than 30 dB.Finally, based on Fig. 38, for the Sharpening filter and Pirate benchmark the error rate is increased; but, for some threshold values, still the PSNR is more than 30 dB.
As a final note in this section, the convolved image evaluation shows that exploiting approximate convolver in special places of the image, not only did not reduce Real-time approximate and combined 2D convolvers for FPGA-based… the filtered image accuracy significantly, but also the resource utilization and power consumption can be improved.Also, we have evaluated our proposed combined convolver under different scenarios by using a special relation to decide whether exact or approximate convolver is suitable for current computation which can be extended to other related scenarios.Finally, it is up to the requirement for the applications to decide between the accuracy and performance of the convolvers.

Conclusion
In this paper, approximate and combined 2D convolvers with optimized power consumption and low pixel access rate are introduced.The presented designs can be exploited in real-time image processing applications.In the approximate convolvers, the resource utilization and power consumption outperform their counterparts with a competitive critical path delay in comparison to EC.In the combined convolvers, the number of active resources in the approximate path is improved with a negligible reduction in clock frequency.In addition, we have taken the opportunity to the applications to choose different levels of accuracy.Different computational analyses and experimental evaluations are performed on the approximate and combined convolvers using various FPGA-based hardware platforms.The evaluated results on hardware and error rate reveal that with a negligible missing of accuracy, the power consumption is improved up to 34% and 20% in approximate and combined convolvers, respectively, for 3 × 3 kernel size.For future work, CNN needs a highperformance processing engine.The basic module of the CNNs is the convolution unit.By exploiting the proposed convolvers and developing a CNN accelerator, the on-chip power consumption can be reduced significantly.

Fig. 1
Fig. 1 Spatial and temporal coding example

Fig. 3
Fig. 3 General architecture of the proposed real-time approximate convolver

Fig. 8
Fig. 8 General architecture of the real-time combined convolver

Fig. 11 Fig. 12
Fig. 11 Architecture of the decision module for CRC

Fig. 14
Fig. 14 Pixel access rate comparison among various kernel sizes

Fig. 17
Fig. 17 Number of LUTs among various kernel sizes

Fig. 19 Fig. 20
Fig. 19 Sobel filters in directions x and y

Fig. 21 Fig. 22 3
Fig. 21 Different filtered images with the Sobel in direction y kernel by using approximate convolver under Cameraman Benchmark

Fig. 23 Fig. 24 Fig. 25
Fig.23 Comparison between RMSE and PSNR with the same number of approximate and exact convolvers by the Sobel in direction x kernel using combined approximate and exact convolvers under Lena Benchmark

Fig. 26
Fig.26 Different filtered images with the same number of approximate and exact convolvers by the Sobel in direction x kernel using combined approximate and exact convolvers under Lena Benchmark with threshold = 10

Fig. 27
Fig.27 Comparison between RMSE, PSNR, and the number of approximate and exact convolvers with the Sobel in direction y kernel by using combined approximate and exact convolvers under Lena Benchmark

Fig. 28
Fig.28 Comparison between RMSE and PSNR with the same number of approximate and exact convolvers by the Sobel in direction y kernel using combined approximate and exact convolvers under Lena Benchmark

Fig. 30
Fig. 30 Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Gaussian kernel by using combined approximate and exact convolvers under Lena Benchmark for 3 × 3 kernel size

Fig. 31
Fig. 31 Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Gaussian kernel by using combined approximate and exact convolvers under Cameraman Benchmark for 3 × 3 kernel size

Fig. 32
Fig.32Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Gaussian kernel by using combined approximate and exact convolvers under Lena Benchmark for 5 × 5 kernel size

Fig. 33
Fig. 33 Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Gaussian kernel by using combined approximate and exact convolvers under Lena Benchmark for 7 × 7 kernel size

Fig. 37
Fig.37Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Laplacian kernel by using combined approximate and exact convolvers under Boat benchmark for 3 × 3 kernel size

Fig. 38
Fig. 38 Comparison between RMSE and PSNR under the same number of approximate and exact convolvers with the Sharpening kernel by using combined approximate and exact convolvers under Pirate benchmark for 3 × 3 kernel size Real-time approximate and combined 2D convolvers for FPGA-based… k 2 + k + 1 I/O registers.But, in the approximate process, k 2 + k multiplexers, k 2 −1 adders, k + 1 I/O registers, and k multipliers are activated.At last, the critical path delay is similar to RARC and EC in each process respectively plus two levels of the multiplexer.It is worth noting, in sub-Sects.4.2.1 and 4.2.2,CWC and CRC are demonstrated.

Table 1
Comparison of pixel access rate, and resource utilization of different convolution structures (N/R denotes "not reported" in the corresponding reference.*PAR: Fig. 18 Number of FFs among various kernel sizes