Prediction of Approximate Multiplier for 16-Bit DICOM Image Contrast Scaling Using Classical Machine Learning Approach

The role of approximate arithmetic are involved when the processors are used for multimedia signal processing application. The impact of multiplier is very important in many processes done by these processors. The compressors are the core architecture for reduction stage if the multiplier width is increased. Later approximations are done in the compressor to limited error without aﬀecting the signal standard. The design of scalable-split compressor is designed in this work and a counter matching method has been developed for approximation. The design of 32x32 and 16x16 multiplier with these new compressors are synthesised in 45nm Synopsis Design Compiler and shows an improvement of 25 % of Chip area and 27% power. The split-scalable architecture attempts to reduce the delay with trade-oﬀ in area and power. Mean Error Distance (MED) and Normalized Error Distance (NED) are the pa-rameters that ensure the quality of any approximate arithmetic based design. 16-bit medical images are processed with both existing and proposed multipliers then the Peak Signal to Noise Ratio (PSNR) is compared. Finally with several input nature and targeted PSNR the best system is identiﬁed using classical meachine learning model.


Introduction
The digital processing system (DPS) plays a significant role in today's realtime application [1]. The design components of DPS are adder, subtractor, multiplier, shifter, and comparator to perform the necessary operations. The Maria Dominic Savio ,T.Deepa SRM Institute of Science and technology,Chennai, 603203, Tamilnadu, India. Tel.:9994076650 E-mail: mariadom@srmist.edu.in design of the multiplier is the most complex task in the construction of DPS. The multipliers consume a higher amount of energy and more delay among all design components [2]. The basic operations like arithmetic computation, comparison, multiplication of signals are performed by DPS's [3]. These systems are embedded with the arithmetic blocks, which provide high accuracy and reliability [4]. When these arithmetic blocks are used in an image processing the approximate arithmetic is carried out in order to reduce the design complexity with minimal error without degradation in the application's performance, [5]. The method of approximate computing is carried out by connecting some output function as the direct input without involving the circuit design. Approximate subtractors are developed in [6] with Boolean solvation using Karnaugh map (K-map), and the approximate divider is designed using the same subtractor for subtract two image to achieve background subtraction. To reduce power, area and increase the speed several approximate comparators in [7]- [9] are designed. The salt and pepper noise are removed with these comparator with no degradation in image quality. The design of compressors play the best alternative for full-adder in Wallace and Dadda multiplier for dot product reduction [10]. Compressors are usually developed with full-adders, an alternative in [11] compressor 4:2, 5:3 are designed using XOR-MUX.
Various multiplier architectures for image processing applications have been done with imprecise compressors proposed in [12] - [14] but the overall errors are limited. The multipliers are constructed with three steps process as follows: (i) dot product generation (ii) reduction of dot product using suitable tree (iii) addition of dot product using adders [15]. Among the process, the second stage is more complicated because it consumes more energy, and produces the delay. The Wallace and Dadda tree architecture is not an efficient method for dot product reduction if the full-adder is used [16]. In the reduction stage, the replacement of full adders is done with a compressor to reduce the design complexity [17]. The 4:2 compressors are the perfect fit for all kind of multipliers with lower width.In the field of medical imaging the requirement of 16-bit pixel images are used to improve the accuracy it is different from the nature of 8-bit pixels [18]. When designing 16x16 or any other higher width multipliers using 4:2 compressor the reduction stage gets increased. Hence, in [19], 16x16 multiplier is constructed with 15:4 compressor.
In multiplier, the approximate compressor acts as a key role for the reduction of area power and delay [20]. For a lower order compressor, the approximation is achieved through truth table and K-map. In truth table matching of input to output is identified, then the highly correlated values are connected straightly. In K-map the device hardware was reduced by removing the essential prime implicant [21]. The image quality and device hardware are equally concentrated in all the approximate computing methods. The image quality is measured through peak signal to noise ratio (PSNR). The variation between two digital images is identified using PSNR [22]. The PSNR value of 30dB is enough for most of the image processing applications [23]- [28]. The overall multiplier quality is measured through error distance (ED), which represent difference between the accurate product and the approximate product [29].
The novel counter based comparison method between inputs and outputs has been addressed in this paper. The proposed compressors are efficient in terms of area, power, and delay, NED and MED. The data-set is created for various standard medical images from different database with the values of input parameter like nature of image, size, contrast ratio, and output PSNR. Artificial Neural Network(ANN) and Logistic Regression(LR) that improved in [34] is used to predict the best approximate multiplier depend on the metrics matched with the newly applied image. The upcoming sections of the paper narrates. Section II explains split-scalable compressor design. The design of impreciseness using a counter based comparison method is discussed in section III. Higher width multiplier architecture discussed in IV. The system performance analysis is delivered in section V. Multiplications of medical images and standard test images are done with the proposed multiplier is given in section VI. Machine learning prediction is discussed in section VII. At the last conclusion is presented in section VIII.

DESIGN OF SCALABLE-SPLIT COMPRESSOR
The design of several 15:4, 9:4 and 8:4, 8:2, 7:2 compressors are proposed by several researchers. These designs are done with sub-component of three or more small compressors. These small compressors are developed with XOR-MUX instead of full-adder proposed by many authors as shown in Fig.1. The approximation is the tedious process when higher compressor are used. The proposed scalable compressors are designed with a stream line approach of parallel inputs to outputs through XOR-MUX architecture as shown in Fig.2. The term scalable means that the proposed compressor can be extended to any higher-order level. In Fig.2 the different levels of the compressor are differentiated with a dotted square box, with the same approach N:2 compressor can be constructed with (N-3) C in s and C out s where N number of inputs and the two system outputs are sum and carry.
Designing of the higher-order compressor (8:2) using four 4:2 compressors was developed in [35]. For any compressor, XOR of all inputs computes the sum, and every C out is calculated from the MUX output. The select line for MUX is computed from XOR of the first two input values; weather MUX inputs are first and third inputs. The sum for the 5:2 stage is given in equation (1), and carry based on XOR is given equation (2).

Fig. 2 Salable Split compressor
The proposed compressors do not consist of any lower-order compressor. As shown in Fig.2, each dotted box the final XOR output (X) that is passing to the next stage is the sum output, and the output collected from MUX (X − 1) is the carry for that stage obeying (N-3) C out s , C in s . This work has to target for 32x32 multiplication, so up to 32:2 compressor was designed. The concept of split compressor is introduced on scalable compressor as shown in Fig.3. The upper and lower split increases the design complexity of the tree stage in multiplier but reduce the critical path delay. The entire partial products are arrived after one gate delay associated with AND gate. As shown in Fig.3 the generation of sum takes the longest path and affects multiplier performance. So generation of two sum terms using split compressor will minimize the delay. Though upper and lower half operate in parallel to produces two sum term and half adder is used to reach the final sum term, so this structure reduces delay from T to T/2 +1.

DESIGN OF COUNTER-MATCHING FOR APPROXIMATION
The approximate computing is the state of art method for reducing the energy, chip area and to increase the speed. The comparator and counter-based novel approximation techniques are presented in this paper. The circuit will check for 2 N samples, where N is the number of inputs. For example 5:2 compressor consists of 7 inputs a 0 -a 4 , C in0 -C in1 and having 2 7 = 128 input combinations. The circuit consists of 4 outputs C out0 -C out1 , sum, carry. This work overcomes the approximation done in [19] by matching each inputs to every outputs for all possible input samples that can be applied to approximate any part of the circuit. The approximation finder blocks is described in Fig.4, which tells how many combination of the input's and output's are same. The Pseudo-code for approximation is shown algorithm-1, which demonstrates every input's correlation to every output. The proposed approximation finder circuit consists of J = N*M counter and comparator. All 1-bit comparator inputs are connected with compressor inputs and outputs. The 1 st comparator is connected with a 0 and C out0 , 2 nd comparator is connected with a 0 and C out1 , likewise J th comparator connected with N and M. All J-counter outputs predict the count of matched input and output combination. With the counter output the various level of approximation will be performed to maintain the image quality.

MULTIPLIER DESIGN
Most of the recent multiplier are constructed with compressor. In [12] discrete cosine transform (DCT) operation are done with approximate compressor. In this work many new designs of the approximate multiplier are developed. Design-1 32x32 multiplier with scalable compressor. Design-2 32x32 multiplier with split-scalable compressor, Design-3 32x32 multiplier with approximate scalable compressor approximating LSB. Design-4 32x32 multiplier with approximate scalable-split compressor in upper-split of LSB area, Design-5 32x32 multiplier with approximate scalable-split compressor in lower-split of LSB area, Design-6 32x32 multiplier with approximate entire scalable-split com-pressor in LSB area, Design-7 32X32 multiplier using 15:4 compressor [19], Design-8 16x16 multiplier with exact scalable compressor, Design-9 16x16 multiplier with exact split-scalable compressor, Design-10 16x16 multiplier with the approximate scalable compressor inthe LSB area, Design-11 16x16 multiplier with approximate split-scalable compressor in upper region of LSB, Design-12 16x16 multiplier with approximate split-scalable compressorin lower region of LSB, Design-13 16x16 multiplier with approximate on both upper and lowercompressor in LSB, Design-14 16x16 multiplier with 15:4 compressors [19], Design-15 16x16 multiplier with 4:2 compressor [14].

Design-1 32x32 multiplier with scalable compressors
In [14] 8x8 multiplier is designed with a 4:2 compressor. The new multiplier is 32x32 width so the usage of any old type compressor increases the design complexity, power and area, so it is easy to design with scalable compressor as shown in Fig.5. To reduce delay of the proposed multiplier the concept of split compressor has been introduced. Here the maximum scaling length of one 17:2 compressor and one 16:2 compressor is used in the center part of the multiplier and either side is scaled down to fit with the bit length.
4.3 Design-3 32x32 multiplier with approximate scalable compressor in LSB area.
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier from 3rd column to 32nd column.

Design-4 32x32 multiplier with approximate scalable-split compressor in upper-split of LSB area
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier only in the upper half of the split compressors.

Design-5 32x32 multiplier with approximate scalable-split compressor in lower-split of LSB area
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier only in the lower half of the split compressors.

Design-6 32x32 multiplier with approximate entire scalable-split compressor in LSB area
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier on both upper and lower half of the split compressors.

Design-7 32X32 multiplier 15:4 compressor [19]
To verify the proposed approximate multiplier with previous work [19] and to compare the performance metrics the 32x32 multiplier is designed using an accurate 15:4 compressor developed in [19]. And the best approximate design from [19] is developed to compare the image quality of existing design with proposed designs done in this work.

Design-8 16x16 multiplier with scalable compressor
The scalable compressor are used to design the 16x16 multiplier is the same way as design-1 as shown in the Fig.5. The compressor with empty dot notifies the zero input just to compensate the two stage reduction architecture.
4.9 Design-9 16x16 multiplier with exact split-scalable compressor To reduce delay of the scalable architecture the normal multiplier reduction stage are split with upper and lower compressor. The split has been implemented only the middle column of the multiplier as shown with dotted line in Fig.6. The exact compressor is replaced with an approximate compressor on the right most part of the multiplier from 3rd column to 16th column.

Design-11 16x16 multiplier with approximate split-scalable compressor in upper region of LSB
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier only in the upper half of the split compressors.

Design-12 16x16 multiplier with approximate split-scalable compressor in lower region of LSB
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier only in the lower half of the split compressors.

Design-13 16x16 multiplier with approximate on both upper and lower compressor in LSB
The exact compressor is replaced with an approximate compressor on the right most part of the multiplier on both upper and lower half of the split compressors.
4.14 Design-14 16x16 multiplier with 15:4 compressor with 5:3 as sub-component [19] The 15:4 compressor is developed with 5:3 compressor and parallel adder as sub-components. In [19] the multipliers are constructed with 15:4 compressor, 5:3 compressor, full-adder, half-adder. The approximation method is only applied on 5:3 compressor. To compare existing model with straight forward approach of proposed multiplier and the performance are analyzed. [14] In so many literature 4:2 compressor are used to develop the 8x8 multiplier. The 4:2 compressor as shown in Fig.1 are used to develop the 16x16 multiplier to compare the results with proposed multiplier.

DELAY COMPARSION OF SCALABLE-SPLIT COMPRESSOR
The critical path delay has reduced to T/2+1 when the scalable-split compressors are used. The longest path in 16x16 multiplier is the 16 th column where the Split compressors are used as shown in Fig.7. To compare delay in sum term the Xor path is traced between split and non split architecture and synthesised with 90 nm libary in Cadance virtuso to compare the area, power and delay.Table-1 shows that delay of the split scalable compressor is reduced and more over the glitches also reduced.

Approximation
The Verilog simulation result of 12:2 compressor approximations is shown in Fig .8. It is clearly shown that the counter values of C in6 C out8 , C in7 carry, and C in8 carry are the maximum correlated for the total cycles. From the functional verification 75% of cycles many inputs are matched with outputs. TABLE-2 shows the highly correlated combination for the different higherorder compressor. From TABLE-2, so many approximations can be performed. But approximation over a least significant bit (LSB) reduces the error distance (ED) and normalized error distance (NED), so in all compressors, final Cin's are approximated as Carry to reduce the hardware. The error value of new compressors are studied with the exact compressor. Equation (3)(4) describe the NED and MED. .
Where N is the input length of the multiplier, the ED i is the difference between the exact and approximated multiplier of the i th output vector. Where D is the maximum possible error is given by (2 N − 1) 2 .TABLE-3 shows the NED and MED of the proposed and existing design From the Fig.9, design-11 and design-12 shows the good performance and produces less error.    From TABLE 4 and 5 it is clearly shown that the area of proposed multiplier is improved by 30% while the power and delay are improved by 25%.

APPLICATION
The proposed designs are used for image multiplication. The resultant difference between the images multiplied by the exact multiplier to the approximate multiplier is given as mean square error (MSE). Based on the MSE, the best approximate multiplier quality is identified by the PSNR value of the multiplied images [36]. The calculation for MSE and PSNR is given in equation (5) and (6).
M AX i gives the maximum possible values of each pixel, and image dimension is equal to m and n, where X(i, j) and Y (i, j) are the values of each pixel in i and j points of the accurate and non-accurate picture respectively. The PSNR of each noisy image is depicted in Fig. 10,11. The medical image are collected from 16-bit Digital Imaging and Communications in Medicine(DICOM) aycan OsiriXPRO (http://www.aycan.de/lp/sa mple-dicom-images.html). In  In Fig.11 the CT image of Arterielle are multiplied with proposed and existing multiplier then the PSNR levels are compared.

MODEL PREDICTION
With the advancement in modern digital technology, machine learning and deep learning deals with big data to more precisely predict the weather forecast, Trade investment, hospital patient management ,etc. From (https://www. aliza-dicom-viewer.com/download/datasets) the big data-set is constructed with several subject. The model is developed with 62% of the initial data (training set) was used for training while 38% (testing set) was set aside randomly for post-training evaluation of our models. After multiplication with proposed and existing system the input features like nature of image, height, width, CR, PSNR are mapped on the excel-sheet to predict the best system as shown in Fig.12.
In the excel-sheet system 0-5 are mapped with design-10 to design-15, nature 0 stands for CT image of Arterielle. Like wise 15 subjects were taken. Each subjects consists of 20 images and all are multiplied with system 0-5, so the data-set is created with 1800 samples. The improved multi-class classification algorithm of LR and SVM in [34] are used to predict the best system.The training graph and confusion matrix shown in Fig.13. When the new medical image is taken for the contrast scaling depend on input parameter the algorithm will identify the best approximate method with accuracy for SVM=96.52% and LR=95.65%.

CONCLUSION
The new architecture of scalable-split approximate compressor design has been done in this work. The multiplier constructed with new approximate compressors shows an average improvement of 25% of the area, 27% of power, and 20% time delay. The proposed multiplier produces the PSNR value of 5dB to 12dB higher than the previous design for the different application images. In 32x32, multiplier Design-3 produces attractive results, but in 16x16 multiplier, Design-9 is good in circuit aspects, and Design-7 is good in image quality. So depending on the target, the machine learning model identifies the best multiplier. In future work the standard alone image processing processor are constructed with adaptive approximate computing to reduce the energy consumption.