## 3.1 Basic principle

In 2006, the concept of "deep learning" was put forward, which proved that neural networks with multiple hidden layers have strong learning ability. As a result, more and more people use neural networks for image recognition and get inspiration. Among them, the convolutional neural network is the most widely used and the best performance in the field of deep learning. Convolutional neural network mainly realizes image recognition through convolution and neural network operations. Convolution operation has excellent feature extraction ability, and neural network has high learning efficiency for network parameters, so it is effective to combine convolution and neural network for task processing. Due to the effective learning characteristics of convolutional neural network, it is brilliant in the field of image recognition.

The artificial neural network can simulate the biological neural information processing process in the human brain by computer to reproduce the task processing effect of the human brain. Artificial neural network is a bidirectional topology structure. It processes various types of input information, including discontinuous or non discontinuous input information, and obtains characteristic original data through the adaptive nonlinear dynamic network structure formed by interconnection, and then processes them. Classification realizes intelligent computer simulation of human brain, and the theory of deep learning network originates from artificial neural network.

Figure 1 is a schematic diagram of the composition of an artificial neural network, in which the input is the source input or the output of the upper neuron, and the input is multiplied by the weight and then summed. A positive weight means excitement, a higher value means stronger effect, a negative weight means inhibition, and a smaller value means weaker inhibition. Compare all accumulated values with the threshold value. If the accumulated value is greater than the threshold value, the result is 1. Otherwise, the result is 0.

During the practical application of neural network, different activation functions can be selected, such as S-type function:

$${\delta }\left(\text{x}\right)=\text{s}\text{i}\text{g}\text{m}\left(\text{x}\right)=\frac{1}{1+{\text{e}}^{\text{x}}}$$

1

The sigmoid function can be used to convert different input signals. The classification result can be simply divided by 0 and 1, and the output is the probability between the two. Therefore, the biggest characteristic of sigmoid function is its infinite applicability to nonlinear input and output. In the training set, the network learns its own parameters to verify the test set.

Tanh function:

$$\text{t}\text{a}\text{n}\text{h}\left(\text{x}\right)=\frac{{\text{e}}^{\text{x}}-{\text{e}}^{-\text{x}}}{{\text{e}}^{\text{x}}+{\text{e}}^{-\text{x}}}$$

2

Tanh and Sigmaid have similar characteristics. The tanh function compresses the output value in the range of − 1 ~ 1, so it is basically a zero average value. If the input is between [− 1, 1], the output change is very sensitive, but once the limit is exceeded, the output change is very small, and saturation will occur.

ReLU function:

$$\text{R}\text{e}\text{L}\text{U}\left(\text{x}\right)=\text{m}\text{a}\text{x}(0,\text{x})$$

3

ReLU function further makes up for the shortcomings of Sigmaid function and Tanh function. If the input is less than or equal to 0, the output value is 0. If the input is greater than 0, the output is the input value itself. ReLU function reduces the problem of over dependence between parameters and effectively solves the problem of over fitting.

The most important part of the convolution layer is the convolution kernel. The purpose of the convolution kernel in the convolution layer is to extract features. Its working principle is to first set the convolution kernel and step size, and then do dot product by the convolution kernel and the corresponding position of the input image. The sum of the product results is the pixel value of the output image. When the convolution kernel completes the first convolution operation, it moves the corresponding step size according to the default step size to obtain the pixel value of the next point in the output image, and so on to obtain the output image of the convolution layer. Unlike traditional neural networks, traditional neural networks are automatic parameter learning. Some parameters of convolutional neural networks are pre-determined, such as the size of convolution kernel, which is usually 5 × 5、3 × 3、1 × 1 These three categories. Usually, dozens or even hundreds of convolution kernels are specified for each convolution layer to extract features from the image. At the same time, in the process of convolution operation, different steps will be established and different features will be obtained, which are predetermined by experiments. The principle is as shown in the formula: in layer L, use convolution to check the feature map of the previous layer for convolution, and then extract the feature map of this layer through the activation function:

$${\text{y}}_{\text{l}}^{\text{t}}=\text{f}\left(\left(\sum _{\text{t}\in {\text{x}}^{\text{i}-1}} {\text{x}}_{\text{l}}^{\text{i}-1}\otimes {\text{k}}_{\text{i},\text{l}}^{\text{t}}\right)+{\text{b}}_{\text{j}}^{\text{t}}\right)$$

4

After the image is processed by convolution layer, the feature size shall be reduced by sampling to obtain the aggregation statistics of features at different locations, such as:

$${\text{y}}_{\text{l}}^{\text{t}}=\text{f}\left(\text{d}\text{o}\text{w}\text{n}{\left({\text{y}}_{\text{i}}^{\text{t}-1}\right)}^{\text{*}}{\text{w}}_{\text{l}}^{\text{t}}+{\text{b}}_{\text{i}}^{\text{t}}\right)$$

5

At present, ReLu function is widely used, and its formula is as follows:

$$\text{f}\left(\text{x}\right)=\left\{\begin{array}{c}x,x>0\\ 0,x\le 0\end{array}\right.$$

6

LReLUs was initially used in acoustic models, which proved to be slightly better than the Relu function after tests. Its formula is as follows:

$$\text{f}\left(\text{x}\right)=\left\{\begin{array}{c}x,x>0\\ 0.01x,\text{ }\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}\text{ }\end{array}\right.$$

7

The input is indexed to avoid neuron "death". The most representative is ELU, whose formula is as follows:

$$\text{f}\left(\text{x}\right)=\left\{\begin{array}{c}x,x>0\\ a\left({\text{e}}^{\text{x}}-1\right)\text{,}\text{ }\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}\text{ }\end{array}\right.$$

8

Two classifiers, softmax classifier and sigmoid classifier, are practiced in this paper. Sigmaid is a commonly used function in logical regression, which is often used in binary classification problems. The formula is as follows:

$$\text{s}\left(\text{x}\right)=\frac{1}{1+{\text{e}}^{-\text{x}}}$$

9

## 3.2 Model construction

Radon transform is a method to map and transform an image to a certain angle. The image is regarded as a binary function. For the image Image = G (x, y), Radon transform is the linear integral of the function in one direction. The complete formula is shown in the formula:

$${\text{R}}_{\text{G}}({\rho },{\theta })={\int }_{-{\infty }}^{+{\infty }} \text{G}({\rho }\text{c}\text{o}\text{s}{\theta }-\text{s}\text{s}\text{i}\text{n}{\theta },{\rho }\text{s}\text{i}\text{n}{\theta }+\text{s}\text{c}\text{o}\text{s}{\theta })\text{d}\text{s}$$

10

$$\left[\begin{array}{c}\rho \\ s\end{array}\right]=\left[\begin{array}{c}cos\theta ,sin\theta \\ -sin\theta ,cos\theta \end{array}\right]\left[\begin{array}{c}x\\ y\end{array}\right]$$

11

The function of block normalization is to make each layer of the neural network maintain the same distribution of inputs during network formation, extend the normalization of input data to the inputs of other layers, and reduce the internal offset of data. As shown in the following formula, processing in this way can effectively improve the network generalization ability and training speed.

$$\left\{\begin{array}{c}{\text{u}}_{\text{B}}⟵\frac{1}{\text{m}}\sum _{\text{i}=1}^{\text{m}} {\text{x}}_{\text{i}}\\ {{\sigma }}_{\text{B}}⟵\frac{1}{\text{m}}\sum _{\text{i}=1}^{\text{m}} {\left({\text{x}}_{\text{i}}-{\text{u}}_{\text{B}}\right)}^{2}\\ {\widehat{\text{x}}}_{\text{i}}⟵\frac{{\text{x}}_{\text{i}}-{\text{u}}_{\text{B}}}{\sqrt{{{\sigma }}_{\text{B}}^{2}+{\epsilon }}}\\ {\text{y}}_{\text{i}}⟵\widehat{{\text{x}}_{\text{i}}}+\beta \equiv B{\text{N}}_{{\gamma },{\beta }}\left({\text{x}}_{\text{i}}\right)\end{array}\right.$$

12

After learning and training, transform the inference scale of the above equation to obtain:

$$\text{y}=\frac{{\gamma }}{\sqrt{\text{V}\left[\text{x}\right]+{\epsilon }}}\text{x}+\left({\beta }-\frac{{\gamma }\text{E}\left[\text{x}\right]}{\sqrt{\text{V}\left[\text{x}\right]+{\epsilon }}}\right)$$

13

If the network is densely connected, there is redundancy. DenseNet projects each layer very narrowly. Each Dense Block unit is actually a block layer, including 1x1 conv and 3x3 conv. There is also a transition layer between blocks, including a BN, a 1x1 conv and pooling, which also play a role in reducing redundancy. There are m mappings in a block, which are determined by a parameter between 0 and 1. The number of mappings in the output segment is limited. The structural network diagram is shown in Fig. 2.

This model has certain expansibility, and can recognize a line of text in an image more accurately. This paper takes the unified tax invoice of automobile sales as an example to accurately locate and segment the bills, forming a single line of text and identifying each other. Therefore, tax invoices can be identified between them. To expand to different tax invoices, you only need to determine the coordinates of each internal area of the tax invoice, that is, the precise positioning of the article ID, so that it can be extended to all tax invoices.

## 3.3 Model training results

To prove that maxout + dropout is more suitable for the data in this paper, the change diagram of the model loss function is shown in Fig. 3:

It can be seen from the figure that the loss function of the model is iterative, and the training accuracy of the model is shown in Fig. 4. It can be seen from the figure that the accuracy of the training set has finally reached 98%, so it can be determined that it is an over fitting model. Dropout is a good method to combat over fitting, and it works well with maxout function. This method is often used to modify the function fitting of the model.

The accuracy rate is relative to the prediction result. It refers to the specific number of real samples among the samples with positive prediction results. There are two kinds of prediction results. One is to correctly predict the positive samples as positive (TP), and the other is to predict the negative samples as positive (FP). The accuracy rate is calculated as follows:

$$\text{P}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$

14

The recall rate is considered for the original sample, and its calculation formula is as follows:

$$\text{R}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}$$

15

The activation function value selected for model training can be seen from the following formula:

$$\text{f}\left(\sum _{\text{k}=1}^{{\text{d}}_{\text{i}}} {{\omega }}_{\text{k}}{\text{x}}_{\text{k}}+\text{b}\right)=\text{D}\times \text{a}\left(\sum _{\text{k}=1}^{{\text{d}}_{\text{i}}} {{\omega }}_{\text{k}}{\text{x}}_{\text{k}}+\text{b}\right)$$

16

Bernoulli random variable has the following probability quality distribution (k is the possible result, p is the probability of neurons entering sleep state):

$$\text{f}(\text{k};\text{p})=\left\{\begin{array}{c}p\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{ }\text{i}\text{f}\text{ }\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}k=1\\ 1-p\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{ }\text{i}\text{f}\text{ }\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}k=0\end{array}\right.$$

17

Apply Dropout to the i-th neuron:

$${\text{O}}_{\text{i}}={\text{X}}_{\text{i}}\text{a}\left(\sum _{\text{k}=1}^{{\text{d}}_{\text{i}}} {{\omega }}_{\text{k}}{\text{x}}_{\text{k}}+\text{b}\right)=\left\{\begin{array}{c}a\left(\sum _{\text{k}=1}^{{\text{d}}_{\text{i}}} {{\omega }}_{\text{k}}{\text{x}}_{\text{k}}+\text{b}\right)\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{ }\text{i}\text{f}\text{ }{\text{X}}_{\text{i}}=1\\ 0\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{ }\text{i}\text{f}\text{ }{\text{X}}_{\text{i}}=0\end{array}\right.$$

18

In order to improve the prediction accuracy, we constantly adjust the parameters of the Dropout layer to obtain the parameters with the highest accuracy. The change of prediction accuracy of test samples is recorded, and the results are shown in Table 1:

Table 1

Accuracy of Droput Layer under Different Parameters

Parameter | Accuracy rate |

0.1 | 86.46% |

0.2 | 87.87% |

0.3 | 84.43% |

0.4 | 81.96% |

0.5 | 77.67% |

0.6 | 73.83% |

0.7 | 70.65% |

0.8 | 53.47% |