3.1 Theoretical knowledge
Boltzmann machine is a Markov random field based on energy model. The energy model establishes the functional relationship between the energy in a specific state of the system and its occurrence probability through the energy function to measure the probability distribution in a random network. From statistical mechanics, we can know that any probability distribution can be transformed into an energy based model. Boltzmann machine uses some existing properties and learning steps of energy model to realize maximum unsupervised learning of data probability distribution. RBM adopts hierarchical neural network model structure, which shows the limitation of Boltzmann machine structure, that is, neurons in layers are not connected, and neurons between adjacent layers are fully connected. Therefore, RBM is often used as a deep neural network. The unit module in front of the network model is used for common tasks such as feature extraction and dimension reduction from input data.
The RBM model consists of two layers, which shows that layers and hidden layers. As shown in Fig. 1, the lower layer is visible layer, including V1, V2, ..., VM, a total of M neurons, recorded as v = (v1, v2, ..., vm); h1, h2, ..., hN has n in total, recorded as h = (h1, h2, ..., hn). The model parameters include the bias of the visible layer b = (b1, b2, ..., bm), the bias of the hidden layer c = (c1, c2, ..., cn), and the weight of the edge of the visible layer and the hidden layer Wn* m.
3.2 Algorithm model
In the deep neural network system, the design of activation function is an important link, and the activation function is usually in the linear function z = ω 𝑥+𝑏 for later use. The activation function is designed to add nonlinear factors, usually with nonlinear and differentiable characteristics. Because of the nonlinearity of the activation function, it is easy to control the change of the output value. If the activation function is linear, the change in the output value is not controlled. Common activation functions are as follows:
Sigmoid function:
The sigmoid function is also a logical function, which is defined as follows:
$$\text{f}\left(\text{x}\right)=\frac{1}{1+{\text{e}}^{-\text{x}}}$$
1
The sigmoid function is the first activation function. By introducing a nonlinear function, the activation value is mapped between 0 and 1. This value indicates the degree of activation. A value of 1 indicates full activation, and a value of 0 indicates full deactivation.
Tanh function: Tanh function, also known as hyperbolic tangent function, is defined as follows:
$$\text{t}\text{a}\text{n}\text{h}\left(\text{x}\right)=\frac{{\text{e}}^{\text{x}}-{\text{e}}^{-\text{x}}}{{\text{e}}^{\text{x}}+{\text{e}}^{-\text{x}}}$$
2
Tanh function is very similar to Sigmaid function in form, which is a long "S" curve. But the Tanh function maps from − 1 to 1. The output value of the activation function is 1, indicating full activation, and the output value is − 1, indicating full deactivation. The curves of Tanh function and Sigmaid function are shown in Fig. 2. From the relationship between 𝑥 and y, it can be seen that the curve of Sigmaid function becomes saturated when | 𝑥 |>4, and Tanh becomes saturated after | 𝑥 |>2, which will affect the convergence of uncertain coefficients.
Because of the saturation nonlinearity of Tanh function and sigmoid function, gradient loss exists in gradient based training and learning process. Therefore, when using these two activation functions, batch normalization (BN) should be added to solve the activation problem and gradient loss caused by function saturation.
ReLU function: Correction linear function (ReLU) is an activation function commonly used in neural network structure. Its calculation expression is shown in Formula 3:
$$\text{f}\left(\text{x}\right)=\text{m}\text{a}\text{x}(\text{x},0)$$
3
It can be seen from Eq. 3 that the function is divided into two parts at the origin: the left part is 0, and the right part is a straight line with a slope of 1, meeting the nonlinear condition. If 𝑥<0, the output is 0; if 𝑥>0, the output is the input value. This function realizes one-way suppression, reduces the number of neurons in the neural network, increases the ability of the model to extract data features, and improves the data fitting effect.
Softmax function is usually used in the output layer of neural network as the output of classifier. The output value represents the probability of a class, and its functional form is as follows:
$${{\sigma }}_{\text{i}}\left(\text{z}\right)=\frac{{\text{e}}^{{\text{z}}_{\text{i}}}}{\sum _{\text{j}=1}^{\text{m}} {\text{e}}^{{\text{z}}_{\text{j}}}}$$
4
Calculate the input of hidden layer according to the input vector, weight and offset. The calculation formula is:
$${\text{u}}_{\text{j}}=\sum _{\text{i}=1}^{\text{n}} {{\omega }}_{\text{i}\text{j}}\text{*}{\text{x}}_{\text{i}}+{\text{a}}_{\text{j}},\text{j}=\text{1,2}\dots \text{l}$$
5
The output of the hidden layer is calculated according to the input of the hidden layer and the activation function. The calculation formula is:
$${\text{y}}_{\text{j}}=\text{f}\left({\text{u}}_{\text{j}}\right),\text{j}=\text{1,2}\dots \text{l}$$
6
In classification, the posterior probability is calculated through the learning model, and the maximum posterior probability is taken as the result of the classifier. According to Bayesian theorem, Bayesian classifier is expressed as a posterior probability:
$$\text{P}\left(\text{Y}={\text{c}}_{\text{k}}\mid \text{X}=\text{X}\right)=\text{a}\text{r}\text{g}\underset{{\text{c}}_{\text{k}}}{\text{m}\text{a}\text{x}} \frac{\text{P}\left(\text{Y}={\text{c}}_{\text{k}}\right)\prod _{\text{i}} \text{P}\left({\text{X}}^{\left(\text{i}\right)}={\text{x}}^{\left(\text{i}\right)}\mid \text{Y}={\text{c}}_{\text{k}}\right)}{\sum _{\text{k}} \text{P}\left(\text{Y}={\text{c}}_{\text{k}}\right)\prod _{\text{i}} \text{P}\left({\text{X}}^{\left(\text{i}\right)}={\text{x}}^{\left(\text{i}\right)}\mid \text{Y}={\text{c}}_{\text{k}}\right)}$$
7
Bayesian classifier parameter estimation can be divided into maximum likelihood estimation and Bayesian estimation. Generally speaking, Naive Bayesian algorithm is a relatively simple algorithm, which can combine N-Gram to represent different contexts.
For linearly separable data, a linear classifier can be constructed to maximize the hard interval. The corresponding learning algorithm is called the maximum interval method. This type of support vector machine is also called the hard interval support vector machine. The input of feature space is X=𝑅 𝑛, and the output is Y={1, − 1}. The separation hyperplane is expressed as:
$$\text{H}:\text{w}\cdot \text{x}+\text{b}=0$$
8
The input sample point closest to the separation hyperplane in the input sample is called the support vector, and the support vector satisfies:
$${\text{y}}_{\text{i}}\left(\text{w}\cdot {\text{x}}_{\text{i}}+\text{b}\right)-1=0$$
9
After the word frequency matrix is obtained, the second step is to calculate the TF-IDF weight of each word. Among them, TF stands for word frequency and IDF stands for reverse word frequency. Assuming that the number of all documents in the corpus is | D |, the current word for calculating the TF-IDF weight is w, the current text is 𝐷, the total number of words in the text is | 𝐷 |, and the number of occurrences of w is | w |, then TF and IDF are calculated as shown in Equations 10 and 11 respectively:
$$\text{T}{\text{F}}_{\text{w}}=\frac{\left|\text{w}\right|}{\left|{\text{D}}_{\text{i}}\right|}$$
10
$${\text{I}\text{D}\text{F}}_{\text{w}}=\text{l}\text{o}\text{g}\frac{\left|\text{D}\right|}{\left\{\left\{{\text{D}}_{\text{i}},\text{w}\in {\text{D}}_{\text{i}}\right\}\mid \right.}$$
11