## 3.1 Neural network algorithm

The basic design and algorithm of convolutional neural network are also evolving through continuous development and application. Later, on this basis, through the constant exploration of algorithm researchers, a multilayer CNN neural network was innovatively proposed, such as LeNet-5. The network can recognize and distinguish the image content of handwritten numerals.

The neurons are interconnected in a way similar to the animal visual cortex tissue. The visual obstacle in the acceptable area of different neuron interactions can cover all visual areas that can be tested. Suppose the entrance variable neti of the i-th neuron in the input layer:

$${net}_{i}=\sum _{i=1}^{M}{x}_{i}+{\theta }_{i}$$

1

Ai is located at the threshold of i neurons. The corresponding algorithm is shown in Formula (2):

$${a}_{i}=f\left({net}_{i}\right)$$

2

In the process of neural network learning, nonlinear characteristics are mainly studied through hidden layer and output layer. Generally:

The input netj including the seventh neuron is:

$${net}_{j}=\sum _{j=1}^{M}{x}_{i}+{\theta }_{j}$$

4

In the following formula, wij is the weight of the hidden layer and the range of the jth neuron. The corresponding output is aj:

$${a}_{j}=f\left({net}_{j}\right)$$

5

The input network netk of k neurons in the output layer is:

$${net}_{k}=\sum _{k=1}^{L}{w}_{jk}{a}_{j}+{\theta }_{k}$$

6

The output results can be summarized as follows:

$$Output=\frac{{map}_{size}-{kenerl}_{size}+2*padding}{stride}+1$$

7

The feature of maximum pooling is the maximum surface area reserved in each Fig. 2 * 2, and the average pooling is the average pixel of feature 2 * 2 reserved in each value. As shown in Fig. 1, the average number is obtained by keeping the average pixels in each attribute map. Although some channel information may lead to the loss of information about the value, there are no additional standard parameters during the pooling operation. In general, attribute loss has a small impact.

The neural network is composed of a multi-level complex function, which requires an activation function to ensure the nonlinearity of the network, because no matter how the linear function changes, its result is still a linear function. The existence of stimulation active layer is to activate neurons. Its primary role is to map the input of neurons to the output, which makes the feature expression of the model more detailed. In the early stage, neural networks are mainly used for Sigmaid as the operation function and output expression.

$$\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}$$

8

Similar to Sigmaid, when the input is large or small, the gradient of Tanh is 0, which will cause slow network update. This is the output expression of Tanh:

$$\text{tanh}\left(x\right)=\frac{1-{e}^{-2x}}{1+{e}^{-2x}}$$

9

ReLU function is nonlinear near 0, and the decline rate is 0 or 1, so it can also be considered nonlinear. The output of ReLU is:

$$ReLU\left(x\right)=\text{m}\text{a}\text{x}(0,x)$$

10

When x ༞ 0, the read input is equal to the output. When x ≤ 0, the reread output is 0. Its conclusion is that the ReLU activation function will turn off the calculation of neurons when the input is less than or equal to 0, thus reducing the interaction between networks and reducing the weight of the model.

RNN has also been constantly studied through algorithms in the fields of text translation, natural language processing, computer visualization, speech recognition and a large number of applications, which has greatly improved the network threshold of its application scenarios. At this time, the output of t at RNN is shown in Formula (11) (12):

$${a}^{t}={g}_{1}({W}_{aa}{a}^{t-1}+{W}_{ax}{x}^{t}+{b}_{a})$$

11

$${y}^{t}={g}_{2}({W}_{ya}{a}^{t}+{b}_{y})$$

12

It should be noted that the weight matrix W * * and threshold b * are shared at each time point, and the deep RNN network structure is obtained by combining, connecting and stacking multiple single-layer RNN structures.

Due to the problem of gradient disappearance, RNN can only be effectively memorized and processed to obtain relatively close information. It is impossible to effectively memorize remote information, which makes RNN training extremely difficult. The LSTM network is proposed to improve the LSTM network based on RNN. Different from the structure of RNN cycle layer foundation, LSTM uses three different structures designed at different times to control the model weight and output of the network related layer.

As the deep neural network uses network parameters for training and updating, for the deeper network layer, finally, the estimated loss value derived many times in the regression diffusion process is very low, which seriously affects the training process of the model. So that the GAN network can design more in-depth structure, there is also the problem of gradient loss. If the discriminator device model can accurately identify the real sample, then the gradient in the generator device model becomes a serious problem. The best discriminator is shown in Eq. (13):

$${D}_{G}^{*}=\frac{{p}_{data}\left(x\right)}{{p}_{data}\left(x\right)+{p}_{g}\left(x\right)}$$

13

The original GAN model uses KL divergence and JS divergence to quantify the difference between the two data, as shown in Formula (14) (15):

$$C\left(G\right)=-lb4+KL\left({p}_{data}\right||\frac{{p}_{data}+{p}_{g}}{2}+KL({p}_{g}\left|\right|\frac{{p}_{data}+{p}_{g}}{2}\left)\right)$$

14

$$C\left(G\right)=-lb4+2JS\left({p}_{data}\right|\left|{p}_{g}\right)$$

15

Including:

$$KL\left({p}_{1}\right|\left|{p}_{2}\right)={E}_{x-{p}_{1}}\left(lb\frac{{p}_{1}}{{p}_{2}}\right)$$

16

$$JS\left({p}_{1}\right|\left|{p}_{2}\right)=\frac{1}{2}KL\left({p}_{1}\right||\frac{{p}_{1}+{p}_{2}}{2}+\frac{1}{2}KL({p}_{2}\left|\right|\frac{{p}_{1}+{p}_{2}}{2})$$

17

## 3.2 Music extraction technology

Four pianos are used to identify the characteristics of the selected music. At the same time, the four pianos are divided into different playing forms and playing intensity, and include different audio circuits. Music information is collected in the form of time waves or non sound waves through Fourier transform.

The construction process is to revise the formula, revise the formula and extract the waveform. The waveform revision depends on the pitch and the fundamental frequency depends on the frequency. Therefore, it is necessary to revise the extracted formula again, as shown in (18).

$$S\left(w\right)=\left\{\begin{array}{c}A\frac{{a}_{1}}{{a}_{1}^{2}+{(w-{w}_{l})}^{2}}, {w}_{l}-{\alpha }_{1}\le w\le {w}_{l}+{\alpha }_{1}\\ A\frac{2}{|w-{w}_{l}|} , others\end{array}\right.$$

18

(18) Where: the corrected vibration coordinate value is S (w), the parameter value of the waveform curve near the adjusted waveform peak value is a1, A is the amplitude, and the specific audio or frequency multiplier refers to wt. For different sounds, they have different characteristics in time domain and frequency domain; In addition, the real audio data with the sampling rate of 44.1 KHz has a large amount of data, and direct feeding into the convolutional neural network requires a large amount of computing power and higher demand for computing power.