## Convolutional Neural Network

With the continuous change in human needs, artificial intelligence technology has become an important carrier of scientific and social development [6], and the machine learning technology supporting its development has become a social hotspot. Machine learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and other disciplines. It studies computer simulation or realization of human learning behavior to acquire new knowledge or skills, reorganize existing knowledge structures, and continuously improve its performance. As a branch of machine learning, deep learning is unique in current scientific and technological research and has made extraordinary achievements [7]. Deep learning is a technology that simulates the human brain based on machine learning. At present, the development of deep learning technology not only provides an important impetus for the sublimation of machine learning technology but also makes important contributions to the development of various fields [8]. The source and structure of deep learning techniques are shown in Fig. 1.

In Fig. 1, the main idea of deep learning technology is to learn the activity mechanism of the human brain, convert it into an algorithm model that can be calculated and run on a computer, and operate the model and solve problems. Typically, such neural network techniques are highly capable of learning and can solve various problems. Therefore, deep learning techniques are widely used in various fields [9]. At present, deep learning technology is mainly used in various fields such as face recognition, image recognition, image classification, image segmentation, speech recognition, and scene recognition and has made great progress and development. Deep learning technology includes many branch technologies. Among them, the achievements of CNN technology are more remarkable [10]. The calculation of deep learning technology is shown in Eq. (1):

$$y=f(\sum _{i=0}^{N}{w}_{i}{x}_{i}+b)$$

1

In Eq. (1), *w* represents the weight, *b* represents the bias, and *f* represents the activation function. The calculation of each level of the neural network is shown in Eq. (2):

$${{x}}^{\left(k\right)}={f}^{\left(k\right)}({w}^{\left(k\right)}{{x}}^{(k-1)}+{{b}}^{\left(k\right)})$$

2

In Eq. (2), *k* represents the level. The calculation of the learning state of the neural network is shown in Eq. (3):

$${x}^{{\prime }}=x-ϵ{\nabla }_{x}f\left(x\right)$$

3

The information transfer is shown in Eqs. (4) and (5):

$${{z}}^{\left(k\right)}={w}^{\left(k\right)}{{x}}^{(k-1)}+{{b}}^{\left(k\right)}$$

4

$${{x}}^{\left(k\right)}={f}^{\left(k\right)}\left({{z}}^{\left(k\right)}\right)$$

5

The derivation is shown in Eq. (6):

$$\frac{\partial \mathcal{ℒ}({y}\hat ,{{y}})}{\partial {w}^{\left(k\right)}}=\frac{\partial \mathcal{ℒ}({y}\hat ,{{y}})}{\partial {{z}}^{\left(k\right)}}\cdot \frac{\partial {{z}}^{\left(k\right)}}{\partial {w}^{\left(k\right)}}$$

6

$$\frac{\partial \mathcal{ℒ}({y}\hat ,{{y}})}{\partial {{b}}^{\left(k\right)}}=\frac{\partial \mathcal{ℒ}({y}\hat ,{{y}})}{\partial {{z}}^{\left(k\right)}}\cdot \frac{\partial {{z}}^{\left(k\right)}}{\partial {{b}}^{\left(k\right)}}$$

7

After optimization, Eq. (8) is obtained:

$${\delta }^{\left(k\right)}={\delta }^{(k+1)}\cdot \frac{\partial {{z}}^{(k+1)}}{\partial {{x}}^{\left(k\right)}}\cdot \frac{\partial {{x}}^{\left(k\right)}}{\partial {{z}}^{\left(k\right)}}$$

8

Eq. (8) can obtain not only the result of the forward conduction of the neural network but also the error of the backward propagation.

CNN technology originated around the 1960s. The design of this neural network is inspired by the exploration of the cat's primary visual cortex [11]. CNN for image recognition and classification is the most used in research. CNN mainly includes input, convolution, pooling, fully connected, and output layer structure. The multi-level structure makes the CNN technology very versatile [12]. The main structure of the CNN technology is shown in Fig. 2.

In Fig. 2, CNN has a multi-level structure, and each level includes many neural network layers. The two most important processes in CNN are convolution and pooling, as shown in Fig. 3.

Figure 3 shows the important working layers of CNN with convolutional and pooling layers. Among them, the main function of the convolutional layer is to scale the input parameters, reduce the proportion of input parameters, and improve the computational efficiency of the model. The main role in this process is the convolution kernel in the convolutional layer [13]. The function of the pooling layer is to pool the input parameters; that is, to perform local random screening of the parameters, to eliminate some data in some parameters, to reduce the calculation amount of the parameters, and to reduce the calculation amount of the model as much as possible. Common pooling operations include max pooling and average pooling [14]. These random pools also need to be trained to improve computational performance. The computation of random pooling training is shown in Eq. (9):

$${p}_{i}=\frac{{a}_{i}}{\sum _{k\in {R}_{j}}\left({a}_{k}\right)}$$

9

In Eq. (9), \(p\) represents the probability of element pooling; \({R}_{j}\) represents the set of input elements; \(a\) represents the input elements; \(i\) and \(k\) represent the sequence of input elements. Then, the elements of the probability are calculated and subjected to a multinomial distribution sampling operation, as shown in Eq. (10):

The Softmax classifier is used to optimize the CNN model. The principle is to optimize the model by minimizing the negative log-likelihood function, as shown in Eq. (11):

$${f}_{\theta }\left(x\right)=\frac{1}{1+\text{e}\text{x}\text{p}(-{\theta }^{T}x)}$$

11

In Eq. (11), \(x\) represents the sample; \(T\) represents the training sample sequence; \(\theta\) is the parameter model. During the training process, \(\theta\) is trained separately, reducing the computational cost of the model. The calculation of the cost is shown in Eq. (12):

$$C\left(\theta \right)=-\frac{1}{m}\left[\right(1-{y}^{i})\text{l}\text{o}\text{g}(1-{ℎ}_{\theta }\left({x}^{i}\right))+\sum _{i=1}^{m}{y}^{i}\text{l}\text{o}\text{g}{ℎ}_{\theta }({x}^{i}\left)\right]$$

12

In Eq. (12), *x* and *y* represent the sample value; *m* and *h* represent the value parameter of the sample. The calculation of *x* and *y* is shown in Eq. (13):

$${f}_{\theta }\left({x}_{i}\right)=\left[\begin{array}{c}p({y}_{i}=1\mid {x}_{i};\theta )\\ p({y}_{i}=2\mid {x}_{i};\theta )\\ \cdots \\ p({y}_{i}=n\mid {x}_{i};\theta )\end{array}\right]$$

13

Then, based on this, the calculation of the cost is optimized as shown in Eq. (14):

$$c\left(\theta \right)=-\frac{1}{m}\left[\sum _{i=1}^{m} \sum _{j=1}^{k}1\right\{{y}_{i}=j\}\text{l}\text{o}\text{g}\frac{{e}^{{\theta }_{v}^{T}{x}_{i}}}{\sum _{v=1}^{k}{e}^{{\theta }_{v}^{T}{x}_{i}}}]$$

14

Eq. (14) uses gradient descent to minimize \(c\left(\theta \right)\) and improve the computational effect of the classifier. Then, Eq. (6) is optimized, as shown in Eq. (15):

$${\theta }_{k}={\theta }_{k}-a{\nabla }_{{\theta }_{k}}c\left(\theta \right)$$

15

The parameters have the same meaning as Eq. (15) [15]. After the classifier is used to optimize the CNN, the performance of the Softmax Convolutional Neural Network (SCNN) will be greatly improved. The design uses the optimized deep CNN model SCNN for wireless communication. The efficiency of wireless communication is improved by optimizing the task processing mechanism of wireless communication.

## Wireless communication

At present, wireless communication technology has become one of the mainstream technologies in society. Due to the rapid development of information technology, the traffic of wireless transmission has greatly increased. Mobile communication has changed the way people communicate, exchange information, and entertainment [16]. Mobile wireless communication has revolutionized people's daily life. The official estimates of the International Telecommunication Union show that in the next development, the traffic growth of mobile users will reach an unprecedented rapid growth rate of more than 50% per year. Therefore, the capacity and various resources of the wireless network will become the bottleneck restricting the wireless service from dealing with massive users [17, 18]. The main principle of wireless communication is shown in Fig. 4.

In Fig. 4, the development of wireless communication technology has established a huge guarantee for the communication industry. Long Term Evolution (LTE) in unlicensed bands can help utilize the resources of the wireless spectrum more efficiently, providing better services to users. Therefore, it has attracted great attention in wireless communication research. The field of wireless research has introduced several technologies to enable LTE to coexist harmoniously with other mature technologies in the unlicensed spectrum, such as Wi-Fi. However, the wireless environment is inherently uncertain and is an extremely complex heterogeneous network. It changes dynamically and continuously. Users of the network are changed frequently, new networks can be deployed, and running networks can be taken down immediately. In addition, the amount of data each wireless node must transmit and the load on the network vary [19]. The working principle of LTE is shown in Fig. 5.

Figure 5 is the working principle of LTE. Techniques for fair coexistence with different wireless technologies in unlicensed spectrum must consider potential changes in the wireless environment. At present, wireless network communication technology has been greatly developed. Therefore, its current online teaching process has become its main carrier, and its optimization can effectively improve the effectiveness of online teaching and promote the comprehensive development of the online teaching system. Then, activation functions are used to optimize the model further. The activation function layer is often referred to as a nonlinear mapping layer, which is used to improve the entire CNN function model and extract feature capabilities (i.e., nonlinear characteristics). Even if a deep neural network contains many linear operation layers, this deep neural network can still only be used to express linearly mapped functions. It cannot express complex functions [20]. In practical use, various activation functions can be selected. This work uses the Sigmoid activation function, as shown in Eq. (16):

$$\text{S}\text{i}\text{g}\text{m}\text{o}\text{i}\text{d}\left(x\right)=\frac{1}{1+\text{e}\text{x}\text{p}(-x)}$$

16

The activation function refers to the characteristics of biological neurons and simulates them mathematically. This function expression takes a set of data and outputs the corresponding result. In the field of neuroscience, biological neurons usually have a fixed value. When the neuron receives an input signal, and the effect of the signal exceeds the fixed value, the biological neuron is activated. Generally, it is called the active neuron state. If it is not activated, it is called the neuron's inhibitory state. Therefore, the role of the activation function is very important [21].

With the rapid development of deep learning, various algorithms of DNN have been gradually applied in the fields of wireless communication and the Internet of Things. In order to be able to adapt to changes in the wireless environment, the advantages of reinforcement learning have been fully demonstrated. LTE is enhanced with deep reinforcement learning techniques to provide reasonable coexistence between co-located LTE and Wi-Fi networks. The optimized wireless communication model is Softmax Convolutional Neural Network- Long Term Evolution (SCNN-LTE). CNN technology is used to optimize wireless communication technology. Then, it is used in the music teaching interactive system to improve the effectiveness of the current online music teaching.

## Research design

1) Interactive system for music teaching

The basis of music teaching interaction is the construction of a system, and its concept is interactive teaching. Interactive teaching is an interactive teaching method based on the idea of scaffolding teaching and built on the background of teacher-student dialogue. In the network environment, interactive learning is the "human-machine-human" interaction based on computer multimedia technology and network communication technology. Interactive learning is a teaching system based on computer technology and using interactive teaching and learning methods. Interaction is originally a computer term that refers to the process in which the system receives input from the terminal, processes it, and returns the result to the terminal, that is, a human-machine dialogue. In communication, interaction is the exchange of information between the sender and the receiver. Therefore, there is interaction in various forms of teaching activities. Interaction is one of the most basic characteristics of teaching activities. It's just that in different teaching forms, the ways and characteristics of interaction are quite different. Therefore, it is breakthrough research to optimize the music teaching interactive system by strengthening the wireless communication technology.

2) Setup of research data

The design idea is to optimize the wireless communication technology through CNN to improve the performance of wireless communication and then apply it to the online music teaching interactive system to improve the interactive teaching efficiency between teachers and students. This study is mainly about designing and optimizing the teaching interaction system. Therefore, this study is mainly to optimize the wireless communication technology of CNN and then apply it to the music online teaching interactive system.

Firstly, the CNN model is evaluated for performance. Therefore, the dataset is used to train and evaluate the model. The datasets used are 1. Mixed National Institute of Standards and Technology (MNIST) is a large-scale handwritten digit database collected by the National Institute of Standards and Technology in the United States, containing a training set of 60,000 examples and a test set of 10,000. Each sample is a 28 × 28-pixel grayscale picture of handwritten digits [22]. 2. Canadian Institute for Advanced Research-10 (CIFAR-10) consists of 60,000 32x32 color images in 10 categories. Each class has 6,000 images, 50,000 training images, and 10,000 testing images. The dataset is divided into five training and one testing batch. Each batch has 10,000 images. The test batch contains 1,000 randomly selected images from each class. The training batch contains the remaining images in random order. However, some training batches may contain more images from one class than another. Overall, the sum of the five training sets contains exactly 5,000 images from each class [23]. 3. The Canadian Institute for Advanced Research-100 (CIFAR-100) has 100 categories. Each class has 600 color images of size 32×32. Among them, 500 images are used as the training set, and 100 images are used as the test set. Each image has fine_labels and coarse_labels, which represent the fine-grained and coarse-grained labels of the image, respectively. The CIFAR100 dataset is hierarchical [24].

During the test evaluation process, the model is evaluated by three indicators: recall, precision, and accuracy [25]. Among them, the calculation of recall and precision rate is shown in Eqs. (17), (18), and (19):

$$recall=\frac{tp}{tp+fn}$$

17

$$precision=\frac{tp}{tp+fp}$$

18

$$accuracy=\frac{tp+tn}{tp+fp+tn+fn}$$

19

The specific meanings of *tp*, *fp*, *tn*, *fn* are: 1) *tp* + *fn* represents the actual number of correct samples. 2) *tp* + *fp* represents the total number of correct samples for the prediction result. 3) *tp* + *tn* represents the actual number of error samples. 4) *tn* + *fn* represents the total number of wrong samples [26].