A substantial transformation in music education will be brought about by the launch of the 5G network. Compared to existing network technologies, 5G will provide considerable gains in terms of bandwidth, service dependability, and device density. Furthermore, we will focus our efforts on arts education, a field in which having a huge capacity is vital for sharing high-quality multimedia streams, and two-way communication latencies should be maintained to a low, in the milliseconds' range is essential. A simplified illustration of the suggested technique is shown in Figure 1.

**A. Description of the dataset**:

We picked the GTZAN dataset since it's been used in much earlier research and would enable us to assess the model more correctly. The dataset of GTZAN contains 100 distinct specimens for every genre, and ten various kinds of music are used [21].

**B. Pre-Processing using Normalization**:

Since the incoming data has not been processed, it may contain duplicated sequences and incomplete data. A thorough cleaning and high-level processing have been performed to remove recurring and duplicate occurrences, as well as gaps in the data. Since the datasets for the education systems are extensive, they must use sample reduction methods to ensure that the data is representative. Because of the enormous range of features in this database, approaches for extracting features are required to exclude characteristics that are not important. During the pre-processing stage, it may normalize the information. Equation (1) describes how the s-score is generated in the first step of the normalizing procedure.

$$U=\left[\frac{Tu-\omega }{\alpha }\right]$$

1

Here,

\(\omega\) - data mean

δ- Standard

U may be written as,

$$U=\frac{Tu-\stackrel{-}{Tu}}{UUK}$$

2

Here,

\(\stackrel{-}{Tu}\) - Sample means, \(UUK\)- sample standard deviation.

The randomized sample is made up of the following individuals:

$${U}_{a}={\beta }_{0}+{\beta }_{1}{Tu}_{a}+{\epsilon o}_{a}$$

3

Where,

\({\epsilon o}_{a}\) denoted error depends on \({\alpha }^{2}\)

Next, the errors must be independent of each other and, as indicated in the following section.

$${\tilde{{t}_{a}}}^{\sqrt{V}}\frac{t}{\sqrt{{t}^{2}+v-1}}$$

4

\({t}_{i}\) -Random variable

In the next step, the standard deviation is utilized to normalize the variations in the variables.

It is possible to estimate the moment scaling deviation by using the given equations:

$$FUK=\frac{{\lambda }^{fuc}}{{\varphi }^{fuc}}$$

5

Where,

\(fuc\) denotes a Scaling moment.

$${\lambda }^{fuc}=Exp(Ti-\alpha )^FUC$$

6

Where,

\(Exp\) represent Expected value

\(Ti\) represent random variable

$${\varnothing }^{fu}=\left(\sqrt{Exp(Ta-\alpha )^FUC}\right)^2$$

7

$${t}_{c}=\frac{fuc}{\stackrel{-}{Ta}}$$

8

Where,

\({t}_{c}\) represent coefficient variance

By changing the values of all parameters to 0 or 1, the feature scaling process is brought to a halt. The unison-based normalizing technique is the term used to describe this process. The following is an example of how might write the normalized equation:

$$T{i}^{\text{'}}=\frac{\left(t-{t}_{min}\right)}{\left({t}_{max}-{t}_{min}\right)}$$

9

The data range and irregularity of the input may remain unchanged after it has been normalized. To reduce delay, this step is completed. Next, the normalized data may be utilized as an input to the following phases in the procedure.

**C. Spectrum Based Feature Extraction**:

**a. Centroid of the Spectral Spectrum**

Spectral centroid (also known as the frequency spectrum centroid) is a statistic used in digital signal processing to describe the frequency band. The letter 'a' designates the location of the "centroid" of the frequency range. It seems to have a close link with the intensity of the noise source. Put another way: The lower this number, the greater the amount of power focused in the low-frequency region. Because the spectral centroid better represents the brightness of the sound, it is based on digital audio and musical signal analysis. It's a tool for evaluating music's timbre. It is a musical term. Here's what it looks like mathematically defined:

$${D}_{s}=\frac{{\sum }_{a=1}^{A}{B}_{s}\left[a\right]*a}{{\sum }_{a=1}^{A}{B}_{s}\left[a\right]}$$

10

Here,

\({B}_{s}\left[a\right]\) denotes Fourier transform magnitude

A-frame in the frequency group 'a' that is in the s-th frame.

**b. Flux Spectral**

The flux spectral is a broad term that refers to the pace at which the signal spectrum changes. It is determined by calculating the current frame spectrum to the range of the previous frame. The 2-norm among two normalized spectrums is often used to compute it, which is more exact. The spectrum flux computed in this method does not depend on the period since the spectrum has been normalized. To compare two signals, their amplitudes need to be known. It is common practice to employ flux spectral to identify the timbre of an audio source or whether or not to pronounce it. Here's what it looks like mathematically defined:

$${U}_{s}=\sum _{a=1}^{A}({A}_{s}\left[a\right]-{A}_{s-1}\left[a\right])$$

11

**c. Contrast in Spectral Range**

Spectral contrast is a property that is used to categorize different types of music. Spectral contrast is described as the variation in decibels between both the ridges and valleys of a frequency range, which may illustrate the relative spectral features of different types of music and sounds.

**d. Cepstral Coefficients at Mel-Scale Frequencies**

Because the cochlea contains filtering qualities, it may map various frequencies to different places on the basilar membrane, which allows for more accurate mapping. As a result, the cochlea is often referred to as a filter bank. Psychologists were able to acquire a set of filter banks comparable to the cochlear effect via psychological research, which they named the Mel frequency filter bank, based on this characteristic. Because the sound level experienced by the human ear is not linearly proportional to the frequency of the sound, researchers have developed a new notion known as Mel frequency to account for this. The Mel frequency scale is better following the acoustic qualities of the human ear than the Richter frequency scale. The following is the connection between Mel frequency and the integer u:

$${u}_{mel}=25955g\left(1+\frac{u}{700}\right)$$

12

Here \({u}_{mel}\) denotes Mel frequency conversion

*u denotes frequency*

For starters, the audio signal is separated into frames and pre-emphasized before being windowed. After that, a short-time Fourier transform is conducted to acquire the frequency spectrum of the audio signal. Next, set the Mel - frequency bank of the L channel to the Mel frequency by adjusting the Mel filter bank of L stations. The N value is calculated when the signal has reached its most significant frequency, which is usually between 12 and 16. Each Mel filter has the same spacing on the Mel frequency as the previous one. The following diagram depicts the connection between the three frequencies of neighboring triangle filters:

$$d\left(l\right)=h\left(l-1\right)=o\left(l+1\right)$$

13

Assume,

\(d\left(l\right)\) denotes Centre frequencies

\(h\left(l\right)\) denotes upper frequencies limit

\(o\left(l\right)\) denotes lower frequencies limit.

The output of the filter get through the Mel filter are,

$$y\left(l\right)=\sum _{k=o\left(l\right)}^{h\left(l\right)}{V}_{l}\left(k\right)\left|{X}_{b}\left(k\right)\right| , l=\text{1,2},\dots .,N$$

14

The filter's frequency characteristics are as follows:

$${V}_{l}\left(k\right)=\left\{\begin{array}{c}\frac{k-o\left(l\right)}{c\left(l\right)-o\left(l\right)} , o\left(l\right)\le k\le d\left(l\right)\\ \frac{h\left(l\right)-k}{h\left(l\right)-c\left(l\right) } , c\left(l\right)\le k\le h\left(l\right)\end{array}\right.$$

15

The discrete cosine is transformed to MFCC by taking the natural log of the filter's actual output. This is the expression:

$${MFCC}_{MFCC}\left(a\right)=\sum _{l=1}^{L}lgy\left(l\right)*\text{c}\text{o}\text{s}\left[\pi (l-0.5)\frac{a}{L}\right]$$

16

Here,\(a=\text{1,2},\dots ,N\)

**D. Classification Using Bi-RNN (Recurrent Neural Network)**:

Over time, RNN may detect the inherent structure buried in the sequence. The audio signal may be thought of as a time sequence in and of itself. The spatial dependency of the audio signal in the time dimension may be captured by using RNN to process music. In the temporal dimension, the sound spectrum is likewise widened. Because the feature map after one-dimensional convolution can be thought of as a temporal feature sequence, the usage of RNN to analyze sound spectrum information may also be considered in this way. This research employs Bi-RNN to describe the music sequence to better represent the multidirectional dependency in the time dimension and get closer to the brain's perception of music. Bi-RNN takes into account both the previous and subsequent inputs, which may aid with data modeling.

The construction of Bi-RNN is seen in Figure 2. In the estimate of the future, \({\overrightarrow{\text{Z}}}^{\text{a}}\) is connected to \({\overrightarrow{\text{Z}}}^{\text{a}-1}\), and in computation in reverse, \({\overleftarrow{\text{Z}}}^{\text{a}}\) is connected to \({\overleftarrow{\text{Z}}}^{\text{a}+1}\), and \({\overleftarrow{\text{Z}}}^{\text{a}}\) indicates the hidden layer's current condition. Z6's calculating formula is as follows:

$${\overleftarrow{Z}}^{a}=u\left({V}^{\text{'}{X}^{a}}+{M}^{\text{'}{\overleftarrow{Z}}^{a+1}}\right)$$

17

To get the final network output, combine the forward and rear of each network step:

$${T}_{a}=M{\overrightarrow{Z}}^{a}+M\text{'}{\overrightarrow{Z}}^{a}$$

18

**E. Improved TCP congestion control Algorithm**:

Live broadcasts would be out of place with this protocol designed for on-demand access to an extensive music collection. If a customer wants to upload a song, they'll need to have the whole track on their computer. So, it simplifies things by not indicating which portions of a track a client owns anymore. The drawbacks are minimized due to the small size of the tracks. Instead of using UDP, which is the most common streaming app transport protocol, Spotify uses TCP.

To begin with, having a dependable transportation protocol makes protocol plans and easy implementation. Second, TCP is suitable for the network because congestion management is favorable, and Stateful firewalls benefit from explicit relationship signaling. Finally, since streamed content is shared through a mentoring network, re-sending missing packets is beneficial to the program. A single TCP connection is utilized between two hosts, and messages are multiplexed via the protocol specification. A client maintains a TCP connection to a Spotify server while it is active. Priority-ordered is buffering and sorting application layer messages before being delivered to the operating system's TCP buffers. Messages required to allow interactive surfing, for example, are prioritized above bulk traffic.

**F. Honey Bee Optimization Algorithm**

For both functional and combinatorial optimizations, the Honey Bees method uses random search and neighborhood search. The fundamental goal of this method, as illustrated in Figure 3, is to identify an optimum solution using honey bees' natural foraging activity. In general, scout bees (n), chosen sites in visited websites (m), resting criterion, best places in sample locations (e), starting patch size, which includes the network's size and its surroundings, bees for selected sites, bees for sites are needed. The fitness of bees is assessed after they are randomly put in an area. The honeybees with the best fitness levels are chosen, and the bees who visit the places are selected for the neighborhood search. Now it's time to recruit bees and assess their fitness at the desired locations. The fittest bees from each patch are chosen. The fitness of the remaining bees is evaluated after they are allocated to a search area at random. The stages are then repeated until the condition for halting is fulfilled. The bees method is utilized in various applications, including clustering techniques, neural network pattern matching, and construction. In sensors, nodes near the sink must transfer their data and data received nodes further away, depleting the energy of nodes near the sink. The network isolation issue, also known as the HOT SPOT problem, is caused by the surrounding nodes' energy depletion. It Will significantly alleviate this issue if sink mobility is used since the energy consumption of neighboring nodes will be balanced. Biological methods are also utilized to improve the Packet delivery ratio, throughput, and delay.