Creating Speech Synthesis Process by Using Pulse Model in Log-Domain Vocoder With Whale Optimized Deep Convolution Neural Networks

Speech synthesis is an artificial production of human speech by utilizing computer systems called a speech synthesizer. Several statistical parameters are used in previous works to perform the speech synthesis process, but the vocoder is combined only the simpler model. Due to the lack of sequence of modeling, the quality degrading process is reducing the speech synthesis. Therefore, the Pulse model in log-domain vocoder with whale optimized deep convolution recurrent neural network is applied to investing the vocoder in this work. During this analysis, Mel-Generalized Cepstrum (MGC), maximum voice frequency (MVF), and F0 are applied to processing the signal to extracting the features, and the vocoder is generated successfully. The system's effectiveness is then evaluated using experimental results compared to deep neural networks and traditional recurrent networks.


Background Study
Speech synthesis [1][2][3][4] is nothing but artificial reproduction of people's speech with a computer device's help. This speech synthesis process is mainly utilized to convert the text into the audio information applicable for different applications [5] like mobile applications, voice-enabled services, etc. Significantly, the synthesis process is used as assistive technology to visionimpaired individuals. Therefore, in bell laboratory, the Home Dudley voder [6] develops its initial practical voice synthesizer. The created synthesizer quality is determined by computing the similarity of synthesized speech with original human notes or voice. Several operating systems adapt the speech synthesizer process in the 1990s, which is done by concatenating the recorded speech. The discussed speech synthesis [7,8] is performed by preprocessing and speech synthesis. The preprocessing step helps to remove the ambiguity present in words such as homographs. After eliminating the noise, speech synthesis is performed by converting the text into a series of sounds. Although the text-to-speech synthesis process [9][10][11][12] is successfully utilized in different techniques, synthetic quality, flexibility, controls, and robustness are significant problems. In addition to this inadequate acoustic model, capability creates the nonlinear relationship between acoustic and linguistic features. To overcome this issue, several researchers analyze the synthesis process for improving the system's overall efficiency. Most researchers utilize the Hidden Markov model (HMM) [13] to create the speech synthesis procs, but it has more computation complexity while analyzing many parameters.
Therefore, deep learning techniques [14][15][16] are utilized in the speech synthesis process to successfully analyze the high-dimensional data that extracts the high-level information. Due to the effective analysis, learning process, and easily adaption characteristics, deep learning techniques incorporate the different neural network types. These networks are effectively mapping the linguistic features with the vocoding parameter, and the decisions are carried out.
By considering the advantage of the deep learning techniques in the speech synthesis process, various analysts utilize this deep learning concept to create the speech synthesis. Discussing different researcher opinions, ideas, thoughts, and framework for developing the speech synthesis system. ( Kolokas N., et al.,2018) [17] creating the recurrent neural network-based keyword to text synthesis system. Initially, keywords are obtained from users; parts of the speech-tagging library are utilized to examine the keywords. From the generated keywords,the network maps the related text for generating the synthesis. The identified textsynthesis is more related to keywords, and the system is developed with minimum cost and complexity. (Al-Radhi M.S., et al., 2017) [18] addressing the discontinuous in traditional vocoders by applying the sequence-to-sequence modeling and convolution neural networks. During this process, gated recurrent networks, bidirectional long-short term memory neural networks, and long-short term memory networks are continuously examined. These neural networks are applied to the Mel-Generalized Cepstrum and maximum voice frequency using the continuous vocoder from the analysis. The process helps generate speech synthesis and reduces the hightime consumingthan traditional feed-forward deep learning neural networks. (Bollepalli, et al., 2019) [19] applying recurrent and long-short term memory neural network for creating the normal-to-Lombard adaptation based speech synthesis process. This process uses the three adaptation methods, such as learning hidden unit contribution, auxiliary features, and finetuning. These adaptation models are work along with the neural networks that change the text to the speech synthesis process. The created synthesis system is evaluated using similarity tests and speech intelligibility teststo assess its efficiency. (Anumanchipalli, et al., 2019) [20] developing the speech synthesis system from spoken speech using neural decoding. This process uses the recurrent neural networks for examining the cortical activities and articulatory movements because it helps to investigate the speech acoustics and easy to synthesize the spoken speech. In addition to this, neuroprosthetic techniques are applied in the articulatory decoding process to improving the communication process Hwang et al., 2020) [22] improving the LPCNet vocoder performance by applying the linear prediction structured mixture density networks. In this process, an autoregressive neural vocoder examines the vocal source and vocal tract components. Then continuous density distribution process is applied to combining linear prediction structure mixture density model with the LPCNet vocoder. This incorporating process creating the persuasive speech synthesis process means the quality of the speech is enhanced effectively. The LPCNet based vocoder performance attains the 4.41 opinion score while creating the text-to-speech structure.
(Y. Zhao et al., 2018) [23] developing the multi-speaker speech synthesis system by applying the wavenet vocoder. The system analyzes the text-to-speech acoustic and natural features and reducing the mismatched characteristics. During this process,the generative adversarial network is utilized as the acoustic model and produces the output used in the wavenet. This process is further enhanced by applying the discretized mixture of logistics loss model that reduces the error rate while creating the text-to-speech synthesis system. (M. Airaksinen et al., 2018) [24] analyzing different vocoders such as sinusoidal vocoding, glottal, and straight in statistical parametric based speech synthesis. During this process, text-to-speech synthesis split the text into vocoder-related features, and synthesis is done using the share envelope model. Then the wave has been generated by using these vocoders, and the efficiency of the system is analyzed using four voices. From the analysis, glottal vocoder attains significant results compared to other vocoders.Recent studies directly show that a deep learning network successfully creates the speech synthesis process. Even though these approaches fail to meet the accuracy due to the lack of sequence modeling, the quality degrading process reduces speech synthesis. For overcoming this issue in this work, the Pulse model in log-domain vocoder with whale optimized deep convolution recurrent neural network is applied to investing the vocoder in this work. Then the overall contribution of the paper is listed out as follows,  Maximizing the speech synthesis accuracy by applying a usefulPulse model in logdomain vocoder  Improving the quality of speech synthesis by extracting the significant features from the signal.  To reduce the error rate while creating the speech synthesis system The created whale optimized deep convolution recurrent neural network-based speech synthesis system is implemented using an experimental setup and different performance metrics such as accuracy and error rate to examine the system's efficiency.
The rest of the script is arranged as follows: section 2 analyzes the whale optimized deep convolution recurrent neural network-based speech synthesis system. Section 3 evaluates the proficiency of the speech synthesis process and concludes in section 4.

Process of Speech Synthesis
This section discusses the speech synthesis process; the main aim is to transmit the input text into a signal. During this process, the system takes text as the input, which is analyzed continuously; different acoustic and linguistic features are extracted that are changed into the waveform. Based on the discussion, the working process of speech synthesis is illustrated in figure 1.
value is obtained from the harmonic peak amplitudes of interpolation.
In eqn. (1), is defined as a spectral envelope,   z I is defined as the flat excitation signal.
After that, the phase distortion deviation (PDD) value is estimated using harmonic phase distortion value that is computed using eqn.
The eqn. (2) is used to determine the phase distortion in each frequency compared with the F0.
According to the F0 counter value, the synthesis process is performed by quantizing the binary The phase spectrum value is changed by random noise value when In eqn. (3), the phase pulse minimum value is denoted as   The spectral envelope minimum phase response value is denoted as    The timestamp process was incorporating the time dimensional features inan optimized manner.
The bi-directional recurrent network's obtained output is fed into the fully connected layers for creating the usefulacoustic prediction model. During this process, the neural network performance's efficiency is enhanced by using the whale optimization technique. The algorithm works according to whale behavior while optimizing network performance. The whale utilizes the prey encircle, bubble-net attacking phase, and prey searching phase for selecting their food.
Initially, the prey encircle is computed as follows.
In the above eqn. (6 and 7), the prey current position iteration is defined as t.
Coefficient vectors are denoted as A  and C.
The current optimal solution position is denoted as ,  X  X  is mentioned as a position vector.
Absolute value is defined as |. |. The coefficient vectors are computing as follows For every iteration a  value is chosen from 0 to 2. A random vector is defined as r  has values from 0 to 1. Based on the position values, the whale prey encircle is identified, and the optimized one is selected according to the bubble net attacking. The mathematical derivation of this process is described in eqn. (10) In eqn. (10), the position updating process is defined as , the constant value is b , the random number is represented as l , having a value from -1 to 1. p is also the random value having a value from 0 to 1. After that, the best prey has been searched according to eqn. (11 and 12) The random position vector is defined as rand X  . Based on the above process, the optimized weight value is selected for the network, which reduces the error value while predicting linguistic and acoustic features. The generated features are fed into the vocoder for creating the speech synthesis that is achieved with the help of a trained speech synthesizer. Then the efficiency of the system is evaluated using experimental results and discussions.

Results and Discussions
The

Evaluations
The discussed whale optimized deep convolution recurrent neural network (WODCRNN) based speech synthesis process effectiveness is evaluated in this section. During this process, efficiency is determined in terms of a subjective and objective process. First, the vocoder performance is assessed in this section because it converts the extracted features into the speech synthesis. The created WODCRNN approach efficiency is estimated because it only determines both linguistic and acoustic models' predictive models.

Evaluation in terms of Objective
This . This process creates the speech synthesis successfully, but the error rate or correlation error values are reduced by successful updating of the whale optimal prey solution selection process . The minimum error value leads to maximize the overall efficiency of the speech synthesis.
Therefore, the obtained accuracy values are illustrated in table 3. Based on the discussion, the respective graphical analysis is illustrated in figure 4.

Figure 4: Training Efficiency
The above results are clearly illustrated that thewhale optimized deep convolution recurrent neural network (WODCRNN) algorithm attains the significant results on training speech data.
The predictive training model helps to analyze the testing speech from the user. Then the obtained testing evaluation is determined using subjective analysis

Evaluation in terms of subjective
This section discusses the excellence of the whale optimized deep convolution recurrent neural network (WODCRNN) algorithm with a subjective manner. As discussed earlier, the speech data is collected from the CMU-ARCTIC database, both male and female participants. Therefore, the testing efficiency is determined using different participants in respective baseline methods.

Conclusion
Thus, the paper analyzing the whale optimized deep convolution recurrent neural network (WODCRNN) with Pulse model in the log-domain vocoder speech synthesis process. In this process, speech data is analyzed continuously using a predictive model for creating the training process. The training model generates the linguistic and acoustic model feature frame. The trained features are stored in the database. The CMU-ARCTIC database-related speaker speech information is collected and analyzed by a created optimized model that predicts the speech synthesis from the F0 counter, MCG, and MVF feature generations. The method attains the maximum synthesis recognition accuracy during this process because of the vocoder training and weight and bias updating process.Finally, the system's efficiency is determined using implementation results in which the system creates speech synthesis with 99.3%. In the future, the speech synthesis process is enhanced by applying a meta-heuristic based speech feature selection process.