Integration of Audio video Speech Recognition using LSTM and Feed Forward Convolutional Neural Network

In the current scenario, audio visual speech recognition is one of the emerging fields of research, but there is still deficiency of appropriate visual features for recognition of visual speech. Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all human, suffer from unreliability in analyzing the lip movement. Here we used a custom dataset and design the system in such a way that it predicts the output for the lip reading. The problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Also due to recent developments and advances in the fields of signal processing and computer vision. The task of automating the lip reading is becoming a field of great interest. Here we use MFCC techniques for audio processing and LSTM method for visual speech recognition and finally integrate the audio and video using feed forward neural network (FFNN) and also got good accuracy. That is why the AVSR technique attract a great attention as a reliable solution for the speech detection problem. The final model was capable of taking more appropriate decision while predicting the spoken word. We were able to get a good accuracy of about 92.38% for the final model.


Introduction
Audio speech recognition is the most widely used technique today to automatically detect what a person is saying into the form of text. In modern era, it has gained a lot of popularity and we use it almost in our day to day life in the form of Google Assistant or even the Amazon Alexa. But, the common observation is that this audio speech recognition is mostly use in indoors and does not give good response in outdoors. This is due to intervention of noise. The noise adds to the audio signals and most of the necessary data are lost. That is not the case when it comes to VSR. The Visual Speech Recognition has some advantages over the Audio Speech Recognition. They are a) it is not attentive to audio noise and modification in audio environments has no effect on the data. b) Does not need the user to make a sound. In present times we have a lot of data available and even possess a high computational ability. This will also serve us with an advantage to use various machine learning and deep learning algorithms to get the best results possible.
The LRS2 database is used as most common database available [1] [7]. The feature extraction is done in the Region of Interest. The performance given by the audio speech recognition is not that good compared to the performance given by the AVSR in noisy conditions. The noise can be of different types like street, train etc. It shows that the noise independent of the type gives the same results. Lip-reading is the job of deciphering transcript from the measure of a presenter's mouth. Ahmad B A Hassanat explained different approaches of lip localization [2]. Ayaz A. Shaikh et al proposed the depth sensor camera has also been used to get the third dimension in the dataset [3]. During the creation of the dataset, the above mentioned factors have been taken care by using a headrest [3]. Themos Stafylakis et al proposed residual and LSTM techniques for LRW database and get 83% accuracy [4]. Shillingford et al has used Lipnet. In the lipnet, used two approaches to solve the problems are learning the visual features and prediction. The word error rate of this work is 89.8% and 76.8% [6].
One of the other architecture to implement Lip-reading is Long-Short Term Memory (LSTM) [7]. LSTM are used for lip-reading that determines the typical is accomplished of selectively indicating which spatiotemporal balances are important for an individual dataset. LRS2 datasets used in the model and it achieves 85.2%. G. Sterpu et al [8] looks int futuristic Deep Neural Network architectures for lip reading based on a sequence to sequence Recurrent Neural Network. This work make sure for both redeveloped and 2D/3D Convolutional Neural network visual frontends, operational monotonic consideration, and a combined connectionist Temporal Classification Sequence-to-sequence loss. This evaluated system is done with fifty nine talkers and a terminology of over six thousand arguments on the widely accessible TCD-TIMIT dataset. Kumar et al [9] showed the set of experiments in detail for speaker dependent, out-of-vocabulary and speaker independent settings. In order to show the real time nature of audio produced in the system, the hindrance values of Lipper has been compared with other speech reading systems. The audio only accuracy is 80.25%, the annotation accuracy variance is 2.72% in audio, and Audio-visual accuracy is 81.25%, the annotation accuracy variance is 1.97% in audio-visual.
Some of common dataset used in lip reading is Grid audio-visual dataset, the work in [10] is based on the Grid audio-visual dataset. Visual dataset are recorded with a frame rate is 25 Frames per second, totally 75 frames per sample for 3 seconds. In this work LCA Net and end to end deep neural network are used. The system archives 1.3% CER and word error rate is 3.0%. Dilip kumar Murgm et. al. suggested the new-fangled SD-2D-CNN-Bidirectional Long Short term memory [11] architecture. The analysis of two different approaches like 3D-2D-Convolutional neural network-BLSTM trained with CTC loss on Characters and 3D-2D-Convolutional neural network-BLSTM trained with CTC loss on word labels for lip reading are presented. For the first approaches, word error rate is 3.2% and 15.2% for seen and unseen words respectively. Performance on grid dataset of the second approaches, word error rate is 1.3% and 8.6% for seen and unseen words respectively. The performance of the Indian English unseen dataset word error rate is 19.6% and 12.3% for the two approaches.
One of the most famous dataset used for lip reading is "Lip Reading in the wild (LRW)" [12][13] [22] from BBC Tv it contains 500 targeted words. Themos Stafylakis et al used Residual networks and Bidirectional LSTMs and misclassification rate of the architecture is 11.92%. Used the same database and same method got 83% accuracy. Audio visual speech recognition is one of prospective explanation for speech recognition in noisy environment [13]. Shiliang Zhang et al used bimodal -DFNN, used 150 hours multi-condition training data and archives 12.6% phone error rate for clean test data. Word error rate is 29.98%. Kuniaki Noda et al introduce a multi stream HMM model for integration of Audio and Visual features [14]. Word Recognition rate of MSHMM is 65% and Signal to noise ratio is 10 dB. Stavros Petridis et al Long -short Memory based endend visual speech recognition classification [15]. The model contain of two streams which citation features straight away from the mouth. The two streams take place via bidirectional Long Short Term Memory. Databases like ouluVS2 and CUAVE used, the accuracy of the work is 9.7% and 1.5% respectively. Fei Tao et al [16] proposed structure is likened with Conventional HMMs with observation models implemented with Gaussian mixture model(GMM) and used this channel matched word error rate is 3.70% and Channel mismatched word error rate is 11.48%. The hybrid Connectionist Temporal Classification architecture for audio-visual recognition of speech in-the wild is used in the [17].
The audio features are of many kinds. The three of them used in [18] are LPC,PLP, MFCC. The study shows that the MFCC has the highest accuracy of about 94.6% for Hindi Language in noiseless environment. Overall performance of about 86% is shown by MFCC compared to PLP which has about 83%. RNN is used for audio input prediction. For the visual features the Active Shape Models and the Active Appearance Model is used to detect the lip region [2]. It proceeds a lot of period to create and process the data to be in the format required for the application. The objective that is defined in the work [19] can be affected by the varying light intensity, movement of the head, the distance from the camera.
Ochiai et al proposed the most significant speaker clues is extracted from the dataset. This is attention based feature extraction. They have used 3 layer BLSTM with 512 units each. They further suggest to use datasets in which the video is taken from different angles. Doing so can improve the accuracy as it is common for humans to move their heads while talking [20].
Joon Son Chung et al proposed a new set of database called LRS it contains 100,000 normal sentences from BBC television [21].
Jha. V. P. Namboodiri et al used Charlie Chaplin videos, The word spotting technique achieves 35% upper despicable typical accuracy over recognition-based technique on extensive LRW dataset. Determine the request of the technique by word recognizing in a standard speech videos are "The great dictator" by Charlie Chaplin [22].
Z. Thabet et al applied a machine learning approaches to identify lipreading and used nine dissimilar classifiers has remained applied and verified, reporting their misperception mediums among dissimilar clusters of arguments. The classification procedure went on more than one classifier but these three classifiers got the best outcomes which are Gradient Boosting, Support Vector Machine and logistic regression with results 64.7%, 63.5% and 59.4% respectively [23]. Yaman kumar et al proposed a speech reading or lip-reading is the method of empathetic and receiving phonetic topographies from a presenter's visual features such as movement of mouths, face, teeth and tongue. It has an extensive range of hypermedia applications such as in investigation, Internet telephony and as an aid to a person with hearing impairments [24]. Y. Lu et al proposed a visual lip reading is a technology which associations machine vision and language perception.
In the lip reading techniques, first detection of the face region from the input image then extract the mouth region of the speakers and determine the pronunciation of these features by appreciation model, in this manner they recognizing the speech contents [25]. Lip-reading in general or in particular can be used to enhance the thoughtful of what a people says and also it is greatly beneficial for hearing impaired [26,27] people. Thus a main usage would be hearing impaired people who would mostly benefited with accurate text of what the speaker is talking. Many experiments and research studies have shown that people with [26,27] and without [28] hearing impairment practice visual signals for understanding and enhance to understand the words of the speaker. The rest of this paper is scheduled as follows: In section 2 explain in details of methodology. Section 3 explains the database creation and challenges faced while create the database. Section 4 explains the result and application of the work and section 5 conclude the proposed work.

Methodology
Different Studies mostly on human machine interfaces have showed that utilizing not only the available audio information but also the speech information present in Wireless Personal Communication (2021) 7 the visual data can boost the accuracy of speech recognition. The addition of visual information to automatic speech recognition is found to improve accuracy especially in acoustically noisy conditions where audio data is corrupted. The proposed work used English dataset for recognition of audio visual speech. The custom English dataset contains two forty five videos from five speakers pronouncing seven English words like 'ABOUT','BOTTLE', 'DOG', 'ENGLISH', 'GOOD', 'PEOPLE', and 'TODAY'. The model expects a video of speaker uttering one of the seven pre-trained words and displays the result in the form of text. However, this work does not take account of predicting speakers of different accents or it doesn't deal with predicting the sentence or individual phonemes.
Methods involved in developing Audio visual speech Recognition system. It consists of six steps explain one by one given below. Data extraction: The video is a custom data. It contains of audio data too. We extracted the both audio and video in separate files. In our proposed work we have taken two forty five videos consisting of seven words. The audio has to be extracted from the given MP4 file. We took custom dataset video and in the first step extract the frames from the custom video. The output will be in color image and it has to be converting into gray scale to decrease the computational overhead.

Audio Feature Extraction:
To accomplish audio speech recognition audio dataset holding essential features has to be shaped. An element called moviepy derived in nearby to achieve this task. Here we used custom dataset it in the form of MP4 format, this element agrees us to extract audio from MP4 files. To extract the audio part we used open source model called 'Librosa'. The Audio preprocessing and recognition here we used five audio methods the methods, are MFCCS, CHROMA, MEL, CONTRAST, TONNETZ

Video Feature Extraction:
In order to extract video features, first we have grouped the videos of same words into a single folder. Videos are taken individually from the folders and using the "shapepredictor68facelandmarks.dat" file, which is available on internet to locate the landmarks as you can see in the picture. In Figure 1 used the shape predictor to predict the human face. For our work required only lip part so we have neglected all other features and considered only the features around the mouth region. Figure 7 shows how video is divided into number frames. The coordinate values of each position is occupied and kept in an array. Then the mean deviance of the coordinates of that position is taken as the feature. This affords us with two coordinates multiplied by twenty places, in total forty video features for a video which are stored in "Videofeatures.csv

Training models and saving weights:
Audio features are extracted using simple deep neural network in the keras model and to train the visual speech we use long short term memory recurrent neural network. For the integration of audio and visual is trained by estimate made by audio and video based models. The weights of the models are kept using Model Check Point process existing in Keras.

Loading the models:
The models with corresponding architecture are loaded with weights which are saved in previous step with the extraction of hdf5 and the 3 models are ready to predict the output.

Performance Evaluation:
The testing and evaluate the audio and visual speech can be predicted using the trained models. First we evaluate and get the confession matrix of audio part, after that we get the confession matrix of the video part and then finally combine the audio and video after getting the predictions and we get the confession matric of the combine model.

Block diagram:
The above Figure 3 is the block diagram depicts the working of AVSR. Here we use input data as video. In the first step extract the audio from video and do the processing using different techniques. Second we take video as an input but in this video, audio is absent. After that do the face detection and lip localization. Extract the lip coordinates and train the data using deep neural network called LSTM and train the audio data using simple DNN and finally combined using feed forward neural network.

Forward Pass
Let x 4 be the input vector at period t, N be the total number of LSTM blocks and M is input number. Then we get the following weights for an LSTM layer. Inputs weights: Wz, Wi, Wf, Wn € R NxM Recurrent Weights: Rz, Ri, Rf, Rn € R NxN Peephole Weights: Pi,Pf, Po € R N Bias Weights : bz, bi, bf, bo € R N Then the vector formulas for a vanilla LSTM layer forward pass can be written as: Z t = Wzx t +Rzy t-1 +bz Z t = g(z t ) block input i t = Wix t +Riy t-1 +PiʘC t-1 +bi i t = σ(I t ) Input gate f t = Wfx t +Rfy t-1 +PfʘC t-1 +bf f t = σ(f t ) Forget gate C t = Z t ʘi t +C t-1 +f t cell ∆ t =WoX t +RoY t-1 +PoʘC t +bo O t = σ(∆ t ) output gate Y t = h(c t ) ʘO t block output Where σ, g and h are point wise nonlinear initiation functions. The logistic sigmoid ) is used as postern activation function and the hyperbolic tangent (g(x) = h(x)=tanh(x)) is usually used as the block input and output activation function. Point wise development of two vectors is denoted by ʘ.

Dataset Creation
Data-set is created for both English and Kannada Words using an extensive setup which includes an electronic gimbal for stable video and a Smartphone with sufficient storage space. In table 1 mention the parameter of the dataset features. The dataset is embraced of interrelated audio and lip movement data in various videos of multiple topics construing the identical words. The formation of the dataset was finished to enable the progress and proof of procedures charity to train and test the method that contains of lip-motion. The data set is a gathering of videos of agrees declaiming a fixed screenplay that is planned to be used to train software to recognize lip-motion patterns. The recordings were collected in a controlled, noise-free, indoor setting with a smart-phone capable of recording at 4K resolution. This data-set consists of around 240 video samples per person. 11 male and 13 female subjects, with ages ranging from 18 to 30, volunteered for the data-set creation process. This data-set can be used for speech recognition, lip reading applications. Around 240 video samples were collected per subject.

Challenges while Creating Dataset
There were various challenges that were encountered during the data-set creation process which are explained below  Interference of external noise may cause disruption for audio feature extraction. Noise free environment is an important requirement of data-set creation.
 Lip movement of an individual should be in conjunction with each other in order to extract the lip feature, random movement of lip leads to error.  Each person who is ready to give database has to spare around 30 to 45 minutes reciting the words, which can be tedious.  Recording a video of person with moustache or beard leads to difficulty in detecting the lip movement.  Selection of English and Kannada words to prepare database was difficult as some of the words have similar pronunciation. Creating database for both Kannada and English words is tedious and time consuming as it is needed to take a number of samples for the same word from different people, and there is need to regularly upload the videos to hard disk or system in order to clear space in mobile so that new videos may be captured.

Result and Discussion
The metrics used to measure the concert of the model are accuracy in classification and confusion matrix. Accuracy rate: It is defined as the number of exact predictions by model divided by total number of predictions. The Figure5 shows the feed forward neural architecture of the audio only model and in Figure 6 shows the confusion matrix of the audio model and here shows the correct and incorrect predictions of the words and the audio only model accuracy is 91.42%. Figure 7 shows the confusion matrix of the video model and the accuracy is 80%. In Figure 12 shows the confusion matrix of combine model and the accuracy of combined model is 92.38%.
In Figure 8 shows the accuracy of the audio and video combined model and Figure 9 shows the loss curve of the combined model. Figure 10 shows the accuracy curve of the LSTM model, this algorithm used for visual speech recognition. Figure  11 shows the loss details of the LSTM model.    Output: For testing the developed model, an user friendly user interface has been developed using Python and are shown in following Figures.

Applications
 AVSR technique can be used in forensics so that the crime branch people can understand what have been spoken just with the help of lip reading video.  Hearing-impaired individuals can read lips and lip reading claims may benefit them to improve their lip imitation skills.  AVSR is also used in Human Computer Interaction related applications to improve user experience.

Conclusions
In this work we develop audio visual speech recognition for custom dataset and the dataset contains English words. First we extract the audio from the video using five different techniques and got 91.42% accuracy and recognition of visual speech using LSTM technique and got 80% accuracy. Eventually, combined the LSTM and feed forward neural network to get a better accuracy in AVSR model. The combined audio and video involving feed forward neural network and LSTM recurrent neural network and got 92.38% accuracy.