Linguistic multidimensional perspective data simulation based on speech recognition technology and big data

In recent years, the multidimensional visualization technology continues to accelerate, resulting in a huge amount of data, higher requirements for related technologies, and more opportunities. Under the background of big data technology application, multidimensional visualization methods can shine in different fields, such as applying them to linguistic research. Although the technology has a wide range of applications, it cannot effectively and intuitively display multidimensional voice feature data, which is difficult to fully meet the requirements of parameter visualization. In order to deeply study linguistic speech recognition and other issues, this paper introduces speech recognition technology to complete the creation and improvement of a multidimensional perspective analysis system for speech data and uses socket mechanism to complete the server construction, including voice recognition, data enhancement, model training and other modules. This system takes the target voice data collection as the calling end. Thus, based on socket connection, data interaction with the server can meet the task requirements of the multidimensional perspective analysis system and can achieve two-way data interaction. The simulation experiment results show that the system based on OpenSMILE toolbox can effectively obtain high-dimensional features, and its performance is excellent, which can meet most of the task requirements. It contains many kinds of acoustic feature data, which can solve the problem of over compression of the original signal, and mine the characteristics of voice waves to explain the relationship between frames. The system is higher than low-dimensional features in recognizing multiple speakers. This paper designs an effective simulation system by applying speech recognition technology to multidimensional linguistic data analysis in the context of big data.


Introduction
With the development of big data and artificial intelligence technology, as well as the continuous improvement of sensing equipment, the ability of human beings to generate a variety of complex multidimensional data continues to improve (Spratt et al. 2013). However, because the analysis speed and multidimensional data processing speed of the human brain are far lower than the speed of data creation, some deep and valuable information is always submerged in the ocean of data, which is difficult to find and apply (Capelleras et al. 2010). When dealing with complicated multidimensional data, we can use multidimensional data visualization to understand some deep rules and information contained in the data, so as to discover new knowledge, which is of great significance to people's production practice and life (Etemadpour et al. 2014). As an interdisciplinary research field, it includes graphics, computer vision, interactive technology and other aspects. Multidimensional visualization technology has universality and can be applied to many fields such as astronomy, physics and politics, with good results. Due to the universality of technology, the amount of information generated is also expanding (Viau et al. 2010). In this process, not only opportunities but also challenges will arise, and the application complexity of visualization technology itself will increase with the growth of the amount of data. It is difficult for people to analyze and understand multidimensional data, and it is also difficult to understand and extract valuable information from multidimensional data (Zholobov 2014). And because of the limitation of human visual cognition, it is very difficult for human to process large multidimensional data. How to easily understand, analyze and mine multidimensional data is the main research direction of multidimensional visualization technology. It is a very important topic to use multidimensional visualization technology to visualize features in linguistics (Chen et al. 2015). Traditional multidimensional visualization methods cannot visualize multidimensional voice feature data, and it is difficult to meet the needs of people for multidimensional voice feature parameter visualization. Under this background, this paper has completed a multidimensional perspective analysis system for speech data. Its speech recognition includes two aspects: speaker recognition and speech recognition (Abdel-Hamid et al. 2014). Many speech recognition problems can be explained intuitively through multidimensional visualization methods. The basic idea of multidimensional speech recognition is to make full use of the correlation between different speech information and recognize different speech information in speech signal at the same time (Naz et al. 2017). This work starts from the joint recognition of multidimensional speaker gender, emotion and other related information, and will be expanded to more dimensional recognition tasks in the future, such as simultaneous recognition of voice content and environmental background voice.

Related work
Literature has established a multidimensional speech recognition model, which can simultaneously recognize the gender, emotion and identity of speakers (Zayene et al. 2018). The model uses Mel frequency cepstrum coefficient features as the parameters of speech recognition features, selects a multi-task neural network structure with attribute dependency layer, shares network parameters through RNN sharing layer, and learns common feature tasks. The system layer learns the unique characteristics of each detection task and uses the MTL mechanism to adjust the weight of the loss function of each detection task to the total loss function (Liao and Kao 2010). The model optimizes the performance according to the characteristics of voice database and then outputs the recognition results of three recognition tasks at the same time. In the literature, CNN extracted features and manually extracted MFCC features are combined in the speech signal spectrum, making full use of the multidimensional information of the speech signal to make the two features cooperate with each other, and finally using fusion features to introduce multitask recurrent neural network to complete the recognition of speaker identity, gender and emotion in the classifier (Tsanas et al. 2012). The literature completed the design and optimization of speech enhancement model based on DNN, which can divide the DNN training process into two stages, and can take logarithmic power as the recognition feature. The literature proposed a parallel training method of discourse backpropagation based on Hadoop (Ashraf and Zaman 2017). In a fully distributed environment, the method uses MapReduce parallel computing framework to divide the training data set into multiple small data sets as the input of multiple sub-nodes and uses BP batch processing algorithm to update CNN model parameters (Al_Duais et al. 2020). The literature uses online speech synthesis method to synthesize speech, adds the generated speech to the existing data set as the basic training set of the speech recognition model, and then uses the data enhancement algorithm based on genetic algorithm to expand the random average training replacement set (Reddy and Rao 2017).
3 Speech recognition and big data-related technology theory

Data warehouse technology based on big data
The data source is the basic business layer. The voice system generates a large amount of business data every day. The data layer mainly uses the big data processing tools Sqoop and Flume in the Hadoop ecosystem for data transmission. The data in the MySQL online database are regularly transferred to the HDFS-distributed file system through Sqoop and scripts. The Ralph Kimball dimensional modeling theory can realize data modeling, and after the modeling is completed, the hierarchical theory can be used for model and data processing, so that the voice data structure is clearer and clearer, thus reducing the process of data iteration and establishing an effective database. The commonly used models in this process are star model and snowflake model. The reasonable data model built in this paper is based on Ralph Kimball's dimensional modeling theory. The data processing method is the data hierarchical method, which can realize data blood relationship tracking and greatly reduce repeated development. The star model is used to build the data warehouse based on the fact table and  dimension table. 3.2 Speech signal preprocessing By ''center clipping'' or ''three-stage clipping'', the interference of channel response such as resonance peak is removed, so as to compress irrelevant information and reduce clipping amount. The pitch period corresponding to the peak value can be calculated from the autocorrelation function of the clipped signal, and the reciprocal of the pitch is the pitch frequency. The expression of the shortterm autocorrelation function Rn (k) is: where N is the window length, and n is the starting position of the window function. The basic steps of linear prediction are: if there is a voice signal s(n), the prediction signal can be obtained by linear integration of the p value before time n. The expression is: According to the minimum mean square error (MMSE) criterion, it is deduced that: Levinson-Durbin algorithm is a commonly used autocorrelation algorithm. The linear prediction coefficient solution is based on the speech channel model, with low computational complexity and short recognition time.
Linear predictive cepstrum coefficients (LPCC) are derived from LPC coefficients recursively and are commonly used parameters based on spectral characteristics in traditional one-dimensional recognition. It effectively removes the information of the excitation source and improves the characteristics of the voice channel. The expression of LPCC coefficient is: where c(n) is the LPCC coefficient. Due to the characteristics of human vocal organs, the high-frequency components of speech signals are weak and easily covered by noise. In order to compensate and enhance the high-frequency part, it is often necessary to add weight to the voice signal to make the voice signal more smooth and convenient for further processing. The pre-emphasis mode is to pass the sampled and quantized speech signal through a high-pass filter with the order of 1, as shown in the following formula.
where a is the pre-weighting coefficient, which is usually set between 0.9 and 1. There are many methods to identify signal endpoints. The commonly used method in the field of speech signal processing is called ''double threshold method'', that is, to judge useful speech signals according to short-term energy, based on threshold and short-term zero rate cross threshold. First, the short-time power of voice signal shall be calculated as follows:

Speech enhancement model training
The DNN network structure used here is a feedforward nonlinear neural network, so noisy speech segments can be mapped to clean speech segment features. In order to avoid falling into local optimization during network training, the depth generation model with normalized noise speech log power spectrum is trained by stacking multiple constrained Boltzmann machines; the back-propagation algorithm trains DNN. The network adopts a small batch of random gradient descent training method to update the network parameters and improve the convergence of the learning algorithm: After the network training, the model can extract the log power spectrum characteristics of the input noisy speech and obtain the log power spectrum estimation of clean speech and noise. The values as DNN learning objectives are defined as follows: Since the time-frequency mask value represents the frequency of the time-frequency point, the clean speech estimation can be post-processed: At the same time, in order to deal with non-stationary noise beer, we use noise-based training to enable DNN to obtain the estimation of noise scenes. Here, noise detection samples are added to the DNN network and noise is estimated, so that the DNN model can use more noise information to better estimate clean speech. The pre-estimation of the current noise signal is: where Ym represents the property of the logarithmic power spectrum of the current noisy voice, where M = 6. In the training phase, speech endpoint detection (VAD) model is used to determine whether each segment of noisy speech frame is speech. The formula is: After obtaining CDNN model that can effectively reduce speech distortion and NDNN model that can effectively remove background noise, combine the models in further improvement steps to obtain VAD-DNN fusion model: Table 1 shows the comparison of PESQ results of DNN model and VAD-DNN model for three kinds of noises. Table 2 shows the subjective results of VAD-DNN model and DNN model listening experiments under three kinds of mismatch noises. As with subjective listening, 10 subjects are selected, and the results are obtained by assigning each subject an audiometric sentence covering each signal-to-noise ratio. It can be seen that the improved sentences of VAD-DNN model get more user preferences, and the listening results are better than those of DNN model.
The time required for parallel training of BP algorithm can be expressed as the sum of training time and communication time of DataNodes. Assuming there are n DataNodes, the time for parallel training of batch BP algorithm is about: The above analysis shows that the time complexity of parallel training is about 1/n compared with that of serial training. The computing time of parallel training decreases with the increase of the number of DataNodes, and the communication time increases, accordingly increasing the number of DataNodes.
In order to further obtain the optimal number of nodes of the model in this paper, as shown in Fig. 1, the training efficiency of the batch BP parallel training method when the number of nodes is 4, 6, and 8 is shown. The number of nodes has little effect on the recognition rate. Parallel training takes less time than serial training to achieve the final stable recognition rate. The training time of 8 sections is longer than that of 6 sections. This phenomenon is because when the number of nodes is 6, the training efficiency of the model has reached saturation. Adding more nodes will only increase the communication time between nodes, thus reducing the efficiency. Therefore, the more nodes, the better the effect. According to the cluster model, data size and hardware performance, there are different optimal nodes. Therefore, for the IV3 DensNet speech recognition model used in the experiment in this chapter, the optimal node number is 6.  4 Design and implementation of multidimensional visual angle analysis system for voice data

System architecture design
Speech recognition data acquisition client, speech data enhancement model and speech recognition machine learning model belong to different systems and different operating environments. Therefore, the voice acquisition system must provide a call mechanism for the client system to call to complete the corresponding functions. This paper uses socket communication mechanism, voice recognition module, voice data enhancement module, voice recognition model training module, language training module as the socket server, voice data acquisition system as the socket server. When the system needs to use the above functions, it directly establishes a socket connection with the server to complete two-way data transmission. In order to avoid constantly establishing socket connections that affect the system performance, this paper constructs a message broker module on the server side, so that when the client software is opened, it will directly establish a long socket connection with the message broker module on the server side, feed forward, and call messages through the message broker module. After the server task is processed, it will still be forwarded to the caller through the message agent module, and the results will be displayed in the system GUI, as shown in Fig. 2:

Extraction of multidimensional feature parameters of speech signal
Feature parameter extraction is the key technology to identify speaker-related information. Its purpose is to extract the individual features different from other objects from the speech signal of known object recognition and to retrieve the features of recognized objects from the complex and expressive features hidden in the speech signal. In the real environment, the voice signal is actually a mixed signal, which contains a lot of useful information. It can not only reflect the personal characteristics of the speaker's relevant information, such as the speaker's gender, emotional state when speaking, but also reflect different speakers, different relevant semantic features, etc. Therefore, the great challenges faced by multidimensional recognition tasks are: (1) incomplete or too complete representation of feature parameters; (2) features are easily affected by noise and other factors and have stability problems. At present, there are three representative parameters of commonly used acoustic features: prosodic features, timbre features and spectrum-based features. In fact, the features of traditional one-dimensional information recognition must meet the following standards: (1) There are significant differences in the identity, emotional state and gender of different speakers, and they can accurately identify different types of target objects.
(2) Easy to obtain, with moderate calculation times.
(3) It is less affected by noise and channel degradation and has good stability. (4) It is difficult to imitate. (5) It is not easily affected by space, time and other factors.
On the basis of meeting the above requirements, parameters that are part of multidimensional identification information shall be combined into a feature vector using different feature parameters. This is because from the perspective of space division, one-dimensional features are similar to one-dimensional coordinates, and two thresholds can distinguish an object from other objects; two-dimensional features are similar to planes and require curves to identify different objects and things; the three-dimensional part is similar to a space that requires a complex surface to divide different objects. It can be seen that only by using fusion features composed of multiple parameters to deal with multidimensional recognition tasks, can more comprehensive and useful information be obtained.
The energy component of speech signal is closely related to emotion recognition and is a very important prosodic component. It is determined by the amplitude of the intra-frame signal vibration and represents the intensity of the sound. When sad, the voice is usually very low; when emotions are excited, such as anger or happiness, the voice is usually high pitched. In this paper, short-term energy is used as a part of the fusion feature in multidimensional recognition.
The Gaussian mixture model can be used to identify speakers, which is also the main identification method. In Linguistic multidimensional perspective data simulation based on speech recognition… 9971 order to identify speakers, the probability density function of M-order GMM can be expressed as: where x is the K-dimensional acoustic eigenvector; k. It is the parameter set of GMM model; I is the serial number of the Gaussian component; Ci is the mixing weight of the ith component and satisfies the formula: GMM parameter set k It can be composed of the weight, mean vector and covariance matrix of each mixed component, as shown in the formula: The covariance matrix can be a general matrix or a diagonal matrix. Because diagonal matrix is easy to calculate and has good performance, diagonal matrix is often used, as shown in the formula: GMM model training is the process of obtaining a set of model parameters k. The process of. Generally, parameter estimation is used to obtain model parameters. For a given set of speech feature parameters, the training vector can be expressed as X = {xt}, t = 1,2,…, T, where T is the total number of training speech frames, and the probability of the GMM model is as shown in the formula: According to the process of one-dimensional speech recognition, we can design a process framework for analog analysis of multidimensional view of speech data, as shown in Fig. 3, which includes four steps: preprocessing, feature extraction, modeling a matching, and judgment results. First of all, multidimensional recognition has many recognition tasks, and the system will eventually output the judgment results of many tasks at the same time. In the preprocessing part, multidimensional speech recognition is similar to one-dimensional speech recognition, in which voice activity detection, framing, windowing, enhancement, noise reduction, etc., will be carried out. It makes speech signal more suitable for recognition and classification. In feature extraction, for one-dimensional recognition, various recognition tasks generally have appropriate feature parameters; for multidimensional speech recognition, because there are many recognition tasks, there is correlation between each recognition task. It requires multidimensional speech information, so different from onedimensional recognition, it needs to obtain parameters containing different dimensions of speech information, which is applicable to multiple classification tasks as far as possible, so it is significantly more difficult than one-dimensional speech recognition. In the part of pattern matching and decision-making, multidimensional speech recognition is very different from one-dimensional recognition. First, in the training phase, multidimensional speech recognition trains multidimensional models for multiple recognition tasks at the same time. Each recognition task shares model parameters, interactions and interactions, so that the multidimensional model and features are consistent and different speech information is obtained from speech training at the same time; in the pattern matching and judgment phase, multidimensional speech recognition uses multidimensional templates to judge multiple recognition tasks at the same time. In the judgment process, various parameters and results interact, and each task affects each other. Finally, the recognition results of several recognition tasks are displayed at the same time. Multidimensional model has certain requirements for the selection of classifier model, which can be a combination of multiple classifiers, or a classifier that outputs multiple judgment results at the same time. In general, the design of multidimensional speech recognition model has many similarities with one-dimensional speech recognition model, but it is more difficult than one-dimensional speech recognition in feature selection and classification model construction.

Analysis of speech feature recognition results
The recognition rate results of other voice tests of the digital voice N3 training model are shown in Table 3: It can be seen from Table 3 and Fig. 4 that the recognition rate is higher when digital voice N3 training model and digital voice N1 and N2 are used for testing; when text tones T1, T2, T3 and T4 are used for testing, the recognition rate is low.
Because this work uses two types of fusion feature vectors and considers the possible over compression problem of low-dimensional features and the redundancy problem caused by high-dimensional features, the experimental design aims to compare the effects of different features. In terms of multidimensional recognition performance, low-dimensional features and high-dimensional features are used as the input of the reference system to obtain the recognition rate of three multidimensional recognition categories, and the multidimensional recognition rate is obtained by taking the average value, as shown in Fig. 5.
It can be seen from Fig. 5 that the first three columns are three results of a single category corresponding to multidimensional identification based on the reference system proposed in this paper. Except for gender classification, the recognition rate of high-dimensional features is generally higher than that of low-dimensional features. It can be seen that the high-dimensional features obtained by the Open-SMILE toolkit have excellent performance. Because of its multiple acoustic characteristics, it can solve the problem of over compression of the original signal and make full use of the relationship between the relevant voice feature waveform and each frame. In the aspect of simultaneous recognition of multidimensional speaker information, the recognition rate of the system using high-dimensional features is 1.46% higher than that of the system using lowdimensional features.
In order to reduce the dimension from the MIML learning framework to the ML learning framework, the kmedoids algorithm executed at the input speech sample level plays an important role. Among them, the selection of parameter k representing the number of subsets after grouping is very important. In order to improve the classification effect of multidimensional recognition, this work needs to determine k suitable clustering subsets for the voice database. The parameter k is proportional to the proportion of the grouping proportion, so you only need to know the exact proportion of the parameter. The experiment compares the influence of different clustering ratio parameters on the recognition rate, and the results are shown in Fig. 6. It can be seen from Fig. 6 that with the increase of the ratio of clustering ratio parameters, the change trend of multidimensional recognition rate curves using different features is basically the same, showing a trend of significant increase and gradual flattening. In practical applications, computational complexity and storage space consumption need to be considered. Therefore, for lowdimensional features, their dimensions are less than 988 dimensions. In order to obtain the best clustering rate, since the high-dimensional part contains more information and requires a long operation time, it is necessary to compress the high-dimensional part in a large proportion and the low-dimensional part in a small proportion, which corresponds to the actual situation. In addition, it can be seen from the figure that the average recognition rate of speakers using multidimensional information of high-dimensional features is higher than that using low-dimensional features, which shows the importance of full sampling.

System performance test analysis
Because the main purpose of the system is to help employees collect voice input data, there are many function calls in the process of voice input. To ensure the correct operation of the software, this section only tests the function of the speech recognition process for input and output.
(1) Speech recognition function test. Test purpose: whether the voice recognition function correctly recognizes the voice transcription results, and whether the recognition results are automatically inserted into the result table.
Prerequisites: The voice recognition function has been successfully activated, and the time information of the date has been successfully entered.  Step: Click the video of table tennis match to start recording, and click the video of table tennis match again to stop recording.
Expected results: the speech recognition results are correct and successfully entered into the data sheet and displayed in the operation bar.
Actual results: After manually correcting the speech recognition results, the results are successfully entered into the data table and displayed in the operation column.
(2) Voice data enhancement function test. Test purpose: whether the voice data enhancement module can add a voice dataset and add enhanced voice to the current dataset.
Prerequisite: The dataset contains at least one voice.
Step: Click the ''Data Enhancement'' button to start the data enhancement function pop-up menu, select the data enhancement method and enter the parameters corresponding to the enhancement method.
Expected results: The amount of data in the voice dataset has increased.
Actual result: In the local folder and the current dataset, the dataset is magnified, and the amplified sound can be played normally.
(3) Detailed voice information display function test. Test purpose: check specific sound information, including duration, corresponding sound tag and spectrum diagram.
Prerequisite: The voice set contains at least one voice.
Step: Select a voice from the voice dataset, and then click the View Information button.
Expected results: get the details of voice files, Actual results: A detailed information interface appears, displaying information such as sound name, total time, sampling rate, corresponding sound tag and spectrum diagram.

Conclusion
Since the beginning of the twenty-first century, the development of information technology has changed with each passing day. In the wave of artificial intelligence, it has become the goal of people to achieve simple, fast and smooth human-computer interaction. Speech interaction has always been an important part of human-computer interaction, and speech recognition technology is the key technology of human-computer speech interaction. However, most of the current research on speech recognition focuses on the recognition of a single content or information. Therefore, based on speech recognition technology, this paper constructs a multidimensional perceptual analysis system for speech data by repeatedly considering the human brain's ability to process multidimensional speech information and combining the correlation between multidimensional speech information. Based on mixed signal processing, machine understanding can be improved, and the real meaning of speech can meet the requirements of intelligent human-computer interaction, which makes speech recognition technology more anthropomorphic and intelligent.