2.4.1 DTW-based intonation assessment algorithm
In the assessment of intonation, this paper adopts the method of feature comparison, that is, to measure the quality of learners' pronunciation and Intonation by comparing the differences between learners' pronunciation and the corresponding reference standard pronunciation. The specific assessment process is as follows: firstly, pitch feature representing intonation characteristics is extracted from learners' speech and reference standard speech respectively. At the same time, the acoustic characteristics of MFCC are extracted; Then, the learners are divided into two parts: first, the students are aligned with the standard phonetic information, and then the learners are segmented from the standard phonetic information; Afterwards, using the obtained phoneme boundary information, the two speech segments are aligned once on the phoneme segments with similar contents, and then DTW is used to aligns the pitch feature sequences of the two in each phoneme segment, and calculates the similarity between them; Finally, using the trained score mapping model, the similarity between the two is mapped to the final intonation score. The block diagram of tone assessment is shown in Fig. 3.
DTW algorithm is based on the idea of dynamic programming and combines time warping with distance calculation, which can effectively solve the matching problem between time series of different lengths [18].
Assuming that the Pitch sequence extracted by learners is \(\text{T}=\left\{{\text{t}}_{1},{\text{t}}_{2},\dots ,{\text{t}}_{\text{N}}\right\}\), the corresponding reference standard speech Pitch sequence is \(\text{R}=\left\{{\text{r}}_{1},{\text{r}}_{2},\dots ,{\text{r}}_{\text{M}}\right\}\), N and M are the number of speech frames of two sequences respectively, and the value in the sequence is the Pitch value extracted from each frame.
DTW algorithm is to find a regular path \(\text{W}=\left\{\left({\text{t}}_{\text{i}},{\text{r}}_{\text{i}}\right),\text{i}=\text{1,2},\dots ,\text{K}\right\}\), which minimizes the distance between two sequences along this path. Where, K is the path length, \({\text{w}}_{\text{i}}=\left({\text{t}}_{\text{i}},{\text{r}}_{\text{i}}\right),\text{i}=\text{1,2},\dots ,\text{K}\) means that the i-th point \({\text{t}}_{\text{i}}\) in sequence T matches the i-th point \({\text{r}}_{\text{i}}\) in sequence R. Therefore, DTW algorithm is to find a time warping function \(\text{w}=\left\{{\text{w}}_{1},{\text{w}}_{2},\dots ,{\text{w}}_{\text{K}}\right\}\), where the learner's speech Pitch sequence T is mapped nonlinearly to the reference standard speech Pitch sequence R, and the cumulative distortion between the two sequences is minimized.
Considering the timing characteristics of speech signal, the time warping function W needs to meet certain constraints, typical of which are monotonicity, continuity and starting point and ending point constraints. In addition, in order to reduce the computational complexity of DTW algorithm and improve the system accuracy, there are usually some restrictions on the region where the regularization function is located, so there will be different constraint paths. The DTW algorithm starts from the beginning of the sequence and recursively calculates the cumulative distance until the end of the matching sequence. In intonation assessment, the DTW distance between two pitch sequences is calculated to measure the intonation similarity between learners' speech and reference standard speech. After the DTW distance calculation is completed, it is mapped to the intonation score similar to the manual score by Formula (4).
\(\text{ score }\text{=}\frac{100}{1+\text{a}(\text{d}\text{i}\text{s}\text{t}{)}^{\text{b}}}\) | (4) |
Where dist refers to the DTW distance between the two Pitch sequences, a and b are the parameters to be trained, and score is the final intonation score, whose value falls within the range of 0 to 100.
2.4.2 Oral scoring algorithm based on SVR
Support vector regression (SVR) algorithm is a machine learning algorithm based on structural risk minimization criterion. It makes full use of the advantages of machine learning and can learn complex data patterns with only limited training samples [19].
Given the training data set \(\left\{\left({\text{x}}_{1},{\text{y}}_{1}\right),\left({\text{x}}_{2},{\text{y}}_{2}\right),\cdots ,\left({\text{x}}_{\text{m}},{\text{y}}_{\text{m}}\right)\right\}\), where \({\text{x}}_{\text{i}}\in {\text{R}}^{\text{n}}\) represents the n-dimensional feature vector extracted from the i-th paragraph spoken speech, \({\text{y}}_{\text{i}}\in \text{R}\) is the artificial score corresponding to \({\text{x}}_{\text{i}}\), m is the total number of samples in the training dataset. The goal is to find a regression function \(\text{y}=\text{f}\left(\text{x}\right)\) as flat as possible to approximate the relationship between the two by training all sample pairs \(\left({\text{x}}_{\text{i}},{\text{y}}_{\text{i}}\right)\) in the data set and minimize the prediction error.
For nonlinear data \(\text{x}\), it is difficult to be linearly separable in the original space. SVR algorithm uses a nonlinear function \({\Phi }\left(\text{x}\right)\) to map x to a high-dimensional feature space for processing. The regression function is defined as:
\(\text{f}\left(\text{x}\right)=⟨\text{w},{\Phi }\left(\text{x}\right)⟩+\text{b}\) | (5) |
Among them, \(\text{w}\) is the weight vector, \(\text{b}\) is the offset term, \(⟨\ast ,\ast ⟩\) is an inner product operation.
The process of pronunciation quality assessment based on SVR algorithm is as follows:
Based on the Chinese students' English reading pronunciation data collected in the experiment, the pronunciation standard, fluency and prosodic features are extracted, and the feature scores are calculated;
The cubic polynomial function \({\text{a}}_{1}{\text{x}}_{\text{i}}^{3}+{\text{a}}_{2}{\text{x}}_{\text{i}}^{2}+{\text{a}}_{3}{\text{x}}_{\text{i}}+{\text{a}}_{4}\) is used to normalize each characteristic score \({\text{x}}_{\text{i}}\), so that it is consistent with the interval of manual score;
The SVR training sample set is constructed with multi-dimensional evaluation feature score as input and manual score as output;
The parameters of SVR scoring model are trained;
The trained SVR scoring model is used to fuse their features, so as to achieve an effective evaluation of students' overall reading pronunciation quality.