Structural analysis of SMOSS model
The core idea of SMOSS model is to extract local features from sentences by sliding windows, so as to capture local semantic information of sentences [17, 18]. Its structure consists of the following key components: input layer, sliding window, feature extractor, feature matching and output layer [19]. Among them, the core of SMOSS model is matching function, which performs matching operation on sliding window. Assuming that the sliding window size is w and the sliding window position is p, the matching function \(M\left(p\right)\) can be expressed as Eq. (1) [20].
$$M\left(p\right)=f\left(X\right[p:p+w\left]\right)$$
1
\(X[p:p+w]\) represents a word-oriented quantum sequence in a window of length w starting from position p. The function f (·) is a nonlinear mapping function, which is used to map the words in the window to the quantum sequence to the matching representation space.
In order to capture the relationship between matching representations in different positions, SMOSS model introduces matching attention mechanism. Assuming that the attention weight between matching representations is A, when the i-th matching representation \(M\left({p}_{i}\right)\) is considered, its weighted representation is Eq. (2) [21].
$${M}_{\text{weighted }}\left({p}_{i}\right)=\sum _{j=1}^{N} {A}_{ij}\cdot M\left({p}_{j}\right)$$
2
\({A}_{ij}\) represents the attention weight between the i-th matching representation and the j-th matching representation.
Input the weighted representation of the matching representation into a fully connected layer to obtain the overall SMOSS representation, as shown in Eq. (3) [22].
$$SMOS{S}_{\text{output }}=\text{R}\text{e}\text{L}\text{U}\left(W\cdot {M}_{\text{weighted }}+b\right)$$
3
W and b are the parameters of the fully connected layer, and ReLU represents the activation function.
The specific structure of SMOSS is shown in Fig. 1.
Based on the content in Fig. 1, the input layer of SMOSS model is responsible for receiving the text data to be processed. In the task of parsing, the input layer transforms English sentences into word sequences and embeds them, that is, each word is mapped into a dense vector representation. Such an embedded representation can better represent the semantic relationship between words. Sliding window is one of the core components of SMOSS model. It cuts the input word sequence into several sub-sequences with fixed length, and then performs feature extraction and matching on each sub-sequence [23]. The size of sliding window is an important super parameter, which determines the range of local features.
Optimization of improved LSTM model based on CTC
CTC is an end-to-end sequence learning method, which is widely used in sequence labelling tasks. The basic structure of CTC is shown in Fig. 2.
In this paper, CTC is applied to the task of syntactic analysis, and the LSTM model is optimized by CTC loss function to improve the syntactic analysis effect in English writing teaching [24].
In the sequence labelling task, given the input sequence \(X=\left({x}_{1},{x}_{2},\dots ,{x}_{T}\right)\), it is necessary to predict the output sequence \(Y=\left({y}_{1},{y}_{2},\dots ,{y}_{T}\right)\). T represents the length of the input sequence. Y represents the length of the output sequence. However, because the sequence length of different samples may be different, the traditional labelling data usually need to be strictly aligned, that is, the input sequence and the output sequence are required to have the same length [25]. This will be a challenge for the task of syntactic analysis, because learners' sentence structures are diverse, resulting in different sentences with different lengths.
The key of CTC principle is to solve the sequence alignment problem by defining blank symbol Ø and repeated symbol r. Assuming that the input sequence is "hello", the possible output sequences are "healo", "helo" or "hello". The goal of CTC model is to find all possible alignments and calculate the probability of each alignment. In order to achieve this, CTC introduces blank symbols in the output sequence Y, indicating that there may be silent blank parts, such as "heØaØllo". The repetition symbol r is used to represent the same characters in succession, such as "heØarØllo". By introducing blank symbols and repeated symbols, CTC can find all possible alignment paths [26].
When calculating the CTC loss function, it is necessary to accumulate all possible alignment paths to get the difference between the output sequence Y and the real output sequence [27]. Minimizing the loss function will make the model better adapt to the mapping relationship between input sequence and output sequence, thus optimizing the whole LSTM model.
Given the input sequence \(X=\left({x}_{1},{x}_{2},\dots ,{x}_{T}\right)\) and the output sequence \(Y=\left({y}_{1},{y}_{2},\dots ,{y}_{T}\right)\), the CTC loss function is defined as Eq. (4).
$${L}_{CTC}=-\text{l}\text{o}\text{g}\sum _{\pi \in {B}^{-1}\left(Y\right)} P(\pi \mid X)$$
4
\({B}^{-1}\left(Y\right)\) represents the set of all possible alignment paths of the output sequence Y. \(P(\pi \mid X)\) represents the conditional probability of output sequence \(\pi\) given input sequence X. The goal of the loss function is to minimize \({L}_{CTC}\) to optimize the model parameters and make the difference between the predicted output sequence and the real output sequence as small as possible.
By introducing blank symbols and repeated symbols, CTC's flexible sequence alignment mechanism enables the improved LSTM model to learn without complete alignment labels and better adapt to the diverse sentence structures in learners' writing expressions. This provides new ideas and methods for the application of syntactic analysis in English writing teaching, and lays the foundation for the innovation of this study.
In order to apply CTC method to improve the optimization of LSTM model, the CTC loss function is connected with the output layer of LSTM. In the traditional LSTM model, the output layer usually maps the hidden state of LSTM to the classification label space through the fully connected layer. However, in this paper, the output sequence of LSTM model is directly used as the input sequence of CTC without introducing additional full connection layer.
In LSTM model, the hidden state sequence \(H=\left({h}_{1},{h}_{2},\dots ,{h}_{T}\right)\) is obtained by recursive calculation. \({h}_{T}\) represents the hidden state at time T, and the output sequence \(O=\left({o}_{1},{o}_{2},\dots ,{o}_{T}\right)\) of LSTM model. The traditional LSTM model will use the fully connected layer to map the hidden state to the target label space, and then carry out the classification task. However, in this paper, the output sequence of LSTM model is directly used as the input sequence of CTC without introducing additional full connection layers.
The combination mode of the improved LSTM model based on CTC is shown in Table 1 [28]:
Table 1
Combination steps of improved LSTM model based on CTC
Step number | Specific content |
1 | Input the input sequence X into the LSTM model, and get the hidden state sequence \(H=\left({h}_{1},{h}_{2},\dots ,{h}_{T}\right)\) through recursive calculation. |
2 | The output sequence \(O=\left({o}_{1},{o}_{2},\dots ,{o}_{T}\right)\) of the LSTM model is associated with the hidden state H, and the output sequence O is taken as the input sequence of CTC. |
3 | CTC model uses blank symbol Ø and repeated symbol r to find all possible alignment paths and learn the corresponding relationship between output sequences and real labels. |
4 | Minimize the CTC loss function to optimize the parameters of the whole model, that is, \({min}_{\theta } {L}_{CTC}\left(\theta \right)\), where \(\theta\) represents the parameters of the model. |
The goal of the training process of the improved LSTM model based on CTC is to optimize the model parameters by minimizing the CTC loss function to achieve better syntactic analysis effect. In the training process, the whole model is jointly trained by using back propagation algorithm combined with CTC loss function. Let \(P(Y\mid X)\) be the probability between the output sequence o of CTC model and the real output sequence Y, and the CTC loss function is defined as \({L}_{CTC}=-\text{l}\text{o}\text{g}P(Y\mid X)\). The parameters of the whole model are optimized by minimizing CTC loss.
In the training process, the Stochastic Gradient Descent (SGD) algorithm is used for optimization, and the learning rate adjustment strategy is used to speed up the convergence of the model. In addition, in order to prevent over-fitting, techniques such as Dropout and L2 regularization are introduced. Specifically, for each sample (X, Y), the gradient \({\nabla }_{\theta }{L}_{CTC}(X,Y;\theta )\) of its CTC loss function with respect to parameter \(\theta\) is calculated. Then, the learning rate \(\alpha\) is used to update the parameters, and the updating formula is shown in Eq. (5) [29].
$$\theta \leftarrow \theta -\alpha \cdot {\nabla }_{\theta }{L}_{CTC}(X,Y;\theta )$$
5
\(\alpha\) is the learning rate, which is used to control the pace of parameter update.
In order to prevent the model from over-fitting, Dropout technology is introduced, which can randomly set the output of some neurons to zero, thus reducing the dependence between neurons and enhancing the generalization ability of the model [30]. Meanwhile, L2 regularization is also adopted, and the L2 norm of parameters is introduced into the loss function to suppress the situation that the parameters are too large, thus further preventing the model from over-fitting the training data [31]. Dropout technology reduces the dependence between neurons and enhances the generalization ability of the model by randomly setting the output of some neurons to zero. Let \({h}_{i}\) represent the output of the ith neuron, and in the training process, a retention probability p is used to control the retention and discarding of neurons, that is, Eq. (6):
$${h}_{i}=\left\{\begin{array}{c}{h}_{i}\cdot {r}_{i}, \text{w}\text{i}\text{t}\text{h}\text{ }\text{p}\text{r}\text{o}\text{b}\text{a}\text{b}\text{i}\text{l}\text{i}\text{t}\text{y}\text{ }p\\ 0, \text{w}\text{i}\text{t}\text{h}\text{ }\text{p}\text{r}\text{o}\text{b}\text{a}\text{b}\text{i}\text{l}\text{i}\text{t}\text{y}\text{ }1-p\end{array}\right.$$
6
\({r}_{i}\) is a random number that obeys the uniform distribution U (0,1). During the test, the output of all neurons is kept instead of using Dropout.
R_i is a random number that obeys the uniform distribution U (0,1). During the test, the output of all neurons is kept instead of using Dropout.
L2 regularization suppresses the condition that the parameters are too large by introducing L2 norm into the loss function, thus further preventing the model from over-fitting the training data. The calculation process is shown in Eq. (7).
$$R\left(\theta \right)=\sum _{i=1}^{N} {\theta }_{i}^{2}$$
7
\(\theta\) is the model parameter. \(N\) is the total number of model parameters. The L2 regularization term is added to the original CTC loss function to obtain the regularized loss function, as shown in Eq. (8).
$${L}_{\text{r}\text{e}\text{g}}={L}_{CTC}+\lambda \cdot R\left(\theta \right)$$
8
\(\lambda\) is a regularization parameter used to control the regularization intensity.
Design and implementation of comprehensive syntactic analysis framework
In the framework of comprehensive syntactic analysis, the SMOSS model is integrated with the optimized LSTM model. Specifically, firstly, the SMOSS model is used to capture the local features of the input text and obtain the coded representation of the sliding window subsequence. Then, these encoded representations are input into the optimized LSTM model, and the encoded representation of the whole text is obtained by LSTM encoder.
In order to make better use of the context information between SMOSS model and LSTM model, a matching layer is introduced into the framework. The matching layer adopts attention mechanism, and realizes the interaction between local features and global features by calculating the similarity between each sliding window subsequence and other subsequence. Specifically, assuming that there are N sliding window subsequence in the input text, the output of the matching layer can be expressed as Eq. (9).
$$\text{A}\text{t}\text{t}\text{e}\text{n}\text{t}\text{i}\text{o}\text{n}(Q,K,V)=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$
9
\(Q,K,V\) respectively represent the local features of SMOSS model output, the global features of LSTM model output and the features that need to be fused, and \({d}_{k}\) is the feature dimension.
In order to better cope with the task of error detection in English writing teaching, the LSTM model is extended and a sequence-to-sequence (seq2seq) model is constructed. The model includes an encoder and a decoder, which are used to encode the input text and generate the corrected text respectively.
In the encoder part, the improved LSTM model is used to encode the input text, and the coding representation C is obtained. In the decoder part, another LSTM model is used to generate the corrected text step by step. Assuming that the target text is Y, the goal of the decoder is to maximize the conditional probability of generating the target text, that is, Eq. (10).
$$P(Y\mid X)=\prod _{t=1}^{T} P\left({y}_{t}\mid {y}_{1},{y}_{2},\dots ,{y}_{t-1},C\right)$$
10
T represents the target text length.
In the implementation of the comprehensive parsing framework, the SMOSS model and the improved LSTM model are realized by using Python programming language and TensorFlow, a deep learning framework. Batch training and optimizer are used to train the model, and cross entropy loss function is used to measure the difference between the prediction results of the model and the real label.