As shown in Fig. 1, our framework is composed of four modules: (1) Selftraining module; (2) Encoder module; (3) Sample selection module; (4) Sample recall module.
We use RoBERTa as the encoder to obtain the text representations. In addition, we leverage MC dropout to compute the model uncertainty and employ the entropybased strategy to select samples that the model is uncertain about for selftraining according to this model uncertainty estimate. Finally, in the sample recall module, we automatically recall some highquality samples with low confidence from two perspectives of word overlap and semantic fluency.
3.1 Selftraining Module
Selftraining [14] is represented as one of the semisupervised learning methods by expanding the labeled training set to generate pseudolabeling for unlabeled data, it is now widely used in the field of sample selection.
Therefore, we utilize selftraining technology to predict and filter the samples generated by data augmentation. Formally, given an original corpus with \(N\) samples \({D}_{l}={\{{x}_{l}^{i},{y}_{l}^{i}\}}_{i=1}^{N}\) and an augmented corpus with \(M\) samples \({D}_{g}={\{{x}_{g}^{i},{y}_{g}^{i}\}}_{i=1}^{M}\), the module is implemented as follows:

Step1: Train a teacher model \({RoBERTa}_{T}\) with crossentropy (CE) loss on the original corpus \({D}_{l}\).

Step 2: Use the teacher network to predict the pseudolabels \({D}_{p}\) for the augmented corpus \({D}_{g}\).

Step 3: Select the augmented corpus \({D}_{g}\) with a variety of selection strategies to get a subset of augmented corpus \({S}_{g}\).

Step 4: Train a student model \({RoBERTa}_{Stu}\) together with \({S}_{g}\) and \({D}_{l}\).

Step 5: Treat the current student model as the teacher model, go back to step 2, and repeat steps 2 to 4 until the model converges.
3.2 Encoder Module
RoBERTa (Robustly Optimized BERT Pretraining) [15], a variant of BERT, only retains the masked language modeling (MLM) task for pretraining. The first and last positions in the input sentence would be respectively given special tokens [CLS] and [SEP]. For each token in an input sentence, its input representation is constructed by summing up its corresponding token, segment, and position embeddings.
We leverage RoBERTa as the encoder to train the teacher and student models in this paper. For the classification task, the final hidden state of [CLS] is usually utilized to present the sentence. Hence, for the \(i\)th sentence, the sentence representation is calculated as follows:
$${S}_{i}=RoBERTa({a}_{i},{b}_{i},{c}_{i})$$
where \({a}_{i},{b}_{i},{c}_{i}\) are the token, segmentation, and position embeddings.
3.3 Sample Selection Module
Since some noisy samples are generated by data augmentation, the model tends to gradually bias towards the noisy data during selftraining, affecting the model performance. Previous works have focused on using the confidentbased method to remove noisy samples without considering the uncertainty of the teacher model. Inspired by these methods, we propose a twostage sample selection method to solve this problem, which includes clean data selection and confidence evaluation.
Clean Data Selection. To verify the correctness of the predicted labels, a binary classification task is introduced, which aims to filter out the augmented data with wrongly predicted labels. We use the teacher model to predict the labels of the augmented samples and compare the predicted labels with their ground truths. Formally, given an augmented sample \({x}_{i}^{g}\) with its ground truth \({y}_{i}^{g}\), we first use the encoder with a linear classify to predict its probability distribution \(f({x}_{i}^{g},\theta )\) and obtain the pseudolabel \({\tilde{y}}_{i}^{g}\).
$$f\left({x}_{i}^{g},\theta \right)=softmax\left(W{x}_{i}^{g}+b\right)$$
$${\tilde{y}}_{i}^{g}=argmax f\left({x}_{i}^{g},\theta \right)$$
Then, we regard the samples with the same predictions and ground truths as clean samples and put them into the set \({D}_{clean}\) while others are put into the set \({D}_{noisy}\).
Confidence Evaluation. After filtering the wrongly predicted samples, based on the probability distribution, we utilize entropybased confidence measurement to quantify the confidence of the samples on \({D}_{clean}\). To better evaluate the sample confidence, we first leverage the MC dropout that conducts \(T\) forward passes with dropout layers to predict labels for augmented samples. For example, given an augmented sample \({x}_{i}\), its probability of class \(c\) in the \(t\)th dropout \({p}_{ic}^{t}\) is as follows:
$${p}_{ic}^{t}= p\left(y=c{x}_{i}\right)=softmax\left({\tilde{W}}_{t}{x}_{i}+B\right)$$
Then, according to the probability distribution \({p}_{ic}^{t}\), we aggregate predictions from \(T\) passes and utilize entropybased confidence measurement to generate sample confidence which is calculated as follows:
$$H\left({p}_{i}\right)=\frac{1}{C}\sum _{c=1}^{C}\frac{1}{T}\sum _{t=1}^{T}{p}_{ic}^{t}\text{l}\text{o}\text{g}\left({p}_{ic}^{t}\right)$$
$${w}_{easy}^{i} =1H\left({p}_{i}\right)$$
$${w}_{hard}^{i}= H\left({p}_{i}\right)$$
where \({w}_{easy}^{i}\) and \({w}_{hard}^{i}\) respectively donate the easy and hard confidences for the \(i\)th samples. Eventually, we rank all the confidence of the samples and get two sets \({D}_{easy}\) and \({D}_{hard}\).
3.4 Sample Recall Module
Some highquality samples in \({D}_{noisy}\) and \({D}_{hard}\) would be wrongly filtered out, which can improve the model performance to a certain extent. Therefore, we propose to rerecall them from two perspectives of word overlap and semantic fluency.
Word Overlap. Word overlap measures the text similarity from the perspective of words. It can effectively filter the samples which are similar in meaning to the original sentence. In this paper, we use the Jaccard coefficient to calculate the word overlap between the augmented samples and the original samples. Suppose \({x}_{l}^{i}\) and \({x}_{g}^{i}\) represent the original and the augmented samples in \(i\)th sample, the word overlap \(J\left({x}_{l}^{i},{x}_{g}^{i}\right)\) can be calculated as follows:
$$J\left(x\right)=J\left({x}_{l}^{i},{x}_{g}^{i}\right)=\frac{{x}_{l}^{i}\cap {x}_{g}^{i}}{{x}_{l}^{i}\cup {x}_{g}^{i}}=\frac{{x}_{l}^{i}\cap {x}_{g}^{i}}{\left{x}_{l}^{i}\right+\left{x}_{g}^{i}\right{x}_{l}^{i}\cap {x}_{g}^{i}}$$
Semantic Fluency. Fluency can be captured by statistical language models [16], which is important for grammatical error correction task [17][18]. Therefore, we propose a metric to evaluate the semantic fluency of the augmented data. Formally, given an original sample \({x}_{l}^{i}=\left\{{w}_{1}^{i},{w}_{2}^{i},\dots ,{w}_{n}^{i}\right\}\) with \(n\) words and its augmented sample \({x}_{g}^{i}=\left\{{u}_{1}^{i},{u}_{2}^{i},\dots ,{u}_{n}^{i}\right\}\) with \(m\) words, we first compute their perplexity \(P(\bullet )\) using the language model ResLSTM [19] trained on the OneBillion[1] corpus.
$$P\left({x}_{l}^{i}\right)=\sqrt[n]{\prod _{j=1}^{n}\frac{1}{p\left({w}_{j}^{i}{w}_{1}^{i}{w}_{2}^{i}\dots {w}_{j1}^{i}\right)}}$$
$$P\left({x}_{g}^{i}\right)=\sqrt[m]{\prod _{j=1}^{m}\frac{1}{p\left({u}_{j}^{i}\right{u}_{1}^{i}{u}_{2}^{i}\dots {u}_{j1}^{i})}}$$
The semantic fluency \({F}_{i}\) for the \(i\)th sentence is then calculated based on the perplexity.
$$F\left(x\right)=F({x}_{l}^{i},{x}_{g}^{i})=P\left({x}_{l}^{i}\right) P\left({x}_{g}^{i}\right)$$
After that, to recall the highquality samples from \({D}_{noisy}\) and \({D}_{hard}\), we use \({D}_{easy}\) to automatically calculate the thresholds to alleviate the workload of normalization and probabilistic modeling. Specifically, we calculate the average Jaccard coefficient \({J}_{avg}\) and fluency \({F}_{avg}\) of the samples on \({D}_{easy}\) and use them as the thresholds to select the recalled samples.
$${R}_{cutoff}=\left\{x\in {D}_{noisy}+{D}_{hard}\right J\left(x\right)>{J}_{avg} \wedge F\left(x\right)<{F}_{avg}\}$$
[1] https://github.com/ciprianchelba/1billionwordlanguagemodelingbenchmark