Heterogeneous-training: A Semi-supervised Text Classification Method

With the advent of the information age, there are more and more text data on the Internet. As the most widely distributed information carrier with the largest amount of data, it is particularly important to use text classification technology to organize and manage massive data scientifically. In this paper, a semi-supervised ensemble learning algorithm Heterogeneous-training is proposed and applied to the field of text classification. Based on the Tri-training algorithm, the Heterogeneous-training algorithm improves the traditional Tri-training algorithm by using different classifiers, dynamically updating the probability threshold and adaptively editing data. A large number of experiments show that our method always outperforms Tri-training algorithm in text classification on benchmark text data sets.


INTRODUCTION
With the continuous advancement of computer technology and network technology, the amount of textual data on the internet is rapidly growing.As the primary information carrier with the widest distribution and the greatest volume of data, it is imperative to scientifically organize and manage massive amounts of data through text classification techniques.Therefore, text classification plays a crucial role in natural language processing in the present-day machine learning field.Conventional supervised learning necessitates manual labeling of text categories, resulting in high labor costs.Semi-supervised text classification can learn from unlabeled samples, thus drawing more and more attention.The most prevalent weakly supervised method [1][2][3][4] of text classification involves generating pseudo-labels and training a classifier to learn the mapping between documents and classes.It is undeniable that the quality of pseudo-labels significantly contributes to the final classification accuracy.However, due to their inspiration, they inevitably produce noise.Pseudo-labels are often generated using heuristics such as String-Match between the documents and seed words provided by the user [5].The high risk of erroneous predictions trained on such noisy labels cannot be ignored.Therefore, it is crucial to minimize the noise generated by pseudo-labels as much as possible.
Ensemble learning [6], as a machine learning technique, has proven to be effective in reducing the noise in classification problems by integrating multiple weak supervised classifiers.The first ensemble learning method, co-training algorithm, was proposed by Blum et al. [7].In this approach, two redundant views are assumed for the dataset, corresponding to each attribute set.These views must meet two sufficient conditions for their definition: each attribute set should be capable of describing the entire dataset, and each attribute set should be independent of the others.However, in practice, these conditions are often difficult to satisfy.For instance, many datasets may have multiple views, but not necessarily sufficient ones [8].Additionally, the views may not be entirely independent.Despite these challenges, the literature [9] does not require the problem to have redundant views, but instead places restrictions on the types of classifiers used and increases the computational overhead associated with estimating labeling confidence.
To address the limitations of co-training, Tri-training was developed to improve efficiency by eliminating the need for long validation times during co-training while utilizing multiple classifiers [10].The algorithm begins by bootstrap sampling the labeled sample set to obtain three labeled training sets, followed by generating a classifier from each training set.During the co-training process, new labeled examples obtained by each classifier are provided by the other two classifiers in collaboration.Specifically, if the predictions of two classifiers for the same unlabeled example are identical, the example is considered to have a high labeled confidence and is added to the labeled training set of the third classifier after labeling.When predicting unseen examples, Tri-training no longer selects a single classifier like previous algorithms; instead, it utilizes the voting method commonly used in ensemble learning to form an ensemble of three classifiers.Unlike previous co-training algorithms, Tri-training implicitly compares the confidence of different unlabeled samples by judging the consistency of the predictions of the three classifiers.This approach eliminates the need for frequently using time-consuming statistical testing techniques.
The traditional Tri-training method implicitly compares the labeling confidence of different unlabeled examples by determining the consistency of the predictions of three classifiers.This approach eliminates the need for frequent and time-consuming statistical testing techniques [11].However, the implicit treatment is often less accurate than explicit estimation of labeling confidence.For example, if the initial classifier is weak, unlabeled examples may be mislabeled, leading to noise in the training of the third classifier.Additionally, the algorithm does not consider the problem of unbalanced sample classes in the training set due to classifier error accumulation.Moreover, since the same learning method is applied to train the data, even if the training sets are different, the obtained classifiers have the same classification results, regardless of the probability of the data.Consequently, the generalization ability of Tri-training algorithm is not strong.
To address these limitations, we propose novel ideas such as semi-supervised learning and co-training into text classification.We also improve the traditional information gain algorithm and Tri-training algorithm and propose a semi-supervised learning algorithm based on the improved Tri-training algorithm.We introduce a semi-supervised classification method called Heterogeneoustraining based on Tri-training algorithm to achieve text classification, and fully utilize the unlabeled sample data to improve classification performance.
The remaining parts of this paper primarily discuss the following topics: the second part explains our research on semi-supervised text classification, ensemble learning and Tri-training algorithm, the third part presents our own improved model, the fourth part describes our related experiments, and the final part summarizes the relevant conclusions of this paper.

RELATED WORK 2.1 Semi-supervised Text Classification
The task of text classification is complex due to various ambiguities and polysemy issues present at various levels of natural language texts.For example, text length can impact the selection of appropriate classifiers, and label level complexity can pose challenges for semantic recognition.Moreover, words can acquire new meanings over time, which further complicates text classification.
Recent studies have focused on deep self-training for Semisupervised Text Classification (SSTC) [12][13][14][15][16][17], where classifiers can use marked and unmarked texts to learn water depth characteristics under a unified framework.This approach involves replacing the current depth classifier with a new one that updates the pseudo-labels of the unlabeled text and then retrains on marked and pseudo-labeled texts.For example, the Virtual Combat Training (VAT) method [12,15] aims to resist random and local interference by first generating the prediction of the original text, applying local perturbations to the embedding of the original text, and training the depth classifier using the consistency loss between the original prediction and the output of the depth classifier.Additionally, other methods such as maximum likelihood confrontation training [14], virtual confrontation training, entropy minimization, and cross-view training are combined into a unified goal.
Unsupervised data enhancement [16] techniques such as reverse translation and TF-IDF word substitution can be employed to leverage the loss of consistency between the prediction of unlabeled text and the corresponding enhanced text instead of applying local perturbation.Cross-view training [13] can be used by matching the prediction of the auxiliary prediction module on the restricted view of unlabeled text (for example, only a part of a sentence) with the prediction of the main prediction module on the corresponding full view.
Overall, research on text classification remains a challenging area with many open issues, and continued research is necessary to improve the accuracy and efficiency of these methods.

Ensemble Learning
Ensemble learning [18] is a machine learning technique that combines multiple independent models to improve their overall performance.Deep learning models with multi-layer processing architectures have shown superior performance compared to shallow or traditional classification models.The main idea behind ensemble learning is to generate multiple learners through specific rules, integrate them using an integration strategy, and make a comprehensive decision based on the final result.Typically, many learners in ensemble learning are homogeneous and considered "weak learners." Based on this weak learner, several learners can be generated through disturbances such as sample set, input feature, output representation, and algorithm parameter disturbances.A "strong learner" [19] with improved accuracy can be obtained by fusing these learners together.With the advancement of integrated learning research, the broad definition [20] of ensemble learning has gradually been accepted by scholars.This method involves selecting learning methods for multiple groups of learners without distinguishing the nature of learners.

Tri-training [10]
Many semi-supervised learning algorithms use classifier generation models, which employ expectation maximization to simulate the process of tag estimation or parameter estimation.One notable achievement in this area is the co-training paradigm proposed by Blum and Mitchell [7], which trains two classifiers separately on two different views, namely two independent sets of attributes.The predictions made by each classifier on unlabeled examples are then used to augment the training set of the other.Dasgupta et al. [21] demonstrated that when the requirements are met, jointly trained classifiers can produce fewer generalization errors.Unfortunately, in most cases, this demand is difficult to satisfy.Goldman and Zhou [9] proposed an algorithm that does not require attribute partition.However, it requires using two different supervised learning algorithms to partition the instance space into a set of equivalence classes, and employs the time-consuming cross-validation technique to determine how to label the unlabeled examples and how to produce the final hypothesis.
Zhou et al. proposed a new co-training style algorithm named tri-training, which eliminates the need for redundant views and multiple supervised learning algorithms.It assumes that the instance space is divided into a group of equivalent categories, making it easy to apply to common data mining scenarios.Unlike previous algorithms that use two classifiers, tri-training uses three classifiers.This approach solves the problem of how to mark unlabeled examples and how to generate the final hypothesis, which greatly improves the efficiency of the algorithm.Additionally, combining these three classifiers can enhance their generalization ability, enabling them to simulate the process of tag estimation or parameter estimation.
The general procedure of the Tri-training classification algorithm is as follows.First, a small number of labeled data sets  are initialized, and three classifiers  Tri-training compares the labeled confidence of different unlabeled samples instead of explicitly comparing the labeled confidence.This avoids the need for frequent and time-consuming statistical testing techniques.However, the downside is that this implicit process is often less accurate than the explicit estimation of labeling confidence, especially when the initial presentation classifier is weak and unlabeled samples may be mislabeled, thus introducing noise to the third classifier.labeled samples are accurate then the negative impact of the introduced noisy data can be offset by the benefit of using a large number of unlabeled samples.[10]

HETEROGENEOUS-TRAINING
Tri-training has achieved impressive results in classification by utilizing semi-supervised ensemble learning.However, the Tritraining algorithm itself has some shortcomings.The original Tritraining method implicitly determines the confidence of different unlabeled samples based on the consistency of their prediction results.This eliminates the need for frequent and time-consuming statistical testing techniques.However, this implicit processing is often less accurate than explicit estimation of confidence, especially when the initial classifier is weak.As a result, the noisy examples may be mistakenly labeled, leading to noisy training of the third classifier.Moreover, the algorithm does not consider the problem of sample class imbalance in the training set caused by accumulated classifier errors.
To address these issues, we propose an improved Tri-training algorithm.Our proposed algorithm shares one training set among three classifiers, which reduces the probability of classifier errors.We also impose more stringent restrictions on samples entering labeled data sets.In addition, the probability thresholds are updated dynamically according to the changes in the proportion of sample categories after each iteration of training.Moreover, we combine the RemoveOnly [24] editing operation and adaptive data editing strategy [25] into the Tri-training learning process.
The steps of text classification include text preprocessing, text feature selection, and classifier training.Our new model will be described from these three steps.At the same time, we will describe the flow and characteristics of the algorithm in detail.

Text Preprocessing
Text preprocessing involves standardized processing of selected text data to convert it into a form for subsequent feature extraction, which includes word segmentation, stop words removal, special character removal, and root reduction of English words.Word segmentation involves separating words into meaningful units using English spaces and punctuation marks as separators.Stop words refer to words that have no significant meaning and can be removed using commonly used English stop words.Special characters, such as non-English characters, numbers, and punctuation, can be filtered using regular expressions.Root reduction is the process of simplifying English words by removing affixes to obtain the basic form of words.

Selection Of Text Features
Information gain is a common feature selection algorithm in text classification.Information gain reflects the importance of features, and the larger the information gain, the more important the features are.
The total information gain formula is as follows: k is the number of text types;  (  ) is the probability that a certain type of text appearing in the total number of all types of texts, .   is the number of certain types of texts and N is the total number of texts;  () is the probability that the text containing feature  appears in all text,  () =    , and   is the number of texts containing feature ;  (  ) is the conditional probability that the text belongs to class   when the feature  appears;  (  ) =    ∩ ,    ∩ is the number of texts with feature  in the class   ;  ( t) is the probability of the text without feature  appearing in the total texts,  ( t) = t  , and  t is the number of texts without feature ;  (  t) is the conditional probability that the text belongs to class   when feature does not appear, and  (  t) =    ∩t t ,    ∩t is the number of texts that do not have feature in class   .

Classifier Training
Let   be a labeled balanced training set containing |  | samples and a given set of labels C = { 1 ,  2 ,  3 , . . .,   };   be an unlabeled training set containing |  | samples: {  1 ,   2 , . . .,   |  | }; and the validation and test sets be   and   .  ,   and   obey the same data distribution.Let the three initial classifiers ℎ 1 , ℎ 2 and ℎ 3 be trained by the original labeled balanced training set, and use these three classifiers to make prediction judgments on the unknown sample set, and only those samples that satisfy the aforementioned conditions (i. e. , all three classifiers have the same prediction for the same unlabeled sample, and the prediction probabilities given by all three classifiers are greater than their respective probability thresholds) can be added to the original training set.The prediction probabilities are calculated by the classification algorithms of the classifiers, and the set of | | probability distributions summing to 1 is obtained by all 3 classifiers for any sample    in   , and the expression is: Where the maximum probability is the predicted probability, namely The probability thresholds are updated dynamically according to the changes in the proportion of sample categories after each iteration of training, so the initial set of probability thresholds for the 3 classifiers must be obtained first.The initial set of "pseudolabels" predicted by the three classifiers for   is obtained by first using ℎ 1 , ℎ 2 and ℎ 3 to determine the unlabeled dataset   , and then using equation 4) to calculate the frequency of the number of labels of the three classifiers for m categories in  as The set of initial values of probability thresholds, which can be defined as: Reduce the dimension of the preprocessed ones, get the feature vector, and train the feature vector with Naive Bayes, SVM, and XGBoost [29] to get three initial classifiers ℎ 1 , ℎ 2 and ℎ 3 with large differences.
(2) Calculate  in  0 for m categories.

Dataset
We conducted experiments on four datasets.The details of the datasets are as follows: The New York Times (NYT): The NYT dataset consists of news articles published by The New York Times, categorized into 5 coarsegrained genres (e.g., science, sports) and 25 fine-grained categories (e.g., music, football, dance, basketball).
Twenty Newsgroups (20 News): The 20 News dataset is a collection of newsgroup documents partitioned broadly into 6 groups (e.g., recreation, computers) and 20 fine-grained classes (e.g., graphics, windows, baseball, hockey).AG-News [26]: AG-News is a vast collection of news articles divided into four coarse-grained topics (e.g., business, politics, sports, and technology).
Books [27,28]: The Books dataset contains book descriptions, user-book interactions, and user book reviews collected from Goodreads 3, a popular online book review website.
The dataset statistics and the corresponding noise ratios for the initial pseudo-labels are provided in Table 1.

Compared Methods
We compare with several label selection methods mentioned below: Tri-training [10], Tri-training with different classifiers, Tri-training+RemoveOnly [24], Tri-training with updated probability thresholds and Tri-training with "admission condition".
We consider the same number of samples in each iteration as Heterogeneous-training for all the above baselines because we cannot adjust individual thresholds for each dataset.There is no human annotated data in a weakly supervised setting and a fixed threshold does not work for all datasets due to the different prediction probability distributions in different datasets.Therefore, for controlled experiments and fair comparisons, we consider the same number of samples as Heterogeneous-training in each iteration.

Experimental Settings
The selection of hyperparameters in general text classification experiments typically involves manual experience or k-fold crossvalidation.In this study, to avoid excessive complexity, some hyperparameters were determined manually.We set the number of iterations t=10, the upper bound, the lower bound, and the finetuning.We focused on the impact of n, the number of subsets of the unlabeled training set, on the experimental results.If n is set too large, the algorithm's time complexity may increase, resulting in a decrease in classification performance; conversely, if n is set too small, the classifier may not be able to accurately distinguish unknown samples due to insufficient iterations, and we found that setting n to 10 was optimal.For the classifier, we used SVM, Naive Bayes, and XGBoost [29] to classify the text.

Evaluation Metrics
The expressions for the accuracy and completeness are defined as: In the expressions, a is the number of texts that actually belong to a category and are predicted by the classifier to be in that category; b is the number of texts that actually belong to a category and are predicted by the classifier to be in other categories; c is the number of texts that actually do not belong to a category but are predicted by the classifier to be in that category.The accuracy rate and the completeness rate reflect two different aspects of automatic text classification, and generally speaking, the accuracy rate and the completeness rate are contradictory, i.e., it is impossible to improve these two indicators at the same time, so they should be considered together.The evaluated value is used as the evaluation index of the experiment, and the expression is: As can be seen from the figure, when there are relatively few labeled training samples, the Heterogeneous-training algorithm proposed in this study can use unlabeled text samples to enhance the ability of supervised learning.With the number of three initial training sets, when n is from 6 to 7,  changes greatly; When n continues to increase, the improvement effect of  1 is not obvious.This shows that in order to obtain the best performance and lower time complexity at the same time, n should not be too large or too small.When the number of initial training sets continues to increase,  1 is improved, which shows that properly increasing the initial training sets can improve the classification effect of this algorithm.

Model Sensitivity
Figure 3

Results
We present a summary of the results of the evaluation by using different classifiers and selecting methods in Table 2. Initial pseudolabels are generated using String-Match [5].Micro and Macro-F1 scores are used as evaluation metrics.Three random seeds were used for each experiment, and the average values and its corresponding standard deviations were given in percentage form.In order to make a fair comparison, we consider the same number of samples for each baseline as the Heterogeneous-training.Heterogeneoustraining that outperforms the standard is rendered bold.All experiments are run on three random seeds, and mean, standard deviations are reported as percentages.
Experimental results show that the performance is significantly improved for all classifiers on fine-grained datasets.In some cases, such as NYT-Coarse, F1 improved by 8.9.However, we also find that some methods are low in some cases.This may be due to the fact that the number of noisy labels is higher in fine-grained data sets, which leads to higher noise.It may also be that the selection of classification methods is not suitable for some data sets, and the classification effect of this algorithm on these data sets has not reached the expected performance compared with other classification algorithms.show that Heterogeneous-training with three different classifiers outperforms the original Tri-training algorithm in the F1 index of the experimental dataset.However, our improved algorithm still has limitations.In the case of a small initial training set, the performance of Heterogeneous-training may be lower when trained on a smaller dataset.This may result in more mislabeled data being generated during the classification process.Therefore, how to eliminate noise better will continue to be an area of focus for future research on three training algorithms.Additionally, many super-parameters in the algorithm are determined by human experience, and this issue requires further investigation.Furthermore, we believe that the Heterogeneous-training algorithm has the potential to play a significant role in the field of deep learning.

1 ,
2 and  3 are trained by .If x is any point in the unlabeled data set  , and  2 and  3 have the same classification result for , then  is labeled as  2 and added to the training set of  1 , thus forming a new training set  1 =  ∪ {/ ∈   2 () =  3 ()} for  1 .Similarly, the training sets of  2 and  3 are expanded to  2 and  3 , respectively.

Figure 1 :
Figure 1: Schematic diagram of Heterogeneous-training algorithm model.In the figure, Labeled Training Set represents   , including the original labels and pseudo-labels, Classifier 1, Classifier 2 and Classifier 3 represent  2 ,  2 and  3 , Unlabeled Training Set represents   , Unlabeled Training Set 1, Unlabeled Training Set 2, Unlabeled Training Set t represent   1 ,   2 and    .Pseudo-labeled data 1, Pseudo-labeled data 2, Pseudo-labeled data l represents    1 ,   2 ,and     ;  represents the trained classifier after iteration T times.

Figure 2 :
Figure 2: Heterogeneous-training method with different number of unlabeled training sets and different number of initial training sets.

Figure 2 shows
Figure 2 shows Heterogeneous-training method with different number of unlabeled training sets and different number of initial training sets.It aims to investigate the influence of the change of n on the classification effect under different number of initial training sets.Set n=6, 7, 8, 9, 10; Set the feature dimension to 450; 160, 200, and 240 text segments in  1 constitute the original marked training set   ,  2 as the verification set   ,  3 as the test set   , and  4 as the unlabeled training set   .The data sets are all provided from NYT-Coarse.As can be seen from the figure, when there are relatively few labeled training samples, the Heterogeneous-training algorithm proposed in this study can use unlabeled text samples to enhance the ability of supervised learning.With the number of three initial training sets, when n is from 6 to 7,  changes greatly; When n continues to increase, the improvement effect of  1 is not obvious.This shows that in order to obtain the best performance and lower time complexity at the same time, n should not be too large or too small.When the number of initial training sets continues to increase,  1 is improved, which shows that properly increasing the initial training sets can improve the classification effect of this algorithm.Figure3compares the Heterogeneous-training of different numbers of unlabeled samples and different numbers of initial training sets to investigate the number of unlabeled samples under different numbers of initial training sets.The influence of the change on the classification effect.Set n=8, and set the feature dimension to 450; Figure 2 shows Heterogeneous-training method with different number of unlabeled training sets and different number of initial training sets.It aims to investigate the influence of the change of n on the classification effect under different number of initial training sets.Set n=6, 7, 8, 9, 10; Set the feature dimension to 450; 160, 200, and 240 text segments in  1 constitute the original marked training set   ,  2 as the verification set   ,  3 as the test set   , and  4 as the unlabeled training set   .The data sets are all provided from NYT-Coarse.As can be seen from the figure, when there are relatively few labeled training samples, the Heterogeneous-training algorithm proposed in this study can use unlabeled text samples to enhance the ability of supervised learning.With the number of three initial training sets, when n is from 6 to 7,  changes greatly; When n continues to increase, the improvement effect of  1 is not obvious.This shows that in order to obtain the best performance and lower time complexity at the same time, n should not be too large or too small.When the number of initial training sets continues to increase,  1 is improved, which shows that properly increasing the initial training sets can improve the classification effect of this algorithm.Figure3compares the Heterogeneous-training of different numbers of unlabeled samples and different numbers of initial training sets to investigate the number of unlabeled samples under different numbers of initial training sets.The influence of the change on the classification effect.Set n=8, and set the feature dimension to 450;

Figure 3 :
Figure 3: Heterogeneous-training of different numbers of unlabeled samples and different numbers of initial training sets.
is 0. The adaptive clipping strategy for RemoveOnly is as follows: let   represent round t as the newly labeled sample set of ℎ 1 in round t-1 iteration,   de as the remaining training set after clipping, and   as the hypothetical classification error rate of   ∪   and   ∪    for ℎ 1 training more, respectively.If RemoveOnly has not been triggered in the  −1 ℎ iteration and |  | > |  −1 |, RemoveOnly will be triggered under the guarantee of |  | > |  −1 |; if RemoveOnly has not been triggered in the  − 1 th iteration and |  | > |  −1 |, RemoveOnly will be triggered if   <   −1 but    <   −1 cannot be satisfied; if RemoveOnly has been triggered in the  − 1 th iteration and |  | > |  −1  |, h will be triggered under the guarantee of    <   <   −1 ; if RemoveOnly has been triggered and |  | > |  −1  | in the  − 1 th iteration, RemoveOnly will be triggered if   <   −1  is not satisfied but    <   −1  is satisfied; except for the above cases, RemoveOnly will not be triggered.Algorithm 1 explains the running process of the model.Algorithm 1 Heterogeneous-training Input: original labeled balanced training set   unlabeled training set   , validation set   and test set   , number of subsets of unlabeled training set n, number of iterations  0 , sample category proportion upper bound  ℎ , lower bound   , fine-tuning step.Output: final classifier  , test result 1 .Preprocess the labeled balanced training set  1 .

Table 2 :
Styles available in the Word template In this paper, we present a novel approach for improving the Tri-training algorithm and propose a new model, Heterogeneoustraining, for text classification.Our results demonstrate that Heterogeneous-training outperforms the original Tri-training algorithm in terms of accuracy, even without modifying the training set.By dynamically adjusting the probability threshold and imposing stricter conditions, we were able to significantly enhance the performance of Heterogeneous-training. Experimental results