Neural labeled LDA: a topic model for semi-supervised document classification

Recently, some statistical topic modeling approaches based on LDA have been applied in the field of supervised document classification, where the model generation procedure incorporates prior knowledge to improve the classification performance. However, these customizations of topic modeling are limited by the cumbersome derivation of a specific inference algorithm for each modification. In this paper, we propose a new supervised topic modeling approach for document classification problems, Neural Labeled LDA (NL-LDA), which builds on the VAE framework, and designs a special generative network to incorporate prior information. The proposed model can support semi-supervised learning based on the manifold assumption and low-density assumption. Meanwhile, NL-LDA has a consistent and concise inference method while semi-supervised learning and predicting. Quantitative experimental results demonstrate our model has outstanding performance on supervised document classification relative to the compared approaches, including traditional statistical and neural topic models. Specially, the proposed model can support both single-label and multi-label document classification. The proposed NL-LDA performs significantly well on semi-supervised classification, especially under a small amount of labeled data. Further comparisons with related works also indicate our model is competitive with state-of-the-art topic modeling approaches on semi-supervised classification.

class (Burkhardt and Kramer 2019b). Standard LDA is a completely unsupervised algorithm, and then how to incorporate prior knowledge into the topic modeling procedure is a popular research direction (Chen et al. 2019). A major challenge of these LDA customizations is the computational cost of computing the posterior distribution. For standard LDA, the popular inference methods include variational inference (Blei et al. 2003), collapsed Gibbs sampling (Griffiths and Steyvers 2004), and collapsed variational Bayes (Teh et al. 2006). However, all these methods have a drawback that requires re-deriving the inference algorithms even if there is only a small change to the modeling procedure. Recently, Variational Auto-encoder (VAE) (Kingma and Welling 2013;Rezende et al. 2014) has been considered as a new choice for topic models, because it is deemed to a black box inference method, and does not need requiring model-specific derivations. To the best of our knowledge, the Neural Variational Document Model (NVDM) proposed by Miao et al. (2016) is the first text topic model based on the VAE framework, but does not use the Dirichlet distribution, which promotes sparsity and leads to more interpretable topics as a prior in the modeling procedure of LDA. To handle this issue, there are many methods have been introduced and get competitive performance (Burkhardt and Kramer 2019a). Furthermore, some topic modeling approaches based on the VAE framework can incorporate prior knowledge for supervised learning (Card et al. 2018).
With the digital text grows explosively in Web, where unlabeled data is abundant, while only a limited subset of data samples has their corresponding labels, supporting semisupervised classification is an increasing research field of topic modeling approaches these years (Pavlinek and Podgorelec 2017;Soleimani and Miller 2017). Standard LDA is an unsupervised algorithm, so combining standard unsupervised LDA and customized supervised LDA to support semi-supervised learning is a natural thought (Wang et al. 2012;Zhang and Wei 2014;Soleimani and Miller 2016).
Meanwhile, there are also some semi-supervised approaches under the VAE framework (Kingma et al. 2014), especially for document classification (Xu et al. 2017). However, topic modeling approaches adopting the VAE framework for semisupervised document classification are still rare, or have a complicated model structure (Zhou et al. 2020). To address this challenge, we propose Neural Labeled LDA (NL-LDA). To handle the Dirichlet distribution, we employ a Laplace approximation following Srivastava and Sutton (2017). Inspired by SLDA (Blei and McAuliffe 2010), the proposed model incorporates the prior knowledge by an additional label generative network with a weight parameter, which leads to a flexible model that can employ various types of prior information. To support the semi-supervised learning, we apply the manifold assumption and low-density assumption. Learning on the unlabeled data is useful to discover the latent topics, i.e., the manifold, and then helps to improve the model classification performance. To further improve the performance of semi-supervised learning, we adopt the low-density assumption by extending the object function. Meanwhile, the model supports supervised, unsupervised learning, and prediction by a consistent and concise inference method.
Our contribution is summarized as follows. We propose a novel topic model, i.e., NL-LDA, which is an extension of SLDA for semi-supervised document classification. The proposed approach is a flexible model that allows a variety of extensions for incorporating prior knowledge, and has a consistent and concise inference method based on the VAE framework. The model has been evaluated on several kinds of typical document classification tasks, including singlelabel and multi-label classification. The experimental results demonstrate the proposed model has better performance than related works, including traditional and neural topic modeling approaches. Specially, the proposed model has significant advantages on semi-supervised document classification under a small amount of labeled data. The rest of the paper is structured as follows. Section 2 reviews the related works; sect. 3 describes the proposed method; sect. 4 introduces the experiments and evaluation results. We discuss the results in sect. 5. Finally, sect. 6 gives concluding remarks and an outline of future work.

Related work
Our work is related to two research lines, which are traditional statistical supervised topic models and neural topic modeling approaches.
LDA proposed by Blei et al. (2003) is a hierarchical Bayesian model that aims to map a text document into a latent low dimensional space based on a set of topics. The model considers each document as random mixtures over topics, and each topic is a distribution over words. However, the automatically learned topics are hard to interpret and may not suit an end-user application, e.g., categorization. To incorporate the prior information in the generative process, there are two kinds of approaches: one first generates the words, and then generates the response variables; the other generates the prior knowledge first, and then generates the words conditioned on them. The typical approach of the first type is Supervised Latent Dirichlet Allocation (SLDA) (Blei and McAuliffe 2010), where each document is paired with a response that infers topic prediction. Labeled LDA (L-LDA) introduced by Ramage et al. (2009) is a typical second type approach. It simply defines a one-to-one correspondence between topics and observed labels, and then incorporates the observed label information by the document-topic distribution Dirichlet prior. L-LDA has been widely applied for efficiency and concision, but it constrains the topic distributions in the observed labels that lead to over-focus on them. To alleviate this problem, Dependency-LDA (Rubin et al. 2011) incorporates another topic model to model the observed label correlations, which is deemed to be crucial for multi-label classifiers (Burkhardt and Kramer 2019b). Another recent improved approach of L-LDA is Twin labeled LDA (Wang et al. 2020b), which employs two sets of parallel topic modeling processes, one incorporates the prior label information by hierarchical Dirichlet distributions, the other models the grouping tags that have prior knowledge about the label correlation.
Most LDA variants require approximate inference methods, which have the drawback that small changes to the modeling procedure result in a re-derivation of the inference algorithm. To overcome this challenge, some neural topic models that employ black box inference mechanism based on the VAE framework are proposed.
In VAE framework, an unobserved variable z is generated from some prior distribution p θ (z), then a value x is generated from some conditional distribution p θ (x|z). By introducing a recognition model q φ (z|x) to approximate the intractable true posterior p θ (z|x), the marginal log likelihood of x can be written as: This is a lower bound to the marginal log likelihood. It is also known as evidence lower bound (ELBO). The first term of ELBO attempts to match the posterior over latent variables to the prior on the latent variables, and the second term aims to reconstruct the data. The principle idea of VAEs is to build an inference neural network, i.e., q φ (z|x), which directly maps a document to an approximate posterior distribution of latent variables, and a generative neural network, i.e., p θ (z|x), which reproduces a document close to the observed document from latent variables. To compute the expectations with respect to q φ , Kingma and Welling (2013); Rezende et al. (2014) use a Monte Carlo estimator and the "reparameterization trick" (Williams 1992). The two networks are jointly learned through the stochastic gradient descent method.
Consequently, the VAE framework is considered to have the ability of discovering representative topics as topic modeling procedures. NVDM (Miao et al. 2016) successfully uses the idea of VAEs to train a topic model with a Gaussian prior for the latent variables. To overcome the problem that the Dirichlet distribution is not a location scale family, and then hinders the reparameterization utilized in the VAE framework, Srivastava and Sutton (2017) employ a Laplace approximation for modeling a Dirichlet prior of the latent variables; Joo et al. (2020) approximate the inverse cumulative distribution function of the Gamma distribution, which is a component of the Dirichlet distribution; Zhang et al. (2018) utilize the Weibull distribution; and Burkhardt and Kramer (2019a) solve this problem based on rejection samples. Meanwhile, to utilize Dirichlet prior, Wang et al. (2020a) abandon the VAE framework, and propose Bidirectional Adversarial Topic (BAT) model, which applies bidirectional adversarial training for neural topic modeling. All these methods of unsupervised neural topic models have achieved competitive results. Furthermore, SCHOLAR (Card et al. 2018) building on ProdLDA proposed by Srivastava and Sutton (2017) is a general neural framework for supervised topic models. It can use metadata as labels to help infer topics that are relevant in predicting those labels.
Semi-supervised learning is the branch of machine learning using labeled and unlabeled data to perform certain learning tasks, e.g., document classification. Researches on statistical topic modeling approaches focused on semi-supervised classification are well documented. Wang et al. (2012) describe semi-supervised LDA based on standard LDA and L-LDA. HSLDA proposed by Zhang and Wei (2014) adopts joint distribution of LDA and SLDA to generate semisupervised topic models. ST LDA proposed by Pavlinek and The mean vector of the logistic normal for the variational approximation to the posterior  (2017) is a semi-supervised model to jointly extract topics from a collection of text documents and classify new documents. They report MCCTM outperforms other semi-supervised topic models with respect to classification performance. There are few neural topic modeling approaches for semi-supervised document classification. Zhou et al. (2020) propose a semi-supervised topic model, S-VAE-GM, under the VAE framework. The approach assumes that a document is modeled as a mixture of classes, and a class is modeled as a mixture of latent topics under Gaussian mixture assumption.

The proposed method
Firstly, we review the SLDA model and introduce the generative story of Neural Labeled LDA (NL-LDA) in sect. 3.1, then we propose the model inference method in sect. 3.2. Specially, sect. 3.3 gives the semi-supervised learning algorithm of the proposed model. We summarize some important notations in Table 1.

Generative story
Our model builds on SLDA, which is a supervised extension of LDA. To incorporate prior information, e.g., classification labels, SLDA adds to LDA a response variable associated with each document. Given the number of labels K , the number of documents D, and the number of words V , is the matrix of document-topic distributions, is the matrix of topic-word distributions. θ dt is the topic proportion for topic t in document d with K t=1 θ dt = 1, φ tn is the probability of word n under topic t with V n=1 φ tn = 1, and y d is the response variable of document d. SLDA as a generative process is summarized as Algorithm 1, where N d is the number of words in document d, z dn and w dn are the nth topic and word in document d, respectively. η is the hyper-parameter of Dirichlet prior, γ and τ 2 are the normal linear model parameters.
Algorithm 1 Generative process of SLDA Based on SLDA, we propose NL-LDA. To handle the Dirichlet within Variational Auto-encoder (VAE), we follow AVITM proposed by Srivastava and Sutton (2017), who collapse z, and employ a logistic normal prior on θ d instead of Dirichlet prior. To approximate a Dirichlet prior with hyperparameter η, the mean and diagonal covariance matrix terms of a multivariate normal prior, i.e., μ 1 (η) and 1 (η) are given by the Laplace approximation of Hennig et al. (2012), where t is the index of vector η, μ 1 , and matrix 1 . Consequently, we get To transform document-topic distribution to documentword distribution, we further replace the matrix product of and with a generative neural network, f g , followed by a softmax transform represented by σ (·). Roughly speaking, the neural network is more flexible, while the matrix product is more interpretable. Meanwhile, instead of using a normal linear model in SLDA, we suggest a new response variable function f y , which is a more flexible multi-layer neural network followed by a softmax transform σ . NL-LDA as a generative story is summarized as Algorithm 2.
In the proposed algorithm, y d is the classification label probability distribution of document d with a simplex constraint. For single-label classification, we select the label that Algorithm 2 Generative process of NL-LDA have the maximum probability, and for multi-label classification, we select the labels with a relatively high probability.

Inference method
Typically, to obtain an applicable generative model, we will maximize the log likelihood of the data p(w d , y d ), which is computationally intractable. To address this issue, the VAE framework is used. We assume that the joint distribution is factorized as p(w d ) p( y d ), while the value w d and y d of document d are generated from the conditional distribution of θ d , so (2) We also assume a recognition model q(θ d |w d ), which is an approximation to the intractable true posterior p(θ d |w d ). For document d, Using Jensen's inequality and Eq.
(2), we get ELBO (3) Meanwhile, we incorporate the prior label information by p( y d |θ d ), which is utilized to generate predictive labels.
We can write the modified variational objective function following AVITM, which is one of recent VAE methods for efficient black box inference in topic models. Given two inference networks f μ and f , for document d, we define the distribution q(θ d ) to be logistic normal with mean μ 0 = f μ (w d ), and diagonal covariance matrix 0 = diag( f (w d )), where diag converts a column vector to a diagonal matrix. With the "reparameterization trick", we generate samples from q(θ d ) by computing where ∼ N (0, I). Based on Algorithm 2, the first two terms in Eq.
(3) can be computed as Based on Eqs. (1) and (4), the third term of Eq. (3) can be computed as To maximize the ELBO, we should minimize L l (w d , y d ), which is rewritten as To enhance the effects of prior labels, we give L y a hyperparameter α, and rewrite Eq. (5) as To optimize Eq. (6), we utilize stochastic gradient descent using Monte Carlo samples from ∼ N (0, I).

Semi-supervised learning
We assume that the training corpus W consisting of labeled and unlabeled documents, which are denoted as W l and W u , respectively. While learning unlabeled documents, since L y in Eq. (6) is useless, the proposed model is similar to AVITM. Actually, the learned topics of a document by topic modeling approaches are a low dimensional representative space known as a manifold. The manifold assumption, i.e., all data points lie on multiple low dimensional manifolds and data points lying on the same manifold often have the same label, extends many algorithms to semi-supervised way (Sheikhpour et al. 2017;Engelen and Hoos 2019). So learning on unlabeled data W u is useful to construct the manifold, and then helps to improve the model performance.
To further improve the semi-supervised learning performance of the proposed model, we apply the low-density assumption (Grandvalet and Bengio 2004) on the proposed model with an entropy regularization term, Consequently, the object function while learning on unlabeled documents W u is where β is a hyper-parameter. While training labeled data, the label generative network f y is optimized by L y ; however, it is optimized by L e while unsupervised learning. This leads to the result that the network cannot converge. To address this issue, we optimize the inference networks, i.e., f μ and f , as well as the generative network f g except f y while unsupervised learning. On the contrary, we optimize all networks using Eq. (6) while supervised learning. The learning algorithm of NL-LDA is summarized as Algorithm 3.

Experiments
In this section, we evaluate NL-LDA on several popular typical document classification tasks. Firstly, we introduce the collections and the metrics, then the implementation details are introduced. Lastly, we list the results of our models and compared approaches on supervised and semi-supervised classification.

Collection and metric
We select five typical collections to evaluate the performance of proposed models. All these datasets are publicly available and have been widely used in existing document classification literatures, including some most classical topic modeling approaches and neural models under the VAE framework.
Yahoo Arts and Health multi-label subsets are from Yahoo Collection (Ueda and Saito 2002). We randomly selected 6,441 and 8,109 documents for training, respectively, ensuring that each label appeared at least once. 20NewsGroups 1 is a collection of news articles across 20 different newsgroups, which are considered as 20 different classification labels. In our experiments, we used 18,846 samples, 60% of them were selected for training and the remaining items for testing. IMDB dataset 2 contains 50,000 movie reviews, which are either positive or negative sentiments, i.e., there are two classes in this dataset. We divided it equally for training and testing. AGNews is a massive collection of news articles. We used the same training and test data presented by Zhang et al. (2015), who choose four largest classes from the original dataset. Each class contains 30,000 training documents and 1,900 testing items. After deleting the stop words, we removed low frequency words and about 8,000 words were retained in each dataset. The datasets are summarized in Table 2.
To compare classification accuracies, we compute the correct classification rate (CCR) as follows: where y d andŷ d denote the true and predicted class labels of document d, respectively. While classifying single-label datasets, δ is an indicator variable such that δ(y d ,ŷ d ) = 1 if y d =ŷ d and zero otherwise. While classifying multi-label datasets, we define δ(y d ,ŷ d ) = 1 if the top-ranked label of y d is in the y d , and zero otherwise. Larger values imply better performance. For multi-label classification, we consider binary prediction metrics, i.e., Macro-F1 and Micro-F1 scores, to evaluate our model. Firstly we define the Recall(R), Precision(P) and F1-score(F1)(Goutte and Gaussier 2005) for a document as follows: considers the full testing corpus as a document (Yang 1999

Implementation
To encode the posterior distribution over latent variables, we utilize two inference networks f μ and f for the mean and log diagonal covariance 3 of the logistic normal, respectively. The encoder f μ and f are designed as multi-layer neural networks with two shared fully connected layers, as well as an exclusive batch norm layer. The networks utilize soft plus in the hidden layers. For reducing the overtraining effect, we utilize dropout on the second fully connected layer. The encoders utilize bag-of-words representation of documents as their inputs, and the size of output layers is constrained to be the same as the number of topics. To decode the hidden variable sampled from the logistic normal, we utilize two generative networks f g and f y for generating documents and labels, respectively. The decoder f g and f y are designed as two connected layers as well as one batch norm layer. Like encoders, the decoders also utilize soft plus in the hidden layers, and utilize dropout on the input. The output of the decoder f g is transformed to probabilities of each word by a softmax layer, and the decoder f y output is transformed to the probabilities of labels after a softmax layer.
The sizes of the hidden layers of f μ and f are 2,000, 1,000, as well as f g and f y are 1,000, 2,000, respectively. We heuristically set the number of topics to 200. Figure 1 shows in detail the proposed model network architecture used in the experiments. To train the model, we set batch size = 200, and the maximum epoch is 2,500, as well as α = 10 and β = 1. To initialize the weights of the networks, we use the Xavier uniform initializer in Tensorflow (Abadi et al. 2016). Adam optimizer is used with the second exponential decay rate of 0.99, and learning rate = 0.002. It is worth noting that we use different object functions to optimize different sub networks for labeled and unlabeled documents, respectively, in each epoch. In implementation process, we save the first value of L y . After the training of each epoch, we get the value of L y , which indicates the label prediction performance of the proposed model. The network models and the value of L y are saved if the new L y is better than the existed one, i.e., the value of L y is less than the recorded one. After training, the saved model is utilized for label prediction.   To evaluate the proposed model on supervised classification, we use Yahoo subsets, which represent multi-label classification tasks, as well as 20newsgroups, IMDB, and AGNews datasets, which represent single-label classification tasks. Dependency-LDA (Rubin et al. 2011), TL-LDA (Wang et al. 2020b), and SCHOLAR (Card et al. 2018) are chosen as the baselines. The first two are state-of-the-art supervised topic modeling approaches, and the latter is a neural topic model that incorporates metadata including prior labels. We utilize the original author-provided implementations of Dependency-LDA 4 , TL-LDA 5 without modification of hyper-parameters and sampling parameters, as well as SCHOLAR 6 with tone as a label, vocabular y_si ze = 5, 000, embedding_dim = 300. The number of topics and training epochs have impacts on the performance of SCHOLAR, we experimentally set the topic number of 20newsgroups, IMDB, and AGNews to 50, 40, and 20, respectively, as well as the number of training epochs to 2,000, 500, and 200, respectively.

Evaluation on supervised document classification
CCR and binary predictions are listed in Tables 3 and  4, respectively. Our model performs well in CCR results (Table 3). It gets the best scores among all compared algorithms, including traditional statistical topic models, i.e., Dependency-LDA and TL-LDA, as well as the neural topic model, i.e., SCHOLAR. Furthermore, the proposed approach has significant advantages across all five datasets, while the compared models perform poorly on a particular dataset. For example, TL-LDA performs well on most of the datasets except IMDB, and SCHOLAR performs well on IMDB and AGNews, but poorly on 20NewsGroups. Table 4 demonstrates the proposed NL-LDA performs better than the compared models on Micro and Macro-F1. It is clear that the proposed approach performs well on multi-label document classification.

Evaluation on semi-supervised document classification
To evaluate the proposed model performance on semisupervised classification, we use 20Newsgroups, AGNews, and Yahoo Arts. The training data labels were randomly removed on a particular proportion. Figure 2 shows CCR results of the proposed NL-LDA with semi-supervised and supervised mode, i.e., learning only from labeled data, respectively. It is clear that the semi-supervised mode performs significantly well under a small amount of labeled data relative to supervised mode of NL-LDA. For example, it is demonstrated that CCR difference between the two modes is about 1% above 20% labels reserved on 20Newsgroups datasets, while the difference is more than 7% under 5% labels reserved, i.e., 565 labeled training samples. AGNews and Yahoo Arts have similar trends. It is shown that the semi-supervised mode scores about 17% higher than the supervised mode under 0.2% labels reserved, i.e., 240 labeled documents, on AGNews dataset; and about 13% higher under 5% labels reserved, i.e., 322 documents, on Yahoo Arts. We further select existing results of semi-supervised topic modeling approaches to compare, including a neural topic model and two traditional statistical topic modeling approaches.
S-VAE-GM (Zhou et al. 2020), which is a semi-supervised topic model under the VAE framework with Gaussian mixture assumption, demonstrates competitive performances on 20newsgroups, IMDB, and AGNews datasets. To compare with reported results of S-VAE-GM, we computed labelpivoted F1-scores, i.e., F1-score computed for specific labels, under 20% labels reserved on three datasets, and utilized box plots following Zhou et al. (2020). In Fig. 3, box plots indicate distributions of label-pivoted F1-scores of our model as well as corresponding results of S-VAE-GM. Obviously, Fig. 2 Experimental CCR results of semi-supervised (twill) and supervised (grey) modes of NL-LDA with different labeled data proportion even considering the randomness of the training and test data selection, our model performs better than S-VAE-GM.
MCCTM (Soleimani and Miller 2017), is a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples. The reported results show it achieves better CCR than some state-of-the-art semi-supervised topic modeling approaches. ssLDA is an extension of Supervised LDA to a semi-supervised framework for document classification (Blei and McAuliffe 2010;Wang et al. 2009). To compare with reported results of MCCTM and ssLDA, we use 20Newsgroups following Soleimani and Miller (2017), i.e., 13,105 and 5,063 documents in the training and test sets, respectively. Table 5 lists the results, which show the proposed NL-LDA has significant advantages to the compared approaches on 6/7 reserved label proportions. It scores about 10% higher than the second good results above 30% labels reserved. However, MCCTM performs best under a very small amount of labeled data, i.e., 5% labeled samples. It gets 2% higher score than ours.  Bold entries denote the best scores

Discussion
The experimental results clearly demonstrate that the proposed NL-LDA has significant advantages on supervised document classification relative to the compared topic modeling approaches, including traditional statistical and neural topic models. Statistical topic modeling approaches based on LDA have been widely applied in the field of document classification, including multi-label and single-label multiclass classification; however, they require deriving specific inference methods for customizations of LDA. Meanwhile, our work not only supports single-label and multi-label document classification, but also does not require specific derivations for customized model modification by the VAE framework.
One striking aspect of the experimental results is NL-LDA outperforms SCHOLAR, which is also a neural topic modeling approach based on the VAE framework. One difference between the two models is the method incorporating prior labels. SCHOLAR incorporates prior information to the input of the model. However, the proposed model incorporates prior labels by the objective function L y of the model, which suggests direct effects on the output of neural networks, and helps to improve the performance while back propagation training. This result is different from traditional statistical topic modeling approaches. Roughly speaking, incorporating the prior side information, and then generating the words has better predictive ability in statistical supervised topic models (Soleimani and Miller 2017).
The other difference is the proposed model can adjust the weight of prior knowledge by the hyper-parameter α, but SCHOLAR has no corresponding mechanism. Furthermore, to test data without prior labels, SCHOLAR needs specially prepare an encoder without labels or replace the label vector for the inference network input with a vector of all zeros. This is a cumbersome procedure. On the contrary, our model can easily be used for prediction without any modification.
Another interesting aspect of results is the proposed NL-LDA performs significantly well on semi-supervised document classification. Firstly, the proposed approach can make good use of unlabeled data to improve the model performance, especially under a small amount of labeled data (Fig. 2). Secondly, NL-LDA scores better than the compared existing semi-supervised topic modeling approaches, including neural and classical statistical topic models (Fig. 3, Table  5). The results suggest the two assumptions, i.e., the manifold and low-density assumption, have helped to improve the performance of semi-supervised document classification. Our model applies the low-density assumption by the entropy regularization term, which is weighted by the hyper-parameter β. To demonstrate the effects of the entropy regularization term on semi-supervised document classification, we use Yahoo subsets, 20Newsgroups, IMDB, and AGNews under 10% labels reserved. Table 6 lists the results. It is clear that the entropy regularization term has positive effects on semisupervised classification tasks.
However, one limitation of neural topic models is many network parameters need be trained. The traditional statistical semi-supervised modeling approach, i.e., MCCTM, performs 2% higher score than our model under a very small amount of labeled data on 20Newsgroups dataset. Another statistical model, ssLDA, only gets 3% lower score than ours, while the proposed model scores 15% higher than ssLDA above 50% labeled data. These results suggest statistical approaches model better than neural topic models under a very small amount of data because neural models have more parameters, and need a certain amount of labeled data to train.
Word embedding (e.g., BERT (Devlin et al. 2018) and T5 (Raffel et al. 2019)) as a core method has shown an outstanding performance in recent years. However, because topic models are bag-of-words-based models, word embeddings obtained from pre-trained model cannot be used as input directly. Some works suggest integrating topic model and BERT in a unified framework for semantic similarity detection (Peinelt et al. 2020), text summarization (Ma et al. 2021), sentiment analysis (Palani et al. 2021), and document classification (Chaudhary et al. 2020). These efforts demonstrate impressive performance while losing the interpretability of topic modeling approaches for document classification. Still, more research will be needed to address this challenge. We leave this research for future work.

Conclusion
Statistic topic models based on LDA have been widely developed in the field of document classification. Some of them can support semi-supervised classification and achieve competitive results with state-of-the-art approaches. However, these customizations of LDA have a drawback that small changes to the modeling procedure result in a re-derivation of the inference algorithm leading to the lack of applications. To address this issue, we propose a novel semi-supervised topic model, i.e., Neural Labeled LDA (NL-LDA), based on the VAE framework, which is a black box inference method. An additional label generative network with a weight parameter is employed to incorporate prior knowledge, and results in flexibility, simplicity, and outstanding performance. NL-LDA supports semi-supervised learning based on two assumptions, i.e., the manifold assumption and low-density assumption. It is worth noting that we utilize different object functions and optimize different sub networks while learning from labeled and unlabeled documents, respectively, and the proposed model has a consistent and concise inference method while semi-supervised learning and predicting.
We evaluate the proposed model and compared methods, including traditional statistical and neural topic models, on supervised and semi-supervised document classification tasks. The results show the proposed model has significant advantages across all experimental datasets, including single-label and multi-label datasets. The proposed NL-LDA performs significantly well on semi-supervised classification, especially under a small amount of labeled data.
In conclusion, the proposed NL-LDA has three significant advantages. Firstly, our model has concise inference method by the VAE framework. Secondly, our model has good flexibility because of incorporating metadata by a generative network. Thirdly, our model performs well on supervised and semi-supervised document classification.
In the future, we intend to further improve the performance of the proposed model by adopting pre-trained language models, and plan to apply the models to some other applications, e.g., news video segmentation and summarization.