The detection of mental health conditions by incorporating external knowledge

Mental health conditions have become a growing problem; it increases the likelihood of premature death for patients, and imposes a high economic burden on the world. However, some studies have shown that if patients are detected and treated early, the social impact and economic costs of mental illness can be reduced. With the popularity of social media, people are sharing their feelings on it, which allows data from social media to be used to study mental health conditions. However, past research had been limited to the optimization of the model or using different types of data available on social media, resulting in models that only rely on data to make decisions. Moreover, people judge things not only by the data collected, but also by background knowledge. Therefore, we considered the diagnostic process of doctors and combined the knowledge of psychological screening tools and diagnostic criteria into the model. In addition, we also tested the effect of combining general knowledge. We retrieve the top m most relevant knowledge segments for each user’s post, and then put both into the prediction model. Experimental results show that our method outperforms previous studies, and the F1-score is increased more than 10% in some situations. Moreover, because the knowledge segments are automatically retrieved, our method does not require additional manual labeling, and the knowledge set can be freely adjusted. These show that our method can help detect mental health conditions and can be continuously optimized in practice.


Introduction
According to the World Health Organization (WHO), mental health has become an important indicator of sustainable development (World Health Organization, 2019).Statistics show that people with mental disorders have disproportionately high rates of disability and death.For example, people with schizophrenia and depression are 40% to 60% more likely to die prematurely compared to the general population (World Health Organization, 2021).Through a meta-study of 174 surveys, the number of people suffering from mental illness in the world is estimated at 29.2% (Steel et al., 2014).Worse yet, mental illness is one of the leading causes of disability, driving the global cost of the mental health treatment into trillions of dollars (World Health Organization, 2021;Patel et al., 2018).
Some studies have shown that if mental illness is detected and treated early, the treatment and long-term results will be greatly improved (Bird et al., 2010;Treasure & Russell, 2011).In the long run, being able to early detect mental illness will reduce the impact of mental illness in our society and reduce the economic burden.In WHO's Mental Health Action Plan (World Health Organization, 2021), it suggests to strengthen the research on mental health information systems.We follow this suggestion to focus on early detection of mental illness by analyzing social media data.
Owing to the popularity of the modern Internet, social media is getting closer to people's lives and most people are more and more willing to share their lives through social networks.This allows us to indirectly understand the inner world of people through their posts on social media.With the participation of billions of people, the social media data is large enough to be suitable for deep learning.
Many studies began to collect datasets from social media for analysis or detection of mental health conditions (Coppersmith et al., 2015;MacAvaney et al., 2021;Benton et al., 2017).At the same time, many studies tried to improve the prediction performance by changing the model architecture or considering different types of data (Gui et al., 2019;Shen et al., 2017).Therefore, the performance of the proposed models was often limited by the distribution of the training data, which constrains the generality of the models and also makes the models unable to scale.
Moreover, people judge things not only by the data collected, but also by experiences and background knowledge.This means that in the process of making decisions, people often incorporate relevant knowledge and follow established conventions.Clinically, the doctor or psychotherapist asks the patient questions based on the mental health screening tools to understand their psychological state and symptoms (Butcher et al., 2001;Krug et al., 2008).A final assessment is then made based on the physician's expertise and standard diagnostic criteria (American Psychiatric Association, 2013).Therefore, diagnosing mental illness requires a great deal of knowledge for a precise diagnosis.
To overcome overreliance on data gathered from social media, we refer to human and doctor decision-making processes to incorporate external knowledge into the model.The main idea is to incorporate relevant knowledge from the mental health screening tools and diagnostic criteria into the deep learning model such that a psychological perspective of the model can be provided.As an extension, we also explore the impact of introducing simple common sense into the model.We collected screening tools for the mental health conditions, Wikipedia's mental health-related entries 1,2 and the authoritative book DSM-5 as our psychological knowledge.The contents from Wiki dpr3 , which are created by the Hugging Face, are also collected and treated as common sense.Our goal is to study whether external knowledge from psychology or common sense can improve the predictive ability and interpretability of the model.The method of introducing external knowledge has another advantage, that is, the external knowledge can be retrieved automatically and replaced freely.The former means the incorporation of the external knowledge does not need manual annotation, and the relevant knowledge can be found automatically.This will save a lot of labor costs and make our method easier to be widely used.The latter means the contents of the external knowledge can be adjusted freely, which solves the scalability problem for a large-scale model.
Our work includes the following four steps: (1) gather and index the employed external knowledge, (2) incorporate the knowledge into the model to aid in prediction, (3) add an attention layer for determining which posts and knowledge receive more attention, and (4) use a fully connected layer and a sigmoid activation function to predict mental health conditions.Our contributions can be summarized as follows: -We incorporate relevant external knowledge into the model and the experiment results show that the F1-score of the prediction is increased more than 10% in some situations compared with existing approaches.
-By providing the model with external knowledge from psychology and other fields, humans can better understand what the model has learned, which makes researchers easier to optimize the model.-This method can be automated, and the external knowledge can be freely adjusted.
This minimizes the cost of manual annotation and solves the scalability problem of the model.-Finally, some statistical guidelines are provided for those who want to adopt our approach to build external knowledge for their model.
The rest of the paper is organized as follows: the related works are reviewed in Section 2, the details of the external knowledge are introduced in Section 3, our approach is described in Section 4, the experiment results are presented in Section 5, and the conclusion is provided in Section 6.

Related work
In this section, we first describe the datasets for detecting mental health conditions, and then introduce the past work on the detection of mental health conditions.

Data collection from social media
With the rise of social media, researchers have begun to use social media data to study mental illness (Park et al., 2012).Moreover, a growing body of research is concentrating on analyzing the copious amounts of text on social media to learn more about mental health conditions (Coppersmith et al., 2015;Birnbaum et al., 2017).However, these studies use manual annotation to label data.Although manual labeling can obtain reliable data, the number of users from whom the data is collected is limited (Choudhury et al., 2021).Even though crowdsourcing is used to collect and label data, it is still difficult to collect a large amount of user data.
In order to collect larger data, Coppersmith et al. (2014) developed a method to identify self-diagnostic posts on social media by using regular expressions, which was widely used to collect data from the users with mental disorders on social media.Four types of mental health conditions were considered and the Tweets collected were analyzed using corresponding linguistic features and predictive models.Since depressive users tend to express their emotions and even reveal the fact of being diagnosed on social media, labelling users by the method can often achieve a high degree of reliability.Cohan et al. (2018) extended this approach using Reddit data for a larger number of mental health conditions, and called the dataset Self-reported Mental Health Diagnoses (SMHD).Self-reported diagnoses mean if, for example, "I was diagnosed with depression last year" is found in a post, then this user is considered as a depression patient.The dataset contains data from users with nine different mental health conditions, including depression disorder, attention deficit hyperactivity disorder (ADHD), anxiety disorder, bipolar disorder, post-traumatic stress disorder (PTSD), autism disorder, obsessive-compulsive disorder (OCD), schizophrenia, and eating disorder.Notice that a user may have one or more mental disorders.
For each diagnosed user, nine or more control users were collected according to the following restrictions (Cohan et al., 2018): the number of posts posted by the control user must be between twice and a half of that of the diagnosed user, and the control user must have at least one post on a subreddit where the diagnosed user once posted.It is important to note that these control users cannot have any mental health-related post.Likewise, the diagnosed user is normalized by removing posts containing mental health signals, leaving only general posts in the final dataset, allowing the text analysis to focus on diagnosed user tendencies in general posts.Table 1 shows the statistics of the posts in the two groups of diagnosed users and control users.

Detection of mental health conditions on social media
In order to improve the prediction performance, early studies focused more on the optimization of the model (Jiang et al., 2020;Murarka et al., 2021) or considering different types of data available on social media.Some studies (Choudhury et al., 2021;Coppersmith et al., 2014;Reece & Danforth, 2017) used handcrafted features from different types of data such as number of posts per day, number of faces in a photo, etc. as input for the predictive model.Other studies combined different types of data, such as text and images, to construct a multimodal (Gui et al., 2019;Shen et al., 2017) for the prediction.However, only relying on social media data to determine whether a user suffers from mental health conditions is less convincing.Past research has focused on adding different types of features, such as the handcrafted features and image features as mentioned above, resulting in the predictive performance of the model limited by the training data.Moreover, because the model parameters cannot be easily changed, the model's generalizability is limited and the model itself unable to scale.
In other fields, researchers solve the above-mentioned problems by incorporating external knowledge into the model.For example, Ghazvininejad et al. (2018) used the memory network to import external knowledge in the conversation generation domain to enable a chatbot to answer questions asked by humans.With the introduction of the external knowledge, the chatbot can answer questions with knowledge not from the training data.For another example, in the question answering domain, Li et al. (2020)  Although these methods have shown the effectiveness of incorporating the external knowledge, the method of retrieving relevant knowledge into the model is too straightforward.In recent years, since the excellent feature extraction capabilities of the Pre-trained Language Models (PLMs), the Meta AI team developed a retrieval method based on the high-dimensional feature representation, Dense Passage Retrieval (DPR) (Karpukhin et al., 2020), which greatly increased the accuracy and reliability of the knowledge retrieval.The experiment results show that the retrieval accuracy of DPR not only exceeds the traditionally used BM25 (Robertson & Zaragoza, 2009), but also because it is based on PLMs, its effect can continue to improve with more training.Meta AI team also used DPR to achieve gratifying results in the field of open-domain question answering and generation (Lewis et al., 2020).
Because of DPR's outstanding retrieval ability, we use it to retrieve external knowledge, and then import relevant knowledge and posts into deep learning models for predicting mental health conditions.It is then tested on the SMHD dataset to show the performance of our model.

Introduction of external knowledge
We consider two types of external knowledge to incorporate into our model.One is the knowledge that a clinical psychologist uses, called psychological knowledge, and the other is unconstrained general knowledge.By combining these two, we hope to effectively improve the predictive performance of the model and enhance the interpretability of the final results.

Psychological knowledge
Psychological knowledge is collected from three main sources: 1. Screening tools for the mental health conditions (psychological test questionnaires).2. Wikipedia's mental health-related entries.3. DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (American Psychiatric Association, 2013)).

Screening tools
Screening tools are used to assess an individual's mental health status or to identify signs or symptoms of mental disorders.These tools help clinicians understand the individual's situation to do suitable treatment.Nine types of screening tools were collected for each mental health condition included in SMHD.Additional screening tools were collected that can be used simultaneously for multiple disorders, such as MMPI (Butcher et al., 2001).In total, there are 10 types of screening tools collected.The collected screening tools are listed in Appendix.
In order to turn the screening tool into model-usable knowledge, it needs to be broken down into knowledge segments.We treat each question in the screening tool as a knowledge segment.If the question is too long to fit within the length limit (100 tokens) of a knowledge segment, we divide it into different segments.
For the ten types of screening tools, a total of 50 screening tools with 2556 questions were collected and divided into 2674 knowledge segments as shown in Table 2.

Wikipedia's mental health-related entries
In Wikipedia, related entries are grouped and placed in templates.When collecting Wikipedia data, we found two templates related to mental health conditions, mood disorders, and mental disorders, and scraped all the webpages listed there.These webpages contain the description of the diseases, the symptoms, and other information, making the knowledge more comprehensive.Similar to processing screen tools, Wikipedia data are divided into knowledge segments by sentences to be used by the model.A total of 165 webpages with 22,002 sentences were collected and divided into 22,002 knowledge segments (there is no sentence with more than 100 tokens).

DSM-5
DSM-5 is the Standard diagnostic criteria for mental health conditions.The DSM serves as the principal authority for psychiatric diagnoses in the United States.It contains disease definitions, symptoms, and treatment recommendations, etc.We remove the parts before the preface and after the Appendix, and collect the body part of DSM-5 as psychological knowledge.As above, we use a sentence as a segment and divide the book into many knowledge segments.For long sentences or paragraphs that are difficult to be segmented by automatic tools, we also split them into several segments with 100 tokens as breakpoints.
There are 17329 sentences in total, which are finally split into 17482 knowledge segments.It can be seen that there are not many sentences with more than 100 tokens, and most sentences can completely retain their meanings.

General knowledge
Wikipedia is a vast online encyclopedia, its content is universal and unrestricted, so we treat it as common sense.The Wikipedia data used was created by the Hugging Face called Wiki dpr.It covers a wide and large amount of content from Wikipedia.
The text in Wiki dpr is segmented by 100 tokens, which are split into 21 million segments in total.Wiki dpr not only contains the text, but also generate the embedding of the knowledge segments by using the same knowledge encoder as ours.The creator establishes a quick search index for these segments, thus users can quickly and accurately find the most relevant knowledge segments.
Since Wiki dpr contains a wide range of contents, we treat it as common sense to analyze whether the model would perform better if general knowledge was introduced.

Mental health conditions detection
In this section, we describe the methods for detecting mental health conditions, including extracting features from the posts and external knowledge segments, finding external knowledge segments related to the posts, and the deep learning model architecture used.

Feature extraction
To convert text into representative features, a good choice is to represent the text as a highdimensional vector.In recent years, the rise of Pre-trained Language Models (PLMs) has made extracting features into high-dimensional vectors more universal and effective.
BERT (Devlin et al., 2019) is a transformer-based natural language processing model that is pre-trained on a large corpus and can cover multiple languages at the same time.Because of the self-attention mechanism, the context of the article is taken into account, therefore the meaning of individual words or the whole article can be accurately expressed.It is one of the most popular pre-trained language models nowadays.
We use BERT as our feature extraction encoder.Both user's posts and knowledge segments are put into respective encoders, and use the output as the final feature representation.In this way, the vector representation of all posts and knowledge segments is obtained, which completes the purpose of the feature extraction.

Relevant knowledge retrieval
Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) is a tool that has been widely used for knowledge retrieval in recent years.DPR is useful for retrieving relevant content and provides a solution to the scalability problem of large pre-trained models.Its basic concept is to use the excellent feature extraction ability of PLMs to find the correlation between the query and the knowledge such that the knowledge related to the query can be found.
All external knowledge segments are first passed through PLMs to get the feature representation.An index is then built on these knowledge segments such that the relevant knowledge segments can be efficiently retrieved.Next, we pass the query through PLMs to get the feature representation.Finally, the feature representation of all knowledge segments is used to perform Maximum Inner Product Search (MIPS) on the feature representation of the query, and the top m knowledge segments related to the query are obtained.
We treat each post as a query, and let each post find the relevant top m knowledge segments.Then we pass the feature representation of the posts and knowledge segments to the model for the prediction.In the end, the input of the prediction model includes n posts and m * n knowledge segments, where m is a parameter of the number of most relevant knowledge segments.
We train a total of nine binary classification models for the nine corresponding mental health conditions to get nine prediction results.The overall flow is illustrated in Fig. 1.
The details of this method are described as follows.Posts P = {p 1 , p 2 , ..., p t , ..., p n }, where n is the total number of posts and p t is the t-th post.For each p t , we use DPR to compute the similarity with all k j ∈ K, where K represents all knowledge segments: ke(k j ) = P LMs ke (k j ), pe(p t ) = P LMs pe (p t ) (1) where

Deep learning model architecture
Because different posts and knowledge segments have different contributions to determine mental health conditions, we use an attention mechanism (Vaswani et al., 2017) to let the model pick out posts and knowledge segments that are more important for the prediction.
We use three attention model architectures to test different interactions between posts and knowledge segments: 1.Each post is concatenated with the relevant knowledge segments, then enters the selfattention layer.2. The posts and knowledge segments enter the self-attention layers separately.3. The posts and knowledge segments enter the attention layer separately, with cross attention to each other first, followed by the self-attention layer.
The details are as follows.
1.After finding the relevant knowledge segments, let each h p t be concatenated with h k t to become the t-th hidden state h t .Put h 1 to h n into the self-attention mechanism to calculate the attention weight of each hidden state.Then, the output of each hidden state h t is obtained by a linear combination of the weights.The output goes through a global average pooling layer to obtain the user representation υ.
2. The post and the knowledge segments are then passed to the attention mechanism separately to calculate the attention weights between the posts and between the knowledge segments.After the same linear combination of the weights is done, a global average pooling layer is performed.Then we concatenate the weighted average hidden states of posts and knowledge segments to obtain the user representation υ.This method can be shown after removing the cross attention layer in Fig. 1.
3. The cross attention on the sequences of posts and knowledge segments allows the posts and knowledge segments to interact with each other.Then as in the second method, the posts and knowledge segments are passed to the attention mechanism separately, followed by the global average pooling layer.Then we concatenate the weighted average hidden states of posts and knowledge segments to get the user representation υ.The complete process is shown in Fig. 1.
After getting the user representation, let it go through the fully connected layer and a sigmoid activation function to get the predicted ŷ, where ŷ is the predictive value of binary classification for each mental health condition:

Experiments
In this section, we first give dataset statistics.Then we present the setup of our experiment and analyze the results of the experiment.Next we provide some statistical guidelines of knowledge sources.Finally, we explain the effectiveness of our model with cases.

Dataset statistics
We use SMHD to validate our method, which has been presented in Section 2.1.It has been divided into training, validation and test sets in equal proportions (Cohan et al., 2018).The statistics are shown in Table 3.

Experiment setup
Our models are trained with batch sizes of 32 and 8.The optimizer for training the models is Adam with an initial learning rate of 10 −4 .The loss function used by gradient descent is binary cross entropy.We use Tensorflow2 to implement the models and train the models on NVIDIA Tesla V100.
In terms of feature extraction, we use BERT as our PLMs, and freeze the training parameters of BERT without fine-tuning.The main reason is that the feature extraction ability of the pre-trained BERT is good enough to test the effectiveness of our method.For retrieving relevant knowledge segments, we use a pre-trained encoder trained by Hugging Face as PLMs for our knowledge encoder. 4 Cohan et al. ( 2018) selected more than nine control users for each diagnosed user, and mixed all control users and all diagnosed users.In order to fairly compare the experiment results, Sekulic and Strube (2019) set the ratio of the diagnosed users to the control users as 1 : 9, and implemented the benchmarks in Cohan et al. (2018) for a comparison.We follow this setup in our experiments.It is important to note that our way of adjusting the user ratio is different from Sekulic and Strube (2019), where the ratio of the diagnosed users to control users is adjusted through multiple random sampling.We did it by first making predictions for all users and then adjusting the values of the false positive (FP) and true negative (TN).Because the sum of the values of TP and FN is equal to the number of diagnosed users, and the sum of the values of FP and TN is equal to the number of control users, we calculated the variable x based on the ratio of the diagnosed to control users for each mental health condition to fit the following equation: We then used the values of F P /x and T N/x to calculate the F1 score.Because we make predictions for all users, the results are not biased.In contrast, random sampling has to be done a sufficient number of times to remove bias.Our experiment results are therefore more statistically significant and more representative of the true predictive power of the model.
Furthermore, we only make predictions on 160 posts per user due to the recommendation from Sekulic and Strube (2019).If the user has more than 160 posts, the most recent 160 posts are selected; if it is less than 160 posts, a zero vector is filled.
From Table 3, we see that the number of control users is much larger than that of diagnosed users.To address the problem of data imbalance, we randomly draw diagnosed users and control users into the batch with the same probability during training.This allows all diagnosed and control users to be fairly used in the training.

Experiment results
We compare with the results of Sekulic and Strube (2019), whose main method is based on the Hierarchical Attention Network (Yang et al., 2016).The basic concept is to use two layers of GRU-based encoders for feature extraction.The attention operation is performed on the feature representations of the posts to achieve the purpose of obtaining the user representation.This study also re-implemented some classic machine learning architectures based on the benchmark models of Cohan et al. (2018), including Logistic Regression, Linear SVM, and Supervised FastText.
To test the impact of different types of knowledge on the model, we divide the knowledge into two categories for the experiments.One is the psychological knowledge specially collected for this task; the other is the addition of general knowledge to the psychological knowledge, called total knowledge.The main purpose of the total knowledge experiment is to see if common sense helps model predictions.For each post, we retrieve the top 1 relevant knowledge segment in the first experiment, so each post has one most related knowledge segment.
The method of incorporating external knowledge segments is detailed in Section 4. We train the first two prediction models with knowledge segments from the two categories, and finally obtain four experiment results, as shown in Table 4.It is seen from the experiment results that psychological knowledge is more effective than total knowledge.Therefore, we further test psychological knowledge using a cross-attention mechanism to test whether the interaction of knowledge segments and posts improves model performance.

Analysis of the experiment results
It can be seen from Tables 4 and 5 that with the Only Post model, it performed better than all the baselines made by the previous work.This also coincides with our conjecture that PLMs have very good feature extraction capabilities.We use this as a baseline to analyze the results of adding external knowledge segments.
It can be found that using psychological knowledge outperforms total knowledge.There are two possible reasons.First, the total knowledge has more than 20 million knowledge segments, which makes it difficult to find the commonalities and differences between various kinds of knowledge segments, and makes the predicting difficult.Second, the total knowledge contains more common sense than the psychological knowledge such that it is easier to find some knowledge segments unrelated to the mental health.This leads to inability to find the relevant knowledge segments, leading to poor results.
In terms of the models, passing the hidden states of knowledge segments and posts to the attention layer separately performs better than concatenating them together.The reason is that the importance of knowledge segments and posts are not consistent.The posts with greater attention weight do not necessarily find the important knowledge segments, and vice versa.Therefore, simply concatenating the hidden states of knowledge segments and posts, and passing to the attention layer reduces the predictive power of the model, because it makes model more difficult to find important contents.On the contrary, passing the hidden states of knowledge segments and posts separately through the attention mechanism results in better performance.Because the model can calculate the hidden states of knowledge segments and posts separately, it becomes feasible to find the hidden states of posts and knowledge segments that are important for model prediction.However, separating the hidden states of knowledge segments and posts make them unable to influence each other.We therefore add a cross-attention mechanism to further improve the prediction effect.Finally, we have achieved good results in most of the disease categories.Among these results, schizophrenia showed the greatest improvement with an increase of more than 10% in F1 score.The incorporation of psychological knowledge and the use of the cross attention layer increases the effect by more than 3%.

Extended experiments: More knowledge segments
This section explores the impact of introducing more external knowledge segments on model predictions.We let each post find the top 1, 3, and 5 related knowledge segments (from Psychological Knowledge) and compare the impact on the model.The results are shown in Tables 6 and 7.
It is found that more related knowledge segments may be helpful for model prediction, but not necessarily.While more imported knowledge segments increases the likelihood of acquiring important and relevant knowledge segments, it also creates more noise and makes  the model harder to focus on the really important knowledge segments.For some mental health conditions with small number of diagnosed users, the more input of knowledge segments, the harder it is to find useful information, and as a result, the more serious problem of overfitting.For example, in eating disorders, the more knowledge segments is introduced, the worse effect results.Therefore, it is not necessarily helpful to introduce more knowledge segments; it depends on the amount of data and the actual situation of the experiment to make a decision.

Co-occurrence of mental health conditions
In this section, we discuss whether the nine mental health conditions predicted by our binary classification model are able to distinguish the multiple disorders of the user, and analyze the comorbidity.
We use the model with the best predictive performance, i.e.Only Post + PK + C, and analyze the prediction results of all diagnosed users on the nine disorders to study the comorbidities.We made predictions for the nine disorders for each diagnosed user.We used Exact Match Ratio (EMR), a strict metric where each label in a multi-label needs to be exactly correct, and Hamming Loss (HL), a soft metric used to report the average number of Fig. 2 The relative co-occurrence of the disorder with other disorders in the test dataset Fig. 3 The relative co-occurrence of the disorder with other disorders in the prediction result incorrectly predicted class labels.The results show that the EMR is only 1.34%, while the HL result is 61.04%, which shows that our model cannot well distinguish the differences between different disorders.
We compared the comorbidity of the test data, as shown in Fig. 2, with our predicted results, as shown in Fig. 3.It is found from Fig. 3 that as long as the user has a mental health condition, it is almost always predicted to have other mental health conditions.However, the actual situation is not the case.Our model can well detect whether users have a mental health condition, but cannot accurately tell which mental health condition it is.

Source statistics for psychological knowledge
The psychological knowledge is collected from three sources: Screening tools, Wikipedia (mental health-related entries), and the DSM-5.Table 8 shows a statistical analysis of the knowledge segments retrieved by our method from these three sources.
We search for the top-5 related psychological knowledge segments for all posts and count their sources.It can be seen from Table 8 that for the source distribution of the top 1 5 related knowledge segments, the knowledge segments from Wikipedia (mental healthrelated entries) are much more often used than the other two.It can be seen from Section 3.1  that DSM-5 occupies the largest proportion of the psychological knowledge, while has the least relevance, and the probability of being retrieved is low.There are two speculative reasons.The first is that the encoder weights of DPR are trained on Wikipedia, which makes the segments from Wikipedia easier to be retrieved.The second is that DSM-5 is a book, compared with the screening tools and Wikipedia (Mental Health), it usually uses abstract or high-level sentences, which leads to a difference from the colloquial posts.
In order to understand which sources of knowledge segments are more important to predict mental health conditions, we analyze the model, Only Post + PK + S, and make statistics on the sources of the top ten related knowledge segments that the model pays the most attention to.As can be seen from Table 9, DSM-5 is again the least important source compared with screening tools and Wikipedia (Mental Health).
From Tables 8 and 9, it can be concluded that the screening tools and Wikipedia (Mental Health) are very important to the model.If we can increase the amount of data for both, there is a good chance that the model will perform even better.

Case study
We show in this section why the introduction of external knowledge segments can increase model prediction and improve interpretability.The model used is Only Post + PK + S.
Because of the user's privacy and data usage agreement, we paraphrased the post.The post attention parts present the three posts with the highest attention weights.The knowledge attention parts present the knowledge segment with the highest attention weight, and the post that retrieved the knowledge segment.
Table 10 shows that the predictions are wrong in the baseline model, but correct after including external knowledge segments.Based on the content of the post, it is difficult to determine that this is a depressed patient.However, from the attentional weight of the external knowledge segments, it is found that patients have a tendency to impulse purchase, making the correct judgment of the model.It is worth noting that some past studies (Mueller et al., 2011;Lejoyeux et al., 1997) have found a fairly high correlation between impulse purchase and depression, which is consistent with the knowledge that the models focus on.
Table 11 is also an example of an incorrect baseline prediction, but it becomes correct after including the external knowledge segments.This example is from an autistic patient.It  is hard to understand why the model made the prediction with the posts.By guessing from the most important knowledge segments, we see that users do not like to contact with other people, which makes one better understand the reason for the prediction.
Table 12 lists some counter-examples that illustrate the inadequacies of current methods.Since our method retrieves knowledge segments for each post, some unimportant posts find noisy knowledge segments during training and testing.This problem of miscited knowledge segments needs to be overcome when incorporating knowledge segments into the model.

Conclusion
In this study, we improved the performance of detecting mental health conditions by incorporating psychological knowledge.The experiment results show that our method outperforms previous work, and the F1-score is increased more than 10% in some situations.By the attentional weight of the knowledge segments, our model can find knowledge segments that are important for predicting mental health conditions and improve interpretability by the content of the knowledge segments.This suggests that our model has the potential to be a reference for psychiatrists to assess patients; or to allow users to learn more about their mental health.
Moreover, DPR is an automatable process, and the external knowledge can be adjusted freely, which make our method more likely to be applied in practice.Through the source statistics for knowledge segments, useful sources can be found to improve the performance of the model.We are working on solving the problem of the miscited knowledge segments used the Word Mover's Distance algorithm to compare the distance between the query and the external knowledge, and put the most matching external knowledge into the model.Li et al. tested the model on the examinations of National Licensed Pharmacist Examination in China, and the results showed that the model can pass the examinations.

Table 2
Statistics of the screening tools ke(k j ) is a feature representation of a knowledge segment produced by the knowledge encoder based on PLMs, and pe(p t ) is a feature representation produced by the post encoder, also based on PLMs.Calculating top-m(p η (•|p t )) is a MIPS problem, and we let each p t find all p η (k j |p t )), where k j ∈ K. Then we sort K according to the result to obtain K t .Lastly, we take top m elements in K t to get K t,m = {k t 1 , k t 2 , ..., k t m }, which is the top m relevant knowledge segments of p t .We represent the hidden state of p t as h

Table 3
Train, Validation, Test Split Control Depress.ADHD Anxiety Bipolar PTSD Autism OCD Schizo.Eating

Table 8
The source distribution of the top 1 5 related knowledge segments

Table 9
The source distribution of the top ten related knowledge segments that the model pays the most attention to

Table 10
Example of depressionValues in bold indicate information focused on by models from the experimentsTable 11 Example of autism Values in bold indicate information focused on by the models from the experiments

Table 12
Counter example Values in bold indicate information focused on by the models from the experiments