The use of social media in mental health research has gained popularity over the past few years since it provides a rich source of information about users' thoughts, feelings, and behaviors. The potential of social media in mental health research is vast. As previously stated, researchers use it to gain insights into the mental health of a population, track changes in mental health over time, and identify risk and protective factors for mental health issues [34]. Automated systems for detecting mental health issues have been developed to analyze textual data, with NLP and transformer models being utilized to identify patterns [7].
An effective method for diagnosing mental health disorders in social networks involves analyzing users' self-report statements. This approach has shown promise in detecting mental health symptoms. It can be used to collect positive and negative samples, which can then be automatically validated to train a machine-learning model. Previous research has shown that this method can efficiently detect mental health symptoms. However, the potential impact of BERT models, which have not yet been explored in this context, remains unknown. Given the success of these models in various fields, we aim to investigate their effectiveness in the domain of mental health diagnosis in the social data [21], [35].
As part of this study, we propose using Twitter users' tweets and bios to predict depression. Specifically, we use BERT models from the Hugging Face library, which have been fine-tuned based on large datasets of reviews, tweets, and other textual data. A transformer-based deep learning model has demonstrated impressive results in various NLP tasks, such as sentiment analysis, question answering, and text classification [10]. The Hugging Face open-source transformers library in the NLP community is trendy and useful for several Natural Language Understanding (NLU) tasks. The library contains thousands of models that have been pre-trained in more than 100 languages.
Using pre-trained BERT models from the Hugging Face library, we can predict symptoms based on users' textual data. Figure 2 presents a four-step approach that involves data selection, preprocessing, training, and validation modules. To determine which model is best suited to our problem, we tested four BERT models, distilbert-base-uncased-finetuned-sst-2-english (DBUFS2E) [36], bert-base-uncased (BBU) [37], mental-bert-base-uncased (MBBU) [20] and distilroberta-base (DRB) [38]. The following parts describe the methodology in detail, covering fine-tuning and evaluation. Eventually, we analyze the performances of the four models and explore the potential implications for mental health research.
3.1. Data selection
Data collection is the process of gathering and analyzing information on targeted variables in a systematic manner that enables one to answer stated research questions, test hypotheses, and evaluate results. For the purpose of training models, the collection of data is a critical component in providing the necessary inputs to create models capable of accurately predicting outcomes.
The present study utilizes the Autodep dataset[1], which was automatically collected and evaluated through the Twitter API [21]. It encompasses a range of data, including posts, bio descriptions, profile pictures, and banner images of Twitter users who have publicly disclosed their mental health status. To ensure the authenticity of the results, benchmarking techniques were employed to compare the outcomes of various measures. The dataset contains 11,890,632 tweets and 553 bio-descriptions. In the last study on predictive models for depressive symptoms on the Autodep, utilizing only tweets and bio-texts resulted in accuracy rates of 91% and 83%, respectively [21]. Here we’ll extract the user bios and tweets from the raw data and save them as separate files. To analyze the tweets holistically, all tweets for each user were merged into a single cell. The following section will detail our actions to prepare the data for analysis.
3.2. Preprocessing Steps
As part of the preparation process for training our model, data preprocessing is an essential step. Social media posts with hashtags, links, special characters, and emojis can often be noisy. However, these elements may not add much value to the text. We used several preprocessing steps to clean and extract meaningful information from the studied features (tweets and bio descriptions).
Algorithm 1. Data Preprocessing Steps |
Input: Tweets and Bio descriptions (control and diagnosed group) Output: Preprocessed data Method: for each value in groups if the value is not in English, skip the value if the value starts with "RT", skip the value if the value starts with "@", skip the value if the value contains URLs, remove them from the value if the value contains emoji, remove them from the value if the value contains any special character, sanitize the value convert the value to lowercase if the value contains any additional spaces, remove them tokenize the value if the value contains stop words, remove them lemmatize values end for |
As illustrated in Algorithm 1, the first step is to check the language, retweets, and mentions. Due to the fact that they do not add value, they should be skipped. Next, URLs will be eliminated from the text utilizing regular expressions, as URLs are improbable to be associated with the prognosis of mental disorders. URLs have a pattern that begins with http/https or www, and we will use regular expressions to match and remove URLs from the text.
Next, we will remove all emojis from the text since they may not provide significant contextual information. We will remove the emojis since we wish to keep our approach as straightforward as possible. Emojis follow an ASCII pattern that can be deleted using a regular expression. We will also remove "*", "^", "@", etc., using regular expressions to improve noise reduction and a smaller dataset. As part of the removal process, we will make sure to eliminate any extra spaces that might remain.
Following, the sentence will be converted to lowercase to ensure consistency. As text data can have different case types, the conversion to lowercase will ensure that the same word will be treated as identical regardless of its case. To enhance the quality of the text, we shall remove any extra spaces between tweets that may impact the algorithm's performance.
After that, we will tokenize the text using the NLTK library, which will transform long texts into smaller units known as tokens. Tokenization is the process of splitting the text into individual words or tokens that can then be used as input for further analysis.
We will perform using the Natural Language Toolkit NLTK [39] library to eliminate insignificant and meaningless stop words from the English language, such as 'I', 'am', 'a', 'the', 'of', 'to', etc. And we will apply lemmatization to group together different inflected forms of words to allow them to be analyzed together as a single item. To reduce the complexity of text data, lemmatization is a process of normalizing words into their base or root form, which reduces the number of variables. A lemmatization will be performed using the WordNetLemmatizer[2] from the NLTK library. We will also be conducting experiments with stemming, which reduces words to their basic form by eliminating suffixes from their root form. However, we found that lemmatization provided slightly better results than stemming, with a minimal difference between the two.
After completing the preprocessing steps, we will verify that the length of the character does not fall below five characters to remove any unnecessary words or characters that may have been left over from the previous steps. Then, we will return the results to the training section.
Figure 3 provides an example of the aforementioned preprocessing steps as follows.
3.3. Model training
Upon completion of the preprocessing steps, we will utilize the Hugging Face library to fine-tune four pre-trained BERT models: distilbert-base-uncased-finetuned-sst-2-english, bert-base-uncased, distilroberta-base, and mental-bert-base-uncased. This approach will enable us to develop a predictive model for detecting depression.
Distilbert-base-uncased-finetuned-sst-2-english is a text classification model developed by Hugging Face that uses distilbert-base-uncased for topic classification, which was trained on the SST-2 dataset. In addition to fine-tuning downstream tasks, the model can also be used to model masked language or predict the next sentences. The distilBERT model has been developed to simplify the original BERT model, making it smaller, faster, and more efficient while maintaining most of its performance.
The bert-base-uncased model is a pretrained NLP model that has been introduced in a paper. This model has been trained on a large corpus of English data using a self-supervised approach; in other words, it has been trained on raw texts without human labeling. A model was trained on two objectives: Masked Language Modelling (MLM) and Next Sentence Prediction (NSP). MLM involves randomly masking 15% of the words in a sentence and predicting the masked words, allowing the model to learn a bidirectional sentence representation. In NSP, two masked sentences are concatenated, and a prediction is made about whether the two sentences follow each other or not. The learned features of the BERT model can be applied to downstream tasks, such as classification.
The Roberta-base model was pre-trained on a large corpus of English data using the MLM objective without human labeling. The model can learn a bidirectional representation of the sentence to extract useful features for downstream tasks, such as sequence classification, token classification, or question answering. It is case-sensitive, meaning it differentiates between different capitalizations. There is a distilled version of the Roberta-base model, distilroberta-base, which follows the same training procedure as DistilBERT [40]. Compared to the Roberta-base model, it has fewer layers, dimensions, and parameters, resulting in a smaller, faster, and more efficient model while maintaining its performance in most cases.
As discussed earlier, MentalBERT is a variant of the bert-base-uncased model that has been fine-tuned on a dataset of mental health-related posts from Reddit. This allows the model to capture better the nuances and complexities of mental health language, which can be useful for tasks such as sentiment analysis, classification, or question-answering on related content. The learned features of the BERT model can be applied to downstream tasks, and the use of MentalBERT represents an important step forward in using NLP to address mental health issues and improve our understanding of this important area.
We found that utilizing pre-trained models gave our approach a significant advantage. By using the pre-existing weights and architectures, we were able to decrease the amount of computational resources needed for training while also enhancing the model's performance. As a result of their training on large quantities of text data, these models could learn complex patterns and relationships within the text, which is crucial for detecting depression.
As a part of our methodology, we loaded the datasets as CSV files and divided them into training and testing sets using an 80/20 split. It is worth noting that two distinct datasets were used to apply this split: tweets and bios. As a next step, we applied tokenization to the text data using the pre-trained BERT tokenizer that was specific to the model we were fine-tuning. These tokens are then utilized as inputs in the model for further analysis and processing.
As an integral component of the training process, we partitioned the datasets into ten distinct folds and shuffled them before initiating the training process. Using K-fold cross-validation, we avoided overfitting and improved model accuracy. Each fold was trained separately, using five epochs per training. The model was trained on the training set and evaluated on the validation set at each training epoch. We loaded, preprocessed, and fine-tuned pre-trained BERT models for both tweets and bios using the Hugging Face library and the Trainer method. As a result of fine-tuning the pre-trained models with our preprocessed Twitter dataset, we could construct high-performing depression detection models. The K-fold cross-validation technique ensures that our model is trained and evaluated on a diverse dataset, thus ensuring quality and accuracy.
3.4. Validation
Several metrics were used to evaluate the performance of our fine-tuned BERT models for depression detection, including accuracy, F1 score, receiver operating characteristic (ROC), and the area under the curve (AUC).
An accuracy measure indicates how many instances were correctly predicted. The F1 score measures the balance between precision and recall and represents the harmonic mean of precision and recall. Precision measures the proportion of true positive predictions among all positive predictions made by the model. Recall, on the other hand, measures the proportion of true positive predictions among all actual positive instances in the data. AUC indicates how well the model performed regarding true positive rates (TPRs) and false positive rates (FPRs). An AUC represents the area under a ROC curve, which plots the TPR against the FPR. Using the confusion matrix derived from the predictions of the models, the accuracy and F1 score were calculated using Equations 1 to 4 for each fold:
$$Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}$$
1
To calculate the F1 score, we first compute the precision and recall using Eq. 2 and Eq. 3.
\(Precision = \frac{TP}{(TP + FP)}\) | (2) |
\(Recall =\frac{TP}{(TP + FN)}\) | (3) |
\(F1 score = \frac{2 (precision \times recall)}{(precision + recall)}\) | (4) |
TP (True Positive) represents the number of correctly predicted positive instances, TN (True Negative) represents the number of correctly predicted negative instances, FP (False Positive) represents the number of incorrectly predicted positive instances, and FN (False Negative) represents the number of incorrectly predicted negative instances.
For the final evaluation of the performance of the models, we used K-fold cross-validation. We split the dataset into ten folds and trained each fold separately, using five epochs per training. At each epoch, the model was trained on the training set and evaluated on the validation set.
To assess the final performance of the models, we employed K-fold cross-validation. Specifically, we partitioned the dataset into ten folds and trained each fold individually, with a training duration of five epochs. At each epoch, the model was trained on the training set and evaluated on the validation set. This process was repeated for each of the ten folds, resulting in a comprehensive evaluation of the model's performance across the entire dataset. By utilizing K-fold cross-validation, we aim to provide a robust and reliable assessment of the model’s performance, while minimizing the risk of overfitting to a specific subset of the data.