3.1. Scientific Production in the Field
Figure 1 shows the yearly publication count on the topic. The annual scientific production chart shows that there has been a recent increase in interest in text analytics for early detection of anxiety and depression, as almost 41% of the research in this field has been conducted in the year 2022. Additionally, this is further backed by the fact that from 2009–2020, 45 papers have been published, and in the past 3 years (2021–2023), 58 papers have been published. A higher publication output in the year 2022 is consistent with the COVID-19 effect on mental health [2]. The field is still in the early development stage, and more research will be conducted in the future, as suggested by the average annual growth of 16.01% (Table 2).
A three-field plot has been shown in Fig. 2 that shows the relationship among cited reference (CR), authors (AU) and the author keywords (DE), wherein the cited references have been shown in the left and author keywords on the right.
The top 10 most relevant sources are presented in Fig. 3. The data that was collected in May 2023 revealed that the most productive source of information in Scopus and WoS was the Ceur workshop proceedings. The Ceur workshop proceedings are the foremost source of information. The only other notable source includes lecture notes in computer science, with both sources having 67% of the documents. The remaining 8 sources constitute a small number of documents, with each being less than 3.
Figure 4 shows that the top 4 most productive authors in the field have produced the same number of documents, which is 4, and the remaining 5 out of 6 have produced the same number of documents as well. This shows that in the field, all the top authors are almost equally productive, as they have produced a similar number of documents.
Lotka's law is derived from Zipf's law, which focuses on how often authors publish within a particular field. It states that there is an approximate inverse-square law, meaning the ratio of authors publishing a specific quantity of articles remains constant in relation to the number of authors publishing just one article [10]. Hence, as the quantity of published articles rises, authors generating the same volume of publications become less frequent. From Fig. 5, we can see that although the author productivity exactly matches Lotka’s law, it is seen to be of similar shape, implying that Lotka’s law still holds true as the graph for author productivity is symmetrical to Lotka’s law.
Figure 6 shows the corresponding authors' countries. Single country publication (SCP) refers to intra-country collaboration that is represented by authors that belong to the same country, whereas publications involving authors from various countries, known as multiple country publications (MCP), are represented by collaborative efforts between different countries [11]. From Fig. 6, India, China, and Spain are seen to have the most publications, and they are all single-country collaborations, suggesting that they have a well-established research infrastructure in this field and a network of researchers in the country. Further, international collaboration can be seen as the UK and USA have the same MCP ratio of 0.5, implying the field is not in isolation and researchers have exposure to diverse perspectives, methodologies, and ideas from researchers in other countries.
A list of top globally cited documents has been given in Fig. 7. Losada De’s paper published in 2016 in the “Lecture Notes in Computer Science” is the most cited document with 154 citations, followed by Yang M. and Cheng O.J. The top 3 globally cited documents are seen to have the most impact and influence within the field, as they consist of 39% of all citations in this field, according to the data collected from Scopus and WoS. These documents can possibly be seen as a starting point to understand key concepts, theories, or methodologies in the field and influence the research direction. Although these are the most cited documents, newer studies, emerging trends, and alternative viewpoints should also be considered to gain a complete picture of the ongoing research in the field.
Reference spectroscopy characterizes the decay in popularity of a publication in terms of the number of citations over time (Fig. 8). As a publication becomes older, it receives fewer citations. The red line refers to a deviation from the 5-year median, referring to the expected citation trajectory for a set of publications based on their publication year. A deviation from the redline indicates a difference between the actual citation count of a publication and the expected citation count based on its publication year.
The first citation was seen in 1985, then directly after that in 2000, which explains why there is a straight one going through 0 from 1986 to 1999. From 2009, the black line is seen to be above the red line, which results in overperformance as there is a higher citation count than expected based on its publication year. This suggests that the publication has received more attention than anticipated. Further, deviation from the red line can help identify influential papers as they continue to be highly cited. For example, in 2017, there was a significant deviation from the red and black lines, signifying a highly cited paper was released in 2017. Furthermore, these papers are early indicators of the growing interest in a particular area and present important research findings that resonate with the research community.
3.2 Keywords and Thematic Analysis
Analysis of authors' keywords helps researchers in understanding and identifying the most relevant and interesting topics in this field [4]. Furthermore, these words provide a means for understanding the dimensions of current research and future trends.
Figure 9 shows the most used words in text analytics for early detection of DAD, with depression and ML being the most prominent. Instead of using the frequency of their occurrence as a representation of text size, we have used the log of the frequency so that less frequent words do not become too small.
Through the log of frequency occurrences of the authors' keywords, it is evident that there is research being conducted on the intersection of natural language processing, machine learning, deep learning, social media, linguistic metadata in the field. Further, the information from which data is being collected comes from social media, which is a prominent source. These topics can help guide research efforts to address the early detection of DAD. A tree representation of the top 50 keywords has been depicted in Fig. 10, wherein depression (occurrence = 10%), machine learning (occurrence = 10%), social media (occurrence = 7%), and natural language processing (occurrence = 7%) are the top keywords.
The co-occurrence of keywords can be understood from Fig. 11, where all these top keywords have a strong association in the network (node size and edge thickness).
Figure 11, which shows the co-occurrence network analysis, could be a useful tool to understand the knowledge structure of the field. It is evidence that depression is a constant term in the field, as are almost all the prominent key terms, suggesting that there is more research involving depression. Furthermore, strong connections are seen between depression and ML, early detection, NLP, and social media, suggesting that more research is being conducted within these fields in relation to depression.
In Fig. 12, the trending topics are based on the authors' keywords in the time span 2018–2022, with the minimum word frequency of 5 and the maximum of 25. Keywords that are connected to publication content characterize topical aspects of a research field [4]. In 2018, early detection was the most used and remained in trend until 2022. From 2019 to 2022, “machine learning," “mental health," and “early risk detection” were trending, with a greater frequency of “machine learning” being used. Further in 2022, “natural language processing” seems to be the trending topic as it has a term frequency of 20, and there’s a small emphasis on “transfer learning” as it has a word frequency of 5. Through this, we can assume that more research has shifted towards natural language processing. Although “natural language processing” appeared in 2021, it appears to be the most trending topic, but terms such as “machine learning” and “social media" are still relevant.
Figure 13 provides the thematic map for early detection of DAD using text analytics. A typical thematic map has four themes with the top left being the specialized themes, the top right being the driving themes, the bottom left being the declining themes and bottom right consisting of the main themes [6]. The aim of the thematic map is to get a comprehensive view of the current condition of the field and evaluate its prospects. The information is crucial as it helps determine prospective avenues for researchers and stakeholders to explore in identifying the progression of research development. Thematic analysis groups most representative authors’ keywords into various themes which are made up of centrality and density, on x-axis and y-axis respectively.
Density is a measure of cohesiveness, while centrality is a measure of correlation among various nodes [5]. The parameters to generate the thematic map were the number of words equal to 50, the minimum cluster equal to 5, and the clustering algorithm was Louvain. Nodes are formed due to the clustering of related terms. Nodes represent individual terms or concepts, and their arrangement within the map indicates their relationships and similarities based on co-occurrence patterns in the literature.
From the thematic representation, we can make out that suicide, multimodality, and mental disorders are clusters in a node and are part of the emerging or declining themes. The motor themes in a thematic map represent the central, influential, and interconnected topics or concepts within a research domain. They help researchers identify key research areas, understand the structure of knowledge, and gain insights into the dynamics and trends shaping the field. The motor themes include CNNs and support vector machines. The field is also seen to have many basic or main themes, with there being three nodes, each significantly bigger than the other themes.
A "collaboration network" among authors refers to the network of collaborations and co-authorship relationships among researchers (Fig. 14). It helps with the identification of research communities and provides insights into the knowledge diffusion, specialization, and collaboration dynamics within a field. The collaboration network illustrates the flow of knowledge within a research community. Researchers in the network share and exchange ideas, methods, and findings through their collaborations. The network can provide insights into how knowledge spreads, disseminates, and influences the research community. In the field, researchers work in collaboration; however, they work in their own groups, with little dissemination of knowledge outside their groups. This can be seen as there are nine research groups and no communication outside any of the groups.
Table 2
Parameter values in generating the conceptual structure.
Parameter
|
Value
|
field
|
DE
|
ngrams
|
1
|
method
|
MCA
|
minDegree
|
2
|
clust
|
3
|
k.max
|
8
|
stemming
|
FALSE
|
labelsize
|
6
|
documents
|
5
|
graph
|
FALSE
|
To determine the correlation between the keywords, a factor analysis was performed. Multiple correspondence analysis (MCA) was used to create three clusters (shown in blue, green, and red colors in Fig. 15) based on the word affinity as it appears throughout the documents. Table 2 lists the values for additional parameters.
A dendrogram is a tree-like representation for showing hierarchical relationships among different data points. Figure 16 shows a dendrogram representation of different author keywords divided into three categories with a unique color-coding scheme.