Analyzing Research Fronts in Medical Informatics Field Based on Topic Model and Life Cycle Theory

Background: Medical informatics (MI) is a multidisciplinary field in which researchers pursue scientific exploration, problem-solving, and decision-making to facilitate the effective use of biomedical data, information and knowledge for the improvement of human health. The purpose of this study is to identify research fronts in the field of MI and ultimately elucidate research activities and trends in this field. Methods: This study used topic model to identify research topics in the field of MI based on the latent Dirichlet allocation method (LDA). And the topic cloud is utilized to visualize the research topics. For identifying the research front topics, we proposed the indicators of identifying research front topics. In addition, we investigated how front topics change over time, and divided them into five categories based on the life cycle theory. Results: The data were collected from 35981 published journal abstracts between 2007 and 2016. In the topic distribution of MI, we found that the scope of MI related research has become increasingly interdisciplinary, particular for medical data analysis. Also, in the analysis of research fronts of MI, we found that the use of natural language processing and medical text knowledge extraction play an essential role for systematic analysis and indexing of the underlying semantic contents. Conclusions: By categorizing the research fronts, the results shows that there are twelve growing, five stable and two declining research fronts. We hope that this work will facilitate greater exploration of the method of identifying the research fronts. Moreover, the findings of this study provide an insight on the research fronts and trends in MI.

For identifying the research front topics, we proposed the indicators of identifying research front topics. In addition, we investigated how front topics change over time, and divided them into five categories based on the life cycle theory. Results: The data were collected from 35981 published journal abstracts between 2007 and 2016. In the topic distribution of MI, we found that the scope of MI related research has become increasingly interdisciplinary, particular for medical data analysis. Also, in the analysis of research fronts of MI, we found that the use of natural language processing and medical text knowledge extraction play an essential role for systematic analysis and indexing of the underlying semantic contents. Conclusions: By categorizing the research fronts, the results shows that there are twelve growing, five stable and two declining research fronts.
We hope that this work will facilitate greater exploration of the method of identifying the research fronts. Moreover, the findings of this study provide an insight on the research fronts and trends in MI.

Background
Medical informatics (MI) is a multidisciplinary field in which researchers pursue scientific exploration, problem-solving, and decision-making to facilitate the effective use of 3 biomedical data, information and knowledge for the improvement of human health [1]. The objective of this study is to identify and analyze the research fronts in the field of MI.
Research fronts present the focus and difficulty domain of scientific research. Obtaining the research fronts timely and accurately is of great significance for the country, institutions and researchers.
The two methods are effective for detecting research topics. However, through the literature review, there are two main problems in the research of front identification: ① It's lack of recognizing indicators of research fronts, though the existing methods of detecting research fronts depend on the accumulation of terms or citation. ② The existing methods neglect the semantic information between texts. Citation analysis and content analysis method can't detect the research fronts semantically, and they neglect the semantic information between texts.
With the rapid development of machine learning technology, examining large collections of literatures can help researchers to understand crytic knowledge. Researchers have 4 proposed some novel methods to detect research fronts, such as topic model, neural network, support vector machine (SVM), the decision tree. Among them, the topic model can extracts valuable potential topic distribution, through semantic analysis of the full text. The most widely applied topic model is latent Dirichlet allocation (LDA) method [21][22][23].
With purpose of identifying research fronts in the field of MI, this study applied a topic model to quantitatively investigate scientific articles published in 26 MI journals. In addition, to elucidate research activities and trends, the study explores the changes of research topics over time, and divided them into five categories based on the life cycle theory. We hope that this work will facilitate greater exploration of the method of identifying the research fronts.
Moreover, the findings of this study provide an insight on the research front topics and trends in MI.

Data collection
We chose the Web of Science™ as our data source and adopted the Web of Science We adopted abstracts as the analytical corpus. Because the abstract of an article, which is regarded as a condensed representation thereof, has been used to successfully identify and interpret the scientific themes of articles 24.

The topic model and parameters setting
The LDA is a generative probabilistic model applicable to collections of discrete data. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words 25.
In fig.1, M is denoted as the total number of articles. N is the total number of words in a article and K is the number of topics. We can view LDA as a dimensionality reduction technique, but with proper underlying generative probabilistic semantics that make sense for the type of data that it models. We used the R package topic models 26 to perform LDA modeling employing the Gibbs sampling method.
The LDA model has three key parameters, these are the Dirichlet hyperparameters α and β, and the number of topics K. The α can reflect the distribution degree of documents on each topic. The smaller the α, the better the discrimination among topics 27. The value of α is related to the K value in the LDA model. The β can reflect the distribution degree of vocabulary in each topic. The value of β can affect model granularity, which means the smaller the β, the more topics. According to Griffiths (2004) and experience in the study, it suggested that the values are α= 0.1 and β= 50/K respectively 28.
Perplexity 29 is a standard measure of performance for statistical models of natural language and is defined as exp{-log P(wtest|ϕ)/ntest}, where wtest and ntest indicate the identities and number of words in the test set, respectively. Perplexity indicates the uncertainty in predicting a single word; lower values are better. In this paper, K was set by the perplexity of the topic model, and when the value of perplexity is smallest, the value of K is optimal.

Indicators of Identifying the Research Fronts
This study hypothesized that the topics derived from LDA model, which contain many research front topics. So how can we detect them accurately and quickly? The paper  Therefore, we utilize the wordclouds to visualize the each topic-word distribution φ k , with the top 10 words in each topic. The sizes of words in wordclouds was proportional to probabilities yielded by the topic model. 8 We believe that each topic is a bag of words related to semantic content. In each topic, the words with higher probability reflect the content of the topic. Thus, we assign each topic a label or research subfield based on the high probability words (see Fig.4

Identification of research fronts
Following the extraction of 62 topics, we calculated the strength of each topic, the results of which are shown in Fig.6.
In With regard to novelty, we created a boxplot to visualize the values. In Fig.7, the boxplot contains one rectangle, a dotted line, and two borderlines. The spacing between the different parts of the box indicates the degree of dispersion spread and skewness in the data. In addition to the points themselves, they provide a visual estimate of various Lestimators, notably the interquartile range, mid-hinge range, mid-range, and trimean.
To calculate the novelty of the 22 topics that had strengths higher than the average. In 9 view of these findings, topics 3,4,8,19,20, and 34 were more novel than the others were, because they remained active until 2015.
When we considered both the strength and novelty values, we found that research fronts had higher strength than the average and a publication time within the most recent 5 years of our study period. Using this definition, we identified 19 research fronts in MI.

Classification of research fronts
In this study, the corpus was divided into two time windows: T1 (2007-2011) and T2 (2012-2016). The topic growth rate was then calculated, and the results are shown in Figure 8.
Each sphere in the figure represents a topic, and the size and height of the sphere represent the growth rate of the topic. Table 1 The types of Research Fronts   Type  No  Topic label  Growing  1  Tumor image analysis  3 Algorithm on medical data mining 4 Medical text knowledge extraction 8 Health care application 19 New medical pattern based on web 21 Disease classification method 22 Medical system and software 31 Health information system evaluation 34 Disease survival model 48 Computer-assisted diagnosis of disease 54 Semantic analysis of clinical knowledge 58 Medical big data platform Stable 9 Community health service 17 Clinical decision support 41 Medical informatics methods and techniques 45 Electronic medical records 53 Machine learning algorithms in medicine Declining 24 Medical data integration 46 Disease risk prediction

Discussion
In the topic distribution of MI, we found that the scope of MI related research has become increasingly interdisciplinary, particular for medical data analysis. There are 16 topics concerned with medical data analysis. With the ability to deal with large volumes of both structured and unstructured data from different sources (T58), big data analytical tools (T41, T53, T3) hold the promise to study outcomes of large-scale population-based longitudinal studies, as well as to capture trends and propose predictive models (T12, T46) for data generated from electronic medical and health records (T45). A unique opportunity lies in the integration of traditional MI with mobile health (T8) and social health (T59), addressing both acute and chronic diseases in a way that we have never seen before [31].
In the analysis of research fronts of MI, we found that the use of natural language processing and medical text knowledge extraction (T4) play an essential role for systematic analysis and indexing of the underlying semantic contents (T54). Mining electronics health records (EHRs) is a valuable tool for improving clinical knowledge and supporting clinical research (T17), for example, in discovering phenotype information.
More importantly, genetic test and analysis (T10) will help to screen out patients who are more likely to develop related disease [32]. Further studies on these high-risk patients (T46) based on tumor images analysis (T1) may provide insight into the rate of disease development. In order to assist disease diagnose (T48) and support clinical decision (T17), a combination of medical images as well as medical records (T45), demographics, and lab test results is key to characterize the structure, function, and progression of diseases (T24). This requires the implementation of effective and optimized querying systems (T22) [33] in order to reduce the computational complexity of handling these data. In addition, evaluation of health information system (T31) focuses on the quality management and evaluation of the system,such as those on the effect on health services [34,35].
Our study has several limitations, mostly due to the complex nature of our research subject. One limitation is that we have limited ourselves by only considering sources, which are indexed by WoS. Therefore, our approach did not include a few MI journals, 11 which are not covered by ISI. The LDA method itself imposes several limitations. The text mining method still depends on important choices of parameters; the attribution of labels to groups is also a matter of expert opinion, and needs substantial human intervention.
The number of the topics can be chosen to be either smaller or larger. This depends on just how "loosely and generally" one wishes to define such a heterogeneous and complex field of study and application as MI. We set K value by perplexity of LDA model, however  Word clouds for Topic 1-30.
21 Figure 6 Strength of topics.

Figure 7
Boxplot of novelty.
22 Figure 8 The growth rate of research fronts.