Research framework
Based on the previous research on topic recognition in the medical field, this paper builds a burst topics recognition model in the medical field. The framework is shown in Fig.1.
Burst topics can be defined as sudden increase topics during a period of time,which can help us to notice the most urgent and important research in large collections.In this study,the identification of burst topics in medical field is mainly divided into the following steps:topic words extraction,burst words extraction,burst topics extraction.
Topic words extraction
The interpretation and use of medical information is a very complicated issue.Due to the special nature of medical information data, using KOS to preprocess medical information data is particularly important (Wu et al. 2015)[[1]]. Unifed Medical Language System (UMLS)(Bodenreider 2004)[[2]] is one of the most important KOS in the biomedical field, MetaMap(Aronson and Lang 2010)[[3]] is a tool for obtaining concepts from the text based on the UMLS.This article uses the mesh words from the pubmed medical text as an object of study,and to supplement more meaningful entity words, we use UMLS to map the title and abstract of the text to get the relevant entity words,UMLS can extract concepts and concepts semantic types from biomedical terms,we use the MetaMap tool to complete the mapping.
Firstly, English journal articles from pubmed are selected as data sources, then we search cancer-related literature data in a given field for download. One year is selected as a time window, and mesh words of each time window are extracted as standby.Text mapping tool MetaMap is used to map free text to concept words.Concept words processed by MetaMap are marked with scores and semantic types, we select the result of the first set of meta mapping and store them into the database for later use.
Burst words extraction
Although we have obtained the word set for each document, the word set includes a large number of meaningless words and generic words, such as Adult, Aged, Female and Humans. We define a common word dictionary and filter out meaningless words and generic words.In addition, it is also a very important problem to identify burst words from a large number of word sets. This paper identifies burst words from the multidimensional features of words, including three dimensions:word frequency characteristic, word increment characteristic and word semantic characteristic, in the hope to find more meaningful burst words.
Word frequency characteristic:
word frequency characteristic is the most intuitive reflection of the importance of a word in the data set of the time window. If a word appears frequently, it means that the word is more relevant to the burst words in the time window. This paper uses the method of Wang Jian(2018)[[4]],instead of setting the word frequency threshold directly, it considered the word frequency relative to the highest word frequency in a single time window.The Equation is as follow:

In Eq.1,Cn(w) represents the word frequency weight of the word w in time window Tn, tfn(w) represents the word frequency of the word w in time window Tn, tfnmax represents the max frequency of the word in time window Tn.This method can keep the words with relatively high word frequency while extracting the burst words.
Word frequency increment characteristic:
The weight of word frequency considers only the high frequency words in a time window, but does not considers the changing trend of word frequency.If an event occurs suddenly, the burst words increase sharply in the time window,therefore,the word frequency increment characteristic is introduced to identify the burst words.A burst word may be a new word or an existing word which burst suddenly. Therefore, this paper considers any of the following conditions as a candidate set of burst words:
(1) words do not appear in time window Tn but appear in Tn+1 and Tn+2 ;
(2) words appear in both time window Tn and time window Tn+1 and its word frequency increment is greater than a certain threshold.
Word semantic characteristic:
since medical texts involve words with the entity meaning of genes,proteins,enzymes,drug,etc.,the word frequency of these words may be small, which is easy to be omitted based on the above two methods.Therefore, in this paper, the semantic types of words with entity meaning such as gene, protein and enzyme were selected, and those words with low word frequency but high annual growth rate were reserved as burst words.
Burst topics extraction
In order to better retain effective documents, we reserve documents whose number of burst words is greater than a certain threshold.According to the texts where the burst words are located, we construct the burst words-document matrix. Based on the matrix, Repeated Bisection is used to cluster the matrix with GCLUTO software.
Burst words-document matrix constructing:
each text containing burst words can be represented as a vector:texti={texti,j|i=1,2,3...,T,j= 1,2,3,...,N},texti represents the ith text, texti,j represents the jth word of the ith text.Texti,j=1 means the inclusion relation, and texti,j=0 means no inclusion relation.
Burst topics identification by Clustering:
based on the matrix of burst words-document, this paper uses repeated bisection to identify burst topics through double clustering.Similarity Function selects cosine function.The cosine function calculation Equation is as follows:

Tki is the concept k in document i,Tkj is the concept k in document j.
Tools
(1) MetaMap(https://mmtx.nlm.nih.gov/):MetaMap is a program that matches the biomedical text with the concepts in the UMLS thesaurus. This program can choose parameters to control the internal operation mode and the output form of the results.
(2) Java program and MySQL database: we need to use Java program to complete the format conversion, data processing and calculation.
(3) gCLUTO: it can use co-occurrence matrix or word-document matrix to perform double clustering analysis which can cluster rows and columns at the same time,the clustering methods have four kinds: Repeated Bisection, Direct, Agglomerative and Graph.We can choose according to our needs.It has the following characteristics:management data files, providing visual solutions and presenting clustering tree view of the project.