Quantitative and visualized analysis of 5,097 triple-negative breast cancer (TNBC) publications by using multiple machine learning algorithms: TNBC is an excellent forerunner in cancer research

Triple-negative breast cancer (TNBC) is a subtype of breast cancer proposed at the beginning of this century, which is still the most challenging breast cancer subtype due to its aggressive behavior, including early relapse, metastatic spread, and poor survival. This study explores current research status and deciencies from a macro perspective on TNBC publications by using machine learning methods. All publications under the MeSH term "Triple Negative Breast Neoplasms" in PubMed were searched and downloaded as of December 2020. R and Python were used to extract MeSH terms, geographic information and other abstracts from metadata. The Latent Dirichlet Allocation algorithm was applied to identify specic research topics. The Louvain algorithm was used to establish a topic network, identifying the relationship between the topics.


Introduction
It is well known that breast cancer is the most common malignancy in women. Breast cancer currently accounts for 30% of newly diagnosed malignant tumors in women and causes 15% of women to die from cancer [1]. For a long time, human research on breast cancer has been going on for very long time,Perou et al., for the rst time, described the intrinsic molecular subtypes of breast cancer in 2000 using cDNA microarray technology [2]. However, the incidence rate of breast cancer is increasing worldwide due to industrialization and change in lifestyle [3]. Triple-negative breast cancer (TNBC) is the most aggressive subtype of breast cancer, accounting for about 10 20% of breast cancer cases [4,5].
Unfortunately, the survival of TNBC is unsatisfactory due to the lack of speci c therapeutic targets [6]. In addition, with the rapid progress of the publishing industry, the number of papers published has risen sharply [7]. However, researchers and administrators could hardly understand the TNBC research situation from a macro perspective, rendering it di cult for them to have a global view of the whole picture of research on TNBC. Therefore, if there is no macro understanding and proper design to coordinate research progress, TNBC will continue to bring serious disasters to humankind in the foreseeable future.
Quantitative analysis of the literature could indicate the research interest, public concern and study de ciencies in the past and predict future research. Bibliometrics is a quantitative analysis method of academic publications, which can discover the progress of discipline research from a macro perspective and provide support for future research directions. TNBC related literature information analysis is extremely scarce. Teles et al. conducted a bibliometric study of 1,932 publications in 2018 to study nanomedicine research's global trend on TNBC [8]. However, the inclusion criteria of this study are too broad, and the analysis methods are insu cient to analyze the status quo of the TNBC study. Unfortunately, due to the lack of practical language analysis tools to integrate meta text data, the bibliometric research in all publications on TNBC is still missing.
Natural Language Processing (NLP) is a computing technology used to analyze human language, a part of machine learning [9]. This technology has been successfully applied to deal with medical text information [10]. Latent Dirichlet Allocation (LDA) is the most classical topic modeling method in bibliometrics to present many unstructured texts and information [10]. LDA can perform topic analysis on a large number of texts. LDA can discover the semantic analysis behind the text compared with the previous methods and has been widely used in medical and other elds [12]. We recently constructed LDA and NLP methods to analyze more than 23,000 rectal cancer-related publications between 1994 and 2018 [12]. We have successfully found the research de ciencies in the last 25 years and predicted the future research focus. Therefore, through the use of mature LDA methods and machine learning techniques to discover the current research from a macro perspective, at the same time discover the missing research topics in the past, and predict potential research breakthroughs in the future. In the present study, we analyzed all past TNBC publications indexed by PubMed under the medical subject heading (MeSH) term "Triple Negative Breast Neoplasms". We improved our algorithm based on our previous research and conducted a more detailed analysis of all TNBC publications with more visual expression to highlight current hot areas in TNBC, together with research de ciencies and speci c areas with future opportunities.

Methods
All publications under the MeSH term "Triple Negative Breast Neoplasms" were downloaded as of December 31, 2020. The complete record of the search results was downloaded in XML format through R's easyPubMed package. R and Python were used to extract XML data, including the publication year, abstract, research type, geographic information, and other MeSH terminology.
LDA was used to identify more speci c research topics in each article. Python was used to model the topics by analyzing the abstracts of all indexed articles in the record. Topics were set at 50. The criteria for selecting the number of topics were based on perplexity, redundancy, and legibility. Based on the algorithmic calculation of topic probability, we nally determined the topic to which each article belongs.
Next, we manually checked the names of each glossary based on the abstract. Finally, we used the Louvain algorithm and Gephi to perform cluster analysis to establish a topic network to determine the relationship between topics [14]. In each publication, we identi ed the two topics with the highest attribution probability, counted the number of simultaneous occurrences of the two topics in each document, and established links between topics.
All the original data were uploaded and publicly available, including all retrieval methods, algorithm codes, and raw literature data in this article. The literature search and download code can be obtained on R by easyPubMed package (https://cran.r-project.org/web/packages/easyPubMed/index.html). The R code is publicly available on GitHub (https://github.com/christopherBelter/pubmedXML). We have uploaded relevant Python code on GitHub (https://github.com/yanwen0614/Medicine-Bibliometric-Analysis). The network visualization in this article is carried out using the software package Gephi (https://github.com/gephi/gephi). All studies' original data can be downloaded (https://pan.baidu.com/s/1nuw6MBvyYDOfMAsBEivtXQ. code: cgvs). This study used publicly published data and did not need to be approved by the relevant institutional review board or ethics committee.

Results
Overview of the data Although TNBC was proposed at the beginning of this century, it was not o cially included as a MeSH term until 2014. We identi ed and analyzed all 5,097 publications between 2012 and 2020 ( Fig. 1). On average, 141 more publications were published each year than the previous year, with an average annual growth rate of 33.5%. The high growth rate indicates that TNBC is a research hotspot, and the research is progressing rapidly.

Categories of publications on TNBC research
In order to explore the research areas regarding TNBC, we rst separated the publications and divided them into nine categories according to the database-provided areas in cancer research (Fig. 2). We found that randomized clinical trials and related studies accounted for roughly 25% of total publications ( Fig. 2). This high proportion of clinical trials has rarely been reported in research on other tumor types, indicating that translational research on TNBC has been carried out quickly. The proportion of review and meta-analysis was around 32% and 2.2%, respectively. As high-quality meta-analysis is generally considered the clinical guiding research, it is reasonable to anticipate that publications on the metaanalysis of TNBC will be increased. The large number of clinical trials of TNBC has already improved and will continue to improve its clinical practice.

Uneven geographical distribution of global TNBC research
In order to further understand the global TNBC research situation, we analyzed the geographic information. We found that 88 countries or regions worldwide have published publications on TNBC (Fig. 3). The top 10 countries' publications accounted for 74.4%, indicating that the head effect is pronounced. Moreover, more than half of the publications were derived from the United States, China, Italy, and France, accounting for 23.1%, 20.5%, 6.9%, and 6.0% of all publications, respectively (Fig. 4).
This phenomenon reminds us that the vast majority of the global population has not participated in TNBC research and even lacks primary data. It also suggests that current TNBC research does not contain enough data on the different genetic backgrounds, and we only have a limited view of the whole picture of TNBC in humanity.

Pathogenic mechanisms and drugs are most studied in TNBC research
A total of 4,299 MeSH terms appeared 133,259 times in all 5,097 publications, indicating that the studies covered multiple aspects. On average, 26 MeSH terms appeared in each publication showing the breadth and depth of studies on TNBC and the intense intersection between the different disciplines. The top 10 cited MeSH terms were listed in Fig. 5, with both metabolism and genetics cited more than 2,000 times so far, suggesting that the past research on TNBC was focused on exploring its molecular pathogenesis. In addition, 5 of the top 10 cited MeSH terms are directly related to medication research. Therefore, we infer that pathogenic mechanism and medication research will continue to be a focus of TNBC research in the foreseeable future.

TNBC research is mainly in three areas
The topic network analyzed by LDA and Louvain algorithm highlights the areas where interrelated topic clusters appear simultaneously and provide remarkable insights into the relationships between the essential topics of interest. We divided publications into 50 topics. As seen in Fig. 6, the circle's size represents the number of papers published, and the thickness of the line represents the degree of relevance between the two research topics. The results of the LDA analysis suggest that all TNBC-related studies are roughly divided into three clusters (Fig. 6).
The therapy plan itself is mainly studied in the TNBC treatment plan research cluster (in orange). However, this cluster also includes some studies on signaling pathways, such as the Wnt/β-catenin signaling, PTEN, EGFR, BRCA and PARP1. This cluster is particularly close to the other two clusters, indicating that the relationship between the essential clinical integration and the TNBC basic research was very close. We have also found that basic research can be quickly transformed into clinical practice through clinical trials, improving patient prognosis.
Among the new biomarkers research cluster (in green), receptor study, age and risk study, and immune checkpoint are extensively investigated topics. The relationship among these three topics is very close, and they are closely related to the therapy plan. In the cluster of the regulation mechanism for TNBC aggressive behavior (in purple), growth and regulation factors, apoptosis study, and nanoparticle delivery system are the three most studied topics. Furthermore, we found that magnetic resonance imaging (MRI) and ultrasound were closely related to the hypoxia study, suggesting that imaging changes may be related to hypoxia in the tumor microenvironment.

Discussion
Since the initial proposal of TNBC at the beginning of this century, substantial progress has been made in identifying therapeutic targets and more detailed tumor classi cation [2]. However, few studies on hospice care, patient perspective, surgical treatment of metastasis, and economics are available. Future research will focus on applying new technologies to understand the complex system of TNBC and nding more speci c therapeutic targets. However, the recurrence of TNBC after long-term survival still presents new challenges for its clinical management. TNBC is currently undergoing a large number of clinical trials. The rapid clinical progress of TNBC suggests that TNBC is an excellent pioneering area in tumor research and will provide an excellent example for the diagnosis and treatment of other tumors.
It is well-known that TNBC is substantially heterogeneous. Therefore, it is necessary to classify TNBC into multiple subtypes further [15,16]. The successful subtyping provides a solid theoretical basis for the precision therapy of TNBC [17]. Gene sequencing technology allows us to fully understand the mutation rate of TNBC, which is about 1.68bp/Mb [18]. Mutations occur in genes in multiple key signaling pathways such as PI3K/Akt/mTOR pathway, RAS/RAF/MEK pathway, JAK/STAT pathway, DNA repair pathway, and cell cycle checkpoint [19][20][21]. Therefore, a variety of drugs targeting the signal pathways are currently undergoing clinical trials. Some inhibitors have been used as potential medications for TNBC treatment, including PI3K, MEK, PARP, EGFR, VEGF, and AR inhibitors [22]. On the other hand, studies on operations and radiotherapy, especially for re-operations related to local-regional recurrence risk or distant metastasis, were rarely reported. Actually, many studies suggest that surgery has an essential role in treating distant metastases of cancers, such as colorectal cancer [23]. In addition, many studies on other kinds of cancers, including pancreatic cancer and colorectal cancer, demonstrated that the tumor microenvironment, especially the extracellular matrix, has been found to play an essential role in cancer metastasis, local recurrence, and chemotherapeutic drug resistance [23]. Many potential drugs are used due to their ability to target the extracellular matrix, such as PEGPH20 (an enzyme that targets matrix hyaluronic acid), pegilodecakin (a PEGylated IL-10) [25,26]. However, so far, the study on extracellular matrix in TNBC is insu cient. It is hoped that the study on TNBC extracellular matrix is likely to play an essential role in the future.
In the present study, we quantitatively analyzed all 5,097 publications on TNBC through multiple machine learning algorithms for the rst time. We found that the research on TNBC mainly focuses on three clusters: TNBC treatment plan research, the new biomarkers research and the regulation mechanism for TNBC aggressive behavior. In the cluster of regulation mechanisms for TNBC aggressive behavior, we found an exciting part of a close connection between hypoxia and imaging. Hypoxia refers to malignant cells' inability to receive enough oxygen as the tumor expands [27]. Low oxygen levels stabilize hypoxiainducible factors (HIFs), which in turn regulates the transcriptional activation of a group of genes, allowing cells to survive under hypoxic conditions [28]. On the one hand, we speculate that these manifestations will be evident in imaging, which will improve the early diagnosis of TNBC and the detection of possible distant metastases in the future. On the other hand, the hypoxic properties of TNBC bring new hopes for treatment with the aid of imaging such as MRI and ultrasound, which are highly sensitive to the hypoxic microenvironment of TNBC. Studies have found that carbonic anhydrase-IXdirected albumin nanoparticles have a deadly effect on hypoxia-mediated T lymphocyte-negative breast cancer cells and can be visualized in a real-time fashion [29]. At present, new hypoxia-inducible factor-2α antagonists such as PT2385 have shown encouraging results in phase I clinical trials of previously treated advanced clear cell renal cell carcinoma [30]. More than two-thirds of patients reported clinical bene ts with acceptable toxicity pro les. PT2977, a more effective second-generation hypoxia-inducible factor-2α antagonist, is the subject of several ongoing solid tumor clinical trials (NCT02974738, NCT03634540, NCT03401788) [31]. With the in-depth study on TNBC hypoxia aided by its imaging, developing new treatment strategies for cancers is possible.
Although the research on TNBC has made signi cant progress in many aspects, the present research also found some research de ciencies on TNBC. First of all, there are considerable differences in geographicrelated research. It has been found that individuals with different genetic backgrounds have different responses to TNBC, survival rates, and mortality rates [32,33]. However, the global distribution of TNBC research is uneven, and the research on populations with different genetic backgrounds is insu cient. For example, only 61 publications in the PubMed database are related to African descent. Strengthening TNBC research on individuals with different genetic backgrounds could facilitate our understanding of TNBC from different perspectives [34]. Secondly, there is a lack of research on TNBC from patient's perspectives, e.g., health economics, hospice care. Although, at present, the 5-year overall survival rate of most tumors has been greatly improved, helping tumor patients with psychological issues re-enter society will become a new important research topic [35]. Compared with other breast cancer subtypes, TNBC patients are more likely to relapse and metastasize, resulting in more signi cant pressure, both mentally and economically, on patients and their families. Studies on patients with more prolonged survival can better understand TNBC and even other long-term survival tumors [36]. We will face more challenges for patients with a long survival period of 5-10 years in the future [37].
There are some limitations in the present study. Firstly, besides PubMed, several other databases, including Scopus, Web of Science, and Embase, could be used for bibliometric research. Although PubMed contains the highest quality peer-reviewed research and excludes irrelevant, non-peer-reviewed papers, the literature will provide detailed and comprehensive knowledge if other databases are explored simultaneously. Secondly, we considered that all publications tend to publish more positive research results. Negative results and clinical participants' perspectives are naturally more di cult to be published.
In summary, with the further expansion of the publishing industry and the explosion of the number of information dimensions, the vast array of documents will make the understanding of TNBC more di cult. Researchers are lost in the information as it is not easy to obtain the information they want. With the development of complete medical record texts, publication databases, and improved algorithms, it is reasonable for machine learning to play a more active auxiliary role in future clinical practice. Machine learning and natural language processing may be an extremely effective new tool for scientists, who will extract objective and comprehensive clues from large amounts of data. The data presented in this study will hopefully help scientists understand the current status of TNBC research and design more relevant basic and clinical research projects.

Concluding Remarks
We analyzed 5,097 TNBC publications through the NLP method. We summarized the current research status from a macro perspective and found that the current research is mainly concentrated in three clusters. TNBC research direction shows some insu ciencies, especially in long-term survival-related research and lack of research from patients' perspectives. TNBC is a good forerunner in cancer research.

Declarations
Ethics approval and consent to participate This study used publicly published data and did not need to be approved by the relevant institutional review board or ethics committee.

Consent for publication
Not applicable

Availability of data and materials
The original contributions presented in the study are included in the article. Further, inquires can be directed to the corresponding authors.      Topic cluster network studied by Latent Dirichlet Allocation: inter-and intra-relationships. TNBC treatment plan research (shown in orange), new biomarkers research (shown in green), regulation mechanism for TNBC aggressive behavior (shown in purple) are three major clusters in TNBC research. The circle size represents the number of papers in each topic; the line's thickness represents the weight of the connection between each topic.