Optimizing Literature Search: TEMAS, A New Text-Mining Algorithm-Assisted Search Tool

Background: Literature search is challenging when thousands of articles are potentially involved. To facilitate literature search we created TEMAS a Text Mining Algorithm-assisted Search tool that we compared to a PubMed reference search (RS) in the context of etiological epidemiology. Methods: The 4 steps of TEMAS are: 1) a classic PubMed global search 2) a first sort removing articles without abstracts or containing off-topic terms 3) a clustering step with a descending hierarchical classification regrouping articles in independent classes 4) a final sort extracting from the targeted class the abstracts containing the terms of interest, with a link to the corresponding PubMed articles. Validation was performed for risk factors of breast cancer. We estimated the precision and recall rate compared to RS. Average precision and discounted cumulative gain (DCG) were also computed to perform a ranking-based evaluation. We also compared TEMAS results with articles selected in two meta-analyses. Results: For risk factors of breast cancer, breastfeeding, mammographic density, oral contraceptive, and menarche were explored. TEMAS consistently increased precision vs RS (from 23% to 32%), with a recall rate from 95% to 97%, and divided the number of selected articles to read from 2.3 to 4.8 times. Mean average precision for 100 articles was 47.4% for TEMAS vs 20.9% for PubMed ranked by best match, and DCG showed a consistent improvement for TEMAS compared to PubMed best match. Discussion: TEMAS divided the results of a literature search by 3.2, and improved the precision rate, the average precision, and the DCG compared to RS for epidemiological studies. Reducing the number of selected articles inevitably impacted the recall rate. However, it remained satisfactory and did not bias the corpus of information. Moreover, the recall rate was 100% for the two meta-analyses we analyzed, which suggests that the loss of recall rate observed above concerned articles not relevant enough to be included in the meta-analyses. Conclusion : TEMAS provides a user-friendly interface for non-specialists of literature search confronted with thousands of articles and appeared useful for meta-analyses.


INTRODUCTION
Literature search is an integral part of any research project. It requires using the correct syntax and the right engine, as well as appropriately selecting and managing information and documents. This last issue becomes complex when the search results bring thousands of articles.
The PubMed search engine indexes more than 30 million citations of the biomedical literature, from more than 5,200 worldwide journals and online books. Several communities have addressed the challenge of automated information retrieval in the literature [1][2][3] . Several methods are provided to perform systematic reviews such as PRISMA 4 , MOOSE 5 , or COSMO-E 6 . These methods are very effective when the research question is clearly formulated, however, they afford little guidance when the informational need of the researcher is less circumscribed, for instance looking for risk factors or predictive factors. There are also many methods using machine learning algorithms [7][8][9] , or methods focused on specific topics 10 .
However, these methods do not provide information for sorting without a priori a large number of articles.
So we set up a new tool allowing researchers to easily explore very large volumes of data, selecting relevant articles by a text mining approach. We assessed whether TEMAS, a text mining algorithm-assisted search, was more relevant than a classic PubMed search in the field of etiological epidemiology, which studies the causal factors of diseases. Most etiological studies are observational case/control studies or exposed/non exposed cohorts. The validation 4/24 of this tool was carried out by comparing TEMAS and a classic PubMed search for non-genetic risk factors of breast cancer and for meta-analyses including case-controls or cohorts studies.

MATERIALS AND METHODS
We developed an interactive web application using Shiny 11 to implement our new Text Mining Algorithm-assisted Search (TEMAS) tool. Shiny is a R software 12 package that enables building easily interactive web apps straight from R. Shiny, which combines the computational power of R with the interactivity of the modern web. We complied with MECIR 13 guidelines for searching and selecting studies (1.5 and 1.6) available in Additional file 3. This 4 step web application is available at https://shiny.temas-bonnet.site.

TEMAS Tool description
A chart describing the TEMAS tool with the proceeding steps appears in Figure 1.

TEMAS Step 1: Global Search
Our new tool allows choosing a date range for the study and entering the search keywords (labeled "Global Search"). An advanced search is possible since it respects the PubMed search syntax. The number of PubMed answers matching the search criteria is displayed.
All the abstracts are retrieved using the package rentrez 14  TXT format that will be used at the next step.

Clustering exploration
The resulting abstracts are analyzed using the Rainette package 16 . This is a R package for multidimensional analysis of texts and questionnaires that enables statistical analyzes of large corpora of texts 17 . More specifically, it allows classifying abstracts by similarity using Reinert's classification 18 .
It is a descending hierarchical classification (DHC) carried out in two stages, which offers a global vision of the corpus explored. After a corpus partitioning, statistically independent classes of words are identified, which are characterized by specific correlated terms. This type of analysis enables grouping articles according to concepts. These classes are interpretable according to their profiles, which are characterized by specific correlated terms. The DHC summarizes this process by a dendrogram, which offers the possibility to choose a 6/24 clusterization including 2 to 6 classes. Once these classes characterized, the class of interest can be identified among the available classes. This is the class that contains the terms or concepts that best match the user's search.

Clustering extraction
Then the extraction step begins. At this stage, there are two distinct possibilities whether the search deals with a single term or multiple associated terms: -Single term extraction: Selects the abstracts containing the "term of interest" within the retained class of interest mentioned above.
-Multiple terms extraction: First, target the most representative term, here called "main term", and define the "complementary term". Then define the maximum distance (expressed as a number of characters) allowed between "main" and "complementary" terms. The selected articles contain the "main term" and the associated "complementary term" within the defined maximum distance allowed (e.g. for "oral contraceptive", the "main" term is "contraceptive" and the "complementary" term is "oral").
These terms were called "extracted term(s)".

TEMAS Step 4: Final sort
This last step enables carrying out the final sorting of the articles extracted from the former step. A default choice is offered, which only keeps articles the abstracts of which contain at least one of the following terms: Odds Ratio, Odds, OR, Relative Risk, RR, Hazards Ratio, or HR. Other relevant terms for the targeted search may be introduced also.

Precision and recall rates
We studied the "relevance" of our information retrieval 19 . The information retrieval evaluation is based on relevance metrics, namely recall and precision that were estimated as follows [20][21][22] Precision is the fraction of relevant instances among the retrieved instances. Recall is the fraction of the total amount of relevant instances that were actually retrieved. Both precision and recall are therefore based on an understanding and measure of relevance.
The measurement of precision and recall rates requires a qualified individual to inspect the output from a search and to address the output into two groups of articles: relevant and not relevant. ) This RS retrieved nRS articles, which corresponds to the total number of RS articles.
In order to ascertain the content of each selected article, a stratified random sampling of 300 articles to read was performed; × 300 articles in the articles and − × 300 articles in − articles.
A figure explaining the stratified random sampling is available in Additional file 1.
Two authors (EB and PL) independently read the 300 randomly selected articles for each risk factor to assess their relevance. The criterion of relevance was the presence, in the abstract, of an odds-ratio (OR), a relative risk (RR), or a hazards ratio (HR) for the "extracted word(s)".
The final step in the comparison was to verify that a bias was not introduced in the representativeness of the articles selected by TEMAS in relation to the RS articles. We performed a Chi2 test (with simulated P-value for small numbers) to compare the distribution of OR (< 1, = 1 or > 1) among the relevant articles retained by TEMAS and RS.

Average Precision and Discount Cumulative Gain (DCG)
To evaluate our method, we also calculated the average precision and the Discount Cumulative Gain (DCG) 23 to take into account the notion of ranking. We therefore retrieved the first 100 most recent TEMAS articles ranked by decreasing publication date, and the first after each relevant document retrieved: We computed @ for ∈ {25, 50, 75, 100} The Mean Average Precision (MAP) is the arithmetic mean of the average precision values for an information retrieval system over a set of q queries. Let @ be the average precision at n for query : Discounted Cumulated Gain (DCG) assumes that the greater the ranked position of a relevant document, the less valuable it is for the searcher, because the less likely it is that the searcher will ever examine the document, and at least has to pay more effort to find it. DCG formalizes these assumptions by crediting a retrieval system for retrieving relevant documents by relevance, which is discounted by a factor dependent on the logarithm of the document's ranked position. Let ( ∈ {1, … , }) be the rating of article ; = 0 if the article is not relevant and = 1 if the article is relevant.

Meta-analysis
We tested whether TEMAS would be suitable to find all the articles selected by the researchers in meta-analyzes. On the one hand, we performed a reference search following the queries described in the meta-analysis method to retrieve the articles. On the other hand, we followed TEMAS method from step 1 to step 3 using the same query strategy. If the relevance criterion in the meta-analysis was the presence of an OR, a RR, or a HR, we stopped TEMAS at step 3.We performed TEMAS step 4 otherwise.

10/24
We then retrieved all the articles used in two meta-analyses and compared these articles to those retrieved by the reference search and TEMAS.

Material studied
We performed these effectiveness analyses for several searches: -Precision and recall, average precision and DCG were calculated for four non-genetic risks factors of breast cancer: breastfeeding, menarche, oral contraceptive, and breast density; -Comparisons with two meta-analyses were performed. The first on physical activity, the second for age at menarche and at menopause as well, as risk factors for breast cancer.

TEMAS step 1: Global Search
With this Global Search (GS) we obtained a set of 15,582 articles.

TEMAS step 2: First sort
We excluded articles without abstract, and we defined the following terms as off topic since they were downstream the screening procedure: -"mastectom" for mastectomy, mastectomies...

11/24
Among the articles extracted in Step 1, we excluded 954 articles without abstract and 4,582 off-topic articles. We thus obtained a database of 10,046 classifiable articles.

TEMAS step 3: Clustering exploration
Clustering: As a result of the classification, 99.9% of the abstracts were classified in five disjoint classes as shown on Figure 2.
Class 1 gathered genetic terms such as: polymorphism, gene, allele, genotype, DNA or genetic.
Class 2 included risk factors such as menopause, BMI, age, hormone, parity for the most relevant terms. It also comprised "statistical" terms such as: ratio, confidence interval (CI), hazards ratio (HR), relative risk (RR), logistic, Cox. In this class we therefore expected to find articles relative to risk factors corresponding to our criterion of relevance, which was the presence in the abstract of an odds-ratio (OR), a relative risk (RR), or a Hazards Ratio (HR) for the "risk factor studied".
Class 3 gathered all breast tumor detection methods with terms such as MRI, ultrasound, mammogram, image, or biopsy.
Class 4 was related to public health aspects of screening. Indeed, the selected terms were: screen, program, access, public, social or communication. Thus this class referred to a systemic approach and management of breast cancer screening rather than a risk factors search.
Class 5 was focused on treatments as suggested by the first terms of the class: therapy, treatment or treat. This class was therefore of less interest because it referred to conditions that were downstream the screening of breast cancer.
Therefore Class 2 appeared to be the most appropriate class of interest because it contained information related to risk factors.

12/24
We illustrate single, multiple terms extraction or both as follows.

Single term extraction:
-Menarche: the extraction term was "menarch" and extraction led to 246 abstracts.

Multiple term extraction is illustrated by the two following examples:
-Oral contraceptive: the main term retained was "contracept" and the complementary term was "oral" with a maximum distance of 25 characters. This extraction led to 144 articles.
-Breast/mammographic density: i. Breast density: the main term was "dens" and the complementary term was "breast" with a maximum distance of 10 characters. This extraction led to 204 articles; ii. Mammographic density: the main term was "dens" and the complementary term was "mammo" with a maximum distance of 25 characters. This extraction led to 316 articles; iii. After removing duplicates, this extraction led to 368 abstracts.

Both single and multiple terms extraction:
-Breastfeed/Breast-feed: i. Breastfeed single term extraction: extraction term was "breastf" and extraction led to 162 abstracts; ii. Breast-feed multiple term extraction: main term was "feed" and complementary term was "breast" with a maximum distance of 10 characters. This extraction led to 176 abstracts; iii. After removing duplicates, this extraction led to 189 abstracts.

TEMAS step 4: Final sort
We chose the default choice which only kept articles the abstracts of which contain at least one of the following terms: Odds Ratio, Odds, OR, Relative Risk, RR, Hazards Ratio or HR.
All the complete search queries are available in Additional file 2.
Thus the final TEMAS and RS databases for the 4 risk factors are shown in Table 1 TEMAS effectiveness analysis

Precision and recall rates
The calculation method and results for precision and recall are displayed in Figures 3 to 6.
For "breastfeeding" we performed a stratified random sampling of 300 articles as follows: 130 out of 142 in the TEMAS set, and 170 out of 186 (nRS -nTEMAS) articles for the RS set.
This gave a representative sample of 300 articles. The RS returned 23% of relevant articles.
On the other hand, the precision of TEMAS was 51%. Thus, TEMAS effectively concentrated the relevant articles by dividing by 2.3 the number of articles to be read and by increasing by 28% (p < 0.0001) the precision rate. Moreover, the recall rate of TEMAS was 97%. The precision and recall rates for the other risk factors studied are displayed in Table 2.
For "menarche" TEMAS divided by 2.8 the number of relevant articles to be read and increased by 23% (p < 0.0001) the precision rate.
For "oral contraceptive" TEMAS divided by 2.7 the number of articles to be read, and increased by 32% (p < 0.0001) the precision rate.
For "breast density" TEMAS concentrated the relevant articles dividing by 4.8 the number of articles to be read, and increased by 26% (p < 0.0001) the precision rate.
For the four risk factors studied, the slight loss in the recall rate did not affect the representativeness of TEMAS with a distribution of ORs, shown in Table 3, not significantly different from RS (p = 1).

Average Precision and Discount Cumulative Gain (DCG)
The results for AP and DCG are displayed in Tables 4 and 5. For all the risk factors studied the average precision and the DCG are higher with TEMAS than with a reference search on PubMed. @25 is 60.16% for TEMAS and 24.22% for PubMed, and @100 is still better with TEMAS (48.28%) vs PubMed (21.5%).

Meta-analysis
We focused on two meta-analyses to test the ability of TEMAS to retrieve all the articles selected by the researchers to perform their meta-analysis.
The first one was focused on physical activity and the risk of breast cancer 25 and based on prospective studies. The authors selected 31 articles to perform their meta-analysis; we will call these articles "relevant articles". We performed a PubMed reference search with the search terms indicated in the meta-analysis that led to 552 articles. Among these articles, there were 30/31 relevant articles. The missing article was either found by the authors in another database or through another search not detailed in the article methodology. TEMAS (whose method applied to this performance test for meta-analysis is detailed in Additional file 2) led to a total of 145 articles among which we also found the 30 relevant articles. The recall rate for TEMAS with this meta-analysis was therefore 100%, while dividing the number of articles to be read by 3.8.
The second meta-analysis 26 was focused on age at menarche and at menopause as risk factors for breast cancer. The articles included in the meta-analysis were divided into 2 groups, cohort studies and case-control studies. There were 31 relevant articles available on PubMed.
Among these 31 articles, 26 were found by the reference search; other articles came from other sources, as indicated in the methodology. The recall rate for TEMAS (we stopped TEMAS in step 3 because there was no condition on the presence of OR or RR in the meta-

15/24
analysis methodology) was 100% and divided the number of articles to read by 2.56 (5431 articles to read with TEMAS vs 13885 articles to read with a PubMed reference search).
For case-control studies, TEMAS recall rate was also 100%, all 52 relevant articles have been found; while dividing the number of articles to read by 2.92.
For each of the three meta-analyses tested, TEMAS recall rate was 100%. It therefore appears that TEMAS can help reducing the time required to search the literature for articles on PubMed, in order to perform a meta-analysis, by dividing between 2.5 to 3.8 times the number of articles to be read, while maintaining a recall rate of 100%. The workload dedicated to reading all the articles selected by a classic search becomes so important that new approaches are needed. Several approaches are based on text-mining using machine learning methods [7][8][9] . However, they need to be trained on a carefully

16/24
constructed training data set representative of the topic of interest. Of note, it is necessary to train these algorithms for each targeted topic. Moreover, it requires specific skills to implement these algorithms. For instance, text mining tools have been implemented to extract information about microbial biodiversity in food 10 . However, this search cannot be extended to another subject without re-calibration. Conversely, TEMAS does not require a training set. It uses a hierarchical classification procedure based on co-occurrences on terms appearing in the abstracts. This framework enabled to develop different types of searches without a training step.
Furthermore, this method can be implemented by non-specialists thanks to the R Shiny web app that helped providing a user-friendly interface available at https://shiny.temasbonnet.site.
Our method has some limits. At the classification stage, the choice of the optimal number of classes might seem complex. Indeed, there are no prescriptive thresholding rules. In our classification procedures, we kept 5 classes for breast cancer since it enabled obtaining enough disjoined classes and an interesting clustering for our searches. The optimal number of classes is different from a search to another.
For multi terms searches, we selected a distance of 10 to 25 characters between the main and complementary terms depending on the risk factor studied and on the retained proximity between the retained terms. These thresholds had been tested and appeared satisfactory without omitting relevant articles on one side, and on the other side limiting irrelevant information.
Reducing the number of articles to read inevitably impacted the recall rate. However, the recall rate of TEMAS compared to RS remained satisfactory, ranging from 95% to 97% for the risk factors of breast cancer. It meant only one to three missing articles per search. In addition, we ranked the relevant articles according to the value of the OR (< 1, = 1, or > 1) for the risk factor studied. We did not observe any significant difference in the distribution of ORs between RS and TEMAS relevant articles. Even if we did not get a recall rate of 100%,

17/24
the loss in recall rate did not bias the overall information given the above-mentioned distribution of the OR values.
We also checked whether TEMAS would achieve a 100% recall on articles selected in meta-analyses. We tested two meta-analyses, the recall rate was 100%.