Automated Screener Based on Convolutional Neural Network for Randomized Controlled Trials in Chinese Language: A Comparative Study of Different Classication Strategies

Objective: To explore the inuence of modied literature classication strategies of Chinese biomedical literature on an automated screener based on conventional algorithm. Methods: Citations of studies indexed as ‘Oral Science’ published in Chinese between 2014 and 2018 were retrieved from the China National Knowledge Infrastructure. Apart from dividing the studies into 2 categories (RCTs and non-RCTs), 3-category (RCTs, may-be-RCTs, and non-RCTs) and 5-category (RCTs, randomization-unclear controlled trials, non-randomized clinical trials/studies, non-clinical literature, and unclear) classication were also employed. The multi-category strategies took into consideration the diversity of study types and the presence of expression vagueness. Similar to real-world practice, full-text-needed studies included those that certainly concerned RCTs and those that might be RCTs but lacked information in their abstracts. Screening and classication were performed independently by 2 experienced researchers. The classication results after peer discussion and/or senior decision were used for the training of the CNN model. The probability thresholds for the classication of each category were set at a high sensitivity level.The area under the receiver-operator curve (AUC) was calculated when applicable. An isolated sample of citations was used in a prospective comparative trial that compared the sensitivity (SEN) and specicity (SPE) of screening RCTs, may-be-RCTs, and full-text-needed studies by using algorithms with different strategies and manual screening. Results:In total, 12,166 citations were used for CNN model training. All 3 training strategies performed well in RCTs-screening with AUCs being higher than 0.99. The training exhibited that, when screening for RCTs, the 5- and 3-category strategies can yield better performance than the 2-category strategy. When screening for may-be-RCTs and full text-needed studies, the 5-category model achieved better SENs while the 3-category model achieved higher SPEs. The comparative trial with 1,422 samples presented similar results. Conclusion: The CNN algorithm has promising results in the automatic screening of Chinese literature.


Introduction
Evidence-based medicine (EBM), which is based on the principle that every clinical decision should be made according to the best clinical evidence, has signi cant value for clinical practice and research.
Systematic reviews (SRs) are the foundation of EMB and nal clinical guidelines. In general, an SR involves protocol design, literature searching, citation screening, full text obtaining, data extraction, critical appraisal, statistical analysis, and writing. Among these steps, literature searching and citation screening determine how systematic a review can be.
To minimize any potential bias from the design of a study, researchers often need, whenever possible, to screen for randomized controlled trials (RCTs) for SRs that focus on medical interventions. Due to that SRs must be comprehensive with all available evidence, researchers often adopt search strategies with high sensitivity (> 0.98) and relatively low speci city (< 0.75), 1 which often yield thousands of references to be screened at the title and abstract level. Moreover, due to the ever-growing corpus of published literature, performing abstract-level screening is a task that is increasingly time-consuming. In 2015 alone, an average of 100 manuscripts describing RCTs for medical interventions were published daily. 2 Thus, this task is one of the several bottlenecks in performing SRs that is both time and money consuming. SRs concerning clinical trials are a critical resource supporting policy and clinical decision-making; however, it is challenging to keep them up-to-date. The production and update of SRs is resource-intensive, while the constantly increasing volume of new evidence produced can outpace our ability to keep up. 3 Researchers have found that to complete an SR, an average of 1,781 studies need to be screened with a screen-out rate of 97.1%. 4 In another study, it was revealed that it takes an average of 2 years before the publication of a Cochrane review, which is far from the requirement from up-to-date 5 . It is well known that SRs require highly skilled reviewers for completing a series of very specialized manual and repetitive tasks. Interestingly, a study has indicated that, if not volunteered, the costs of some SRs may reach approximately $141,194.80 to conduct. 6 A fundamental problem is that current screening methods, though rigorous enough, simply are not designed to meet the demands imposed by the voluminous scale of the current evidence base.
To tackle this crucial problem, several efforts have been made which can be categorized into 3 main sub elds: modi ed search lters, crowdsourcing, and machine learning (ML). Search lters are based on combinations of text strings and database tags, which are developed by information specialists. Current best-performing lters achieve near-perfect sensitivity, but they provide comparatively low speci city.
Crowdsourcing, such as the Cochrane Crowd which is the most well-known crowdsourcing task, has been used among others in RCT-screening, and table and gure selection. 7,8 However, some serious disadvantages of crowdsourcing have prevented it from becoming popular. These include the unsatisfactory inter-person consistency and the extensive efforts required to perform a crowdsourcing task. 9 ML refers to the application of arti cial intelligence (AI) that enables computer systems to learn and improve from experience, typically from large amounts of training data, without being explicitly programmed. The use of ML in the eld of text analyzing is called natural language processing (NLP), referring to the analysis of the human language. Several studies have been focused on methods for semiautomating SRs via NLP. In recent years, a few researchers used various ML-aided software or platforms in real-world SRs to replace conventional counterparts, and the results veri ed their feasibility. 10 In a series of reviews published in the Journal of Clinical Epidemiology in 2018, the term 'living systematic review' was introduced, 11-14 which called for the fully-automated work ow of SR that can only be achieved with ML. In the past few years, neural network models have dominated NLP generally and text classi cation speci cally. 15 In particular, convolutional neural networks (CNNs), which have been originally used for image classi cation, have emerged as state-of-the-art models for text categorization.
Up to now, current studies on assisting citation screening of scienti c articles in English are thriving with an accompanying body of work. 16 However, none of the existing studies have included Chinese literature. The EBM principle requires SRs to be as inclusive as possible. A potential source of bias in SRs is the publication language of the included studies. 17,18 It has been pointed out that trials with signi cant results are more likely to be published in English-language journals. 19 Cochrane Collaboration standards recommend that "whenever possible review authors should attempt to identify and assess for eligibility all possibly relevant reports of trials irrespective of language of publication". 1 Including studies published in non-English languages not only prevents language bias, but also increases the power of meta-analysis estimates. It has been suggested 20 that all SRs of epidemiology and public health should include literature published in the major languages of the world. This means that the use of regional and non-English bibliographic databases should become routine. In China, there are already 263 research institutions and medical laboratories with an estimated 926,000 researchers, making Chinese medical researchers second in number only to those of the United States in 2006. 21 Moreover, the share of China in the published scienti c papers worldwide increased from less than 1% in 1980 to about 12% in 2011, ranking second behind the US in 2013. 22 A study has pointed out that the utilization of Chinese-language databases can signi cantly increase the number of potentially relevant references for each search. 23 Nevertheless, only a few researchers in the EBM eld take this issue seriously. For example, in 2015, a study pointed out that among 8,680 published reviews indexed in the Cochrane Library, only 243 (3%) had included in their searches at least one of the major Chinese databases. 24 If there would be more convenient ways to overcome the language barrier and identify RCTs in Chinese, it would be helpful to increase the inclusion rate of Chinese studies in SRs.
This study aimed at Chinese literature of RCTs. The ambiguity of abstracts and the variety of clinical study designs were taken into consideration when the CNN model was trained in order to improve the screening performance of Chinese RCTs. Although ML has been already used in various studies to assist English literature screening, it cannot be applied directly to Chinese literature without speci c training and calibration. This is due to that many Asian languages, such as Chinese and Japanese, are written without using explicit word delimiters. These characteristics of the Chinese language require complex preprocessing before the application of ML, which pose additional di culties on the automation of Chinese literature screening. Furthermore, although there are guidelines that instruct proper reporting, abstracts can still be ambiguous or lack important methodological description. Consequently, dichotomous classi cation of abstracts into RCTs and non-RCTs may miss an important body of evidence, i.e., may-be-RCTs. May-be-RCTs refer to those studies that cannot be determined either as RCTs or non-RCTs. In the actual title and abstract screening process, full texts of both RCTs and may-be-RCTs need to be acquired. Dichotomous classi cation cannot solve this problem. Thus, the sensitivity of the screening results from dichotomous classi cation can be awed. In order to further exploit the potential help from clinical epidemiology, this study aims to explore the in uence of modi ed literature classi cation strategies of Chinese biomedical literature on an automated screener based on CNN. In this study, the literature was classi ed into 5 different categories based on study design and publication type. Finally, the performance of the nal AI models was validated through a comparative study.

Classi cation criteria
In the classi cation criteria, the ambiguity of title and abstract expression and the variety of study types was taken into consideration. In the nest version of classi cation, i.e. classify the literature by its nature regardless of the screeners' intend to include or exclude the literature in their reviews, literature was divided into 5 different categories: RCTs, randomization-unclear controlled trials (RUCTs), nonrandomized clinical trials/studies (NRCTs), non-clinical literature (NC), and Unclear. The RUCTs comprised comparative studies where the group assigning method was not stated clearly enough. The NRCTs included clinical studies other than RCTs and RUCTs, e.g., case-control studies. The NC category included medical reviews, laboratory experiments, and other literature that was not related to medicine at all. In the Unclear category, the literature contained so little information in the titles and abstracts that could not be assigned into any of the aforementioned categories.
Following the nest classi cation, the literature was further merged into 3-category and 2-category training. The 3-category training comprised RCTs, may-be-RCTs, and non-RCTs. The may-be-RCTs category included both RUCTs and Unclear, while the non-RCTs category contained the literature classi ed as NRCTs and NC. Finally, traditional dichotomous classi cation was performed by further merging the may-be-RCTs and non-RCTs categories.

Literature collection and labelling
Chinese literature published between Jan 1st 2014 and Dec 31th 2018 indexed as 'Oral Science' in the China Network Knowledge Infrastructure (CNKI) was searched and exported. Citations were excluded in the cases where key information, such as the abstract, author, or publication information, was missing.
Literature classi cation was conducted by 2 experienced researchers (S.C. and Y.X.) separately using an online platform that our group developed especially for this study. The platform aims at assisting researchers with the screening and label-checking tasks for Chinese literature citation, and provides the functions of citation management, screening assistance, and automatic label checking. The citations were screened by the 2 researchers independently, according to the 5-category criteria. After the labeling results of the 2 researchers were checked, citations with the same labels were used as the gold standard. Results with different labels were resolved by consensus after peer discussion and/or turning to an experienced senior researcher (C.L.) for nal decision. The 3-and 2-category labels were derived from the 5-category labels according to the aforementioned classi cation criteria.
2.2 AI screener training 2.2.1 CNN model description In this paper, a customized neural network architecture is proposed that uses a CNN to process text inputs (Fig. 1). The input of the model is the citation of a Chinese study that includes the title and abstract information, while the output is the con dence coe cient of whether the study should be labeled as one of the different categories. The architecture represents words as vectors and the input text is a concatenation of word vectors. During preprocessing, JIEBA, 25 a Chinese text segmentation tool, was employed as the tokenizer for the text segmentation and deleting stop words tasks. Then, a citation was processed into a sequence of words based on the order of words that appeared in the original text. A word vector was randomly initialized, which would be further adjusted during model training.
Next, the id-th word corresponded to a d-dimensional word vector. A convolution operation involves a lter which was applied to a window of w words in order to produce a new feature. Multiple lters were applied to each possible window of words in the sequence, producing many feature maps. Subsequently, a maxpooling operation was applied over the feature map and the maximum value was set as the feature corresponding to that particular lter. These features that formed the penultimate layer were passed to a fully connected layer. The nal output was the probability of the citation being from a speci c category, e.g., RCT, which ranged between 0 and 1. The closer the probability to 1, the more likely the citation to be in the category, and vice versa. Cross entropy was the loss function of the model, and the Adam optimizer was used to update the network parameters by backpropagation.

Parameter adjustment
The model adjusted the parameters on the dev set with the learning rate being initially set as 0.0005. The batch size was set to an integer power of 2 in order to adjust the number of lters and the size of each lter convolution window. The pooled results were treated with dropout and the values in the range of parameters (0,1). In addition, in order to prevent the over-tting of the model, L2 penalty was applied to the model, the parameters of which could be adjusted according to the accuracy rate on the dev set.

Model storage and usage
After each training process with the respective classi cation strategies, a stable model was obtained. The model parameters and the word vectors were saved. Then, when the model was used, the parameters were loaded, while the fully connected layer 'dropout_keep_prob' was set to 1. When these models were used, the input data were processed and the results were predictions of the possibility belonging to each literature category. The results were saved with the relevant citation data.

Sample and model preparation
In this comparative study, a sample containing 1,422 citations that had been isolated from the screening researchers and CNN models were adopted to reveal the performance of the models with different classi cation strategies. During model preparation, cutting thresholds of the possibility coe cient given by the CNN models were determined. In our previously published study, 26 the sensitivity (SEN) and speci city (SPE) of the CNN model could be adjusted by setting different thresholds. The primary goal was to maintain the screening as sensitive as possible. Ideally, the sensitivity should be 1. However, if such SEN could not be reached, 0.95 should be the bottom line. The secondary goal was to maintain the SPE as high as possible. Therefore, an SPE higher than 0.8 was desired but not mandatory. Hereby, the High-sensitivity Threshold was determined as follows: when the SPE can be maintained higher than 0.8, we rst seek for SEN = 1. If that threshold does not exist, the standard will be lowered to 1 > SEN > 0.95 when the SPE is still higher than 0.8. If the threshold still cannot meet the subordinate standard, the Highsensitivity Threshold will be set at SEN = 0.95 irrespective of how low the SPE is. Receiver-operating characteristic (ROC) curves and the area under curve (AUC), along with the SEN and SPE with the chosen thresholds, were employed to compare the models obtained by the different classi cation strategies.
It should be noted that, while the 2-category model had only one High-sensitivity Threshold for the RCTs category, the 3-and 5-category models had respectively 3 and 5 High-sensitivity Thresholds for each category. For example, when trained with 5-category strategy, the CNN model will provide 5 sets of probabilities for a single citation, 5 thresholds were therefore needed to determine to which category did the citation belong. This might lead to the result that a speci c citation may be labelled with multiple category tags if the possibility given for those categories were all above the respective thresholds. Therefore, combination of different thresholds was needed. The work ow of the combination of different thresholds is shown in Fig. 2. Since the sensitivity of citation screening was the main aim, no citations were discarded until the positive choosing was completed. The rst screening was for RCTs and the positive results were kept. Then, the RUCTs and Unclear categories were screened from the remaining results and kept aside. Finally, screening for NRCTs and NC was performed in the remainder, where the negative results (Ambiguous) were kept and the positive ones were discarded.

Screening performance validation
The 1,422 samples underwent both manual and CNN model screening. The same independent screening process was performed by the 2 experienced researchers. After discussion, the nal classi cation labels were used as the gold standard. The independent results of both researchers prior to discussion were deemed as performance of manual screening. Fully-prepared CNN models performed the screening according to the aforementioned thresholds and work ow. Finally, SEN and SPE were given with 95% con dence interval (95% CI).

CNN model training
In total, 12,166 citations (2,382 RCTs, 886 RUCTs, 402 Unclear, 3,839 NRCTs, and 4,657 NC) were collected for CNN model training. Among them, 10,266 citations were used as training set, while 1,900 randomly selected citations were used for testing. Screening for RCTs, may-be-RCTs, and full-text-needed citations were 3 principal scenarios. After testing 3 differently trained CNN models, the SEN, SPE, and AUCs for the above 3 scenarios with the High-sensitivity Thresholds were documented ( Table 1). The 2category model was not able to screen for may-be-RCTs and full-text-needed citations. Due to the fact that some screening tasks were performed by threshold combinations, ROC curves and AUCs of those scenarios were not applicable.  Table 2. The validation results revealed that the sensitivities of the CNN models were close to or even better than those of manual screening, especially when screening for may-be-RCTs. Nevertheless, in all 3 scenarios, manual screening had much higher speci cities than all CNN models. This can be partially attributed to that the high-sensitivity work ow sacri ces model speci city. The performance of the CNN models in the validation trial was close to or somewhat better than that during training. By comparing the CNN models, it was found that, in RCTs screening, the 5-category model had an astonishing SEN of 1 with the highest SPE. In the other 2 scenarios, although the 95% CIs overlapped, the 5-category model tended to be more sensitive, while the 3-category model tended to be more speci c. The actual bene ts and costs were that the 5-category model rescued 5/202 may-be-RCTs at the cost of 26 more false positive citations than the 3-category model. As for the full-text-needed citations, 8/583 citations were rescued at the cost of 32 false positive citations. Subsequently, the choice between these 2 models depends on the balance between the cost and bene t of SEN and SPE required by the researchers according to their screening requirement.

Discussion
In general, SRs are time-and resource-intensive, requiring an average of 5 researchers and approximately 41 weeks to be submitted to a journal. An average-sized SR search cites 1,781 references (range 27 ~ 92,020), requires an abstract-level screening of 1,286 references (range 14 ~ 77,910), full-text screening of 63 studies (range 0 ~ 4,385), and nal inclusion of 15 studies (range 0 ~ 291). 4 Proper methods are required to solve this long-existed problem. This study focused on the automation of Chinese medical literature screening and employed different training strategies to improve the screening results.
This study was designed to face the problem concerning the absence of a Chinese medical literature screener based on ML able to classify abstracts according to study type. Additional efforts were made to improve the screening performance by adopting different training strategies. In the scenario of screening for citations that certainly concern RCTs, the 5-category strategy yielded the best performance. The SEN and SPE of the 5-category model were higher than those of the other 2 models with the SEN reaching the perfect standard of 1. Although the results of perfect SEN might be due to sampling error, the excellent performance and the superiority over the 3-and 2-category models could be guaranteed. As for the scenarios of screening for may-be-RCTs, the 2-category model was inherently not applicable. The remaining 2 models exhibited relative advantages over each other; i.e., the 3-category model had higher SPE, while the 5-category model had higher SEN, while no statistical signi cance was found. The screening performance for full-text-needed studies (i.e., certain RCTs and may-be-RCTs) was similar.
Despite that no statistical signi cance was found, again the 5-category model achieved a perfect SEN of 1, while the 3-category model had a higher SPE than its counterpart (0.6972 vs. 0.6591). The reason why the 5-category model had a perfect SEN in screening full-text-needed studies while it missed some maybe-RCTs was that all the missed ones were labeled as RCTs. In real-world practice, the choice between 5and 3-category models depends on whether the user demands a higher SEN or SPE.  34 In this eld, Cochrane Crowd is one of the most well-known projects. 35 This project has been remarkable in its success, since over 1,600,000 articles have already been labeled as RCTs or clinical controlled trials. An evaluation of its output against double assessment by experienced researchers revealed that the sensitivity and speci city exceeded 99%. 36 Crowdsourcing may facilitate the update of previously published reviews, or contribute to real-time up-to-date online SRs, i.e., "living SRs". 37 However, the quality of the crowd's work and the large effort required to begin a reliable crowdsourcing task remain the top 2 barriers hindering its wide application. 38 In recent years, it has been inspiring to see studies using ML-aided methods, improved search lters, as well as crowdsourcing projects to replace conventional manual screening during SR researches. 10,[39][40][41] However, the ultimate goal should be a comprehensive database containing well-structured and indexed information concerning the study type, research details, outcome data, etc. Various methods have been employed to identify such sets of information from published studies. If the authors, publishers, and databases could scienti cally tag and index the studies prior to publication, it would constitute a further step closer to this ultimate goal.
This study has several limitations that should be mentioned. First, the training set included 12,166 citations. However, ML relies heavily on the amount of training samples to generate better results. More efforts will have to be made to obtain training sets of larger scale, in order to further improve the performance of CNNs. Second, an imbalance of sample distribution was observed, i.e., the Unclear category comprised only a small proportion due to the nature of this category. Three sets of different approaches could be applied to deal with this imbalance: the pre-processing data approach, the algorithmic approach with cost-sensitive classi cation, and the set of ensemble methods. 42 Third, it was noticed that a research published in 2020 revealed that compared to CNNs, other novel ML techniques, such as multilayer perceptron, bidirectional long short term memory networks (biLSTM), and CNN-biLSTM can be more e cient during text classi cation tasks. 43 However, this current study mainly focused on the in uence of modi ed literature classi cation strategies of Chinese biomedical literature on screening performance, so only conventional CNN was involved. In future studies, carefully combined advanced models with multi-category training strategies might further enhance the screening performance.

Conclusion
Automated screening in Chinese literature was successfully achieved in this study. The proposed CNN algorithm trained with multiple-category strategies in automatic screening of Chinese medical literature outperform the model trained with dichotomous data. They are not only more sensitive and speci c when screening for RCTs, but also capable of picking out literature that may also be RCTs. The choice of screening model during real-world practice should depend on the users' speci c needs for higher sensitivity or speci city. Further study is needed in order to expand the conclusion of universality.

Declarations
Ethics approval and consent to participate: Not applicable. The proposed CNN architecture.