Integrated Model for COVID-19 Diagnosis Based on Computed Tomography AI, and Clinical Features: A Multicenter Cohort Study


 Background

We developed and validated a machine learning diagnostic model for novel coronavirus (COVID-19) disease, integrating artificial-intelligence-based computed tomography (CT) imaging and clinical features.
Methods

We conducted a retrospective cohort study in 11 Japanese tertiary care facilities that treated COVID-19 patients. Participants were tested using both real-time reverse transcription polymerase chain reaction (RT-PCR) and chest CT between January 1 and May 30, 2020. We chronologically split the dataset in each hospital into training and test sets, containing patients in a 7:3 ratio. Light Gradient Boosting Machine model was used for analysis.
Results

A total of 703 patients were included with two models — the full model and the A-blood model — developed for their diagnosis. The A-blood model included eight variables (the Ali-M3 confidence, along with seven clinical features of blood counts and biochemistry markers). The areas under the receiver-operator curve of both models (0.91, 95% confidence interval (CI), 0.86 to 0.95 for the full model and 0.90, 95% CI, 0.86 to 0.94 for the A-blood model) were better than that of the Ali-M3 confidence (0.78, 95% CI, 0.71 to 0.83) in the test set.
Conclusions

The A-blood model, a COVID-19 diagnostic model developed in this study, combines machine-learning and CT evaluation with blood test data and is better than the Ali-M3 framework existing for this purpose. This would significantly aid physicians in making a quicker diagnosis of COVID-19.


Introduction
Since the discovery of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) species -and the resulting novel coronavirus disease (COVID-19) -in December 2019, humanity has been plunged into a global pandemic (1). Although the devastating effects of the virus have been mitigated by vaccination (2,3), breakthrough infections caused by new variants of the variants prevent the pandemic from coming to an end (4).
The gold standard for COVID-19 diagnosis is the real-time reverse transcription polymerase chain reaction (RT-PCR) test. However, RT-PCR has several drawbacks: in several cases, it is known to be insu ciently sensitive to the virus even in symptomatic patients, leading to false negatives (5,6). In addition, its diagnosis in facilities that require specimen transport takes a long turnaround time (7). These aspects reveal the need for a more accurate and timely diagnosis; for this, several diagnostic models have been developed using clinical characteristics, laboratory data, and radiographic images. However, most such models have not been validated with datasets external to the development phase (8). Moreover, methodological aws and/or underlying biases, making it di cult to determine the model validity. In consequence, there are no diagnostic models using chest computed tomography (CT) with potential clinical use (9).
There is no diagnostic system that automatically interprets the CT and clinical features. In addition, to overcome these limitations in diagnostic models using chest CT, we have externally validated a deeplearning-based, CT diagnostic system for COVID-19 (Ali-M3) (10). To further improve its accuracy, it is important to properly diagnose COVID-19 patients without pneumonia detectable by CT. For this purpose, we integrated the Ali-M3 model with the clinical characteristics of patients suspected of having COVID-19 using machine learning, and validated this new system.

Materials And Methods
We used datasets for the external validation of the Ali-M3. The details of the datasets were published elsewhere (10). We used the guidelines of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement to report this study (Supplementary Table 1) (11). The institutional review board of each facility approved of our study and waived the need to obtain written informed consent.

Study Design
This was a retrospective cohort study conducted in 11 Japanese tertiary care facilities that provided treatment for patients with COVID-19.

Participants
We included patients who tested both RT-PCR and chest CT for the diagnosis of COVID-19. Potentially eligible participants were identi ed as those who had, on the advice of physicians, taken both the RT-PCR and chest CT tests when they presented with symptoms or were suspected of having COVID-19. RT-PCR results were extracted from the patients' medical records at each facility.We selected patients by using consecutive sampling methods between January 1 and May 30, 2020. We excluded patients when the time-interval between chest CT and the rst RT-PCR assay exceeded 7 days.

Chest CT and Arti cial Intelligence
We considered, for each patient, the CT image that was taken closest to the onset of symptoms. All these images featured the patient in a supine position.
Ali-M3 is a three-dimensional, deep-learning framework for the detection of COVID-19 infections, developed from 7,000 chest CT scans (12). It predicts COVID-19 infections with con dence levels in the range of 0-1. The learning of Ali-M3 was halted before our evaluation (10) and the investigators who entered data from the CT images into Ali-M3 were blinded to the corresponding RT-PCR results. The area under the curve (AUC) of Ali-M3 for predicting a COVID-19 diagnosis was 0.797 (95% con dence intervals [CI]: 0.762-0.833) (10).

Clinical Characteristics
We extracted, from electronic medical records, those clinical characteristics that were recorded at a time closest to the date of the chest CT scans. At the time of data acquisition, the turnaround time of RT-PCR was a few days; therefore, all predictive variables were recorded without the RT-PCR results.

Reference Standard
COVID-19 was diagnosed by the RT-PCR test, which detected the presence of the nucleic acid of SARS-CoV-2 in the sputum, throat swabs, and secretions of the lower respiratory tract (13). This test was established as the primary reference standard. Although the ndings of chest CT, interpreted by radiologists, were included as a reference standard in the AI development phase of this framework, we did not include it as the reference standard in this study.

Model Development
We used the machine learning model, Light Gradient Boosting Machine (LightGBM), which is also a highly effective gradient-boosting decision tree algorithm (13). In the boosting algorithm, a weak classi er (decision tree) is sequentially created to minimize the prediction errors made by the previous classi er (14). The result is a powerful ensemble classi er with superior predictive performance. To avoid over tting, parameters speci c to the algorithm (known as hyperparameters) must be well tuned before tting them to the nal model, which also needs to treat missing data as such.
The creation of prediction models consists of three steps. First, the data set in each hospital was chronologically split into a training set (with 70% of the patients) and a test set (with the remaining 30% of subjects). Second, hyperparameters were tuned to maximize the area under the receiver-operator curve (AUROC) by performing a ve-fold cross-validation on the training set using strati ed splitting in equally sized groups. A Bayesian optimization algorithm was used for tuning, with the search parameters and spaces given as follows: "num_leaves" (maximum number of leaves in one tree) at 10-150, "max_depth" (maximum tree depth) at 10-150, "learning_rate" (learning rate) at 0.005-0.5, "subsample_for_bin" (number of data sampled to construct feature-discrete bins) at 20000-300000, "min_child_samples" (minimal number of data in one leaf) at 10-100, "reg_alpha" (L1 regularization) at 0.0-1.0, "reg_lambda" (L2 regularization) at 0.0-1.0, "colsample_bytree" (the rate of features selected in training each tree) at 0.5-1.0, "subsample" (the rate of data selected in training each tree) at 0.5-1.0, and "is_unbalanced", True or False. Optimal "n_estimator" (number of trees) was automatically determined by employing early stopping ("early_stopping_rounds" = 100). Third, we used the entire training set to t two nal models, whose hyperparameters were also tuned. One model -the full model -included all the above-mentioned variables, while the other model -the A-blood model -included only eight limited variables (the Ali-M3 con dence variable, in addition to seven variables pertaining to blood test results: white blood cell, hemoglobin, platelet, aspartate aminotransferase, alanine aminotransferase, lactate dehydrogenase, and C-reactive protein). These blood test variables were selected owing to the ease of availability of their data and due to their relative importance in the full model, which was computed as Shapley Additive exPlanations (SHAP) values (13). SHAP values quantify the association between each variable and the outcome of each patient.

Model Validation
We differentiated between the con dence of the machine-learning models and the Ali-M3 framework by using AUROC in the test set, with 95% con dence intervals calculated with bootstrapped resampling (1000 samples). AUROC is an effective measure of overall diagnostic accuracy, which is deemed to be "outstanding" if AUROC ≥ 0.9, "excellent" if 0.8 < AUROC < 0.9, and "acceptable" if 0.7 < AUROC < 0.8 (18). Calibration was assessed using the Brier score (19) and a calibration plot. The formulation of the Brier score for a binary prediction is given by: Brier Score = , where the score predicts the occurrence of the outcome, ranging from 1 for an outcome that de nitely occurs and 0 for one that de nitely does not occur, where smaller values indicate superior model performance. The AUROC values and Brier scores of the machine-learning models were compared with those of the Ali-M3 con dence using bootstrapped resampling (1000 samples). We calculated the SHAP values and presented them in the gures.

Patient Characteristics
A total of 703 patients were included in the study, including 326 PCR-positive and 377 PCR-negative patients. In training set, we included 490 patients including 247 PCR-positive patients. In test set, we included 213 patients including 79 PCR-positive patients. Patient characteristics are shown in Table 1.

Model Performance
We developed two models -a full model and an A-blood model. Details on the A-blood model are accessible in web calculators online (20). We show the model discrimination and calibration of the test data in Table 2. The AUROC values of both the full model (0.91, 95% CI, 0.86 to 0.95) and the A-blood model (0.90, 95% CI, 0.86 to 0.94) were better than that of the Ali-M3 con dence (0.78, 95% CI, 0.71 to 0.83) in the test set. The calibration evaluated by the Brier scores of both the full model (0.10, 95% CI, 0.07 to 0.13) and the A-blood model (0.12, 95% CI, 0.08 to 0.16) were better than that of the Ali-M3 con dence (0.23, 95% CI, 0.19 to 0.27) in the test set. The ROCs of the test data are shown in Figure 1, with SHAP values shown in Figures 2 and 3. Figure 2 shows all of the predictive variables that we used. Figures 4-6 show the calibration plots for the Ali-M3 framework, the full model, and the A-blood model.

Discussion
We developed and validated two integrated diagnostic models of the Ali-M3 framework with the clinical characteristics of patients with suspected COVID-19. Based on the relative importance of each variable, we shrank the full model to a more compact A-blood model, whose parameters included the Ali-M3 con dence and eight routinely collected blood markers. This A-blood model showed better discrimination and calibration performance than the full model.
Our diagnostic model is the rst to automatically interpret clinical data in conjunction with CT scans. Several problems faced by existing diagnostic models, such as separate collection of cases and controls, lack of external validation, and insu cient reporting (8, 9), have been overcome in this study with rigorous methodology, with our model achieving good discrimination and calibration performance.
The A-blood model would allow for quicker diagnoses. Even if the RT-PCR test existed in the facility, the Ablood model would be a better option because of its lower turnaround time, which requires only general blood test and CT results. When the RT-PCR results are not known, the "A-blood" model could help physicians determine indications for timely treatment with antibody drugs (21). For patients for whom COVID-19 infection cannot be ruled out based on a single RT-PCR negative, physicians may be able to use the "A-blood" model to determine if a patient can be released from quarantine. These clinical implications need to be evaluated in further studies (22). This study has several limitations. First, the dataset used in this study is from the rst wave of infections in the spring of 2020, which does not include vaccinated patients or the latter variants of the SARS-CoV2 virus. Therefore, it is necessary to further expand on this external validation. A second limitation is the occurrence of false negatives, which includes patients falsely regarded as COVID-negative with a single negative PCR result. This misclassi cation may affect the accuracy, but the magnitude of this bias cannot be predicted. Further studies are required with datasets that also include su cient follow-up.
In conclusion, we developed the A-blood model, which is a COVID-19 diagnostic tool that combines machine learning and CT evaluation with blood test data. Physicians would be able to use this model for the rapid diagnosis of COVID-19. Further validation studies, especially those including SARS-CoV-2 variants and subjects inoculated with different vaccines, are warranted.

Declarations
Reporting Checklist: The authors have completed the TRIPOD reporting checklist. Data Sharing Statement: Data will be shared by the reasonable request to the corresponding author.

Con icts Of Interest:
All authors have completed the ICMJE uni ed disclosure form. Junichi Matsumoto received lecture fee from M3 Inc. The other authors declare no con icts of interest.

Funding information:
This study was partially supported by the Kyoto University managing fund for English editing. The article processing fee was supported by Scienti c Research Works Peer Support Group (SRWS-PSG). Funders played no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.   Predicted versus observed probability of COVID-19 real-time reverse transcription polymerase chain reaction (RT-PCR positive (calibration; pink line) for Ali-M3 Con dence.

Figure 5
Predicted versus observed probability of COVID-19 real-time reverse transcription polymerase chain reaction (RT-PCR positive (calibration; pink line) for machine learning model using all variables.