iCTCF: an integrative resource of chest computed tomography images and clinical features of patients with COVID-19 pneumonia

The outbreak of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was initially reported in Wuhan, China since December, 2019. Here, we reported a timely and comprehensive resource named iCTCF to archive 256,356 chest computed tomography (CT) images, 127 types of clinical features (CFs), and laboratory-confirmed SARS-CoV-2 clinical status from 1170 patients, reaching a data volume of 38.2 GB. To facilitate COVID-19 diagnosis, we integrated the heterogeneous CT and CF datasets, and developed a novel framework of Hybrid-learning for UnbiaSed predicTion of COVID-19 patients (HUST-19) to predict negative cases, mild/regular and severe/critically ill patients, respectively. Although both CT images and CFs are informative in predicting patients with or without COVID-19 pneumonia, the integration of CT and CF datasets achieved a striking accuracy with an area under the curve (AUC) value of 0.978, much higher than that when exclusively using either CT (0.919) or CF data (0.882). Together with HUST-19, iCTCF can serve as a fundamental resource for improving the diagnosis and management of COVID-19 patients.


Introduction
Since December, 2019, the outbreak of an unknown viral pneumonia has severely affected Wuhan, China. This virus was quickly identified and named by the World Health Organization (WHO) as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 1-3 , and the resulting viral pneumonia was referred to as coronavirus disease 2019 (COVID-19) pneumonia [4][5][6][7][8] . By the end of March, 2020, nearly 200 countries and regions were affected with > 500,000 confirmed cases, which are still increasing. Such a severe situation underscores the urgency for developing effective measures to control this pandemic.
To control the outbreak, early diagnosis of patients with COVID-19 pneumonia for timely treatment is critical, especially in epidemic regions 2,[8][9][10][11] . However, it was challenging to be achieved. Across many COVID-19-stricken regions, it appears to be common that limited medical resources and a large number of quickly piled-up patients result in long waiting time for diagnosis and medical decisions, such as quarantine or hospitalization, which potentially increases the chance of cross-infection and leads to poor prognosis. Although COVID-19 confirmation relies on real-time polymerase chain reaction (RT-PCR) detecting existence of SARS-CoV-2, this PCR testing was found to have a high specificity (Sp) 8 but a rather low sensitivity (Sn) with a reported positive rate of merely 38%~57% 12 .
In addition to etiological laboratory confirmation, other key diagnostic elements that could facilitate identification of COVID-19 pneumonia include clinical features (CFs) and chest computed tomography (CT) imaging 13,14 . Consistent with the importance of these elements, COVID-19 guidelines, for instance, the one followed in China, uses all these elements to define mild, regular, severe, and critically ill forms of COVID-19 pneumonia 9,14-18 . Despite far from complete understanding, the limited number of relevant papers have begun to reveal relevant CFs, including symptoms of COVID-19 such as fever, dry cough, myalgia and shortness of breath 10,19,20 . Other CFs, such as lymphopenia, elevated levels of inflammatory cytokines, and reduction in T cell subsets, are also frequently found 10,11,15 . Moreover, chest CT imaging characteristics of infected lungs reportedly include ground glass opacity (GGO) and severity-correlated consolidation 21 . Although the full picture regarding each element still awaits full depiction, comprehensively pooling the features of the aforementioned diagnostic elements might collectively improve diagnosis accuracy and efficacy.
Amid the ongoing COVID-19 pandemic, the availability of first-hand radiographic and clinical datasets would be essential and important to help in guiding clinical decisionmaking, to supply information for deepening the understanding of this viral infection, and to provide a basis for systemic modeling that may facilitate early diagnosis for timely medical intervention. A way to achieve this goal is to build an open-access and comprehensive resource containing chest CT images and CFs of individual patients, to offer a platform allowing internationally joint efforts to combat COVID-19 pneumonia.
From the accumulated data in our hospitals, we enrolled 1170 anonymous patients including 649 laboratory-confirmed, 222 COVID-19-negative/control and 299 suspected patients, and collected their corresponding chest CT images, CFs and SARS-CoV-2 laboratory testing results if available. Then, we developed a patient-centric resource named integrative CT images and CFs for COVID-19 (iCTCF) to archive and share the rich data. Through integration of the highly heterogeneous CT and CF datasets, we built a novel framework of Hybrid-learning for UnbiaSed predicTion of COVID-19 patients (HUST-19) to predict negative cases (Control), mild/regular (Type I) and severe/critically ill (Type II) patients, respectively. The 10-fold cross-validation was conducted for model training, parameter optimization and performance evaluation.  (Fig. 2a). Desired selections could be customized. By clicking the button "Submit", the results could be displayed in a tabular list with 20 patients' information per page (Fig. 2a).
For convenience, we also provided an "example" button that could be clicked to automatically load pre-configured selections, followed by the representation of several typical cases (Fig. 2b). Here, we selected "Patient 4" as an example to show the annotations in iCTCF. Patient 4 had intermittent fever (highest as 38.5℃), fatigue, shortness of breath, and myalgia ten days prior to admission. He coughed occasionally with sputum. On Feb 3, the RT-PCR for SAS-CoV-2 nucleic acid test on his throat swab specimens was positive. He was admitted on Feb 4 with finger blood oxygen saturation of 90% in ambient air. It reached 98% through face mask oxygen support (3 L/min).
According to the Guideline of China (6 th edition), he was diagnosed as severe form of COVID-19 and regarded as a Type II case in iCTCF.
By clicking "Patient 4", detailed information on the anonymous patient would be shown ( Fig. 2c). On patient page, a brief summary of the patient was presented, whereas 5 representative CT images would be displayed (Fig. 2c, d). All numerical forms of CFs were provided in a tabular list, and the laboratory-confirmed SARS-CoV-2 status was also shown (Fig. 2c, e). Consistent with his brief clinical summary described above, users would find the age of this patient (73), gender (male), his body temperature (38.5℃), the positive SARS-CoV-2 infection status, and his Udis (aorta calcification).
His CT examination at clinics suggested possible bilateral viral pneumonia. His tabular list contained 82 numerical CFs, including but not limited to the decrease of EC, lymphocyte count (LY), EOP and lymphocyte percent (LYP), and the increased levels of NEP, erythrocyte sedimentation rate (ESR) and C-reactive protein (CRP) (Fig. 2e).
He was diagnosed as the severe form of COVID-19 ( Fig. 2c).

A hybrid-learning framework to predict COVID-19 patients
To exemplify the usefulness of iCTCF, we developed a computational method named HUST-19 that integrated the data sets of CT images and CFs for prediction of patients CNNs was implemented to transform the individual CT image-based prediction into the patient-based prediction (Fig. 3). For each patient, ten mostly probable pCT images were reserved as the representative images, which were inputted into the secondary 13layer CNNs to classify the patient as a control, Type I or Type II case. Third, the CFbased classification of patients into three types was implemented in a framework of 7layer deep neural networks (DNNs), including one input layer, 5 dense layers, and one output layer (Fig. 3). In contrast to CNNs, DNNs did not have convolutional and pooling layers. Finally, the predictions using CT images or CFs were integrated through the penalized logistic regression (PLR) algorithm to output final predictions on patient classification (control, Type I and Type II).

The prediction accuracy of HUST-19
For training individual image-based models in HUST- 19 4b).
Compared to the CT image-based prediction, the CF-based prediction had a lower accuracy in general for identifying COVID-19 patients, as evidenced by an AUC value of 0.882 (Fig. 4c). However, CFs were more effective than CT images in classifying Type I and II COVID-19 patients (Fig. 4c). Thus, CT images and CFs had their unique advantages, providing a justification for integration of these two types of datasets.
Indeed, their combination produced much higher AUC values of 0.978, 0.921 and 0.931 in predicting controls, Type I and II patients, respectively (Fig. 4d).
For a general prediction, the type of a CT image or a patient was determined based on the highest probability score of three output values. Under this threshold, confusion matrices were generated from the 10-fold cross-validations to visualize the average agreement between actual and predicted results ( Fig. 4e-h). It was found that NiCT, pCT and nCT images can be correctly recognized with a high accuracy (Fig. 4e). For patient-based predictions, CT image-based and CF-based models achieve a similar performance in recognizing control cases and Type I patients, whereas CF-based prediction exhibited a higher efficiency in correctly predicting Type II patients (Fig. 4f, g). The integration of CT images and CFs considerably improve the prediction efficiency on recognizing control cases and Type II patients (Fig. 4h). The results of other performance measurements under the general threshold were shown in Supplementary Table 4.
In HUST-19, the best model with the highest 10-fold cross-validation AUC value was  Table 4.

Computational annotations of suspected cases
In iCTCF, there were 299 suspected cases without definitive SARS-CoV-2 laboratory confirmation at the time of enrollment. Here, we used HUST-19 with the sensitive threshold to predicted 21,207 and 71 patients of 299 suspected cases to be COVID-19 negative cases, Type I cases, and Type II cases, respectively (Fig. 5a). For each patient, the six intermediate scores generated from CT image-based and CF-based prediction of patients were retrieved and analyzed by the t-distributed stochastic neighbor embedding (t-SNE) in a 2-dimensional (2D) plot. The t-SNE result demonstrated the suspected cases were dispersed in the three types, and the predicted Type I and II cases were highly approximal to COVID-19 confirmed cases (Fig. 5b). For example, Patients 324 and 610 were predicted as Type I and II cases, respectively (Fig. 5b).
Patient 324, female, 34-year old, was admitted to the HUST-UH on Feb 9, because of "fever for seven days" and ground glass lesion in left lower lung suggested by CT imaging (Fig. 5c) serum IgM and IgG when she came back to the hospital for follow-up examination on Mar 6 (Day 15 after discharged). Thus, her diagnosis was eventually corrected to be "COVID-19 regular form" (Fig. 5c).

Data collection and preparation
The

COVID-19 case definitions
The  Table 1). The basic information included age, gender, body temperature (°C), and Udis, which were derived from patients' medical records.
In HUTS-UH, routine blood tests, such as hemoglobin (HGB), were carried out by a

The 13-layer CNNs
We used the 2 sets of 13-layer CNNs for the image-based and patient-based predictions, respectively. In each CNN framework, there were one input layer, 3 sets of dual convolutional and pooling layers, 2 dense layers and one output layer (Fig. 3). In the 11 hidden layers, neurons were the basic computation units, and both internal feature coding and computational outcome were connected and propagated by neurons inside each layer. The convolutional layers were used for feature extraction and presentation, and a widely-used rectified linear unit (ReLU) function was adopted to activate the outcome of a neuron and defined as below: Where x was the weighted sum of a neuron.
In the pooling layers, feature selection and information filtering were performed by the max pooling strategy. The last 2 hidden layers were dense layers for generating prediction outcomes. In order to prevent overfitting that frequently occurs in deep learning algorithms, we used a simple dropout method to randomly select a number of nodes from the 2 dense layers and set their corresponding scores to 0 if the average Ac value went up. In the output layer, 3 sigmoid nodes were set to separately calculate 3 scores for an inputted CT image shown as below: Where y was the input of a sigmoid node derived from the dense layer. In the CNN model for the image-based prediction, a Score(y) was a 0~1 value to represent the probability of a CT image classified as a NiCT, pCT or nCT image. For the patientbased prediction, a Score(y) was a 0~1 value to reflect the probability of a patient to be a control case, Type I patient, or Type II patient.

Normalization of CF data and the 7-layer DNNs
For each patient, a diagnosed value of a CF was f, and f was normalized as below: Where F was the normalized value of f, the normal range of the CF was Min to Max. If f was an unavailable value, we set F to 0.5. For two CFs of gender and Udis, we used 0 or 1 to encode males or females, and adopted 0 and 1 to encode patients with and without Udis, respectively.
To enable the classification of patients based on normalized CFs, we used 7-layer DNNs, including one input layer, 5 dense layers and one output layer (Fig. 3). Again, to avoid over-fitting, the dropout method was used by randomly dropping nodes from the 5 hidden layers if the average Ac value went up. In the first step, the input layer received numerical values of CFs for each patient. The 5 hidden layers were mainly adopted for feature extraction and representation. The ReLU activation function was used to transform data for each node. Again, the output layer contained 3 sigmoid neurons to individually calculate 3 values ranging from 0 to 1 for each patient. Finally, the DNN model was trained for classification of patients into the 3 types.

The PLR algorithm
The integration of predictions from CT images and CFs were performed by the PLR algorithm, which was implemented in Python 3.7 with Scikit-learn 0.21.2. For each patient, CNN models and DNN models were individually used to calculate 3 scores, respectively. Then, the 6 intermediate values were taken as secondary features, and the weight score of each value was initially set to 1. The ridge regression (L2 regularization) penalty was adopted to optimize the weight scores if the average Ac value went up.
Finally, the PLR model calculated 3 scores for predicting control cases, Type I or Type II patients.

Model training and parameter optimization
To train the 13-layer CNN models for individual CT image-based prediction, we randomly generated a training data set and a testing data set with a ratio of approximately 9:1, in which the labelled NiCT, pCT and nCT images were proportionally distributed. For the training data set, we further randomly split it into 10 parts, in which 9 of them were used for model training. Then, we used the remaining one part to calculate the average Ac value for predicting the three types of CT images, and the process of parameter optimization was stopped until the Ac value was not increased any longer. The randomization and parameter optimization on the training data set was repeatedly performed 10 times, and the model with the highest Ac value was reserved. Using the determined parameters, the final model was trained on the full training data set. The testing data set was not used for training, and it was only used to count TP, TN, FP and FN values and calculate the performance measurements. To avoid any bias, the 10-fold cross-validation on the full data set was randomly repeated 10 times. For the 10 best models, the average Sn, Sp, Ac, PPV, NPV, MCC and AUC values were computed, and a confusion matrix was visualized. Such a 10-fold cross-validation was also adopted for the CT image-based prediction of patients, the CF-based prediction of patients, and the integration of predictions from CT images and CFs.
For model training, we used a lab computer with an Intel(R) CoreTM i7-6700K@ 4.00 GHz central processing unit (CPU), 32 GB of RAM and a NVIDIA GeForce GTX 1070 core. The Keras version 2.2.4 (http://github.com/fchollet/keras), a highly useful neural networks API that was written in Python and developed based on the TensorFlow 1.13.1 (https://github.com/tensorflow), was adopted for parallel computing. For CNNs, the Adam optimizer in Keras was adopted, by using learning rates of 0.001 and 0.0007 for CT image-based and patient-based predictions, respectively, with an additional parameter of 64 for mini-batch size. For DNNs, transient parameters such as the learning rate, mini-batch size, and dropout probability were simultaneously optimized to achieve optimal performance.

Data Availability
All source data sets including chest CT images, CFs and laboratory confirmations were  (Type I & II) and Controls, and between Type II and Type I cases (Two-sided unpaired t-test, p-value < 10 -4 ). Fold changes were also present. CF abbreviations were shown, and their full names are available in Supplementary Table 1. The other images could be seen by clicking horizontal scrolling buttons. e The numerical CFs of a given patient were shown in a tabular list.   were predicted by HUST-19 to be Type I and Type II cases, respectively.