A Fully Automatic AI-based CT Image Analysis System for Accurate Detection, Diagnosis, and Quantitative Severity Evaluation of Pulmonary Tuberculosis

Background: Accurate and rapid diagnosis of pulmonary tuberculosis (TB) plays a crucial role in timely prevention and appropriate medical treatment to the disease. This study aims to develop and evaluate an articial intelligence (AI)-based fully automated CT image analysis system for detection, diagnosis, and burden quantication of pulmonary TB. Methods: From December 2007 to September 2020, 892 chest CT scans from pathogen-conrmed TB patients were retrospectively included. A deep learning based cascading framework was connected to create a processing pipeline. To train and validate the model, 1921 lesions were manually labeled, classied by six categories of critical imaging features, and visually scored for the lesion involvement as the ground truth. “TB score” was calculated by the network-activation map to assess the disease burden quantitively. Independent test datasets from two additional hospitals and NIH TB Portal were used to validate externally the performance of the AI model. Results: CT scans from 526 participants (mean age, 48.5 years±16.5; 206 women) were analyzed. The lung lesion detection subsystem yielded a mean average precision of 0.68 on the validation cohort. In the independent datasets, the overall classication accuracy for six pulmonary critical imaging ndings indicative of TB were 81.08%-91.05%. A moderate to strong correlation was demonstrated between the AI model quantied “TB score” and the radiologist-estimated CT score. Conclusion: This end-to-end AI system based on chest CT can achieve human-level diagnostic performance, and holds great potential for early management and medical resource optimization of patients with pulmonary TB in clinical practice. articial intelligence; CNN: convolutional neural network; COVID-19: Coronavirus disease 2019; CT: computed tomography; AUC: area under the curve; DL: deep learning; FN: false negative; FP: false positive; ROI: region of interest; TB: tuberculosis; TP: true positive.


Background
Tuberculosis (TB), caused by the bacillus Mycobacterium, has led to huge public health problem due to its rapid spread [1]. It is the leading cause of death by infectious disease worldwide [2]. Although the global morbidity and mortality of TB have been slowly declined, the disease burden remains substantial in endemic countries [3]. Chest imaging plays a crucial role in the work-up of patients with pulmonary TB [4]. In particular, computed tomography (CT) has been used to help diagnose, monitor imaging changes and evaluate the disease severity of pulmonary TB [5]. On the other hand, the activity and progressiveness of TB can be suggestive of radiographic features on CT [6]. This will support decision making of clinicians for timely isolation and appropriate treatment. However, the complexity of chest CT images have caused challenging workloads to the specialists. In countries where health and surveillance systems are weak, underdiagnosis and missed-reporting of TB is common.
Arti cial intelligence (AI) has gained signi cant attention in recent years and many applications have been proposed in medical image recognition and interpretation [7]. Deep learning (DL), as the core technique of the rising AI, has made great progress in medical image analysis including skin diseases classi cation [8], diabetic retinopathy detection [9], lung cancer screening in chest CT [10]. Recent promising advances in the CT-based deep learning system demonstrated the potential of AI-assisted radiological diagnostic technology [11][12][13]. For example, a few studies have already reported supervised deep learning to diagnose, quantify extent, and predict triage of Coronavirus disease 2019 (COVID-19) disease on CT during the global pandemic [14]. Therefore, bene ting from end-to-end deep learning algorithms, we hypothesized that DL networks can be designed and established by automatic and adaptive learning the features through backpropagation, to achieve human expert-level performance for diagnosis and follow-up.
In this study, we developed and evaluated an AI based fully automated CT image analysis model to provide support in the detection, diagnosis, and disease severity quanti cation for patients with pulmonary TB.

Study participants
The study was approved by our institutional review board. Written informed consent was waived for this retrospective study. Between Dec, 2017 to Sep, 2020, we searched chest CT scans of patients with suspected TB from the picture archiving and communication system. A total of 1356 CT scans from 865 patients were collected who met the following inclusion criteria (Fig. 1): (a) age ≥18 years; (b) CT imaging examinations for known or suspected to have primary or secondary TB; (c) CT slice thickness < 1.5mm. Exclusion criteria were CT scans with inadequate image quality (n=73), no typical imaging ndings (cavitation, consolidation, centrilobular nodules and tree-in-bud, clusters of nodules, bronodular scarring, or calci ed granulomas) indicative of TB (n=215), and negative Mycobacterium tuberculosis culture from sputum, bronchoalveolar lavage, or lung biopsy sample (n=176).
CT acquisition parameters and image pre-processing Chest CT exams were acquired on different CT scanners from multi-centers. The acquisition and reconstruction parameters are summarized in Table 1. CT images were reconstructed with a 512×512 matrix and a slice thickness of 1-1.5 mm. Images were preprocessed by setting the lung window (window width, 1500 HU; window level, -700 HU), and resampled to the voxel to 1×1×1 mm 3 . A three-dimensional reconstruction approach was used to display the severity of TB visually.

Diagnosis system and network architectures
We proposed a convolutional neural network (CNN)-based AI cascading system to diagnose tuberculosis using a raw chest CT scan. Our model consisted of four subsystems to rstly identify the abnormal CT slices, then to localize the lung region of interest (ROI) and to perform region-speci c disease diagnosis, and nally to evaluate the disease severity quantitatively. The cascading networks were automatically connected to create an end-to-end processing pipeline (Fig. 2). All codes are available in the github webpage(https://github.com/ColinWine/Accurate-and-rapid-pulmonary-tuberculosisdiagnosis-system). The detailed system network structure for the AI model is summarized in Supplementary Materials.

Image labeling and data annotation
For DL algorithm development (training) and optimization (validation) for lesion localization and classi cation, a thoracic radiologist (CY) with 11 years of experience performed data annotation. First, the abnormal slices with pulmonary TB lesions and normal slices without pathological ndings were manually labeled and used as gold standard to train the DL network. Second, for each CT scan, the TB related imaging features were judged by the Fleischner Society glossary [15], and labeled on the center slices with the maximum area of TB lesions. The bounding boxes of the above critical ndings (2.2 lesions per CT scan on average) were drawn by the open-source software ITK-SNAP (version, 3.6.0) for lesion detection and segmentation. Finally, the CT slices and cropped masks (ROIs) from the corresponding lesions area were saved for each scan. For the training phase, up to seven center slices per CT scan were included in patients with multiple lesions.

Slice selection CNN
We designed a binary classi cation CNN based on Attention Branch ResNet [16] as the slice selection model. Sigmoid function was used as the last activation function (Fig. 3). Two attention mechanisms were introduced to improve its performance. To train the network, 124,890 and 34,161 slices were manually annotated as normal or abnormal, respectively. Slices in each CT scan were ranked according to abnormal scores from the output values of the last Sigmoid function. The top ten abnormal slices from a raw chest CT scan were selected by the algorithm for the subsequent disease diagnosis. We proceed with an evaluation of the ability to screen at the case-level for TB patients vs normal healthy controls. The validation data are provided in Supplementary Materials.

Object detection CNN
The model is based on the CenterNet [6] architecture to provide ROI detection for further classi cation task [17] (Fig. S1). We used 1,921 bounding boxes with lung pathological features as the input for the model. The ROIs were randomly divided into training and validation cohort in an 8:2 ratio, balancing for imaging ndings. The trained CNN can output the location of the TB lesions and the corresponding segmented ROIs. Mean average precision (mAP) was used to evaluate the lesion detection performance of the proposed system.

Disease diagnosis CNN
As for disease diagnosis, we used an 18 layers Squeeze-and-excitation ResNet model (SeNet-ResNet-18) [18] pretrained on the ImageNet dataset to classify detected lesions from selected abnormal CT slices. The network took images with bounding boxes as input and produced a nal prediction on categories and activities of TB (Fig. S2). The bounding boxes were labeled as cavitation (n=170), consolidation (n=281), centrilobular nodules and tree-in-bud (n=429), clusters of nodules (n=261), bronodular scarring (n=264), or calci ed granulomas (n=516). We augmented the training data by random rotations range of 0.2 and ipping (horizontal and vertical). A max-pooling layer that outputs maximum probability was used at the last layer.
To further estimate a clinically important task--the activity of TB, we merged the output prediction probability by two-class classi cation of active (cavitation, consolidation, centrilobular nodules and tree-in-bud, clusters of nodules) and inactive TB ( bronodular scarring, or calci ed granulomas) [4]. The label of a patient is then predicted by combining all predictions of every local regions. Max pooling serves as an 'OR' gate that labels an image or a case as active TB if there is any subregion that is TB (+).

Severity evaluation
To evaluate the TB burden, we used a Grad-CAM algorithm based on the slice selection CNN described above [16] to generate an activation map. The computerized quantitative approach provided segmentation of the lung tissues based on thresholds and adaptive region growing. Lesion regions were segmented on the "network-activation maps" of all positive slices, which correspond to the region most contributing to the network's decision. The prede ned activation threshold was determined to be 0.88 by visual evaluation by radiologists in our experiments. A lung "TB score" was calculated by the ratio of lesion volumetric summation to that of the corresponding lung lobes.
Additionally, each lobe was visually scored by two independent radiologists with more than 6 years of imaging experience as follows: 0, no lesion; 1, < 5% involvement; 2, 5% to less than 25% involvement; 3, 25% to less than 50% involvement; 3, 50% to less than 75% involvement; and 5, ≥ 75% involvement. Any disagreement between the two radiologists was resolved by another senior radiologist. According to the subjective score of the lung lobes, the patient was de ned as severe (greater than or equal to 2) or non-severe (lower than 2).

Performance of the AI model in external validation datasets
Model performance was tested independently at Dataset-2 (Yanling Hospital, n=99), Dataset-3 (Haikou Hospital, n=86) and Dataset-4 [NIH TB Portal dataset (https://tbportals.niaid.nih.gov/), n=171]. Patients who matched the above criteria were selected from Dataset-2,3. From the NIH cohort, we then excluded CT scans that ful lled the following criteria: (a) slice thickness >2mm; (b) inadequate image quality for diagnosis; (c) could not be clearly allocated to either of the classi cation categories. Model output was probabilities for six critical imaging ndings and activity, and the consensus categories by two radiologists with 11 and 6 years of clinical experience were considered as ground truths. Candidate related regions on continuous slices were merged to avoid clustered standard errors. False positive patches were deleted and false negatives were added manually.

Statistical analysis
The confusion matrix was calculated to estimate multiclass classi ers, whereas recall, precision, and the more balanced F1_score were used to measure the performance per-class. The Spearman correlation analysis was performed to assess the correlations between the radiologist-estimated CT score and TB score quanti ed using the algorithm. The correlation was de ned as mild (r<0.3), moderate (0.3≤r<0.5), good (0.5≤r<0.8), and strong (r≥0.8). We also calculated the interobserver reliability for subjective CT score rated from 30 randomly chosen cases by using interclass correlation coe cient. The TB scores of severe and non-severe patients were compared by Student t test. Statistical analysis was performed using SPSS (version 23.0, IBM). Statistical signi cance was de ned as a P value <0.05.
Performance of the AI model Our AI cascading models consisted of four subsystems, which provide consistent visual descriptions: (1) screening to distinguish between normal and abnormal CT images, (2) object detection and localization of pulmonary infectious lesions, (3) diagnostic assessment of radiological features (6 types) and TB activity, (4) severity evaluation.

Slice selection
We performed a slice-level analysis to select the top ten positive slices according to the predicted probability for each CT scan. The average accuracy within the training and validation cohort was 99.6% and 99.8%, respectively.

Object detection and localization
An advanced real-time object detection algorithm based on CenterNet was used to localize lesions, which yielded a mAP of 0.68 on the validation cohort. In the test Dataset-2, 563 candidate regions were detected by the AI model, including 518 true positive (TP), 45 false positive (FP), and 26 false negative (FN). In the test Dataset-3, 502 candidate regions were detected with 440 TP, 62 FP, and 23 FN. In the test Dataset-4, 931 candidate regions were detected with 869 TP, 62 FP, and 37 FN.

Classi cation and diagnosis
The classi cation CNN demonstrated a training accuracy of 99.6%± 1.0 (1531/1537 patches) across six critical imaging ndings indicative of TB. For the 20 iterations, the average model accuracy for overall six-way lesion types classi cation in the validation cohort was 87.37% (346/396). The recall rate for individual lesion types ranged from 83.87% (52/62) for clusters of nodules to 90.00% (90/100) for calci ed granulomas. The classi cation results are summarized as confusion matrix for the critical imaging ndings predicted by the AI model (Fig. 4a). In the validation cohort, the DL model showed good discriminative performance for each independent critical nding (area under the curve, AUC = 0.959-0.983, Fig. S3). The classi cation results of AI model per-class are shown in Table S1.
To validate the general applicability of our AI system, we obtained CT images from an NIH opensource dataset and additional data from our collaborators. In the independent test cohorts (Dataset 2,3,4), the overall classi cation accuracy for six pulmonary in ltrate types were 91.50% (474/518), 87.65% (440/502) and 86.08% (748/869), respectively. The confusion matrix for each testing dataset is in Fig. 4a. The predictive performance of the corresponding recall, precision and F1-score per-class were listed in Table 2.

Severity evaluation
A Grad-CAM framework for automatically highlighting pulmonary lesions was used to assess the extent of the disease. The intraclass correlation coe cient for agreement between the two radiologists' subjective score was strong (0.92, 95%CI: 0.90-0.95). As displayed in the attention heatmap obtained by fusion (Fig. 5a), the AI-discovered suspicious infectious area matched highly with the actual pulmonary TB lesions. The Spearman correlation analyses demonstrated a moderate to good correlation between the AI model quanti ed "TB score" and the radiologist-estimated CT score (r ranged from 0.545 to 0.713) in the validation cohort. The correlation results are summarized in Table 3. The TB scores per lobe in the severe patients were signi cantly higher than those in the non-severe patients in the validation and testing sets (all P<0.05; Fig.  5b). Fig. 6 shows some examples of TB and the corresponding prediction results.

Discussion
In this retrospective, multicohort, diagnostic study, we developed and evaluated an AI cascading model for fully automated diagnosis and tirage of pulmonary tuberculosis based on chest CT images. We found that the model was useful for detection and classi cation of critical imaging features and achieved an overall accuracy of 0.86-0.92 on external datasets. Moreover, attention heatmap highlighted infectious area for TB burden evaluation with human-level accuracy. The AI system succeeded to stratify patients into severe and non-severe groups by TB score quanti ed by the algorithm. Our study demonstrated that the DL system allowed for accurate detection, diagnosis, and severity assessment of TB lesions. The proposed AI system can assist clinicians ease the signi cant demand for pulmonary TB screening, diagnosis and follow-up in daily clinical practice.
TB is the leading cause of death among all infections worldwide, which kills estimated 1.4 million people each year [19]. Despite general incidence has slowly decreasing during the past decade, TB remains an enormous burden globally [3]. The disease is mainly transmitted by cough aerosol, and usually characterized by necrotising granulomatous in ammation in the lung (~ 85% of cases) [20]. Prompt detection, timely treatment, and routine follow-up are priorities to prevent development of TB-related morbidity and spread to the unaffected. Although sputum and blood assays have been developed as the standard to diagnose TB, these tests are inadequate for assessing smear negative TB and results can take up to several days [2]. Lung biopsy provides proven diagnosis pathologically, but is an invasive procedure with signi cant comorbid risks [21]. Chest radiography is a widely available imaging tools in screening and diagnosis of TB [22]. In comparison, CT is more precise than radiographs in the detection and localization of pulmonary abnormalities [4]. Experimental results have shown that CT scans can aid radiologists review these suspected TB cases when chest radiographs are inconclusive [23]. Furthermore, CT imaging features (including centrilobular nodules, tree-in-bud, consolidation, and cavitation) have demonstrated strong correlation with the positivity and grading of sputum microbiology results [24].
AI has become state of the art for medical image analysis, and plays a role in supporting clinical decision making with respect to diagnosis and risk strati cation [7,25]. Lakhani et al [26] recently reported a deep learning with CNN system to classify TB at chest radiography fast and accurately, with a sensitivity of 97.3% and speci city 100.0%. When evaluating a three-dimensional CT scan, abnormal slices in the full series of images must be identi ed rstly. To reduce computation burden of AI model, we used a pretrained network to select key slices with the top ten con dence to represent a complete CT scan in this study. We obtained similar accuracy value of 99.80% along with the BConvLSTM U-Net method for image selection in the literature [16]. But this strategy can be with a tradeoff on missing diagnostic information in the slices that are not selected for further analysis.
In lung diseases, several previous studies explored the detection and classi cation of pulmonary infections, especially during the recent pandemic of COVID-19 [16, [27][28][29]. Jaeger et al. [30] utilized the a simple but effective one-stage detection model Retina U-Net for lesion detection and localization, achieving a mAP of 0.50. In our work, a CNN based on architecture Centernet was built and trained with human annotations to the in ammatory area. The DL system managed to automatically discover the suspected regions that are strongly indicative of TB with a mAP of 0.68. More recently, Li and colleagues [31] reported a state-of-the-art 3D deep learning model to annotate the spatial location of lesions and classify ve critical CT imaging types of TB disease (miliary, in ltrative, caseous, tuberculoma, and cavitary), with classi cation precision rate at 90.9%. Our model has shown similar overall accuracy (0.86-0.92 vs. 0.91) for six typical CT imaging ndings. Another novel development within our AI model is the ability to analyze the imaging features and prediction of activity simultaneously. The increased accuracy for participant-wise prediction (98% compared with region-wise accuracy of 90%) is rational. Errors in region-wise predictions are mini ed for the prediction of active TB patients. This could help to more effective identi cation, intervention and isolation containment of active cases.
Further, a computer-aided CT image analysis tools were recently developed in the diagnosis and evaluation of disease burden for Coronavirus positive patients, which can help to predict the progression to critical illness [14]. Similarly, Shan et al [29] reported a CT-based deep learning system automatically focused on segmentation and quanti cation of infection regions, which discovered suspicious abnormal areas distributed on bilateral lung in a slice-based "heat map". This would allow automatic infection delineation and severity prediction consistently and quantitatively. In this study, we demonstrated that the proposed "TB score" corresponds to the disease severity. The lesion percentage determined by radiologists and the AI model showed a moderate to good correlation (r, 0.453-0.761). Furthermore, we demonstrated signi cant differences in TB score of severe and non-severe groups in the testing datasets (all P < 0.05). We therefore hypothesize that such a TB score may provide a quantitative tool for patient follow-up and management to monitor progression and regression of ndings. More important, unlike prior supervised algorithms based on slice-level analysis, our approach can search an entire CT study without human guidance. Then the exported quantitative report, with overall TB infection probability, imaging features with spatial coordination, activity and severity prediction may serve as an effective reference for clinicians to make decisions, which is well-suited in real-life health services.
Our study has several limitations. First, the sample size of patients was relatively small, and insu cient data for training the AI networks could limit the performance. CT data from different centers are quite heterogeneous in the scanning parameters and slice thickness. However, such heterogeneous data perhaps allow the results more generalizable. As demonstrated in the independent test cohorts, our AI model is robust to slice thickness and lesion distribution. Second, this study only focused on the six typical types of pulmonary in ltrates but ignored the other rare signs including pleural effusion and enlarged lymph nodes. The sample size of the rare imaging features is not enough for training. Additionally, CT examinations were consecutively acquired, which makes selection bias of different types of imaging ndings. Third, Given the overlap of imaging features with other types of pneumonia, our AI systems need to be trained with a larger dataset with abnormalities due to other diseases such as community acquired pneumonia or fungal infections. Lastly, complex deep leaning models are known to face the problem of over tting. The external validation datasets within our study are inherently different because the cohorts are comprised of patients with different disease burdens. Slight dropout in performance in the test phase was shown in this study.

Conclusions
In conclusion, our deep learning cascading model based on chest CT images can be clinically applicable for accurate detection, diagnosis and triage of TB in the lungs. This fully automated AI system has great potential to be utilized in clinical practice for rapid assessment of disease activity and guidance of therapy and management for patients with pulmonary TB.

Declarations
Ethics approval and consent to participate This study was approved by the Ethnic Committee at the Nanfang Hospital of Southern Medical University. Informed consent and clinical trial registration were waived, mainly due to the retrospective nature of this study.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used or analyzed during the current study are available per reasonable request from the corresponding author.    Table 3 Correlation coe cient (r) of AI quanti ed TB score and radiologist-estimated subjective score in lung lobes  Figure 1 Flowchart of the study process for the training and testing datasets.

Figure 2
Illustration of the proposed cascading AI pipeline. Our AI diagnostic system consisted of four subsystems, which provides consistent visual descriptions: (1) screening to distinguish between normal and abnormal CT images, (2) object detection and localization of pulmonary infectious lesions, (3) diagnostic assessment of radiological features (6 types) and TB activity, (4) severity assessment. AI arti cial intelligence, CT computed tomography, TB tuberculosis.