Deep Learning Classi cation of Active Tuberculosis Using Chest X- Rays: E cacy of Transfer Learning and Generalization Performance of Cross-Population Datasets

James Devasia Jawaharlal Institute of Post Graduate Medical Education and Research Hridyanand Goswami Marwari Hospitals Subitha Lakshminarayanan (  subitha.l@gmail.com ) Jawaharlal Institute of Post Graduate Medical Education and Research Manju Rajaram Jawaharlal Institute of Post Graduate Medical Education and Research Subathra Adithan Jawaharlal Institute of Post Graduate Medical Education and Research Ambalavanan Bharanidharan Sri Ramakrishna Engineering College


Introduction
Tuberculosis (TB) is a deadly infectious disease caused by the bacillus Mycobacterium tuberculosis. In 2019, World Health Organization (WHO) estimated and reported 10 million active cases (8.9 -11.0) 95% UI and 1.2 million (1.1 -1.3) 95% UI people died from TB and an extra 208 000 (177,000-242,000) 95% UI deaths (with HIV) in a single infectious agent despite being preventable and curable 1 . The goal of the United Nations General Assembly meeting held in September 2018 was to detect and treat 40 million people with Tuberculosis by 2022 2 . One of the strategies to achieve this goal was to improve and expedite the screening and triage procedure for active Tuberculosis. Sputum smear microscopy, rapid molecular tests, and culture tests are various methods for diagnosing active TB and drug-resistant TB; however, these approaches are relatively expensive and not easily available or applicable in low-resourced regions. Tuberculosis unveiled in chest radiography (CXR) images are broadly classi ed into parenchymal and pleural involvement. Most common parenchymal pathologies are consolidation, cavitation, reticular opacity, brosis, bronchiectasis, calci cation, hilar adenopathy, and collapsed lung. In pleural cavity, the most commonly exhibited pathologies are pleural effusion, thickening, calci cation, and pneumothorax 3 . Therefore, Posteroanterior (PA) chest radiography as a screening tool for detecting Tuberculosis plays a vital role in many algorithms as it is a fast and economically viable solution despite low speci city and high sensitivity 4,5 .
In emerging economic countries, the lack of trained clinicians or radiologists in remote areas and inter/ intra reader variability leads to delay in diagnosing and missing out on the active Tuberculosis case 6,7 . Sometimes another pathology diagnosis as TB. In this scenario, there has been an increased interest in developing and using Computer-Aided Detection (CAD) for radiology interpretation using diverse methods, which show promising results 8-12 . Recent (2021) guidelines released by WHO recommended using CAD packages for automated screening and triage of active TB diseases among individuals above 15 years populations using the interpretation of digital CXR 6 where 56% of active TB cases developed in 2019 were individuals aged >=15 years and old 1 .
The Deep Learning research community was attracted to radiology interpretation for detecting many diseases after the breakthrough innovation and success of the winning software AlexNet 13 in ImageNet Large Scale Visual Recognition Competition (LSVRC) based on Convolutional Neural Networks (CNN) in 2012. Before 2012 CAD packages for detecting TB were handcrafted Machine Learning approaches and used TB-speci c textural feature selection and classi cation [14][15][16][17] . Since 2012 classi cation of radiology images gained much attention by the state-of-the-art Deep Convolutional Neural Networks (DCNN) technologies, transfer learning, and publicly available large datasets with interpretations. This resulted in remarkable classi cation accuracy in lung segmentation 18 using Total Variation-based Active Contour algorithm, lung nodule detection 19 with RetinaNet and modi ed U-Net, cardiothoracic diseases 20 with localization and multi-label classi cation on par with experienced radiologists and clinicians.
Dataset shift, also known as a shift in the distribution of the variable, is a known issue in predictive models where training and testing data differ in the distribution of single or multiple features or class itself [21][22][23] . Reported works 8, 9,11,14,16,17,20,24 on classi cation of Tuberculosis use training and testing from the same dataset to measure the performance matrics to evaluate the model. Few predictive models in Biomedical research addressed this phenomenon 10,25,26 , and few reported and incorporated changes to tackle the situation 27,28 . Reported works 25,29,30 established that dataset shift leads to poor generalizability in deep learning predictive models. There is a paucity of evidence regarding cross-regional or cross-population train/ test and the effect in diagnostic accuracy in detecting TB. In this work, we assess the e cacy of transfer learning and diagnostic generalization performance of DCNN using a cross-population train /test dataset to detect active Tuberculosis.

Materials And Methodology
A. Study design Retrospective study with model creation using transfer learning and analyzing diagnostic generalization to detect active Tuberculosis.

B. Dataset sources and curation
The following four datasets were used for cross-population diagnostic classi cation accuracy of active Tuberculosis, two publicly available from the Revised National Tuberculosis Control (RNTCP) program referral register, where patients were con rmed as active TB using sputum or culture tests from 2017-2020. CXR was downloaded from PACS using the patient identi er in Tag Image File Format (TIFF) format. Healthy control subject's demography and CXR were collected from an ongoing TB project.
All CXR used in the study were de-identi ed using system generated study identi er and any overlay information in CXR were removed to protect the privacy of the patients. There were no missing data on patient demographics and CXR. The basic characteristic of the datasets is in Table 1 which also reveals nal number of abnormal and healthy images for training, validation, and testing the models.

C. Pre-processing of CXR
The input data were pre-processed before feeding to the model with the following: (1) segmentation of the region of interest (ROI) using U-net 33 architecture, we capture the extreme points from lung area ROI and padded with extra 50 pixels to make sure the bounding box maintained the ROI (e.g., TB patients with Pleural Effusion), (2) extract the ROI to the size of 224x224 and saved in Joint Photographic Experts Group (JPEG) format. Supplementary gure S1 and Supplementary gure S2 shows the pre-processing pipeline of all CXR images.

E. Data Partitions
We randomly selected 55% (1672) of IN collection healthy control images used in conjunction with NIAID dataset for training, validation, and testing, a similar attempt made by Lakhani, P. & Sundaram 24 . Each dataset was split into training (80%), validation (10%), and intramural holdout test (10%) sets. Repeated images of the same patients were included only in training set to avoid data leakage in holdout test set.
Extramural test sets are the datasets were not used for training and validation.

F. Model
Resnet50 and Densenet121 pre-trained networks on ImageNet which are widely used in medical image classi cation were used in this work as base models. The input image size of Resnet50 and Densenet121 was set to 224x224. We used weights from ImageNet (transfer learning) to initialize the network. The classi er in base model is replaced with the following; (1) a Global Average Pooling layer, (2) a Dense layer with ReLU activation, (3) a Dropout layer, and (4) a classi er layer with two outputs and softmax activation. Based on the TensorFlow framework (version 2.7.0, Google Brain Team, CA, USA https://tensor ow.org), Keras (version 2.7.0, https://keras.io) and Python (version 3.7, Python Software Foundation, DE, USA https://python.org) as programming language were used in deep learning frameworks. The desktop computer was equipped with Intel i9-9820X CPU @3.30 GHz, 64G RAM, and dual NVIDIA GeForce RTX 2080Ti @11G GPU. Figure 1 shows the outline of the DCNN model.

G. Training
All layers except Batch Normalization layers in the base model were set non-trainable as the mean and variance of the training dataset differed from ImageNet 37 . Hyperparameter tuning for the added layers was carried out by Keras Hyperband Tuner 38 with max_epochs=10 and hyperband_iterations=5 to optimize the Dense layer units, Dropout rate, and Learning rate. The training was performed using Categorical crossentropy as losses function, Adam 39 optimizer. The network was trained with mini-batches of 32 samples. The mini-batches were shu ed on each epoch to randomize the training method and decrease over tting. We also employed Early Stopping aided by validation loss to reduce over tting. We did not use any data augmentation technique in this work. The model selection was performed by Keras Tuner. The summary of the optimal Resnet50 and Densenet121 models for various datasets is shown in Supplementary Table S2.

H. Evaluation Metrics
Testing of the model was done using intramural holdout and extramural test sets. The following metrics were used to evaluate the e cacy of the model using the test sets, (1) Sensitivity, (2) Speci city, (3) Area Under the Receiver Operating Characteristic (AUC), (4) Accuracy, (5) Precision and (6) F1-Score. Con dence Interval (CI) for AUC were calculated using Hanley & McNeil test 40 , and CI for adjusted sensitivity and speci city were obtained using the Wilson Score method 41 . Statistical analysis was done using Python 3.7 statistical library, and a P-value of 0.05 was considered statistically signi cant.

Ethics approval and consent
Ethical approval for this study was obtained from the Institutional Ethics committee for Observational studies of Jawaharlal Institute of Postgraduate Medical Education and Research (JIPMER), Puducherry, India (JIPMER ethics committee number JIP/IEC/2019/533). Waiver of written informed consent was approved by JIPMER institutional ethics committee as all data sources used (patient demography, laboratory records, and chest radiography) were previously available, and no patients needed to be contacted. Additionally, all data were collected anonymously and de-identi ed using study identi er before reading by the radiologist, model development, validation, and testing. All methods were carried out in accordance with Indian Council of Medical Research (ICMR) and International Committee on Harmonization of Good Clinical Practice (ICH-GCP) guidelines and regulations.

Results
The SH collection consists of 662 PA CXR in which 336 CXR having various manifestations of TB, and 326 CXR were healthy control patients with 35 pediatric (<=18 years of age) CXR. In the MC collection, 138 PA CXR with 13% were pediatric, resulting in 58 abnormal CXR and 80 healthy patients. NIAID collections entail 1678 abnormal images with 45 pediatric images. IN dataset entails 4392 PA chest X-rays consisting of 30% abnormal images and 70% normal images, with 3% (172) images in the collections being pediatric. The exclusion criteria for CXR are given in Supplementary Table S1. We excluded patients under 15 years of age from the training of model as per the new WHO guidelines 6 . We also excluded inactive TB and Pneumonia images from SH and MC collections as the objective of this study is to nd the diagnostic accuracy of active TB cases.
Demographic characteristics of TB subjects and healthy controls of all datasets are presented in Table 2. Among the active TB case and healthy control in all datasets, the female distribution of CXR was signi cantly different except for the SH dataset. The mean age (SD) was signi cantly different among active TB cases and healthy controls in all datasets. Multiple images from the same patients were predominantly seen in TB CXR compared to control CXR. Except for the MC dataset, all three other datasets are balanced with the number of CXR images in each class.  Supplementary gure S3 portrays the comparison of 95% CI of AUC across Resnet50 and Densenet121 models. Its observed intramural test set used in the MC dataset was less than 10 in TB and Healthy normal CXR and thus had wide CI. It is evident from Table 3 the evaluation matric results that the Resnet50 model trained on SH and tested in NIAID showed the worst performance matric compared to all other extramural performance matric. Overall accuracy for Resnet50 was 30.02% -100%. Precision across all models ranges from 21.55% -100%, with the lowest in the Resnet50 model trained in SH and tested in NIAID. The highest was the Resnet50 model trained and tested in the IN dataset, whereas Densenet121 trained model ranges from 26.67% -100% with lowest were NIAID trained and tested in MC, and the highest was NIAID trained and tested in intramural and IN datasets. In case of F1-score, the results range from 17.78% -100% in Resnet50 models and 14.81% -99.62% in Densenet121 models. Details of accuracy, precision, and F1-score are available in Supplementary Table S3.

Discussion
The study aimed to analyze the e cacy of transfer learning and training on one geographical dataset impacted the diagnostic accuracy of different geographical dataset and their performance estimates. Our experiments show that sensitivity, speci city, and AUC for intramural test sets are consistent with previous studies on detecting TB with DCNN 9,24,42 . Our cross-geographical train test experiments exhibit the instability in various performance estimates among models compared with similar published work. Our results show better performance on AUC, sensitivity, and speci city on Densenet121 trained on MC and tested on SH compared with InceptionNet V3 model trained on MC and tested on SH reported by Das, Santosh, & Pal 26 , they employed histogram equalization for contrast enhancement. Santosh & Antani 10 reported better results than our experiments but they employed voting ensemble of three different classi ers and used handcrafted feature selection and lung symmetry for detecting the TB. Moreover, both works 10,26 used off-the-shelf MC and SH datasets but our experiments excluded CXR from MC and SH based on the exclusion criteria.
A similar study reported 25 a considerable drop in the performance of AUC on the SH trained InceptionNet V3 model tested on the ChextX-ray8 dataset even though the authors employed several data augmentation methods to improve the generalisability. Data distribution shift plays a crucial role in the failure of machine learning systems in terms of generalisability. Although extensive data augmentation can generally improve generalization, in radiological datasets, the use of un tting data augmentation can lead to an adverse effect on model learning (horizontal ipping of CXR inadvertently creates a medical condition called situs inversus). We also checked the dataset used in commercial CE-marked CAD products on the market for diagnosing Tuberculosis. Genki, reported 1,500,000 CXR from 10 countries used for training and tested over 30 different imaging machines, CAD4TB trained with 1,000,000 CXR from several countries and continents, qXR Findings from our study indicate that it was unlikely to construct an accurate deep learning model using transfer learning with Resnet50 and Densenet121 trained and validated on the same dataset to detect the active TB in geographically different population datasets. By far best, even though lung pathology is consistent across the population for active Tuberculosis, the discrepancy in the speci cation and standard operating procedure for medical imaging machines and the underlying image manipulation techniques to produce Digital Radiography and Computed Radiography modality are inevitably having an essential role in performance shift in cross-population train and test method. Moreover, the technical quality of the CXR, image conversion (DICOM to JPG), and image resolution also play a crucial role in classi cation accuracy. Different modalities and equipment exist for image acquisition in the dataset used for our experiment, which potentially in uenced the learning and classi cation for active TB.
There are some limitations of the study. First, we didn't use any data augmentation methods and k-fold cross-validation. We used a relatively small dataset (MC,109 instances) and mid-size (IN, 2655 instances) in our experiments. The study's strength was to use multiple datasets across different geographical regions, state-of-the-art deep learning architecture, and hyperparameter tuning using hyperband Keras tuner to optimize model parameters.

Conclusion
This work has addressed the e cacy and delity of using transfer learning on geographically different train and test datasets for detecting TB. We demonstrated the cross-population test on two state-of-the-art models without any data augmentation for classi cation accuracy. The results revealed signi cant variance in AUC, sensitivity, speci city, accuracy, and other potential measures on where train and test datasets are different. The paucity of Tuberculosis dataset and their annotations in public domain was the main hindrance for many researchers in this eld. Further investigation on this area combines CXR from multiple countries or CXR generated from different imaging machines and modalities to use deep learning in real-world TB screening scenarios.