A. Study design
Retrospective study with model creation using transfer learning and analyzing diagnostic generalization to detect active Tuberculosis.
B. Dataset sources and curation
The following four datasets were used for cross-population diagnostic classification accuracy of active Tuberculosis, two publicly available Health Insurance Portability and Accountability (HIPPA) compliant datasets maintained by the National Library of Medicine, Maryland, USA: (a) Shenzhen No.3 People’s Hospital, Guangdong Medical College, Shenzhen, China (SH) collection31; (b) Department of Health and Human Services, Montgomery County, Maryland, USA (MC) collection31. The SH and MC collections are in in Portable Network Graphics (PNG) format. The other two datasets, (c) TB Portals Program data (TBPP), National Institute of Allergy and Infectious Diseases (NIAID), Bethesda, Maryland, USA, (NIAID) collection32 , which follows the HL7 Fast Healthcare Interoperability Resources (HL7 FHIR) standard in Digital Imaging and Communications in Medicine (DICOM) format. The NIAID dataset comprises primarily drug-resistant TB cases. PA CXR comes from TB Portals Consortium and participating institutions in India, Belarus, Romania, Georgia, and Azerbaijan, with a significant share from Belarus. (d) Jawaharlal Institute of Postgraduate Medical Education and Research (JIPMER Hospital), Puducherry, India (IN) collection, use HL7 integration interface between Picture Archival and Communication System (PACS) and Health Information System. For the IN collection, patient demography was collected from the Revised National Tuberculosis Control (RNTCP) program referral register, where patients were confirmed as active TB using sputum or culture tests from 2017-2020. CXR was downloaded from PACS using the patient identifier in Tag Image File Format (TIFF) format. Healthy control subject’s demography and CXR were collected from an ongoing TB project.
All CXR used in the study were de-identified using system generated study identifier and any overlay information in CXR were removed to protect the privacy of the patients. There were no missing data on patient demographics and CXR. The basic characteristic of the datasets is in Table 1 which also reveals final number of abnormal and healthy images for training, validation, and testing the models.
C. Pre-processing of CXR
The input data were pre-processed before feeding to the model with the following: (1) segmentation of the region of interest (ROI) using U-net33 architecture, we capture the extreme points from lung area ROI and padded with extra 50 pixels to make sure the bounding box maintained the ROI (e.g., TB patients with Pleural Effusion), (2) extract the ROI to the size of 224x224 and saved in Joint Photographic Experts Group (JPEG) format. Supplementary figure S1 and Supplementary figure S2 shows the pre-processing pipeline of all CXR images. (3) All saved images were pre-processed with pre-processing functions of Resnet5034 and Densenet12135architecture during training, validation, and testing.
D. Ground Truth
All PA TB CXR in IN collection were read by single radiologist (HG) for various TB manifestations and reported in a specified format approved by the institute committee. Peer validation of 10 % CXR was done by the radiologist (SA) and pulmonologist (MR). The interobserver agreement36 between HG and SA and HG and MR was almost perfect (k=0.83, 95% CI [ 0.72 – 0.93] and k=0.80, 95% CI [0.71 – 0.94] respectively). The Normal CXR in IN collection was read by the clinician and peer validated by MR, with inter-rater reliability, was almost perfect (k = 0.90 95% CI [0.87 – 0.98]. The MC, SH, and NIAID dataset’s ground truth and subject’s demography was obtained from the corresponding dataset source.
E. Data Partitions
We randomly selected 55% (1672) of IN collection healthy control images used in conjunction with NIAID dataset for training, validation, and testing, a similar attempt made by Lakhani, P. & Sundaram 24. Each dataset was split into training (80%), validation (10%), and intramural holdout test (10%) sets. Repeated images of the same patients were included only in training set to avoid data leakage in holdout test set. Extramural test sets are the datasets were not used for training and validation.
F. Model
Resnet50 and Densenet121 pre-trained networks on ImageNet which are widely used in medical image classification were used in this work as base models. The input image size of Resnet50 and Densenet121 was set to 224x224. We used weights from ImageNet (transfer learning) to initialize the network. The classifier in base model is replaced with the following; (1) a Global Average Pooling layer, (2) a Dense layer with ReLU activation, (3) a Dropout layer, and (4) a classifier layer with two outputs and softmax activation. Based on the TensorFlow framework (version 2.7.0, Google Brain Team, CA, USA https://tensorflow.org), Keras (version 2.7.0, https://keras.io) and Python (version 3.7, Python Software Foundation, DE, USA https://python.org) as programming language were used in deep learning frameworks. The desktop computer was equipped with Intel i9-9820X CPU @3.30 GHz, 64G RAM, and dual NVIDIA GeForce RTX 2080Ti @11G GPU. Figure 1 shows the outline of the DCNN model.
G. Training
All layers except Batch Normalization layers in the base model were set non-trainable as the mean and variance of the training dataset differed from ImageNet37. Hyperparameter tuning for the added layers was carried out by Keras Hyperband Tuner38 with max_epochs=10 and hyperband_iterations=5 to optimize the Dense layer units, Dropout rate, and Learning rate. The training was performed using Categorical cross-entropy as losses function, Adam39 optimizer. The network was trained with mini-batches of 32 samples. The mini-batches were shuffled on each epoch to randomize the training method and decrease overfitting. We also employed Early Stopping aided by validation loss to reduce overfitting. We did not use any data augmentation technique in this work. The model selection was performed by Keras Tuner. The summary of the optimal Resnet50 and Densenet121 models for various datasets is shown in Supplementary Table S2.
H. Evaluation Metrics
Testing of the model was done using intramural holdout and extramural test sets. The following metrics were used to evaluate the efficacy of the model using the test sets, (1) Sensitivity, (2) Specificity, (3) Area Under the Receiver Operating Characteristic (AUC), (4) Accuracy, (5) Precision and (6) F1-Score. Confidence Interval (CI) for AUC were calculated using Hanley & McNeil test40, and CI for adjusted sensitivity and specificity were obtained using the Wilson Score method41. Statistical analysis was done using Python 3.7 statistical library, and a P-value of 0.05 was considered statistically significant.
Ethics approval and consent
Ethical approval for this study was obtained from the Institutional Ethics committee for Observational studies of Jawaharlal Institute of Postgraduate Medical Education and Research (JIPMER), Puducherry, India (JIPMER ethics committee number JIP/IEC/2019/533). Waiver of written informed consent was approved by JIPMER institutional ethics committee as all data sources used (patient demography, laboratory records, and chest radiography) were previously available, and no patients needed to be contacted. Additionally, all data were collected anonymously and de-identified using study identifier before reading by the radiologist, model development, validation, and testing. All methods were carried out in accordance with Indian Council of Medical Research (ICMR) and International Committee on Harmonization of Good Clinical Practice (ICH-GCP) guidelines and regulations.