Predictive Value of a Five-biomarker Signature to Diagnose Active Pulmonary Tuberculosis Patients

Background: To improve the diagnosis accuracy of active pulmonary tuberculosis (TB) from pneumonia (PN) in low income and inadequate facilities areas is one of the biggest problems facing public health today. Most TB patients are dicult to diagnose, especially those who are acid-fast bacillus smear-negative (AFB - ) but IGRA®.TB test positive (IGRA + ). Thus, we aim to develop a low-cost and rapid risk model for the diagnosis of TB patients with AFB - IGRA + TB from PN. Methods: We retrospectively analyzed 41 laboratory variables of 204 AFB - IGRA + TB and 156 PN participants. Candidate variables were identied by t-statistic test and univariate logistic model. The logistic regression analysis was used to construct the multivariate risk model and nomogram with internal and external validation. Results: There were differential correlations between variable pairs in AFB - IGRA + TB and PN. We found several signicant variables in TB compared with PN. Among them, uric acid (UA) was up-elevated , acting as a protective factor with an odds ratio (OR) < 1. By integrating ve variables, including Age, UA, albumin (ALB), hemoglobin (Hb) and white blood cell counts (WBC), we constructed a multivariate risk model with a concordance index (C-index) of 0.7 (95% CI: 0.61, 0.8). Nomogram showed that UA and Hb had the protection effect, while Age, WBC and ALB acted as risk factors on TB occurrence. Internal and external validation revealed a good agreement between nomogran prediction and actual observations. Conclusions: Differential correlations existed between variable pairs in AFB - IGRA + TB and PN. An integration of ve biomarkers (Age, UA, ALB, Hb and WBC) can be used to predict TB in AFB - IGRA + clinical samples from PN.


Background
Tuberculosis remains a severe global public health problem, especially in developing countries [1]. As of 2018, Jiangxi province, China, reported 34,000 new cases of tuberculosis each year, with a reported incidence of 70.83/100,000, ranking second among infectious diseases for a long time. How to reduce the tuberculosis epidemic in the province is a major issue in economic and social development.
Detection of Mycobacterium tuberculosis (Mtb) DNA using Gene Xpert MTB/RIF assay is more sensitive and rapid for diagnosing TB and rifampicin resistance. However, due to its costs, environmental limitations and supplement, it is di cult to carry out screening in a low-income rural area [2,3]. Interferongamma release assay (IGRAs) is commonly used in the diagnostic workup of Mtb, but distinguish poorly between active TB (TB) and latent tuberculosis infection (LTBI) [4,5]. In addition, the symptoms of tuberculosis patients are very similar to those of pneumonia, especially when most tuberculosis patients have no sputum specimens, or are Mtb smear/culture-negative, which makes diagnosis more intractable. Therefore, the development of new detection methods to improve diagnostic accuracy of sputum smearnegative tuberculosis is still important.
Host markers including secreted molecules in blood have been reported as novel candidate markers to distinguish TB from LTBI, such as interferon-gamma inducible protein ten kD (IP-10) [6], Interleukin 2 (IL-2) [7], IL-6 [8],C-reactive proteins (CRP) [9] and Vascular endothelial growth factor (VEGF) [8]. However, the reported biomarkers' diagnostic performances could not be applied in low-income areas due to their technical, instrumentation or high price issues. To address this problem, the present study retrospectively analyzed the differences in routinely monitored laboratory variables in blood test between AFB − IGRA + TB and PN and build a risk model to differentiate AFB − IGRA + TB from PN, and validated its application by an external independent cohort.

Study design and subject oversight
The patients provided a full medical history, participated in regular physical examinations and underwent routine investigations, including HIV serology, chest radiography, IGRAs and microbiological sputum examination, where possible. All participants in this study were excluded from HIV infection. Pneumonia (PN) cases referred to upper or lower respiratory tract infections by viral or non-Mtb bacterial pathogens, although no attempts to identify the organisms by bacterial or viral cultures were made. Acid-fast bacilli (AFB)-negative active pulmonary TB (TB) participants were no sputum or negative smear and negative Mtb, but chest computed tomographic (CT) scans or chest X-ray evidence and symptoms responding to TB treatment.
The involved participants were divided into two groups: the discovery cohort and the external validation cohort. For the discovery cohort, participants were enrolled in Ganzhou Fifth Hospital (Ganzhou, China) between August 2018 and August 2020, including 748 AFB-negative TB participants and 531 PN participants. As the prede ned goal was to assess the ability of laboratory biomarkers to distinguish IGRA-positive patients presenting with AFB-negative TB, 287 participants with IGRA-positive TB were subjected for further analysis (AFB -IGRA + TB). To analyze more biomarkers, participants with over 50% laboratory data missing were excluded. Totally 89 TB and 38 PN participants with recorded biomarker values were used to construct a risk model after statistical analysis. The external validation cohort of 134 participants in the study was collected from Shenzhen Third People's Hospital (Shenzhen, China) from June 2018 and June 2019 (Fig. 1). Among them, 15 were excluded due to missing variables, consisting 77 AFB -IGRA + TB and 42 PN participants.

Data collection
The medical records of all participants were reviewed by experienced TB clinicians, including medical history, symptoms, clinical signs, microbiological test, laboratory nding, chest CT chest x-ray, and treatment measures. Forty-one laboratory biomarkers were assessed by differential statistics and odds ratio (OR) calculation for variable selection (Additional le 2: Table S2).

Statistical analyses
For laboratory results, continuous variables were preprocessed by log2-transformation before data analysis. Laboratory variables were compared using moderated t-statistics in the R package "limma", which is suitable for data that may not be normally distributed [10,11]. P-values were adjusted by the false discovery rate (FDR). Variables between two conditions were de ned as statistically signi cant when the FDR was <0.2 [12]. A generalized linear model (glm) was used to calculate odds ratio (OR) for each laboratory biomarker.
The regression coe cient of the glm was regarded as the log OR. Variables which FDR was < 0.2 or at a statistically signi cant level (p-value < 0.05) in univariate glm analysis were candidates for construction of a multivariate logistic regression risk model (lrm) and nomogram by R package "rms", and the nial variables were determined using Akaike's information criterion (AIC) as a stopping rule. The goodness of t of multivariate risk model lrm was calculated by Hosmer and Lemeshow C statistic. The performance of nomogram was evaluated by the concordance index (C-index) and assessed by comparing nomogrampredicted vs actual observation, and bootstraps with 1000 resamples to decrease the over t bias were applied to calibration [13,14].
Differences in the correlations of biomarker pairs between TB and PN were analyzed by R package "DGCA". DGCA transforms sample correlation coe cients to z-scores and uses differences in z-scores to calculate p-values of differential correlation between gene pairs [15]. Package "pheatmap" was used to draw a correlation heatmap between two conditions. All analyses and gures were generated in R version 4.0.3 (https://www.r-project.org/) and arranged for publishing using Photoshop CS3 (Adobe, San Diego, CA, USA).

Results
Baseline characteristics of AFB -IGRA + TB and PN participants The characteristics of the AFB − IGRA + TB and PN participants were shown in Additional le 1:

Differential laboratory biomarkers between AFB -IGRA + TB and PN
To gain more insights into the overall change in correlation structure in laboratory variables between AFB − IGRA + TB and PN, we used DGCA to visualize all variable pair correlations in both conditions ( Fig. 2A). Variables in this heatmap were ordered by their median z-score correlation difference with all of the other variables, without the restriction to positive correlations. By quanti ed the difference in correlation between all variable pairs using permutation testing [15], we found a lower correlation between most variable pairs in TB than PN (mean difference in z-score (dz) = 0.11, p = 0.04), suggesting that the TB group had a more coordinated perturbation of pro le than the PN group.
In order to nd the odds that TB would progress or not given exposure to these laboratory variables, odds ratio (OR) was assessed for each variable by a univariate general linear model. We identi ed 11 variables signi cantly associated with TB progression (p-value < 0.05, Table 1). Among them, four variables indicated the protective effect TB progression (OR < 1), including UA, red blood cell (RBC), Hb and albumin (ALB); while seven variables showed as risk factors in TB progression (OR > 1).

Multivariate risk model to predict TB progression probability
Combining with the results above, and using AIC as a stopping rule, we nally selected ve laboratory variables (Age, UA, ALB, Hb and WBC) to develop a multivariate risk model with 89 AFB − IGRA + AFB and 38 PN participants. Figure 2D  Using the nomogram, we mapped the values for each variable to points on a scale axis ranging from 0 to 10. With a corresponding number of point assigned to given magnitudes of the variables, the risk probability was calculated by corresponding cumulative point score for all the variables [16]. We found that UA had the most protective effect on TB progression, followed by Hb; while Age, WBC and ALB acted as risk factors (Fig. 3C).
The Multivariate risk performed well in an independent validation cohort Next, we prospectively collected an external validation cohort of 134 participants, of whom 15 were excluded due to missing variables, consisting 77 AFB − IGRA + TB and 42 PN participants. The total points of each participant in the external cohort were calculated according to the established nomogram, and then used to evaluate the performance of the nomogram. The C-index of nomogram for predicting external cohort was 0.77 (95% CI: 0.68, 0.86, Fig. 3A red line). The calibration plot also showed good agreement between the prediction by nomogram and actual observation (Fig. 3D) with a p-value of 0.13 by Hosmer-Lemeshow test.

Discussion
In the study, different pro les were analyzed between AFB − IGRA + TB and PN from several aspects, and ve laboratory variables (Age, UA, ALB, Hb and WBC) were selected to construct a multivariate risk model and nomogram. Internal validation and calibration plot showed moderate agreement between nomogram probability and actual observation, with a C-index of 0.7 (95% CI: 0.61, 0.8). We achieved a similar result in an external validation cohort (C-index: 0.77, 95% CI: 0.68, 0.86). These ndings indicated that ve laboratory variables can be used to predict TB disease probability, when a clinical sample was AFBnegative and IGRA-positive.
It was reported that TB patients intended to display increased levels of C-reaction protein (CRP), erythrocyte sedimentation rate (ESR), UA and low levels of Hb [17]. Anemia is a common comorbidity in TB with a decreased level of Hb. Prevalence of anemia in TB patients ranged between 32% and 96% [18], most of which were due to anemia of in ammation [19,20]. Previous evidence had showed an unambiguous relationship between anemia, iron redistribution and TB susceptibility [21], moreover anemia was correlated with poor clinical prognoses and mortality after TB diagnosis [20].
The anti-tuberculous agent Pyrazinamide (PZA) and ethambutol could increase serum UA level through decreasing in UA excretion and the production of UA by decreasing renal clearance, respectively [22,23]. Increased UA level was observed in 28.2% of men and 37.5% of women prior to chemotherapy, and more often during the rst 2 months of treatment both in men and women, which suffered from multiple drug resistant pulmonary TB [24]. In our study, serum UA showed signi cantly higher level in TB (FDR < 0.001), with an OR value of 0.36 (p-value < 2.5e-05) compared with PN (Table 3).
Reduced plasma ALB concentrations have been reported in TB [25] and might be used as a diagnostic and prognostic marker in pretreated HIV and TB patients. Hypoalbuminemia was associated with an increased risk of mortality in patients with tuberculosis and serum albumin concentrations < 3.2 g/dL were associated with 85% speci city [26]. WBC was signi cantly increased in TB patients compared with healthy controls, and the WBC decreased signi cantly during TB treatment [27,28]. In our study, WBC was statistically signi cant and a signi cantly risk factor (OR: 1.49), but no higher counts in foldchange compared with PN (Table 3).
To predict the risk of TB for each AFB − IGRA + patient, we used a nomogram to provide a more accurate pro le. With ve variables, nomogram had a well predictive accuracy with a C-index of 0.7, of which UA had a most protection contribution on TB progression (OR: 0.36, p-value < 0.01, data not shown), indicating that it might be a speci c protection factor on TB patients in our data. External validation is essential to con rm that it can be applied to patients outside of the cohort. Thus, we collected a second participant cohort from another center and tested them on the nomogram, and the result showed a good consistent with actual observation (C-index of 0.77).
The present study still had several limitations: rst, in low-income and setting area, not all patients received all routine laboratory tests, leading to too many missing values in the rst cohort of participants.
In order to analyze more biomarkers, participants with over 50% of missing data were excluded, leaving 41 laboratory variables and a small number of participants left (89 AFB − IGRA + TBA and 38 PN). Second, though internal and external validation met good performance, further investigations were required to optimize the nomogram in larger cohorts and in more types of pulmonary tuberculosis. After improvement, we anticipated this model might help clinicians to reduce the cost and time to diagnose AFB − IGRA + TB in low-income, high-burdened and resource-constrained setting rural area.

Conclusion
The study had identi ed a ve-variable signature in distinguish AFB − IGRA + TB from PN patients. A risk model was built to differentiate AFB − IGRA + TB from PN, and validated in an external independent cohort, which could be applied in low-income and resource-constrained setting rural areas. analyses and drafted the manuscript. All authors were involved in critically revising and providing nial manuscript approval.

Acknowledgments
We thank the staff of the Ganzhou Fifth Hospital and Shenzhen Third People's Hospital for facilitating access to the relevant medical records.

Funding
This work was supported by Natural Science Foundation of Jiangxi Province (20202BAB206059).

Ethics declarations
Ethical approval and consent to participate The study protocol was approved by the Ethics Committee of Ganzhou Fifth Hospital and Shenzhen Third People's Hospital to allow retrospective access to patients' records and les. Written informed consent was waived by the Ethics Commission as this was an observational and retrospective analysis.

Consent for publication
Not applicable.

Available of data and materials
The clinical data used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Figure 1
Overview of study design and analysis work ow. A total of 748 AFB-TB and 531 PN participants were prospectively evaluated. Individuals with over 50% of clinical data missing were excluded. Candidate variables were ltered by statistical differences. 89 AFB-IGRA+ TB participants and 38 PN participants with candidate variables were nally included in multivariate risk model construction.