Of the 22 study participants randomly selected from the main study’s database, 11 were culture positive TB patients. All study participants (including those with ORD) were QuantiFERON TB Gold In Tube positive and HIV negative. The mean age of all study participants was 40.3 ± 8.9. The clinical and demographic characteristics of study participants are shown in Table 1.
Protein identification by shotgun proteomics
Representative total ion chromatograms (TICs) were obtained by pooling samples from the TB and ORD groups (Figure 2). These pooled samples were used as quality control (QC) references for the remainder of the individual clinical runs. The raw data for these references were processed according to Tyanova and colleagues (23). We identified 1176 protein groups across all samples, of these, 46 (3.91%) were contaminants, 12 (1.02%) were reverse hits and 170 (14.46%) were single peptide protein groups. The contaminants, reverse hits and protein groups represented by single peptides were cleaved and not included in further analysis. Amongst the remaining 948 proteins, 26 of them had intensities that were significantly different between the TB and ORD groups (Table 2). These differentially expressed proteins were subjected for GO and IPA.
Gene Ontology of the differentially expressed proteins
To get an overall picture of the differentially expressed salivary proteome in the two groups, the differentially expressed proteins were subjected to PANTHER classification and GO database to categorize them according to their molecular functions, biological processes and protein classes (Figure 3). The largest fraction of the biological process ontology of the identified proteins was the cellular process, followed by biological regulation and metabolic process (Figure 3A). Binding, catalytic activity and molecular function regulator were identified as the main function of the proteins followed by structural molecule activity, molecular transducer activity and transcription regulator activity by molecular function ontology (Figure 3B). Additionally, protein class ontology showed that the majority of these differentially expressed proteins belonged to the enzyme modulator protein class followed by oxidoreductase, cytoskeletal proteins and hydrolase classes (Figure 3C). The remaining smaller percentages of the differentially expressed proteins fall in the calcium-binding protein, cell adhesion molecule, defense/immunity protein, lyase, receptor, signaling molecule, surfactant and transferase protein classes.
Pathway analysis by IPA software
The protein-protein relationships and putative networks pathways analysis of differentially expressed proteins were performed by IPA (Ingenuity Inc.). The selected proteins (Table 2) were uploaded to the IPA software with their corresponding UniProt IDs and respective log10 ratios to map proteins into key pathways and to retrieve protein-protein interactions. We identified association of 26 canonical pathways with the differentially expressed proteins. Amongst these, IL-8 signaling pathway, which involves IL-8, a chemokine associated with inflammation which plays an essential role in neutrophil recruitment and neutrophil degranulation, appears as the top hit (Figure 4A). IL-8 induces chemotaxis mainly in neutrophils as well as other granulocytes, resulting in the migration of these target cells to the site of inflammation (24). As shown in Figure 4A, we further identified several important pathways, mainly signaling pathways in which these differentially expressed proteins were involved in.
Using the IPA software, we also explored protein-protein interaction networks. The protein-protein interaction analysis showed significant interactions among the differentially expressed proteins. We identified two interaction networks, with the major network including 18 of the 26 differentially expressed proteins. Of the 18 proteins in the protein-protein interaction network, only two were upregulated in individuals TB in comparison to those with ORD (Figure 4B). Within this interaction network, we identified multiple central nodes, namely PLG, IL1B, TNF, P38 MAPK, ERK 1/2, HGF, ITGAM and ITGB2. These nodes were located in different cell compartments, with the majority of them localized in the cytoplasm and extracellular space (Figure 4B). Of these multiple central nodes identified, only PLG, ITGAM and ITGB2 were proteins identified by the proteomics analysis, and the rest appeared as additional proteins of this network.
Utility of individual proteins in the diagnosis of TB
After removal of contaminants, reverse hits, and proteins represented by single peptides, data were log10 transformed and the LFQ intensities of the remaining proteins between individuals with TB and those with ORD using unpaired t-test. Intensities of 26 proteins were significantly different between the two groups (Table 2).
When data for individual proteins were assessed by ROC curve analysis, the area under the ROC curve (AUC) was above ≥ 0.75 for 14 of the 26 proteins. Notably, the AUCs for 5 of these proteins, including macrophage-capping protein (P40121), plasminogen (P00747), profilin-1 (P07737), f-actin-capping protein subunit beta (P47756) and alpha-1-antichymotrypsin (P01011) were ≥ 0.80 (Figure 5, Table 2).
Performance of combinations of proteins in the diagnosis of TB
To investigate whether the ability of proteins to discriminate between TB and ORD groups would be enhanced when used in combinations, intensity data from the 26 proteins that were found to be significantly different between the TB and ORD groups were fitted into GDA models. With the number of combined variables restricted to a maximum five, optimal prediction of TB was found to be achieved with a combination of either four or five proteins (Figure 6A).
The most accurate five-protein biosignature comprised of alpha-1-antichymotrypsin (P01011), NAD(P)H-hydrate epimerase (Q8NCW5), proteasome subunit beta type-6 (P28072), immunoglobulin kappa variable 1-33 (A0A2Q2TTZ9) and neuroserpin (Q99574). This 5-marker biosignature diagnosed TB disease with an AUC of 1.00 (95% CI, 1.00-1.00) (Figure 6C&D), corresponding to a sensitivity of 100% (95% CI, 76.2-100%) and a specificity of 100% (95% CI, 76.2-100%). After leave-one-out cross-validation, there was no change in the sensitivity (100%), however, the specificity dropped to 90.9% (95% CI, 58.7-99.8%). The negative predictive value (NPV) obtained for the 5-marker model after leave-one-out cross-validation was 100% whereas the positive predictive value (PPV) was 90.91% (95% CI, 60.49-98.49%) (Table 3).
The most accurate four-protein combination identified by best subsets GDA comprised of flavin reductase (NADPH) (P30043), myosin-9 (P35579), neuroserpin (Q99574) and protein S100-A11 (P31994) which accurately classified 90.9% of both the TB and ORD participants after leave-one-out cross-validation (Table 3).
The most frequently occurring proteins in the best 20 protein combinations that most accurately predicted TB disease or ORD were P30043, appearing in all the 20 protein combinations, Q99574 and P31949 appearing in 9 of the 20 protein combinations (Figure 6B).