Supervised Machine Learning Model Performance
Mapping and alignment of reads to the ARS-UCD1.2 reference assembly identified 33,129 genes across all 151 libraries (n = 28 controls from 10 animals, n = 123 BRD from 32 animals; Supplemental file 2); the corresponding count matrix resulted in a total library size of 5,132,593,936, with a median library size of approximately 32.7 million counts per library. The count matrix was partitioned into nine pathogen challenge classes; overall testing performance for each ML algorithm is provided in Table 2. Support vectors machine (SVM) modeling, a non-sparse classifier, performed best in terms of balanced accuracy for all testing groups except for BVDV, which the nearest shrunken centroids model provided by the pamr package (PAM) outperformed all other models (86.7%). Because sparse classifiers utilize a subset of the data for classification, genes were acquired and compiled from the top sparce models within each experimental challenge comparison. PAM performed best in terms of balanced accuracy when classifying Virus (89.9%), BRSV (100.0%), BVDV (86.7%), M. bovis (71.4%), and M. haemolytica (73.3%) against controls. Poisson linear discriminant analysis (PLDA) performed best when classifying Bacteria (70.0%). Both power-transformed Poisson linear discriminant analysis (PLDA2) and PAM performed identically when classifying IBR (100.0%). BVDV was less accurate (PAM; 86.7%), which most likely affected classification accuracy when evaluating all viral samples (PAM; 89.9%). Bacteria-challenged classes performed worse overall, with accompanying top balanced accuracies of 80.0%, 71.4%, and 80.0% for M. haemolytica (SVM), M. bovis (SVM/PAM), and Bacteria (SVM) classification, respectively. Combination of all challenge classes (BRD) possessed poor balanced classification accuracy, with the highest non-sparse classifier at 65.0% (SVM) and sparse classifiers (VNSC) at 60.8%.
Table 2: ML performance across all testing sets. Calculations for ML performance metrics are defined by Goksuluk and colleagues (2019). PLDA: sparse Poisson linear discriminant analysis; PLDA2: sparse Poisson linear discriminant analysis with a power transformation; NBLDA: negative binomial linear discriminant analysis; SVM: support vector machine; PAM: nearest shrunken centroids provided by the pamr package; VNSC: sparse voom-based nearest shrunken centroids.
Test set
|
Features Tested
|
Accuracy
|
Accuracy upper
|
Accuracy lower
|
Kappa
|
McNemar p-value
|
Sensitivity
|
Specificity
|
Prevalence
|
Detection rate
|
Balanced accuracy
|
BRD PLDA
|
22062
|
0.5870
|
0.7300
|
0.4323
|
0.0839
|
0.0665
|
0.5000
|
0.6111
|
0.2174
|
0.1087
|
0.5556
|
BRD PLDA2
|
22062
|
0.5870
|
0.7300
|
0.4323
|
-0.0282
|
0.3588
|
0.3000
|
0.6667
|
0.2174
|
0.0652
|
0.4833
|
BRD NBLDA
|
22062
|
0.5652
|
0.7107
|
0.4111
|
-0.0502
|
0.2636
|
0.3000
|
0.6389
|
0.2174
|
0.0652
|
0.4694
|
BRD SVM
|
22062
|
0.8478
|
0.9366
|
0.7113
|
0.4015
|
0.0233
|
0.3000
|
1.0000
|
0.2174
|
0.0652
|
0.6500
|
BRD PAM
|
22062
|
0.7391
|
0.8573
|
0.5887
|
0.1737
|
0.7728
|
0.3000
|
0.8611
|
0.2174
|
0.0652
|
0.5806
|
BRD VNSC
|
22062
|
0.7826
|
0.8905
|
0.6364
|
0.2532
|
0.3428
|
0.3000
|
0.9167
|
0.2174
|
0.0652
|
0.6083
|
Virus PLDA
|
21685
|
0.6765
|
0.8261
|
0.4947
|
0.3909
|
0.0026
|
1.0000
|
0.5769
|
0.2353
|
0.2353
|
0.7885
|
Virus PLDA2
|
21685
|
0.7647
|
0.8925
|
0.5883
|
0.4809
|
0.0771
|
0.8750
|
0.7308
|
0.2353
|
0.2059
|
0.8029
|
Virus NBLDA
|
21685
|
0.7353
|
0.8712
|
0.5564
|
0.3953
|
0.1824
|
0.7500
|
0.7308
|
0.2353
|
0.1765
|
0.7404
|
Virus SVM
|
21685
|
0.9412
|
0.9928
|
0.8032
|
0.8211
|
0.4795
|
0.7500
|
1.0000
|
0.2353
|
0.1765
|
0.8750
|
Virus PAM
|
21685
|
0.9118
|
0.9814
|
0.7632
|
0.7650
|
1.0000
|
0.8750
|
0.9231
|
0.2353
|
0.2059
|
0.8990
|
Virus VNSC
|
21685
|
0.7941
|
0.9130
|
0.6210
|
0.5296
|
0.1306
|
0.8750
|
0.7692
|
0.2353
|
0.2059
|
0.8221
|
Bacteria PLDA
|
21162
|
0.7143
|
0.8872
|
0.4782
|
0.3636
|
0.1336
|
0.6667
|
0.7333
|
0.2857
|
0.1905
|
0.7000
|
Bacteria PLDA2
|
21162
|
0.8095
|
0.9455
|
0.5809
|
0.4167
|
0.6831
|
0.3333
|
1.0000
|
0.2857
|
0.0952
|
0.6667
|
Bacteria NBLDA
|
21162
|
0.8095
|
0.9455
|
0.5809
|
0.4815
|
0.6171
|
0.5000
|
0.9333
|
0.2857
|
0.1429
|
0.7167
|
Bacteria SVM
|
21162
|
0.8571
|
0.9695
|
0.6366
|
0.6316
|
1.0000
|
0.6667
|
0.9333
|
0.2857
|
0.1905
|
0.8000
|
Bacteria PAM
|
21162
|
0.7619
|
0.9178
|
0.5283
|
0.2222
|
0.0736
|
0.1667
|
1.0000
|
0.2857
|
0.0476
|
0.5833
|
Bacteria VNSC
|
21162
|
0.7143
|
0.8872
|
0.4782
|
0.0000
|
0.0412
|
0.0000
|
1.0000
|
0.2857
|
0.0000
|
0.5000
|
BRSV PLDA
|
20692
|
0.9474
|
0.9987
|
0.7397
|
0.8834
|
1.0000
|
1.0000
|
0.9231
|
0.3158
|
0.3158
|
0.9615
|
BRSV PLDA2
|
20692
|
0.9474
|
0.9987
|
0.7397
|
0.8834
|
1.0000
|
1.0000
|
0.9231
|
0.3158
|
0.3158
|
0.9615
|
BRSV NBLDA
|
20692
|
1.0000
|
1.0000
|
0.8235
|
1.0000
|
N/A
|
1.0000
|
1.0000
|
0.3158
|
0.3158
|
1.0000
|
BRSV SVM
|
20692
|
1.0000
|
1.0000
|
0.8235
|
1.0000
|
1.0000
|
1.0000
|
1.0000
|
0.3158
|
0.3158
|
1.0000
|
BRSV PAM
|
20692
|
1.0000
|
1.0000
|
0.8235
|
1.0000
|
1.0000
|
1.0000
|
1.0000
|
0.3158
|
0.3158
|
1.0000
|
BRSV VNSC
|
20692
|
0.6842
|
0.8742
|
0.4345
|
0.0000
|
0.0000
|
0.0000
|
1.0000
|
0.3158
|
0.0000
|
0.5000
|
IBR PLDA
|
20858
|
0.9375
|
0.9984
|
0.6977
|
0.8710
|
1.0000
|
1.0000
|
0.9000
|
0.3750
|
0.3750
|
0.9500
|
IBR PLDA2
|
20858
|
1.0000
|
1.0000
|
0.7941
|
1.0000
|
N/A
|
1.0000
|
1.0000
|
0.3750
|
0.3750
|
1.0000
|
IBR NBLDA
|
20858
|
0.9375
|
0.9984
|
0.6977
|
0.8710
|
1.0000
|
1.0000
|
0.9000
|
0.3750
|
0.3750
|
0.9500
|
IBR SVM
|
20858
|
1.0000
|
1.0000
|
0.7941
|
1.0000
|
N/A
|
1.0000
|
1.0000
|
0.3750
|
0.3750
|
1.0000
|
IBR PAM
|
20858
|
1.0000
|
1.0000
|
0.7941
|
1.0000
|
N/A
|
1.0000
|
1.0000
|
0.3750
|
0.3750
|
1.0000
|
IBR VNSC
|
20858
|
0.5000
|
0.7535
|
0.2465
|
-0.2308
|
0.2888
|
0.0000
|
0.8000
|
0.3750
|
0.0000
|
0.4000
|
BVDV PLDA
|
20468
|
0.5625
|
0.8025
|
0.2988
|
0.2432
|
0.0233
|
1.0000
|
0.3000
|
0.3750
|
0.3750
|
0.6500
|
BVDV PLDA2
|
20468
|
0.5000
|
0.7535
|
0.2465
|
0.0000
|
0.7237
|
0.5000
|
0.5000
|
0.3750
|
0.1875
|
0.5000
|
BVDV NBLDA
|
20468
|
0.5000
|
0.7535
|
0.2465
|
0.0000
|
0.7237
|
0.5000
|
0.5000
|
0.3750
|
0.1875
|
0.5000
|
BVDV SVM
|
20468
|
0.7500
|
0.9273
|
0.4762
|
0.5294
|
0.1336
|
1.0000
|
0.6000
|
0.3750
|
0.3750
|
0.8000
|
BVDV PAM
|
20468
|
0.8750
|
0.9845
|
0.6165
|
0.7333
|
1.0000
|
0.8333
|
0.9000
|
0.3750
|
0.3125
|
0.8667
|
BVDV VNSC
|
20468
|
0.3750
|
0.6457
|
0.1520
|
-0.1765
|
0.3428
|
0.5000
|
0.3000
|
0.3750
|
0.1875
|
0.4000
|
M. bovis PLDA
|
20345
|
0.3571
|
0.6486
|
0.1276
|
-0.2857
|
1.0000
|
0.4286
|
0.2857
|
0.5000
|
0.2143
|
0.3571
|
M. bovis PLDA2
|
20345
|
0.6429
|
0.8724
|
0.3514
|
0.2857
|
1.0000
|
0.7143
|
0.5714
|
0.5000
|
0.3571
|
0.6429
|
M. bovis NBLDA
|
20345
|
0.4286
|
0.7114
|
0.1766
|
-0.1429
|
0.2888
|
0.7143
|
0.1429
|
0.5000
|
0.3571
|
0.4286
|
M. bovis SVM
|
20345
|
0.7143
|
0.9161
|
0.4190
|
0.4286
|
0.6171
|
0.8571
|
0.5714
|
0.5000
|
0.4286
|
0.7143
|
M. bovis PAM
|
20345
|
0.7143
|
0.9161
|
0.4190
|
0.4286
|
0.1336
|
1.0000
|
0.4286
|
0.5000
|
0.5000
|
0.7143
|
M. bovis VNSC
|
20345
|
0.5000
|
0.7696
|
0.2304
|
0.0000
|
0.0233
|
0.0000
|
1.0000
|
0.5000
|
0.0000
|
0.5000
|
M. haemolytica PLDA
|
20659
|
0.5000
|
0.7535
|
0.2465
|
0.0000
|
0.7237
|
0.5000
|
0.5000
|
0.3750
|
0.1875
|
0.5000
|
M. haemolytica PLDA2
|
20659
|
0.5625
|
0.8025
|
0.2899
|
-0.1200
|
0.1306
|
0.0000
|
0.9000
|
0.3750
|
0.0000
|
0.4500
|
M. haemolytica NBLDA
|
20659
|
0.6250
|
0.8480
|
0.3543
|
0.2500
|
0.6831
|
0.6667
|
0.6000
|
0.3750
|
0.2500
|
0.6333
|
M. haemolytica SVM
|
20659
|
0.7500
|
0.9273
|
0.4762
|
0.5294
|
0.1336
|
1.0000
|
0.6000
|
0.3750
|
0.3750
|
0.8000
|
M. haemolytica PAM
|
20659
|
0.7500
|
0.9273
|
0.4762
|
0.4667
|
1.0000
|
0.6667
|
0.8000
|
0.3750
|
0.2500
|
0.7333
|
M. haemolytica VNSC
|
20659
|
0.6250
|
0.8480
|
0.3543
|
0.0000
|
0.0412
|
0.0000
|
1.0000
|
0.3750
|
0.0000
|
0.5000
|
Visualization of Gene Expression Variation
Multidimensional scaling (MDS) was applied to the integrated ML dataset, to discern dissimilarities between its individual mRNA-Seq libraries based on gene variation. Each point on x- and y- axes represents a different individual dataset and subsequent transformed Euclidean distance in two dimensions. Patterns from the top 500 genes based on log2-normalized standard deviation revealed that there was an overall similarity in gene expression across specific tissue types. While differences can be appreciated for each dataset with distinction to tissue site, lung (cluster 1; light blue) and lymphoid tissues (cluster 2; purple) were the most evident in terms of dissimilarity (Fig. 1). Notably, bronchial lymph node tissue from Johnston et al. 32 (cluster 3; green) demonstrated equivalent dissimilarity from lung tissue as the bronchial lymph node tissue from Tizioto et al.30. However, the bronchial lymph node tissue from the two different studies were distinct from one-another when evaluated in the second dimensional space. Tissue-level gene expression differences (e.g., lung versus all other tissue types) were more pronounce compared to disease or etiological differences.
A heat map was generated for each dataset using the gene classifiers identified through the top ML sparse model in each etiologic-specific test group (Fig. 2). A total of 572 genes were identified across the five etiological test groups, 357 of which were unique (Supplemental file 3). Expression patterns within each column is accompanied by unsupervised hierarchical clustering, visualizing likeness in tissue type, etiology, and disease classification. Similar to the MDS plot, distinction based on gene expression can be appreciated across lung and lymphoid tissue types. Pearson correlation coefficients clustering of genes (rows) allowed for the visualization of distinct expression patterns. Particularly, three visual expression modules were identified, and labeled as 1, 2, and 3. Visual expression module 1 contained the genes PSMB8, PPA1, PARP12, EPSTI1, CXCL10, CLEC4F, TIFA, ZNFX1, MX1, DHX58, LOC100139670, GBP4, ZBP1, PLAC8, LOC618737, LOC512486, ISG15, IFIT2, IFITM1, PML, FAM111B, and CD274, which were distinctly higher in expression in lymphatic tissue sampled from cattle experimentally challenged with BRSV and IBR compared to all remaining. Visual expression module 2 contained the genes CPSF6, TMEM123, CIRBP, ATP6, ATP8, ND4L, LPP, IFITM2, LOC112444847, DTX3L, LDHA, RPS26, STIP1, PSME2, PARP9, LOC786372, PTP4A2, CDC42SE1, and NLRC5, which were distinctly decreased in disease lung tissue sampled from cattle experimentally challenged with Mycoplasma bovis, Mannheimia haemolytica, and IBR. Visual expression module 3 contained the genes WDFY4, OTUD4, LCP2, OCDC69, TLN1, RPS7, VPC, HNRNPU, and HMGB2, which were distinctly increased in bronchial lymph node tissue sampled from cattle in the control group and experimentally challenged BRSV.
To explore the complex overlap of gene classifiers between etiological groups, we employed an UpSetR matrix intersection technique (Fig. 3). Among the genes identified through the top sparse classifiers, BRSV was the most distinct with 109 unique genes. There was an apparent separation of viral-related genes, whereas BRSV and IBR possessed the highest overlap (42), BVDV possessed 24 unique genes, and only four genes were shared across all three viruses. Similarly, the bacterial datasets possessed minor overlap, with 25 and 22 genes identified uniquely for M. haemolytica and M. bovis, respectively, and only four genes shared between both bacterial analyses.
Functional Analysis of Gene Classifiers
Gene Ontology (GO) terms for biological processes and Reactome pathway enrichment analyses were performed with WebGestalt (FDR ≤ 0.05). 120, 72, 1, and 48 GO-BP terms were significantly enriched by gene classifiers identified for BRSV, IBR, BVDV, and M. haemolytica, respectively; no significant GO-BP terms were enriched for M. bovis. 47, 15, and 15 pathways were enriched by gene classifiers identified for BRSV, IBR, and M. haemolytica, respectively; no pathways were enriched for BVDV and M. bovis. All GO-BP terms and pathways identified is found in Supplemental file 4. Overlap of the GO-BP terms and pathways identified for each etiological group is shown in Fig. 4A and 4B. BRSV and IBR possessed the highest overlap of functional associations, with 37 GO-BP terms and 12 pathways shared between the two. GO-BP terms and pathways between BRSV and IBR were primarily related to type I interferon production and signaling, cellular metabolism and ATP production, unfolded protein response, antigenic cross presentation, and IL-8 secretion. Between BRSV, IBR, and M. haemolytica, 12 GO-BP terms and 4 pathways were shared across all three. GO-BP terms and pathways between BRSV, IBR, and M. haemolytica were related to innate immune response, apoptosis, and unfolded protein response. M. haemolytica differed in functional enrichment with processes and pathways related to neutrophilic activation and degranulation, classical and alternative complement activation, and immunoglobulin-mediated humoral immunity. All five etiological groups shared genes involved in heat-shock protein response.