Genes and Mechanisms Associated With Experimentally Induced Bovine Respiratory Disease Identi ed With Supervised Machine Learning Methodology on Integrated Transcriptomic Datasets


 Bovine respiratory disease (BRD) is a multifactorial disease involving complex host immune interactions shaped by pathogenic agents and environmental factors. Advancements in RNA sequencing and associated analytical methods are improving our understanding of host response related to BRD pathophysiology. Supervised machine learning (ML) approaches present one such method for analyzing new and previously published transcriptome data to identify novel genes and mechanisms. Our objective was to apply ML models to lung and immunological tissue datasets acquired from previous clinical BRD experiments to identify genes that classify disease with high accuracy. Raw mRNA sequencing reads from 151 bovine datasets (n=123 BRD, n=28 control) were downloaded from NCBI-GEO. Quality filtered reads were assembled in a HISAT2/Stringtie2 pipeline. Raw gene counts for ML analysis were normalized, transformed, and analyzed with MLSeq, utilizing six ML models. Cross-validation parameters (5-fold, repeated 10 times) were applied in a 70:30 training/testing ratio. Downstream analysis of genes identified by the top sparse classifiers for each etiological association was performed within WebGestalt and Reactome (FDR < 0.05). Support vector machines was routinely the top non-sparse classifier for predicting etiological disease versus sham control. Nearest shrunken centroid and Poisson linear discriminant analysis with power transformation could reliably classify IBR and BRSV with 100% accuracy. Genes identified in IBR and BRSV, but not BVDV, were related to type I interferon production and IL-8 secretion, specifically in lymphoid tissue and not lung. Genes identified in Mannheimia haemolytica infections were involved in activating classical and alternative pathways of complement. Novel findings, including expression of genes related to reduced mitochondrial oxygenation and ATP synthesis in consolidated lung tissue, were discovered. Genes identified in each analysis represent distinct genomic events relevant to understanding and predicting clinical BRD. The few genes shared across analyses may be reliably associated with clinical BRD. Our analysis demonstrates the utility of ML with published datasets for discovering functional information to support prediction and understanding BRD acquisition.


Introduction
Bovine respiratory disease (BRD) is the most important disease complex in beef cattle production. Although extensively researched, BRD remains the leading cause of infectious disease and economic loss in post-weaned beef cattle worldwide 1,2,3,4 . Due to the multifactorial and polymicrobial nature of BRD, effort has been made to illustrate host factors, management schema, etiological associations, and stressful environmental factors associated with disease development and progression 1,2,4 . Recent research has been focused on predicting BRD susceptibility and outcomes over time 5,6,7,8 . Unfortunately, clinical diagnostic and prognostic prediction models remain contested, and mechanistic information regarding host-pathogen interactions and the development of clinical BRD is not fully understood.
Clinical BRD is often linked with a select number of bacterial and viral etiologies. Bacteria, such as Histophilus somni, Mannheimia haemolytica, Mycoplasma bovis, and Pasteurella multocida, and viruses, such as bovine respiratory syncytial virus (BRSV), bovine viral diarrhea virus (BVDV), bovine herpesvirus-1 (IBR), and bovine parain uenza type 3 virus (PI3), are well studied regarding their pathological capacity and disease association 9,10,11,12,13,14,15 . However, the clinical presentation of BRD is highly variable and antemortem diagnosis is often made without accompanying etiological identi cation 9,13,16,17 . Additionally, cattle experimentally exposed to these agents often fail to develop severe clinical BRD, demonstrating the underlying complexity of the disease and the requirement of implied predisposing factors 18, 19 . Consequentially, current vaccination protocols possess varying effects in reducing ongoing rates of morbidity and mortality associated with BRD, and targeted antimicrobial usage and antimicrobial resistance is of particular public interest 20,21,22,23,24,25 . Therefore, research is needed to elucidate underlying host mechanisms associated with infectious BRD that represent biological components and regulatory functions amendable to manipulation to improve disease response and clinical diagnosis.
High-throughput RNA sequencing (RNA-Seq) is a highly sensitive methodology used to comprehensively evaluate functional mechanisms and molecular heterogeneity through global gene expression analysis 26,27,28,29 . Because of the high sensitivity of the technology, growing technological applications in research, and decreasing costs, RNA-Seq has become an excellent method of evaluating cellular transcriptomes and functionality at a given point in time. Several RNA-Seq studies performed with samples from post-weaned beef cattle have identi ed underlying genes and host mechanisms associated with both naturally occurring and experimentally induced BRD 30,31,32,33,34,35 . However, the results are highly dependent on the experimental design, sequencing technology, and selected data analysis technique, which may be highly conservative in nature 28,36,37,38,39 . Therefore, the use of supervised machine learning models with previously published RNA-Seq data could identify additional gene expression and mechanistic information related to clinical presentation of BRD.
Supervised machine learning (ML) models used in biological research aid in the discover of molecules and establishment of dynamic models that recognize, classify, and predict disease outcomes 40,41,42,43,44 . In recent years, studies have employed the use of ML framework to identify candidate biomarkers for disease classi cation, cell and tumor expression signatures, and novel protein mechanisms within publicly available RNA-Seq datasets 45,46,47,48,49 . However, to our knowledge, the use of ML-based methodology has not been explored with BRD-associated datasets. Therefore, we combined mRNA-Seq data from lung and immunological tissue of cattle experimentally challenged with causative agents of BRD, and tested the classi cation performance of ML methodology and selected gene classi ers. Our objective for this study was to integrate three publicly available datasets and utilize ML methodology, in order to both corroborate ndings previously discovered through differential gene expression analysis and to potentially identify novel genes and mechanisms associated with experimentally induced BRD.

Dataset Acquisition
One hundred and sixty high throughput mRNA sequencing datasets were acquired from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) 50,51 . The datasets originated from lymphoid tissue and lung harvested during peak clinical signs in cattle that were experimentally challenged with isolated BRD pathogens (n = 35), or their sham controls (n = 10). Analyses of these datasets has been previously reported 30,31,32 . Details of sample sizes for challenged and control cattle, isolated BRD pathogens used for challenge, and tissue samples that were subjected to mRNA sequencing are summarized in Table 1. Paired-end read les for each dataset were concatenated to their corresponding forward and reverse direction. To eliminate potential variations induced by differing work ow toolkits, all reads were processed identically. Quality assessment, read trimming, and adapter contamination removal was performed with FastQC v0.11.9 52 and Trimmomatic v0.39 53 . Brie y, reads were trimmed by removing leading and trailing bases if base quality scores were less than 3, scanning each read with a 4-base pair sliding window and removing read segments below a minimum base quality score of 15, and retaining reads above a minimum length of 36 bases. Read quality analysis was summarized and evaluated for each study with MultiQC v0.37 54 . Read survival and quality assessment information are provided in Supplemental le 1. Trimmed reads were mapped to the bovine reference assembly ARS-UCD1.2 using HISAT2 v2.2.0 55 . Reference-guided transcript/gene assembly and quanti cation was performed with StringTie v2.1.2 56,57 . A gene-level raw count matrix was generated for each dataset with the program prepDE.py 58 . Five datasets (86684_Retrop_LN (control), 86688_Retrop_LN (BRSV), 86710_Retrop_LN (BVDV), 86698_dlung (M. bovis), and SRR1956908 (control)) were removed from further analysis due to low read count quantity and technical variability. Additionally, the four datasets related to Pasteurella multocida infection (SRR1952370, SRR1952371, SRR1952372, and SRR1952373) were removed to avoid unbalanced classi cation. The resulting compiled ML dataset was composed of 151 mRNA-Seq datasets.

Supervised Machine Learning Analysis
A total of 151 mRNA-Seq datasets, spanning six tissue types, constituted the compiled ML dataset for further classi cation and feature selection. Raw gene counts generated for each dataset were processed and analyzed in R v4.0.2 with the Bioconductor package MLSeq v2.6.0 (https://github.com/dncR/MLSeq) 59 . The 151 mRNA-Seq libraries were allocated into 9 classes based on the nature of the experimental pathogen challenge: 1) sham-challenged controls (Control; n = 28), 2) challenged with any BRD pathogen (BRD; n = 123), 3) challenged with a BRD viral pathogen (Virus; n = 82), 4) challenged with a BRD bacterial pathogen (Bacteria; n = 41), and categories 5-9 for each of the 5 independent challenge pathogens (BRSV; n = 35, BVDV; n = 23, IBR; n = 24, M. haemolytica; n = 24, and M. bovis; n = 17). The objectives of the ensuing ML analysis were to develop ML models that would 1) accurately "classify" an mRNA-Seq dataset within the 9 experimental pathogen challenge classes and, 2) extract genes and gene sets or "features" that accurately assign an mRNA-Seq dataset to its experimental pathogen challenge class. These objectives were pursued by comparisons of the 8 pathogen challenge classes and the control challenge class. The raw gene count matrix used for this approach is available in Supplemental le 2. Brie y, offset values of one were added to the count matrix to reduce the likelihood of convergence in model tting and to reduce bulk sparsity 60,61 . Genes with a minimum count-permillion of 0.5 in three or more mRNA-Seq libraries were retained for analysis. Library normalization was performed with the DESeq median ratio approach, using default settings 62 . The resulting ML dataset was strati ed into a training and testing set (70% and 30%, respectively), using controls as the comparative baseline (i.e., class statement).
Feature selection from sparse classi er models was set to a maximum of 2000 genes, based on maximum variance ltering. Sparse classi er models (PLDA, PLDA2, VNSC, and PAM) and generated feature (gene) lists were manually selected as the top models for each test set based on highest associated balanced accuracy and Kappa statistic; if two or more models were equal, gene lists would be merged. Performance metric calculations are de ned by Goksuluk and colleagues 59 . Further information regarding work ow parameters, model building, and optimization are found in the package vignette and associated GitHub repository mirror (https://bioconductor.org/packages/release/bioc/html/MLSeq.html; https://github.com/dncR/MLSeq).

Visualization of Gene Expression Variation
Multidimensional scaling (MDS) was applied to the integrated ML dataset, to discern dissimilarities between its individual mRNA-Seq libraries based on gene variation. Each point on x-and y-axes represents a different individual dataset and subsequent transformed Euclidean distance in two dimensions. Patterns from the top 500 genes based on log2-normalized standard deviation revealed that there was an overall similarity in gene expression across speci c tissue types. While differences can be appreciated for each dataset with distinction to tissue site, lung (cluster 1; light blue) and lymphoid tissues (cluster 2; purple) were the most evident in terms of dissimilarity (Fig. 1). Notably, bronchial lymph node tissue from Johnston et al. 32 (cluster 3; green) demonstrated equivalent dissimilarity from lung tissue as the bronchial lymph node tissue from Tizioto et al. 30 . However, the bronchial lymph node tissue from the two different studies were distinct from one-another when evaluated in the second dimensional space. Tissue-level gene expression differences (e.g., lung versus all other tissue types) were more pronounce compared to disease or etiological differences.
A heat map was generated for each dataset using the gene classi ers identi ed through the top ML sparse model in each etiologic-speci c test group (Fig. 2). A total of 572 genes were identi ed across the ve etiological test groups, 357 of which were unique (Supplemental le 3). Expression patterns within each column is accompanied by unsupervised hierarchical clustering, visualizing likeness in tissue type, etiology, and disease classi cation. Similar to the MDS plot, distinction based on gene expression can be appreciated across lung and lymphoid tissue types. Pearson correlation coe cients clustering of genes (rows) allowed for the visualization of distinct expression patterns. Particularly, three visual expression modules were identi ed, and labeled as 1, 2, and 3. Visual expression module 1 contained the genes PSMB8, PPA1, PARP12, EPSTI1, CXCL10, CLEC4F, TIFA, ZNFX1, MX1, DHX58, LOC100139670, GBP4, ZBP1, PLAC8, LOC618737, LOC512486, ISG15, IFIT2, IFITM1, PML, FAM111B, and CD274, which were distinctly higher in expression in lymphatic tissue sampled from cattle experimentally challenged with BRSV and IBR compared to all remaining. Visual expression module 2 contained the genes CPSF6, TMEM123, CIRBP, ATP6, ATP8, ND4L, LPP, IFITM2, LOC112444847, DTX3L, LDHA, RPS26, STIP1, PSME2, PARP9, LOC786372, PTP4A2, CDC42SE1, and NLRC5, which were distinctly decreased in disease lung tissue sampled from cattle experimentally challenged with Mycoplasma bovis, Mannheimia haemolytica, and IBR. Visual expression module 3 contained the genes WDFY4, OTUD4, LCP2, OCDC69, TLN1, RPS7, VPC, HNRNPU, and HMGB2, which were distinctly increased in bronchial lymph node tissue sampled from cattle in the control group and experimentally challenged BRSV.
To explore the complex overlap of gene classi ers between etiological groups, we employed an UpSetR matrix intersection technique (Fig. 3). Among the genes identi ed through the top sparse classi ers, BRSV was the most distinct with 109 unique genes. There was an apparent separation of viral-related genes, whereas BRSV and IBR possessed the highest overlap (42), BVDV possessed 24 unique genes, and only four genes were shared across all three viruses. Similarly, the bacterial datasets possessed minor overlap, with 25 and 22 genes identi ed uniquely for M. haemolytica and M. bovis, respectively, and only four genes shared between both bacterial analyses.
Functional Analysis of Gene Classi ers Gene Ontology (GO) terms for biological processes and Reactome pathway enrichment analyses were performed with WebGestalt (FDR ≤ 0.05). 120, 72, 1, and 48 GO-BP terms were signi cantly enriched by gene classi ers identi ed for BRSV, IBR, BVDV, and M. haemolytica, respectively; no signi cant GO-BP terms were enriched for M. bovis. 47, 15, and 15 pathways were enriched by gene classi ers identi ed for BRSV, IBR, and M. haemolytica, respectively; no pathways were enriched for BVDV and M. bovis. All GO-BP terms and pathways identi ed is found in Supplemental le 4. Overlap of the GO-BP terms and pathways identi ed for each etiological group is shown in Fig. 4A and 4B. BRSV and IBR possessed the highest overlap of functional associations, with 37 GO-BP terms and 12 pathways shared between the two. GO-BP terms and pathways between BRSV and IBR were primarily related to type I interferon production and signaling, cellular metabolism and ATP production, unfolded protein response, antigenic cross presentation, and IL-8 secretion. Between BRSV, IBR, and M. haemolytica, 12 GO-BP terms and 4 pathways were shared across all three. GO-BP terms and pathways between BRSV, IBR, and M. haemolytica were related to innate immune response, apoptosis, and unfolded protein response. M. haemolytica differed in functional enrichment with processes and pathways related to neutrophilic activation and degranulation, classical and alternative complement activation, and immunoglobulin-mediated humoral immunity. All ve etiological groups shared genes involved in heat-shock protein response.

Discussion
Over the past several years, RNA-Seq analysis has been utilized in bovine disease research to evaluate gene expression related to risk of BRD development, stress response, and viral lesion development 30,31,32,33,34,35,73,74 . Primarily, studies that generate RNA-Seq data utilize statistical platforms and techniques for the detection of differentially expressed genes and subsequent construction of functional networks or modules. Many RNA-Seq studies are thus limited in extrapolatory capacity, as analyses are often performed through subsampling single populations and tting xed statistical models, which may be conservative when analyzing gene expression datasets with overdispersion 75,76,77 . Fortunately, publicly available gene expression repositories, such as the NCBI Gene Expression Omnibus, make it possible to acquire, integrate, and analyze datasets related to a particular eld or disease. Such studies have been performed in mammalian species, including cattle, to better characterize genomic mechanisms and protein production related to a particular disease or condition 49,78,79 . Additionally, with the dynamic capacity for analysis that supervised ML models allow, it is possible to explore and characterize gene expression patterns associated with clinical BRD with profound sensitivity 42,79 . In this study, we integrated gene expression data from controlled BRD experiments and determine expression patterns and classi cation potential through supervised ML analysis.
Some of the limitations of this study are evident. First, data were integrated from three studies, two of which utilized the same animals for their transcriptomic analysis 30,31 . While a clear separation in gene expression patterns was appreciated across tissue types, corroborating the ndings Behura and colleagues 31 , utilizing datasets from a limited number of animals and at single time points may not account for the heterogenous nature of gene expression related to BRD development and clinical presentation 75,80 . Additionally, these datasets were acquired from samples of cattle experimentally challenged with single pathogens. BRD challenge models using single etiological components often fails to elicit severe disease, as described by the three studies used here and may not fully represent the complex nature of the disease process seen with naturally occurring BRD 81,82 . Moreover, RNA-Seq analysis remains a relatively new modality in BRD research, and publicly available data are limited at this time. Nonetheless, this study, which to our knowledge is the rst to evaluate host transcriptomes related to BRD with integrated supervised ML methodology, substantiating many of the gene expression ndings previously identi ed, and may serve as a template for modern data analysis in bovine health research.
Between all testing groups and the six models utilized in this study, the support vector machines (SVM) model typically performed the best in terms of classi cation capacity. While originally utilized in microarray experiments, this algorithm is popular for genomic classi cation research in RNA-Seq, as it has been used to discover cancer biomarkers in humans, classify genes related to early conception in cattle, and automate single-cell RNA-Seq identi cation 49,83,84 . While this algorithm was capable of accurately classifying BRSV and IBR challenged datasets, compared to controls, this model is a nonsparse classi er and therefore does not have a built-in process for feature selection and gene extraction within MLSEq. Therefore, particular interest was placed on the PLDA, PLDA2, PAM, and VNSC algorithms, as subsets of genes used to drive classi cation decisions could be obtained. The compiling of datasets for classifying total BRD, viral, and bacterial challenge yielded mixed results. For total BRD, sparse classi ers PAM and VNSC yielded high classi cational accuracy for identifying the challenged cattle, but performed poorly in discerning them from the controls, as illustrated by the accompanying sensitivity and balanced accuracy. This nding may be related to the complexity and distinction of infection processes associated with each etiological component, and highlights that an all-encompassing approach may be inappropriate for determining relevant gene expression and pathways in BRD. To a lesser extent, this is similarly found when compiling bacterial datasets, as discernment from controls was relatively poor. Viral datasets yielded much higher overall balanced accuracies, compared to the bacterial counterparts. Regarding sparse classi ers, BRSV, IBR, and BVDV were capable of being classi ed with high balanced accuracy (100%, 100%, 86.67%, respectively) through the PAM model; IBR was also identi ed with 100% balanced accuracy with PLDA2.
Generally, viruses were independently the most well classi ed, followed by M. haemolytica. Collectively, BRSV and IBR were well de ned by genes involved in the production and response to type I interferons. More speci cally, these genes were seen to be driven primarily by lymphoid tissue, as opposed to the lungs (expression module 1, Fig. 2). This result, coupled with the subsequent lack of type I interferon genes from the BVDV class, corroborates ndings previously described 30,31 . Biologically, the lack of this innate interferon response has been described as a potential immunosuppressive response driven by BVDV, allowing for persistent infection and viral shedding 85,86,87,88 . Notably, non-cytopathic BVDV strains used in experimental infection models, such as the one utilized in this project, have been shown to reduce proin ammatory signaling 31,89 . While the functional enrichment of the genes classi ed for BVDV were largely non-speci c, several have been previously identi ed and have known immunological functionality 30,31 . Related to M. haemolytica, there was apparent overlap in functionality of genes identi ed through ML (Fig. 4). Largely, this was driven by genes encoding for heat shock proteins, calreticulin, talin-1, and Xbox binding protein. These proteins are involved in apoptotic and cell stress events, and may ultimately impact immunoglobulin production and cellular homeostasis 90,91,92,93 . Additionally, genes classi ed in M. haemolytica were unique to the activation of classical and alternative pathways of complement. While complement-related genes were identi ed across all viruses in previously reported gene expression studies and here, the alternative complement component CFB was only identi ed in M. haemolytica. This may be an indication that the presence and activation of the alternative complement pathway is more indicative of in ammation and disease associated with lipopolysaccharide, often caused by extracellular Gram-negative bacteria such as M.
Lastly, novel ndings were identi ed through visual expression modules found in Fig. 2. Expression module 2 possessed 19 genes with reduced expression in disease lung tissue sampled from cattle experimentally challenged with Mycoplasma bovis, Mannheimia haemolytica, and IBR. While often assumed that the oxygenating capability of consolidated lung space during acute respiratory disease is substantially decreased, this expression module provides evidence of this event, as these genes largely possess aerobic ATP synthase and mitochondrial function 95,96,97,98 . Expression module 3 had nine genes with increased expression in bronchial lymph node tissue sampled from cattle in the control group and BRSV. These genes play important roles in T-cell proliferation, integrin activation and antigenic presentation through actin/tubulin reorganization 99,100,101,102 . Potentially, this serves as an underlying mechanism of immunological stimulation unique to lymph nodes of the lower airway.

Conclusion
This study was conducted to integrate and analyze mRNA-Seq datasets with supervised ML methodology. This approach allowed for a novel and comprehensive analysis of lung and immunological tissues in to experimentally induced BRD. ML enabled the classi cation of viral-induced BRD, speci cally with BRSV and IBR, with 100% balanced accuracy, against sham controls, regardless of the tissue type. This experimental investigation illustrates a novel and powerful approach to the investigation of host response mechanisms in BRD through the use of mRNA-Seq and supervised ML analysis.

DATA AVAILABILITY
The data utilized in this study were downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (NCBI-GEO), Bioproject numbers PRJNA272725 and PRJNA543752.  Heatmap of the 357 unique genes identi ed by top ML sparse classi er across all 151 datasets. Ward clustering of datasets and gene expression was performed with Euclidian distance and Pearson correlation coe cients, respectively. Visual expression modules (1, 2 or 3) were empirically identi ed by class dissimilarity.

Figure 3
Matrix intersection of gene classi ers identi ed for each etiological class. 572 genes identi ed by top ML sparse classi ers were visualized for determining functional relevance and comparative uniqueness. Figure 4