Study System. Bombus impatiens is a native eusocial bee species in North America, ranging from Maine to Ontario to the eastern Rocky Mountains and south through Florida (67). They are generalists that visit a range of agricultural and native plants. B. impatiens have also been domesticated for crop pollination services throughout much of North America (68–70), subsequently making them a widely utilized study species. An annotated reference genome for B. impatiens (71) is available from the National Center for Biotechnology Information (NCBI). At the time of this study, NCBI BIMP_2.2 (GenBank assembly accession: GCA_000188095.4) contained 13,161 transcripts that code for 24,471 proteins.
Crithidia bombi (Zoomastigophora: Trypanosomatidae) is an infectious protozoan gut pathogen that can be contracted at flowers via fecal transmission and can also be horizontally transmitted within colonies (72, 73). Crithidia sp. reduce learning and foraging efficiency in worker bumble bees (74, 75), slow colony growth rates, especially early in the colony life cycle (76), reduce the likelihood of successful reproduction in wild colonies (77), and reduce infected queen fitness (78). Crithidia sp. infection is common; for example, Crithidia sp. infected over 60% of wild-caught B. impatiens in western MA (79) and commercial colonies can have high levels of infection (74).
Sunflowers (Helianthus sp.) belong to a large and diverse family (Asteraceae) with over 32,000 described species (80). Helianthus annuus is a major domesticated oilseed crop cultivated worldwide and a native US wildflower (81). With nearly two million acres of sunflowers planted in the US (82) and ten million acres planted in Europe annually (83), the high abundance of cultivated sunflowers combined with large nectar and pollen yields make it an important resource for bees.
Preparing inoculation treatments. Live Crithidia bombi cells were harvested from three wild B. impatiens workers collected near Stone Soup Farm, Hadley, MA, USA in 2014 (42.363911 N, -72.567747 W) and housed in commercial colonies of B. impatiens thereafter. The Crithidia species was identified in a previous study and confirmed to be C. bombi (84). Both the C. bombi source colony and experimental colony used in this experiment were purchased from Koppert Biological Systems (Howell, MI, USA). Colonies were fed with 30% sucrose solution and mixed wildflower pollen throughout their lifetimes and housed in a dark room at 21 – 24ºC and ~50% rh. We made C. bombi inoculum using an established protocol (12, 85, 86). Briefly, bee digestive tracts of 15 workers, excluding the honey crop, were removed with forceps, placed into 1.5 mL microcentrifuge tubes with 300 µL of distilled water, and ground with a pestle. We allowed each sample to rest at room temperature for 4-5 hours so that gut material settled and C. bombi cells could ascend into the supernatant. Crithidia bombi cells were counted from a 0.02 µL sample of supernatant per bee with a Neubauer hemacytometer under a compound light microscope at 400X magnification. We then mixed 150 µL of the supernatant with distilled water to achieve a concentration of 2400 cells µL−1. The sample was then mixed with an equal volume of 50% sucrose solution to yield inoculum with 1200 cells µL−1 in 25% sucrose. We made the sham inoculum following the same procedure as the C. bombi inoculum, but instead used the digestive tracts of five bees from the un-infected experimental colony.
Preparing pollen diets. We prepared two pollen diet treatments – sunflower and wildflower. Honey bee-collected sunflower pollen pellets were obtained from Changge Hauding Wax Industry (China) and sorted by color to remove impurities. We verified a pure batch of sunflower pollen by staining five samples with basic fuschin dye (87) and visually confirming only sunflower pollen was present with a compound microscope at 400X magnification. Honey bee-collected mixed wildflower pollen pellets were obtained from Koppert Biological Systems (Howell, MI, USA) and microscopically confirmed to contain < 5% Asteraceae pollen, identified by having spines on the exine (61). Experimental pollen diets were provided to bees as a paste produced by mixing ground pollen pellets with distilled water to achieve a uniform consistency.
Inoculation treatment. Experimental adult worker bumble bees were obtained from a single commercial B. impatiens colony that was determined to be uninfected by screening five workers using the methods described in Preparing inoculation treatments. Workers were removed from the colony and placed into individual plastic containers (7.5 cm x 10 cm x 5 cm) with mesh screen flooring. We starved the bees for 3-5 hours and then fed each a 10 µL drop of either the C. bombi inoculum or sham control; bees were assigned at random to inoculation treatment. The dose of C. bombi inoculum contained 12,000 C. bombi cells, which is within the concentration range bees are exposed to when foraging on flowers in the wild (88). Only bees that consumed the entire droplet (n = 120; 60 with C. bombi and 60 with sham inoculum) were used in the experiment.
All bees were then randomly assigned within inoculation treatment to either the sunflower (n = 60) or wildflower pollen diet (n = 60). Each day we fed bees fresh pollen paste of their assigned treatment, packed into an inverted lid of a 1.5 mL microcentrifuge tube and 1 mL of 30% sucrose via a filled and inverted plastic 1.5 mL microcentrifuge tube plugged with cotton (Richmond Dental & Medicine, Charlotte, NC, USA). We harvested tissue samples for RNA extraction 72 hours post-inoculation. We selected this time period because C. bombi counts start to diverge between bees fed sunflower vs. wildflower pollen between 48- and 72-hours post-inoculation (Figure S1; Supporting Information). Moreover, a recent study demonstrated that consuming sunflower pollen for approximately the first 72 hours or for 7 days after inoculation both reduced C. bombi intensity in bumble bees compared with control pollen (89). We harvested tissue samples for RNA extraction from five workers per treatment that had the greatest average daily rate of pollen consumption (see Pollen consumption below). In total, we sequenced 20 samples: 5 replicates for each inoculation treatment and pollen diet. The remaining 100 bees (referred to as non-RNAseq-bees) were reserved to indirectly determine inoculation efficacy by measuring C. bombi infection.
Pollen consumption. Previous work showed that consuming higher concentrations of sunflower pollen had a stronger medicinal effect (23). Because it was not feasible to control how much pollen an individual bee consumed in our study, estimating pollen consumption was important to effectively model the relationship between diet and gene expression. To estimate consumption of pollen over the 72-hour period, we recorded the weight of each pollen feeder each day before placing it into the container with the bee and also 24 hours later. We accounted for feeder weight change caused by evaporation by placing an additional 30 pollen and nectar feeders (15 per pollen type) into containers that lacked a bee. Each day bees were provided fresh sucrose and pollen, yielding three days (post inoculation) of pollen consumption and evaporation measurements. We were not able to estimate nectar consumption because nectar feeders often leaked.
All statistical analyses using linear models were conducted with R version 4.0.2 (90). To estimate pollen consumption, we calculated evaporation-adjusted net consumption based on change in weight of the pollen feeder for each bee per day. Using the evaporation controls, we fit separate linear regressions for each day and pollen type, with initial weight regressed against weight 24-hr later. We then used the predict function in R to calculate an evaporation-adjusted feeder weight, yielding a net consumption estimate for each bee each day. Consumption variables (day 1, day 2, day 3, average daily rate (mg/day) and total) were strongly correlated based on Pearson's product moment correlations (t > 4.538, df = 34, p < 0.001 for all combinations). We thus focused solely on average daily pollen consumption rate for all gene expression analyses (see Differential gene expression analysis), as this was the metric used to select bees for RNA sequencing. We used ANOVA to test for differences in average daily pollen consumption rate between pollen diets and inoculation treatments for RNAseq bees. Model estimated means and Tukey-adjusted pairwise comparisons were obtained using the “emmeans” package (91).
Efficacy of inoculation. To verify that bees inoculated with C. bombi were infected and that sunflower pollen reduced C. bombi infection relative to wildflower pollen, we measured C. bombi prevalence and infection intensity of a random subset of the remaining bees that were not selected for RNA extraction, but also consistently consumed their pollen treatments over the first 72 hours [n(sunflower pollen) = 15 bees, n(wildflower pollen) = 18 bees]. Each bee was dissected and C. bombi cells were counted as in Preparing inoculation treatments, with the addition that all tools were washed with 70% ethanol and thoroughly dried between bees to prevent cross-contamination. We measured prevalence as the presence (1 or more C. bombi cells) or the absence of C. bombi cells per 0.02 µL sample, and C. bombi infection intensity as the number of flagellate C. bombi cells per 0.02 µL. We also removed the right forewing of each bee to measure marginal cell length, a proxy for bee size (92).
We used generalized linear models to analyze how pollen diets affected C. bombi infection prevalence and intensity. Crithidia bombi prevalence models were fit with a binomial distribution and infection intensity models were fit with a negative binomial distribution using the “MASS” package (93).
RNA extractions and sequencing. For bees selected for RNA sequencing, at 72 hours post-inoculation, the bees were anesthetized in a container of dry ice for 2 min. Using flame-sterilized forceps, we removed the abdomen of each anesthetized bee and placed it into a sterile 2 mL microcentrifuge tube with 2 mL of RNA-stabilizing reagent (RNAlater; ThermoFisher, Waltham, MA, USA; cat. No. AM7021). Each abdomen was slightly torn open with forceps for the RNA-stabilizing reagent to fully saturate the tissue sample and stored at 4°C for 24 hours. All samples were then kept in a -80°C freezer until RNA extraction. Crithidia bombi is a gut pathogen in bumble bees, and since our interests were in the effects of diet, we focus the sequencing on abdominal tissues. We did not use whole-bees to avoid potential tissue-specific gene expression patterns (e.g., differences between brain and gut gene expression), which has been shown in other insects (94) and may make it difficult to disentangle gut-specific responses.
Total RNA samples were submitted to the NCSU Genomic Sciences Laboratory for Illumina RNA library construction and sequencing. Purification of messenger RNA (mRNA) was performed using oligo-dT beads in the NEBNExt Poly(A) mRNA Magnetic Isolation Module. Complementary DNA (cDNA) libraries for Illumina sequencing were constructed using the NEBNext Ultra Directional RNA Library Prep Kit (NEB) and NEBNext Mulitplex Oligos for Illumina (NEB) using the manufacturer-specified protocol. Double-stranded cDNA was purified, end repaired, and “a-tailed” for adaptor ligation. Following ligation, samples were processed for a final fragment size (adapters included) of 400-550 bp using sequential AMPure XP bead isolation (Beckman Coulter, USA). Prior to library construction, RNA integrity, purity, and concentration was assessed using an Agilent 2100 Bioanalyzer with an RNA 6000 Nano Chip. Library enrichment was performed and specific indexes for each sample were added during the protocol-specified PCR amplification. The amplified library fragments were purified and checked for quality and final concentration using an Agilent 2100 Bioanalyzer with a High Sensitivity DNA chip. The final quantified libraries were pooled in equimolar amounts for clustering and sequencing on an Illumina NextSeq 500 DNA sequencer, using a 75 bp x 2 single end sequencing reagent kit. The software package Real Time Analysis was used to generate raw bcl (base call files), which were then de-multiplexed by sample into fastq files. Low-quality bases and adapter sequences were removed from raw sequence data for each sample using the Trimmomatic software package. Clean reads were mapped to the B. impatiens genome (BIMP 2.2) with HiSat2 version 2.1.0 (95) using default parameters. Gene expression was quantified using StringTie version 2.0 (96) to determine the number of reads uniquely mapping to exons and summed at the transcript level using gene features annotated in the NCBI B. impatiens annotation file (BIMP 2.2; GCA_000188095.4). Here after, for each transcript product mentioned throughout we report NCBI RefSeq protein accession IDs for B. impatiens or blast top hit taxa if B. impatiens was unavailable.
Differential gene expression analysis. We used a negative binomial generalized linear model using DESeq2 version 1.28.1 (97) in R to test for differences in gene expression between treatments. We tested for effects of treatment (pairwise comparisons of pollen diet and inoculation treatment) on gene expression and included pollen consumption rate as a continuous covariate to control for variation caused by differences in average daily pollen consumption among bees. We used the Wald test to assess the significance of differentially expressed transcripts (DETs) and corrected for multiple testing using the Benjamini-Hochberg method with a cutoff of FDR < 0.05. Shrunken log2 fold changes for normalized transcript counts were obtained using lfcShrink function in DESeq2 with a shrinkage estimator based on a normal prior (97).
Machine Learning analysis. Transcriptomic data often suffers from the ‘curse of dimensionality’ due to having many more features than samples (98). Small sample sizes and the rapid loss of degrees of freedom thus make it a poor fit for traditional linear statistics, like regression and ANOVA (99). Standard analyses typically use multiple testing corrections to control false discovery rate (FDR). This method does not consider the highly interactive system of the transcriptome and often fails to detect small changes in gene expression (100). Machine learning, or artificial intelligence tools, can be used to address these challenges by building models from the data rather than fitting the data to rigid models.
We applied support vector machines (SVM) for classification using Weka 3.8.4 (101) to the gene expression profiles of each pairwise treatment comparison. Classification in machine learning is the task of learning to distinguish data points that belong to two or more categories in a dataset. Feature selection techniques can then be used to select a reduced number of variables that can maintain accurate classification. For example, comparing infected bumble bees fed either sunflower or wildflower pollen, support vector machines can be used to determine how well pollen diet can be classified from gene expression profiles. Feature selection can then be used to determine a reduced set of transcripts that accurately classify pollen diet, and are thus important (i.e., akin to statistical significance). To optimize data dimensionality for feature selection, we first selected a subset of transcripts from each pairwise treatment from the DESeq2 models that were differentially expressed based on an uncorrected p-value < 0.05. Transcripts (attributes) were then ranked using the InfoGain attribute evaluator and Ranker search method. This process evaluates the worth of an attribute by measuring the information gain with respect to the treatment (102). Specifically, InfoGain measures the difference in the Shannon's entropy of the system H(S) before a new attribute X is introduced, and H(S|X) is the entropy of the system after the attribute X has been introduced. We then created a series of ranked datasets, each including a subset of the top-ranked transcripts in a serial manner.
Preliminary classifier runs demonstrated that a support vector machine SMO, using the 10-fold (stratified hold-out) cross-validation method, correctly classified an average of 82.90% (SD = 27.68%) of bee treatment effects based on gene expression profiles, and thus was used in our analyses. SMO implements John Platt’s sequential minimal optimization algorithm for training a support vector classifier by globally replacing all missing values and transforming nominal attributes, in our case transcripts, into binary attributes (103). The machine-learning algorithm SMO has been successfully used to analyze gene expression profiles (104). We used the ranked data sets to train the SMO algorithm using both the 10-fold (stratified hold-out) and 66% split cross-validation method. For each data set, we repeated model training 100 times and used the classification performance metrics percent correct classification (%CC) and kappa statistic (k) to evaluate model performance.
We tested the efficacy of the optimized SMO model using a negative control method for machine learning. We first created 10 randomized data sets using the DESeq2 normalized counts with treatment randomly assigned. We then re-ran the SMO model training using the number of attributes that provided the best classification. Since there are always two class types, the predicted correct classification rate from random assignment should be approximately 50%, based on the Law of Probability, and the kappa statistic should be close to zero for a randomized negative control to demonstrate that true learning occurred in the optimized SMO models. This approach is detailed in previous studies (105, 106).
Gene Ontology enrichment analysis. Protein descriptions and gene ontology (GO) annotations for transcript sequences were obtained using OmicsBox version 1.4.11 software (https://www.biobam.com/omicsbox/). First, a BED formatted file of transcript coordinates was parsed from the NCBI BIMP 2.2 Annotation release. We then used bedtools version 2.29.2 (107) to extract nucleotide sequences based on the BED file coordinates. A BLASTX search was then performed with an E-value of 10−25 against all arthropod sequences in the NCBI non-redundant database, with the number of hits restricted to 20, followed by GO mapping and annotation for the resulting hits. We then ran InterProScan annotation for the sequences using the default settings and merged InterProScan GO annotations with BLASTX annotations. GO enrichment analysis was performed for all treatment comparisons to find significantly (FDR < 0.05) enriched GO biological process and molecular function terms in the test set of DEGs with respect to the reference set. We used the publicly available databases GeneCards (108) and UniProtKB/Swiss-Prot (109) as additional primary sources of information about DEGs.
IPA canonical pathway analysis. We used Qiagen Ingenuity Pathway Analysis (IPA) software to further interpret the differential expression of transcripts in each treatment pairwise comparison. IPA Knowledge Base maintains a large set of databases that consist of curated metabolic and signaling pathways. Transcripts were manually mapped to human, mouse, or rat ortholog gene IDs or those of other species based on UniProtKB accession numbers for use in IPA. We also performed an additional blastx for all transcripts using the same methods described in Gene Ontology enrichment analysis, but restricted to the Homo sapiens, Mus and Rattus taxonomies. Using these two different methods allowed us to double-check ambiguous orthologous gene symbols. We then performed a Core Analysis in IPA to determine enrichment of relevant canonical metabolic and signaling pathways based on gene expression patterns. We repeated Core Analysis for each pairwise treatment comparison based on the list of important transcripts identified using machine learning. The significance of the association between each gene set and a Canonical Pathway was determined from a p-value of overlap calculated using a right-tailed Fisher’s Exact Test. In addition, IPA calculates a z-score based on the gene expression fold change values of each gene to estimate the state of activation or inhibition of each pathway. We report gene symbols (Homo sapiens, Mus or Rattus) for each gene product mentioned hereafter in the IPA canonical pathway results.