Field Site and Sampling Scheme
We sampled trees at the Torrey Pines State Natural Reserve (TPSNR, the mainland population) near La Jolla/Del Mar in San Diego County, California. The reserve occurs along coastal bluffs and comprises multiple habitat types including coastal pine forest, chaparral, and coastal sage scrub. A lagoon separates the reserve into two sites including the main reserve and the Torrey Pines Extension (hereafter TPSR and TPEX populations, respectively). Our sampling design included pairs of proximate trees with and without signs of Ips in order to reduce environmental variation that may influence beetle attack. Characteristic signs of Ips infestation included small entrance holes exuding red pitch and frass on the trunk and branches, and needle yellowing or browning throughout the tree crown as infestation progressed. We sampled tree pairs throughout the reserve to capture a representative pool of variation for diversity and structure estimates.
We sampled a total of 40 trees from October to November 2017 and from November 2018 to April 2019 (Fig. 1, S1, Table S1). Fall of 2017 had high Ips activity with widespread tree mortality in clusters throughout the reserve, while the season beginning in Fall of 2018 had lower Ips activity in scattered localities. In 2017, we sampled 22 mainland trees including seven pairs of trees (n = 14) with and without signs of Ips, four trees throughout the reserve to capture additional genotypes, and four trees with a Santa Rosa Island origin planted prior to 1960 at the nearby Sanford Burnham Prebys Discovery Institute (SBPDI). Beginning in Fall of 2018, we sampled nine mainland tree pairs (n = 18) with and without signs of Ips. Because mortality can proceed rapidly once needle yellowing begins, we sampled beetle-colonized trees with sufficient living tissue opportunistically as we identified infested trees. Where possible, we matched beetle-colonized trees to nearby asymptomatic trees of similar diameter at breast height (DBH), and in the second season, we matched environmental variables between tree pairs using 2018 census data provided by TPSNR and geographic information system data. Tree pairs occurred within approximately 30 – 50 m of one another (or up to 150 m for more isolated trees). A multiple factor analysis indicated comparable size, spatial, and environmental conditions between groups (Fig. S2) despite sampling across a heterogeneous landscape.
In November 2019, we re-surveyed all 40 trees, assigning dead trees with signs of Ips infestation as ‘cases’ and living trees as ‘controls.’ Some controls had potential signs of past beetle attack limited to isolated branches, though no evidence of ongoing infestation. One Ips-infested individual identified in 2017 was still living in 2019 and one asymptomatic individual had succumbed to beetle attack by 2018, and they were reassigned as a ‘control’ and ‘case’, respectively. A second bark beetle, Dendroctonus valens LeConte (red turpentine beetle) also occurred extensively throughout the reserve, although it is largely restricted to the lower tree bole. Prior to sampling, 21 of 40 trees displayed signs of Dendroctonus including 13 case trees and 7 control trees (infestation could not be determined for three trees whose lower bole was covered by shrubbery).
For each of the 40 trees, we sampled both twig and needle tissue to maximize transcript diversity. We selected small, terminal branches healthy in appearance when possible, although case trees were in various stages of Ips colonization. From each branch, we sampled needles and twig segments 15 – 20 cm behind the oldest needle growth. We preserved tissues on liquid nitrogen or dry ice prior to transporting and storing at –80°C at the San Diego Zoo Institute for Conservation until further processing.
RNA Extraction, Library Preparation, and Sequencing
We extracted total RNA from twig and needle tissue using the Spectrum Plant Total RNA Kit (Sigma-Aldrich, St. Louis, MO) according to the manufacturer’s protocol. For twigs, we used a razor blade to remove the outer bark layer to reduce lichen and other potential contaminants. We verified RNA quality and quantity using a Nanodrop One spectrophotometer (Thermo Scientific, Waltham, MA), Qubit BR Assay kit (Life Technologies, Waltham, MA) and an Agilent 2100 Bioanalyzer, and cleaned extracts with AMPure XP beads as needed. Of two cases with dry twigs, needles were exclusively (sample tpsr0195) or predominately (sample tpsr0912) used for sequencing. However, RNA integrity number (RIN) scores did not differ significantly between groups (cases: 7.1 – 9.9, mean 8.43; controls: 7.2 – 9.2, mean 8.35; F=1,30=3.16, P=0.578).
We prepared mRNA libraries with a TruSeq Stranded mRNA Kit (Illumina, San Diego, CA) and unique dual index adapters following the manufacturer’s protocol (October 2013). Libraries were PCR amplified for 15 cycles, quantified by Qubit, and validated on an Agilent 2100 Bioanalyzer. We performed double size selection with AMPure XP beads as needed to reduce minor secondary peaks in the Bioanalyzer trace. We sequenced final libraries as paired end 75 bp reads across four runs on an Illumina NextSeq500 platform at SBPDI (San Diego, CA). We included six libraries as technical replicates across runs to control for potential batch effects and re-sequenced seven low coverage samples (Table S1).
Read Filtering, Mapping, and Variant Calling
We filtered raw, demultiplexed fastq files using the HTStream pre-processing pipeline (https://bioinformatics.ucdavis.edu/software). We removed 28s and 18s Pinus taeda rRNA, PCR duplicates, adapters, and poor quality sequence from 3’ read ends (with quality score < 20 in a sliding window of 10 bp). Finally, we removed remaining N characters and any processed reads under 50 bp.
Following the Genome Analysis Tool Kit (GATK) best practice guidelines for RNA-Seq short variant discovery (https://gatk.broadinstitute.org/), we aligned reads to the Pinus taeda genome v2.01 (Zimin et al. 2017) downloaded from the TreeGenes database (Falk et al. 2018) with modified steps as follows. Due to the large and fragmented nature of the genome assembly (22 Gb, 1.76 million contigs), we used a custom perl script (S. Fitz-Gibbon, personal communication) to sort contigs by size and concatenate them into 222 ~100 Mb ‘pseudocontigs’ with contigs separated by 10 N’s. Using two-pass mode with the splice aware aligner STAR v2.6.0 (Dobin et al. 2013), we aligned processed reads to the P. taeda pseudocontigs with settings --genomeSuffixLengthMax 1000, --outSJfilterCountTotalMin 9 3 3 3, and --alignIntronMax 160000, where maximum intron length was based on estimates for P. taeda (Wegrzyn et al. 2014). We used Samtools v1.9 (Li et al. 2009) to retain only uniquely mapping reads and to remove PCR/optical duplicates, secondary or supplementary alignments, reads failing QC, and unpaired reads.
To call variants, we ran two iterations of GATK v4.1.4.1 HaplotypeCaller and GenotypeGVCFs with option –stand-call-conf 30 and base quality score recalibration (BQSR). During the first variant calling iteration, we applied hard filtering thresholds in GATK to remove SNPs with the following INFO fields: QD < 5.0, FS > 30.0, ReadPosRankSum < -4.0 or > 6.0, BaseQRankSum < -4.0. We then used VCFtools (Danecek et al. 2011) to remove SNPs with minimum mean depth < 5 and call rate < 50%, resulting in over 419 K SNPs for BQSR. During the second iteration, we performed hard filtering as above with an additional filter to remove SNPs with ExcessHet > 16.43 (Z-score 2). Using VCFtools, we set genotypes with depth (DP) < 5 and genotype quality (GQ) < 20 to missing, and removed sites with mean DP < 15, minor allele frequency < 0.05, > 2 alleles, mean GQ < 44, and call rate < 95%. We modified further filtering steps from the dDocent pipeline (http://www.ddocent.com/filtering/) to remove sites with high read depth that may contain potential paralogs (all loci with DP > 2 standard deviations (SD) from the mean; loci with DP > 1 SD from the mean if QUAL < 2 ´ DP). Finally, we used BCFtools v1.9 to prune SNPs in linkage disequilibrium (r2 > 0.5 in a window of 5 SNPs) using the original genome coordinates. We subset SNPs with and without LD-pruning by population for use in downstream analyses (Table 1).
Kinship, population structure, and genetic diversity
Our first goal was to assess functional genetic diversity within and among populations of Torrey pines. To test for family groups that could result from sampling proximate case/control pairs, we estimated relatedness among samples with the KING inference method in VCFtools (Manichaikul et al. 2010). We then assessed genetic differentiation controlling for kinship through Principal Components Analysis in Related samples (PC-AiR) (Conomos et al. 2015) in the R package GENESIS v.2.12.4 (Gogarten et al. 2019) for all individuals, mainland individuals, and case/control pairs using the LD-pruned SNP set. PC-AiR first identifies an unrelated subset of individuals in the data representing all ancestries. It then uses those individuals to infer principal components (PCs) and projects the remaining samples onto those PCs based on genetic similarity. We ran one to two iterations of PC and kinship estimation with a scaling constant of 1 when estimating kinship, a kinship threshold of 0.0884 (second degree relatives) for identifying unrelated individuals, and two ancestry informative PCs to estimate kinship accounting for ancestry.
For each sample group, we calculated genetic diversity and structure estimates for all individuals and for unrelated individuals identified by PC-AiR. We estimated the mean expected (HE) and observed multi-locus heterozygosity (HO) per population with the R package Adegenet v2.1.1 (Jombart et al. 2010) and the average inbreeding coefficient (F) per individual within population with VCFTools. Using the R package HIERFSTAT (Goudet 2005), we calculated pairwise FST based on Weir and Cockerham’s q (1984). We performed all analyses in R v.3.5.2 (R Core Team, 2018) using the LD-pruned SNP set.
Multivariate analysis of genetic divergence between cases and controls
To test for divergence between multivariate genotypes of cases and controls, we employed Discriminant Analysis of Principal Components (DAPC) in Adegenet. To control for population structure, we first performed DAPC on the LD-pruned SNP set with population (TPEX or TPSR) as group for the 32 paired samples. We then regressed the full SNP set on the first discriminant function (DF) summarizing 2 PCs from the population model. We used residuals from the regression in a second DAPC model with case/control as group and missing data set to the mean. To select the optimal number of PCs for the DF, we performed cross validation, assigning 80% of the data to a training set and the remaining 20% to a validation set in a stratified random manner, with 50 replicates. Highest mean success and lowest root mean squared error estimates indicated an optimum of 12 PCs. To determine whether the difference between case and control groups was significant, we ran a MANOVA Pillai test with 12 PCs as the dependent variable. We repeated DAPC for subsets of the data dropping a PCA outlier or samples with high missing data. All analyses were limited to SNPs with a 95% call rate across the case/control dataset.
Association of individual SNPs with case/control status
To identify individual SNPs associated with case/control status, we ran the linear mixed model GEMMA which calculates an exact association test statistic (Zhou and Stephens 2012). We selected a ‘centered’ covariance matrix to control for hidden kinship and population stratification and included latitude, longitude, and population (TPEX, TPSR) as additional covariates. Only SNPs with a 95% call rate among the 32 paired samples were included, amounting to 854 LD-pruned SNPs to estimate the relatedness matrix and 1526 SNPs from the full SNP set for the association test. We repeated the association test separately for TPSR, TPEX, and subsets of the data excluding a PCA outlier or samples with high missing data. Using the R package qvalue v.2.14.1, we adjusted P-values from the likelihood ratio test by employing the false discovery rate method to correct for multiple testing (Storey and Tibshirani 2003). Additionally, PCs for differing tiers of ‘top SNPs’ in the tail of the distribution were visualized based on estimates from the R package SNPRelate v1.16.0 (Zheng et al. 2012). We used a dDocent script to test deviations from Hardy-Weinberg equilibrium in top SNPs within TPEX, TPSR, and the combined mainland population.
Functional annotation and gene ontology enrichment analysis of top SNPs
We matched SNPs to functional annotations for 51,751 gene models in the Pinus taeda reference genome from the TreeGenes Database (Pita.2_01.final_annotations_lvl0.tsv downloaded April 2020: https://treegenesdb.org/FTP/Genomes/Pita/v2.01/annotation/). SNPs lacking a functional annotation in the reference were supplemented with custom annotations. Briefly, we used Stringtie v2.1.1 (Pertea et al. 2015) to assemble transcripts for each sample using the P. taeda gene models as a guide and 15´ minimum read coverage (-c 15 -s 15). We then merged individual assemblies to create a single transcriptome assembly. To functionally annotate transcripts containing top SNPs, we ran EnTAP v0.9.1 (Hart et al. 2020) with blastx similarity search across the Uniprot-SwissProt and Plant RefSeq databases (downloaded May 2020) with the following parameters: E-value threshold 10-6, 30% query and target coverage, and Pinaceae taxon prioritization. Finally, we identified gene ontology (GO) terms for all SNPs tested in GEMMA by running EggNOG-mapper (Huerta-Cepas et al. 2016) through EnTAP.
To identify GO terms overrepresented among SNPs associated with case/control status, we performed singular enrichment analysis in AgriGO v2.0 (Tian et al. 2017). We ran two sets of enrichment tests, using either GO terms from the P. taeda reference annotations or from custom annotations with EnTAP (633 vs 1170 of 1526 SNPs with at least one GO term, respectively). If a SNP matched multiple alternative transcripts, we selected GO terms from the longest transcript. We separated gene ids and corresponding GO terms into a query set (SNPs identified in the top 5% or 10% of the distribution of the GEMMA association test) and a reference set (the non-overlapping set of the remaining SNPs tested in GEMMA). Only one SNP per gene was considered. We performed Complete GO enrichment analysis separately for biological process, cellular component, and molecular function levels. To assess significance, we used a Chi-squared test for GO terms matching at least 5 transcripts and adjusted for multiple testing with the Benjamini and Hochberg (1995) false discovery rate (FDR) method.