Study region
Three mountains in Sulawesi (Gunung Dako, Gunung Torompupu, and Gunung Ilomata; Figure 1) were sampled in coordination with a biotic survey of the Sulawesi mountains. This was a large collaborative project coordinating with the National Research and Innovation Agency (BRIN, formerly Indonesian Institute of Sciences, LIPI), Museum Zoologicum Bogoriense (MZB)) in Bogor, Indonesia and Tadulako University in Palu, Indonesia. Permits were granted by LIPI and RISTEKDIKTI (now BRIN) (23//SI/MZB/VIII/2018). Gunung Dako, located on the northern peninsula, has a summit at 2,260m; the majority of sites were located in the Gunung Dako Forest Reserve, but low elevations (< 500m) were located outside the reserve and have been converted to mixed agroforestry (Supplementary Materials, Figure 10). Gunung Torompupu, located in the central core, has a summit of 2,495m; this mountain borders the Lore Lindu National Park and has more intact lowland forest than Gunung Dako. The mountain we call Gunung Ilomata is also located on the northern peninsula, farther east of Gunung Dako; this mountain is located near Bogani Nani Wartebone National Park Sites and is part of a larger mountain range. Accessing higher elevations deeper into the park proved impossible and so the highest elevation sampled was 1,211m. The low elevation sites below 500m were in largely intact forest, contrasting the converted lowlands of Dako.
Sample collection
Sites of 20x20m were chosen at 400m intervals across the elevation gradient on each mountain, allowing thorough collection from distinct habitat types (Supplementary Materials, Table 2; Supplementary Materials, Figure 10). Elevation bands were defined as < 500m, 500-1,000m, 1,000-1,500m, 1,500-2,000m and > 2000m. Sites below 1,500m have lowland forest habitats with higher elevations displaying transitional forest type, and lower sites featuring higher anthropogenic disturbance. Above 1,500m, the habitat transitions to mossy forests. Two sites at each elevation band were chosen when possible. However, the number of sites at each elevation and the elevational range we were able to sample were dependent on local field conditions, accessibility, and time limitations.
Community sampling was conducted using standardized methods and were conducted by one individual. Each site was sampled using one hour of hand collection during the day, two hours of hand collection during the night, leaf litter sorting using a standardized volume of pre-sifted soil, timed beat sampling with 20 seconds of beating for seven distinct plant types, two pitfall traps and one hour targeted web documentation with subsequent specimen collection. No plant materials were collected in this study. Specimens were stored in 95% EtOH and, upon importation to the USA, samples were stored at -20°C until DNA extractions were performed after which samples were transferred to 70% EtOH and stored at room temperature.
Molecular Procedures
Juvenile specimens from each collection unit (sampling method and site) were combined into pools while adult taxa were grouped into morphospecies by unit. A female and/or male of each morphospecies was sequenced for each site individually. DNA was extracted from adult specimens from pulverized leg tissue or non-destructively by soaking the entire specimen in a solution of cell lysis buffer and Proteinase K for three hours, with volumes altered dependent on sample size. Pooled juvenile samples were extracted non-destructively by soaking in a 600 µL solution of cell lysis buffer and Proteinase K for three hours. Samples were incubated for only three hours to preserve morphological characters (Supplementary materials, Figure 11). Following lysing, DNA isolation was performed following the Qiagen PureGene protocol (Qiagen, Hilden, Germany) and eluted in 20µL of Elution Buffer. To increase the throughput of the protocol, all extractions were performed in 96-well plates. To test the success of extractions, DNA was spot-checked across each plate via NanoDrop (ThermoFisher Scientific, Waltham, MA, USA).
Amplification was performed using the Qiagen multiplex kit (Qiagen, Hilden, Germany). The mini-barcode region was amplified using the forward primer LCO1490 (5′‐GGTCAACAAATCATAAAGATATTGG‐3′) [38] and the reverse primer COI-CFMRb (5′‐GGNACTAATCAATTHCCAAATCC‐3′) [39], each which has the additional TruSeq tail for Illumina flow cell binding. The PCR reaction consisted of 5µL of the Qiagen PCR MasterMix (Qiagen, Hilden, Germany), 3µL of H20, 0.5µl of each primer, and 1µL of template DNA. Amplifications were conducted at an annealing temperature of 46°C for 30 cycles. A negative PCR control was included on each plate, consisting of Qiagen PCR MasterMix, H2O, and each primer. PCR products were visualized using gel electrophoresis on a 3% agarose gel, run at 140V. PCR products and associated DNA from specimens lacking bands were checked for quality using Nanodrop and Qubit and then re-amplified when deemed appropriate.
A dual indexing strategy was implemented using a second round of PCR to attach 8bp indexes to the TruSeq tails. The annealing temperature for indexing PCR was 55°C and ran for 6 cycles. PCR products were visualized once again using a 3% agarose gel, run at a lower voltage (100V) to allow clear visualization of fragment length to confirm the addition of indexes to each amplicon. Final libraries were constructed by pooling PCR products proportionally based on band strength, adding either 1µL, 2µL, or 3µL. Separate libraries were constructed for individually-extracted adult specimens and bulk-extracted juvenile samples. Final library preparation was conducted at QB3 Genomics (QB3, Berkeley CA, USA). Libraries were quantified using both qPCR and a Qubit Fluorometer (ThermoFisher Scientific, Waltham, MA, USA). Size selection was performed using Pippin Prep (Sage Science Inc, Beverly MA, USA) followed by Fragment Analyzer (Agilent, Santa Clara, CA USA) to confirm amplicon sizes. Pools were sequenced using Illumina MiSeq v3 300PE (Illumina Inc, San Diego, CA USA) on two lanes, combined with other libraries unrelated to this project.
Bioinformatics
Samples were demultiplexed by QB3 staff (QB3 Genomics, Berkeley, CA, RRID:SCR_022170”) using custom methods. CutAdapt was used for primer removal with the paired-end approach [40]. Following primer removal, sequence data were denoised using DADA2 as implemented in R [24]; DADA2 relies on error models that incorporate error data produced during sequencing and accurately produces amplicon sequence variants that retain fine-scale nucleotide variation. Following DADA2 processing, sequences were filtered to the expected length of 181bp and chimeras were removed using the removeBimeraDenovo method implemented in the DADA2 package. The function isContaminant in package decontam was used to remove potential widespread contamination based on sequences detected in negative controls [25]. This method assesses prevalence in controls versus positive samples to identify contaminants. The threshold parameter was set to 0.5, which removes any contaminant more prevalent in negative controls than in true samples. The LULU curation algorithm was performed on the resulting sequence data to remove additional erroneous sequences [26]. Each of the above methods was performed by sequencing lane to account for run-specific errors and subsequently combined
The retained sequences were aligned in Geneious Prime v. 2022.0.2 and manually assessed for reading frame interruptions or stop codons, representing pseudogenes; these sequences were removed. Sequences were then assigned taxonomy using megablast and any non-spider sequences were removed. A maximum likelihood phylogeny was constructed using IQTREE based on the alignment of spider sequences. The appropriate substitution model was determined using ModelFinderPlus [41]. The general time reversible model was selected, using the FreeRate model of rate heterogeneity and empirical codon frequencies (GTR+F+R7). 440 total iterations were run to reach parameter optimization. Sequences with low BLAST matches, and therefore unknown family identifications, were assigned a family identity based on their clade placement. Targeted morphological identifications were performed to inform the taxonomy of uncertain clades.
Individual sequence variants were clustered into operational taxonomic units using swarmv2 [27] with the --fastidious option. Unlike other clustering methods, swarm does not rely on a universal threshold and instead uses an iterative single-linkage clustering algorithm to create units. The fastidious option reduces under-grouping by collapsing low-abundance OTUs into larger “parent” OTUs. This method was used to create species-like units while ASVs, with their fine-scale nucleotide variability, were treated as haplotypes. Despite the multiple well-supported filtering methods we used, many samples still contained more than one sOTU. The prevalence of multiple low-abundance variants appearing in a sample is a well-known problem when using NGS approaches. To produce quantitatively informed cutoffs to use as universal filtering methods, we compared our morphological family assignments to the family identity associated with an sOTU that was found in the identified sample and assessed what were the strongest predictors of a matched sOTU classification. The number of reads for each sOTU in a sample and the proportion of the reads found within a sample to the sOTU reads found across the libraries showed strong relationships to correct or incorrect identities (Supplementary Materials, Figures 1 and 2). Cutoffs were chosen based on interquartile ranges, separated by incorrect and correct family matches, and these cutoffs were applied to all samples, including juvenile pools.
Statistical Analysis
To test if sOTUs found across mountains were more likely to consist of multiple haplotypes (ASVs), we performed Pearson’s Chi-squared test. To examine alpha diversity, we used three hill numbers: q = 0 which represents pure richness, q = 1 which is the exponential of Shannon’s entropy index, and q = 2 which is the inverse of Simpson’s concentration index. To look at the relationship between alpha diversity and elevation, we constructed three linear models for each Hill number, using elevation as a linear predictor. To examine beta diversity, we calculated phylogenetic beta diversity using the package BAT [42] using the maximum likelihood tree as input. Rarefaction was set to 20 to account for differences in abundance. We chose to use rarefaction due to skewed results between sites of similar elevations that had different sampling sizes; upon examining data, this seemed to be an artifact of abundance differences rather than a lack of similarity. The resulting mean beta diversity values were used for analysis.
We used NMDS as our ordination method to visualize group differences based on the mean total beta diversity with k = 2 and then tested for significant compositional differences by the categorical groupings of mountain and elevation using the ADONIS variant of PERMANOVA with adonis2 implemented in vegan [43]. Differences in group dispersions can make PERMANOVA unreliable so we additionally tested for dispersion differences using PERMDISP using the function betadisper (Supplementary Materials, Table 2). More aggressive clustering was performed using function otu in the package kmer [44] at 0.97 and 0.95 to represent coarser taxonomic units. Throughout this paper, these are referred to as 97% clusters and 95% clusters, while the swarmv2 species-like clustering results are referred to as sOTUs. Venn diagrams were constructed using the package ggvenn [45].
Analyses were run in R v4.2.2 and Rstudio v2022.12.0. Figures were made using ggplot2 and ggpubr and tables were made using stargazer. Code was written using the tidyverse. Further figure edits were made in Inkscape v1.2.