Participant recruitment and dust collection
Floor dust samples were collected from nine different non-moisture damaged homes across the US from May 2021 to November 2021. Overall methods are shown in Fig. 1. Three homes were from Ohio and the remaining six were homes from six different states in the US (Table S1, Fig. 2). Due to COVID-19 restrictions, we used an online approach for participant recruitment and instructions for dust collection by participant. Using participant collected dust as a surrogate for collection by a project staff has shown to be equally effective for studies reporting allergen concentration in dust [40, 41]. Participants were initially recruited via social media and additional recruitment and screening were completed over email.
A Qualtrics survey (Qualtrics, Provo, UT) containing the consent form, as well as questions on relevant home and indoor environmental measures were used for screening participants. Participants were asked if there was any evidence of present water damage, moisture, leaks (such as damp carpet or leaky plumbing) or visible mold inside their homes. If participants answered in the affirmative, then these homes were not recruited for our study. The survey also contained information about the floor area and flooring type that was vacuumed, the frequency of vacuuming, types of floor cleaning, the number of occupants (adults and children), number of pets (dogs, cats, birds, and other furry pets) as well as any prior history of moisture damage and mold in participants’ homes within the last five years.
One home located in Texas was initially recruited but due to consistently low quality of the extracted RNA, the dust was not including in sequencing and was excluded from this study. Two of the homes had potential moisture damage even though the participants answered in the negative to “Is there evidence of water damage, moisture, or leaks (such as damp carpet or leaky plumbing)?” (Table S1). One home (Dust sample ID: KS, Table S1) reported to have a temporary leak that occurred after heavy rains and was gone within 24 hours and the other reported to have a leak more than 10 years ago (Dust sample ID: WA, Table S1). These two samples were not excluded because they did not meet the extent of moisture damage necessary for exclusion criteria due to the extent of the damage and length of time since the damage, respectively.
Dust collection instructions were sent to the participants over email. Participants were asked to collect floor dust (> 25 g), emphasizing collection from the main living areas inside their homes (living room and bedroom) using their home vacuum. If the home vacuum did not contain a vacuum bag, participants were asked to remove dust from the canister and place it in a zip top bag. Participants were then asked to ship their collected dust to our lab or have it dropped off to a designated location for us to pick up. Once we received the dust, all dust was screened to eliminate for the presence of SARS CoV-2, using a previously described protocol [42] and no dust samples were excluded. Recruitment and dust collection procedures were approved by The Ohio State University Behavioral Institutional Review Board under study number 019B0457 for the duration of the study.
The collected dust was then hand-sieved to 300µm to remove larger sized dust particles and was stored at 25°C prior to chamber experiments. Dust was never frozen to maintain intact microbial communities.
Chamber Experiments
For the chamber experiments, 100 mg aliquots of sieved dust were incubated in glass chambers at 25°C for a period of one week, at relative humidities of 50%, 85% and 100% ERH [16]. A total of 27 dust samples were incubated (9 sites x 3 ERH conditions). Relative humidity levels in the glass chambers were maintained using salt solutions or distilled water, as detailed in previous work [20]. 50% and 85% ERH were maintained by using salt solutions with water activities of 0.5 aw and 0.85 aw, respectively, and distilled water was used to maintain an ERH of 100%. The water activities of these salt solutions were tested for accuracy using an AquaLab™ Dew Point Water Activity Meter (Decagon 125 Devices) with a margin of error of +/−0.005.
RNA extractions and nucleic acid sequencing
Immediately following the one-week incubation, RNA was extracted from incubated dust using a previously used modified protocol of the Qiagen RNeasy PowerMicrobiome extraction kit (Qiagen, Hilden, Germany) [35]. To prevent RNA degradation from RNases, the manufacturer’s protocol was modified to use 10x the concentration of β-mercaptoethanol in the first step and 70% ethanol in place of PM4 in the RNA binding step. Extracted RNA was immediately frozen at -80°C prior to use and transported on dry ice.
To ensure high RNA quality and integrity, all RNA extracts were analyzed using the High Sensitivity RNA ScreenTape analysis on the Agilent 4200 TapeStation Bioanalyzer (Agilent, Santa Clara, CA, USA) at The Genomics Shared Resource Center (The Ohio State University Comprehensive Cancer Center Shared Resources, Columbus, OH, USA).
RNA extracts were then sent to the Yale Center for Genomic Analysis (Yale University, New Haven, CT, USA) where they were reverse transcribed and then sequenced on a NovaSeq 2x100 lane with 25 million reads per sample. RNASeq library preparation was performed using the NEB Next Single Cell/Low Input RNA Library Prep Kit (New England Biolabs, USA) and the NEB Ultra II FS (New England Biolabs, USA) workflow for Illumina. The polyA selection protocol was used to select for eukaryotic mRNA. Sequence data was submitted to GenBank under accession number PRJNA1072816.
Initial processing, metatranscriptome assembly and transcript quantification
Processing of sequenced reads followed protocols previously described [34, 35]. FastQC (v.0.12.0) was used for quality assessment of sequences [43]. rCorrector (v.1.0.6) was utilized to correct erroneous k-mers created due to sequencing errors [44]. After correction, reads deemed unfixable by rCorrector were filtered out using the TranscriptomeAssemblyTools package [45].
De novo metatranscriptome assembly was conducted using Trinity (v.2.12.0) [46] with default settings and was run on the Ohio Supercomputer (Ohio Supercomputer Center, Ohio). Trimmomatic within the Trinity pipeline was used to remove poor quality reads and contigs with a length less than 300 base pairs (bp) [47, 48]. Contigs from the Trinity assembly were clustered using CD-HIT-EST (v.4.8.1) based on 80% sequence similarity [49, 50]. These clusters from CD-HIT-EST represent all expressed contigs and constitutes the full transcriptome.
Abundance estimation and alignment were run within the Trinity pipeline with default parameters. RSEM (v.1.3.3) was used to estimate transcript abundance in each sample and to determine transcript-level expression counts of the RNA-Seq fragments for each transcript using alignment-based quantification [51]. Bowtie2 was used to align the quality trimmed paired-end reads after Trimmomatic to the full transcriptome created using CD-HIT-EST [52]. Read coverage was then quantified using Samtools to capture read alignment statistics for concordant read pairs (yielding concordant alignments 1 or more times to the CD-HIT-EST transcriptome) with a MAPQ greater than 2.
Transcript-level abundance estimates were used to construct a matrix of counts and a matrix of normalized expression values. Normalized expression values include Counts Per Million (CPM), Transcripts per Million (TPM) [53] and Trimmed Mean of M-values (TMM) [54] and account for transcript length, number of reads mapped to a transcript, total number of reads over all transcripts and library size (sequencing depth). Gene-level count and gene-level normalized expression matrices were calculated using txImport [55] implemented directly in the Trinity pipeline.
Differential expression analysis
DESeq2 was used within the Trinity pipeline to perform Differential Gene Expression (DGE) analysis of expressed genes [56]. DGE performed using gene-level counts was used for downstream target gene identification. Performing differential expression analysis on gene levels, in addition to contig or transcript levels, improves interpretation of annotated contigs and potentially increases statistical power [57]. Pairwise comparisons between the three ERH conditions (50%, 85% and 100%) were performed, giving rise to six pairwise ERH comparisons. Genes that were most differentially expressed based on the most significant False Discovery rate (FDR) [58] (FDR-adjusted p ≤ 0.001) and log2FC (log2 fold change) values (log2FC ≥ 2) were extracted and used for subsequent Gene Ontology (GO) enrichment analysis.
Functional annotation and Gene Ontology enrichment
Transcripts were annotated using Trinotate (v.3.2.2), designed for comprehensive functional annotation of de novo transcriptomes [59]. Trinotate integrates all functional annotation data into an SQLite database, which is used to create a whole annotation report for the transcriptome. For functional annotation, Trinotate used BLAST + sequence homology search of transcripts and predicted coding regions against the SwissProt database [60, 61] and protein domain identification using a HMMER (v.3.3.2) search against the PFAM database [62, 63]. Predicted coding regions were identified using TransDecoder (v.5.5.0) that utilizes a minimum length Open Reading Frame (ORF) found in a transcript sequence [64]. The TrEMBL/SwissProt database was used for Gene Ontology (GO) and KEGG assignments of transcripts using Trinotate [59, 65, 66]. KEGG assignments for genes were analyzed using the KEGG Mapper tool to identify the number of metabolic pathways [67] and visualized using the iPath3 tool [68] as metabolic pathway maps.
GOseq, developed specifically to account for gene length bias in RNA-seq data, was used within the Trinity pipeline to perform functional GO enrichment testing [69]. Results from the GO enrichment was analyzed for enriched GO categories based on significance of enrichment using FDR values and the number of DE genes within these GO categories at each pairwise ERH comparison.
Identifying potential target genes associated with fungal growth at high moisture
GO enrichment was performed on the most highly significant and differentially expressed genes with a cutoff of FDR-adjusted p ≤ 0.001 and log2FC ≥ 2. GO enrichment results were then analyzed for GO terms associated with fungal growth that were significantly enriched at higher moisture conditions (FDR < 0.05). Higher moisture conditions comprised of GO terms enriched at 100% compared to 85% ERH, enriched at 100% compared to 50% ERH and enriched at 85% compared to 50% ERH. Finally, genes upregulated within these GO categories associated with fungal growth at higher ERH and having a known fungal annotation (BLASTX) were used to identify genes as potentially targets that are indicative of mold growth.
Target genes were chosen based on the criteria that (i) genes are upregulated at high ERH conditions: upregulated at 100% compared to 85% ERH, at 100% compared to 50% ERH or at 85% compared to 50% ERH (ii) genes have a log2FC ≥ 5 (iii) genes are upregulated (expressed) in at least two-thirds of sampling sites (n ≥ 6, out of a total n = 9 locations) and (iv) genes are not expressed at the 50% ERH condition in any sample. Counts in the 0–10 range are usually considered ‘noise’ [70] and therefore, the target genes were required to have a count < 10 at 50% ERH. Exceptions were made for some genes that did not meet criteria (iii) if the gene was essential for fungal growth based on prior knowledge, and genes (n = 15) upregulated in at least three sampling sites were included based on this exception. Criteria (iv) was included because using such genes as a marker for moisture would be simpler in that they can be quantified without being dependent on increases in abundances/counts and would not need to be compared to a baseline level. If a gene was upregulated in more than one ERH comparison (for example, upregulated at 100% compared to 50% and at 100% compared to 85%), then the largest log2FC value was used and the ERH comparison corresponding to the log2FC value was reported (Table S10).
To analyze how genes performed when compared to fungal taxa (species and genus) as targets of high moisture conditions, we applied similar criteria to fungal taxa. Fungal taxa were analyzed based on the criteria that they are (i) more abundant at high ERH conditions: upregulated at 100% compared to 85% ERH, at 100% compared to 50% ERH or at 85% compared to 50% ERH (ii) more abundant in at least two-thirds of sampling sites (n ≥ 6, out of a total n = 9 locations) and (iii) not expressed at the 50% ERH condition in any sample.
Species identification in samples
50 mg of sieved dust, identical to those used for RNA extractions, were used for DNA extractions. Dust samples were incubated for 1-week at 50%, 85% and 100% ERH (similar to RNA extractions), prior to DNA extraction (n = 27, 9 sites x 3 ERH conditions). DNA extractions were performed using the Maxwell RSC PureFood GMO and Authentication Kit (Promega, USA). After addition to CTAB buffer, proteinase K, and RNAse, a 5 min bead beating step (BioSpec Products, Inc., Bartlesville, OK, USA) was added to release spore contents where bead tubes contained 0.3 g of 100 µm glass beads, 0.1g of 500 µm glass beads, and 1 g of garnett particles (all BioSpec Products Incl., Bartlesville, OK, USA) [71]. Further protocol modification included 1) 30 min of room-temperature incubation between beat beating and centrifugation and 2) reduction of eluate volume to 75 µL. DNA extracts were stored at − 20°C. Fungal concentration in DNA extracts was measured using a qPCR assay targeting the 18 S rRNA gene with the universal fungal primer pair FF2/FR1 [72]. QPCR reagents, standards, and cycling parameters were described previously [20]. Next-generation DNA sequencing was performed at Research and Testing Laboratory (Lubbock, TX, USA) using an Illumina MiSeq with 2x300 bp chemistry. The adapters ITS1F (5’ – CTTGGTCATTTAGAGGAAGTAA – 3’) and ITS2aR (5’ – GCTGCGTTCTTCATCGATGC – 3’) were selected to amplify ITS1 region [73]. Raw sequence data is archived in GenBank (PRJNA1072816).
A DADA2-based bioinformatics pipeline customized for ITS sequences [74] was run using R [75] on Ohio Supercomputer (Ohio Supercomputer Center, Ohio). Adapters were first removed using Cutadapt [76], BioStrings [77], and ShortRead [78]. Denoising was performed using DADA2 [79] where the maxEE and truncQ parameters of the filterAndTrim function were both set to eight following Rolling et al. [80]. The UNITE version 9.0 database [81] was used for taxonomic identification. Absolute abundance of organisms in the samples was determined as described previously [36, 82].
Statistical Analysis
The statistical analysis software R (v.4.2.2) [75] was used to perform statistical testing. To compare gene expression profiles based on moisture condition, relationships between samples were analyzed using Principal Component Analysis (PCA). Gene expression values in Counts Per Million (CPM) that account for library size normalization were used for PCA. Log2 transformed and mean-centered standardization, typically applied in gene expression studies, were performed prior to analysis to reduce bias towards highly expressed transcripts [56, 83]. PCoA was performed for relative abundance of fungal taxa (Amplicon Sequence Variants (ASVs), species and genus) using Bray-Curtis distances. We also performed PCoA using Aitchison distances to compare gene expression and fungal taxa abundances. Aitchison distance is Euclidean distance after a centered log ratio (clr) transformation of data and uses relative abundances that is suitable for compositional microbiome data [84]. This ensures that the metric for sample distances is similar for both gene expression and taxa, for a more even comparison as PCA is identical to Principal Coordinates Analysis (PCoA) when using Euclidean distances. The adonis2 function in R using the vegan package (v.2.6.4) [85] was used to determine statistical significance of ERH groupings (p < 0.05) from the Euclidean, Bray-Curtis and Aitchison distance matrix. The test employed 10,000 permutations and used FDR [58] to adjust for multiple comparisons. Significance was defined at FDR-adjusted p < 0.05 [86]. A 95% confidence ellipse using the stat_ellipse function within the ggplot2 package (v.3.4.3) [87] was created to compare moisture conditions to each other.
The Spearman rank correlation coefficient was calculated using the corrplot package [88] for differentially expressed genes based on moisture condition. False Discovery Rate was used to adjust for multiple comparisons and only the correlation coefficients that were significant (FDR-adjusted p < 0.05) were considered. The Spearman rank correlation coefficient determines the strength and direction in the relationship between the data where a value of 1 indicates the strongest positive correlation.
Gene expression heatmaps were plotted using the ComplexHeatmap [89] package in R, based on TMM-normalized (Trimmed Mean of M-values normalized) expression values for direct comparison of gene expression across samples [54]. Log2 transformed and mean-centered standardization were performed prior to analysis to reduce bias towards highly expressed transcripts [56, 83].
To identify species with differences in absolute abundances between the ERH levels, the Kruskal-Wallis test was first performed to determine significant difference (p < 0.05), followed by pairwise Wilcoxon rank sum test using FDR [58] to control for multiple comparisons. To determine significant differences between the number of fungal genes present by ERH condition, Kruskal-Wallis test followed by pairwise Wilcoxon rank sum test was performed, with FDR to adjust for multiple comparisons. These tests were also used for determining differences in absolute fungal concentrations based on ERH condition. Kruskal-Wallis tests were used as a non-parametric alternative to ANOVA when the data was determined to not be normally distributed (Shapiro-Wilk p < 0.05) [90].
Visualization
Figures in the manuscript were generated using R scripts (v.4.2.2) [75], scripts within Trinity [46], ggplot2 (v.3.4.3) [87], ComplexHeatmap (v.2.15.1) [89] for heatmaps (Fig. 6 and Supplementary Figure S13), corrplot (v.0.92) [88] for Supplementary Fig S6, iPath3 [68] for global metabolic pathway map (Fig. 4, Supplementary Fig S9 and S10), Canva (https://www.canva.com) for generating a GIF image, Adobe Illustrator (v.28.3) (http://www.adobe.com/au/products/illustrator.html) for Fig. 1 and Inkscape (v.1.3) (https://inkscape.org/) for Fig. 1 and finalizing other figures. The map in Fig. 2 was created using the R packages ggplot2 (v.3.4.3) [87], maps (v.3.4.1) [91] and ggmaps (v.3.0.2) [92] and further finalized using Inkscape.