Functional Proling of Saliva Microbiome is Essential for Oral Cancer Prediction

Background: The association between microbiome and host disease has been documented in oral cancer, one of the leading cancers worldwide. Huge efforts are made to use the prole of oral microbiome and distinct signature species as markers to distinguish oral cancer patients from healthy individuals. The previous results, however, remain inconclusive. The assembly mechanisms of oral microbiome and their response to changes in oral carcinogenesis also have not been characterized. Here, using 16S rRNA gene amplicon sequencing and in-silico function prediction approaches, we analyzed the saliva microbiome in cohorts of orally healthy, oral verrucous hyperplasia, and oral cancer at taxon and function levels, and compared their corresponding predictive performance of oral cancer using machine learning algorithms. Results: Analyses of diversity and phylogenetic proles of bacterial communities in saliva showed that microbiome dysbiosis was signicantly linked to oral health status. As oral health deteriorated, the number of core species (>75% prevalence) as a percentage of overall species richness declined. In line with the null model-based analysis, taxonomic and functional assemblies of saliva microbiomes were primarily governed by the stochastic processes. Correspondingly, the quantitative assessment of partitioned beta-diversity suggested extremely high species turnover but low function turnover, revealing a functional redundancy of the oral ecosystem. Functional beta-diversity in salvia microbiome shifted from turnover to nestedness during carcinogenesis of oral verrucous hyperplasia, but this pattern was not observed at the taxon level. Moreover, using both taxon and function data as training features for machine learning-aided prediction on host health status supports a superior predictive performance when using functional proling. Similar results were also obtained and validated using publicly accessible data. Conclusions: Our results suggest that altered oral bacterial communities are highly associated with carcinogenesis of oral verrucous hyperplasia. Partly owing to high taxonomic turnover and stochastic assembly processes of the oral ecosystem, discovering oral microbial consortia as universal biomarkers for oral cancer may prove dicult and arduous. Functional proles are relatively stable and evolve a nestedness pattern during oral carcinogenesis, serving as a new benchmark to study the interplay of the oral microbiome and host health in the

and oral verrucous hyperplasia (OVH), that may precede OSCC development [6]. The presence of OPMD signi es a risk of malignant transformation, though the transformation rate varies, ranging from less than 1% to over 30% [6,7]. Smoking, alcohol, and betel quid chewing are major risk factors [8]. OVH is one of the most common precancers observed in the oral cavity of betel quid chewers and has the highest transformation rate (up to 21.3%) [6,7]. Therefore, it is crucial to identify prognostic factors for OSCC development in OPMDs, and saliva can serve as a non-invasive source of biological signatures for OSCC [9].
The microorganisms present in the human oral cavity are referred to as the human oral microbiome, and they play an essential role in the etiology of oral and systemic diseases [10]. Several previous studies have reported the association between oral microbiome and oral cancers, but the results have not always been consistent. For example, the genus Actinomyces was reported to be associated with tumor development in one study [11], but the opposite microbial pattern was identi ed in another study [12].
These inconsistent observations can be attributed to study design, confounding variables, sequencing artifacts, and/or taxonomic resolution [6,13]. Alternatively, the inconsistency can be explained by microbial "functional redundancy", de ned as the coexistence of different taxa that can perform the same function [14]. A recent systemic review of studies on the microbiome of OSCC patients reported that the tumor-associated microbiome presented similar functional potentials regardless of variations in taxonomic pro les [15]. Thus, taxonomic information in conjugation with functional pro les may shed some light on the variation of the oral microbiome. Continuous efforts to update analysis tools of the microbiome in terms of taxon and function (e.g., ner resolution of taxa [16,17] and up-to-date metagenome prediction software [18,19]) should provide opportunities to facilitate the understanding of the interplay between the microbiome and host health.
Microbial ecology concepts have been applied to microbiome studies to evaluate the differences in microbial communities separated by host phenotypes or health status. In exploratory studies of casecontrol microbiome analysis, researchers have hypothesized that microbiomes differ between host phenotypes, and distinct clusters have been generated when visualized in ordination spaces. However, the stressors produce substantial perturbations to microbes and drive the microbial community assembly into distinct patterns [20]. To decipher how microbiomes respond to stressors, we must rst learn the patterns of microbial community assembly and the underlying mechanisms shaping the microbiome.
The between-sample diversity (beta diversity) is often used to measure the differences between samples and can be disentangled into nestedness and turnover components: the former is a non-random process of species loss or gain, while the latter is the replacement of some species by others [21]. These patterns are microbial responses to the deterministic (e.g., competition, predation, and environmental ltering) or stochastic processes (e.g., birth, death, migration, and colonization) [22,23]. It is crucial to quantify the stochasticity of the microbial community assembling in an ecosystem, especially when the primary goal is to identify speci c taxa or consortium between phenotypes. However, for the oral microbiome, the aforementioned ecological characteristics and their responses to disease development, such as oral carcinogenesis, have not yet been studied.
In the present study, to understand the changes in microbial communities associated with oral health status, we assessed oral microbiomes of orally healthy (normal), OPMD (speci cally OVH), and OSCC cohorts through 16S rRNA amplicon sequencing at a single nucleotide resolution. To investigate the role of ecological patterns in healthy and diseased oral microbiomes, both taxonomic pro les and functional potentials were studied in terms of the dichotomy of beta diversity (nestedness and turnover), along with the stochasticity ratio. We validated our results with publicly available data using the same pipeline and then used machine learning techniques to predict oral health status.

Study participants and sample collection
Healthy participants were recruited from the oral screening activity, and saliva samples of OVH and OSCC patients were withdrawn from the biobank in Chi Mei Medical Center (CMMC) in Liouying, Taiwan, with the approval of the Institutional Review Board (No.10612-L02) of CMMC. The saliva collection complied with the following protocol. Participants were interviewed and instructed to refrain from eating, drinking, or using oral hygiene products for at least 1 hour prior to saliva collection, and to rinse their mouth with drinking water. Five minutes after oral rinsing, participants were instructed to spit into a 50 mL centrifuge tube, which was kept on ice, and were cautioned not to cough up sputum. A total of 5 mL of saliva was collected from each participant within a 30-minute time frame. Saliva samples were then centrifuged at 2,600 × g at 4 °C for 15 min. One milliliter of the supernatant was transferred to a new centrifuge tube for other research, and the rest of the saliva supernatant was treated with RNase Inhibitor (Ambion, Austin, TX, USA) and stored at − 80 °C for further analysis. The samples were processed and frozen within 30 minutes after collection. DNA extraction, PCR, and 16S rRNA gene sequencing Bacterial genomic DNA was extracted from saliva samples using a QIAamp DNA Mini Kit (Qiagen, Germany) according to the manufacturer's spin column protocol. The extracted DNA was ampli ed using a barcoded Bacteria-speci c primer set (341F/806R) that targets the V3-V4 hypervariable region of the 16S rRNA gene. The PCR amplicons were sequenced on a MiSeq platform (Illumina, USA) using v3 Chemistry Kits (2 × 300 bp). The detailed sequencing protocol has been described previously [24].

Bioinformatics analyses
16S rRNA sequence processing High-throughput amplicon sequencing data were analyzed on the QIIME 2 platform (v2019.4) [25]. After the primers at both ends were trimmed, raw sequences were quality ltered, denoised, merged, and chimera ltered using DADA2 to produce amplicon sequence variants (ASVs) [17]; the maximum number of expected errors was set at 3. The denoised ASVs with lengths outside the interval between 400 and 450 nt were excluded from the subsequent analysis. Taxonomic annotation of ASVs was performed by a customized classi er trained on the expanded Human Oral Microbiome Database (version 15.1) [26] using the q2-feature-classi er plugin with default settings.

Diversity analysis
Alpha and beta diversity indices were calculated using the q2-diversity plugin at a rarefaction depth of 43,313 reads per sample. A phylogenetic tree was constructed using the q2fragmentinsertion plugin for phylogenetic alpha (Faith's phylogenetic diversity) and beta diversity (UniFrac) measurements. Kruskal-Wallis rank-sum test was used to compare the differences between the alpha diversity indices among cohorts. The contribution of participant age, oral health status (healthy, OVH, and OSCC), and lifestyle factors (alcohol, betel nut, or cigarette consumption) were analyzed using Adonis with 9999 permutations. Distance-based permutational multivariate analysis of variance (PERMANOVA) was used to test the signi cant difference levels of the centroid of beta diversity metrics among cohorts in the ordination space of principal co-ordinate analysis (PCoA). For the observed signi cant PERMANOVA results, a permutational test for homogeneity of multivariate dispersion (PERMDISP) was then performed with 9999 permutations to determine the within-group homogeneity of dispersion. The Benjamini-Hochberg procedure was applied to control the false discovery rate (FDR) for multiple testing by statsmodels (0.10.2) [27]. To evaluate the respective contribution of turnover and nestedness components to beta-diversity as a whole, we calculated the multiple-site dissimilarity (Sørensen-based, β SØR ), and the partitioning dissimilarities that accounted only for turnover (Simpson-based, β SIM ) and for nestedness (β NES ) components, respectively [21].

Core microbiome analysis
Core microbiome analysis was performed using a customized Python script and visualized using matplotlib-venn (0.11.5). The feature table was rst converted to incidence data (presence/absence), and the prevalence of each taxon in each cohort was calculated. If the prevalence of a given taxon was greater than 75%, it was considered a core species in a cohort [28,29]. A Venn diagram was used to illustrate the distinct and shared core species between cohorts.

In silico metagenome prediction
The metagenomic content was developed in silico from the denoised 16S rRNA genes using PICRUSt2 [18]. HMMER (www.hmmer.org), EPA-NG, and GAPPA were performed to place ASVs into reference phylogeny [30,31]. The castor R package was subsequently used for hidden state prediction to infer gene family copy numbers [32]. Finally, the EC number abundances were predicted based on the adjusted gene family abundances. To infer pathway abundances, MinPath was applied to identify a set of minimum pathways based on the predicted gene families [33]. Default settings were used to regroup EC numbers to MetaCyc reactions and further inferred to MetaCyc pathway abundances [34]. Statistical testing of differential abundance Linear discriminant analysis (LDA) effect size (LEfSe) was applied to identify differentially abundant species and metabolic pathways among cohorts [35]. The input of the frequency matrix was rare ed to the same depth and then transformed into a relative abundance matrix. The signi cance level was 0.05 for the Kruskal-Wallis test, and the cutoff of the logarithmic LDA scores was 3.

Stochasticity ratio estimation
To evaluate the drivers of the microbiome assembly, the stochasticity ratio was estimated using the nullmodel-based index based on the Bray-Curtis dissimilarity metric, as described previously [36]. For the null model algorithm, proportional taxa/pathway occurrence frequency and richness were applied to generate random microbial/functional communities [37]. Samples in each cohort shared the sample regional taxa/pathway pool in the null model algorithm.
Public data acquisition, processing, and re-analysis Academic search systems, including Google Scholar and PubMed, were used to nd studies published in the last ve years (2015-2020) with the search terms "oral microbiome", "saliva microbiome", and "OSCC". We included 16S rRNA amplicon-based studies (Table S1) with publicly available sequences and metadata indicating OSCC or control for each sample. To compare the results, we only included studies with samples collected by non-invasive collection methods (oral swab, oral rinse, or saliva samples) and excluded studies that used tissue biopsies and those without sample metadata. Studies with the OSCC cohort consisting of both the oral cavity and oropharynx types were also included. The raw sequence processing, diversity analysis, and core species/metabolic pathway analysis were performed as described in previous sections.

Prediction using machine learning analysis
The QIIME2 q2-sample-classi er plugin [38] was used to predict sample health statuses based on taxonomic pro les and predicted functional pro les generated from this study, and the publicly available dataset was re-analyzed. Input data were randomly split into 80% for training and 20% for testing. The Random Forest classi er was applied for supervised machine learning. Cross-validated recursive feature elimination was applied for feature selection, with 5% of features eliminated at each iteration.
Hyperparameters were automatically tuned using a random grid search with 5-fold cross-validation. Based on taxonomic and functional pro les, respectively, we performed the analysis procedure 100 times with different random seeds and recorded the testing accuracy ratio for each iteration. The resulting accuracy ratio data was tested using an independent t-test to determine the statistical signi cance of the machine learning prediction results. Area under the receiver-operating characteristic curve (AUROC) metrics were calculated using the scikit-learn package [39]. For multi-class classi cation, micro-average was used. The data in each study was trained and validated separately to minimize the experimental batch effects.

Cohort descriptions and sequencing quality
Saliva samples collected from 75 male participants, including healthy controls (normal, n = 27), nonrecurrent OVH patients with > 8-year follow-up (2011 December-2019 November) (NRVH, n = 21), and patients having primary OVH followed by OSCC development within 8 years follow-up (OSCC, n = 27), were included in this study (Table S2). Statistical analysis of the participants' metadata (age and lifestyle factors) showed that the differences in the studied cohorts were signi cant for age between normal and OSCC cohorts, and for betel nut chewing habits between OVH and OSCC cohorts (Table S3). Illumina high-throughput sequencing generated a total of 14,261,633 raw sequences targeting the V3-V4 region of the 16S rRNA gene. After sequence denoising, 8,522,211 denoised reads were retained from 75 samples, with an average of 113,629 ± 33,379 high-quality sequences per library. The plateau rarefaction curves indicated that the sequencing depth was su cient for downstream analysis ( Figure S1).

Diversity analysis of salivary microbiomes
When multiple alpha diversity indices (observed ASVs, Pielou's evenness, Shannon index, and Faith's phylogenetic diversity) of the salivary microbial communities were compared among the cohorts, the observed ASVs were signi cantly higher (p = 0.044 before FDR adjustment) only between the healthy (192 ± 44) and the NRVH (164 ± 57) cohorts ( Fig. 1A and Table S4). The mean observed ASVs for the OSCC cohort (174 ± 53) were lower than that of the control cohort without statistical signi cance (p = 0.119 before FDR adjustment). However, no signi cant difference was found for all indices of alpha diversity after FDR adjustment (Table S4). To evaluate the effects of risk factors, including participant age, oral health status, and lifestyle on the changes of the microbial community, UniFrac distance-based Adonis analysis was performed. As shown in Figs. 1B and 1C, oral health status was detected as the strongest explanatory power (Adonis R 2 = 0.037 for unweighted UniFrac and 0.062 for weighted UniFrac), and was the only factor to signi cantly differentiate the cohorts (FDR-adjusted p < 0.05). Age and betel nut chewing habits, although exhibiting signi cant distinction between some cohorts (Table S3), were not signi cantly associated with changes in oral microbial communities. The UniFrac-based beta diversity distribution of salivary microbiomes from the cohorts was visualized using a PCoA plot (Fig. 1D-E), and showed random distribution on the ordination space. Pairwise PERMDISP analysis further con rmed that the dispersion effect was not found among cohorts based on the unweighted UniFrac metric (p PERMDISP = 0.1669); however, this effect was observed between normal and diseased (NRVH/OSCC) cohorts. In particular, the dispersion effect reached a signi cant level between the normal and NRVH cohorts (FDRadjusted p PERMDISP = 0.0362) in the weighted UniFrac distance measurement, suggesting heterogeneous dispersion of abundant taxa in salivary microbiota in correspondence with the oral health status.

Core salivary microbiome
To further compare the differences in salivary microbiomes among cohorts, we investigated the "core" species, de ned as the taxa commonly present in the saliva of each cohort with prevalence > 75% [28,29]. The number of core species was 55 (67.14 ± 11.06% of total abundance), 39 (47.24 ± 14.96% of total abundance), and 30 (44.52 ± 12.81% of total abundance) in normal, NRVH, and OSCC cohorts, respectively, and 24 species taxa were universal in the saliva samples from all cohorts, even when the oral health status altered ( Fig. 2A). Five species (Anaeroglobus geminatus, Porphyromonas gingivalis, Prevotella oulorum, Saccharibacteria (TM7) [G5] bacterium HMT-356, and Tannerella forsythia) were speci c to the NRVH cohort, while two species (Capnocytophaga sputigena and Catonella morbi) were speci c to the OSCC cohort. One species (Dialister invisus) was speci c to the NRVH and OSCC cohort. Interestingly, a decreased trend in overall core species richness (gamma diversity) was clearly observed with deteriorating oral health status (Fig. 2B): from 12.53% in the normal cohort to 6.34% in the OSCC cohort.

Differences in predicted metagenome among cohorts
We applied PICRUSt2, an updated version of a widely used metagenomic prediction tool [40], to infer the functional pro les of the microbial communities using the denoised ASVs. The nearest-sequenced taxon index (NSTI) of 81.68% of reads was less than 0.15 ( Figure S2A), suggesting the high-quality metagenome predicted [40]. LEfSe analysis identi ed 26, 7, and 24 inferred pathways that were signi cantly abundant speci c to normal, NRVH, and OSCC cohorts, respectively ( Figure S2B). By categorizing these pathways to higher classes, we found that most of them belong to amino acid biosynthesis (10 pathways) and cofactor, prosthetic group, electron carrier, and vitamin biosynthesis (8 pathways) for the normal cohort; while cell structure biosynthesis (5 pathways), fatty acid and lipid biosynthesis (4 pathways), and nucleoside and nucleotide metabolism (3 pathways for biosynthesis; 2 for degradation) were abundant in the OSCC cohort (Fig. 3). Only three pathways belonging to TCA cycles and nucleic acid processing were found to be signi cantly higher in abundance in speci c relation to the NRVH cohort.

Partitioning turnover and nestedness
To determine the ecological drivers shaping the salivary microbiomes among cohorts, we quantitatively compared the assemblage dissimilarities of the salivary microbiomes based on taxonomic and functional pro les. For assemblage assessment of taxonomic data, the species nestedness was 0.042, 0.058, and 0.043, (sustainably lower than the turnover: 0.822, 0.802, and 0.841) for the normal, NRVH, and OSCC cohorts, respectively (Fig. 4), showing that the differentiation of salivary microbiomes was in uenced by the species turnover. By contrast, the functional pro les of the salivary microbiomes were relatively stable: the mean multi-site Sørensen dissimilarity related to pathways (0.456 ± 0.044) was lower than that related to species taxa (0.869 ± 0.013) by approximately 50%. Notably, the numerical distributions of function nestedness (0.180, 0.187, and 0.295) and function turnover (0.291, 0.209, and 0.204) were relatively similar in each of the three cohorts (normal, NRVH, and OSCC, respectively), as compared to species-based Sørensen dissimilarity (Fig. 4). Conspicuously, the ratio of the function nestedness to turnover increased from 0.620 for normal, 0.896 for NRVH, to 1.449 for OSCC, suggesting that the effect of nestedness gradually dominates the functional differentiation of salivary microbiomes during the process of oral carcinogenesis.
To validate these ndings, a null model-based analysis of stochasticity with taxonomic and functional pro les was performed. Figure S3 shows that the stochastic process dominated the salivary microbiome assembly, with a higher stochastic ratio of inferred pathways (85.97 ± 16.73% to 93.04 ± 4.02%) than that of species taxa (61.88 ± 13.75 to 64.44 ± 13.11%). The distribution patterns of taxa-based stochastic ratio among cohorts were relatively similar. By contrast, a far wider distribution spectra for the function-based stochastic ratios were observed for the OSCC cohort than for the normal cohort. This nding supports a shift of the underlying mechanism of the functional alternation towards the deterministic process. Regardless of the observed high species turnover of saliva microbial communities, the dominance of functional nestedness in the OSCC cohort suggests that a set of distinct microbial functions, which may be associated with oral carcinogenesis, evolved accordingly in the oral cavity.

Comparisons with other studies
To generalize the idea of using the functional potential as a valid signature for oral cancer detection, we compared the taxonomic and functional diversity data of this study with those of previous studies. We included two studies for the comparison (see justi cations in Table S1). PCoA plots (Fig. 5A-B) based on Bray-Curtis distance metric revealed that the percentage of variance explained by the rst two axes was higher (47.06%) using functional pro les than that using taxonomic pro les (20.61%). The PERMANOVA results showed that at least one cohort was signi cantly different from the rest of the cohorts (namely, normal vs. NRVH and OSCC) in both taxonomic (p PERMANOVA =0.0001, Fig. 5A) and functional pro les (p PERMANOVA =0.0001, Fig. 5B). However, the PERMDISP results suggest heterogeneous dispersion in the taxonomic pro le (p PERMDISP =0.0061, Fig. 5A), but not in the functional pro le (p PERMDISP =0.1267, Fig. 5B). Although limited data regarding the pre-cancer cohorts are available from previous studies, the results shown here further support the concept that the functional pro le of the oral microbiota is more effectively related to the oral health status.
Because all the previous studies focused on the analysis of 16S rRNA sequences with a 97% similarity threshold using a clustering approach, many potential signature taxa were reported at the genus level. We re-analyzed the sequences with the same analytical pipeline to achieve phylogenetic resolution down to the species level. Although the core species (> 75% prevalence) were found in these studies, no taxon could be identi ed across all three studies for the normal and OSCC cohorts (Fig. 5C-D). Instead, we found a comparatively high number of pathways between studies when comparing functional pro les, although there was still no common pathway across studies obtained for the normal cohort (Fig. 5E). For the OSCC cohort, four pathways involved in nucleotide biosynthesis (UMP biosynthesis I) and cell structure biosynthesis (UDP-N-acetylmuramoyl-pentapeptide biosynthesis I and II, and peptidoglycan biosynthesis I) were prevalent across the studies (Fig. 5F).

Machine learning to predict oral health status
To further study the discriminability of taxonomic and functional pro les for predicting the occurrence of oral cancer, we trained random forest models with both pro les separately. The datasets from each study were processed independently in accordance with the analytical process illustrated in Fig. 6A. The mean accuracy ratio (the mean of the ratios of predicted accuracy to the accuracy of random guess, 100 iterations) was higher when using the functional pro le compared to when using the taxonomic pro le ( Fig. 6B and Figure S4), although statistical signi cance was detected in only two of the three datasets (Wolf et al. and Zhao et al., with p = 0.043 and 0.001, respectively). In addition, we bootstrapped the receiver-operating characteristic (ROC) analysis 100 times and demonstrated that, for the three datasets, AUROC was higher when functional pro les were used to distinguish OSCC samples from healthy controls, compared when using taxonomic pro les (Fig. 6C-D).

Discussion
In this study, we found that altered oral microbiomes were signi cantly associated with oral health status. A number of previous studies have reported the association between the change of oral microbiome and oral health status, as well as risk factors including betel nut chewing, cigarette smoking, and alcohol consumption [41][42][43][44]. A recent investigation of multiple factors that potentially affect the oral microbiome assemblage between nasopharyngeal carcinoma patients and healthy controls showed that differences in oral microbial communities were most associated with oral health status and smoking [45].
In agreement with previous studies, our ndings also showed that oral microbiome may be linked to the progression of OVH carcinogenesis, and the statistical signi cance was especially strong when more weights were put on rare ASVs (unweighted UniFrac measure). Cigarette smoking and other risk factors like betel nut usage and alcohol drinking, however, were not the main contributors to differences in oral microbiomes in the present study. Additionally, although the within-group diversity was not signi cantly different among cohorts, a decreased trend of core microbiome from healthy or OVH to OSCC was observed when the total diversity of each cohort (gamma diversity) was considered. Consistent with our results, previous studies have reported that 9.6-13.1% of detected bacterial species are common among healthy individuals [28,29]. Gamma diversity can be an index related to oral health status. Together, our ndings showed dysbiosis of the core salivary microbiome as the patients developed OSCC, implying that the salivary microbiome is linked to oral health status.
More speci cally, several enriched species were associated with the health status of the oral cavity.
Previous studies [12,[46][47][48] have reported that the bacterial genera Lautropia, Rothia, Streptococcus, and Veillonella were enriched in the healthy control or pre-cancer lesion groups, whereas the genera Fusobacterium, Capnocytophaga, Dialister, Selenomonas, Parvimonas, Peptostreptococcus, and Treponema exhibited higher abundance in the OSCC cohort. In line with these studies, we also identi ed signature species belonging to the genera Rothia, Streptococcus, and Veillonella within the non-cancer cohorts, while other species belonging to Capnocytophaga and Parvimonas were more abundant speci cally in the OSCC cohort. However, several species (Prevotella melaninogenica, Streptococcus salivarius, and Rothia mucilaginosa) that were highly abundant in patients with OPMD or OSCC [49][50][51] are identi ed as signatures for the healthy cohort in our study. Besides, many genera/species found in the present study have not been previously reported, and vice versa. This inconsistency between studies could be attributed to the experimental design (e.g., sample types or hypervariable regions of the 16S rRNA gene [52]), the bioinformatics analysis pipeline (e.g., sequence denoising approaches [53] and reference databases [54]), the genetics of studied cohorts (e.g., racial factors [55]), and the complexity of oral carcinogenesis [56]. Alternatively, our results suggest that the inconsistencies may be due to the extremely high species turnover of oral ecosystems, and the dominance of the stochastic process in shaping microbial communities in saliva ( Fig. 4 and Figure S3). As revealed by the component partitioning of dissimilarity in the present study, microbes in saliva are subject to a high population dynamic [57], representing an extraordinarily dynamic ecosystem; consequently, detecting bacterial species as stable or universal signatures to re ect the oral health status can be challenging.
We found that the stable function pro les remained distinguishable, despite the high variation and uctuation of the microbial populations. Previous studies have illustrated that microbial communities of the oral cavity are highly dynamic and individual-speci c [58][59][60]. This is not surprising; the oral cavity represents an open environment that is exposed to exogenous bacteria and rapid changes in environmental factors in addition to host effects [61]. The stable function pro les but highly varied species composition in the salivary ecosystem may be accounted for by functional redundancy that has been proposed for other microbial ecosystems, such as soil and the human gut [62,63].
To our knowledge, this is the rst report to disentangle the contribution of the turnover and nestedness of both taxonomic and functional compositions in three different states of oral cavity ecosystem (orally healthy, OPMD, and OSCC). Notably, the primary mechanism contributing to the differences in functional pro les shifted from turnover to nestedness in accordance with OVH carcinogenesis (Fig. 4). Since nestedness re ects the effect of environmental ltering [64,65], a possible explanation is that the diseased status of the oral cavity (OVH and OSCC) was a more in uential environmental ltering factor than the healthy one, resulting in the loss of speci c metabolic niches and an increase in nestedness components during the shift from healthy to diseased status. This can be further supported by the nullmodel-based stochasticity analysis ( Figure S3) that revealed a more deterministic in uence, which has been linked to the nestedness effect [66], in the metabolic pathway pro les in the NRVH and OSCC cohorts compared to healthy controls.
The re-analysis of OSCC-associated oral microbiomes from publicly accessible studies reinforced the use of functional pro les for more consistent results. The relationship between oral microbiome and OSCC remains uncertain and has not been intensively studied. To circumvent various technical challenges that potentially yielded different results and inconclusive interpretations, we re-analyzed the raw sequences with the same bioinformatic analysis pipeline, which generated more comparable results across different studies, as shown in the function-based PCoA ordination (Fig. 5). These more comparable results may be partly attributable to the bias (due to different hypervariable regions of 16S rRNA) being alleviated by inserting high-resolution amplicons into a reference phylogenetic tree [67], an improved inference process included in PICRUSt2. However, the accuracy of the taxonomic classi er based on phylogenic inference was varied to some extent; meanwhile, precisely comparing the ASVs of different sequence regions in taxonomy remains challenging [68]. Thus, it is not surprising to observe a within-group dispersion effect based on taxonomic pro les and to identify no common core taxon among three studies in healthy control and OSCC cohorts, respectively. No common core metabolic pathway was identi ed in healthy controls either. This is counterintuitive; oral microbiota should possess a number of essential common metabolic pathways for survival [69]. Likely, the dietary patterns and lifestyles of the control groups from different geographical regions can serve as a stochastic factor that drives the functional pro les to be dissimilar to one another [70]. As for the functional pro les in the OSCC cohorts, the common core pathways (nucleotide biosynthesis and bacterial cell structure) are related to generate building blocks for DNA and RNA synthesis as well as cell wall formation. Unlike the pathways for metabolites such as fatty acids and amino acids, which can be utilized exogenously by the host cell to generate complex hostmicrobe interactions [71,72], the enrichment of these common pathways could be the consequence of functional alterations of microbial communities in response to changing oral health status. Further study is required to investigate the relationship between enriched metabolic pathways and the host. Raw sequence sharing, along with completely publicly available metadata, will also enable to reproduce, compare, and validate results across studies through meta-analysis. This is especially crucial for hard-toobtain disease-associated samples (e.g., OPMD and OSCC in the present study) in order to achieve clinical application of disease progression prediction.
In several previous studies, machine learning-aided models were trained for disease prediction using taxonomic pro les derived from 16S rRNA genes [73][74][75]. Since the saliva microbiome was characterized as a highly dissimilar, high species turnover, and stochasticity-dominated entity ( Fig. 4 and Figure S3), using taxonomic data as input features for classi cation and prediction tasks can be suboptimal.
However, the functional potential of the saliva microbiome was much more similar and displayed a nestedness pattern, where deterministic processes gradually dominated as the oral health status deteriorated. In this study, we demonstrated that predictive accuracy could be improved by "transforming" the taxonomic data into functional pro les with a widely used metagenome prediction tool. The consistently higher mean accuracy using functional pro les as the input feature could be primarily attributed to the lower variation of the pathway data. Although it has long been discussed whether functional pro les outperform taxonomic pro les for microbiome classi cation tasks [76], the predicted functional pro les from OSCC-associated individuals were reported similar despite the variation of the taxonomic pro les [15]. Besides, a recent shotgun metagenomic study [77] also supports our results by discriminating healthy controls from dental caries cohorts more effectively using pathways than taxa. In contrast to our ndings, another study with gut microbiome datasets [78] showed a worse performance from function-based searches compared to those using the taxa-based strategy. This con ict may be due to the lower stability of oral microbiomes relative to gut microbiomes [79]. Also, the updated metagenome prediction tool improved its accuracy by allowing genome prediction via ASVs, expanding the database of reference genomes, and employing a more stringent method of generating pathway-level functional pro les [18]. These technical improvements make it successful for predicting phenotypes via functional pro les in our study. One limitation of using predicted functional pro les is the loss of function data attributed to a fraction of microbes without genomic information. This can be improved in the future by incorporating both metagenomics and culturomics to expand the microbiome database [80].

Conclusions
Overall, this study has revealed that altered oral microbial communities are highly associated with oral health status (normal, OVH, and OVH-associated OSCC). From the perspective of microbial ecology, because the oral ecosystems are high taxonomic turnover (i.e., high variance and uctuation), any attempt of discovering oral microbial consortia as biomarkers for oral cancer would be a daunting task. Compared to taxonomic pro les, functional pro les are relatively stable and display a nestedness pattern in the process of oral carcinogenesis, thus serving as a suitable alternative for OSCC signatures. Machine learning models yielded a better mean predictive performance by using functional pro les as training data when compared to using taxonomic pro les. The re-analysis of datasets from the published microbiomes further con rmed the feasibility of using functional pro les for OSCC prediction. This function-based method serves as a new benchmark to study the interplay of the oral microbiome and host health in the future. Further studies are warranted to elucidate the association and causation between the ecological pattern of microbiota and the host, as well as how to improve the predictive performance by understanding the ecological assemblage of the microbiome.      Distribution of signature pathways. The signature pathways, which abundances are signi cantly higher concerning each studied cohort, are detected using LEfSe. The inferred pathways are collapsed to each category based on Metacyc's pathway ontology. Colored boxes indicate a higher rank of the categories. Distribution of signature pathways. The signature pathways, which abundances are signi cantly higher concerning each studied cohort, are detected using LEfSe. The inferred pathways are collapsed to each category based on Metacyc's pathway ontology. Colored boxes indicate a higher rank of the categories. Multiple-site beta diversity (Sørensen dissimilarity) and corresponding nestedness and turnover components. The dissimilarities were analyzed in terms of species and metabolic pathway pro les in each cohort.

Figure 4
Multiple-site beta diversity (Sørensen dissimilarity) and corresponding nestedness and turnover components. The dissimilarities were analyzed in terms of species and metabolic pathway pro les in each cohort.

Figure 5
Comparison of diversity and core analysis between taxonomic and functional pro les from different studies. (A-B) Principal coordinate analysis (PCoA) plots of (A) taxonomic and (B) metabolic pathway pro les based on Bray-Curtis distance metric from previous studies and this study. The dispersion effect was found between normal and OSCC cohorts (FDR-adjusted pPERMDISP = 0.0018) using taxonomic pro les. The percentages of explained variance are displayed along with each principal coordinate axis.
(C-F) Venn diagrams reveal common core species/pathways (prevalence > 75%) in normal and OSCC cohorts, respectively. (Taxonomic pro les of normal (C) and OSCC (D) cohorts; metabolic pathway pro les of normal (E) and OSCC (F) cohorts).  Evaluating functional pro le as an alternative signature for OSCC detection using machine learning. (A) Flow chart illustrating the process of obtaining predicted results from published sequences. Blue boxes indicate input or output les, and red boxes indicate bioinformatics methods or pipelines. (B) The mean accuracy ratios of 100 iterations of the randomly split dataset (80% training and 20% testing) based on taxonomic and functional pro les. The accuracy ratio is de ned as the predicted accuracy to the accuracy of a random guess. Vertical bars indicate 95% con dence intervals. (C-D) The 2D-density plots of ROC curves from 100 iterations demonstrate a higher mean AUROC using (C) functional pro les to distinguish OSCC from normal cohorts compared to that using (D) taxonomic pro les. Evaluating functional pro le as an alternative signature for OSCC detection using machine learning. (A) Flow chart illustrating the process of obtaining predicted results from published sequences. Blue boxes indicate input or output les, and red boxes indicate bioinformatics methods or pipelines. (B) The mean accuracy ratios of 100 iterations of the randomly split dataset (80% training and 20% testing) based on taxonomic and functional pro les. The accuracy ratio is de ned as the predicted accuracy to the accuracy of a random guess. Vertical bars indicate 95% con dence intervals. (C-D) The 2D-density plots of ROC curves from 100 iterations demonstrate a higher mean AUROC using (C) functional pro les to distinguish OSCC from normal cohorts compared to that using (D) taxonomic pro les.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.