Insights from shotgun metagenomics into bacterial species and metabolic pathways associated with NAFLD in obese youth

Abstract Nonalcoholic fatty liver disease (NAFLD) is the most common form of liver disease and is often the precursor for more serious liver conditions such as nonalcoholic steatohepatitis and cirrhosis. Although the gut microbiome has been implicated in the development of NAFLD, the strong association of obesity with NAFLD and its effect on microbiome structure has made interpreting study outcomes difficult. In the present study, we examined the taxonomic and functional differences between the microbiomes of youth with obesity and with and without NAFLD. Shotgun metagenome sequencing was performed to profile the microbiomes of 36 subjects, half of whom were diagnosed with NAFLD using abdominal magnetic resonance imaging. Beta diversity analysis showed community‐wide differences between the groups (p = 0.002). Specific taxonomic differences included increased relative abundances of the species Fusicatenibacter saccharivorans (p = 0.042), Romboutsia ilealis (p = 0.046), and Actinomyces sp. ICM47 (p = 0.0009), and a decrease of Bacteroides thetaiotamicron (p = 0.0002), in the NAFLD group as compared with the non‐NAFLD group. At the phylum level, Bacteroidetes (p < 0.0001) was decreased in the NAFLD group. Functionally, branched‐chain amino acid (p = 0.01343) and aromatic amino acid (p = 0.01343) synthesis pathways had increased relative abundances in the NAFLD group along with numerous energy use pathways, including pyruvate fermentation to acetate (p = 0.01318). Conclusion: Community‐wide differences were noted based on NAFLD status, and individual bacterial species along with specific metabolic pathways were identified as potential drivers of these differences. The results of the present study support the idea that the NAFLD phenotype displays a differentiated microbial and functional signature from the obesity phenotype.


INTRODUCTION
Nonalcoholic fatty liver disease (NAFLD) is the most common hepatic complication of obesity in children and adolescents, with an estimated 34% of obese youths affected by this disease. [1][2][3] Moreover, data based on histological specimens have shown that inflammation and fibrosis may occur earlier in life in patients who develop NAFLD during childhood, [4] making them more susceptible to develop liver failure at a younger age. [5] Although NAFLD is a common complication of obesity in youth, the reason why some patients are susceptible to the disease while others never develop it remains unclear. Nutritional and genetic factors certainly play a key role in conveying susceptibility to NAFLD, [6][7][8] but other factors may drive individuals to develop this condition. In particular, it has been suggested that the composition of the gut microbial community may play a pivotal role in this regard. [9,10] Several studies have shown associations between intrahepatic fat content and gut microbiota, including decreased microbial diversity in severe fatty liver disease and an enrichment of specific taxa, including the bacterial phyla Proteobacteria, Firmicutes and Bacteroidetes, and the genera Escherichia, Veillonella, Faecalibacterium, Eubacterium, Bacteroides, and Oscillospira. [10][11][12][13][14][15] The results of seminal studies performed in twin adults with NAFLD demonstrated that within the spectrum of NAFLD, a metagenomic signature can not only differentiate between subjects with differing degrees of fibrosis, [12] but also between those with and without cirrhosis. [16] Based on these and other findings, we previously assessed whether differences exist between the gut microbiomes of youths with and without NAFLD. [17] We showed that youths with NAFLD have a different metagenomic profile than those without NAFLD. [17] However, our initial analysis was limited by the use of sequence data from the V4 region of the 16S rRNA gene, which typically allows for only familylevel and genus-level identification and no direct assessment of the physiological potential of the microbiome.
In the present study, we followed up on our previous findings by shotgun sequencing and analyzing the metagenomes of obese youths with and without NAFLD. Within this group of subjects, we observed community-wide differences in metagenome composition and identified specific species associated with these differences. We also identified metabolic gene pathways that were increased or decreased in subjects with NAFLD and the species contributing to these differences.

Study cohort
Thirty-eight obese youths (body mass index [BMI] ≥ 95th percentile) were included in the present study. The participants were recruited from the Yale Pediatric Obesity Clinic. Those participants who had known non-NAFLD hepatic diseases, diabetes, or medication that would interfere with glucose production and liver function were excluded from the study. All participants underwent an oral glucose tolerance test and abdominal fast-magnetic resonance imaging (MRI) to assess abdominal body fat partitioning and intrahepatic fat content as previously described. [38][39][40] Hepatic fat fraction (HFF) was assessed by abdominal MRI and used to categorize non-NAFLD (HFF < 5.5%) and NAFLD (HFF ≥ 5.5%) groups. In addition, fasting blood samples to measure liver function and lipid profiles were obtained. The studies were conducted at the Yale Center for Clinical Investigation at 8 h after a 12-hour overnight fast. A stool sample from each subject was also collected. Written parental informed consent and written child assent were obtained from all participants. Yale University Human Investigation Committee approved the study.

Statistical analysis
Differences were tested using chi-square tests for categorical variables, two-sample t-tests for normally distributed, continuous variables, and Wilcoxon rank-sum tests for nonnormally distributed, continuous variables, respectively. Subject characteristic differences were evaluated between the NAFLD and non-NAFLD groups. Chi-square testing was applied to the variables of gender (male/female), race (White or Caucasian/Black or African American/Hispanics/Others/Asian), and glucose tolerance (normal/impaired). The distributions of continuous variables were examined by histogram. The following normally distributed variables were examined using two sample t-tests: visceral fat; BMI, body fat (%); fasting glucose; whole-body insulin sensitivity index (WBISI); hemoglobin A1C; and total, high-density lipoprotein, and low-density lipoprotein cholesterol. Wilcoxon rank-sum tests were used to test differences for the following variables: hepatic fat fraction (%), subcutaneous, deep subcutaneous, superficial subcutaneous, deep/superficial subcutaneous, age, BMI z-score, 2-h glucose, fasting insulin, insulinogenic index, disposition index, triglycerides, alanine aminotransferase (ALT), and aspartate aminotransferase (AST). Subject characteristics analyses were performed using SAS (version 9.4; SAS Institute Inc., Cary, NC, USA).

Shotgun metagenome sample preparation and sequencing
DNA was extracted using a MoBio PowerMag Soil 96well kit (MoBio Laboratories, Carlsbad, CA) from 0.25 g of fecal sample. DNA extracts were quantified using the Quant-iT PicoGreen kit (Invitrogen, Thermo Fisher Scientific, Waltham, MA). Metagenome libraries were produced using a TruSeq PCR-Free kit (Illumina, San Diego, CA), diluted, and then pooled for loading onto an Illumina HiSeq 2500. The libraries underwent 75 × 75 bp paired-end sequencing.

Shotgun metagenome preprocessing
Raw paired-end read data for 36 shotgun metagenomes were downloaded from Basespace (basespace. illumina.com) following demultiplexing. Libraries for two samples failed and were excluded from the study. The reads were then processed using KneadData (version 0.7.4, https://github.com/bioba kery/knead data) with the default parameters to remove adapter sequences, quality filter and quality trim reads, and remove contaminating host-associated reads. The median read depth following Quality Control was 46 million reads, and the range was 69 million reads.

Taxonomic profiling and comparison
MetaPhlAn (version 3.0) [41] was used to taxonomically profile the gut microbial communities to the species level using the latest marker gene ChocoPhlAn database (release 2019.01). To investigate potential differences between the viromes of the two study groups, a virus profiling parameter was also included. Individual sample taxonomic profiles were merged using the merge_ metaphlan_tables.py utility script, resulting in a single table with relative abundances provided for each taxon. Absolute counts for use in downstream analyses were recovered by multiplying the total read count for each sample by the relative abundances resulting from the merge table script. Species-level counts were retrieved from this table and reformatted for import into R (version 4.1.0). [42] An Operational taxonomic unit table, taxonomy table, and metadata table were imported into RStudio (version 1.4.1717) and read into the package phyloseq (version 1.36.0). [43] Alpha and beta diversity calculations were performed using the packages phyloseq and microViz (version 0.7.7). [44] For ecological distance metrics, samples were rarefied to 8,893,658 million reads (the lowest read count for a sample) before calculating distances and ordination. For compositional biplots, a centered log-ratio transformation was performed as recommended before the generation of principal component analysis (PCA) biplots. Differential abundance testing was performed using the ANCOM statistical framework [45,46] implemented by the package ANCOM-BC. [47] Taxa were aggregated at the species, genus, family, order, class, and phylum levels using the microbiome package (version 1.14) [48] before performing the ANCOM test on each level. Taxa had to appear in a minimum of 25% of samples to be included in ANCOM-BC analyses.

Functional profiling and comparison
HUMAnN (version 3.0) [49] was used to functionally profile the microbial communities. Paired-end sequence files were first concatenated before running HUMAnN. The full ChocoPhlAn pangenome database (release 2019.01) was used for functional pathway abundance and coverage determination, whereas the UniRef90 database (release 2021.03) [50,51] was used for gene family abundance determination. The output pathway and gene family abundance files for each sample were normalized to relative abundances, and the resulting files were joined.
Enriched pathways and gene families were identified using the R package MaAsLiN2 (version 1.6.0). [52] Pathways and gene families achieving a corrected p-value of 0.05 or less were classified as significantly increased within one of the two patient groups. Identified pathways and gene families were then plotted using the bar plot utility script in HUMAnN.

Population characteristics stratified by NAFLD status
The clinical and demographic characteristics of 36 study participants (18 NAFLD and 18 non-NAFLD) were evaluated for differences with respect to NAFLD status ( Table 1). Most of the subject characteristics were not significantly different between the two groups. Subjects with NAFLD had a higher BMI, BMI z-score, fasting insulin level, alanine transaminase (ALT) level, and visceral and hepatic fat fractions, but a lower WBISI as compared to subjects without NAFLD. These all attained statistical significance (p < 0.05). Although not statistically significant, subjects with NAFLD had elevated fasting blood glucose, body fat percentage, total cholesterol, triglycerides, and AST as compared to the subjects without NAFLD.

Gut microbial community profiles of 36 obese youths
Thirty-six shotgun metagenomes of the gut microbial communities of 18 subjects with NAFLD and 18 without NAFLD were analyzed in the present study. The results presented in Figure 1 compare the alpha and beta diversity between these groups and profiles the microbial communities at the species and phylum levels. Shannon-Weiner diversity index values were calculated for all samples and were slightly elevated in the NAFLD group ( Figure 1A), although this difference was not significant (p = 0.14, t-test). Bray-Curtis dissimilarity values were calculated and plotted using a nonmetric multidimensional scaling ordination to compare overall gut microbial community structure at the species level ( Figure 1B). Permutational multivariate analysis of variance (PERMANOVA) testing revealed that the subjects with NAFLD had significantly different clustering from the non-NAFLD subjects (p adj = 0.001, R 2 = 0.1006, PERMANOVA) and showed a significant difference in dispersion (p adj = 0.023, β-dispersion). These results indicate a significant difference between the centroids of these groups at the species level, and this difference may be due to the tested factor (disease), variability within each group, or a combination of these two factors. [18] Agglomeration at higher taxonomic levels (genus thru phylum) led to reduced and no longer significant dispersion differences while maintaining the significant clustering differences as measured by PERMANOVA testing ( Figure S1). Between the NAFLD and non-NAFLD groups at the phylum level ( Figure 1C, Figure S2A), the mean relative abundance of Firmicutes (72.1% vs. 56.1%) and Bacteroidetes (6.7% vs. 33.2%) differed. Actinobacteria (18.8% vs. 7.9%) was the third most prevalent phylum for both groups. Verrucomicrobia (0.9% vs. 1.9%) and Proteobacteria (0.3% vs. 0.5%) were present as minor bacterial phyla along with the archaeal phylum Euryarcheota (0.8% vs. 0.4%) as well as viruses (0.2% vs. 0.1%). At the species level ( Figure 1D, Figure S2B), the Firmicutes Faecalibacterium prausnitzii (9.4% vs. 12

Taxonomic differences between the NAFLD and non-NAFLD groups
The differences between the gut microbial communities of individuals with NAFLD and without NAFLD were first compared using PCA biplots. PCA plotting presents a compositional approach in which results are generally more reproducible, and variance in the data is directly accounted for within the plot (as opposed to requiring separate statistical analysis as in principal coordinate plotting). [19] Figure 2 shows the species-level and phylum-level PCA plots, with arrows representing the contribution of individual taxa to each principal coordinate (the length of the arrow represents the strength of the effect). Beneath each PCA biplot is an iris plot showing the taxonomic composition for each sample corresponding to the location on the PCA plot above. The percent of the variation explained in the PCA plot increased from the species level (PC1 Species with a large effect size in the quadrant containing most of the non-NAFLD samples included Alistepes putredinis, Odoribacter splanchnicus, Barnesiella intestinihominis, Parabacteroides merdae, Bacteroides thetaiotamicron, and Bacteroides fragilis (Figure 2A,C). None of the top 10 species shown were dramatically enriched in the individuals with NAFLD. At the phylum level, Bacteroidetes was enriched in the area of the samples from the non-NAFLD group, while Actinobacteria and viruses were enriched in the area of the samples from the NAFLD group ( Figure 2B,D). Proteobacteria and Verrucomicrobia were enriched in a subset of both individuals with and without NAFLD, whereas Firmicutes does not have a clear correlation with either group.
The biplots suggested that specific taxa may correlate with the disease state of the subjects, and differential abundance testing was then used to probe this in a statistically meaningful way. Analysis of composition of microbiomes with bias correction (ANCOM-BC) was used to determine whether any taxa differed significantly in the microbiomes from subjects with and without NAFLD. All taxonomic levels from species to phylum were tested, and the results are presented in Table 2. The phylum Bacteroidetes, class Bacteroidia, order Bacteroidales, family Bacteroidaceae, and genus Bacteroides were all significantly decreased in the NAFLD group. Three species were shown to have elevated relative abundances in the NAFLD group, including Fusicatenibacter saccharivorans, Romboutsia ilealis, and Actinomyces sp. ICM47. The B. thetaiotamicron species was shown to be decreased in the NAFLD group as compared with the non-NAFLD group.
The differentially abundant species identified using ANCOM-BC were also plotted in Figure S3 as relative abundances in box plots. Certain species, including Bacteroides thetaiotaomicron and Fusicatenibacter saccharivorans, make up a larger portion of the average species consortium (> 0.1%). Other species, including Actinomyces sp. ICM47 and Romboutsia ilealis, amount to a smaller portion (< 0.1%) of the gut microbial community. The significant results from all other taxonomic levels are plotted in Figures S4-S7.

Functional differences between the NAFLD and non-NAFLD groups
In addition to taxonomic differences, functional differences between the groups were also investigated. Following the functional classification of reads, their abundances were compared, and the top 50 statistically significant results are reported in Table 3. Of these pathways, 29 had an increased relative abundance in the NAFLD group, while 21 were decreased, among which many biosynthetic pathways were identified, including multiple amino acid synthesis pathways. Multiple lysine synthesis pathways had increased relative abundances along with methionine, isoleucine, ornithine, threonine, serine, tryptophan, arginine, aspartate, and glycine pathways. Superpathways for branched-chain amino acid (BCAA) and aromatic amino acid (AAA) synthesis along with peptidoglycan synthesis were also observed at higher relative abundances in the NAFLD group. Another subset of pathways was decreased in the NAFLD group and included the ornithine (a distinct pathway from the aforementioned one), glutamine, glutamate, and isoleucine (a distinct pathway from the aforementioned one) synthesis pathways. Degradation, use, and energy generation pathways were also identified. Pyruvate fermentation to acetate and lactate, methanogenesis from acetate, lactose and galactose degradation, glycerol degradation, stachyose degradation, guanosine degradation, glycolysis, and the pentose phosphate pathway were increased in individuals with NAFLD. Two histidine degradation pathways as well as the urea and tricarboxylic acid cycles were decreased in the individuals with NAFLD. All significant results are presented in Table S1.

T A B L E 3 (Continued)
Additionally, the specific contributions of individual species to a particular metabolic pathway were calculated and then compared between groups to identify differentially abundant taxonomic contributions. Bar plots are provided in Figures S9-S14 for a subset consisting of pathways of interest. Eubacterium hallii and Fusicatenibacter saccharivorans were by far the two most common taxa contributing to increased pathways in the NAFLD group. Additionally, Blautia wexlerae, Blautia obeum, Streptococcus salivarius, Ruminococcus torques, Streptococcus parasanguinis, Coprococcus catus, Streptococcus thermophilus, Romboutsia ilealis, Roseburia sp. CAG 471, and Dorea formicigenerans were also increased in relative abundance in certain pathways. Reduced taxa contributing to decreased metabolic pathways in the NAFLD group included Bacteroides vulgatus, Bacteroides thetaiotaomicron, Bacteroides ovatus, Bacteroides stercoris, Parabacteroides distasonis, Bacteroides fragilis, Bacteroides xylanisolvens, Ruthenibacterium lactatiformans, and Parabacteroides merdae. All significant results are presented in Table S1.

DISCUSSION
In the present study, we used shotgun metagenomic sequencing to illuminate a number of taxonomic and functional differences between the gut microbial communities of obese youths with and without NAFLD. Previous studies sequenced relatively short variable regions of the 16S rRNA gene, which limits the taxonomic resolution capacity and lacks important functional information. [20] The analysis of the shotgun metagenome data at the species level revealed that the composition of the microbiomes differed significantly between the NAFLD and non-NAFLD groups ( Figure 1). The high degree of variability of the human microbiome, especially at the species and genus levels, has been well documented. [21,22] This variability may explain the dispersion visualized in Figure 1B, particularly for the non-diseased group. The reduction in variance at higher taxonomic levels suggests that a larger number of genera and species vary among the subjects without NAFLD than the subjects with NAFLD.
The taxonomic contributors of the observed difference in beta diversity were analyzed using PCA biplots. Interestingly, a number of species appeared to correlate with the non-NAFLD group, whereas the NAFLD group lacks obvious ones. This result indicates that a lack of certain microbial taxa and general dysbiosis may contribute to the pathogenesis of NAFLD as opposed to the presence of specific taxa being causative. [23] The most commonly observed and significant taxonomic difference was the relatively decreased abundance of bacteria belonging to the phylum Bacteroidetes in subjects with NAFLD. Both the PCA biplot and ANCOM differential abundance testing identified this group as strongly decreased in individuals with NAFLD. This was also the finding of an earlier study that we performed on a larger cohort using 16S rRNA gene sequencing and that included the subjects in this study. [17] A commonly cited overabundance of Proteobacteria [24] in individuals with NAFLD was not observed in our data set, although a subset of both study groups possessed increased relative abundances of Proteobacteria. A number of species were identified as being differentially abundant, among which B. thetaiotamicron was shown to be significantly decreased in the NAFLD group and was identified as a major driver in PCA (Figure 2). B. thetaiotamicron has been shown to reduce diet-induced body weight gain and adiposity in mice and was observed at lower abundances in obese human subjects compared with healthy ones. [25] O. splanchnicus, a species within the family Odoribacteriaceae that was decreased in the NAFLD group, was shown to be associated with decreased incidence of NAFLD, [26] cystic fibrosis, [27] and inflammatory bowel disease [28] in previous studies. This species was identified as a major driver in PCA, and its bacterial family was identified via ANCOM analysis. Fusicatenibacter saccharivorans, the lone species within a relatively newly recognized genus, [29] comprised a fairly large portion of the NAFLD community. This species has been associated with a diet high in processed foods [30] and is capable of fermenting a wide variety of saccharides, producing short-chain fatty acids (SCFAs) as a result. [29] The species Actinomyces sp. ICM47 is primarily associated with the human oral microbiome, [31] and its overabundance in individuals with NAFLD could be partly due to increased saliva entering the gastrointestinal tract through more frequent eating or an increased salivary response (although this is only speculation). It is important to note that detecting DNA from specific bacterial species does not establish their living presence within a certain niche. The presence of oral microbial DNA in fecal samples is to be expected but does not necessarily indicate that a species is a gut resident.
Functionally, an even more expansive list of features was shown to be differentially abundant between the two groups. BCAA ( Figure S9) and AAA ( Figure S10) biosynthetic genes were observed to be at higher relative abundances in the group with NAFLD, and previous studies have reported BCAAs and AAAs to be associated with NAFLD. [32,33] Interestingly, pathways for isoleucine (a BCAA) were shown to be increased in both groups. Isoleucine pathways I ( Figure S11) and III ( Figure S12) were increased in individuals with NAFLD, while pathway IV ( Figure S13) was decreased. Pathways 1 and 3 use 2-oxobutanoate and glutamate, respectively, whereas pathway 4 uses propanoate. Among the three aforementioned pathways, contributions by Eubacterium, Blautia, and Streptococcus spp. were noted as increased in individuals with NAFLD, while Bacteroides and Parabacteroides spp. were decreased. Although these shotgun data only detect the presence of certain genes and do not provide information on expression levels, it is possible that these pathways are active and lead to an increase in free isoleucine associated with NAFLD.
The fermentation of pyruvate to lactate and acetate, a SCFA, is another interesting pathway that was observed at higher relative abundances in the NAFLD group compared with the non-NAFLD group ( Figure S14)-a difference primarily driven by the abundances of Fusicatenibacter saccharivorans and two Streptococcus species. Total SCFA content in the gut has been correlated with obesity status. [23] The classification of both evaluated groups as obese may indicate that certain bacterial community members further exacerbate SCFA concentration, increasing the chances for more severe obesity and NAFLD progression. The increase or decrease in specific SCFAs (acetate, propionate, and butyrate) may also contribute to disease progression. Butyrate and acetate are the most and least potent anti-inflammatory SCFAs, respectively. [34] The gene potential for increased acetate production by F. saccharivorans and a decrease in butyrate production from a decreased abundance of F. prausnitzii (the most abundant species in these metagenomes and a known butyrate producer) ( Figure  S2B) may contribute to more severe gut inflammation in the NAFLD group. [35] This study has some limitations, the main one being the relatively small sample size (n = 36) used, which could have led to a potentially high false discovery rate (FDR). To account for this issue, we made FDR-related statistical adjustments and used minimum prevalence values to exclude taxa or features only present in a small subset of samples. Another consideration, particularly for the functional data, is that shotgun metagenomic data provide the functional potential of the community present rather than the expression levels of these genes. This distinction is important, as certain genes may be present but not actively transcribed, whereas others may be overexpressed. Although a metatranscriptomic analysis could provide a functional snapshot of the community, this approach has its own set of caveats, as the transcriptomes of fecal microbiome samples are not necessarily representative of the physiology in the colon. The significant differences noted for BMI and body fat percentage between the two groups may also present as a confounding variable. The NAFLD group had consistently higher BMI and percent body fat measurements, indicating more severe obesity, which has been shown to affect microbiome diversity. [36] In the present study, we profiled a largely undercharacterized age group within the population, [37] including thorough clinical phenotyping of the patient groups using MRI and shotgun deep sequencing of the gut microbial communities. In summary, we found that the two groups had significantly different microbial community compositions, and that these differences were at least partially driven by a specific subset of bacterial species and that biosynthetic pathways producing metabolites (BCAAs, AAAs, SCFAs) previously correlated with disease were again identified here. There is great potential for future studies to investigate these identified species as probiotics or drug targets as well as using correlated metabolite production as a biomarker for increased disease risk to inform clinical interventions. The comparison examined was also unique in that only obese youth with and without NAFLD were included, thereby controlling for the effect of obesity on the microbiome. Given NAFLD's link with type 2 diabetes, cardiovascular disease, and advanced liver disease, as well as its increasing prevalence, NAFLD poses a major concern to public health and begs more effective strategies for prevention and treatment. While gut dysbiosis has been postulated as a potential contributor to the pathogenesis of NAFLD and NASH, more research is needed to elucidate how gut-liver axis signaling influences hepatic fat accumulation and inflammation and how these mechanisms might be modulated for therapeutic benefit. Future studies that combine clinical, metabolomic, and microbiome data will be extremely important for prospective therapies for NAFLD and its more severe progressions, especially considering children and adolescents in high-income countries are in the midst of an obesity epidemic.

A C K N O W L E D G M E N T
The authors are grateful for the clinical and research staff of the Hospital Research Unit, Church Street Research Unit, and YCCI Core Lab at Yale School of Medicine. They also thank Dr. Jeremiah Marden for providing feedback on the manuscript.

C O N F L I C T S O F I N T E R E S T
Nothing to report.

E T H I C S A P P R O VA L
The Yale University Human Investigation committee approved the study.

C O N S E N T T O PA R T I C I PAT E A N D C O N S E N T F O R P U B L I C AT I O N
Written child assent and written parental permission were obtained from all participants involved in the study.

AVA I L A B I L I T Y O F D ATA A N D M AT E R I A L
Data can be found in the NCBI SRA database under project ID PRJNA328258.

C O D E AVA I L A B I L I T Y
Code can be found at https://github.com/joerg grafl ab/ Code-for-Teste rman-Li-2021.