Chemical space coverage and the features associated with demographic factors
Samples, data generation and analysis workflow are depicted in Fig. 1A. All HRMS analysis was performed in triplicate using a dual column chromatography scheme that included hydrophilic interaction liquid chromatography (HILIC) and reversed phase liquid chromatography (RPLC; C18) columns to maximize the chemical space coverage. A total of 14,342 features were quantitatively measured for HILIC and C18 columns (N = 8,739 and 5,603, respectively). A subset of these features was experimentally validated using LC-MS/MS for known chemical compounds (N = 97 for HILIC and 69 for C18, Table S1) [16]. The demographic characteristics of our cohort are summarized in Table S2. To identify metabolites that were associated with demographic variables, we used a generalized linear model to control for batch effect and global genetic ancestry using the top ten eigenvectors—hereafter referred to as principal components (PCs) 1–10—from EIGENSTRAT analysis of 619,688 common variants (see Fig. S1 for population stratification with the first two PCs).
A total of 338 features were associated with age and gender, of which 19 were validated for known chemical compounds (false discovery rate (FDR) < 0.05, N = 120, with 10 validated and 218, with 9 validated, respectively, Table S3). Among the age correlated metabolites, urate and creatinine are known to be correlated with age [17, 18] and we found novel associations including quinic acid, arachidonic acid, hydroxyproline, acetylcholine, 1-methylhistidine, trimethyllysine, cystine, and retinol. Between females and males, amino acid derivatives (hydroxyproline, hydroxylysine, asymmetric dimethylarginine (ADMA)), carnitines (carnitine, propionylcarnitine, butyrylcarnitine), cholines ((lysoPC(18:0) and lysoPC(18:1)), microbial product (valerobetaine), and a product of urea cycle (fumaric acid) were significantly different among the validated features.
As of the impact of global genetic ancestry on individual’s metabotype, PC1—correlated with global ancestry of African descents from non-African descents—was associated with 141 features including 8 validated metabolites: amino acid derivatives (citrulline, indoleacetate, and kynurenine), carbohydrate (arabinose), vitamins (retinol (A), thiamine (B1), nicotinamide (B3)), and cholines (lysoPE(18:0), lysoPE(20:3)) (FDR < 0.05) (Table S3). Except for PC4 and PC7, the other PCs were not correlated with any of the validated metabolites. PC4 was correlated with subgroups of European decent and PC7 distinguished two individuals of European descents. Nonetheless, the features that were significantly correlated with PCs may reflect combined effect of genetic and environmental factors, such as diet and lifestyle. All the significant features associated with demographic variables—age, gender, and global genetic ancestry—were not biased for mass-to-charge ratio (m/z) and retention time (RT) for both HILIC and C18 (Figs. S2-S4).
Next we performed pathway enrichment analysis using Mummichog [19] with statistical scores obtained from univariate analyses as described above. Pathways were selected for adjusted p-value < 0.01 with five or more empirically validated chemicals overlapping for each pathway (Table S4). Glutamate metabolism pathway including five empirically validated metabolites—glutamine, glutamate, carbamoyl phosphate, 2-oxoglutarate and N-methylglycine—was significant for age (adjusted p-value 0.00495). Tricarboxylic acid cycle pathway was enriched with differentially detected metabolites between males and females. Ascorbate and aldarate metabolism and pentose phosphate pathways were significant for PC1 and PC4.
Narrow-sense heritability estimation of feature levels
To capture the genetic influence on the variance of metabolite levels across individuals, we estimated a narrow-sense heritability (h2) of feature levels using genomic relatedness matrix (GRM) restricted maximum likelihood (GREML) method implemented in Genome-wide Complex Trait Analysis (GCTA) [20]. We found a wide range of h2 estimates with a bimodal distribution having 15.9% of features with high h2 (> 0.8) and 62.0% with low h2 (< 0.2) (Fig. 1B). Next, we checked h2 distributions in each chemical species with the validated features. Carbohydrates and lipids had low h2 overall with 7.1% and 3.7% had h2 between 0.6–0.8, respectively. No feature was high h2 (> 0.8) for these chemical classes. In comparison, large proportions of amino acids and derivatives and nucleic acids had high h2 (> 0.8) (24.5% and 33.3%, respectively) (Fig. 1C).
Genome by metabolome-wide association analysis
We calculated age and gender corrected feature intensities and included the top 10 PCs as covariates to perform GxMWAS (see Methods) discovering 54 associations among 29 features and 43 common genetic variants at the threshold of p < 3.5 x 10− 12 (= 5 x 10− 8/14,342 features) (Table 1 for selected associations; Fig. 2 for associations involving features with validated annotations; Dataset S1 for full list of associations at genome-wide significance level, p < 5 x 10− 8). On average, a genetic variant was associated with 1.3 ± 0.79 features (range 1–4) and a feature was associated with the median of one variant (range 1–5). Most variants associated with feature levels were intronic (25 of 43, 58.1%) and three (7.0%) were in protein coding exons.
Table 1
Genetically influenced metabolites. Significant genetic variant-metabolite associations at a genome-wide significance of p < 3.5x10− 12 (= 5 x 10− 8/14,342 features) are shown. The features that are confirmed using authentic standards are annotated with chemical names.
Metabolite annotation(s) | Feature | Lead SNP Position* | EA, OA | EAF | Beta | Standard error | P-value | Candidate Genes | |
N/A | c18_3017 (m/z 394.8915, RT 68.1) | chr9:72,902,395 | G, A | 0.14 | -0.49 | 0.063 | 4.94E-14 | ALDH1A1 | |
N/A | c18_4819 (m/z 640.3195, RT 278.2) | chr20:38,052,517 | G, T | 0.08 | -1.23 | 0.167 | 7.35E-13 | RPRD1B | |
chr10:66,637,441 | A, G | 0.07 | -0.69 | 0.090 | 1.56E-13 | CTNNA3 | |
N/A | c18_5068 (m/z 709.0644, RT 190.3) | chr8:47,661,411 | A, G | 0.07 | 0.93 | 0.128 | 1.80E-12 | SPIDR | |
N/A | hilic_252 (m/z 101.5811, RT 100.6) | chr10:98,377,943 | T, C | 0.46 | -0.61 | 0.065 | 4.82E-19 | PYROXD2** | |
Methyl Lysine or its variants | hilic_1381 (m/z 161.1285, RT 106.1) | chr10:98,377,943 | T, C | 0.46 | -0.49 | 0.061 | 6.98E-15 | PYROXD2** | |
Methyl Lysine or its variants | hilic_1401 (m/z 162.1321, RT 106.9) | chr10:98,377,943 | T, C | 0.46 | -0.54 | 0.064 | 3.32E-16 | PYROXD2** | |
NeNe dimethyllysine | hilic_1651 (m/z 175.1442, RT 104.5) | chr10:98,377,943 | T, C | 0.46 | -0.55 | 0.056 | 6.20E-21 | PYROXD2** | |
N/A | hilic_1914 (m/z 188.9574, RT 44.2) | chr12:66,599,541 | C, T | 0.06 | -1.22 | 0.164 | 5.69E-13 | GRIP1 | |
chr15:88,215,350 | A, C | 0.15 | -0.84 | 0.116 | 2.85E-12 | NTRK3 | |
chr21:14,901,801 | G, C | 0.13 | -0.85 | 0.117 | 1.94E-12 | NRIP1 | |
Dimethylarginine | hilic_2476 (m/z 220.1777, RT 24) | chr16:7,273,260 | C, T | 0.08 | -1.03 | 0.137 | 3.45E-13 | RBFOX1 | |
Dehydrochorismic acid | hilic_3208 (m/z 269.0023, RT 125.5) | chr4:40,241,127 | G, C | 0.06 | 0.47 | 0.061 | 9.20E-14 | RHOH | |
Tridecan-1-ol, Tridecan-2-ol, 10-methyl-1-dodecanol, 4-Methyldodecan-7-ol | hilic_3410 (m/z 283.2741, RT 27.6) | chr7:12,354,019 | G, A | 0.12 | 0.27 | 0.037 | 1.55E-12 | VWDE | |
chr7:72,362,534 | A, G | 0.07 | 0.35 | 0.047 | 3.02E-13 | CALN1 | |
Menthol propylene glycol carbonate | hilic_3653 (m/z 300.2167, RT 37.1) | chr2:210,210,185 | C, T | 0.31 | 0.68 | 0.077 | 5.62E-17 | ACADL** | |
TG(14:0/O-18:0/18:2(9Z,12Z)) | hilic_7586 (m/z 781.7483, RT 89.3) | chr8:3,674,391 | C, T | 0.11 | 0.73 | 0.093 | 2.29E-14 | CSMD1 | |
EA, effect allele; OA, other allele; EAF, effect allele frequency; m/z, mass to charge ratio; RT, retention time * Represented as chromosome:position based on the human reference genome GRCh38. ** Previously reported GIMs in adults. |
A previously reported GIM between PYROXD2 and N-dimethyl-lysine (m/z 175.1442, RT 104.5) was successfully replicated in our pediatric cohort, which was indeed the strongest association in our results (Fig. 3A). This association has been independently replicated by the studies using urine, plasma, and cerebrospinal fluid (CSF) samples[4, 21–23]. The ACADL gene encodes acyl-CoA dehydrogenase long chain (ACADL) that is a subunit of the four enzymes involved in the initial step of mitochondrial beta-oxidation of straight-chain fatty acid. One missense and two intronic variants were significantly associated with hilic_3653 (m/z 300.2167, RT 37.1) that matches a long chain fatty acid, (6E,8S,10Z)-8-hydroxy-3-oxohexadecadienoic acid in HMDB (Fig. 3B). The association between nonanoyl carnitine and ACADL (p = 2.3 x 10− 9) did not pass the statistical significance threshold while this association has been previously reported.
Lipid species had low h2 overall; however, we found novel GIMs for triglycerides. Among novel GIMs discovered in our cohort, triglyceride (TG)(14:0/20:1(11Z)/O-18:0) (m/z 283.2741, RT 27.6) level was associated with multiple intronic SNVs in CALN1 and VWDE (p-values 3.02 x 10− 13 and 1.55 x 10− 12, respectively) (Figs. 3C and 3D). A monostearyl alcohol triglyceride level (m/z 781.7483, RT 89.3) was associated with an intronic SNV (rs2624100) in the CSMD1 (CUB and Sushi Multiple Domains 1) gene (p = 2.29 x 10− 14).
Some feature levels were associated with multiple genomic loci. For instance, hilic_1914 (m/z 188.9574, RT 44.2) was associated with three loci on chromosomes 12, 15, and 21 (p-values 5.69 x 10− 13, 2.85 x 10− 12 and 1.94 x 10− 12, respectively). The GRIP1 (glutamate receptor interacting protein 1) gene in chromosome 12q14.3 is involved in synapse formation [24]. The NTRK3 (neurotrophic receptor tyrosine kinase 3) gene in chromosome 15q25.3 encodes a receptor tyrosine kinase that binds to its ligand neurotrophin-3 and plays a role in nervous system development. AF127577.4 is a long non-coding RNA on chromosome 21q11.2 and 5’-end overlaps with the NRIP1 gene. Nuclear receptor interacting protein 1 (NRIP1) is a nuclear protein that interacts with the hormone-dependent nuclear receptors and expressed in neuronal and glial cells [25].
For the experimentally validated features, we found several GIM candidates at less stringent but a genome-wide significant level (i.e., p < 5 x 10− 8) including previous reported ones. For amino acids and its derivatives, we found GIM candidates for arginine, glutamate, lysine, and sulfinoalanine. Arginine and lysine levels were associated with multiple genetic loci in different genes. Six loci including intronic variants in APBB2 and CHAF1A were significantly associated with plasma arginine level. Lysine level was associated with four loci including intronic variants in the CCDC85A, CAMK4, TMEM106B, and OSBPL1A genes. Glutamate is the most abundant excitatory neurotransmitter in CNS and its plasma level was significantly associated with genetic variants in the ADAMTS8 and RHOU genes. For carbohydrates, glucose level was significant for an intronic variant (rs11954514) in the HARS1 (Histidyl-tRNA synthetase 1) gene that is a disease-causing gene for Usher syndrome type 3b (MIM ID: 614504)[26]. An intronic variant in the SLC7A2 gene was associated with plasma glycogen level. The SLC7A2 gene encodes a cationic amino acid transporter and is implicated in arginine metabolism. Slc7a2 knockout mice had 20% higher blood glucose compared to wild-type mice [27]. N-Acetylneuraminate (Neu5Ac) is a sialic acid found in cell membrane. In neuronal cells, Neu5Ac residues are found in membrane bound glycoproteins, i.e., gangliosides. Neu5Ac interacts with bacterial and viral pathogens in diverse cell types. Multiple SNVs in the PALLD gene were associated with plasma Neu5Ac levels. Palladin, encoded by the PALLD gene, is a cytoskeletal protein found in actin filaments. Among lipid species, butyrylcarnitine-ACADS association was notable, which has been reported in independent studies. All genetic variants associated with annotated features as identified by unique m/z and RT are listed in Dataset S1.
Gene-level enrichment with the variants associated with features
Next, we performed a gene-level enrichment analysis for each feature with its GWAS summary statistics using Multi-marker Analysis of GenoMic Annotation (MAGMA) [28]. A total of 572 genes were enriched with variants associated with 217 features at FDR < 0.01 (Dataset S2). The most significant association between gene and feature was found for the SKIDA1 gene and urea citrate (m/z 269.0023, RT 125.5). Among the validated features, the HLA-C gene was associated with arginine level. Bilirubin was significantly associated with UDP-glucuronosyltransferase isoforms, which has been replicated by independent studies using different metabolomics platforms [4, 7, 29, 30]. Of the ten features associated with UGT1A isoforms, six were identified and/or annotated as bilirubin while four features were not. Correlation structure of these features showed that the unmatched features could represent other chemicals than bilirubin (Fig. S5) as UGT1As are the enzymes of the glucuronidation pathway processing small lipophilic molecules such as steroids, bilirubin, hormones, and drugs into water-soluble and excretable metabolites.
We created a network of gene-feature associations with MAGMA results to check the interconnectivity (Fig. 4A and Fig. S6). The average number of neighbors was 2.0 and a one-to-one association was found for 67 gene-feature pairs. The largest subgraph had 13 features and 72 genes. We checked enriched gene ontology terms for the genes in each subgraph with 6 or more genes. Six subgraphs were enriched with one or more GO biological pathway terms (hypergeometric test, FDR < 0.05). A subgraph with 4 features and 19 genes was enriched with the genes involved in purine metabolism such as dADP (deoxyadenosine diphosphate) biosynthetic process (hypergeometric test, FDR 0.018, Fig. 4B). A sterol, hilic_5455 (m/z 467.256, RT 278.1) was associated with the variants in 12 genes that were enriched in nucleosome assembly (hypergeometric test, FDR 0.013, Fig. 4C). Seven genes associated with an oxidized CDP-diacylglycerol (CDP-DG), c18_4077 (m/z 522.734, RT 43.2) were enriched for plasma membrane long-chain fatty acid transport and ketone body biosynthetic process (hypergeometric test, FDR 0.036 and 0.036, respectively, Fig. 4D).
Revealing the underlying network among genetically influenced metabotypes
Focusing on the 29 GIMs with p < 3.5 x 10− 12, we identified a GIM-causal network at the type I error rate of 5%. Briefly, causal networks are based on conditional (in)dependency established in the principles of Mendelian Randomization [31]. We found two disjoint modules of interconnected GIMs. The modules comprised 7 and 12 GIMs with directed connections pointing a prediction target in each module (Fig. 4E and F). A prediction target captures the effect from multiple other GIMs in the module, so its concentration levels can be representative of the module [31]. In the module with seven features, methyl-lysine and TG(14:0/8:0/8:0) showed a significant connectivity/dependency (p = 2 x 10− 4) while four features that were all mapped to methyl-lysine (Fig. 4E). In the module with 12 features, the connection between an unannotated feature (hilic_1914) and urea citrate (hilic_3208) was one of the most significant connectivity (p = 5 x 10− 13)(Fig. 4F).