Correlation analysis between metabolomics and expression datasets successfully predicts SLC substrates
We set out to de-orphanise SLC proteins by investigating the potential effects their expression might have on cellular metabolite concentrations. We reasoned that the transporter activity of SLCs might result in correlations between SLC expression level and the intracellular concentrations of their corresponding substrates (Figure 1A). We used a major cancer cell panel profiling 225 metabolites with liquid chromatography-mass spectrometry (LC-MS) across almost a thousand cancer cell lines from the Cancer Cell Line Encyclopedia 2019 (CCLE2019) (Li et al., 2019). We selected a list of SLC and SLC-like genes (S1 Table) from previous curations (Gyimesi & Hediger, 2022; Meixner et al., 2020). For each SLC or SLC-like gene in the list, we related their normalised transcript levels to metabolite concentration in 913 cell lines, and calculated the Z-score of the absolute values of the correlation coefficients (Spearman’s ρ) for each metabolite to account for varying degrees of correlation strength across different metabolites (Methods). Upon inspection of this set we observed many cases where expression of an SLC correlated most strongly with its known substrate. For example, SLC6A6, a Na+/Cl--dependent β-alanine and taurine transporter (Ramamoorthy et al., 1994), correlated most strongly to β-alanine and taurine, whilst SLC6A8, a Na+/Cl--dependent creatine transporter (Skelton et al., 2011), correlated most strongly to creatine (Figure 1B). Notably, expression of SLC6A8 also strongly correlated with two other metabolites, phosphocreatine and creatinine, which are direct derivatives of creatine (Taegtmeyer & Ingwall, 2013; Wyss & Kaddurah-Daouk, 2000).
These examples indicated that the expression and metabolite level variation across cancer cell lines might be more generally predictive of the functions of SLCs. To test this, we first expanded the correlation analysis to two other major cell line panels, NCI-60 (Shoemaker, 2006) and CCL180 (Cherkaoui et al., 2022) (Table S2-S4). These three datasets were generated using different methodologies to measure metabolite levels and gene expression; nevertheless, our analysis demonstrated significant concordance in SLC-metabolite correlations between them, suggesting that our method generated robust predictions (Figure 1C). We next investigated how well our correlation method was able to capture the known substrates of SLCs. We updated and expanded the database of SLC annotations based on a previous report (Meixner et al., 2020), selected a list of known substrates that can be found in the metabolomics we used (Table S5; see also Table S1 for references). Since the annotated names for the same metabolites was often different across the databases, we manually examined the metabolomic annotations in the three metabolomics we used to ensure consistent nomenclature across them (Table S6). We tested what fraction of the known substrates were recovered by our correlation analysis and compared this to a random expectation generated by shuffling the SLCs and substrates (“simulated random pairs”). All three datasets had a mean Z-score normalised Spearman’s ρ of known SLC-substrate pairs (Table S5; see also Table S1 for references) significantly higher than that of simulated random pairs (Figure 1D), indicating that correlation analysis was able to successfully predict known interactions.
In order to use our method to generate prediction for novel SLC-substrate pairs, we sought to define a cutoff for the correlation strength indicative of a strong prediction (Methods). To do this we systematically varied the normalized Spearman’s ρ and attempted to maximise the fractional difference between true positive (fraction of the set of known pairs above the cutoff) and false negatives (fraction of the set of simulated random pairs above the cutoff). We defined this threshold separately for each of the three datasets (Figure S1A-C). To unify the correlations from the three datasets we assigned each metabolite/SLC pair with a score such that a higher score corresponded to a stronger prediction. To assign a score for a particular metabolite/SLC pair we first normalized the absolute value of the correlation coefficient to give a Z-score. We then compared the Z-score to the Z-scores of the correlation coefficients for all the experimentally determined metabolite/SLC pairs. To give an example of how the score is calculated, the correlation between SLC6A8 and creatine has a Z-score of 3.08 in the NCI60 panel, 8.89 in the CCLE2019 panel and is not represented in the CCL180 panel because creatine was not measured. 3.08 is within the top 10% of known metabolite/SLC pairs in the NCI60, and is therefore assigned a score of 10; 8.89 is in fact the best correlation within the known set for the CCLE and so is assigned a score of 11. These scores are multiplied by 3 to give a total score of 63 for the creatine/SLC6A8 pair. When compared against the mean confidence score of the known SLC-substrate pairs our score had good predictive power compared with the simulated random sets (Figure 1E) and worked across a range of confidence score cutoffs (Figure 1F). Taken together, these results indicate that correlation analysis is able to provide an accurate indication of SLC-substrate pairs and thus could be a potential method to predict new substrates for orphan SLCs.
Data from gene dependency screens improves SLC substrate predictions
We next considered whether data from genome-wide gene dependency screens (Bock et al., 2022) could be incorporated into our predictions of SLC substrates. We reasoned that cell growth may be dependent on a specific SLC if the cells grow more slowly after the expression of that SLC is depleted, due to loss of that particular metabolite(s) (Figure 2A). Metabolites whose concentrations are significantly different between dependent and non-dependent cell lines might therefore be candidate substrates for the SLC in question. We used the CRISPR-Cas9 dependency screen (Tsherniak et al., 2017) recording cell growth in over a thousand cell lines upon CRISPR knockdown, of which 625 cell lines overlap with CCLE2019 metabolomics profiling. To infer the dependency of cell lines on specific genes we used the gene effect score as defined previously (Meyers et al., 2017). Genes annotated with negative gene effect scores indicated that cells exhibit reduced proliferation upon their deletion compared to normal cells. For each SLC gene in the curated list, we ranked the cell lines based on their corresponding gene effect scores, excluding positive scores where growth was improved by loss of the SLC. We computed a p-value following multiple test corrections for the difference in each metabolite between cell lines with the top 20% of negative scores (more dependent) and bottom 20% of negative scores (less dependent) (Table S7). Across a range of p-value cutoffs the fraction of significant pairs recovered from the known set is higher than simulated random pairs. The fractional difference is maximised when the p-value cutoff for significance is set to 0.077 (Figure S1D). For p-values smaller than the cutoff, a confidence score is assigned based on its position within a similarly calculated decile-based quantile of known pairs, scaled to 0.1 (Figure S1E). Thus, incorporating the CRISPR-Cas9 dependency screen as an additional data source improved the prediction performance, as the confidence score difference between known pairs and simulated random pairs increased by 7 % (Figure 2C).
Inclusion of adjacent metabolites improves substrate prediction for SLCs
Previous results consolidated that the correlation analysis and CRISPR-Cas9 dependency carry predictive power towards recovering known SLC-substrate pairs (references?). However, substrates might easily dissipate into downstream derivatives, leading to poor correlation and reduced predictive power. Furthermore, the prediction of substrates will be reinforced if the SLC correlates with the derivatives of the substrates as well. To address these ideas, we created a metabolite adjacency matrix (Table S8) from annotated KEGG metabolic pathways. This was done by extracting the number of conversion steps required for one metabolite node to reach another, with each unit of adjacency representing a conversion edge between two metabolite nodes. We reasoned that the expression of the SLC that transports the substrate molecule may correlate with its proximal derivatives, while for its distant derivatives, the distribution of correlation strength will be more random and thus more similar to metabolites that are not related to the substrate of the SLC (Figure 3A). To validate this hypothesis, we generated adjacency tables containing the derivatives that represent different steps of conversion away from the original SLC-substrate table, from proximal to distant. For each SLC-derivative pair in the table, we measured the similarity between the correlation of SLC-derivative pairs and the original SLC-substrate pairs by calculating their Spearman’s ρ difference. Subsequently, these differences were compared against the control tables containing randomly sampled non-adjacent metabolites. Non-adjacent metabolites cannot be linked to the substrate node via any continuous path. We performed one-tailed Wilcoxon tests to compare the Spearman’s ρ differences between the adjacent tables and non-adjacent controls, testing the null hypothesis that the differences in adjacent tables are not smaller than non-adjacent controls. Additionally, for each non-adjacent control in the previous comparison, 100 non-adjacent controls were compared with in one-to-one manner to ensure robustness. Our results demonstrate that proximal derivatives (those requiring fewer conversion steps) showed greater similarity to the original substrates compared to distant derivatives (Figure 3B) and that the difference between derivatives and non-adjacent molecules decreased with increasing distance from the substrate (Figure S2). We next determined the optimal correlation coefficient and the increase in confidence score to be added to maximise the difference between known SLC-metabolite pairs and simulated random pairs(Figure S1F-I). This improved the fractional recovery of known SLC-metabolite pairs by 90% (Figure 3D), indicating that metabolite adjacency information bolstered the accuracy of our prediction algorithm.
Predicted substrates for Orphan SLCs
Our method thus confirms that true SLC-substrate pairs tend to appear in a higher position compared to simulated random pairs when pairs for each SLC were ranked according to their confidence scores, measured with median rank (Figure 4A). We further demonstrated that the fractional difference peaked if we only considered predictions ranked ≤ 178, with the fraction of true positives reaching 50% (Figure 4B). However, in order to generate a number of predictions for orphan SLCs that could be reasonably tested experimentally, we sought to reduce the number of predictions further. We reasoned that we could improve predictive power for a smaller number of possible substrates by simultaneously identifying over-represented metabolite pathways within the set. We curated a list of 623 metabolites across the three metabolomics datasets that could be linked to 57 metabolic pathways (Table S9). Using the known SLC-metabolite pairs we showed that 20 metabolite predictions were optimal to successfully predict enriched metabolic pathways containing the known substrate (Figure 4C).
On this basis we used our prediction algorithm to create a list of substrate predictions with high confidence scores for 128 orphan SLCs (Table S10). We identified many predictions that are in line with experimental data. For example, we found strong associations between the orphan SLC CLN3 and several glycerol phosphate related metabolites (e.g. phosphatidylcholine, alpha-glycerophosphocholine, alpha-glycerophosphate, glycerylphosphorylethanolamine), agreeing with recent research indicating that CLN3 mutant in zebrafish leads to glycerophosphodiesters (GPDs) accumulation in early development (Heins-Marroquin et al., 2024). We predicted that MTCH1 could be associated with metabolites involved in glutathione synthesis (glycine, glutathione, glutamate, pyroglutamate, NADPH), which aligns with the recent observation that MTCH1-deficiency correlates with NAD+ depletion in mitochondria (Wang et al., 2023). Moreover, our results converge with a previous attempt to predict SLC substrate predictions that used sequence information (Meixner et al., 2020). In this publication, SLC25A45, SLC22A25 and SLC35E2B were all predicted to have nucleobase-containing substrates, and our algorithm also predicted a variety of nucleobases as substrates for these transporters (Table S10). Together, our predictions could be used to generate plausible hypotheses for novel SLC substrates, which can be used to narrow down sebsets of metabolites for downstream experimental verification and leading to faster de-orphanisation.
Leveraging drug repurposing panels provides new predicted SLC-drug interactions
Solute carriers are known to play an important role in determining drug pharmacokinetics, safety and efficacy profiles (Alam et al., 2023). A key goal of the recently established International Transporter Consortium is to identify key transporters involved in drug transport and highlight potential issues around adverse drug-drug interactions involving transporters during clinical trials. Therefore, in parallel to the prediction of physiologically relevant substrates, we investigated whether interrogation of omics datasets could be used to identify drug molecules that are substrates for specific SLC proteins. We reasoned that expression of SLCs might affect drug efficacy, thus alter the shape of the dose-response curve reporting the relationship between viability and drug concentration. For example, when considering anticancer drugs, if cell death is improved or attenuated with higher SLC expression levels, one possible indication is that the drug is a substrate for transport by the SLC in question (Figure 5A). We investigated our hypothesis using the cancer repurposing screen profiling 1448 active drugs against 578 cancer cell lines across 8 doses (Corsello et al., 2020). The cancer cell lines screened were ranked based on each SLC’s transcript levels in CCLE2019, with the highest and lowest 20% marked with “high expression” and “low expression”, respectively. We fitted non-linear regression models into the annotated screen data, and compared the curves between high expression and low expression cell lines. Our analysis captured the difference with accuracy, as it revealed consistency with previously validated results. For example, SLC35F2 expression sensitized cells to the drug YM-155, a known substrate imported by this SLC (Winter et al., 2014). SLC19A2 encodes a plasma membrane thiamine transporter (Dutta et al., 1999), but thiamine uptake is not a dose-dependent factor impacting cell viability (Figure 5B).
To test if the analysis shows systematic predictive power in the drug repurposing screen, we selected a list of known transport activity of drug molecule by SLCs (Table S11), and used this to benchmark our predictions. Known pairs showed a higher difference between cell lines marked with high and low expression levels in a dosage dependent manner, compared to simulated random pairs (Figure 5C), with optimal p-value cutoff maximising the fractional difference at 0.17 (Figure S1H).
To remove the general impact of drug properties, we calculated a drug-specific significance threshold. For every drug, we randomly picked 20% of cell lines and separated these into high and low expression for 100 times and compared their model predictions. To filter out insignificant pairs we used a drug-specific significant threshold two standard deviations away from the mean of log-transformed p-value (Table S12). Subsequently, drug predictions were listed as we ranked pairs with absolute mean difference and dose-dependent strength, and removed any drug targeting for specific mutations. We then selected the top 50 predictions (Table S13). The SLC-drug pair with the best prediction statistics was an experimentally validated transport activity of YM-155 by SLC35F2 (Winter et al., 2014). Our algorithm also predicted previously unknown links; for example, we predicted an interaction between the orphan SLC Patched Domain Containing 4 (PTCHD4) and the drug molecule idasanutlin, which acts as a small molecule antagonist of p53 activity suppressor Mouse double minute 2 homolog (MDM2) (Figure 5B; Ding et al., 2013). We also noticed a group of SLCs (SLC3A1, SLC7A7, SLC16A4, SLC23A1, SLC37A1, SLC37A2, SLC41A2, NPC1L1, CLN3) that interact with the small molecule inhibitor RITA, which leads to induction of cell apoptosis by (re)activating wild-type or mutant p53 (Wiegering et al., 2017). SLC3A1 associates with attenuation while the other associate with sensitisation of drug killing effect (Figure S3A). Importantly, our predictions worked across a range of cell lines (Figure S3B), demonstrating…. In summary, our work provided a possible route to predict SLC-drug interactions in parallel to physiological substrate determination, aiding the process of exploring SLC as a therapeutic target reservoir or alerting drug discovery teams to potential downstream issues with cell toxicity or adverse impacts on drug pharmacokinetics.