Differential expression analysis reveals few common genes and regulated pathways between craving-related datasets. After filtering out genes with low expression, the datasets contained 10,026 (Carpenter), 9,948 (Walker), and 7,104 (Powell) genes each. The Supplementary Materials contain summaries of the processing for each sample (Supplementary Table 1) and final statistics for each dataset (Supplementary Table 2). Two outlier samples were removed from the Powell C21 group because they did not cluster with other samples on the MDS plot. MDS and variancePartition plots showed that the experimental variables sequencing batch (Walker) and lane (Powell) explained a larger portion of the variation in gene expression than the treatment groups, so these variables were incorporated into the linear models before differential expression analysis. Supplementary Figs. 2–4 contain MDS and variancePartition plots, along with voom44 results before and after linear modeling. Final sample sizes were 12 (6/group; Carpenter), 11 (5–6/group; Walker) and 4 (2/group; Powell).
To demonstrate how to use evolutionary and human medical data to prioritize RNAseq results, we proceeded with using a P value < 0.05 (Fig. 2, Supplementary Tables 2–4). For the Carpenter contrast, there were 633 DEGs, with 304 up- and 329 downregulated in C28 relative to S28. A similar number of DEGs were identified for the Walker contrast (139 DEGs; 39 up- and 100 downregulated in C30 relative to S30), while the Powell contrast had 633 DEGs (289 up- and 291 downregulated in C21 relative to C1). In contrast, only 1 DEG was identified using an FDR threshold of 0.05 or 0.1 across all three datasets. This gene, AC239701.1 (ENSRNOG00000060437, adjusted P = 0.046), was significantly downregulated in the Powell dataset and is thought to be a pseudogene and does not yet have a known function, though a recent study suggests it may be sexually dimorphic in microglia 51.
We further used sequence homology to identify comparable genes across studies. To compare DEGs between the Powell contrast and the others, Ensembl IDs for rat genes were first converted to a list of mouse homologs. This reduced the Powell dataset to 534 rat DEGs (270 up- and 264 downregulated) corresponding to 548 mouse homologs (280 up- and 268 downregulated). Some rat genes had more than one ortholog in mouse, resulting in a higher number of mouse orthologs than initial rat genes. No genes were differentially expressed between all 3 contrasts. There were 15 shared genes between Carpenter and Walker, 45 between Carpenter and Powell, and 5 between Walker and Powell, of which 10, 21, and 3 were regulated in the same direction. These correspond to 9, 21, and 3 human homologs, respectively, for a total of 33 Craving genes when taking the union across all three comparisons. Of these 33 genes, one was excluded from the analyses because it was not annotated/present in the majority of the downstream datasets, in human it is annotated as a novel transcript (AC009690.3 or ENSG00000273025), and because it is annotated as a novel paralog of another Craving gene, CELF6. The final 32 genes are as follows: AGK, AMZ1, B2M, BCAS1, BTG1, CACYBP, CARTPT, CCDC88C, CELF6, EGR2, FABP7, FKBP4, FTH1, GPD1, GUCY1A3, HAPLN2, HSPA8, IRS2, KIF5A, LYPD1, MBP, MOBP, NTS, PHLDA1, PITPNM3, RGS5, RPS6KA2, SOX17, TIPARP, TTLL1, USP46, and VIM. For full names of all discussed genes, see the Supplementary Material.
RNA-seq power analysis. Because almost no DEGs were identified for each study using an FDR cutoff, we calculated the sample size per group required to reach 0.8 power for each dataset with FDR < 0.05. Estimated sample size varied for each dataset based on the expected proportion of DEGs and their fold change values (Fig. 3). For example, assuming a small effect size for each DEG (fold change of 1.25) but a high proportion of DEGs (0.2, or 20%), the datasets were estimated to need 7 (Carpenter), 13 (Walker), or about 25 (Powell) samples per group. Calculating for both a large effect size (fold change of 2) and high proportion of DEGs (0.2, or 20%), all three datasets were estimated to need only about 3 samples. In a more realistic scenario, where fewer genes are differentially expressed (5%, corresponding to 500 genes for every 10,000), each with a small fold change difference (1.25), the estimated necessary sample size varies more between datasets, with just 8 for Carpenter, but 16 for Walker and > 30 for Powell. This is unsurprising, since power analysis tools for RNA-seq are more accurate in estimating necessary sample size when the pilot or input data has more samples52, and the Carpenter and Powell datasets contained the most and fewest samples per group, respectively. The Powell dataset also shows the highest mean dispersion across all genes (0.056, vs. 0.015 for Carpenter and 0.028 for Walker; Supplementary Table 2).
Prioritizing Craving genes using evolutionarily conservation. We propose that therapeutically relevant Craving genes are conserved between rodents and humans, and as such evolutionary conservation is a measure to prioritize RNAseq candidates. We find that overall, Craving genes showed somewhat higher sequence similarity for both human-mouse (Mann–Whitney U = 307576, n1 = 31, n2 = 16699, median1 = 91.7, median2 = 87.1, P = 0.0696) and human-rat (Mann–Whitney U = 270762, n1 = 29, n2 = 15863, median1 = 89.1, median2 = 86.2, P = 0.0988) homologs when compared to all other homologous pairs in the genome (Fig. 4A), though results did not reach statistical significance. Craving genes did not have significantly different dN/dS values for either human-mouse (Mann–Whitney U = 182458, n1 = 28, n2 = 15067, median1 = 0.093, median2 = 0.109, P = 0.216) or human-rat (Mann–Whitney U = 162708, n1 = 27, n2 = 14260, median1 = 0.088, median2 = 0.110, P = 0.164) homologs compared to other homologs (Fig. 4B). Only one-to-one homologs were included in the statistical analyses, but where available, sequence similarity and dN/dS ranges for all homologs are available in Supplementary Table 6.
We suggest developmental trajectory as another measure of conservation for prioritizing candidate genes. A Fisher’s exact test indicated no difference in the evolutionary conservation of Craving gene homolog developmental trajectories in the forebrain compared to all other genes (n1 = 21, n2 = 9824, P = 0.643, Fig. 4C). This analysis excluded genes that had no data available (CELF6, FKBP4, FTH1, HSPA8, LYPD1, TIPARP, and USP46), or that had data only for human-mouse (GPD1, PHLDA1, RPS6KA2) or human-rat (SOX17) comparisons and not the other rodent species18. All Craving genes with complete available data showed conservation in their developmental expression in the forebrain of human, mouse, and rat, with the exception of CACYBP, NTS, and RGS5. Individual conservation results for each gene are presented in Supplementary Table 6.
For each Conservation measurement (sequence similarity, dN/dS, developmental conservation), genes were categorized as having High, Medium, or Low conservation (Table 3). For sequence similarity, genes were categorized as High, Medium, and Low if they had similarity ≥ 90%, between 80 and 90%, or < 80%, respectively. For dN/dS scores, genes were considered to have High conservation if they had values below the Craving median (0.11), Medium conservation at any other non-outlier value (≤ 0.2), and Low if they were outliers (> 0.2). dN/dS values were categorized as “High” if they were missing due to 100% sequence similarity, and not categorized in any other cases of missing data. Developmental conservation was categorized as High if the gene shared its developmental expression pattern in human, mouse and rat (HMR), Medium if its temporal expression in humans was shared with either mouse (HM) or rat (HR), and Low for no conservation with either rodent species (H). Genes with no data for both the human-mouse and human-rat comparisons were not categorized. For genes where data was available for one rodent species and not the other, they were categorized as Medium for shared developmental expression in the forebrain with that species, and Low if they did not.
Table 3. Prioritization of candidate Craving genes.
Genes were categorized as High (green), Medium (yellow), and Low (red) priority for each conservation and brain expression metric. Complete data are listed here, even for homologs that do not map one-to-one and were not included in the statistical analyses. Sequence similarity was measured as the percentage match between a rodent gene and its human homolog. Data from Cardoso-Moreira et al.18 was used to determine conservation of gene expression patterns across development in the forebrain. HMR = conserved expression across development in human, mouse, and rat; HM = conserved expression in only human and mouse; HR = conserved expression only in human and rat; H = developmental expression not conserved between humans and either rodent species. Developmental conservation assignments listed in this table are based on this available data only, assuming developmental expression is not conserved with the untested species. For example, GPD1 is "HM" because its developmental expression is conserved between human and mouse, but data is not available for human vs. rat. Brain specificity was calculated as the log2 fold change of the total expression in central nervous system (CNS) tissues compared to non-CNS tissues, using data from the GTEx database. The brain region(s) with the highest expression were determined based on post-hoc Tukey’s tests (Supplementary Table 9). See Supplementary Material for brain region abbreviations. Mean brain expression was calculated by obtaining the mean expression in TPM for each brain tissue across all individuals regardless of age, then calculating the grand mean.
* Indicates genes that do not have complete developmental expression data available, i.e. genes that only had data comparing human and mouse (GPD1, PHLDA1, RPS6KA2) or human and rat (SOX17).
$ Indicates homologs with one-to-many or many-to-one mapping. For some of these homologs, sequence similarity and/or dN/dS are given as a range.
^ FTH1 has a rat ortholog according to the January 2020 but not December 2021 Ensembl release, and thus a dN/dS value is listed but not sequence similarity.
+ CELF6 is a paralog of another Craving gene that corresponds to a poorly annotated novel transcript, ENSG00000273025, which was excluded from the analyses.
Though the overall set of Craving genes is not more or less evolutionarily conserved than other genes, there are individual genes that would make poor candidates based on their conservation. For example, human FTH1 has a homolog in mouse but not rat, so studies of this gene would need to be confined only to mouse models, limiting translational utility. Some Craving genes were also outliers with very low sequence similarity and/or high dN/dS values, including BCAS1, B2M, FABP7, and MBP (Fig. 5A, B, Table 3); however, all of these genes are conserved in their developmental expression in the forebrain across the three species. Importantly, several other genes are not developmentally conserved, including NTS and RGS5, which have conserved expression between human and rat only. CACYBP, which has high sequence similarity between human and mouse (93.4%) and human and rat (90.3%), is not conserved between human and either rodent species for developmental expression in the forebrain. This emphasizes the importance of considering functional conservation when selecting a candidate gene.
Narrowing candidate Craving genes by brain specificity and brain tissue- and sex-specific expression. Utilizing the GTEx dataset, we found that Craving genes are expressed with higher brain specificity than other genes after CNS- and body-exclusive genes were removed (Mann-Whitney U = 1233008, n1 = 32, n2 = 55305, median1 = -1.44, median2 = -2.51, P = 0.000116, Fig. 5A). This result remains significant even after filtering out lowly-expressed genes with a grand mean of ≤ 1 (Mann–Whitney U = 428911, n1 = 31, n2 = 20624, median1 = -1.54, median2 = -2.36, P = 0.000992; excluding the lowly-expressed Craving gene AMZ1) or ≤ 4 TPM (Mann–Whitney U = 299236, n1 = 31, n2 = 14404, median1 = -1.54, median2 = -2.35, P = 0.00105; also excluding AMZ1) across all tissues. Figure 5B and 5C depict the overall gene expression across tissues for 2 previously identified addiction genes, CREB1 53,54 and C1QL2 32,55, and the 4 Craving genes with the highest CNS specificity: MOBP, KIF5A, MBP, and HAPLN2.
Next, we probed for possible sex and tissue differences in brain expression of Craving genes, which is important when developing additional experiments and considering translation to humans. Using age-matched data for males and females ≥ 55 years, we performed two-way ANOVAs by Sex and Tissue for the 12 CNS tissues. ANOVA and post-hoc Tukey statistics for Sex and Sex:Tissue interactions are reported in Supplementary Tables 7–10. All 32 genes showed a significant main effect of Tissue. Post-hoc Tukey’s tests were used to determine which brain tissue(s) showed enriched expression of a gene relative to others and to generate compact letter displays (CLDs) to cluster tissues with similar expression levels (e.g., “a” has the highest expression, “b” has the second-highest expression, and so on) (Supplementary Table 9). For example, KIF5A shows higher expression in the cortex relative to all other tissues and is denoted with the only “a”; however, for NTS, the pituitary gland shows significantly higher expression than all other tissues except the hypothalamus, so they are designated as “a” and “ab”, respectively; Supplementary Fig. 5). We then categorized genes as having “enriched” expression in specific brain tissue(s) if the CLD included the letter “a” (Table 3).
Of the 5 genes that had a significant main effect of Sex, 4 had higher expression in males (AMZ1, NTS, PITPNM3, TTLL1) and 1 was more highly expressed in females (B2M). Several genes also showed a significant Sex:Tissue interaction, including CARTPT, which has higher expression in males in BRN-CAU, as well as NTS and RPS6KA2, which both have higher expression in males in BRN-PTRY. HAPLN2 also showed a significant Sex:Tissue interaction, but none of the post-hoc tests for this interaction were significant after Tukey’s correction. Summary brain expression results for each gene are presented in Table 3. Supplementary Fig. 5 shows expression of each Craving gene across Sex and Tissue using the age-matched samples.
Craving genes were categorized as High, Medium, or Low priority based on the brain expression data (Table 3). The following criteria were used: High - brain specificity > 0 and enriched expression in reward regions, but not cerebellum or spinal cord, according to post-hoc Tukey’s tests (Supplementary Tables 9 & 11); Medium - brain specificity > 0 but enriched expression only in non-reward reward regions, or brain specificity < 0 but enriched expression in reward regions; Low - brain specificity ≤ 0 and has enriched expression only in cerebellum and/or spinal cord. The following tissues were considered reward-related: anterior cingulate cortex, amygdala, caudate, frontal cortex, hippocampus, hypothalamus, nucleus accumbens, pituitary gland, putamen, and substantia nigra. The grand mean of expression (TPM) for all brain tissues was also calculated for each gene, and genes were categorized as High, Medium, and Low if their expression was ≥ 20, between 10 and 20, or < 10 TPM, respectively.
Using this approach, we found genes that would make exciting candidates to explore for a role in craving and reward. Though most of these genes do not show sex differences in brain expression for healthy human adults, for those that do (e.g. CARTPT, NTS), extra consideration should be made to design any preclinical experiments with adequate power to detect possible sex differences. Ten genes have higher expression in the brain than the rest of the body: MOBP, KIF5A, MBP, HAPLN2, BCAS1, FABP7, LYPD1, CARTPT, AMZ1, and NTS (Supplementary Table 11). Post-hoc Tukey’s tests indicate that five of these genes (MOBP, MBP, HAPLN2, BCAS1, and FABP7) are preferentially expressed in the spinal cord or cerebellum, which are tissues not conventionally associated with addiction. Rather, creating therapeutics that target genes expressed highly in these tissues may lead to negative side effects. However, other genes are highly expressed in reward-related regions. For example, KIF5A and LYPD1 are most expressed in the cortex and nucleus accumbens, respectively. AMZ1 has equally high expression in the anterior cingulate cortex, amygdala, and nucleus accumbens, which are all involved in emotional and reward processing. Many other Craving genes do not have high brain specificity, but within the brain are most expressed in reward-related regions. This includes IRS2 (highest expression in pituitary gland), CCDC88C (highest in caudate and nucleus accumbens), and GUCY1A3 (highest in nucleus accumbens).