2.1 UK Biobank samples and mental phenotypes
The phenotypic and genotypic data of this study were driven from UK Biobank health resource (http://biobank.ndph.ox.ac.uk), which is a population-based prospective cohort study. The UK Biobank cohort collected 502,656 participants aged between 40 and 69, from 2006 to 2010 at recruitment. The study has collected a rich variety of phenotypic, health-related information on each participant, including physical and biological measurements, lifestyle indicators, imaging of the body and brain and genome-wide genotyping. Longitudinal follow-up for a wide range of health-related information are provided by linking health and medical records.
Several potential measures of smoking behavior was selected to define the phenotype of ever smoking. The UK Biobank data field of 20432 was described as ongoing behavioural or miscellanous addiction. Anxiety and depression were defined according to the previous study [29], which was based on the general anxciety disorder (GAD-7) and Patient Health Questionnaire (PHQ-9) [30, 31]. Fluid intelligence score was described as a simple unweighted sum of the number of correct answers given to the 13 fluid intelligence questions. The maximum number of reported past or current cigarettes (or pipes/cigars) consumed per day was used to define the frequency of smoking (UK Biobank data fields: 20116, 2887 and 3456). In addition, the frequency of alcohol consumption (UK Biobank data field: 20117) was defined as the sum of all alcoholic beverages per week. Ethical approval of UK Biobank study was granted by the National Health Service National Research Ethics Service (reference 11/NW/0382). The detailed definition of phenotypes are shown in Additional file 1.
2.2 UK Biobank genotyping, imputation and quality control
A total of 488,377 participants have genome-wide genotype data, which were assayed using two similar genotyping array. DNA was extracted from stored blood samples when participants visited to a UK Biobank assessment Centre. Genotyping was carried out by Affymetrix UK BiLEVE Axiom Array or the Affymetrix UK Biobank Axiom arrays (Santa Clara, CA, USA), which shared 95% of marker content. Imputation was conducted by IMPUTE4 (https://jmarchini.org/software/) to carry out in chunks of approximately 50,000 imputed markers with a 250 kb buffer region. Routine quality checks was carried throughout the process, including sample retrieval, DNA extraction and genotype calling. Statistical tests was performed to identify poor quality markers by checking for consistency of genotype calling across experimental factors, including batch effects, plate effects, departures from Hardy-Weinberg equilibrium, sex effects, array effects, and discordance across control replicates. Based on self-reported ethnicity (UK Biobank data field: 21000), the individuals were restricted to only “white British”. Finally, the participants who reported inconsistencies between self-reported gender and genetic gender, who were genotyped but not imputed, and who withdraw their consents were excluded. Detailed description of array design, genotyping and quality control procedures can be found in previous studies [32, 33].
2.3 GWAS data of sex hormone related traits
The sex hormone associated SNPs was driven from a published GWAS [23]. Briefly, the study analyzed four sex hormone traits, including sex hormone-binding globulin (SHBG), testosterone, bioavailable testosterone and estradiol. Association testing was conducted to account for cryptic population structure and relatedness by linear mixed models implemented in BOLT-LMM (v2.3.2). Genotypic data was driven from the ‘v3’ release of UKBB [32], which contained the full set of Haplotype Reference Consortium (HRC) and 1000 Genomes imputed variants. We selected sex hormone associated SNPs with significant threshold of p<5×10-8 to calculate PRS. Detailed description of sample characteristics, array design, quality control and statistical analysis can be found in the previous study.
2.4 PRS calculation of sex hormone related traits
Using the genotype data of UK Biobank cohort, PRS calculation was performed by using the PLINK’s “--score” command [34]. Briefly, PRS denotes the PRS of the sex hormone levels for the ith subjects, defined as , where n (n = 1, 2, 3, …, t) and i (i=1, 2, 3, …, k) denote the number of genetic markers and the sample size, respectively. βn is the effect parameter of risk allele of the nth significant SNP related to sex hormone levels, which obtained from the published study. is the dosage (0 to 2) of the risk allele of the nth SNP for the ith subject. The PRS values were standardized to have mean 0 and variance 1 before further analyses.
2.4 Statistical analysis
Four serum level sex hormone traits, including SHBG, testosterone, bioavailable testosterone and estradiol, were analyzed both within and across sexes, with the exception of estradiol where analyses were performed only in men. Logistic regression model was performed to assess the associations between individual PRS of sex hormone traits and ever smoking and ongoing behavioural or miscellanous addiction, respectively. Correspondingly, linear regression model was conducted to evaluate the correlations between individual PRS of sex hormone traits and anxiety score, depression score, fluid intelligence score, and the frequency of alcohol and smoking, respectively. The regression analyses were conducted by R software (version 3.5.3). Additionally, the sex, age, and 10 principle components of population structure were used as covariates in the regression model. Benjamin-Hochberg false discovery rate (FDR) was used to control the potential impact of multiple test problems in this study.
2.5 Genome-wide environmental interaction analysis
Based on the result of regression model, GWGIS was then performed to assess the interaction effects between genetic factors and sex hormone related PRS for fluid intelligence, and the frequency of smoking per day and alcohol consumption per week in UK Biobank cohort. The GWGIS was conducted by PLINK 2.0 [34, 35]. Letting D is the disease outcome variable, the penetrance models form is described as the following:
logit[𝑃(𝐷 = 1|𝐺, 𝐸)] = 𝛽0 +𝛽gG+𝛽eE +𝛽geGE
where G is genetic factors and E is the environmental factors [36]. In this study, the outcome variables were fluid intelligence score, and the frequency of smoking per day and alcohol consumption per week, and the instrumental variables were the PRS of serum sex hormone levels. The Hardy-Weinberg equilibrium (HWE) p values < 0.001 or minor allele frequencies (MAFs) < 0.01 or the SNPs with low call rates (< 0.90) were excluded in this study for quality control. Significant interactions were identified at p < 5.0 × 10–8 in this study. Rectangular Manhattan plot was generated using the “CMplot” R script (httcps://github.com/YinLiLin/R-CMplot).