Study participants
Data used in the present study were obtained from the Qatar Biobank (QBB) dataset. The QBB cohort is the first population-based prospective cohort study that included participants of Qatari nationals or long-term residents (living in Qatar for ≥ 15 years), aged 18 years and older, and appearing phenotypically healthy. Physical measurements for all participants were collected, and each participant completed a standardized questionnaire reporting lifestyle and medical history information. In addition, detailed baseline sociodemographic data, phenotypic data, clinical biomarkers, and biochemical tests were covered for the study participants. More details of the QBB project are explained previously (24).
Hamad Medical Corporation Ethics Committee approved the QBB study design (MOPH-AQBB-000222). All QBB participants signed informed consent waivers before participation. We submitted a request to access the QGP and QBB data (https://www.qatarbiobank.org.qa/research/how-to-apply), which was approved by the QBB IRB (IRB project number, QF-QGP‐RES‐ACC‐00075). The first QBB dataset release (N = 6,218 individuals) was used to carry out the current GWAS analysis. A large GWAS published recently using European, African, and South Asian participants were used in performing the meta-analysis and replication analyses (N = 363,228 individuals) (10).
Circulating 25(OH)D and dependent covariates
Blood samples were collected, centrifuged for serum separation, and immediately stored at -80°C for all participants. Quantitative evaluations of serum 25(OH)D concentrations were analyzed using a fully automated chemiluminescent immunoassay (CLIA), DiaSorin LIAISON, Germany, in the diagnostic laboratories at Hamad Medical City. Briefly, serum Vitamin D was dissociated from Vitamin D binding protein, and a labeled tracer was added. A washing step was performed to remove any unbound protein before initiating the CLIA reaction. 25(OH)D levels were determined using a photomultiplier. Serum concentration of 25(OH)D were available for 5885 subjects. Before the statistical analyses, the phenotype was normalized using rank-based inverse standard transformation by R (version 3.4.0). The anthropometric measurements, such as body weight and height, were performed by the Seca 284 stadiometer and balance. Body mass index (BMI) was calculated by dividing the weight (kg) by the square of height (m2).
Whole genome sequencing and bioinformatics analysis
Genomic DNA was isolated from whole peripheral blood using an automated QIASymphony SP instrument following the Qiagen MIDI kit protocol's instructions (Qiagen, Germany). Quantification was performed on the FlexStaion 3 (Molecular Devices, USA) using Quant-iT dsDNA Assay (Invitrogen, USA). Whole genome sequencing was conducted on the Illumina HiSeq X Ten (Illumina, USA) platform with an average coverage of 30x at Sidra Clinical Genomics Laboratory Sequencing Facility as previously described(25). Briefly, FastQC data (version 0.11.2) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) were aligned to the human reference genome GRCh37 (hs37d53) using Burrow-Wheeler Aligner (version 7.12) (BWA, https://github.com/lh3/bwa/tree/master/bwakit). Variant calling was obtained using HaplotypeCaller provided by Genome Analysis Toolkit (version 3.4) (GATK, https://software.broadinstitute.org/gatk/documentation/article?id=3238). Joint calling was conducted on all individual intermediate genomic variant call files (gVCF) to create a joint multi-samples VCF file for all the samples. The process consisted of two steps. We first applied GenomicsDB to combine regions for all samples. We then utilized GenotypeGVCFs using SNP/Indel recalibration to merge all regions. Variants with the PASS filter were only included for downstream analysis following the GATK VQSR filtering steps.
A comprehensive quality control (QC) assessment was performed to minimize population structure and genetic diversity in our data using PLINK (version 2.0) (26). SNPs with genotyping call rate < 90%, the minor allele frequency (MAF) < 1%, or the Hardy–Weinberg equilibrium P < 1 × 10− 6 were excluded. Additionally, we excluded samples with call rate < 95% (N = 1), duplicates (N = 10), excess heterozygosity (N = 8), and gender ambiguity (N = 65). We also performed multidimensional scaling (mds) analysis to identify population ancestry outliers with PLINK (26). A set of pruned independent autosomal SNPs (N = 62,475) was used to determine the pairwise identity-by-state (IBS) matrix through a window size of 200 SNPs and LD threshold of r2 = 0.05. Population outliers were considered and excluded if they deviated from the mean of the first two mds components by four standard deviation units or more (± 4 SD). The genome-wide association analysis was conducted on 7,880,618 genetic variants obtained from 6,047 participants.
Genome-wide association analysis
GWAS analysis under a variance component-based linear model was performed using GRAMMAR-Gamma (27) within the GenABEL/R package (version 1.8-0) (28) to study the association of each variant and 25(OH)D levels. This method corrects for relatedness and genetic substructure by using the genomic kinship matrix. Considering the mixture of the Qatari population, we performed principal components analysis through PLINK software (26). The first four PCs were used as covariates in the association model to minimize bias from population stratification. Further, we adjusted the regression model for age (years) and gender. Genome-wide significance cutoff was defined as P < 5 × 10− 8 and the nominal significance level as P < 0.05 (29). The quantile-quantile (Q-Q) plot, Manhattan plot, and genomic inflation factor were generated using R (version 3.4.0). Heritability (h2) was identified as part of the polygenic model in GenABEL to estimate the degree of variation in the 25(OH)D levels due to inter-individual genetic variation in a population (30).
Meta-analysis
We combined the results of the QBB GWAS and a recent large GWAS (GCST90019526) from the UK Biobank by the N Sinnott-Armstrong et al (10). to derive a combined meta-analysis for the suggestively associated loci. The GCST90019526 GWAS models were characterized with the same phenotype and methods following the correction of relevant covariates, including age, sex, and genotype PCs (10). Summary statistics for study GCST90019526 (10) were obtained from the NHGRI-EBI GWAS Catalog (31) taken on 02 December 2022. We confirmed matching A1 and A2 alleles with the alternate and reference alleles in the GCST90019526 study. We also canonicalized our association statistics as our reported effect size based on the alternate allele in the reference genome. Notably, 25(OH)D measurements were inversely normalized using rank-based transformation in both GWASs. PLINK (version 1.9) was utilized to perform an inverse-variance weighted meta-analysis and estimate heterogeneity of effects analysis (26).
Validation of previous association with 25(OH)D
We compared our findings to those of the GCST90019526 GWAS study on Vitamin D from the UK Biobank project (10) to evaluate the extent of replication and correlation of effect size and allele frequency for common signals. To test the possibility that SNPs observed in our meta-analysis were previously reported in the UK biobank at P < 5.0 × 10− 8, each SNP was checked in the GWAS catalog (EFO_0004631) of Vitamin D measurement trait from the November 2022 released data that was accessed on 02 December 2022 (31). We first examined the locus associated with Vitamin D levels in QBB and UK biobank driven by the same variants. We then determined markers within a 250-kb upstream and downstream region of the GWAS catalog signals to identify significant variants associated with Vitamin D.
Polygenic scores analysis
We tested the performance of European-derived polygenic risk score (PRS) on the QBB cohort to estimate genetic liability to Vitamin D using PLINK (version 1.9) (26). Polygenic scores consist of combined SNPs by the sum of risk alleles, which are weighted by corresponding effect sizes predicted by GWAS results. We used polygenic scores derived from one of the largest Vitamin D genome-wide association studies carried out in European populations (PGS000702: n = 255,256 individuals and 8,012 variants (10)). The scoring files of the study were obtained via the Polygenic Score Catalog (https://www.PGSCatalog.org) (32) accessed on 09 December 2022. Pearson’s correlation (R) between the inverse normalized Vitamin D levels, and European-derived PRS was computed with adjustment of age, gender, and first four PCs using the R software to evaluate the performance of the models on the Qatari population. We also used the area under the receiver operating characteristic (ROC) curve, also known as the "area under curve" (AUC). The AUC ranges from 0.5 (no distinction) to 1 (complete distinction), indicating the effectiveness of the derived PRSs in identifying those with Vitamin D deficiency, defined as serum 25(OH)D levels below 20 ng/mL and Vitamin D insufficiency when 25(OH)D levels between 21 to 30 ng/mL.
SNP annotation and functional analysis
The identified Vitamin D associates from the GWAS and meta-analysis were annotated using the Ensembl Variant Effect Predictor release 108 (VEP, https://grch37.ensembl.org/index.html) (33). We used the Genome Aggregation Database (gnomAD, https://gnomad.broadinstitute.org) and Allele Frequency Aggregator (ALFA, www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/) to compare the frequencies of the identified Vitamin D variants with those in the global populations.