Study population
We used data from three Greenlandic cohorts; The Inuit Health in Transition (IHIT, 2005-2010), comprising 3,115 individuals living in Greenland (19), the Greenland population study (B99) and BBH cohorts (1998-2001) comprising 1,401 individuals living in Greenland, and 503 Greenlanders living in Denmark, respectively (20). There was an overlap of individuals between IHIT and B99 (N = 295), and these were assigned to B99. By combining the three cohorts, we were able to include up to 4,473 individuals in the analyses.
Biochemical measurements
Blood samples were collected from fasting participants in IHIT (19) as well as the majority of participants of B99, and in fasting and non-fasting participants of BBH. Serum concentrations of total cholesterol, HDL-cholesterol, triglycerides, apoA1, and apoB were measured and concentrations of LDL-cholesterol were calculated. Triglyceride measurement and the calculation of LDL-cholesterol was only done for fasting participants (21). Participants also had anthropometric measures and blood pressure taken, as well as an oral glucose tolerance test for those aged ≥35 years in B99 and BBH and for those aged ≥18 years in IHIT, from whom blood samples were also taken two hours after the glucose ingestion to measure plasma glucose, serum insulin, and serum C-peptide, as previously described (19, 20).
Cardiovascular disease outcomes
We obtained information on ischaemic stroke, myocardial infarction, and revascularization procedures from The Greenland National Patient register, the Danish National Patient Register, and the causes of death register from both Greenland and Denmark. We used the International Statistical Classification of Diseases and Related Health Problems- (ICD-) 8, ICD10, and International Classification of Primary Care 2 (ICPC-2) codes, as well as The Classification of Operations and Treatments and The Nordic Medico-Statistical Committee (NOMESCO) Classification of Surgical Procedures for procedures (Supplemental Table 1 and 2). Follow-up data in registries were only available for IHIT and B99, resulting in the inclusion of 4,110 individuals in the analyses of CVD events.
Genotyping
All samples were genotyped on the Multi-Ethnic Global Array (MEGA chip, Illumina), with ~2M variants. Genotypes were called jointly for all cohorts using the GenCall module of the GenomeStudio software (Illumina). The data set underwent quality control, removing duplicate samples, individuals missing >5% genotypes, and variants with >1% missing genotypes. In total, data from 4,607 individuals and 1,769,375 variants passed the quality control.
Imputation
To create a reference panel, we obtained short paired-end whole genome sequencing data (Illumina) for 447 Greenlanders. Genotyping was done using GATK best practices with their build 38 resource bundle. The variants were lifted to hg19 using Picard (http://broadinstitute.github.io/picard/) and sites with >3 % missingness or mendelian discordance >5 % based on the large number of trios and duos present in the data set were removed. We phased the remaining data using ShapeIt2 (22), using available duo/trio information for some of the individuals. These phased data were then combined with 1000 genomes data for 99 CEU, 91 GBR, 103 CHB, 105 CHS, and 93 CDX. For the 1000 genomes data, we included sites that overlapped with the whole genome data or had a minor allele frequency (MAF) > 1 %.
For the 4,182 individuals without WGS data, we phased the MegaChip data with more than 5 minor alleles using ShapIt2 also using the duo/trio family information as well as the constructed reference panel. The imputation was performed using IMPUTE2 (23). The recombination map for the reference variants was inferred with linear interpolation using the hg19 genomic map from IMPUTE2 as a template, and an effective population size of 1,500 (24). The imputation generated genotype data with an info score (r2) above 0.8 for 12,270,968 variants. Of these we removed 16 tri-allelic variants on chromosome 12 and applied a MAF filter of 0.5 %, resulting in a data set of 10,265,398 variants with a mean info score of >0.98, which we included in the analyses.
Statistical analysis
Analyses of the Greenlandic data
To perform association analyses, we used a linear mixed model implemented in the software GEMMA (25) to account for the relatedness and population structure. For each phenotype, we included all individuals across the three cohorts with information about that phenotype, and the general relatedness matrix used as additional input in GEMMA was estimated from the same individuals using only SNPs with MAF above 5% and at most 1% missingness. We assumed an additive effect and included sex, age, cohort, and whether an individual had imputed or genotype data as covariates. Lipid levels were quantile transformed to a standard normal distribution within each sex before association analyses to obtain p-values and effect size estimates. In GWAS analyses, we applied a significance threshold of p<5x10-8. QQ plots showed no indications of inflation due to confounding in analyses of the six lipid traits (Supplementary Figure 1).
We estimated the partial variance explained (PVE) (26) for the identified variants in the Greenlanders and for the variants reported in Europeans (27).
CVD risk was assessed with Cox regression on the number of years lived until the first CVD event (survival time) using the R-package survival (28). These analyses were adjusted for sex, age, and the top 10 principal components (PC) to correct for the population structure. Years until an event was calculated as the years from birth until an event or end of follow-up (2018).
Analyses of UK Biobank data
Variants of interest were analysed in the UK Biobank version 3 (29) (UK Biobank project ID: 32683). LDL-cholesterol (Field ID: 30780 ‘LDL direct’) and total cholesterol (Field ID: 30690 ‘Cholesterol’) were rank-based inverse normal transformed separately for each sex and analysed in a linear model using plink2 (30). CVD risk was assessed using Cox regression, defining CVD events by the diagnosis codes in supplementary tables 1 and 2 extracted using the field IDs: 41270 (ICD10), 41271 (ICD9), 41272 (OPCS), 20002 (self-reported), and 20004 (self-reported operations) and their related data fields for date of diagnosis/operation. All analyses were adjusted for birth year and month, sex, and the first 10 PCs, and conditioned on the PCSK9 loss-of-function variant (rs11591147) to obtain the independent effect estimates for the PCSK9 rs12117661 variant.
Genetic risk scores
Genetic risk scores (GRS) were calculated for each of the six lipid traits, by including variants identified by the GWAS (p<5x10-8) and using the estimated effect size from the GWAS as weight. We did 10-fold cross-validation, dividing all the individuals randomly into 10 groups of equal size and estimating the GRS of the individuals in each group by using variants and effect sizes from a GWAS performed on the remaining 9 groups combined. Each of these GWAS was performed as the main lipid GWAS performed in this study. To avoid including multiple variants that associate due to LD, we sorted all the variants by p-value, kept the top variant, and discarded all other SNPs in a region +/- 5 Mb region around it. We repeated this procedure until there were no more variants left with p<5x10-8.
We calculated the variance explained by each GRS by the partial r2 of the GRS in a model adjusted for age, sex, and the first 10 PCs using quantile-transformed lipid levels.
In-silico variant analyses
The Ensembl database (31) and Encode 3 data (32) were queried to assess the co-localization of variants with regulatory elements, such as transcription factor binding sites, promoter regions, and regions of DNase hypersensitivity. Moreover, RNA expression data from 48 tissues (with >70 samples, range: 80-399) were queried through the GTEx Portal to assess possible effects of the genetic variants on the expression of nearby genes (33).