Study design: description of cohorts
The present study was submitted to and approved by the Clinical Research Ethics Committee (CEIC) of the Hospital Clínico Universitario de Valencia (Spain) - September 29th, 2016 (2016/169) and July 13th, 2018 (2018/139) - and was conducted in compliance with the Helsinki Declaration.
This is case control study compiling full genotyping and phenotypic data for a cohort recruited between January 2017 and December 2018 from two sources: Hospital Clínico Universitario de Valencia and Valencian Community Screening Programme (General Directorate Public Health), both in the Autonomous Community of Valencia (on the Mediterranean Coast). A total of 867 healthy women and 640 breast cancer patients were recruited, with ages in the range of 30-70. Patients had developed breast cancer in a maximum period of five years prior to data collection, while controls were women who had not developed breast cancer during the same period. Those that presented incomplete phenotypic data or genotyping failure were excluded from the cohort, which left 1,097 participants consisting of 642 healthy women and 455 breast cancer cases.
The patient cohort was composed of 45% Luminal A, 20% Luminal B, 20% Her-2 positive and 15% Triple Negative tumors (approximate percentages).
Data collection
Clinical information was collected for all subjects at recruitment: family history of breast cancer, date of birth, age, age at menarche, age at menopause, age at first pregnancy, and mammographic density (MD). Breast density was assessed from craniocaudal and mediolateral oblique mammographic projections by an experienced radiologist with more than ten years of experience. The radiologist used the image viewer system (DICOM, from General Electric GIMD company), classifying MD according to Boyd's semiquantitative scale (8).
SNP selection and genotyping
As in our previous PRS risk analysis (4), we initially selected 76 SNPs from the European Collaborative Oncological Gene Environment Study (COGS) (9). These SNPs were significant or showed a trend towards significance in our previous validation with Spanish samples. The correlation of the genetic variants analyzed with prediction of breast cancer risk in women of the Spanish population has already been described [4]. In brief, we analyzed the performance of our PRS using the 76 selected SNPs for breast cancer risk prediction in a Spanish case and control cohort. The initial selection was extended to 123 SNPs by including additional SNPs obtained from the OncoArray Project (10). Of these, 28 SNPs with an OR close to 1 (0.95<OR<1.05) and another 3 SNPs with platform genotyping failure were removed. In this way, a total of 92 SNPs (11–16) were eventually employed for the current analysis (Online Resource 1).
The genotyping method has been described previously (4). In short, 10 ml of peripheral blood was collected in an EDTA tube. One µg of Deoxyribonucleic acid (DNA) was used for the genotypic analysis (minimum concentration of 25 ng/µL). Genotyping was performed with the Open Array® Real-Time PCR platform (Life Technologies) using the Acufill® system and Taqman® probes. The data obtained were analyzed using Genotyper software. Samples with a call rate <0.95 were discarded. SNPs with a genotyping rate <0.95 and SNPs generating errors in control duplicates were also ruled out.
Statistical analysis
Sample size was calculated with a 95% confidence level (two-tailed test), 80% statistical power, control-case ratio of 1.3 and initial prevalence of breast cancer of 12 %; the total number of women necessary for results to be statistically significant was 1138, similar to our case control cohort (1097). In an initial exploratory univariable process, the case/control ratio of each risk factor was compared. During this step, the Wilcoxon-test was used with a two-sided p-value threshold of 0.05.
The PRS was based on a combined effect of 92 SNPs statistically associated with breast cancer. This strategy considers an independent effect of each SNP, ignoring departures from a multiplicative model (17). The PRS was derived for each study subject using the formula:
where xk is the number of risk alleles (0, 1 or2) based on the ploidy of each SNP. The βk weights are the ORs of the risk alleles associated with breast cancer described in Online Resource 1. This strategy has been used in other studies (5,6). The resulting values are normalized using the median PRS value of the control samples of the cohort.
In the phenotypic analysis, the phenotypic categories were transformed into quantitative variables using the ORs described in the Pollan et al. study (8), except for family history, the ORs of which were based on the Pharoah et al. study (18). In addition, the age of women (age at diagnosis of patients and age at interview of controls) was grouped into five-year periods, similar to in other publications (19), which allowed the groups to be transformed into quantitative variables. The final number of cases and controls in our cohort was 455 and 642, respectively.
For the univariable analysis, logistic regression was applied to each risk factor, which has been adjusted for age and centre. The coefficients of the model were standardized using the reghelper library of R (20). Additionally, the PRS was adjusted for the first five principal components. The interaction effect between variables was also evaluated using the likelihood ratio test (LRT). All analyses were two-sided and employed a p-value threshold of 0.05.
To confirm the independence of the PRS and other phenotypic risk factors, pairwise Spearman correlations of unaffected controls were evaluated.
For the multivariable study, we performed a logistic regression analysis that incorporated the statistically significant variables obtained in the previous steps, including the interaction terms. Family history and age at menarche were also included in the analyses, even though they were not significant, since they are well-known risk factors. The significance of the final model was evaluated using the Wald Test (21). To assess the accuracy of the final multivariable model, a global Hosmer-Lemeshow goodness-of-fit test was performed using deciles (22).
To evaluate improvement in risk prediction for the different models and risk factors, the area under the curve (AUC) was evaluated (23) as a measure of discrimination between cases and control women. This calculation was performed using the pROC (24) library of R. To avoid a possible overfitting of the model, the 95% Confidence Interval (CI) of the AUC was assessed using a cross validation strategy (25). This step was based on the calculation of AUC in 1,000 permutations using a random selection of 90% of women as a training set and the remaining 10% as a test set.
Finally, women were stratified into deciles based on their final individual risk factor, obtained from the multivariable model. The ORs of extreme deciles were evaluated using logistic regression with a reference range of 40-60%.
Based on the characteristics of our cohort, the final individual risk factor proposed in this study describes the relative risk of women in the Spanish population of suffering breast cancer in a maximum period of five years.