In the current study, the PRSs for the prediction of overall breast cancer and subtype-specific breast cancer in Chinese women were developed using a GWAS dataset and validated in an external case-control dataset. The best PRSs (PRSANN and PRSLRR) based on 24 SNPs showed good predictive ability (PRSANN: IQ-OR 1.76; AUC 0.601; PRSLRR: IQ-OR 1.58; AUC 0.598) and calibration (PRSANN: O/E OR 1.09; PRSLRR: O/E OR 1.08) for overall breast cancer. More importantly, the study results showed that the PRSANN was largely independent of Gail-2 model 5-year risk and other non-genetic risk factors. The PRSANN remained predictive of overall breast cancer after adjustment for classical breast cancer risk factors (PRSANN: IQ-OR 1.68; AUC 0.596; PRSLRR: IQ-OR 1.57; AUC 0.595), although the calibration index seemed to be slightly altered for PRSANN (O/E OR 1.17). These results indicated that our PRSs can provide additional risk information and are therefore suitable for use in conjunction with breast cancer risk prediction models based on non-genetic risk factors to stratify women into different risk groups. Since the Gail-2 model is the only publicly available model that can be used to predict the risk of breast cancer in Chinese women, we also investigated the combination of the PRSANN and PRSLRR with the Gail-2 model. Although there was a substantial increase in AUC when the PRSs were combined with the Gail-2 model (increased from 0.531 to approximately 0.58), the combined models had lower predictive ability than that using the PRSs alone. This was largely due to the poor performance of the Gail-2 model in the SBCCS dataset, which was consistent with a recent meta-analysis reporting a pooled AUC of 0.55 (95% CI 0.52–0.58) for the Gail-2 model in Asian females . Therefore, although our PRSs showed great potential to contribute additional risk information and increase predictive ability when combined with classical breast cancer risk factors, further studies are still needed to investigate their performance when combined with a more accurate non-genetic risk prediction model for Chinese women.
Another important application of the PRS is to identify women at high risk of breast cancer who could benefit from more frequent breast cancer screening or preventive therapy. Therefore, it is also important to assess the ability of the PRSs in predicting risk in the tails of the distribution. In the current study, the adjusted Q4th vs. Q1st ORs for PRSANN and PRSLRR were 2.51 and 2.47, respectively, meaning women in the fourth quartiles of the PRSs had an approximately 2.5-fold greater risk of having breast cancer than those in the lowest quartiles. This represents a substantial improvement in predictive ability compared with previous PRSs developed for Chinese women (Supplementary Table S9). This improvement can be attributed to the use of individual-level genotype data and a more sophisticated approach for PRS construction. Nevertheless, our best PRSs were still less predictive compared with some recent PRSs developed for women of European ancestry [22, 23, 35, 36], perhaps reflecting the gap between the number of SNPs included. Therefore, the performance of these PRSs can still be improved by including more SNPs associated with breast cancer risk in the Chinese population.
Previous studies conducted in women of European ancestry showed that breast cancer PRSs were generally less predictive of ER− breast cancer than ER+ breast cancer . We confirmed this result in our dataset in the Chinese population. The primary PRSs were significantly less predictive and poorly calibrated for ER− breast cancer, with adjusted IQ-OR and AUCs ranging from 1.24 to 1.30 and 0.548 to 0.555, respectively. To improve the prediction of ER− breast cancer, we developed ER− PRSs in the current study. The results indicated that when training PRSs solely on ER− breast cancer cases yielded a substantial gain in predictive ability for ER− breast cancer. As a more aggressive breast cancer subtype, patients with ER− breast cancer had significantly worse prognosis compared with patients with ER+ breast cancer. Identifying women at high risk of ER− breast cancer regardless of their overall breast cancer risk is therefore of great value in clinical practice and breast cancer screening. Our results highlighted the requirement for optimization of future PRS for ER− breast cancer by incorporating more ER− cases in the training dataset and perhaps, including more SNPs associated with ER− breast cancer.
We compared three different approaches to PRS construction, consisting of the traditional RLR approach using summary statistics, as well as LRR approach and the newly proposed ANN-based approach using individual-level genotype data. Compared with the traditional summary statistics-based RLR approach, the LRR and ANN approaches can be used to address the issues of overfitting, collinearity and confounding by using individual-level genotype data, thus providing a more accurate estimate of the weighting parameters. Therefore, it is expected that the primary PRSs constructed using the ANN and LRR approaches both achieved better predictive performance than PRSRLR (including SNP-17 based PRSRLR in the sensitivity analysis). Through the use of the non-linear activation function and multiple hidden layers, the ANN model is able to fit high-order interactions between variables . Therefore, in theory, the ANN approach captures the interactions between breast cancer SNPs [38–40], and thereby achieves better predictive performance than the linear LRR approach. Our research confirms this speculation. The primary PRSANN showed higher predictive ability than the primary PRSLRR in predicting overall and ER+ breast cancer, which suggests the existence of interactions between the included SNPs. To explore possible SNP-SNP interactions, we conducted RLR tests to identify pairwise interactions in the SGWAS dataset. A total of 13 pairs of SNPs with possible SNP-SNP interactions (P < 0.05) were identified, but none of them reached a Bonferroni corrected level of statistical significance (P < 1.8×10−4). Further post-hoc analysis revealed that the interaction between rs10789190 and rs7799039 was statistically significant in both datasets (P < 0.05). The SNPs rs10789190 and rs7799039 are located in the leptin (LEP) and leptin receptor (LEPR) genes, respectively, making their interaction is biologically plausible. Adding this interaction term to the PRSLRR slightly improved its predictive ability (PRSLRR with interaction term: IQ-OR 1.62; AUC 0.602), indicating the differences between the primary PRSLRR and PRSLRR can be partially attributed to this interaction term. In other words, the ANN approach automatically captures the potential interactions between SNPs, which are likely to be omitted in the traditional approaches. Nevertheless, the ANN approach is more sophisticated and less flexible than the LRR and RLR approaches. Whether ANN can be considered the optimal approach to PRS construction remains to be investigated.
The current study has several strengths. First, the PRSs were validated in an external dataset, and thus avoided the concern of overfitting. Second, we examined the performance of the PRSs by ER status and further optimized the PRSs for ER− breast cancer prediction, which has not been previously conducted in Chinese women. Third, all the SNPs in the SGWAS and SBCCS were genotyped directly. Imputation was conducted only for sporadic missing genotypes. However, the study also has some limitations. First, due to budget constraints, the search for candidate SNPs was limited to those that are well-validated in Chinese population, hence some newly identified SNPs and SNPs that remain to be validated in Chinese population were omitted. Therefore, the results of our study should be interpreted with caution. Future studies should include more SNPs associated with breast cancer susceptibility, especially those identified in recent GWASs. High-quality genetic studies are also needed to identify and validate more breast cancer susceptibility SNPs in the Chinese population. Second, assessment of the performance of the PRSs in combination with classical breast cancer risk factors was not sufficient, since there is no suitable breast cancer risk prediction model for Chinese women. Further studies are warranted to investigate the performance of the PRSs when incorporated into more accurate risk prediction models for Chinese women. Finally, our PRSs and study results are limited to Han Chinese women and may not be generalizable to Chinese women in other ethnic groups, although they only account for around 9% of the total population.
In summary, the SNP-24-based breast cancer PRSs showed significantly better predictive ability than previous PRSs developed for Chinese women. Our SNP-24-based PRSs were largely independent of classical breast cancer risk factors and thus have great potential to improve clinical practice and future risk-based breast cancer screening programs by providing additional risk information for the general population. Nevertheless, the predictive performance of the current PRSs can still be improved by incorporating more SNPs that are associated with breast cancer risk in Chinese women. The subtype-specific PRSs showed substantial improvement for ER− breast cancer prediction and can be used to identify women at high risk of ER− breast cancer. Our newly proposed ANN-based PRS construction approach automatically captures the potential interactions between SNPs and showed better performance than the traditional approaches, although additional studies are needed to further validate and investigate this approach.