A Comprehensive Screening Strategy of Breast Cancer Based on Different Penetrance Susceptibility Genes in Healthy Han Chinese Females

13 Background: Morbidity and mortality associated with breast cancer (BC) had increased rapidly in 14 China. Early screening and intervention could greatly reduce the risk of hereditary BC. Several risk 15 models had been utilized over the last decades to predict individual BC risk, but most of them didn’ t 16 assess the polygenic risk of low-penetrant genes. A novel screening method integrating different 17 penetrance susceptibility genes was eagerly needed. 18 Methods: Twenty-three variants of high and moderate penetrance susceptibility genes (HVs) and 19 twenty loci of low penetrance susceptibility genes (LVs) were selected from previous studies. 20 Genotyping of these mutations were conducted among 3777 healthy Han Chinese women (HCW) 21 and 401 BC subjects. Based on the mutation profiles, we raised a comprehensive screening strategy 22 using HVs and LVs to evaluate the polygenic risk score (PRS) in healthy individuals. 23 Results: Three HVs in BRCA1, BRCA2 and PALB2 genes mutated in the study population, which 24 suggests the necessity of applying genetic determination to healthy HCW. LVs were widely carried 25 in objects and their frequencies differed greatly between HCW and the west population. After 26 quality control and lasso dimension reduction, nineteen of twenty-three LVs were involved to 27 construct a logistic regression model to evaluate the cumulative genetic risk score. The area under 28 curve (AUC), sensitivity and specificity of the model is 0.993, 0.9676 and 0.9617 respectively, 29 indicating that it is robust. Finally, a screening strategy using HVs and LVs was put forward to 30 evaluate the risk of BC in normal objects. Conclusion: The distribution of HVs and LVs differed greatly between Han Chinese females and 32 the west population. A screening strategy combining HVs and LVs showed strong efficacy in 33 distinguishing high risk individuals from healthy women.


38
Breast cancer (BC) is the most common invasive cancer and the leading cause of cancer-related 39 deaths in women worldwide and in Asian countries [1]. Morbidity and mortality associated with BC 40 have increased rapidly in China [2]. Although the precise mechanisms underlying this 41 heterogeneous disease have not been fully elucidated, increasing evidence indicated that genetic 42 variants may contribute to the heritable risk of breast cancer besides the environmental and lifestyle 43 factors [3]. 44 Over the past 30 years, at least 170 susceptible genes or loci with different population penetration 45 6 (0.041 μl ) was parsed with three steps: 94℃ 30s , 94℃ 5s ; 52℃ 5s , 80℃ 5s ; 5 cycles ; Step 1 & 108 Step 2 : 40 cycles ; Step 3 : 72℃ 3 min , 4℃ hold . 109 Reaction products of resin were purified after PCR extension, SAP digestion and single nucleotide 110 base extension. Genotype analysis was carried out by MALDI-TOF-MS MS time-of-flight mass 111 spectrometry analyzer. The molecular weight of the single nucleotide extension product was 112 analyzed by the MassARRAY system to obtain the site typing results, and the quality value of the 113 typing results was determined according to the signal to noise ratio, peak tip position and peak width 114 of the mass spectrum peak. 115 Data analysis 116 All the tables and texts were parsed by in house Perl (version 5.18) scripts. Statistical analyses and 117 image render were performed using R software (ver.3.6.1; http://www.r-project.org/). Plink (version 118 1.07-x86_64) was used to analyze the linkage between loci [30], any two sites with r 2 >0.5 were 119 judged to be linkage disequilibrium (LD). The Hardy-Weinberg equilibrium (HWE) test was also 120 analyzed with Plink (--hardy). Tumor dataset GSE118527 was downloaded from the NCBI Gene 121

133
All 3777 subjects involved were healthy Han Chinese females without familial breast cancer history. 134 The median age was 40 (19-86), among which 1733 were below 40 years old, and 2044 were above 135 40 years old. The distribution of age was shown in Fig. 1a. Total 43 sites were selected for further 136 study, including 20 HVs and 23 LVs (Fig. 1b). HVs belong to BRCA1, BRCA2, TP53, PALB2, 137 ATM and CHEK2 genes and include three mutation types which are nonsense mutation, missense 138 mutation and frame shift mutation. All the HVs were located in the exon region (Fig. 1c). Detailed 139 information was in Table S1. All the 23 LVs were introns or intergenic region variants, which were 140 derived from the case-control study of breast cancer in GwasCatalog database ( Table S2). The OR 141 values of LVs range from 1 to 2, and the maximum LD coefficient between any two LVs was 0.36 142 (Fig. 1d). Genotype of all objects were measured by MassARRAY platform. All genotyping data 143 were list in Table S3. 144 Analysis result of 23 LVs, r 2 between any two loci was less than 0.5. 148

Variants of high and moderate penetrance genes
149 Three in the 20 HVs were detected in the study population, as shown in Table 1. Notably 150 rs180177110 (PALB2) had a high carrying frequency, with 10 heterozygous and 1 homozygous. In 151 addition, two heterozygous pathogenic mutation were detected in rs80356978 (BRCA1) and 152 rs80358600 (BRCA2) respectively. HVs in TP53, CHEK2 and ATM were not detected in this study. 153

Variants of low penetrance genes
158 Based on the genotype information of the study population, 23 LVs were tested for HWE test, among 159 which rs10941679, rs7072776 and rs1436904 did not conform to the law (BH adjusted p 160 value<0.05), as shown in Table S4.

164
The allele frequencies of the 20 LVs meeting the HWE were highly correlated with the East Asian 165 population frequency in 1000 genomes and gnomAD database (r 2 >0.98), but differed greatly with 166 the west populations (r 2 <0.6). The frequencies of rs10474352, rs3960987 and rs9485372 were all 167 over 0.2 higher than the mean of the west populations, while the frequency profiles of rs2380205, 168 rs889312 and rs7107217 was significantly lower than that of the west populations (shown in Table  169 2). 170 dataset and objects older than 40 years old (1880) as the training set, 1000 iterations lasso regression were conducted to address the major independent variables. The results showed that the numbers of 181 LVs involved in the model range from 8 to 19 and the 19 LVs models have the highest proportion, 182 detail in Fig. S5. Therefore, all 19 LVs were included to construct the logistic regression model, 183 with genotype of LVs as independent variables and the status of disease as dependent variable. The 184 threshold value (y = 0.215) was calculated with the max AUC. With this threshold, the classifier 185 works the best with balanced sensitivity and specificity (0.962 and 0.968) (Fig. 3a).  Table S6. 193 A comprehensive screening strategy using HVs and LVs to predict PRS 194 of BC

195
A comprehensive screening strategy was adopted to decide the PRS to objects (Fig. 3c). The PRS 196 of testers was defined to be 1 whenever the HVs were detected, no matter it's heterozygous or 197 homozygous, further genetic consultation would be provided for these ones (13/3777). Afterwards, 198 PRS of testers with the wild genotype of HVs would be assessed by the LVs logistic model. Ones technology would be more and more common used. Relevant disease data generated in hospital can 271 make the breast cancer health data more complete and the risk assessment of diseasemore accurate. 272 However, many problems remain to be solved when it comes to use genetic testing for breast cancer 273 screening work and it requires joint efforts of various aspects. 274 The main objective of our studies is to evaluate the mutation profile of HVs and LVs in HCW, and 275 provide a novel synsetic screening strategy to assess the risk in healthy individuals. It should be 276 noted that this study has examined only PRS in the study population. In fact, many other factors,