Leveraging GWAS Data Derived From a Large Cooperative Group Trial to Identify a SNP Cluster Associated With the Risk of Taxane-induced Peripheral Neuropathy

29 Background 30 Chemotherapy-induced peripheral neuropathy (CIPN) is a common toxicity of taxanes for 31 which there is no effective intervention. Genomic CIPN risk determination has yielded promising, 32 but inconsistent results. The present study assessed the utility of a collective SNP cluster 33 identified using novel analytic to describe taxane-associated CIPN risk. 34 Methods 36 We analyzed GWAS data derived from ECOG-5103, first identifying SNPs that were most 37 strongly associated with CIPN using Fisher’s ratio . We then ranked ordered those SNPs which 38 discriminated CIPN-positive from CIPN-negative phenotypes based on their discriminatory power 39 and developed the cluster of SNPs which provided the highest predictive accuracy using leave- 40 one-out cross validation (LOOCV). Using GWAS aggregate data, we identified a 267 SNP cluster which was associated with 44 a CIPN+ phenotype with an accuracy of 96.1%. Identification of a 267 SNP cluster could accurately predict CIPN risk. Validation using an independent patient cohort should be performed.

Chemotherapy-induced peripheral neuropathy (CIPN) is a common toxicity of taxanes for 31 which there is no effective intervention. Genomic CIPN risk determination has yielded promising, 32 but inconsistent results. The present study assessed the utility of a collective SNP cluster 33 identified using novel analytic to describe taxane-associated CIPN risk.

36
We analyzed GWAS data derived from ECOG-5103, first identifying SNPs that were most 37 strongly associated with CIPN using Fisher's ratio. We then ranked ordered those SNPs which 38 discriminated CIPN-positive from CIPN-negative phenotypes based on their discriminatory power 39 and developed the cluster of SNPs which provided the highest predictive accuracy using leave-40 one-out cross validation (LOOCV).

43
Using GWAS aggregate data, we identified a 267 SNP cluster which was associated with 44 a CIPN+ phenotype with an accuracy of 96.1%.

65
We demonstrated that cancer regimen-related toxicities can be reliably predicted using a 66 learned SNP analysis strategy that is based on the hypothesis that risk identification is enhanced 67 using a multiple-SNP signature [16]. Here, we applied a novel analytic to the comprehensive 68 GWAS outputs from ECOG 5103 (Doxorubicin hydrochloride, cyclophosphamide, and paclitaxel 69 with or without bevacizumab in treating patients with lymph node-positive or high-risk, lymph 70 node-negative breast cancer; NCT00433511), and reported by Schneider et al [17]

81
were from non-CIPN developers and the balance from patients who had significant symptoms.

84
Alleles were designated as either A or B and parameterization of nucleotides was defined.

85
The parameterization used for the nucleotides was as follows: With this parameterization, a mean of 3.77 for a particular SNP in one of the classes means that 87 the corresponding nucleotide in mostly G (between T and G).

89
Analysis of the discriminatory power of the SNPs

90
We used Fisher's ratio as a measure of the individual discriminatory power of each SNP 91 and then found the smallest-scale aggregate SNP cluster via leave-one-out cross validation 92 (LOOCV). We then ordered SNPs by decreasing Fisher's ratio based on each SNP's 93 discriminatory power.

94
We defined the Fisher's ratio for a SNP j in classes 1 and 2 (CIPN and no CIPN), c1, c2

102
Once we identified those SNPs which differentiated CIPN-positive from CIPN-negative,

103
we ranked them in decreasing order based on their discriminatory power. We hypothesized that, 104 rather than individual SNPs being most able to differentiate highest risk, the discriminatory power 105 could be optimized by considering a cluster of SNPs. Consequently, we identified the smallest 106 collection of SNPs with the highest prognostic accuracy using an algorithm based on recursive 107 elimination of lower discriminatory SNPs. In our analysis, we differentiated those SNPs which 108 were highly discriminatory from those which did not significantly contribute to informing differential

125
We included the uncertainty analysis because the number of monitored SNPS was much larger 126 than the number of samples, resulting in an associated uncertainty space of the corresponding 127 phenotype prediction problem. We dealt with this by limiting a minimum discriminatory power to 128 the SNPs.

132
Patient characteristics

133
Patients were part of the parent protocol (ECOG 5103) described by Schneider et al [17] 134 and had consented for DNA analysis. ECOG 5103 was a phase III adjuvant breast cancer trial 135 that randomized patients with node positive or high-risk node negative breast cancer to 136 intravenous doxorubicin and cyclophosphamide every 2 or 3 weeks (at discretion of treating 137 physician) for four cycles followed by 12 weeks of weekly paclitaxel alone (Arm A) or to the same 138 chemotherapy with either concurrent bevacizumab or concurrent plus sequential bevacizumab.

139
We focused our analysis on patients with available germline DNA who never developed CIPN

140
(score of 0) vs. those who developed moderate to severe CIPN (scores of 3 of 4). We reasoned 141 that focusing on the two most extreme phenotypes and excluding those patients who had mild 142 CIPN, increased the likelihood of our more clearly differentiating genotype/phenotype differences.

144
Dataset analysis

145
Using the hold out experiment (HOE) dataset, 7.15% of the total number of alleles 146 analyzed (104,063) had a discriminatory power > 0.5 as defined by their Fisher's ratio. The 7 median cumulative distribution function (cdf) for the set of the most discriminatory SNPs (n=110) 148 was 1.97 with a low interquartile range (0.12) ( Figure S1).

150
Using LOOCV analysis, we achieved an accuracy of 100% in differentiating patients with 151 CIPN from those without CIPN with an aggregate 110 SNPs which had a FR > 3 (Table S1).

152
Whereas the single most discriminatory SNP (rs969768_A; C in controls and T in patients) only 153 provided a classification accuracy of 58%, adding the second most discriminatory SNP

158
We then performed hold out sampling (75% for training and 25% for validation) selecting 159 those genes with a FR>2.2 in the training and having a LOOCV validation accuracy higher than 160 95%. Using this set of genes. Posterior frequency analysis of this SNP set (Table S3)

180
We subsequently performed holdout sampling, selecting those genes with a FR>2.2 in the 181 training set and having a LOOCV validation accuracy higher than 95%. Using this set of genes,

182
we performed posterior frequency analysis.

207
Genomics' contribution to CIPN risk has been studied using candidate gene and GWAS

220
We reasoned that conglomerating the two databases might enhance our prediction and 221 functionality objectives. The merged dataset also provided the advantage that the number of non-

222
CIPN controls was increased (n=55) so we were able to create a 267 SNP cluster which predicted 223 CIPN risk with an accuracy of 96.1% (p<0.0093).

10
Our findings demonstrate the potential value of an undirected analytical approach to determine 225 CIPN risk. An association between SNPs and CIPN is not new. Nor is the use of SNPs to assess 226 the risk of other cancer regimen-related toxicities [23]. Typically associations between SNPs and 227 phenotypes have been done using GWAS or candidate gene approaches with the goal of 228 identifying the most predictive SNP or gene. Replication of findings has been problematic and 229 critical reviews of these approaches have been reported [15,24]. Our process was driven by the 230 concept that risk is the consequence of multiple genes simultaneously interacting to impact 231 phenotype. In this case, genes were defined by attribution from SNPs, an approach that excludes 232 non-SNP-related genes. While these may not be uniquely associated with risk, the assumption 233 that they did not contribute globally undermines a systems-based hypothesis. While SNP arrays 234 provide huge potential value, a whole genome array could be more comprehensive. The results

235
of the network generation exercise included in this paper is illustrative of that potential. Likewise,

236
to assume that risk is solely attributable to genomic influencers is naive. More likely, it is the sum 237 of metabolic, epigenetic, proteomic, and microbiome elements among others. Analyses of all 238 these inputs will require the application of newer methods such a multiplex networks [25,26].

239
External validation of the SNP set described here will be required to confirm its clinical 240 utility. The mechanistic information provided by our analyses is reassuring in that it confirms the 241 value of genomic-based assessments to describe potential CIPN pathogenesis.

243
We demonstrated that SNP analysis strategy can reliably predict the cancer regimen-244 related toxicities, where the risk identification is improved by using a multiple-SNP signature. We 245 applied a novel analytic to the comprehensive GWAS outputs from ECOG 5103 (Doxorubicin 246 hydrochloride, cyclophosphamide, and paclitaxel with or without bevacizumab in treating patients 247 with lymph node-positive or high-risk, lymph node-negative breast cancer; NCT00433511), to 11 recognize a comprehensive, hierarchical single nucleotide polymorphism (SNP) cluster that 249 robustly predicts the CIPN risk. By using the GWAS aggregate data, we identified the SNP cluster 250 which was associated with a CIPN+ phenotype with an accuracy of 96.1%. Thus, the present 251 work contributes in accurately determining the CIPN risk; however, validation studies using an 252 independent patient cohort should be performed.     366 Table S2: Nucleotide substitutions in the five most discriminating SNPs.      Merged data set. Fisher's ratio of the 267 discriminatory SNPs as a function of -log10(p-value).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.