Differences in sensitivity to new therapies between primary and metastatic breast cancer: A need to stratify the tumor response?

Abstract Objective We compared therapeutic response of Varlitinib + Capecitabine (VC) versus Lapatinib + Capecitabine (LC) in patients with human epidermal growth factor receptor 2‐positive metastatic breast cancer after trastuzumab therapy by assessing changes in target lesion (TL) diameter and volume per location. Methods We retrospectively analyzed the CT data of the ASLAN001‐003 study (NCT02338245). We analyzed TL size and number at each location focusing on therapeutic response from baseline to Week 12. We used TL diameter and volume to conduct an inter‐arm comparison of the response according to: RECIST 1.1; stratified per TL location and considering TLs independently. Multiple pairwise intra‐arm comparisons of therapeutic responses were performed. Considering TL independently, weighted models were designed by adding weighted mean TL responses grouped by location. Results We evaluated 42 patients (88 TL) and 35 patients (74 TL), respectively, at baseline and Week 12. We found reductions in breast TL burden in the VC arm compared to the LC arm (p = 0.002 (diameter), p < 0.001 (volume)). Responses and TL sizes at baseline were not correlated. Explained variabilities of volume change per TL location, patient and patient:TL interaction were 36%, 10% and 4% (VC), and 13%, 1% and 23%, (LC). A test of inter‐arm difference of responses yielded p = 0.07 (diameter), and p < 0.001 (volume). Conclusions The therapeutic responses differed across tumors' locations; the magnitude of the differences of responses across the tumors' locations were drug‐dependent. Stratified analysis of the response by tumor location improved drug comparisons and is a powerful tool to understand TL heterogeneity.


| BACKGROUND
The Response Evaluation Criteria in Solid Tumors (RECIST) remain the most widely used criteria for assessing drug efficacy using imaging, 1 primarily due to its simplicity and the lack of better established criteria. 2 The heterogenous treatment responses observed in radiology following cytotoxic chemotherapy has already been reported. 3 Now, some groups 4,5 have raised concerns about RECIST that may be suboptimal for assessing treatment response to new generations of therapeutics like tyrosine kinase inhibitors (TKIs), whose mechanisms of action (MoAs) differ from that of chemotherapy.
Since 2010, the Food and Drug Administration (FDA) started considering new types of anti-cancer therapies 6,7 for which the pattern of response was neither observed nor considered when RECIST were developed. Since then, radiology has not evolved at the same pace as these new treatments have emerged.
Molecular intra-tumor heterogeneity is often encountered with the use of TKIs, which rely on a novel MoA 8,9 and categorizing a disease as stable is often evidence of drug effectiveness. 10,11 Consequently, with new generations of anti-cancer treatments, patterns of radiology response may vary with tumor locations 12,13 suggesting that sometimes, use of a stratified analysis would be more appropriate than a global one. Continued use of chemotherapy-based response criteria for assessing clinical efficacy of new therapies is therefore suboptimal. 14 Similar limitations apply when assessing clinical efficacy of cocktails of drugs or response in basket trials.
In recent years, there have been rapid developments in the field of quantitative imaging in radiology, with the release of guidelines for qualification of quantitative imaging biomarkers (QIBs) 15 and recommendations for their implementation. 16 Tumor volume in Computed Tomography (CT) has recently been presented as a valuable QIB that maxed all qualification steps. 17 Coupling tumor volume as an advanced QIB with a stratified analysis of the therapeutic response per disease location may offer useful insight into drug efficacy. In our study, we compared therapeutic response of Varlitinib + Capecitabine (VC) versus Lapatinib + Capecitabine (LC) in patients with human epidermal growth factor receptor 2 (HER2)-positive metastatic breast cancer (MBC) after trastuzumab therapy, using changes in tumor diameter and volume per tumor location.

| MATERIALS AND METHODS
Our study was exempted by the Institutional Review Board (IRB) due to its retrospective nature. Written informed consent was not required as patient management was not impacted.

| Data collection
We retrospectively analyzed CT scans measurements and annotations of 42 patients from the phase 2A multicenter ASLAN001-003 clinical trial (NCT02338245), which compared the therapeutic response of VC versus LC in patients with HER2-positive MBC after trastuzumab therapy. In the ASLAN001-003 trial, RECIST 1.1 were applied and, additionally, changes in sum of target lesion (TL) volume were monitored. The ASLAN001-003 trial used the LMS platform (Median Technologies, France) that automatically recorded tumor type, location, longest axial diameter (LAD), short axial diameter (SAD) and manually delineated volume.
Demographics and disease characteristics of the ASLAN001-003 trial are summarized in Table 1. The key inclusion criteria were: 1. Documented histological confirmation of breast cancer with HER2 overexpression or gene amplification (immunohistochemistry 3+ or 2+ with fluorescent/ chromogenic/silver in situ hybridization+) prior to study entry. 2. HER2 positive MBC that had progressed on prior firstline treatment with trastuzumab in metastatic setting or relapsed within 1 year of treatment with trastuzumab in adjuvant setting.
The key exclusion criteria were: 1. Patients who have received more than 2 lines of any therapies in metastatic stage, radiation treatment or major surgical procedures within 21 days prior to study entry. 2. Patients with any history of other malignancy unless in remission for more than 1 year. 3. Patients with an uncontrolled intercurrent illness.

| Study workflow and analysis
For our study, measurements and annotations were automatically retrieved from the original trial database. Measurements (LAD, SAD and volume) and annotations recorded at baseline and Week 12 were quality controlled and analyzed by a 15Y+ medical imaging expert. Lymph nodes (LN) measurements (SAD) were specifically controlled to comply RECIST recommendations.
Our study plan was as follows:

| Population statistics
We compared the tumor size and the number of tumors at each disease location, and for each treatment arm.

| Inter-arm comparison of the responses (Figure 1)
We analyzed the mean changes of the tumor burden (as %) in considering: 1. The definition of tumor burden given by RECIST 1.1, where, for each patient, at each time point, the size (LAD/volume) of all TLs (up to 5 in number, independent of location, but no more than 2 per location) was summed, and these sums were monitored from baseline to Week 12; 2. Stratified tumor burden, where, for each patient, at each time point, the size (LAD/volume) of TLs from the same location were summed, and these sums were monitored from baseline to Week 12. 3. All tumors considered independent from patients and monitored from baseline to Week 12, with the average of tumor change in size (LAD/volume) computed per tumor location.
We tested if a significant relationship existed between tumor size at baseline and change from baseline at Week 12.

| Intra-arm comparison of the responses
In each arm, (independently to patients) we grouped tumors by location then we compared the average response (as %) between these groups. We performed multiple pairwise comparisons of the responses between the various tumor locations (liver-breast, lung-breast, lymph node-breast, lung-liver, lymph node-liver and lymph  node-lung). In using either LAD (SAD in the case of LN) or volume, we tested for significant differences at Week 12 in each treatment arm. Finally, we computed the mean tumor changes by stratifying patients' responses and did an analysis of variance (ANOVA).

| Modeling of the stratified response
Considering tumors independently from patients, we designed a model by adding mean tumor responses, grouped by locations and weighted by the proportion of tumors at these locations. The weighted model summarized the response to treatment. The model will be computed for each arm to allow for more accurate comparisons of interarm responses. Treatment responses were summarized as follows: With: Nb i : Number of TLs at disease location i (i = breast, lung, liver or nodal tumors). Δ TL i : Mean change of TLs size (LAD or volume) at disease location i .
Inter-arm comparisons of tumor responses. Three different analyses were performed. (1) Patient tumor burden was monitored according to RECIST 1.1. At each time point, for a given patient, the size of all target tumors was summed, and these sums were monitored over time.
(2) Patient tumor burden was stratified by tumor location. At each time point, for a given patient, the size of target tumors from the same location was summed, and these sums were monitored over time.
(3) All tumors were considered as independent from patients and monitored independently over time; the mean tumor change was computed per location.

| Sensitivity analysis
We tested the robustness of our results by slightly changing the study input as follows 18 : 1. Excluding patients exhibiting extreme treatment response at Week 12, then re-testing our conclusions with/without outliers; 2. Adjusting for the imbalance in number of independent tumors and number of patients after stratifying per tumor location at Week 12;

| Statistics
The multiple comparisons of tumor sizes per tumor location were tested using Tukey Honest Significant Differences. Comparisons of tumor proportions at each location relied on a two-sided Chi-square test. We computed waterfall plots of patients' response (summing all tumors for each patient), and in stratifying patients' response per tumor location, and Wilcoxon-rank tested the equivalence of inter-arm and stratified intra-arm responses.
Tests of multiple comparisons of tumor response per tumor location were performed applying Tukey Honest Significant Differences.
We used a two-sided Chi-square test for evaluating inter-arm difference of response derived from the weighted models.
Eta-squared derived from the ANOVA reported the proportion of explained variabilities.
As prerequisite of performing ANOVA, data were tested for homoscedasticity using Levene's test and for Normality using Jarque-Bera test. Both tests are available from the "lawstat" R package.
Data were considered as outliers when outside the 1.5 Inter Quartile Range. 19 The R 3.5.1 Cran software was used for statistics, p < 0.05 was considered a significant difference.

| Population statistics
At baseline, 42 patients displayed at least 1 TL. A set of 88 TLs was distributed per disease location as follows: lung (31% n = 27), breast (26% n = 23), liver (23%, n = 20), lymph nodes (17% n = 15) and miscellaneous (3%; n = 3). Miscellaneous locations (skin and mediastinal lesions) were excluded as they were under-represented. Therefore, 85 TLs were classified into 4 major groups by location (Table 2). To be noted that 22 patients had no visible primary breast tumors on CT due to previous trastuzumab treatment or because their tumors were visible only on mammography.
At Week 12, 35 patients remained in the study (14 and 21 patients in the VC and LC arms, respectively) ( Table 3) and 74 tumors were measured.
Distributions of tumor size at baseline per tumor location in both treatment arms are displayed in Figure 2 for both QIBs.
At baseline, there was no significant difference between the treatment arms in the proportion of tumors (p = 0.27), though there was a greater proportion of lung tumors in the LC arm versus the VC arm (p = 0.07) ( Table 2). When considering either QIB, the mean size of breast tumors was significantly larger than that of tumors at the other locations (p < 0.002).

| Inter-arm comparison of the responses
Tumor burden changes, in both treatment arms, are presented in Table 3 T A B L E 2 Proportion of tumors at each disease location at baseline LAD, p < 0.001 for volume in favor of VC arm). No significant inter-arm differences were noted for other TLs. Table 4 summarizes the mean tumor response with tumors considered independently from patients and grouped by location. Putting all tumors together without distinction from disease location and patient, a test of inter-arm difference of the response yielded p = 0.02 for tumor LAD and p = 0.015 for tumor volume. There was no significant relationship between the response and baseline tumor size by LAD or volume. Tables 5 and 6 summarize, for tumor diameter and volume, respectively, the difference of responses between the different pairs of tumor locations. For changes in tumor diameter (Table 5), explained variabilities per tumor location, patient and patient: tumor interaction were 22%, 5% and 16%, respectively, in the VC arm, and 2%, 0.5% and 30%, respectively, in the LC arm.

| Intra-arm comparison of the responses
For changes in tumor volume (Table 6), explained variabilities per tumor location, patient and patient: tumor interaction were 36%, 10% and 4%, respectively, in the VC arm, and 13%, 1% and 23%, respectively, in the LC arm.

| Model design
We applied our model using the distribution of tumor location (Table 2) and the average response by tumor location (independent tumors) in Tables 5 and 6. Thus, we modeled the response to treatment for the VC and LC arms, respectively, in Equations 1 and 2 (using LAD) and, respectively, in Equations 3 and 4 (using volume). LAD Volume a test for a significant difference in inter-arm responses yielded p = 0.07 (using LAD), and p < 0.001 (using volume) both in favor of VC arm.

| Sensitivity analysis
The inter-arm comparison of the stratified responses yielded p = 0.015 (using LAD) and p = 0.03 (using volume) after removing outliers at Week 12 (n = 6 for tumor diameter, n = 3 for tumor volume) (Table S1). When considering each tumor independently from patients, inter-arm comparison yielded p < 0.007 (for tumor diameter) and p = 0.016 (for tumor volume) after removing outliers (n = 7 for tumor diameter, n = 4 for tumor volume). Intra-arm comparisons of the stratified responses by disease location are summarized in Tables S2 and S3.

| DISCUSSION
Our study showed that breast tumors were, on average, significantly larger than other tumors (p < 0.001). There was no significant inter-arm difference in the proportion of tumors at different disease locations, though there was a greater proportion of lung tumors in the LC arm (p = 0.07).
Inter-arm tests showed a trend toward superiority of the VC arm per patient, and confirmed superiority of the VC arm when tumors were considered independently. Multiple intra-arm comparisons showed that tumor volume is more sensitive than LAD for detection of differential responses at different disease locations. In the VC arm, we found a significant differential response between breast and liver tumors using volume (p = 0.007) and a trend toward superiority using volume in differential response for lymph node versus liver tumors (p = 0.057). No significant differences were measured in the LC arm using LAD or volume. Results of the intra-arm multiple comparisons confirmed the stratified inter-arm results, showing a more favorable response in the VC arm compared to the LC arm, for both QIBs (p = 0.07 for LAD, p < 0.001 for volume). These results were also confirmed by the inter-arm comparisons of the weighted models and the ANOVA, indicating a greater variability per tumor locations in the VC arm. The results of our study are strengthened by a sensitivity analysis that reported no significant impact of outliers upon our conclusions, and no change in the stratified responses of VC over LC, after adjusting the proportion of TLs at each disease location. Our stratified analysis showed the effectiveness of the drug at specific disease location. This insight would help to improve drug indications and to design more effective drug combinations. F I G U R E 4 Waterfall plot showing stratified changes from baseline to Week 12 in tumor volume (on right) and diameter (on left) for breast, lung, liver, and lymph node tumors (from top to bottom). Green bars represent responses in the Varlitinib + Capecitabine arm; red bars represent responses in the Lapatinib + Capecitabine arm. There was significant inter-arm difference only for changes in breast tumor burden (p = 0.002 and p < 0.001, respectively, for tumor diameter and volume as qualitative imaging biomarkers).
Researchers have reported differential responses according to disease location. Menzies et al 20 found significantly different Time To Best Response for subcutaneous soft tissue and lung metastases compared to lymph node and liver metastases, and Crusz et al 21 found that 55.6% of patients showed a heterogeneous response. These studies drew contradictory conclusions regarding a relationship between tumor size at baseline and response. Our study did not show a relationship between tumor size at baseline and response. Usually, tumors have complex shapes and are heterogeneous; volumetric measurements have long showed better precision and accuracy than linear measurement, notably in advanced lung cancer patients. 17,21 However, very few studies have proved that changes in tumor volume better correlate to the disease or can be an alternative for clinical trial. In our study, we found that when tumor volumes were used, p values were lower when testing inter-arm response according to a weighted model of stratified response p = 0.07 (for tumor diameter), and p < 0.001 (for tumor volume) or when tumors were all considered as independent p = 0.02 (for tumor diameter) and p = 0.015 (for tumor volume). We also found that tumor volume was more discriminant than diameter when testing differential response (e.g. p = 0.007 for liver-breast in VC arm). Similar discrimination was not T A B L E 4 Mean proportional change (%) in diameter and volume of tumors considered independent of patients and grouped by disease locations observed with RECIST (p = 0.13 for volume), (p = 0.086 for diameter). This can be explained by the design of RECIST that recommend adding tumors from different location, therefore losing the benefit of the volume. We can also consider that the stratification of imaging therapeutic response represents a mean of investigation per say. 22 It is known that spatial and temporal tumor heterogeneity can be due to the mutational status of tissues, their cellular morphology, metabolism, and proliferative and metastatic potential. 23 Therapeutic response stratification can therefore be seen as an indirect noninvasive feedback on tumor heterogeneity. More specifically, the temporal monitoring of clinical data coupled with stratified responses could inform about different resistance mechanisms and their outbreak. 24 Enriching biological data with stratified imaging responses would help to understand the MoA, identify drug sensitive or resistant cells and investigate new targeted therapy approaches.
Our study had some limitations, the first being that we analyzed tumor response over a short period of time. To match the ASLAN003-001 trial setting, we restricted our analysis to Week 12. We may hypothesize that the stratified response at each tumor location can vary over time. For instance, at treatment onset, a drug could exhibit a superior efficacy upon primary breast tumors compared to metastases, which could fade, disappear, or even reverse over time. Because of the limited dataset we could not extend our study over multiple time points. A second limitation of our study was that we did not consider the aspect of measurement reliability. In our dataset, tumors had different size distributions according to locations, and the proportions of tumors at various locations differed slightly between the arms. Several groups have investigated the measurement reliability according to tumor size and location. 25,26 A more sophisticated model of stratified response would include the reliability of measurements as a parameter. A third limitation of our study was that it was not possible to consider all RECIST aspects as the unequivocal appearance of new lesions and progression of non-target lesions (nTLs). In our study, at Week 12, a single new unequivocal new lesion was detected, 2 nTLs progressed while 9 decreased. The small data sizing precluded any significant conclusions.
A fourth limitation is inherent to the ASLAN003-OO1 trial that mainly included Asian patients while Wagner et al. 27 reminded that different response may exist between Asian and Caucasian ethnicities. Therefore, the generalizability of our observations needs to be confirmed with non-Asian cohorts.

| CONCLUSION
We found that drugs have different efficacy across tumor locations. In the era of new therapies, stratified analysis of response will provide better assessments and drug comparisons, and be a powerful tool contributing to improved understanding of the MoA behind tumor heterogeneity.

AUTHOR CONTRIBUTIONS
Hubert Beaumont: Conceptualization, formal analysis, original draft, writing, review and editing. Nathalie Faye: Conceptualization, Data curation, review and editing. Antoine Iannessi: formal analysis, original draft, review and editing. Emmanuel Chamorey: methodology, formal analysis, review. Catherine Klifa: Conceptualization, methodology, project administration, review and editing. Chih-Yi Hsieh: Data curation, review and editing. We certify that all co-authors contributed equally and significantly to the study and to the design of the manuscript.