This study intended to evaluate the presence of bias in SAS and RAS [intra-observer], along with difference in the inter-observer rating. We found good agreement between SAS and RAS, with SAS having slightly better agreement with BCCT.core, suggesting that subjective rating is closer to objective rating [BCCT.core] if breast photographs are evaluated serially. This parameter has never been evaluated in historical studies of similar nature, and hence may be considered for a detailed analysis. The study also establishes the influence of the experience of the rater and the composition of panel for cosmetic assessment.
The cosmetic assessment of the treated breast is often subjectively done during routine follow up in the clinics, as performing a BCCT.core analysis is not feasible in high volume centres since it requires acquiring breast photograph and it’s processing in a dedicated software. Using BCCT.core is a time-consuming process that limits its utility. On the other hand, BCCT.core has become very popular for objective assessment in research setting. Heil et al have found very good agreement between different users of the semi-automated BCCT.core software. [10] Stored photographs can hence be scored objectively, as well as subjectively by an individual physician or by a panel of experts retrospectively. Panel assessment is a standard approach adopted in large breast cancer trials. [11]
In our study, we found that the mean scores for SAS and RAS were in good agreement [kappa 0.659]. This means that either of these can be adopted for cosmetic assessment when a panel has to evaluate several images. A fair agreement of both SAS and RAS was noticed with BCCT.core, though the value was slightly higher for SAS [0.343] than RAS [0.301].This indicates that providing serial images of the earlier visits is likely to improve the rating of the current image [that has to be rated]. However, this finding needs to be validated in a larger cohort before implementation.
No prior studies have evaluated the significance of providing serial images [SAS] however, in a study conducted by Heil et al, images were evaluated twice by a panel, one hour apart, taken immediately after and 1 year after surgery. They found that there was significant intra-rater agreement between both the ratings. The kappa values were 0.41 shortly after surgery and 0.32 one year after surgery, demonstrating either worsening of agreement as more time passed after treatment or lack of availability of serial images as they had only 1 image for comparison. [10] In our study, we found that intra-rater agreement for SAS between baseline and 1st follow up image was lower [0.342] as compared to RAS [0.442].
The duration of time between the treatment and the rating is also significant. This may be due to the natural healing process of the tissue, or due to a timeline bias where a freshly treated tissue appears better or worse than it actually is. In our study, we found better agreement with BCCT.core for both SAS and RAS at 1st follow up as compared to the baseline [Supplementary Table 2], contrary to Heil et al. [10]
Preuss et al found fair agreement of 0.37 [Kappa] between median subjective scores [panel] and BCCT.core. Agreement was higher with LD flap [kappa 0.51] and delayed reconstruction [0.60] as compared to TRAM flap reconstruction and immediate reconstruction. [12] In our study, 33 patients had undergone open cavity BCS, and only 17 patients had undergone Oncoplastic procedures and we found good agreement between SAS and RAS [0.733] for open cavity procedures, and moderate agreement [0.497] for Oncoplastic procedures. For Oncoplastic procedures, the agreement with BCCT.core improved from fair to moderate when serial images were provided for assessment, again indicating the positive influence of serial images. This observation also needs to be tested in a larger cohort.
Heil et al also found only slight to fair inter-rater agreement within each group [5 groups, each having 5-7 members of uniform profile, including breast surgeons, nursing staff, patients, medical and non-medical students], and the agreement was again found to be better immediately after surgery as compared to the agreement one year after surgery. [10] Corica et al have also shown modest inter-rater agreement between patients, doctors, nurses and BCCT.core in their study [13]. In our study, inter-rater kappa was not calculated as the three groups in the panel consisted of only 2 members each.
Haloua et al have shown that BCCT.core software was substantially in agreement with the mean overall panel score [panel consisting of 10 members] for cosmetic outcome [weighted kappa: 0.68, indicating substantial agreement]. This is higher than the agreement with BCCT.core in our study [SAS 0.343 and RAS 0.30]. The higher agreement in their study may be due to a larger panel in their study [10 member panel], quality of photography, or the presence of highly experienced clinicians in their panel in larger proportion. Their panel consisted of 2 experienced breast surgeons, 2 surgical residents, 4 experienced plastic surgeons and 2 lay persons. [14]
Yu et al determined the kappa statistic between the patients’ self-reported score and BCCT.core. They found low kappa value of 0.12 [indicating slight agreement]. [15]
The composition of a panel must be carefully chosen. A good panel has the potential of having high agreement with BCCT.core. For e.g., as discussed above, for Haloua et al, the panel consisted of ten members, out of which at least six were experienced surgeons [breast and plastic surgeons], and 2 surgical residents and only two lay persons. [14]. In the Cambridge Breast IMRT trial, the cosmetic evaluation was done by a team of seven clinicians [four oncologists, one radiographer, one surgeon, and one breast care nurse] in a panel of any three at a time, and Mukesh et al concluded that improved dose homogeneity with simple IMRT translates into superior overall cosmesis. [16] In the EORTC ‘Boost v/s no boost’ trial, a panel cosmetic evaluation was performed by a panel consisting of five members. [17]. In the UK START B Trial, changes in the appearance of breast photographs was observed by three blinded observers. [7]. In line with literature, the current study also involved a panel comprising of six members.
Professional background and sex of the rater can also potentially affect the rating. In our study, all panellists were females and from Radiation Oncology background. No comprehensive study has evaluated the bias between Radiation and Surgical oncology background ratings, as most panels are an amalgamation of the two specialities. [14] However, such a study may be conducted, as a surgeon will have deeper understanding of surgical nuances, techniques and practices which may create a bias.
In our study, the panel consisted of three groups, two panellists in each group [two senior breast RO, two junior RO and two RN]. We found that the agreement with the BCCT.core improved with the increasing experience of the panellist. The senior Breast RO had the highest agreement with BCCT.core, suggesting that experience of the rating panellist is significant in cosmetic rating [Figure no. 4]. Similar results were seen in a study conducted by Cardoso et al. They found higher inter-observer agreement for experienced observers [multiple kappa 0.59] as compared to medium and inexperienced observers. Overall inter-observer agreement [for a total of 13 observers] was found to be 0.33 [multiple kappa]. They concluded that a homogeneous group of observers with experience in breast cancer conservative treatment will provide better inter-observer agreement than a mixed group involving clinicians with different levels of expertise.[9]
The strength of our study lies in the simple design, testing of raters with different experience levels, and being the first study to report impact of study setting on cosmetic rating. We included a mix of breast patients with different clinical stage, breast size, skin colour, and different surgeries to reflect a real-world clinical setting. Though simple in design, our study has limitations. Our study does not evaluate the influence of professional background [Plastic surgeon vs. Oncosurgeon vs. Radiation Oncologist] on the rating. Only frontal photographs were assessed objectively and subjectively, hence if the scar was not visible on the frontal photograph, it was not considered by the panel for cosmetic rating. All the patients did not have uniform number of photographs. Our photographs were not captured by a professional photographer and hence, minimal variations were observed in the quality of the images, the background, the distance, and the patient posture. We tried to overcome these limitations by post-processing the photographs for sharpness, quality and contrast and expect that the impact of these issues would be uniform on all the raters and on both SAS and RAS. The study did not intend to compare subjective and objective methods of assessment with each other.