COMFORTneo scale: a reliable and valid instrument to measure prolonged pain in neonates?

We studied the reliability and validity of the COMFORTneo scale, designed to measure neonatal prolonged pain. This prospective observational study evaluated four clinimetric properties of the COMFORTneo scale from NICU nurses’ assessments of neonates’ pain. Intra-rater reliability was determined from three video fragments at two time points. Inter-rater reliability and construct validity were determined in five neonates per nurse with the COMFORTneo and numeric rating scales (NRS) for pain and distress. Pain scores using N-PASS were correlated with COMFORTneo scores to further evaluate construct validity. Intra-rater reliability: Twenty-two nurses assessed pain twice with an intraclass correlation coefficient (ICC) of 0.70. Inter-rater reliability: The ICC for 310 COMFORTneo scores together with 62 nurses was 0.93. Construct validity: Correlation between COMFORTneo and NRS pain, distress, and N-PASS was 0.34, 0.72, and 0.70, respectively. The COMFORTneo can be used to reliably and validly assess pain in NICU patients.


INTRODUCTION
Experiencing pain negatively impacts a premature infant's development with respect to cognitive, motor, behavioral and neurological outcome [1][2][3][4][5][6]. Neonatal Intensive Care Unit (NICU) nurses consider the prevention and reduction of pain and stress in NICU patients the most important research priority [7]. The accurate assessment of pain in these patients is essential to accomplish adequate pain management [8]. The use of self-report is the first choice in assessing pain in pediatric and adult patients, but this is impossible in neonates [9,10]. Because of the lack of a gold standard, the assessment of pain remains a difficult aspect of neonatal care [11]. The application of a measurement instrument to quantify pain is considered the best alternative. Nowadays, more than 40 measurement instruments have been developed to assess pain in neonates [8,12]. These instruments primarily use behavioral observations to quantify the level of pain, sometimes combined with physiological aspects and contextual information such as gestational age. Despite the large number of observational pain measurement instruments, more research focusing on the reliability, validity, clinical utility, and applicability of these instruments is necessary in order to ensure that pain is assessed adequately.
According to the framework provided by Anand in 2017, neonatal pain can be divided into either acute (episodic or recurrent) or prolonged, persistent, and chronic pain depending primarily on the onset and duration of pain [13]. Most pain measurement instruments focus on the assessment of acute pain related to procedures, for example heel sticks and venipunctures [11].
More attention for the assessment of prolonged pain, unrelated to procedures, is needed in NICU patients. A survey study from 2017 in 18 European countries showed that prolonged pain was assessed at least once during the NICU stay in 32% of the patients, with daily assessment occurring in only 10% of all neonates [14]. This is worrying because a lack of assessment impedes sufficient treatment [15]. One of the instruments that has been developed specifically to assess prolonged pain is the COMFORTneo scale (Supplementary file 1). This instrument was introduced in 2004 at the NICU of the Sophia Children's Hospital. In 2009 the first validation study was published and concluded that the instrument showed preliminary reliability and validity for the evaluation of prolonged pain [16]. Nowadays, an increasing number of NICUs worldwide use the COMFORTneo scale either in clinical practice or for research purposes [17][18][19][20][21].
The validation of an instrument is a continuous process; it is never fully complete [22]. For one, knowledge regarding the measurement of pain in NICU patients is evolving and this strengthens the possibility to validate a pain measurement instrument [22]. Since a gold standard for pain assessment, selfreporting, is unavailable for infants, this further complicates the validation process [23]. The original COMFORTneo validation study already mentioned possibilities to strengthen the evaluation of the instrument's measurement properties [16]. While the same nurse assessed Numeric Rating Scores (NRS) for pain and distress and the COMFORTneo, this should ideally be assessed by different caregivers to minimalize observer bias. The Neonatal Pain, Agitation and Sedation Scale (N-PASS) was not yet published during the first validation study, but nowadays has been validated to assess prolonged pain in neonates [24]. Van Dijk et al. mentioned that both the N-PASS and the COMFORTneo should be assessed by two independent assessors at the same time to confirm the construct validity [16]. Lastly, the intra-rater reliability was not determined during the first study.
Therefore, our study aimed to further evaluate the reliability and validity of the COMFORTneo scale as an instrument to measure prolonged pain at the NICU.

SUBJECTS AND METHODS Design
This prospective validation study addressed four measurement properties: inter-rater and intra-rater reliability, concurrent validity, and construct validity.

Patients and setting
Data collection was conducted from November 2015 until April 2016 in the level 3 NICU of the Erasmus MC -Sophia, Rotterdam, the Netherlands. Approximately 100 NICU nurses are employed at this NICU. There were no exclusion criteria for patients or nurses, as in clinical practice COMFORTneo is also applied by all nurses and to all preterm and term patients. The nurses only assessed the pain of each infant once, but different nurses could observe the same patient. The neonates could be observed at any time of the day, but all observations were made during rest while the patients were not disturbed. Both nurses (depending on their presence and availability during their shift) and patients (depending on practical reasons such as the absence of parents) were selected based on convenience sampling.

Measurement instruments
COMFORTneo. The COMFORTneo consists of seven behavioral items (alertness, calmness/agitation, respiratory response, crying, body movement, facial tension, and muscle tone), of which six items should be scored (respiratory response or crying depends on the presence of invasive ventilation) [16]. In order to score these items, the neonate is observed for 2 min. Each item has a score range of 1 to 5 and the total score ranges from 6 to 30. A score of 14 and higher is considered a sign of distress and pain. A score below 9 suggests that it might be possible to decrease the opioid or sedative dose. All NICU nurses are trained to apply the COMFORTneo when they start working at the NICU because the COMFORTneo is part of our standard of care. They are at least vocational or bachelor-trained nurses with a certified NICU specialization or are in training for this NICU specialization. The COMFORTneo training starts with a presentation focusing on pain in NICU patients and the COMFORTneo scale as an assessment tool. After this presentation, they are asked to assess pain using the COMFORTneo score in 10 NICU patients together with a qualified nurse that has already completed the training, independently. If the linearly weighted Cohen's kappa is lower than 0.65, the ten paired assessments are repeated after discussing the differences until the agreement exceeds 0.65.
NRS pain and NRS distress. NRS scores range from 0 (no pain) to 10 (worst pain possible) with cut-off scores set at 4 or higher for both pain and distress. In clinical practice, NICU nursing staff are trained to always apply the COMFORTneo and NRS scores simultaneously.
N-PASS. The N-PASS consists of 5 items with scores ranging from −2 to 2; four behavioral items (crying/irritability, behavior state, facial expression, extremities tone) and one item for vital signs (changes in heart rate, respiratory rate, blood pressure, and oxygen saturation). Pain is scored from 0 to 2 for each behavioral and physiological criterion, total pain score will be between 0 (no pain) and 10 (pain/agitation). Sedation is scored from −2 to 0 and total sedation score ranges from −10 to 0. Additionally, a correction for gestational age is applied (+3 if <28 weeks, +2 if 28-31 weeks, +1 if 32-35 weeks). The goal of pain treatment is an N-PASS score of 3 or less. The N-PASS was validated in 2008 for prolonged pain and in 2010 for acute pain [24,25].

Data collection
We repeated the evaluation of the inter-rater reliability more than 10 years after the introduction of the COMFORTneo and added an evaluation of the intra-rater reliability. Next to this, we asked different raters to independently apply the COMFORTneo and either NRS or N-PASS scores to determine the construct validity in the present study. The institutional ethical review board waived the need for approval because this is an observational study and data were analyzed anonymously (MEC-2014-547).
Before starting the validation study. The principal investigator (PI; NM) was trained before the start of the study by assessing pain using the COMFORTneo score and the N-PASS during ten paired observations for each scale together with a pain expert (MvD). Linearly weighted Cohen's kappa for the PI compared to the pain expert after ten paired scores with the COMFORTneo score and the N-PASS was 0.92 and 0.95, respectively. Figure 1 shows a flow chart of the study design.
Intra-rater reliability (Part A). Three video fragments lasting exactly two minutes were selected by the PI (NM) based on 1) different gestational ages of the neonates (one neonate with a gestational age below 28 weeks, one between 28 and 32 weeks and one older than 32 weeks) and 2) different pain levels. This selection was made since the instrument should measure pain reliably in patients with different gestational ages and pain levels. The video fragments were to be shown twice at a four-week interval to at least twenty NICU nurses. The NICU nurses were invited to rate the fragments during a coffee break depending on their availability without discussing the observations with each other. During the first time, nurses were not informed that they would be asked to observe and assess the video fragments a second time.
Inter-rater reliability (Part B). Each nurse that participated in part B of this study assessed pain at the bedside together with but independently of the principal investigator in five patients using the COMFORTneo. During the assessment, these patients were lying in the incubator and not exposed to any procedure.
Construct validity (Part C). During the simultaneous observations with the PI to evaluate the inter-rater reliability, the nurses also scored the NRS pain and NRS distress. During the last part of this study, after observing a neonate's bedside for 2 min, the principal investigator (NM) applied the N-PASS to assess pain while a trained NICU nurse simultaneously applied the COMFORTneo scale. A total of 50 different neonates were scored, resulting in 50 combined assessments.

Data analysis
Patient characteristics and other data are presented as mean (standard deviation, SD) in case of normally distributed variables or median (interquartile range, IQR) in case of non-normally distributed variables for continuous variables and as percentages for categorical variables. In case of a skewed distribution or small sample size, nonparametric statistics were used (detailed below). All statistical tests used a two-sided significance level of 0.05. Data were analyzed in IBM SPSS Statistics for Windows, version 25, Armonk, NY: IBM Corp. Measurement properties were calculated according to the Consensus-based Standard for the selection of health Measurement Instrument (COSMIN) guidelines [26].
Intra-rater reliability (Part A). The intraclass correlation coefficient (ICC, 95% CI) was used to calculate intra-rater reliability for all the COMFORTneo total scores and each video fragment separately (two-way mixed effects model, absolute agreement for single measures). An ICC value of 0.70 is considered acceptable [27].
Inter-rater reliability (Part B). The ICC was used to determine the inter-rater reliability for the COMFORTneo total scores (two-way mixed effects model, absolute agreement for single measures). An ICC value of 0.70 is considered acceptable. Due to the complex design with repeated measurements of both nurses and patients, the calculation of a valid confidence interval was not considered feasible. Because this analysis was not adjusted for the repeated measurements within the same patient, we also calculated the ICC (95% CI) for the first paired pain assessment in each patient.
Construct validity (Part C). The correlation between the NRS pain and distress scores from the nurses and the COMFORTneo scores from the PI was calculated with the Spearman rank order correlation coefficient. The correlation coefficients were calculated over all observations, without adjustment for repeated measurements. Due to the complex design with repeated measurements of both nurses and patients, the calculation of a valid confidence interval was not considered feasible.
Because of the non-normal distribution, the Spearman rank correlation coefficient (95% CI) was also used to determine the correlation between the COMFORTneo score from the nurses and the N-PASS score from the PI.
We formulated hypotheses regarding these correlations a prioriaccording to the COSMIN guidelines-, namely that the correlation between the COMFORTneo and the NRS pain score and N-PASS respectively should be at least 0.60 [28]. Table 1 shows the patient characteristics of all 130 neonates that were observed once or multiple times during the 426 paired pain scores for the different study parts. Gestational age ranged from 24 +0 to 41 +3 and postnatal age from 0 to 125 days. If neonates were observed more than once, the mean postnatal age was calculated and used to determine the median postnatal age for all 130 neonates.

Intra-rater reliability (Part A)
Twenty-two nurses assessed all three video fragments twice with a range of 4 to 10 weeks between the two observation days. Four nurses never reassessed the video fragments after the first assessment and therefore were excluded. For fragments 1, 2, and 3 respectively, the median of the mean COMFORTneo scores was 14.5, 13.5, and 18.3. The systematic difference between the first and second assessments was close to zero (mean difference −0.23) and comparable for each of the three video fragments (mean difference −0.27, 0.09, and −0.50 for fragments 1, 2, and 3, respectively).
The ICC of all 66 paired COMFORTneo scores between the first and second observation was 0.70 (95% CI 0.55 to 0.80; p < 0.001).

Inter-rater reliability (Part B)
Sixty-two nurses participated in Part B of the study. The median COMFORTneo score was 12 (IQR 10 to 14) for the nurses and 12 (IQR 10 to 14) for the PI. The ICC of all 310 paired COMFORTneo scores (62 nurses × 5 assessments) versus the scores of the PI was 0.93. Figure 2 shows the correlation between the paired COMFORTneo scores. Pain could be assessed in these neonates multiple times by different nurses. After selecting only the first paired COMFORTneo score for each individual patient, the ICC for those 104 COMFORTneo scores was 0.96 (95% CI 0.94 to 0.97).

Construct validity-NRS (Part C)
The 62 nurses also rated the level of pain and distress using the NRS for all 310 paired assessments with the PI (applying the COMFORTneo). The median COMFORTneo score, NRS pain, and NRS distress of the PI for these observations were 12 (IQR 10 to 14), 0 (IQR 0 to 0) and 0 (IQR 0 to 1), respectively. In 178 assessments (57.4%) no pain or distress was suspected (NRS 0) by the nurses. The NRS pain and/or NRS distress was rated 4 or higher by the nurses during 28 observations (9.0%).
The Spearman rank correlation coefficient between the 310 COMFORTneo scores assessed by the PI and the NRS pain and NRS distress assessed by the nurses was 0.34 and 0.72, respectively (Fig. 3a, b). When selecting only the first paired assessment for each individual patient, the Spearman rank correlation coefficient was 0.37 (95% CI 0.21 to 0.49) and 0.73 (95% CI 0.62 to 0.81), respectively. Construct validity -N-PASS (Part C) Fifty different patients were simultaneously assessed once by both the PI applying the N-PASS and a nurse applying the COMFORTneo scale. The Spearman rank correlation coefficient between the COMFORTneo score of the nurse and the N-PASS score assessed by the PI was 0.70 (95% CI 0.52-0.82) and 0.75 (0.59-0.85) with the new correction for gestational age. In Fig. 4, pain scores are shown for the different postmenstrual age groups for which the N-PASS score was corrected. For 43 of the 50 patients (86%) the vital signs remained within normal limits (N-PASS item score 0).

DISCUSSION
Our study shows that COMFORTneo is an instrument with good inter-rater reliability and acceptable intra-rater reliability and construct validity to measure prolonged pain in newborns admitted to the NICU. Our findings complement and strengthen the conclusion of the previous validation study [16].
Directly after the implementation of this scale in the NICU, ten years ago, the inter-rater reliability was high with a linearly weighted Cohen's kappa of 0.79 [16]. After using the COMFORT-neo for over ten years, the inter-rater reliability has further improved with an ICC of 0.93. A possible explanation may be the increased experience of the NICU nurses using this scale. This corresponds with the findings by Stenkjaer et al., who also found a significantly improved inter-rater reliability five years after the implementation of the COMFORTneo [18].
The intra-rater reliability of the COMFORTneo was lower than expected. The ICC of 0.70 was equal to the lowest acceptable limit we set before the start of this study. The validation studies regarding other pain measurement instruments, the Neonatal Infant Acute Pain Assessment Scale (NIAPAS) and Bernese Pain Scale for Neonates (BPSN), found a higher level of agreement between the same assessors at different time points, respectively 0.99-1.00 (Pearson correlation coefficient, 2 raters) and 0.98-0.99 (Cronbach's alpha reliability, 4 raters) per rater [29,30]. While the intra-rater reliability of these instruments was much better compared to our study, the lower number of raters, shorter time interval between the two assessments, and the fact that the raters were aware of the re-assessment during the first assessment of the NIAPAS and BPSN validation studies could have potentially explained these results.
Another explanation for our lower intra-rater reliability could be that the environmental circumstances differed during the observations of the video fragments for the determination of the intra-rater reliability. Also, with video fragments, one relies on the angle of the recording, whereas with bedside observations you can move around to have a full view of the neonate. This would mean that the intrarater reliability was influenced by environmental circumstances related to both the surroundings and the way in which the neonate is observed (i.e., bedside or video). Interestingly Black et al. specifically recommend to use video recordings for research purposes in order to improve consistency [31].
Regarding the construct validity, the correlation between the COMFORTneo scale and the NRS pain was lower than hypothesized. Furthermore, the correlation between the COMFORTneo and the NRS distress was higher than with the NRS pain. In our ward, the COMFORTneo is always assessed together with the NRS for pain and distress in order to differentiate pain from distress [16]. In the current study few patients-fortunately-were exposed to pain; only two of the 310 NRS pain scores were four or higher (0.6%). The lack of patients that were considered painful decreases the variation and therefore deflates the correlation. The COMFORTneo should be able to measure prolonged pain in all NICU patients in order to make it clinically applicable. It seems necessary to validate the instrument in a population with greater  variability in prolonged pain levels. It is important to determine which patients are at risk for experiencing this type of pain, but this is complicated without a clear definition. Referring to the framework presented by Anand [13], Ilhan et al. recently formulated consensus-based definitions for acute episodic and chronic pain, but not for prolonged pain [32]. It seems like prolonged or persistent pain might be caused by painful conditions (e.g., necrotizing enterocolitis) unrelated to procedures, tissue injury (e.g., postoperative) and repeatedly experiencing painful procedures while an infant has not yet recovered from earlier procedures [13,33].
It is difficult to differentiate pain from distress in neonates based on their behavior [34]. When applying the COMFORTneo together with these NRS scores, this may enable NICU clinicians to objectify and differentiate both pain and distress and treat accordingly.
Although there is some overlap between the COMFORTneo and the N-PASS, the most important differences between both scores are the addition of the vital parameters and the correction for gestational age in the N-PASS [35]. Hummel et al. chose to correct for gestational age because previous studies showed premature neonates are less able to show signs of pain than term infants [24]. However, in the validation studies of the N-PASS score as well as the COMFORTneo the mean pain scores were similar for each gestational age group, without adding additional points for different gestational age groups [16,24]. The COMFORTneo does not include vital parameters because of the lack of evidence for a relationship with prolonged pain [16,35]. The N-PASS item that assesses vital signs showed very little variability between patients in our study with 86% of the patients receiving a score of 0. Hummel et al. did not present scores per item in their N-PASS validation study, though it would be interesting to see if they found greater variability because they specifically included ventilated and/or postoperative infants that are expected to experience a higher level of prolonged pain [24].
One of the strengths of the current study is the use of COSMIN guidelines and checklist [26]. Giordano et al. used this checklist to evaluate the quality of validation studies focusing on pain and sedation scales for neonatal and pediatric patients and found that COMFORTneo was one of the seven most relevant scales for this patient population with a low risk of bias [12]. Another strength of our current study is that all simultaneous assessments took place with the same researcher with a high level of agreement with the pain expert. The different COMFORTneo, N-PASS, and NRS scores that were correlated were assessed independently by different assessors, which reduces the risk of bias. This study also has some limitations. The fact that only few NICU patients were painful or distressed is reassuring but also limits this study. Still, we need to keep in mind that this is also due to our focus on prolonged pain and not on acute pain caused by heel pricks or venapunctures for example. The latter type of pain will occur more often than prolonged pain. Furthermore, in daily care doctors will prescribe pain-reducing medication as soon as a child is diagnosed with a painful condition such as necrotizing enterocolitis. This may result in low pain scores despite the condition of the child. While patients with varying gestational and postnatal age were included in our study, specific patient groups such as infants with necrotizing enterocolitis of asphyxiated infants might need additional attention in future validation studies. Next, we are not able to provide nursing characteristics. Since they were selected based on convenience sampling, however, we expect the participating nurses to be representative for the full NICU nursing staff. Furthermore, we did not test responsiveness, 'the ability of an instrument to detect change over time in the construct to be measured' [22,36]. Since an instrument for prolonged pain is necessary in order to evaluate the effect of pain-reducing interventions, it is important to also evaluate this measurement property in a future study. Finally, our data collection was performed in 2015-2016 and the delay in publishing our findings could be considered a drawback. However, neither our policy nor our patient mix has changed in the past years and we still apply the same pain and sedation protocols as at the time of the data collection.
The behavioral response to pain might not always correspond with brain and spinal cord activity [37]. Physiological indicators are being studied for acute pain assessment. For example, skin conductance, heart rate variability, and methods that focus on the brain such as Near-Infrared Spectroscopy (NIRS) and electroencephalography could give information regarding the level of pain in neonates. The results of these studies are promising, but more research is needed before these methods will be available to use in clinical practice and for prolonged pain [38]. More advanced physiological methods such as heart rate variability and NIRS could complement behavioral observations but require more testing. Assessment of pain and stress in vulnerable NICU patients depends on the use and interpretation of observational measurement instruments such as the COMFORTneo scale.
This validation study shows that the COMFORTneo scale has acceptable inter-rater reliability and moderate intra-rater reliability. Next to this, the COMFORTneo correlates well with the N-PASS, but less so with the NRS pain. Future validation studies should focus on neonates with prolonged painful conditions and this underlines the continuous process of validating measurement instruments. Combining the COMFORTneo score with an NRS for pain and distress might be an easy way to improve observational pain assessment in neonates until more advanced pain assessment methods become available.

DATA AVAILABILITY
The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.