Closing an Eye for the Reliability of Orbital Tightening in Pain Detection: A Model-Specic Simplication of the Mouse Grimace Scale

Despite its long establishment and its applicability in pain detection in mice, the Mouse Grimace Scale still seems to be underused in terms of acute pain detection during chronic experiments. However, a broadening of its applicability can identify possible renement approaches such as cumulative severity and habituation to painful stimuli. Therefore, this study focuses on two main aspects: First, ve composite MGS criteria were evaluated with two independent methods (the MoBPs algorithm and a penalized least squares regression) and ranked for their relative importance. The most important variable was used in a second analysis to specically evaluate the context of pain after an i.p. injection (intervention) in two treatment groups (CCl 4 and oil (control)) at xed times throughout four weeks in 24 male C57BL/6N mice. One hour before and after each intervention, video recordings were taken and the MGS assessment was performed. In this study, the results indicate orbital tightening as the most important criterion. In this experimental setup, a highly signicant difference after treatment between week 0 and 1 was found in the CCl 4 group, resulting in a medium-sized effect (W = 62.5, p-value <0.0001, r CCl4 = 0.64). The oil group showed no signicant difference (week 0 vs 1, W = 291.5, p-value = 0.7875, r control = 0.04). Therefore, the study showed that the pain caused by i.p. injections was only dependent on the applied substance and no signicant cumulation or habituation occurred due to the intervention. Further, the results indicated that the MGS system can be simplied.


Introduction
The EU Directive 2010/63 protects animal life and welfare when animals are used in experiments, e.g. for biomedical research 1 . When using animals, the aim should always be the greatest possible well-being and the reduction of animal suffering through pain, distress, or harm. When assessing severity, pain recognition is one major factor to be considered 2 . The perception of pain varies between individuals, but it can also be shown in a variety of ways regarding the different animal species. In this context, facial expressions are an example of showing pain in certain animals, e.g., rodents 3 . The pain face, or, so-called grimace scale, which was originally developed in humans for the recognition of pain in children or other patients who depend on non-verbal communication 4 , is scaling the pain sensation based on the expression of different facial features. Meanwhile, the Mouse Grimace Scale (MGS) 5 was developed and transferred to different animal species as well [6][7][8][9][10][11] . Numerous studies were able to demonstrate and verify the applicability and utilization of the grimace scale for pain recognition 12,13 . The following animal-speci c facial criteria are taken into account: Orbital tightening (OT), ear posture (EP), cheek bulge (CB), nasal wrinkling (NB) and whisker change 5 . These 5 criteria are scored by observers and classi ed into degrees of deviations as a function of severity classes. The summation allows a classi cation of the animal at the respective speci c time to a degree of pain. All criteria are equally weighted in this approach.
The application of the grimace scales in laboratory animal science is intended both to provide the possibility of classifying certain interventions and treatments and to ensure better medical care for the animals within the experiment through the direct assessment of the pain condition. This means that the MGS can also be used directly as a target for possible re nement measures in the context of the 3Rprinciples 14 . Despite the simplicity of learning this methodology 11 , the grimace scale has not yet been widely used on a routine basis during experiments. Most studies that use the grimace scale are either focusing on the evaluation of the MGS system 13,15 using different techniques or settings or have pain detection and assessment as a direct scienti c focus 16,17 .
The studies that have investigated the applicability of the grimace scale show that time and personnel requirements still impede its extensive use and, above all, a direct on-site approach due to its respective evaluation character 13,18,19 . Inter-individual variations in the assessment criteria and the in uence of subjective perceptions on the assessment result represent further di culties in the usability of this method 15,19 . These standardization problems can lead to the conclusion that the application is too intricate or too extensive in its basic structure to achieve unambiguous results. Our exemplary study aimed at the characterization of the ve different examination criteria and their values in the overall scoring as well as the evaluation of changes in the examination criteria or their singularization, and if this had the potential of facilitating the scoring of the animals' pain face. These examinations were carried out as an evaluation of a pain assessment by repeated i.p. injections (intervention) of CCl 4 or oil (treatment) at prede ned regular intervals, and are, therefore, considered to be the pain stimulus to be classi ed.

Variable importance and selection
To analyze the severity of the intervention based on the MGS image scores, a total of 4944 images were randomly selected for evaluation using a picture selection tool similar to our previous studies 15 . Of these images, 749 could not be included because of poor quality or non-recognizability (are marked as -1= rejected in the raw data) of the evaluation criteria (e.g., whisker change). Data were integrated for mean values in terms of repeated measurements from different video sources. Further, in addition to the ve MGS criteria, the time resolution of the measurements was noted in two variables "week" (0, 1, 2, 3, 4) and "day" (day 1, 2, and 3) as well as the variables treatment (Oil, CCl 4 ), intervention (baseline, pre and post), and animal ID. The nal data set had the dimensions of 498 rows with n=24 unique animal identi ers.
Initially, the priority of the different MGS evaluation criteria was determined with the MoPBs algorithm. As a result, the expressiveness of certain parameters was ranked and quanti ed relative to the most meaningful value (de ned as 100%). Figure 1 shows the result of these analyses and identi es orbital tightening as the rst-ranked parameter and whisker change as the last-ranked parameter. Further, the algorithm explored criteria combinations like OT and NB as second best, etc.
In addition to the expressiveness, time-and intervention-independent correlations of the grimace scale criteria in each treatment group were analyzed ( Table 1). The overall correlations in the CCl 4 group were higher than in the Oil group. In both treatment groups, the NB~CB combination shows the highest correlation of all criteria (Oil, r NB~CB = 0.817; CCl 4 , r NB~CB = 0.901). In general, however, the results show that all parameters are highly correlated and will, therefore, show strong collinearity in regular regression analysis. To compensate for this, we used a penalized maximum likelihood regression that was capable of both, variable selection and regularization of the model. We used 10-fold cross-validation to minimize the mean squared error on the λ estimator (λ 1SE,Oil =0.001, λ 1SE, CCL4 =0.306). Figure 2 shows the result of the coe cient ranking from the LASSO regression. MoBPs algorithm nds whisker change as the worst-performing variable, while the LASSO regression nds nose bulge, again in both treatment methods. In the regression model, whisker change is performing better than cheek bulge in the CCl 4 group. In the control group, this was reversed.
Due to the overall agreement of the high applicability of the orbital tightening in our results and the simultaneous easy recognizability also for future automated examination procedures, we have selected the orbital tightening as a "target parameter" for subsequent examinations.

The regression model of the OT analysis
In the second part of the analysis, multiple linear mixed regression models with orbital tightening as the dependent variable were built to analyze different treatments and interventions over time affecting the orbital tightening variable ( Table 2). The main target factor is the investigation of the effects of the parameter OT on the treatment, the intervention and the time. In model I (Supplemental Material S2-3), the highest available time resolution "day" was included in an interaction with the "intervention" variable and the "treatment" groups (Oil and CCl 4 ). The betweentreatments model (I) with animal ID as RE was extended by a random intercept term in which "day" was nested within the "week" variable (β Intercept =2.59, CI 95% [2.04; 3.14], p<0.001). From the total variance, the animal ID was able to explain 21.56% (τ ID = 0.32), the interaction day:week 5.33% (τ day:week = 0.08) and week 0.77% (τ week = 0.01) of the variance in the data. between-treatments predictor was not signi cant, the interaction with intervention shows that CCl 4-postintervention was higher than Oil-pre-intervention. In model I, "day" or its interactions with "treatment" or "intervention" did not show signi cant differences (Fig., 3A).

Model II -Orbital Tightening within-CCl 4 analysis
The analysis in model II focused on CCl 4 data (Supplemental Material S2, S4). Here, the within-treatment development of severity over time was modeled. Therefore, baseline data (at week 0) with missing interventions were excluded. As a result, the default level of "week" was 1 in this model. Baseline level comparisons are shown in model I. Orbital tightening was modeled as a function of the interaction terms "week" and "intervention" (β Intercept =3. 30  In the control group, the median development of the post-interventional severity was not as high as it was in the CCl 4 group (see "intervention (post)" in models II and III, Fig. 3  distribution of data into the three discretized severity classes was also different in the group comparisons. CCl 4 showed more directionality towards higher severity in the post-intervention group (red points in the red area) than the control group. Figure 4 B explores the cumulative and time-independent development of severity in the data. For this, data in the discrete classes were counted (Table 3) and expressed as percentages (for absolute numbers, see Supplemental S6). There was a clear trend towards higher severity in the post-intervention procedure in the CCl 4 group (also see the "intervention (post)" coe cient in model II). Here, the severity in the post-intervention was always higher than before an  Orbital Tightening data were summarized and grouped by "treatment" and "intervention". Since the orbital tightening variable showed mixed distributions over time (Supplemental Material S7) and the timeindependent distribution was also not normally distributed (Shapiro Wilk's test, p<0.0001), value development was characterized as medians using a 10000-fold bootstrapping from which also the 95% con dence intervals were obtained. The treatment-based medians were depicted and grouped by the intervention ("pre" (steel blue) / "post" (red)), and the corresponding con dence bands (Fig. 5). Week 0 had no injected animals and served as baseline measurement in both treatments. The control group showed no signi cant difference between the animals at the baseline and after the intervention (week 0 vs 1, W = 291.5, p-value = 0.7875, r control = 0.04). However, in the CCl 4 group, a signi cant difference after treatment between weeks 0 and 1 was found, resulting in a medium-sized effect (W = 62.5, p-value <0.0001, r CCL4 = 0.64) and was considered highly signi cant.

Discussion
This study aims at the possibility of simpli cation of the Mouse Grimace Scale to assess severity and pain level detection in mice. Various studies have shown that the use of the Mouse Grimace Scale, despite its widespread publicity, does not take place on a massive scale within the eld of laboratory animal science 3,13 . Our research aimed at the evaluation of the different MGS criteria and the potential simpli cation of its application, mainly to achieve a faster and more widespread implementation. In evaluating the MGS method, various criticisms of its use are repeatedly raised 13,20 20 . In their recently published study, it was shown that the interrater variability is primarily also dependent on the examination criterion. There it was reported that the best agreement took place with the orbital tightening criterion, while the lowest agreements were achieved with nose and cheek bulge. In earlier studies, we were also able to identify gradations in the recognisability of the different criteria 15 . In general, these earlier studies have shown that there were no signi cant differences between or within raters when they were experienced. Despite this, the different criteria cannot be recognized with equal ease. The research of Cohen and Beths 19 is giving a good overview in their review of the use of the Grimace Scales in different animal species. Looking at their reappraisals, it becomes clear that mainly criteria for changes in the orbital tightening, ear, and nose are selected for assessment across all animal species. Taking together the results from the literature as well as the results of our study, the conclusion can be drawn that the orbital tightening criterion is a key parameter in the MGS. On the one hand, orbital tightening indicates to be the best discernible parameter 20 , and on the other hand, it has the highest in uence on the MGS score (Fig. 1, Table 1). This nding was demonstrated in two independent analyses, using the MoBPs algorithm ( Fig. 1) as well as the penalized least square regression (Fig. 2). Both approaches con rmed the parameter rankings (and their combinations) in each treatment group.
Although automation by image processing and scoring algorithms is strongly demanded 3 and pushed forward 18,22,23 , equal inclusion of all criteria is not yet feasible. Taking into account the various di culties of parameter recognition, the lack of feasibility in automation as well as the high effort required to examine all criteria, the question of simpli cation arises. Consequently, and if automation is sought, there will be a need for using simpli ed evaluation criteria. The whisker change, for example, is a criterion that is often not reliably assessed by both, experienced raters and algorithms. Our approach examined exemplarily the impact of the individual scoring criteria for the total score or the assignment of an animal to a discrete severity level. On this basis, and the observation that rating orbital tightening is the most reproducible 20 as well as the most reliably identi ed criterion for evaluation ( Fig. 1., Fig. 2), it was selected as the assessment parameter for further investigations in our study.
In the results presented in Figure 1, we can indicate that the orbital tightening parameter has the highest impact on the overall score, while whisker change has the least impact. While Table 1 shows that there is a high correlation between the individual parameters, it was con rmed in both groups ( Fig. 2A,B) that orbital tightening ranked highest. The orbital tightening criterion mainly indicates differences in the intervention (Fig. 3. A,B), especially in the CCl 4 group after treatment (Fig. 3A), which as an expected pain stimulus and, therefore, was of particular interest in the investigation. Thus, we conclude that orbital tightening is a meaningful criterion in the grimace scale for investigating acute pain stimuli in our animal model. Rating of orbital tightening can discriminate differences between two treatment groups over time (Fig. 4A). As a pain stimulus, the injection itself and also the in uence of the treatment (CCl 4 vs. oil) were studied over 4 weeks. However, signi cant differences in the baseline values of the treatment groups can be observed. Hence, the signi cance of the results between the treatment groups is diminished, indicating the limitations of this study.
By examining the distribution of the assessment data in the severity classes (Fig. 4), we can show that baseline values mostly result in a maximum to mild, and occasionally a medium degree of orbital tightening. With the start of the treatments in week 1, a clear increase in severity was given. Hence, the recognition of a clear acute pain stimulus in this model was seen (Fig. 4A). While single animals in the oil group also showed severe facial expressions in orbital tightening, this was seen in the CCl 4 group in up to 14% of the cases after an intervention (Fig. 4B). This shows that the cumulation of pain compared to baseline is caused by both, the intervention of the i.p. injection (oil group) and by the injected substance itself, independently of time.
The development of the bootstrapped median severity estimates pre-and post-treatment of the two groups over time with their 95% con dence interval is shown in Figure 5. The estimates in the control group showed no signi cant differences over time. We were able to show that the injection of CCl 4 has an impact on the degree of pain and can be considered, in general, a model with moderate severity (Fig. 5).
Even though the cumulative severity in the severe CCl 4 class (Fig. 4B) was elevated from 2.3-13.8%, the largest shift took place in the moderate class. Here, a shift of 30% was observed (51.6-76.6%). There was no indication that the treatments or interventions caused severe pain. Instead, there was a moderate shift away from the mild class towards the moderate class. Nevertheless, some animals also showed a short-term severe orbital tightening behavior, which cannot, however, be explained with the treatment or time variable.
An overlap in con dence intervals in Fig. 5 indicates that the respective comparison showed no evidence for differences. If we look at the CCl 4 group in detail, we see increased values shortly after the injection, especially in weeks 2 and 3. This indicates a painful impulse caused by the injection, which lasts over the investigation period of one hour after injection. These ndings are in line with our recently published study on the severity of the CCl 4 model itself 24 , which showed the highest severity of the animals in various clinical and behavioral parameters also during the second week of treatment. In Fig. 5 it was also demonstrated that the animals in the control group receiving only oil injections showed only a mild to moderate degree of severity in the orbital tightening scores. We can show that there is a high positive slope within the CCl 4 group, which is most evident at the rst and second weeks of treatment (Fig. 5).
However, in the intervention of the control group, the pain stimulus did not seem to be caused by the medication but only by the intervention itself. The pain stimulus triggered by the injection alone did not seem to lead to either cumulative or habituation effects at these intervals. However, the negative slope in the post-intervention CCl 4 group (Fig. 5) leads to smaller differences between pre-and post-intervention states over time. Consequently, the continuous decrease in the within-subjects intervention differences points towards a certain habituation effect in the CCl 4 group. Although not signi cant, a decreasing effect of intervention severity over time (Fig. 3B) is perceivable, also supporting evidence for a possible habituation effect. However, this habituation effect in the CCl 4 group may be due to the increased liver metabolism in the turnover of toxic CCl 4 with the second week of treatment. These changes in liver metabolism were shown elsewhere by blood analysis in the CCl 4 model 24 .

Conclusion
Our study shows that in the present experimental setting, the examination with the primary focus on orbital tightening yields su cient results for both, the assessment of the degree of severity and for the inter-treatment group analyses. Taking these results into account, it can be concluded that this simpli cation of the Mouse Grimace Scale is feasible for practical use. We suggest that this can lead to faster applicability, an easier automated procedure, and more quickly obtainable results. This is made possible because of better recognizability of the orbital tightening parameter which will increase the reproducibility due to an increase in precision. A quick and simpli ed application is necessary when the MGS procedure is applied to more immediate settings, which can also serve as a potential target for re nement measures. This should be applied to and veri ed with random studies, proving that the pain stimulus shown in orbital tightening can also be detected in other stimuli. The procedure of simpli cation provides a basis for quick decision-making support and a further improvement in the quality of care and may provide opportunities for automated monitoring. At the same time, the MGS scoring in this study demonstrated that the severity caused by intraperitoneal injections was mainly dependent on the injected substance and not necessarily on the number of injections or the injection interval.

Materials And Methods
Ethical statement This study was performed in accordance with the application of the 3Rs criteria as a branch project from a recently published animal study on evaluation severity assessment in brosis induction 24 . The animals were examined retrospectively, no additional experiments were carried out. The study was performed and reported in accordance with the ARRIVE guidelines 26 .

Animals and study design
Twenty-four male C57Bl/6N animals (Janvier, France) of approximately 8 weeks of age were used. During the experiment, the animals were kept in a controlled spf barrier according to the FELASA recommendations 27 . Humane endpoints were set at each stage of the study to avoid severe pain, harm, or distress of the animals. These animals were weighed and then divided into two treatment groups: A CCl 4 group and a control group (oil) for further investigation in a liver brosis model. 24 . For this purpose, the animals were injected i.p. with 50 µl of the treatment solution three times a week over 4 weeks (Monday, Wednesday, and Friday). On these treatment days, the MGS examination was carried out according to a set-up that we have recently published 15 . Brie y, the animals were each lmed in an MGS observation box for 10 minutes. The lming was carried out 1 h before the injection and exactly 1 h after the injection of the respective animal. To investigate the effect of the intervention (=injection) between the different treatment groups, the animals were observed at the same daytime on the intervention days. At each time point, eight images were randomly selected in each video by the algorithm 15 . Subsequently, these pictures were issued blinded and manually evaluated by the investigator within this study.

Data Science and Analysis
Statistical analysis and data evaluation were performed using the R software (v4.0.3 28 ) and the recently published algorithm for identi cation of the best performing variable by data-mining and cooperative game theory for evaluating study criteria (MoBPS = mining on best parameter search) 29 ). Data were grouped and summarized using the dplyr 30 package. Distributions were tested with quantile-quantile plots and Shapiro Wilk's test. In the case of non-Gaussian or mixed distributions, 10000-fold bootstrapping was applied to obtain the median estimates and 95% con dence intervals (CI) (boot 31 ). Raw data are available at https://github.com/mytalbot/MGS_data.
To explore the variables' impact on the average picture score, two independent strategies were followed. In the rst approach, the ve independent criteria (orbital tightening (OT), nose bulge (NB), cheek bulge (CB), ear position (EP), and whisker change (WC)) were analyzed with the MoBPS algorithm.
MoBPS examines the ability of parameter combinations to quantify intervention effects between pre-and post-intervention conditions of treatment groups. The assumption is that multivariate measures can have greater explanatory power than single variables. Measures of univariate comparisons of treatment groups are statistical effect sizes. MoBPS modi es effect sizes to make groups of different sizes and distribution comparable and creates a multi-parameter measure M. This M is determined for each possible combination and normalized to the maximum occurring value M max . Also, the effect of each parameter on the overall measure was determined using a Shapley value.
In a second approach, a generalized linear model with a penalized maximum likelihood (glmnet) was applied 32 , in which the average picture score was modeled as a function of the highly correlated grimace scale criteria and their interactions with time ("week") and intervention ("pre/post") using 10-fold crossvalidation and a least absolute shrinkage and selection operator (LASSO) (α=1) to ensure the robustness of the coe cients. The most parsimonious model within one standard error of the best-performing model was used to select the coe cients. This was calculated independently in each treatment group (control ("Oil") and "CCl 4 ").
Week 0 was excluded due to rank de ciency of the intervention variable (intervention started in week 1). The input variables were scaled so that the resulting coe cients could be ranked and compared with each other.
The "most meaningful" dependent variable from the MGS ensemble was tested for both, the betweentreatments and within-treatment contrasts. Further, two different time resolutions (day and weeks) were tested. The change of default levels for these contrasts made it necessary to restructure the model for the analyses, e.g., to assess the speci c coe cients in each treatment separately (Supplemental Material S1-2 for more information). The independent variables (treatment, day/week, and intervention) were set as xed effects (FE) and interactions. In total, three models were used in the analysis: (I) a generalized between-treatments model at the highest available time resolution (day) and with day nested in weeks as random effects (RE), (II) a within-treatment model of CCl 4 , excluding data from week 0 to avoid rank de ciency for the missing intervention data, (III) same as (II) but with the control group. The models were calculated as linear mixed-effects regressions (lmer (lme4, 33 , lmerTest 34 )) using the animal ID as random effects (RE) in a random intercepts model with the restricted maximum likelihood estimator. The Kenward-Roger's approximation of the degrees of freedom was used to calculate the con dence intervals and p-values of the mixed models.
To assess the impact of the intervention variable on animal welfare and baseline differences, a Mann-Whitney U test was used to test whether there was a difference between animals in week 0 without an intervention ("bsl=baseline") and after an intervention ("post") on week 1. This was performed in both treatment groups (control and CCl 4 ) under the alternative hypothesis that the true location shift was not equal to 0.
Further, group differences in time-independent cumulative severity counts were determined with a χ 2square test. Post-hoc tests were calculated with the rcompanion 35 package to adjust for multiple comparisons.
Results with p≤0.05 were considered signi cant in all inferential tests. In all examinations, a severity assessment was performed at the following thresholds, discretizing the grimace scale into classes [Score Level= MGS < 3: mild; MGS >=3 and <=6: moderate; MGS >6: severe], following the current publications 5,11,36 .   Coe cient estimates with 95% con dence intervals from linear the mixed-effects regressions of the orbital tightening variable (red/blue color: negative/positive coe cients). (A) General between-treatments model (default levels were Oil, day 1, and intervention (pre)) with signi cant coe cients for intervention (post) and CCl 4 :intervention (post). (B) Within-CCl 4 data over weeks (default level week = 1). No signi cant coe cients for week:intervention were found but there was evidence for a negative slope  (A) Distribution of orbital tightening over time contrasted by the within-subjects intervention regimes pre (steel blue)/ post (red) in the two treatment groups. The untreated baseline values are shown in week 0 (dark green). Note that in week 1 the animals show higher values after the intervention (red) than in week 0. These differences were not prominent in the control group. Further, the grimace scale thresholds are shown as colored regions on the y-scale (green=mild, orange=moderate, red=severe). In the CCl 4 group, more animals were found in the moderate and the severe classes than in the control group. In the CCl 4 group, the animals show an elevated baseline (60% of the CCl 4 animals in week 0, compared to 38.9% in the Oil group). Further, the fraction of severity was increased in both treatment groups after the intervention. (B) Time-independent cumulative severity estimation. The number of animals in each severity class was counted and expressed as a percentage (fraction). The severity classes are colorized as in A.