Design and participants
The present study used cross-sectional data from the baseline tests of the School in Motion project . This was a multicenter study, involving four geographically separate regional test centers in Norway. Out of 103 invited lower secondary schools, 29 schools agreed to participate. Only eighth grade students (13-14-year-olds) were invited to participate in the study (n = 2733). Informed parental consent was obtained from 76% of the invited students (n = 2084). Not all students had valid measures on all variables and Fig. 1 shows an overview of the participant flow. The participants were tested in the spring of 2017, during school time, at their respective schools. All test personnel received the same training beforehand to make sure there were no discrepancies in how the tests were carried out. All test procedures were approved by the Norwegian Centre for Research Data (project number 49094), and the project is in accordance with the Declaration of Helsinki for experiments involving humans.
Fig. 1 Flow chart of recruitment and participation with an overview of missing values. SES = socioeconomic status. BMI = body mass index. CRF = cardiorespiratory fitness. TDS = total difficulties score.
Participants’ weight without shoes was measured by digital scale (Seca 899, Hamburg, Germany) and all measurements were recorded to the closest 0.1 kg. Their clothes were noted, and their weight adjusted in the following analysis: 1 kg was subtracted for pants and/or sweater, 0.5 kg was subtracted for shorts/tights and t-shirt. Height was measured by portable stadiometer (Seca 123, Hamburg, Germany) and was recorded to the closest mm. The values were used to calculate individual body mass index (BMI) scores (kg/m2). None of the measurements were disclosed to the participants.
Sit-ups (n/30 seconds), standing broad jump (best of two attempts) and handgrip test (best of two attempts), as described in the EUROFIT test battery  were used to measure muscular strength. Participants performed sit-ups with their knees in a 90-degree angle and their fingers locked behind their head, and their feet held to the floor by test personnel. To get a valid count, the participants had to touch their knees with their elbows, going up, and touch the floor with their shoulders, going down. Participants performed standing broad jump by jumping as far as they could from a stand still position, and the distance was recorded from the heel closest to the starting point. Measurements were recorded to the closest cm. The handgrip strength test was executed with the participants’ dominant hand, as they held their arm down alongside their body, gripping a Baseline dynamometer (Baseline® Hydraulic Hand Dynamometer, Elmsford, NY, USA) as hard as they could for three seconds. Measurements were recorded to the closest kg.
CRF was assessed by a 10-minute intermittent running test  The test was performed by the participants running between two marked lines, 16 meters apart, inside a gymnasium. The participants ran for 15 seconds, then paused for 15 seconds on the test leader’s whistle. They were required to touch the floor behind the line with one hand before turning and running back. The procedure was repeated for 10 minutes. According to test protocol, the intended distance between the lines is 20 meters; however, limited space in many school gymnasiums compelled us to set a new standard distance at 16 meters. Because of this, we could not estimate maximum oxygen uptake from the test results, therefore, we use running distance in meters (m) as an indirect measurement unit of CRF when describing our results.
To measure mental health, the participants completed a Norwegian language version of the Strengths and Difficulties Questionnaire [SDQ; 36]. The questionnaire consists of 25 items divided into five subscales. The five subscales cover emotional symptoms, conduct problems, hyperactivity, peer relationships and prosocial behavior. The questionnaire contains statements such as “I worry a lot”, ”I am easily distracted, I find it difficult to concentrate” and ”Other people my age generally like me”. Participants reply to the statements on a three-point Likert scale: ”not true”, ”somewhat true” and ”certainly true”. Each subscale scores from 0 to 10. Except for the prosocial subscale, a higher score signifies a higher degree of difficulties. A high score on the prosocial subscale signifies social strengths. The scores from all subscales except the prosocial are summed to create the total difficulties score (TDS). TDS scores from 0-40 and is a dimensional measure of mental health for children and adolescents, which means that on a population level, there is a detectable reduction in psychopathology for each point-reduction on the scale . It therefore represents an indication of the general mental health state in the measured population, but in the continuation of the paper, we will refer to the outcome as either TDS or psychological difficulties. The psychometric properties of the SDQ have been validated in several countries [38-40], including Norway .
Other variables associated with mental health are sex , domestic or foreign birthplace , and socioeconomic status [SES; 44]. The participants’ sex was noted by test-personnel, and birthplace (“Were you born in Norway”) was assessed in the questionnaire. Parents’ education level was included as a measure of SES .
Data were managed and analyzed in IBM SPSS Statistics 25 (IBM, Armonk, New York, USA). SDQ data were scored according to the syntax provided by the SDQ information web page . The syntax summed the scores from each of the four subscales needed to create the TDS variable. Cronbach’s alpha was employed to assess the internal consistency of TDS and the result was .62.
We created z scores stratified for sex and BMI quartiles for handgrip strength, standing broad jump and sit-ups. The z scores were used to create one composite mean z score for muscular strength. SES was analyzed by including only the parent with the highest education level. Next, parents’ education level was categorized as either ”lower secondary school or less”, ”upper secondary school”, “less than four years university education” and “four years or more university education”.
Out of 2045 participants, 27% (n = 559; girls = 38.2%) had at least one missing value. A new grouping variable was created to analyze differences between participants with all values (n = 1486) and participants with missing values (n = 559). The following primary analyses were carried out on the complete-case group only, while extensive missing value analyses were conducted to examine if they influenced the primary results.
Complete-case primary analyses
Descriptive statistics were calculated and are presented as means and standard deviations (SD). Seven linear mixed effect models with TDS as the dependent variable were conducted. In models one to six, we assessed the separate associations between TDS and the muscular strength variables and the health-related fitness components. In the seventh model, the fitness components controlled for each other. All models controlled for the covariates (sex, domestic birthplace and SES). We report estimates (unstandardized coefficients) and their 95% confidence intervals (95% CI). Estimates reflect the change in TDS as a result of one unit of measurement increase in the independent variables. Initial linear mixed effect modelling showed no statistically significant interaction effects between sex and the physical fitness variables, using TDS as the dependent variable. To account for possible effects of clustering of observations within schools, school site was included as a random effect in all models. A p value < .05 indicated statistical significance.
Missing value analyses
To assess whether missing values were missing completely at random (MCAR), Little’s MCAR test was used. The analysis did not support MCAR (104.331, DF = 24, p < .001). Pattern analysis (not shown) indicated that the data were likely missing at random (MAR). A possible explanation for the missing values is that we never forced the participants to complete the tests, which may have caused some participants to opt out. For instance, many stated that they did not want to run the CRF test. Moreover, the SDQ was one of many components in a large and extensive questionnaire. The missing data from the SDQ may be a consequence of the size and duration of the extended questionnaire, which may have caused many to quit before completion. However, this is unclear and there may be other reasons unknown to us.
One-way ANOVA was used to assess differences between the complete-case group and the missing-values group. Pearson’s correlation analysis was used on the fitness variables and TDS, for the purpose of examining if associations were similar in both groups. Multicenter studies are vulnerable to differences in missingness between test centers , and this was examined using frequency statistics. As our final action in handling the missing values, we employed multiple imputation [48, 49]. Five imputations were generated from relevant variables, using the automatic procedure with 10 iterations, with the assumption that data were missing at random. A linear mixed effects model was conducted on the imputed dataset, with TDS as the dependent variable, and all health-related components of physical fitness variables and covariates entered as independent variables. The imputed dataset results are presented, in addition to the complete-case results, as recommended by Manly and Wells  and Sterne et al. .