Despite most SOPARC studies reporting high reliability and a protocol that has been widely accepted and adopted in outdoor recreation research [6, 7, 27, 29, 30], it is still important to understand how reliability of the observations might be compromised by factors such as park users characteristics being observed or contextual conditions. With SOPARC being increasingly used to assess park use behaviors and preferences in diverse communities and environments[8, 31, 32], it is even more important to understand how its reliability can change in different observation contexts. Identifying which factors cause observations to have lower reliability can help improve future SOPARC training protocols.
In this context, this study uses 4725 paired SOPARC observations in 20 New York City parks conducted during Spring and Summer 2017 to analyze the reliability and interobserver agreement of observers using SOPARC to assess race/ethnicity and physical activity in different park settings. Results concur with a large body of evidence regarding SOPARC reliability at observing park users and physical activity[4, 7, 10, 33, 34]. High levels of reliability were achieved when counting the number of people in the parks (ICC=0.92), indicating excellent agreement beyond chance. Reliability scores however were affected by the population being observed, the physical activity level, and contextual conditions and settings of the target area at the time of the observation.
Reliability and race/ethnicity
Despite the observation of all three targeted race/ethnicity groups –Asian American, Latino, African American- drawing high levels of reliability (ICC >0.75), agreement between observers was harder to achieve when counting Latino park users. This was also recently observed by Banda et al. [8]. Low interobserver agreement when observing the number of Latinos in a park area can be explained by race/ethnicity being a socially constructed classification that depends both on phenotype and on their associated meanings. In our case, parks located in neighborhoods with predominantly Latino populations, saw a higher race-ethnicity mix, than parks located in Asian American Neighborhoods. The fact that Latinos and African Americans usually were found together in the same parks, while the population found in Asian American predominant parks were more homogeneously Asian, might help explain why observers had more troubles at agreeing when observing Latino and African American park users and agreed more often when classifying Asian park users.
Low reliability when trying to assess the number of Latinos in an area can also be explained by a potential two-way missclassification. From our experience during training, trying to assess the race/ethnicity of a potential Latino person from a distance often resulted in discussions between how to classify that person. Skin color plays an important role in racial and ethnic identification [35, 36], and with Latino population typically exhibiting larger intragroup phenotype variation[37], doubts on how to potentially classify Latino park users were recurrent. In informal discussions, raters mentioned particular difficulty distinguishing between Latino and White; Latino and South Asian (India, Pakistan); and Latino and African-American. Disagreements regarding the race/ethnicity of a potential Asian person were often discussed as a decision between Asian or Whiteand only in some cases between Asian-Latino. In the same way, disagreements regarding the race/ethnicity of a potential African American involved only a decision between African American and Latino and in some cases African American and White. Also noteworthy, is the fact that when in doubt, observers were instructed to default to the “other” race/ethnicity category, adding the possible scenario of the first observer having doubts on how to classify a potential Latino person and defaulting to other, and the second observer having no doubts and thus classifying them as Latino.
In any case, our findings suggest that most studies using SOPARC to identify Latino park use might actually only be analyzing the behavior of those Latinos with darker skin tones, who are more easily classified as Latinos. Given the relevant socioeconomic inequalities between light and dark Latino individuals within the same ethnic groups [37], it is important that future SOPARC studies acknowledge and address this limitation. And with the growth of Latino populations in the US and the need of designing tailored public policies towards encouraging physical activity, these findings could be valuable when designing future park-use studies. SOPARC training should incorporate specific attention on how to properly assess race/ethnicity based on phenotype characteristics. Nonetheless, these characteristics should only be applied for direct observations if race/ethnicity is an important individual variable for the study. Employing local community members in SOPARC observations has been reported as a way to help overcome some of these issues[38] although it should be noted that past studies did not find evidence of an association between observers’ demographic characteristics and better identification race/ethnicity traits [36, 39].
Reliability and physical activity
Agreement for physical activity levels was even harder to reach than trying to identify race/ethnicity, consistent with the original reliability assessment by McKenzie et al. [33]. Our results suggest that observing sedentary and vigorous activities might be easier than moderate physical activity. It is a counterintuitive finding, as more dynamism should be harder to assess, but one that was also recently found by Santos et al.[10]. Once again this can be partially explained by a regression to the mean and the fact that moderate physical activity is adjacent to two other categories allowing for two types of misclassification (sedentary-moderate; and vigorous-moderate), while sedentary and vigorous activity are only adjacent to the moderate category. Other explaining factors are the fact that, counting immobile people (sedentary) might be easier than counting moving people or observers might tend to unconsciously fixate and focus on the more dynamic movement (vigorous) on the target area. While training provided a clear list of activities and at which level of physical activity they should be considered, it is also possible that observers would unconsciously default some activities that should be classified as vigorous into moderate. All of these factors might be contributing to explain why reliability is lower when counting moderate physical activity, and higher when assessing sedentary or vigorous physical activity, and should also be accounted for in future training.
Reliability, contextual conditions and target area settings
Regarding contextual conditions and settings of the target area at the time of the observation potentially affecting reliability scores, an important finding of this study has been that interobserver agreement did not decay with each additional round of observation. The SOPARC protocol with modified format seems to be adequate as the quality of observations was not impacted by the amount of time spent observing or observers fatigue. Other than that, the most important contextual condition affecting interobserver reliability were the number of people present in the target area, and the type of target area that was being observed. Observers achieved a very high agreement when observing areas with five or fewer people. The type of target area for its part, also significantly affected reliability. While swing areas recorded very high interobserver reliability, basketball courts and playgrounds reached low agreement rates close to 60%.
These low reliability statistics in specific areas of the park, can be partially explained by a combination of some race/ethnicities being more difficult to assess than others, some physical activity being harder to identify, and the fact that areas with more people might be harder to assess. African American males for instance are more frequently observed engaging in vigorous physical activity in basketball courts [15]. The combination of more people in the target area engaging in a highly dynamic activity can lower interobserver reliability. Similarly, complex cognitive tasks such as identifying race/ethnicity and physical activity at the same time, can work well in calm and well-defined areas such as swings, but can prove tricky in intricate areas such as playgrounds.
Reliability and the combination of physical activity, race and context and target area settings
Trying to assess a hard-to-measure target, such as physical activity categories that are hard to distinguish, and in the context of a difficult-to-observe target area, can substantially drop reliability values below acceptable thresholds. This variability was also described by Chung-Do et al., [40] who had agreement rates as high as 0.94 for sedentary girls, and as low as 0.44 for vigorous boys.
Our results suggest that observers might benefit from subdividing the target areas whenever more than 5 people are present, or in the case that more moderate activities are taking place. Modifying the SOPARC form to anchor observations based on race/ethnicity and age instead of sex and physical activity could also improve reliability since race and physical activity are the more difficult-to-observe variables. Whenever possible, researchers should also consider balancing the number of scans with the number of inputs to be recorded per scan. Transitioning towards more scans per observation--one scan for each sex, age, and race/ethnicity, with each scan having only to record the physical activity level—could alleviate problems associated with large counts. While some technological assistance such as iSOPARC can be valuable to streamline coding and data management[10], if future SOPARC studies want to keep increasing the amount of information gathered per scan, they should consider a technological change or accept the drawbacks of lower reliability scores. In the future, if the appropriate permits are obtained, researchers might want to start using pictures or video cameras that can provide static assessments of the conditions of the park area and its park users, which can later be examined more thoroughly by human researchers or by machine learning [41].