Rasch Model of the COVID-19 Symptom Checklist - Towards a Higher Quality of Self-Reported Measurement

Background: Inaccurate measurement of self-reported instruments including questionnaires and symptom checklists jeopardizes the comparability of the results. We therefore used advanced psychometric modelling to determine if the fundamental principles of measurement of an online self-reported COVID-19 symptom checklist were met or whether adaptations were necessary to increase measurement precision. Methods: Fit to the Rasch model was examined in a sample of 1638 Austrian citizens who completed an online COVID-19 symptom checklist on up to 20 days during a period of restrictive country-wide COVID-19 measures. Results: The longitudinal application of the self-reported COVID-19 symptom checklist increased the t to the Rasch model. The items ‘fatigue’, ‘headache’ and ‘sneezing’ had the highest likelihood to be armed. The item ‘cough’ showed a signicant mist to the fundamental measurement model and an additional dependency to ’dry cough/no sputum production’. Several personal factors, such as gender, age group, educational status, COVID-19 test status, comorbidities, immunosuppressive medication, pregnancy and pollen allergy led to systematic differences in the patterns of how symptoms were armed. Adjustments ranged from ±0.25 to ±0.01 on the metric scales (0 to 10) to which the raw scores were transformed. Conclusion: Except for some basic adaptations, the present analysis supports the combination of items. More accurate item wordings co-created with lay persons and adjustments for personal factors would increase measurement precision of the self-reported COVID-19 symptom checklist.


Introduction
The novel Coronavirus Disease 2019 (COVID-19) spread rapidly worldwide and the number of cases increased globally at an accelerated rate [1,2]. While measures are needed to decrease the virus spreading and mitigate the impact of the pandemic, effective monitoring is essential to tailor these measures to the current situation. Self-reported symptom tracking contributes to monitoring [3] and a number of self-reported symptom checklists and trackers for COVID-19 exist; some of them also allow their users estimating the risk for a SARS-CoV-2 infection [4]. Moreover, self-reported symptom-tracking could increase patient safety at home in COVID-19 patients, who do not need hospitalisation. In oncology, even survival could be improved by collecting selfreported side-effects in a timely manner [5]. Self-reported data are thus a vital part of a comprehensive outcome measurement scheme and large international research initiatives are developing the technical, ethical and legal infrastructure to collect such data on country-level.
Self-reported patient data are acquired with speci c instruments. To be t-for-purpose, these tools must ful ll certain psychometric properties. Most commonly, psychometric properties are examined on the basis of the classical test theory [6][7][8][9].
However, the classical test theory focuses on overall, sample-based statistics, which provide little insight into how individual items actually work [9]. Furthermore, the classical test theory makes speci c assumptions, such as normally distributed populations and interval-scaled response data, that are rarely met in practice. An alternative psychometric approach to assess whether an instrument ful lls the fundamental principles of measurement is the application of the Rasch model [10]. It overcomes some of the simplistic assumptions of the classical test theory and provides insight at the item level. Based on the Rasch model, items can be brought into a hierarchy according to their level of 'di culty' or likelihood to be a rmed, which means that more 'di cult' items are only a rmed by persons with higher 'abilities'. A range of possible mis ts to the Rasch model exists. A basic test compares actual responses to the expected responses based on parameter estimates. Another test checks, if the expected response to a speci c item differs between respondents from different sub-groups of persons (e.g. females versus males, people from different age groups, or people with versus without co-morbidities), although having the same 'ability'/likelihood to a rm a certain number of items. This is referred to as differential item functioning [11]. If differential item functioning is apparent, parameter invariance is violated and ndings between different groups of people cannot be compared without adjusting the scale. Related to parameter invariance is local dependency. Local dependency represents an additional dependency between items beyond the relationship related to the latent construct, which is measured by the instrument. Thus, local dependency also distorts the metric of the measure. Another fundamental requirement of measurement is unidimensionality [12] which applies whenever a scale is to be summarized by one measure.
Although a variety of self-reported COVID-19 symptom screening checklists exists [4], items are often not formulated in a standard manner and psychometric information on these instruments is scarce. This lack of knowledge on the psychometric properties of the symptom checklists prevents us from judging whether we use them in the right way or whether adaptations might be necessary. As more symptoms might increase the likelihood for a suspected case, this might be similar to the measurement of a latent construct. Furthermore, a recent Cochrane review concluded that individual COVID-19 signs and symptoms, such as cough, sore throat, fever, myalgia or arthralgia, fatigue, and headache, had very poor diagnostic power [13]. Moreover, inaccurate measurement in self-reported instruments jeopardizes the comparability of the results. We therefore used the Rasch model to determine if the fundamental principles of measurement of an online self-reported COVID-19 symptom checklist were met or whether adaptations were necessary to increase measurement precision.

Methods
We conducted a psychometric analysis using a sample of 1638 Austrian citizens who completed an online COVID-19 symptom checklist on up to 20 days during a period of restrictive country-wide COVID-19 measures. After the rst con rmed cases of COVID-19 in Austria on 25 February 2020, nationwide infection control measures were ordered by the Austrian government from 16 March 2020 onwards. Public life in Austria remained severely affected before rst easing measures were implemented in mid-April 2020 [14]. The self-reported, online COVID-19 symptom checklist used in the present study was therefore available from 22 March to 30 April, 2020.
We developed the checklist based on the WHO symptom descriptions for COVID-19 [15] and included fever, fatigue, cough, dry cough/no sputum production, pain in limbs, sore throat, headache, shortness of breath, chills, vomiting, diarrhea, nasal congestion, sneezing, sni es/rhinitis and smell and taste disorders (Supplement Table A). Response options and scores for each item were as follows: 'Yes' (scoring 1), 'No' (0) and 'I can't say that' (3). For the analysis, we dichotomized the items by collapsing 'No' and 'I can't say that' because we assumed that these two answers would indicate that the participant had not experienced a certain symptom. In addition, we asked the participants to state gender, age group, the highest completed education, the current smoking status, body height and weight, whether any type of COVID-19 test had been conducted and if so, what the result of this test was, whether any comorbidities existed (nervous system, cardiovascular, gastrointestinal, liver, kidney, oncologic, high blood pressure and/or diabetes), whether he or she was taking immunosuppressive medication and whether the participant was pregnant (females only). We used a self-reported code consisting of numbers and string based on rst name initials, initials of rst names of relatives and birth months to allocate multiple assessments to the correct participants despite guaranteeing anonymity. Due to the psychometric nature of this study, only complete cases were included. The relevant ethical committees approved the study (Medical University of Vienna 1379/2020, Medical University of Innsbruck 1076/2020 and ethical committee of the region Vorarlberg).

Fit to the Rasch measurement model
Overall and item-based t to the Rasch model was explored in a series of dichotomous models [16] using two different data sets: one data set with the questionnaire lled in for the rst time when the participants entered the study and another dataset in which we recorded a symptom to be a rmed by a participant, if it was ticked at least once during the period of nationwide restrictive measures. We used raw scores without weighting [17] and determined the hierarchy of items based on their location parameters.
The item local parameter refers to the likelihood of each item to be a rmed. Items differ in their likelihood to be a rmed. A hierarchy of items can be determined based on this likelihood. Likewise, a hierarchy of persons can be established based on the likelihood that a person is likely to a rms more or less items.
Item t residuals between − 2.5 and + 2.5 with non-signi cant F-tests represented individual item t. Non-signi cant chi-squared values were interpreted as t to the latent trait. Local dependency between items was determined using residual correlations based on a cut-off of 0.2 above the mean [18]. To assess the instrument's item-based internal consistency and reliability, we compared Cronbach's alpha with the person separation index (PSI). The PSI refers to the reproducibility of relative measure location and indicates whether a scale is able to distinguish between people with higher and lower levels of the concept Page 4/13 measured by the instrument [19]; in general, a PSI ≥ 0.7 indicates that the instrument is su ciently suitable for group comparisons.

Unidimensionality
To test unidimensionality, we used an approach proposed by Smith [20] and combined principal component analysis of the item residuals with a series of t-tests to assess whether subsets of residuals which loaded positively or negatively resulted in different estimates of person parameters. These sets of items were chosen as a way to maximize the contrast between them and were thus then most likely to violate the assumption of unidimensionality [21].
Differential item functioning Differential item functioning was assessed separately for each item by comparing the expected responses to a speci c item between respondents from different sub-groups who shared the same likelihood to a rm a certain number of items. The subgroups were built based on gender (female, male, divers/other), age group (10 sub-groups listed in Table 1), highest completed education (6 levels listed in Table 1), COVID-19 test status (pos/neg/no test), comorbidities (yes/no), immunosuppressive medication (yes/no), pregnancy (yes/no), current smoking status (yes/earlier, but not now/never) and body mass index (BMI; above versus below median). As only seven participants indicated 'divers/other' for gender, we did not include this category into the differential item functioning analysis and compared only female and male participants. If differential item functioning was apparent in an item for a personal factor with more than two properties, we determined between which sub-groups these differences occurred using post hoc analysis of the residual means. Person-item targeting Person-item targeting was inspected graphically using person-item map.
Transformation to a metric interval scale Based on the logit scale from the Rasch model, we transformed the raw sum scores into a metric scale. If differential item functioning existed for a personal factor, we split the respective item and performed a separate metric transformation for each sub-group. We used the differences between these metric scales to adjust the scores for the respective personal factor, e.g. for people with and without comorbidities. All analyses were performed with either Microsoft Excel, RUMM2030 or the eRm and ltm packages in R (www.r-project.org).

Results
Participants from all age groups, ranging from 0 to 9 years to ≥ 90 years, lled in the symptom checklist. Two thirds of the participants (66%) were female (

Diagnosis of measurement problems
The data set with the rst responses of the participants (model 1 in Table 2) showed a considerable mis t to the Rasch model and a substantial discrepancy between PSI (-0.06) and Cronbach's alpha (0.68). The second dataset, where a symptom was recorded as a rmed, if it was scored at least once during the observation period (model 2 in Table 2), showed a better model t  Table B. Table 2 Model t statistics. Mean item log residual test of t, item-trait interaction chi-square statistics and Root Mean Square Error of Approximation (RMSEA) were calculated to assess model t. Model 1 ("First") refers to the questionnaires lled in for the rst time when the participants entered the study. In Model 2 ("Ever"), a symptom was considered a rmed by a participant, if it was ticked at least once during the period of nationwide restrictive measures

CI = Con dence interval
Only one item (cough) had signi cantly deviating F-tests with an item-based t residual being below − 2.5 and a signi cantly deviating chi-squared value (Table 3). Fatigue, headache and sneezing were the items with the highest likelihood to be a rmed, compared to fever with the lowest probability of a rmation by the participants, followed by dry cough and chills. Table 3 Item t statistics sorted in an ascending order according to item location in the data set with the items a rmed, if they were scored positively at least once by each participant during the observation period. The smallest (negative) item location of fatigue implies that it was the item with the highest likelihood to be a rmed, whereas fever showed the lowest likelihood. Only one item (item 3 cough) showed a t residual below the threshold of -2.5 and a signi cant F-test which represented individual item mis t.
Cough also had a signi cant chi-squared value which can be interpreted as mis t to the latent trait. The Bonferroni corrected signi cance level was 0.000667. Signi cances are highlighted in bold letters. Differential item functioning (DIF; depicted in the far right column) exists, if the expected response to a speci c item differs between respondents from different sub-groups of persons (e.g. females versus males, people from different age groups or people with and without co-morbidities), even though they have the same 'ability'/likelihood to a rm the same number of items. Adjustment for differential item functioning All personal factors, except for smoking status and BMI, led to differential item functioning in certain items ( Table 3). As expected, taste and smell disorders were more likely to be a rmed by participants with a positive COVID-19 test result, participants with pollen allergy were more likely to indicate nasal congestion and sneezing and pregnant women were more likely to experience vomiting. However, less obvious was that participants with comorbidities exhibited a higher likelihood to indicate shortness of breath and were less likely to score headache when compared to participants without comorbidities; pain in the limbs was more often a rmed by persons with immunosuppressive medication than persons without; women were more likely to indicate headache when compared to men, and men indicated more often rhinitis/sni es than women (Fig. 1). Participants with higher educational status (completed post-secondary non-tertiary education or rst stage of tertiary education) were more likely to indicate fatigue than people with completed apprenticeships (the signi cant absolute residual mean differences in the post hoc analysis were 0.29 and 0.4, respectively) and un nished compulsory education (the signi cant absolute residual mean differences were 0.5 and 0.61, respectively). Likewise, people with completed second stage of tertiary education were more likely to a rm fatigue in comparison to participants with un nished compulsory education (the signi cant absolute residual mean difference was 0.5). Fatigue was also most prevalent in younger adults (from 20 to 49 years of age), when compared to both children/adolescents up to 19 and older adults of 50 + years (Supplement Figure A). However, as the highest completed education depended also on the age group, e.g. participants below the age of 20 could not have completed the second stage of tertiary education, and the numbers of participants in the age groups were heterogeneous (ranging from 1 to 541; Table 1), we did not further assess differences between speci c age groups.
Local dependency in relation to item t Local dependency was detected between items 3 (cough) and 4 (dry cough/no sputum production), items 12 (nasal congestion) and 13 (sneezing), as well as items 12 (nasal congestion) and 14 (sni es/rhinitis). The additional 'local' dependency of nasal congestion, sneezing and/or sni es/rhinitis seemed to be rather evident and well known. Moreover, these three symptoms might be less important regarding COVID-19 than, for example, cough. Cough and dry cough also showed such an additional dependency which might also relate to the mis t of cough (item 3) in Table 3. Finally, we found both data sets to be unidimensional (the last four far right columns in Table 2 show these results).

Person item targeting
From the graphical inspection of the person-item map (Supplement Figure B), it is evident that the items in general cover the range of symptoms in the population. We used the dataset where an item was recorded as a rmed, if indicated at least once during the observation period, which consisted of less zero-scored symptoms compared to the data set of the questionnaires lled in only once.

Transformation to a metric interval scale
The raw sum scores and the corresponding values on an interval metric scale from zero to ten are depicted in Table 4. Higher scores indicate a more symptoms. We split and adjusted items 'headache', 'shortness of breath' and 'rhinitis/sni es', 'pain in limbs' and 'fatigue' according for the personal factors comorbidities, gender, immunosuppressive medication and educational status (Table 4; details of the calculations are depicted in Supplement Table C). Adjustments for personal factors differed in magnitude over the range of each metric scale. Table 4 Transformations of raw scores into metric scales (logit scales and from zero to ten). 'COM' refers to persons with comorbidities; 'IMM' refers to persons with immunosuppressive medication; The numbers below COM, IMM and gender represent the difference between the metric scales for these subgroups, e.g. persons with and without comorbidities, and represent a possibility to adjust the metric scores for a better comparability of persons with different personal factors.

Discussion
For the rst time, we assessed measurement precision of a COVID-19 symptom checklist using advanced psychometric methods and suggested some basic adaptations which could increase its precision. Local dependency between the items 'cough' and 'dry cough/no sputum production', for example, showed the need for more accurate item wordings which would make it clearer to the participants how to differentiate between these items. The unclear difference between what is meant with 'cough' and 'dry cough/no sputum production' could have prevented them from precise scoring and an improved item wording would be preferable. Likewise, internationally agreed standard item wordings for COVID-19 symptom checklists co-created with lay persons could improve comparability of the ndings, especially across datasets and different countries. Such an agreed version should then also be tested for measurement precision using an approach similar to ours.
Our analysis showed that participants with certain characteristics, such as comorbidities, intake of immunosuppressive medications, gender and educational status, scored systematically differently on some items, although having in principle the same likelihood to a rm the respective items. This means for example that people with comorbidities had a higher likelihood to a rm shortness of breath due to this confounder. Likewise, people with higher educational status were more likely to score fatigue despite their otherwise similar likelihood to a rm a similar number of items when compared to people with a lower educational status. The symptoms on the COVID-19 checklists could also occur due to other diseases or in relation to certain personal characteristics. An adjustment of scores according to these characteristics would increase the comparability of the ndings between individuals from different sub-groups.
The low PSI in the dataset with the questionnaires lled in for the rst time by the participants could be related to the small number of people who a rmed the items when they rst responded to the questionnaire. It might thus be di cult to determine measurement precision of symptom checklists, if only a small number of people report symptoms. We recorded similar ndings in previous studies where the large number of zero-scored symptoms led to right skewed distributions and a unfavorable personitem targeting [21,22]. In general, measurement precision of self-reported symptom checklists, especially in case of easily uctuating symptoms, such as the ones used in our COVID-19 checklist, might be increased by using these checklists longitudinally.
We could show that fever, dry cough and chills had a low probability of a rmation. This could be related to the fact that participants were anxious to report these symptoms in a digital way, even if anonymity was guaranteed and the participation in our study was voluntary. Participants could be reluctant to ll in such symptoms in digital tools and might fear that this could lead to personal consequences, such as quarantine or other restrictions. This fact needs to be taken into account when selfreported checklists are used and could be a reason for the very poor diagnostic power [13].
As we used Austrian data in our study, differential items functioning between countries and the assessment of cross-cultural differences could thus not be assessed. In terms of future work, it would be interesting to repeat the Rasch model in an international data set collected by an online self-reported COVID-19 symptom checklist with adapted wording of speci c items.

Conclusion
Apart from some basic adjustments, the analysis supports the present combination of items into a comprehensive COVID-19 symptom questionnaire. More accurate item wordings co-created with lay persons and adjustments for personal factors would increase measurement precision of the self-reported COVID-19 symptom checklist.

Declarations
Funding Source: This study was partly funded by the COVID-19 Rapid Response Funding Scheme of the Vienna Science and Technology Fund (project number COV20-028). The funding institution had no in uence on the results of this work.

Ethics approval and consent to participate
The relevant ethical committees approved the study (Medical University of Vienna 1379/2020, Medical University of Innsbruck 1076/2020 and ethical committee of the region Vorarlberg). All participants consented in written before lling in the questionnaire.

Consent for publication
Not applicable

Competing interests
We declare no competing interests related to this work.

Funding
This study was partly funded by the COVID-19 Rapid Response Funding Scheme of the Vienna Science and Technology Fund (project number COV20-028). The funding institution had no in uence on the results of this work.
Authors' contributions VR, MO, RA, NM, AR, MS, SW, ToS, EM and TS were responsible for the study conceptualization and design. VR, EM and TS designed the online survey and analysed the data. VR, MO, RA, NM, AR, MS, SW, ToS, EM and TS interpreted the data, wrote and reviewed the manuscript. EM and TS were responsible for the visualization, including tables and gures.