Computerized Adaptive Testing for Sleep Disorders: Development of An Item Bank and Validation in A Simulated Study

Background : As more and more people suffer from sleep disorders, developing an efficient, cheap and accurate assessment tool for screening sleep disorders is becoming more urgent. This study developed a computerized adaptive testing for sleep disorders (CAT-SD). Methods : A large sample of 1,304 participants was recruited to construct the item pool of CAT-SD and to investigate the psychometric characteristics of CAT-SD. More specifically, firstly the analyses of unidimensionality, model fit, item fit, item discrimination parameter and differential item functioning (DIF) were conducted to construct a final item pool which meets the requirements of item response theory (IRT) measurement. In addition, a simulated CAT study with real response data of participants was performed to investigate the psychometric characteristics of CAT-SD, including reliability, validity and predictive utility (sensitivity and specificity). Results: The final unidimensional item bank of the CAT-SD not only had good item fit, high discrimination and no DIF; Moreover, it had acceptable reliability, validity and predictive utility. Conclusions : The CAT-SD could be used as an effective and accurate assessment tool for measuring individuals' severity of the sleep disorders and offers a bran-new perspective for screening of sleep disorders with psychological scales.

electromyography (EMG) and other physiological detectors, it can collect various physiological changes during sleep. Sleep scales mainly require subjects to respond to the items in the scales, and then analyze the sleep status of the subjects according to their responses. For example, the Pittsburgh Sleep Quality Index (PSQI; Buysse et al., 1989) and the Insomnia Severity Index (ISI; Morin, 1993) are widely-used sleep scales.
Each of the above four sleep assessment methods has its own pros and cons. This article mainly introduces the shortcomings of sleep scales. Most of the existing sleep scales were developed within the framework of the classical test theory (CTT). One of the most prominent problems of CTT is that a large number of items are needed to cover a wide range of the construct with a high measurement precision.
In the clinical field, there is a high demand for mental health assessments which have both short duration and good quality (e.g., Gardner et al., 2004;Cella et al., 2007;Smits et al., 2007).
Computerized Adaptive Testing (CAT) which involves the administration of items via the computer offers substantial promise for this. For CAT, each item is dynamically selected from item bank and is optimal for the respondents in question (Smits et al. 2011). CAT relies on modern test theory which is also known as Item Response Theory (IRT). The main content of IRT research is the relationship between the responses of subjects to items and the latent traits of subjects measured in the test. IRT models have item parameters which quantify the relationship between the latent trait and the item score (Smits et al. 2011). In recent years, a majority of researchers have used IRT to improving existing scales. For examples, O' Connor et al. (2014) used IRT to analyze the Subjective Happiness Scale (SHS), Cho et al. (2015) applied IRT to analyze emotional intelligence scale, and Wang (2018) developed an adaptive testing with a hierarchical item response theory (H-IRT) model. Generally speaking, CAT may be the most intriguing new perspective of IRT.
Compared with traditional Paper & Pencil (P&P) test, the greatest advantage of CAT is that it can greatly reduce the number of items without loss of measurement accuracy. In addition, CAT has many other advantages. For instance, the presentation of items is more standardized, not only can accurately control what the examinee can see and hear, but can control the length of time. However, CAT also has disadvantages, such as being a complex technique and requiring a substantial amount of human and financial resources to organize a CAT program. However, study had shown that the advantages of CAT far outweigh its disadvantages (Meijer & Nering, 1999).
According to literature review, CAT has been widely used in psychological and clinical fields. For examples, Fliege et al. (2005) developed a CAT for depression; Abberger et al. (2013) developed a CAT to assess anxiety in cardiovascular rehabilitation patients; Gibbons et al. (2017) used a CAT of the quality of life to adjust the cross-cultural differences of the participants. However, few CAT studies for sleep disorders have been found, which is unfavorable to the measurement and assessment of sleep disorders. Considering the seriously negative effects of sleep disorders on people, so it is urgent to develop an efficient and accurate assessment tool for sleep disorders. Moreover, the technology, algorithm and implementation of CAT for sleep disorders deserve further discussion. To address the above issues, this study attempts to propose a CAT using real data to measure sleep disorders (CAT-SD).
The current study is expected to contribute to the theory and practice to the measurement and assessment of sleep disorders. Specifically, in theory, a) the combination of CAT and sleep disorders broadens the application area of CAT; b) although some sleep scales have short items, these scales may not adequately refer to the domains related to sleep disorders. However, multiple sleep scales are combined into a large item bank which is relatively comprehensive in CAT; c) CAT mitigates measurement error while maximizing efficiency since only the items pertinent to accurately measuring trait level are administered (Kirisci et al., 2012). In practice, a) cost is minimal because the responses are scored automatically and immediately after the subject complete the questionnaire; b) privacy is also ensured because there is no record of the subjects' responses on paper and access to the information is protected by password (Kirisci et al., 2012); c) the sleep scales are very important as an auxiliary tool for doctors to diagnose patients' sleep status, and the patients can complete it at home.
The next is to describe the development of CAT-SD and the evaluation of its test properties in simulation studies.

Methods Participants
The total college student sample came from ten universities in five Chinese cities (Beijing, Shanghai, Jingzhou, Jingdezhen and Nanchang) that consisted of 1,304 participants who agreed to take part in this study after being informed that their personal information would be kept secret, all participants were voluntary and anonymous. Participants included both healthy individuals (81.18%) and patients (18.82%). These patients reported that they had been diagnosed with sleep disorders by professional doctors and were screened with severe sleep disorders by several sleep scales. The sample comprised 585 males (44.86%) and 719 females (55.14%), of whom 752 (57.67%) were from rural areas and 552 (42.33%) were from cities, and the mean age was 19.15 years old (SD = 1.52, range from 15 to 25), 87.40% of participants aged from 18 to 22.

Measures
The first step is to conduct a Delphi process. We started with 133 available items originating from 8 self-rating sleep scales that are widely-used in routine diagnostic examinations: Pittsburgh Sleep Quality Index (PSQI; Buysse et al., 1989), Insomnia Severity Index (ISI; Morin, 1993) Hays et al., 2005), and Quality of Life in Neurological Disorders-Sleep Disturbance Scale (Neuro-QOL-SD; Perez et al., 2007). Athens Insomnia Scale (AIS; Soldatos et al., 2000) which is widely-used in diagnosing sleep disorders is selected as criterion scale to evaluate the CAT-SD. These questionnaires are administered to participants via computer. measures one main latent trait, and no other factors will affect the characteristics of the examinee's response to items. Many IRT models assumed unidimensional, such as the two-parameter Logistic model (2PL), three-parameter Logistic model (3PL), the Graded Response Model (GRM; Samejima, 1968) and the Generalized Partial Credit Model (GPCM; Muraki, 1992). To confirm acceptable unidimensionality of the dataset, EFA was conducted. If the ratio of the first eigenvalue to the second eigenvalue is greater than 4  along with the first eigenvalue explains more than 20% of the total variance (Reckase, 1979), which can be considered that the test conforms to the unidimensional hypothesis.
Model fit. In IRT, selecting an optimal IRT model for statistical analyses is the premise to ensure the accuracy of statistical analyses. In current study, four polytomously-scored IRT models (i.e., GRM, PCM and GPCM) were simultaneously applied to fit the items of CAT-SD, and the optimal model was selected based on the test-level model-fit indices include − 2Log-Likelihood (-2LL; Spiegelhalter et al., 1998), Akaike's information criterion (AIC; Akaike, 1974) and the Bayesian information criterion (BIC; Schwarz, 1978). Smaller values of these indices indicate better model-fit.
Discrimination parameters. Item discrimination parameters (a) shows the extent to which individuals with similar scores can be differentiated via an item. An item with high discrimination parameter suggests that this item is helpful to obtain more precise estimation of examinee's latent traits. Item discrimination parameters were estimated via the optimal model and then selected items with discrimination parameters more than 0.5 (Chang & Ying, 1996).
Item fit. In order to calculate the item fit, the S-X 2 statistic (Kang & Chen, 2008) that quantifies and compares the differences between observed frequencies and expected frequencies under the IRT model was suggested. Items with p values of S-X 2 less than 0.01 were deemed to have poor item-fit (Flens et al., 2017) and were removed.
Differential item function (DIF). Another assumption of IRT models is that the item has same item parameters in different sample. If parameter values differ between groups, a test is said to suffer from Differential Item Function (DIF; Embretson & Reise, 2000, Chap. 10). The consequence of DIF is that respondents from different groups, who actually have an identical score on the latent trait, have a different probability of endorsing an item (Smits et al., 2011). Change in McFadden's pseudo R 2 was used to evaluate effect size, and the null hypothesis of no DIF was rejected when the value of R 2 change was more than 0.02 (Bjorner, 2003). DIF analysis was carried out with respect to gender (male, female) and region (rural, city) groups.
The IRT analyses of unidimensionality, model fit, item fit, discrimination parameters and DIF were sequentially performed until the remaining items of item bank fully satisfied the above rules.
Consequently, the remaining items constructed the final item bank of CAT-SD, and then the item parameters of the final item bank and person parameters were estimated against by using the optimal model.

CAT-SD Simulation Study
In this part, all the real participants' response data were used for a CAT-SD simulation study that was to investigate the characteristic, criterion-related validity and predictive utility (sensitivity and specificity) of CAT-SD.
Initial item selection. In the CAT simulation, item selection is dependent on the participants' responses to a given item. However, the computer knows nothing about prior information of participants at the beginning. the random selection method was applied to initial item selection in this study.
Item selected method. Once the CAT has an estimate of participants' latent trait, it needs according to the estimated latent trait to choose the most appropriate item for them. The Maximum Fisher Information (MFI) (Baker, 1992) method is a commonly used method for item selection by selecting the item with maximum information at that estimated theta point. This so-called statistical information is a function of the item parameters and is related to the measurement error of the estimated latent variable. The higher the information of the item, the more it reduces the measurement error associated with that estimate (Smits et al., 2011). The Fisher Information was defined as, where the is the item information function of item j given the , the is the estimated latent trait level, is the probability of getting score k given the , and is the first derivative of to . A new item with the highest at that estimated theta point was selected.
Scoring algorithm. After the participants responded to an item, their sleep disorders theta was updated with the expected a posteriori method (EAP; Bock & Mislevy, 1982). EAP is a kind of Bayesian estimation based on the participants' responses to the selected item. The EAP was defined as, where the refers to be one of the quadrature points, is the likelihood function of participant i with a specific response, given an ability value is the weight of the quadrature points, and .
Stopping rule. The CAT algorithm alternately selects items and updates the estimate of the participant's latent trait until the item bank is empty, unless termination criterions are set. There are generally two approaches to terminate the test, one is fixed length, the other is variable length. In this study, the latter rule was applied, that is, the test was terminated when the standard error (SE) of theta reached the pre-set value of SE(θ). SE for a trait level (Magis & Raiche, 2012) can be defined as, where the n denoted the total number of administrated items. Different stopping rules have different estimation accuracy for the test. Generally speaking, the larger the SE(θ), the lower the accuracy of the test estimate, and vice versa. Several cut-off values of SE(θ) were used in the CAT-SD simulation: the whole final item bank (None), SE(θ) ≤ 0.3, SE(θ) ≤ 0.4, SE(θ) ≤ 0.5, and SE(θ) ≤ 0.6, respectively (Tan, et al., 2018).
Characteristic of CAT-SD. In order to explore the characteristics of the CAT-SD, several statistics were calculated: the mean and standard deviation (SD) of administrated items, the mean SE of theta estimates, the Pearson's correlation between the estimated theta under different stopping rules and theta estimations using the whole item bank, and the marginal reliability that was the mean reliability for all levels of theta (Smits et al., 2011). The corresponding reliability of each examinee can be derived via the following formula (Samejima, 1994) when the mean and SD of theta were 0 and 1, respectively, where the is the test information for the participant , which can be inferred based on the administered item parameters and his/her respond, the is the corresponding reliability in IRT for the participant , and the marginal reliability is the average of the corresponding reliability of each participant.
Additionally, the figure of the number of selected items and test information as functions of estimated theta under different stopping rules was plotted. The test information suggests the measurement precision of CAT-SD, and the lower the value of it, the larger the error of the theta estimation.
Criterion-related validity and Predictive utility (sensitivity and specificity) of CAT-SD. In order to further investigate the criterion-related validity and predictive utility (sensitivity and specificity) of CAT-SD, the AIS that is widely-used and well-validated in measuring sleep disorders was selected as criterion scale. The Pearson's correlations between the estimated theta in the CAT-SD and the standard scores of the AIS under different stopping rules were calculated. Predictive utility (sensitivity and specificity) of CAT-SD was examined calculating Receiver Operating Characteristics (ROC). The area under the curve (AUC) can be seen as the probability that a randomly selected unhealthy individual scores higher than a randomly selected healthy individual on the sleep scales. and its value ranged from 0.5 to 1. The predictive utility of the estimated theta for diagnosing sleep disorders is similar to random guessing when AUC = 0.5, while it is optimal when AUC = 1. Swets and colleagues (1988) suggested to heuristically interpret AUC-values as small (0.5 < AUC 0.7), moderate (0.7 < AUC 0.9), or high (0.9 < AUC 1). Sensitivity refers to the probability that a patient is accurately diagnosed with a disease, and specificity refers to the probability that healthy individual is diagnosed with no illness, the larger the value of these two indicators, the better the effect of the diagnosis (Tan, et al., 2018). Determination of the cut-off scores was calculated by maximizing the Youden-Index (YI = sensitivity + specificity − 1) (Schisterman et al., 2005). The AIS regarded as the classified variable (according to the scoring standard of this scale, when the total score of the participants was greater than 6, they were diagnosed as insomnia and were rated as 1, while others were rated as 0) and the estimated theta in CAT-SD was used as a continuous variable for sleep disorders to plot the ROC curve under different stopping rules.

Software
The EFA and ROC curve were carried out via the software SPSS 23.0. All other analyses were performed in the free statistical package R (Version 3.4.1; Coreteam, 2015). Specifically, the analyses of IRT model selection, item fit and discrimination parameters were conducted via the 'mirt' package (Version 1.24;Chalmers, 2012); DIF tests via the 'lordif' package (Version 0.3-3; Choi, 2015); the 'catR' package (Version, Magis & Barrada, 2017) was applied to conduct CAT algorithm.

Construction of item bank for CAT-SD Unidimensionality
Results of EFA for 133 items in the initial item bank for sleep disorders indicated that the ratio of the first eigenvalue (λ 1 = 22.003) to the second eigenvalue (λ 2 = 4.552) was 4.834, and the first eigenvalue explains accounted for 24.448% of the total variance, which was more than 20% of the total variance. Thereby, these results revealed the unidimensionality of CAT-SD.

Model fit
Model fit statistical indices of the GRM, the GPCM and the PCM were documented in Table 1. It can be seen from the table that the values of -2LL, AIC and BIC of GRM model were all smaller than those of other IRT models which suggested that the GRM fitted the data best. Therefore, the GRM was regarded as optimal model to perform the further analyses.

Discrimination parameters, Item fit, and DIF
The discrimination parameter of 13 items in the initial item bank were less than 0.5, so they would be removed from the item bank. Of the remaining 120 items, the S-X 2 values of 14 items were less than 0.01, therefore, the 14 items were removed due to poor item-fit. Regarding DIF of the remaining 106 items, there was no DIF in the region group, while there were 12 items that the values of R 2 change were more than 0.02 in the gender group, thus, the 12 items with DIF were eliminated.
Consequently, the final item bank of CAT-SD comprised 94 items after 39 items were eliminated for the above psychometric criterions. After that, unidimensionality, model fit, item fit, discrimination analysis and DIF test were conducted again for the remaining 94 items, and it was concluded that all the items met the requirements of IRT measurement. The statistical indices of items in the final item bank of CAT-SD were partly presented in Table 2 and the statistics of the whole item bank were provided in the Supplementary material. For the final item bank, the average IRT discrimination parameter (a) was 1.31 (SD = 0.40), which suggested the final item bank had high quality, and the location parameter (b) ranged from − 4.66 to 4.88, which implied the location parameter basically covered a large range of traits.

CAT-SD Simulation Study
Characteristic of CAT-SD Table 3 shows several characteristics of CAT-SD under different stopping rules. The first row shows the characteristics of CAT-SD when no stopping rule was applied, that is, all items in the final item bank were administered. The second and third columns show the mean number of items administered and the associated SD, respectively. Obviously, the higher the level of measurement precision, the more the mean number of used items.  Note: ** shows the discrepancy on 0.01 level notable. None = the whole item bank was administered; r = the Pearson's correlations between the estimated theta in the CAT-SD and the estimated theta via the whole item bank. Figure 1 depicts the standard error of estimated theta of the CAT-SD under stopping rules SE(θ) ≤ 0.4 and SE(θ) ≤ 0.3. Subjects who have moderate and high CAT-SD score have a smaller standard error. The results were consistent with the fact that screening was more effective for people with moderate or severe sleep disorders than for people with mild sleep disorders. Figure 2 shows the number of items administered along with test information as functions of the estimated theta in the CAT-SD with stopping rules SE(θ) ≤ 0.4 and SE(θ) ≤ 0.3. Particularly, a large number of items had to be administered for subjects with lower theta and the test information was low. However, fewer items were administered for subjects with middle or high theta and the test information was high. For example, under the stopping rule SE(θ) ≤ 0.3,a) the test information was less than 8 for those whose theta ranged from − 3.8 to − 2 even if the entire item bank was administered to them; while b) the test information was over 12 for those whose theta ranged from 0 to 3 with about 10 administered items to them. Figure 3 illustrates the density distributions of the sleep disorder scores obtained from traditional test (the whole item bank) and CAT-SD. As we can see, the two distributions are almost identical with different stopping rules, the Pearson's correlations between two kinds of test are 0.91 and 0.95 under stopping rules SE(θ) ≤ 0.4 and SE(θ) ≤ 0.3 (see Table 4). Figure 4 displays the marginal reliabilities of CAT-SD for each participant with different stopping rules. Under SE(θ) ≤ 0.4 and SE(θ) ≤ 0.3, the marginal reliabilities were above the average of it (r = 0.84), which indicated that CAT-SD had a high reliability for most participants. Furthermore, the marginal reliabilities for participants with estimated theta more than − 2.5 under SE(θ) ≤ 0.3 was maximal, while the marginal reliabilities under SE(θ) ≤ 0.3 and SE(θ) ≤ 0.4 were equal when estimated theta was less than − 2.5, and the marginal reliabilities under SE(θ) ≤ 0.3, SE(θ) ≤ 0.4 and SE(θ) ≤ 0.5 were equal when estimated theta was less than − 3. Individuals always had the minimum marginal reliabilities with stopping rule SE(θ) ≤ 0.6, regardless of theta estimation.

Criterion-related validity and Predictive utility (sensitivity and specificity) of CAT-SD
The results show that the correlations (range from 0.684 to 0.777, p < 0.001) between CAT-SD estimated theta and AIS was significant under different stopping rules, which indicated the CAT-SD had acceptable criterion-related validity. The ROC analysis for CAT-SD is presented in Table 4 and so, the values of AUC were also higher than critical value 0.7 that is universally regarded as the lower bound for moderate predictive utility under all stopping rules. In this study, the minimum probability that patients were accurately screened with sleep disorders and that normal individuals were accurately screened with no sleep disorders were 0.781 and 0.751, which were higher than the random level (0.5).

Discussion
This study focused on the development of CAT-SD, which provided optimal items for individuals based on the severity of their sleep disorders to effective assessment sleep disorders and significantly reduce the test burden without loss of measurement accuracy.
The whole study was divided into two parts: construction of the CAT-SD item bank and CAT-SD simulation study. In order to construct a high-quality item bank for CAT-SD, items were carefully selected from eight universally-used sleep scales. After the unidimensionality, model fit, item fit, discrimination analysis and DIF test were carried out, a high-quality item bank was constructed.
Results display that the final unidimensional item bank of CAT-SD contained 94 items which had good item-fit, high discrimination and no DIF. In CAT-SD simulation study, the real participants' response data was used to investigate the psychometric characteristics of CAT-SD, including reliability, validity and predictive utility (sensitivity and specificity). Simulated CAT-SD under different stopping rules (required standard errors in decreasing steps of 0.1) were performed and results revealed that, a) individuals with moderate or severe sleep disorders can be accurately screened by administering only a few items. Small differences are more easily detected for participants with high scores than those with low scores of sleep disorders in the context of participants with a similar degree of sleep disorders. This result is similar to previous studies (e.g. Smits et al., 2011, Reise & Waller, 2009b) high correlation were observed between the traditional test and the CAT-SD. But participants only need to complete 15.19 and 8.47 items under the stopping rules SE(θ) ≤ 0.3 and SE(θ) ≤ 0.4 (see Table 4) in CAT-SD. CAT offers main advantages to the traditional test is that only the optimal items are administrated to each participant, minimizing test burden without sacrificing measurement precision; c) CAT-SD had an acceptable marginal reliability with an average of 0.84. Meanwhile, it also had an acceptable and reasonable criterion-related validity with the AIS, the Pearson's correlation coefficients under different stopping rules were all greater than 0.6 that is widely-used as the lower bound for moderate correlation; d) from the ROC curve analysis, the AUC values (AUC = 0.857 ~ 0.902) did not change much and were higher than the value (0.7) of the lower bound for a moderate predictive utility under different stopping rules, therefore, the CAT-SD had a good screening performance for sleep disorders. The sensitivity (0.781 ~ 0.815) and specificity (0.751 ~ 0.865) of the CAT-SD were both acceptable; e) the simulation study in this study indicated that the stopping rules SE(θ) ≤ 0.3 and SE(θ) ≤ 0.4 seem to be optimal. Because the two stopping rules are higher than the other stopping rules in terms of reliability and validity, although more items were used, the number of administrated items was within the acceptable range.
Although appreciating the promising results for the proposed CAT-SD, there were still some limitations. Firstly, when the criterion-related validity and the predictive utility (sensitivity and specificity) of CAT-SD were performed, only one scale was selected as the criterion scale and the participants were same. Future studies should employ more criterion scales to stabilize and crossvalidate the validity of CAT-SD and to ensure that the subjects involved in experiment and verification process are different. Secondly, given that the CAT item bank requires a sufficient number of items with high quality and a wide range of location parameters (Howard, 1990), more high quality items should be supplemented to the item bank of CAT-SD in future. The size of the item bank that is generally considered appropriate should be 6 to 12 times the number of items in P&P test (Stocking et al., 1993). There were 96 items in the final item bank of CAT-SD, but if the item exposure rate, item elimination, item content distribution and other issues are taken into consideration, the existing item bank should be further expanded. Thirdly, the sample is not representative. Therefore, future researches need to be done on participants with sleep disorders. Finally, in current research, CAT-SD simulation study with real response data in traditional test was carried out, however, a real CAT-SD administration should be implemented in future researches to further explore the efficiency of CAT-SD. Different results may be produced by simulated and real CAT administration (Smits et al., 2011).
In real situation, the participants' responses will be affected by many factors, such as the environment, mood, people and time, however, simulation study were usually performed under ideal conditions. Fortunately, research (Kocalevent et al., 2009) had found that the results of simulated CAT were consistent with actual CAT. Consequently, this study still has some practical significance.

Conclusion
The CAT-SD could be used as an effective and accurate assessment tool for measuring individuals' severity of the sleep disorders and offers a bran-new perspective for screening of sleep disorders with psychological scales. We need more studies to assess the performance of adaptive tests in both mental health specialty and other clinical settings. The study was carried out following the recommendations of psychometrics studies on mental health at the Research Center of Mental Health, Jiangxi Normal University and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All participants and their parents or legal guardian provided verbal informed consent and this practice was approved by the ethics committee of the Research Center of Mental Health, Jiangxi Normal University.

List Of Abbreviations
-Consent for publication Not applicable -Availability of data and materials The data and materials that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. -Competing interests All authors declare that they have no any conflict of interest related to this work. -Funding This work was supported by the National Natural Science Foundation of China [grant numbers 31960186, 31760288, 31660278]. The funders had no role in study design, data collection, data analysis, data interpretation, or writing of the manuscript.

Ethics approval and consent to participate
The study was carried out following the recommendations of psychometrics studies on mental health at the Research Center of Mental Health, Jiangxi Normal University and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. All participants and their parents or legal guardian provided verbal informed consent and this practice was approved by the ethics committee of the Research Center of Mental Health, Jiangxi Normal University.

Availability of data and materials
The data and materials that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. Figure 1 Standard error (SE) of CAT-SD score under stopping rules SE(θ) ≤ 0.3 and SE(θ) ≤ 0.4. Note:

Figures
A plot suggests good measure precision for the majority of the CAT-SD score range.   Marginal reliability as a function of estimated theta under different stopping rules.

Figure 5
The ROC curve of CAT-SD with several stopping rules.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. Additionalfile1.pdf