This study has three phases: Phase 1 consisted of the development of the scale, Phase 2 involved the validation of the scale, Phase 3 involved standard-setting of the scale.
Two groups of participants were involved in this study: Sample 1 was used in Phase 1 and Phase 2, Sample 2 was used in Phase 3.
The medical curriculum in Ankara University School of Medicine (AUSM) runs a 6-year programme comprises 3 years of preclinical work followed by 3 years of clinical work (2 years of clerkships and one year’s internship). Communication training with SPs in pre-clinical years is a mandatory part of the curriculum in AUSM, and SP encounters are conducted during the second and third years. For this reason, we included second and third-year medical students at 2016-2017 academic year. The criteria for participation in the study were having at least one previous SP encounter experience among volunteered students.
702 students participated to the Phase 1 study. While determining the sample size, the requirements of the multivariate data analysis methods [Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA)] used in this study, were considered. As these are the multivariate statistics, they require large sample sizes. According to Comrey and Lee, a sample size of 200 is fair, and a sample size of 300 is suitable for EFA . Moreover, at least 300 cases are needed with low commonalities, a small number of factors, and just three or four indicators for each factor .
As EFA and CFA should be conducted with two different groups selected from the same population, we distributed the participants to each process. In the study, second year students performed the SPs’ performance evaluation process earlier than third year students. Since EFA was performed earlier than CFA, EFA was done using the data of the second year students (n=307) and CFA was done using the data of the third year students (n=395).
The standard-setting study of the scale was carried out with a test-centered approach, following which expert opinions were collected. Experts with at least five years of experience in using and training SPs were selected by the purposive sampling method. Purposive sampling is a type of nonprobability sampling in which the researcher consciously selects specific elements or subjects for inclusion in a study to ensure that the elements will have certain characteristics relevant to the study. In addition, while selecting the experts, it was taken into account that they work in different departments of the medical school, as they may have different perspectives. For this purpose, two SP trainers and faculty from the Department of Medical Education and five faculty from the Departments of “Infectious Diseases”, “Child Health and Diseases”, “Psychiatry”, “Radiology” and “Forensic Medicine” participated to this phase of the study.
Development of Data Collection Tools
The scale development process comprised seven steps, which included a literature review, conducting interviews, synthesis of the literature review and interviews, developing items, consulting expert validation, preliminary application, and pilot testing .
Literature Review: At this stage, the keywords "standardized/simulated patient performance" and "standardized/simulated patient scale" were used on Web of Science, Google Scholar, and ProQuest search engines to research the relevant literature. During this stage, the two research studies focusing on the development of the measurement tools for SPs’ performance evaluation were investigated [13,14] and eventually assessed.
The domains of SPs’ performance were defined as follows: the ability to portray a patient, to observe the medical student's behavior, to recall the encounter, and to give feedback . SPs must give accurate medical history and realistically depict the patient's educational level, psychological state, as well as emotional condition while observing the student's performance. After the interview, the SPs must recall the details of the student's behavior and give thoughtful, beneficial, and effective feedback from the standpoint of the patient SP was portraying. These conceptual definitions of the domain were decided to be measured for gauging the SPs’ performance.
Conducting Interviews: In addition to the literature review, the interviews were conducted with 9 faculty and 50 students (different from the participants in Phase 1 and Phase 2), and two field experts who participated in SP training. Individual oral interviews of 15-30 minutes were made from faculty among 45 faculty who have been involved in SP selection or working with SPs for at least seven years, especially in communication skills training. They were asked what they considered to be the main attributes of the good and poor performance of an SP. They focused on the different performance characteristics in this role, such as persuasion, successful portrayal, respecting the scenario, and giving effective feedback. When the expected answers started to repeat after nine tutors, the interview was stopped. Both written and verbal answers were collected from these interviews.
Synthesis of the Literature Review and Interviews: The data from the literature review and the interviews were together evaluated with the domains of SPs’ performance. As a result, the scope and content of the measurement tool intended to be developed were determined and nine indicators of performance were identified (Table 1).
Developing Items: An item pool consisting of 18 items was created, and two items were assigned to each indicator (Appendix 1) in order to prevent the narrowing of the scope of the scale in a situation where an item was removed as a result of expert opinion or item analysis.
All developed items were positively worded. A five-point Likert-type scale was determined in consultation with the experts (three medical educators and one measurement-evaluation specialist, excluding the experts who participated in Consulting Expert Validation)
The response anchors of these items were defined as "poor (1)", "fair (2)", "good (3)", "very good (4)", and "excellent (5)". After taking these steps, a draft version of the scale was formed.
Consulting Expert Validation: To obtain an opinion on the 18-item draft scale, seven experts working in the field, four volunteer faculty experienced in using and training SPs, two linguists, and one of the authors, who is a measurement-evaluation specialist, were consulted. These experts examined the scale items in the context of content, scope, language, comprehensibility, measurement, and evaluation principles using an evaluation form. On the form, the experts stated their opinions on each item as "applicable," "not applicable," "needs revision" and subsequently included their recommendations for these items. Based on the recommendations from the experts, six items were excluded, and one item was revised; thus, a 12-item pilot version of the scale was created (Appendix 1).
Preliminary Application: At this stage, the scale was applied to a group of 81 students (different from the participants in Phase 1 and Phase 2), in order to determine the approximate duration of implementation, to correct any incomprehensible items, and to make changes, where necessary. As a result of the preliminary application, no item was misunderstood, none of the items were left unanswered, the instructions were comprehensible, and the evaluation of an SP took three to five minutes.
Pilot Testing: The 12-item pilot form was applied to a large group of medical students (N=702), following which the validity and reliability studies comprising phase 2 of the study were performed. After the completion of these analyses, the scale was finalized. Relevant findings are presented in the "Data Analysis" section.
In this study, the extended Angoff method was used to determine the cut-off score for the scale. In this method, the experts estimated the number of scale points that they believed borderline examinees would obtain from each response item . In this context, experts determined the level of performance of the SPs at the borderline by using the "Standard-Setting Form" for each item in the scale. In the "Standard-Setting Form", two sections specify the level of performance of the SP, which is at the borderline for each item. After carrying out discussions between the two sessions, the experts completed one of these sections in the first evaluation session and the other in the second evaluation session.
At AUSM the students practice interviews with SPs who are trained to act as patients with conflicts, as well as a defined medical and life history. They have regular script training sessions for learning new roles or refreshing established roles and practicing the giving of feedback. Before entering the university SP pool, all SPs signed the “SP Commitment Form” that includes confirmation that materials related to them could be used for educational or research purposes. During SP encounters, SPs give verbal feedback to the students from the patient’s perspective.
The Ethics Committee of AUSM approved the study. During the communication skills program, the students were informed about the study, and participation was voluntary. Before the interview with the SPs, two authors performed a rater training for the volunteer students in 20 minutes. The training consisted of information about the standards of SPs’ performance, how to assess SPs immediately and how to fill the scale. Consent was obtained before the interviews. Twenty-five SPs (6 male - 19 female, aged between 32 and 65 years) were utilized for the study.
The students were asked to respond to the scale immediately after the encounters. Each student evaluated the SP he/she interviewed one time during the communication skills program only. Care was taken for the students to complete the scale alone without any interaction with their peers or others.
First, the experts were trained by one of the authors who is a specialist in the field of measurement and evaluation. During this training, the aim and methods used in the standard-setting were explained, and information was given about the procedures to be performed on the standard-setting method. In the next step, the experts discussed the characteristics they considered should be present in the SP and agreed on the level of competence that an SP at the borderline should have. After the first evaluation session began, the experts were asked to give the scores (min: 1, max: 5) for each item by an SP at the borderline. This process was carried out individually. They were subsequently asked to share their evaluations with the group and justify them. Then, the experts stated their opinions about each other's evaluations, which were followed by a discussion. In the second session, experts were asked once again to provide the scores for the SP at the borderline. The duration of the first session, the discussion, and the second session was approximately one hour. The reason for using two rounds in the Angoff method is to reduce the deviations in the evaluations and to obtain results that are more reliable. The participants discussed their evaluations after the first round. The participants are given the opportunity to make changes in their evaluations in the second round because of these discussions if they deem it necessary.
The Validity of the Scale
Construct Validity: Before EFA, Kaiser-Meyer-Olkin (KMO) and Bartlett test of sphericity were examined and the data was tested for its appropriateness for the factor analysis. Bartlett's test of sphericity was applied to determine if the correlation matrix is different from the identity matrix. The statistical significance of the calculated chi-square value in Bartlett test can be interpreted as the data is appropriate for factor analysis .
The principal components method was used in the factor selection process. In the determination of the number of factors, scree plot and parallel analysis methods were used. Since the scale had a single factor structure, no rotation method was used.
In the CFA process, as the data did not satisfy the multivariate normality assumption, the analysis was carried out based on the weighted least squares method, and the standardized coefficients, corresponding t values, and some fit indices were evaluated. While the ratio of chi-square value to the degree of freedom of below 2.5 indicates the perfect fit, the corresponding values for Non-Normed Fit Index (NNFI), Comparative Fit Index (CFI), Goodness of Fit Index (GFI), Adjusted Goodness of Fit Index (AGFI) are above 0.95. For The Root Mean Square Error of Approximation (RMSEA) and Standardized Root Mean Square Residual (SRMR), values below 0.05 shows perfect fit .
Item Discrimination: In order to assess the item discrimination, the significance of the differences between the scores of the participants in the upper and lower 27% groups for each item was compared using the Mann -Whitney U-test. Since the scores given for each item were within the ranking level, parametric tests (t-test for independent groups, etc.) were not used in this comparison.
Reliability of the Scale
To assess the reliability of the developed scale, Cronbach's alpha , internal consistency coefficient, and split-half reliability coefficient  were calculated. The test-retest reliability coefficient could not be calculated because it was practically impossible to reach the participants twice. Since a large number of medical students evaluated SPs in this study, inter-rater reliability was not calculated because it would not be practical .
Phase 3: Standard Setting of the Scale
The methods commonly used in standard settings can be classified as test-centered and exam-centered. Angoff, which is a test-centered standard-setting approach, is a widely used and practical method. With this method, a cut-off score can be determined before the test is administered. In examinee-centered methods (e.g., borderline regression method, constricting groups method), the cut-off score is determined after the test is applied. In order to use this method, the experts involved in the standard-setting process need to be familiar with all the SPs because they have to classify those people as successful and unsuccessful. Since the experts in this study did not know SPs well enough, a decision was taken to use the Angoff method, a test-centered approach. In addition, test-centered and examinee-centered standard-setting methods give similar results, if applied correctly.
An adaptation of the Angoff method for items with more than two possible scores is called the extended Angoff method . Candidates at the borderline are those at the sufficient-insufficient border and those who are considered barely sufficient. Using the extended Angoff method, the experts decided the scores for each item of the SP at the borderline and recorded these estimates. The mean of the estimates given by the experts was calculated for each item. The sum of means gave the cut-off score.
During the data analysis process, SPSS 21.0, Lisrel 8.7, Excel 2016, and Monte Carlo PCA for Parallel Analysis packages were used.