Assessing Physician Performance Using Multisource Surveys: Do Biases Exist Due to Gender, Country of Training, Native Language, and Age?

Background: With a growing number of foreign-trained physicians joining the United States workforce, there is a need to assess their job performance fairly. The purpose of this study was to explore the fairness of a multisource competency assessment on U.S.- and non-U.S.-trained physicians. Methods: We conducted a non-experimental retrospective analysis on physicians working in the United States (n=258) who participated in a physician assessment and education program. Results: There were no signicant differences in performance outcomes at the scale level of teamwork, motivating or discouraging behaviors, technical practice, and patient interactions, nor the item level based on demographic differences. Conclusions: The PULSE 360 is a powerful tool that can be used to evaluate physician performance without bias due demographic differences including: gender, country of physician medical training, physician native language, or age.


Background
In 2017, foreign-trained physicians made up over 25% of practicing physicians in the United States. 1 Given their substantial contributions to the U.S. healthcare system, 2,3 fairly assessing their performance is critical. Policy suggestions have been developed recently to promote the bene ts of employing international health care workers, such as treating them transparently and fairly. 4 However, comparing the performance of U.S. trained physicians (USTP) to non-U.S. trained physicians (NUSTPs) is not wellunderstood. 5 Exploring soft skills like professionalism, interpersonal and communication skills, teamwork, and patient interactions is critical, and physician competency assessments should be fair regardless of demographic differences like age, sex, and nationality. The purpose of this paper is to explore the fairness of a multisource competency assessment program when evaluating U.S. and non-U.S. trained physicians working in the United States. While previous studies have explored where physicians were educated or trained (international medical graduates), in this study, NUSTP refers to physicians who completed their residencies outside the United States.

Assessing Physician Competence
Fair assessment of physician competencies is a key step in improving job performance. Reasons for evaluating physician performance range from appraisal to recerti cation, identifying high-risk physicians, and remediating those with a previous history of poor performance. 6 Maintaining quality patient care is important because 6-12% of physicians are referred to remediation for poor clinical skills. 7 The Institute of Medicine estimates that physician dyscompetency is one contributor to preventable medical errors at an estimated cost of $17 billion. 7,8 To determine which physicians should be referred for dyscompetency, one model for performance remediation starts with an assessment of the physician's competence. 9 A common framework for maintaining physician competency is the American Board of Medical Specialties, which developed the Maintenance of Certi cation (ABMS MOC). This four-part framework includes: maintaining licensure, lifelong learning, cognitive expertise, and quality improvement. 10

Multisource Feedback
Multisource feedback is the use of physicians' team members (other physicians, nurses, and staff) to evaluate job performance. 11,12 The scope and depth of multisource feedback is valuable given the argument that patient evaluations of physician performance are subjective at best. 13 For example, patient evaluations have been found to be in uenced by the race and gender of the physician such that only physicians who were white and male bene ted from a customer satisfaction judgment, even after controlling for objective measures of performance. 14 Beyond clinical skills, physician performance is based on a combination of individual differences including specialty area, gender, and age. 15,16 Evidence suggests that biases against international medical graduates may lead to more complaints against physicians and disciplinary outcomes, 17 but ndings on biased physician performance evaluations are mixed. 18 Given the inconclusive evidence, having two examiners appears to mitigate potential sex or ethnic biases against physicians who are being evaluated based on their clinical performance. 19 The PULSE 360 Program has been conducting multisource feedback assessments for physicians since 2002 with over 15,000 unique healthcare professionals participating in the program receiving over a million completed surveys of feedback. The original PULSE 360 Survey was developed with the help of subject matter experts (SMEs) including senior physician leaders and other healthcare administrators to determine the behaviors that they believed were most relevant to physicians being able to succeed as part of an interdisciplinary healthcare team. This led to the creation of over 100 behavioral rating items, which have been revised through years of item analyses and outcome studies to the most commonly used survey today, which consists of 25 behavioral items. PULSE conducts assessments for academic medical centers, community hospitals, and other healthcare organizations throughout the US and Canada. The PULSE 360 family of surveys have been used in several scholarly research projects showing strong validity and reliability in evaluating physicians' performance as part of a healthcare team.
Some research has explored the use of multirater assessments on international medical graduates and found them reliable, 20 but little research has examined bias in physician assessment as a function of training country (i.e., USTPs versus NUSTPs). Of the research on assessing physician performance, one experiment found that after holding education, experience, and personality consistent, international medical graduates were rated more poorly than those who had born in the prospective patients' home country. However, physicians who had been trained in an industrialized and high-income country bene ted on their evaluations. 21 There are no signi cant differences in mortality rates for international versus national practitioners, but differences may exist in regard to the soft skills of communication, teamwork, and ethical issues. [22][23][24] Part of this bias may be a function of the examiners themselves. 25 In one study, international medical graduates had lower mortality rates than U.S. medical graduates. 26 Further, there is evidence that in Canada, international medical graduates are disciplined for misconduct more frequently than North American medical graduates. 27 In Australia, international medical graduates receive more complaints and disciplinary adverse ndings. 17 Thus, there is a critical need for fair tools to evaluate USTPs and NUSTPs on their job performance. Given the mixed ndings on the effects of gender, country of training, language, and age on performance, the following is predicted: Hypothesis 1: There will be signi cant differences in PULSE 360 physician performance (as rated by colleagues) based on gender such that females will have higher scores. Hypothesis 2: There will be signi cant differences in PULSE 360 physician performance based on country where training occurred such that USTPs will have higher scores.
Hypothesis 3: There will be signi cant differences in PULSE 360 physician performance based on rst language spoken such that native English speakers will have higher scores. Hypothesis 4: There will be signi cant differences in PULSE 360 physician performance based on age such that younger physicians will have higher scores.

Methods
Design A non-experimental retrospective analysis of data was conducted for two hundred and fty-eight physicians (n=258) who participated in a physician education program focused on tness for duty.

Statistical Analyses
For hypotheses 1-4, we conducted a post-hoc power analysis using G*Power v. 3.0.10 to determine the sample size needed to detect signi cant effects. 28 Given a one-tailed independent samples t-test with a large effect size (d = 0.5, α = .05), the preferred sample size is 176 participants. Independent samples ttests using a Bonferonni adjusted alpha of 0.0125 (.05/4) were conducted in order to evaluate potential biases in PULSE 360 scale scores due to demographic differences including: gender, country in which training occurred, rst language spoken, and age; see Tables 1-4.

Participants
Seventy percent (80%) of the physicians were males (n=206), 62% were trained in the United States (n=157), 66% had English as their rst language (n=166), the average physician age was 61 (range 41-84), and 78% were board certi ed (n=202). Age was median split into two groups: 1) 62 or younger (n=126, 49%), 2) 63 or older (n=125, 48%), n=7 missing data (3%). There were thirty (n=30) different specialties represented within the sample of physicians, including internal medicine, obstetrics and gynecology, anesthesiology etc.). All physicians in the sample were participating in an assessment conducted through a physician education program at an academic medical center in the United States.
Physicians had been referred to this program by a healthcare organization for a tness for duty evaluation.

PULSE 360 Survey
The PULSE 360 Survey is an assessment of leadership, teamwork, communication, professionalism, and other physician behaviors based on multisource feedback from other physicians, advanced practice providers, clinical staff, and administrative staff members who interact with a physician. The survey used with the current sample included n=25 items and is made up of 5 performance domains, including a total composite performance score known as the Teamwork Index (TI) Score. All items are scored using a 5point Likert type scale regarding the extent to which raters perceive a physician engages in a target behavior. The internal consistency reliability estimates for all performance domains are as follows: TI Score (25 items) α = .92, Motivating Behavior Score (9 items) α = .85, Discouraging Behavior Score (7 items) α = .84, Technical Practice Score (5 items) α = .82, and Patient Interaction Score (4 items) α =.79. TI scores typically range from 0 to 100 with a national mean score of 68.9 for physicians. The other scale scores range from 20 to 100. Prior research has demonstrated both the internal and external validity of PULSE 360 scores in relation to important physician outcomes such as malpractice risk and patient satisfaction. 11,[29][30][31][32][33][34] Data Analyses The PULSE 360 survey item data is collected at the ordinal level of measurement while scale scores created from this data are interval level data. We opted to perform parametric analyses (independent sample t-tests) because the observed data demonstrated an approximately normal distribution. However, we also conducted non-parametric chi-square comparisons of expected distribution of scores for each hypothesis given the ordinal nature of the item level data. We report the results of the parametric analyses only because the non-parametric analyses produced the same results/conclusions at both the scale and item level.

Results
Hypothesis 1 was not supported. There were non-signi cant differences in the mean PULSE 360 scores for male vs. female physicians (see Tables 1-4 for independent samples t-test results). Hypothesis 2 was not supported; there were non-signi cant differences in the mean PULSE 360 scores for US-trained vs. foreign-trained physicians. Hypothesis 3 was not supported; there were non-signi cant differences in the mean PULSE 360 scores for native English speakers vs. non-native English speakers. Hypothesis 4 was not supported; there were non-signi cant differences in the mean PULSE 360 scores between age groups (see Table 4). Additionally, all post hoc comparisons of item level differences yielded non-signi cant differences in mean scores on all PULSE 360 Scale scores for all comparison groups mentioned in Hypotheses 1-4.

Discussion
This study compared the performance of USTPs and NUSTPs using a multisource assessment tool. Given the growing demand for NUSTPs in the United States and potential bias in how their performance is assessed, there is a need to fairly select, train, and support this diverse group of international physicians. 34 Physicians in this study were evaluated on their performance using the PULSE 360 Survey and were compared across gender, country of training, native language, and age. There were no signi cant differences in their reported performance on professionalism, teamwork, motivating behaviors, discouraging behaviors, technical practice style, or patient interactions. These ndings suggest that there are valid and reliable tools established to fairly evaluate the performance of both U.S.-trained physicians and foreign-trained physicians. This is valuable given that previous research has found that some measures discriminate against some protected classes. 17 The physicians in our current sample were recruited to the physician assessment and education program for a variety of reasons that may not be representative of practicing physicians in the United States. However, the use of multisource data allows us to move beyond self-reported performance to a more comprehensive view of physicians' performance on the job. Further research will be needed to more thoroughly explore these ndings, but at least within our sample, there were no signi cant variations in feedback scoring patterns attributable to protected class membership.

Conclusions
The use of multisource feedback can provide a more comprehensive and unbiased assessment of others' perceptions of physician behavior and performance within the healthcare team than traditional single source methods of feedback.

Declarations
Ethics approval and consent to participate: Because this study used de-identi ed archival data, it was deemed unnecessary to ask participants for informed consent. This study was approved by the UNK IRB #041320-2.
Consent for publication: Not applicable.
Availability of data and materials: The dataset used and analyzed during the current study are available from the corresponding author on reasonable request.