Assessing Differential Item Functioning (DIF) For Pearson Test of English (PTE) A study of Test Takers with Different Fields of Study

Differential Item Functioning (DIF), which is a statistical feature of an item and provides a sign of unpredicted performance of items on a test, occurs once dissimilar groups of test takers with the same level of ability show different performance on a single test. The aim of this paper was to examine DIF on the Pearson Test of English (PTE) test items. To that end, 250 intermediate EFL learners with the age range of 26 - 36 in two different elds of study (125, Engineering, and 125 Sciences) were randomly chosen for the analysis. The Item Response Theory (IRT) Likelihood Ratio (LR) approach was utilized to nd items showing DIF. The scored items of 250 PTE test takers were analyzed using the IRT three-parameter model utilizing item diculty (b parameter), item discrimination (a parameter), and pseudo-guessing (c parameter). The results of the independent samples t-test for comparison of means in two groups depicted that Science participants performed better than the Engineering ones particularly in Speaking & Writing and Reading sections. It is evident that the PTE test was statistically easier for the Science students at 0.05 level. Linguistic analyses of Differential Item Functioning items also conrmed the ndings of the quantitative part, indicating a far better performance on the part of Science students.


Introduction
The growth of the psychometric tests and testing procedures have been affected by virtue of social and political uctuations within the few past decades (Owen, 1998). When psychometric tests are used to perform individual or group comparisons, item bias ought to be considered to lessen the un tting interpretations. Test bias varies from test fairness in that it is usually measured quantitatively while test fairness is carried out subjectively and intuitively and it is not feasible to be described in absolute terms, indicating that no one can categorize tests as either fair or not fair. It can be taken as read that it is not the test characteristics being signi cant on its own but the scores' interpretations and the results that are of overriding signi cance as the students' educational future is usually determined by these decisions.
The term biased pertains to the applied instruments, testing procedures and the methods of scores interpretation. The scores' differences between two groups don't merely de ne the term bias (Osterlind, 1983). The term bias has been superseded by differential item functioning (DIF) showing that individuals who are parallel considering their level of ability have different performance on a test and gain various scores accordingly. Test bias or DIF is concerned with systematic errors and discloses the characteristics associating with item psychometric characteristics depicting that the items cannot measure impartially considering different individuals/groups. In actual fact, DIF arises when "individuals from various classes have the similar ability level but display different likelihood in responding to an item accurately" (Osterlind, 1983, p. 32). Basically, non-DIF represents the situation in which the test takers with the analogous level of ability irrespective of their in-group differences have the same probability to answer an item correctly. DIF deals with the extent to which the test items differentiate between participants having the same ability level from various groups consisting of gender, ethnicity, education, etc. (Zumbo, 2007).
Parameters contributing to item/test bias are "culture, education, language, socioeconomic status, and so on" (Van de Vijver, 1998, p. 35). Test bias or DIF should be evaluated and calculated during test Loading [MathJax]/jax/output/CommonHTML/jax.js construction process (Osterlind, 1983). Tests ought to be constructed in a way that when inconsistency in examinees' test results is observed, such discrepancy is attributed to differences in the construct that the test is going to assess. By detecting and eliminating items demonstrating DIF as well as the analysis of items, test developers nd problematic items lacking psychometric properties. This paper investigated item analysis of PTE, an internationally -recognized pro ciency test, by means of item response theory (IRT) based on DIF study.

Methods of DIF identi cation
Finding items demonstrating DIF permits the test developers to match the examinees with the pertinent knowledge. DIF is concerned with the students' scores on the tests, their hidden ability's measurement and examination of individuals being analogous with reference to their level of capability and come from various background though perform identical on an item. Mantel-Haenszel x 2 Test is used for detecting DIF (Mantel and Haenszel, 1959), suiting well even for small number of participants and empowers the test makers to utilize simple arithmetic measures based upon logistic regression methods proposed by Zumbo (2007). Modest arithmetic procedures offer a more in-depth explanation of DIF and permits the researchers to make distinction between uniform and non-uniform DIF. The other procedures to detect DIF employ IRT models as stated by Lord, (1980), Raju (1990), and Thissen, Steinberg, & Wainer (1994). These methods deal with examinees' ability and characteristics of items more accurately and are more concerned with larger sample sizes. Among these models, IRT is used more by the researchers to spot items agging DIF, as these models "render the most useful data for identifying differences on particular items" (Ertuby, 1996, p. 51).

Models of Item Response Theory (IRT)
Most of the measurement procedures, in particular in the eld of education and psychology, deal with the latent variables (Hambleton, 1996). The chance of answering correctly hinge on both item characteristics and examinees' level of ability. Such a relationship is mathematically stated as item characteristic curve (ICC). Any ICC ought to envisage the examinees' scores based on their underlying abilities, which is also recognized as item response function. The examinees' level of abilities is shown along the X-axis and represented by theta (θ) while the probability of responding to items correctly is demonstrated on Y-axis and is shown by p (θ). As Baker (1985) proposed, the ICC shape rest on the item di culty (b-parameter), item discrimination (a-parameter), and guessing power known as pseudo-chance (c-parameter). In fact, depending on horizontal location, ICCs might vary, spotting the individuals' ability level against items' di culty. The likelihood of selecting the right answer is 0.50 (i.e., the likelihood of choosing the right answer is 50 percent). Larger b-values stand for more di cult items, ranging from -2.5 to +2.5 in theory. Meaning it differs from the very easy items to very tough ones.
Item discrimination (a-parameter) displays the slope of the ICC and the accuracy of the measurement of a given item. The curve slope and item discrimination are positively correlated in a sense that the steeper Loading [MathJax]/jax/output/CommonHTML/jax.js slope shows more discriminating power of an item. The a-value ranges between 0~2. Those below 0.5 do not have discriminating power. The items having larger discrimination power may well differentiate the individuals. The guessing power (c-parameter) displays the probability a test taker with the bottommost level of ability answering the item accurately. The c-parameter ranges from 0 to 1. IRT models alter concerning the properties of items they involve. The one parameter or Rasch model has to do with the item di culty and ability level of examinees. The two parameter model deals with item discrimination and Item di culty (probability of getting the correct response based on examinees' ability level). Third parameter or pseudo-chance parameter is realized when items have multiple-choice format and examinees can get the correct response by guessing. IRT models are unidimensional and independent. They are based upon the shape of ICC and examinees' level of ability.
2.3 Non-uniform vs. Uniform DIF DIF usually has two distinct categories with regard to logistic regression model: uniform and nonuniform. Uniform DIF affects the participants at all levels equally suggesting that ICC is precisely identical for two classes. De Beer (2004) believes that the likelihood to pick the correct answer is less than that of another class in uniform DIF. The shape of ICC for one class of testees is therefore below that of the other group in his opinion, as illustrated in Fig. 1.
When two groups are different on their slopes, the item shows non-uniform DIF. In other words, ICCs have various shapes for different groups of examinees in non-uniform DIF. Non-uniform DIF in uences examinees inconsistently. De Beer (2004, p. 42) states that "the ICC shapes cross at a given point implying that one group has a lesser possibility to answer the test items accurately while such possibility for the other group was still higher". Fig. 2 shows the ICC shape for an item demonstrating the nonuniform DIF.
A DIF analysis for test takers with various language backgrounds encompassing Chinese and Spanish was examined by Chen and Henning (1985). They employed Transformed Item Di culty (TID) presented rst by Angoff (1993). TID provides the item di culty indices between two groups of test takers and identi ed outliers. One hundred eleven test takers including seventy-seven Chinese and thirty-four Spanish test takers took part in the research. Nevertheless, the participants were not that much ample for the di culty parameter to be consistently measured. Lawrence, Curley, & McHale (1988)  Achievement Test (FCAT) by taking language, gender, and ethnicity into account. He gured out that items of vocabulary and phraseology favored non-English language learners irrespective of their gender and ethnicity. Aryadoust and Zhang (2015) utilized a Rasch model to a test of reading comprehension in a Chinese context. They found that while class one performed better on vocabulary, grammar, and general English pro ciency, the other class surpassed in skimming and scanning parts. The results of most prior studies showed the gender had a trivial impact on the performance of the readers (Hong & Min, 2007;Chen & Jiao, 2014). Federer, Nehm, & Pearl (2016) explored the correlation between the way male and female participants while answering the open-ended questions. They found that women performed better under novel circumstances. In another study focusing on evolution, Smith (2016) made an instrumentation dealing with the Evolution Theory. He could succeed to make a distinction between high school and university students using items agging DIF.

The Current Study
The present paper aimed at nding and identifying the items that were susceptible to DIF as well as determining the elds of study which were advantaged in those items. Most DIF investigations are based upon the comparisons between gender (e.g., Lawrence, Curley, & McHale, 1988;Carlton, 1992;Federer et al., 2016), ethnicity (Schmitt, 1990;Koo, 2014), or language (Chen & Henning, 1985;Ryan & Bachman, 1992)  In line with the purposes of the study, the researchers applied one instrument as follows:

Pearson Test of English (PTE)
Pearson Language Tests is devoted to measuring and validating the English language of non-native English speakers. The tests comprise the Pearson Test of English (PTE) Academic, PTE General and PTE Young Learners. These are administered in association with Edexcel, the world's largest examining body. In 2009, Pearson Language Tests introduced the Pearson Test of English Academic which is recognized by Graduate Management Admission Council (GMAC). The test score has been associated to the levels well-de ned in the Common European Framework of Reference for Languages (CEFR). PTE Academic is distributed through the Pearson Virtual User Environment (VUE) centers which are also in charge of holding the GMAT (Graduate Management Admission Test). Upon publicizing, it was accepted by nearly 6,000 organizations. As a case in point, the test is accepted by the Australia Border Agency and the Australian Department of Immigration and Citizenship for visa applications. The test is mostly read by a computer rather than a human corrector to decrease waiting times of the results for students.

Data Collection Procedures
The researchers requested the PTE candidates to provide them with report card of their score in each section as well as the total scores. In addition to this, the scores of each item were collected and used for the purpose of data analysis. The scores for each part had been estimated based on the correct responses and no negative marks had been considered for wrong answers. During the administration of the PTE test, the usual precautions were met: 1. Strict administration procedures were followed to minimize the effects of external factors like cheating, etc.
3. The same ID details shared when booking the test must be presented by the test taker on the day of the test.
4. The name on the ID should exactly match the name used when booking the test.
5. If you fail to produce the required ID you will not be allowed into the test room and will lose your test fee.
6. Copies will not be accepted. The original document must be provided. No other ID will be accepted at the test center.

Design
In view of the fact the researchers couldn't manipulate and control the independent variables, the design of this study was ex post facto as already con rmed by Hatch and Farhady (1982). Such design is normally utilized when there is no interference on the part of the researchers on the participants' traits. This study comprised the test-takers' subject elds as an independent variable and their PTE test scores as the dependent variable.

Data Analysis Procedures
The PTE scored items of two hundred and fty Iranian EFL test takers were entered into the IRT 3PL model suggesting the probability that a test taker with an ability of theta (θ) responds to an item accurately, with regard to item di culty (b parameter), item discrimination (a parameter), and pseudoguessing (c parameter) (Hambleton, Swaminathan, & Rogers, 1991). These characteristics are mathematically shown hereunder: Where, x is an item response, θ is the estimated ability, a is item discrimination, b is item di culty, c is pseudo-guessing parameter, D is a scaling factor (= 1.7) that is devised to estimate the IRT models to a cumulative normal curve, and e is a transcendental number whose value is 2.718. However, because the c parameter is often poorly assessed, a prior distribution (M = 0.2 and SD = 1, according to Thissen (1991) has been applied. Thissen, Steinberg, & Wainer (1988) proposed that a prior speculation is applied to the c parameters when DIF is studied using the 3PL IRT model. The IRT LR is a model-based approach and compares a model in which all parameters are controlled to be equal across groups, hence no DIF, with an ampli ed model, permitting parameters to be free across groups. Using the likelihood ratio goodness-of-t statistic, G², the t of each model to the data is estimated. Statistical difference in G² between the two models were also tested based on the chi-square statistics. Then, item discrimination (i.e., a parameter), item di culty (i.e., b parameter), and G² were measured by means of probability ratio of chi-square statistics. If a parameter is constant, it con rms unchanging, uniform DIF or no DIF. If the result is signi cant (i.e., variant b parameter), it designates uniform DIF. On the other Loading [MathJax]/jax/output/CommonHTML/jax.js hand, if a parameter of the studied items is variant, it proves the presence of non-uniform DIF in spite of the b parameters.

The Outcomes of Research Question
The results of DIF investigations on IRT 3P LR model are shown in Tables 2, 3, and 4. These Tables depict the following data:   Thissen, Steinberg, & Wainer (1988). As Table 3 indicates, ve items (3, 10, 13, 14 and 15) were found to depict DIF at the 0.05 signi cance level.

Comparing two groups based on Descriptive Statistics
To discover which group (Engineering vs. Sciences) performed better at the exam in each part and the whole test, the independent samples t-test for comparison of means in two groups has been carried out.
As Tables 5 and 6 Tables 5 and 6 demonstrate, by considering the mean score of Science test takers (45.52) and the standard deviation (SD=11.11) and comparing them with those of Engineering (35.55); (SD= 13.38), it turned out that Science test takers outperformed the Engineering. It can be inferred that the exam was statistically easier for Science test takers at 0.05 level. Table 5 Descriptive statistics for the Comparison of Two Groups (Engineering vs. Sciences) in Three Parts of PTE In the meantime, the descriptive statistics and reliability estimates are also given in Table 7 for data sample (n = 250) results on the PTE total test as well as its three sections. As presented in Table 7, the PTE Test has been proved to be a quite reliable test. The reliability for the whole PTE test as well as Listening, Speaking & Writing and Reading parts were .95, .88, .82 and .93 respectively.

Linguistic Analyses of Differential Item Functioning Items
In order to have a better understanding of the results from the DIF analyses, the researcher undertook an investigation of the linguistic features of these items. The goal of this linguistic analysis was to determine whether the DIF ndings between pairs of elds of study could be explained by the information of variances across study elds in one of the following: The linguistic differences or similarities between the two elds of study, i.e. Engineering and Science.
The approaches or methods that are frequently utilized to teach English in the two elds of study.
Consistent with previously -mentioned procedures for the DIF analyses, the results from the linguistic analyses are organized rst by section of the assessment, then by pairwise elds of study analyses within each section. create bias in assessing their pro ciency. Nevertheless, such inconsistencies were not that much great, denoting that the di culty level of items was not the same for two groups of examinees in different elds of study. As already con rmed by Zumbo (2007), these discrepancies among examinees' performance may be linked to some prevailing covariates. In this study, almost twenty percent of the original questions ultimately agged as items showing differential item functioning. They need to be discarded from the test's next administration. These ndings oppose with the general international results proposed by McBride (1997). He believes one third of original items needs to be deleted in any test. The ndings of this research are in line with earlier studies where speaking, vocabulary, listening and reading were found to cause disparities among examinees' performance and caused DIF (Grabe, 2009;Koda, 2005). Tittle (1982) and Clauser (1990)  Reading play pivotal role in any language pro ciency test and are therefore substantial to dedicate further time and energy in learning context to teach these parts more systematically. Learners should be assisted to have a better appreciation of the implication and importance of these factors and do their best to ameliorate in these skills. This study has some implications for PTE test developers and those who take the test. The former are highly recommended to conduct more studies to identify the items that may ag DIF and take care of the researchers' ndings in this regard, and the latter can be guaranteed that the test scores are not favored against any speci c type of examinees. Nonetheless, given that the gender is also a contributing factor, it is recommended to perform a post hoc study to inspect the in uence of gender No approval of research ethics committees was required to accomplish the goals of this study because experimental work was conducted with data collected from participants who have agreed on their contribution.

Consent to participate:
Informed consent was obtained from all individual participants included in the study.
Author's contribution: There is only one single author for the present study so the corresponding author has contributed thoroughly to this paper.

Data Availability Statements
The datasets analyzed during the current study are not publicly available due con dential company data by Pearson Test of English.  Non-uniform DIF item (Adopted from De Beer, 2004)