Main Idea:
The accuracy of MIMIC methods for DIF testing was examined when the focal group is small and the results were compared with those obtained using 2-group item response theory (IRT). Simulation study has been used to conduct the comparison, together with an empirical study to demonstrate the efficacy of MIMIC-Model Methods. Generally, the results of binary item data are better than the ordinal data. The results support the utility of the MIMIC approach.
Key concepts:
DIF:
Definition:Differential item functioning (DIF) occurs when an item on a test or questionnaire has different measurement properties for one group of people versus another, irrespective of mean differences on the construct. In an IRT context, an item with DIF has a different category response function (CRF) for one group of people versus another.
Types:Uniform DIF occurs when the CRFs for the two groups are different and do not cross; one group is more or less likely to endorse a higher response over the entire range of the ability. If the CRFs cross, then the DIF is non-uniform.
IRT-LR-DIF(IRT-based likelihood-ratio testing for DIF):
Procedures:
Statistically comparing nested two-group item response models with varying constraints to evaluate whether the response function(s) for a particular item differs for the reference and focal groups:
A model with all parameters for the studied item constrained equal between groups is compared with a model with all parameters for the studied item permitted to vary between groups.
The mean and variance of q are fixed to 0 and 1 (respectively) for the reference group to identify the scale and estimated for the focal group.
Programme: IRTLRDIF
Evaluation of the result:
The LR test statistic is -2 times the difference between the optimized log-likelihoods, which is approximately c2–distributed with df equal to the difference in free parameters. Statistical significance indicates the presence of DIF.
MIMIC models:
Procedures:
Choosing the designated anchors based on preliminary tests using all other items as anchors.
In one model, all studied items can be regressed on the grouping variable with individual tests of these regression parameters interpreted as DIF tests. Alternatively, each studied item can be tested individually by comparing a full model that presumes DIF in all studied items with a model with the DIF path removed for one studied.
Programme: Mplus
Evaluation of the result
Item responses are regressed on the grouping variable to test for DIF. There is evidence of DIF if group membership significantly predicts item response, controlling for any mean differences on q.
Simulation study:
Number of items: 6, 12, or 24
Binary:
Focal-group:
Sample size: 25, 50,100, 200, or 400
q~N (0, 1)
a~N (1.7, 0.3)
b~N (0, 1)
Number of items: 6, 12, or 24
Reference-group:
Sample size: 500 or 1,000
q~N (0.4, 1)
Ordinal:
Focal-group:
Sample size: 50, 100, 200, or 400
q~N (0, 1)
Reference-group:
Sample size: 500 or 1,000
q~N (0.4, 1)
a~N (1.7, 0.6)
b~N (-0.4, 0.9) (The first R-group threshold)
Results:
With small NF, tests of uniform DIF with binary or five-category ordinal responses were more accurate with MIMIC models than IRT-LR-DIF. With larger NF, IRT-LR-DIF has performed more accurately than in the present study.
No matter what the values of NF, Type I error was well below the nominal ’ level and power was greater for the MIMIC approach than for IRT-LR-DIF
An important limitation of MIMIC methods is that they cannot test for non-uniform DIF.
Future Study:
To evaluate the extent to which the differences in hit rates and false positives between MIMIC models and IRTLR-DIF translate into practical consequences for score interpretation.
To determine how well these results and recommendations