There are some assumptions underlying the common DIF detection methods which define DIF as the presence of differences in the probability of a correct response for two manifest groups (eg, gender, ethic), conditional on the latent trait. The first assumption is that the manifest groups is homogenous for the items. That is, the items do not exhibit DIF within the groups. The second assumption is that the manifest variables are related to the sources of the DIF. In another words, the group membership is the source of the DIF.
However, the grouping variable may not manifest, like gender. Rather, it may be unobserved in various degree, like the innate strategy used to solve the mathematic problems. The mixture IRT model is more meaningful in telling why DIF is occurring.
The main research questions in this paper are:
(1) Whether a mixture IRT model would perform better compared with the manifest DIF model under different conditions.
(2) Whether the manifest variable incorporated in the mixture IRT model improve/weaken the identification of DIF.
The manipulated variables include:
(1) Sample size: 1,000, 5,000 and 25,000
(2) Balanced or unbalanced group size;
(3) DIF size: 0.3~1.1, with increase of .1.
(4) The correlation between the manifest and latent groups: 0~1, with increase of .2.
The main results:
(1) The mixture IRT model performs better in identifying DIF items compared with DIF detection using manifest variables only, when sample size is large (larger than 5,000), balanced group size and high correlation between the manifest and latent variables.
(2) When the correlation between the source of the bias and the manifest variable was low, the manifest DIF methods perform poorly. (It is self-evidence).
(3) Incorporating manifest variable (even when there is low association between the source of the bias and the manifest variable) will improve the identification of DIF of mixture IRT model. On the other side, even if the manifest variable is highly correlated with the source of DIF, including a latent grouping variable does not harm.
Comments:
(1) The mixture model depends largely on the sample size. The samples sized used in this study are 1000, 5000 and 25000, respectively. And it was found that the mixture IRT model could not accurately estimate the variance/covariance matrix of the item parameters for the 1000 cases. Though it may need much more samples to classify the people according the unobserved characteristics, it is hard to have more than 1000 participants in the practical. When the sample size is less than 5,000, the performance of the mixture IRT model is not reliable.
(2) It seems that the including a latent grouping variable does not harm at least and including the manifest variable could improve the detection of DIF items. Can it be concluded that the best strategy is to incorporate the manifest and latent variable when detecting DIF items? Is it feasible?
(3) The empirical example showed that the manifest and latent DIF detection methods identified the same number of DIF items(is it necessary?), but not all were the same. So, there is a question, which method should we believe? It was said that the heterogeneity in the minority group was captured, therefore, the manifest DIF method may be wrong. However, since N=1134, a slightly larger than 1,000, the estimation of the mixture IRT model may not be correct.
(4) For empirical study, we may need to decide how many class we should decide. It was said we may study the fit of the model with different numbers of latent classes.
(5) Future study may use the mixture IRT model to detect nonuniform DIF. Also, the multilevel mixture IRT model has not been explored well.