The authors proposed the bifactor model to take into account the construct shift among grades in the context of vertical scaling. Besides the common factor of all the grades, there are group-specific dimensions for each grade and the bifactor model should be used to model this phenomenon. On the simulation study, the percentage of common items, sample size and degree of construct shift (variance of group-specific dimension) were manipulated to investigate the parameter recovery of bifactor model and UIRT for vertical scaling. The concurrent calibration was employed to scaling three grades and thus multiple group bifactor model was specified.The results shown that the bifactor model yielded well recovery of both person and item parameters, as well as the variance estimates of the person parameters on the grade-specific dimensions which representing the construct shift on vertical scaling. It was also shown that when the variance of group-specified dimensions at a highly level, both bifactor and UIRT might suffer large SE on parameter estimates. However, generally speaking, the bifactor model is superior than the UIRT when the construct shift occurred. In a illustrate empirical study, the constrained bifactor model was showing better model-data fit than 2P testlet model and Rasch testlet model according to AIC and BIC. The authors also discussed some topics for further studies such as polytomous scored items, data generation by other models rather than bifactor model, variate item discrimination parameters, other parameter estimation techniques for model estimation, and the use of collateral information to get more accurate parameter estimates.
Questions and Comments:
1. How about the situation in the multidimensional vertical scaling? If each dimension comes with a group-specific factor, for the 4 dimensions model (commonly in practices, such as math test), there are 8 factors which might cause computational burdens. Also, how to take into account the correlation between common dimensions as well as the group-specific dimensions?
2. The separate calibration can be tried in the future. There is an important drawback of the concurrent calibration that the common items need to be screened before doing the calibration. While the separate calibration can offer the opportunity to eliminate the bad common items. Also, by using the multiple imputation approach to take into account the source of uncertainty of each grade, the separate calibration can be investigated thoroughly.
3. The group-specific dimensions need to be orthogonal in modeling but it might be not the case in practices. Also, the sources for attaining an additional dimension is various (such as dimensionality detection method, guessing, speedness, etc) and if an additional dimension was obtained, it is still hard to judge that it can be represent the construct shift in practices. For instance, the author claimed that FA was conducted to confirmed the existence of primary and secondary dimensions in real vertical scaled assessment data, but how can be so sure it was the construct shift among grades? In short, it might be hard to equivalent the bifactor to the construct shift and the variance of group-specific dimension to the magnitude of construct shift.