Effects of Vertical Scaling Methods on Linear Growth Estimation
Pui-Wa Lei and Yu Zhao
Vertical scaling refers to the process of converting the number-correct score for test forms that measure the same or similar constructs, but at different difficulty levels, to a common scale. There are two scaling methods: concurrent calibration and separate calibration. For the former, item and ability parameters are estimated simultaneously, in one run. In the latter method, item and ability parameters are estimated from the
responses of each group to their respective unique and common items, one group at a time, in separate runs.
moment methods |
Mean/Mean [MM] |
the mean of the difficulty parameter from the common items |
the mean of the item discrimination estimates |
Mean/Sigma [MS] |
SD of the item difficulty estimates from the common items |
||
characteristic curve methods |
Haebara |
minimizing either the sum, over common items, of the squared differences between groups in common items’ characteristic curves |
|
Stocking–Lord |
minimizing the squared difference between groups in the sum of the common items’ characteristic curves |
Research findings regarding the relative performance of concurrent versus separate calibrations are far from unanimous. Few studies have examined the calibration methods with small samples and short tests
The current study focuses on the effects of sample sizes and test lengths, especially of small samples and short tests, on the quality of vertical scales established by different scaling methods.
4 factors are controlled in the study:
(a)Calibration methods: 5 methods (one concurrent and 4 separate mehtod )
(b) IRT models:1PL & 2PL
(c) sample sizes:50, 100, 250, 500, and 1,000
(d) test lengths:10, 20, 30, and 40 (common items: 25%)
An unstructured random component type was analyzed. Fixed effects of growth rate and growth intercept as well as random components of growth rate variance, growth intercept variance, and covariance between growth rate and growth intercept across individuals were estimated.
The results show that concurrent calibration produced slightly lower total estimation error than separate calibration in the worst combination of short test length (≤20 items) and small sample size (n ≤ 100), whereas separate calibration, except in the case of the Mean/Sigma method, produced similar or somewhat lower amounts of total error in other conditions.
Comments:
1 Since the SAS mixed procedure had difficulty in estimating the unstructured random components properly when the sample size was very small (n = 50) or when a small sample size was combined, is there another method can be used? In the real situation, it may affect the parameter estimate.
2 For page32, compared with scaling using the 1PL model, scaling using the 2PL model produced slightly larger bias when the test was short. The data are generated using 2PL, so how to explain this? Is that because few items?