This paper described an interesting aspect to model the rater effect in IRT. The Patz et al. (2002) has referred the HRM as figure 1 and function 1. But there were two problems in this model. The first is, while the rater precision parameter is small, the likelihood of the severity parameter is nearly uniform distribution. It causes MLE or other method of estimation derived very similar estimator of severity parameter even if on different severe raters. The second is the model just captured the severity of rater effects rather than various type of rater effects, such as central tendency rating style. Respecting those problems, the HRM-SDT model uses SDT model as function 2 to substitute the level 1 of the HRM. This function shows the relationship between the latent category and rater severity. The relative higher severity comparing with latent category will result in the higher probability of choosing a lower response. This model can express the rater effect better, however, I don’t think of the figure 3 figured out the characteristic of HRM-SDT well. The model should still be two levels but figure 3 looks like 3 levels model.
Comment
1) In previous research, the two essay items were not enough for well estimating the item parameter of GPCM in level 2. It was suggested to use other information items, such as multiple choice items, or add the third essay item. I think the problem was caused by there was too less information to estimate the theta (examinee’s ability), but he didn’t mention the ability recovery result in the literature.
2) The mean of difficulty parameter In Table 1 is not zero. Is it still necessary to fix the mean of ability to zero in this model? I think the answer is yes. I also think the one of the C and Eta parameters in function 2 should be fixed.
3) The prior in the posterior mode estimation set on the probabilities for responses and latent classes at page 343. I am not sure how to explain the prior design can punish for boundary problem.