Reliability and VAM

One of the problems of using value-added measurement (‘VAM’) to assess individual teachers is the imprecision of these estimates. Brad Lindell explains the problem in testimony in a New York case challenging teacher assessment:

If the same test-retest reliability from the teacher assigned yearly VAM scores (.40) was applied to the WISC [IQ score] full-scale to determine the 90% confidence interval, the range would be ridiculously large….

If a student scored a full-scale IQ of 100 (average) then the 90% confidence interval would be an 81 to 119. This indicates that there would be a wide range where the scores from repeated administrations of the WISC would be expected to fall for this student. One could not have confidence in the validity of a intelligence test with low reliability. Without adequate reliability, there can not be validity. This same holds true for VAM scores, whose reliabilities have been found to be notorious[ly] low.

The reliability of the WISC is generally in the .80 to .90 range. The 90% confidence intervals are generally in the +\- 6 range. So this same person with a 100 full-scale IQ would have a 90% confidence range of 94-106. Quite a smaller range.

This is why reliability is so important, which has repeatedly been shown to be low like .2 to .4 for year-to-year VAM scores. This is also why teachers year to year VAM score vary so considerably, like in the case of Sheri Lederman. Without reliability there cannot be adequate validity.

To put this in perspective, someone with the median IQ (100), the fiftieth percentile, would have a ninety percent chance of testing somewhere between the eleventh percentile (81) and the 89th percentile (119). Not exactly a reliable test.

Another key point is that given the way the scores are calculated, no matter how well a teacher does in absolute terms, someone will end up looking like a low-performing teacher:

Using the NYSED’s logic and methods referenced above, regardless of how well students performed on the State tests, a certain portion of teachers would have to be deemed ineffective. Even if 100% of students were proficient on the State tests and evidenced significant growth from the prior year, given that teachers are compared to each other in their MPG’s [Adjusted Mean Growth Percentile], a pre-determined portion (i.e., those falling 1.5 standard deviations below the mean) would have to be deemed ineffective. This could deem teachers whose students evidenced significant growth as ineffective. This is not rational.

No, it is not. And then people wonder why there is a teacher shortage.

This entry was posted in Education, Statistics. Bookmark the permalink.