One thing I’ve noted over the years is that, when it comes to teacher evaluation through student testing*, you have to understand the limitations of your data. Most scores, either percentiles or reported numeric scores, are non-linear, which means that two ‘equal’ differences are not the same (e.g., the gap between the tenth and fifteeth percentile usually does not equal the gap between the fifieth and 55th percentile).
Related to the point above, this can have real effects on how we perceive teachers or teaching interventions, as a recent Brookings Institution report describes (boldface mine):
Teacher evaluation is another good example. In recent years, many districts have started to use measures of teacher value-added as part of its determination of promotion, tenure, and even compensation. A teacher’s “value-added” is based on the how much improvement his or her students make on standardized tests during the school year (sometimes adjusted for various demographic characteristics). A teacher whose students grew by, say, 15 points is considered more effective than a teacher whose students only grew 10 points. However, if the students in these classrooms started from a different baseline, then this type of comparison depends entirely on the scaling of the exam. For example, it might be the case that a teacher who raises the scores of low-achieving students by 10 points has provided the students more than her colleague who manages to raise the scores of higher-achieving students by 15 points.
And this seems important:
Consider the case of the black-white test score gap. Past studies have found that white students typically score about one standard deviation higher than black students, before accounting for important socioeconomic factors such as family income. Moreover, many studies find that this difference actually grows over time. However, a recent study documents that the change in the black-white test score gap between kindergarten and third grade can be as small as zero or as large as 0.6 standard deviations depending on how one chooses to scale the test.
Basically, there is either no racial gap or a gap which would have the average black child at around the twenty-fifth percentile of white scores. Seems kinda important.
But it’s more complicated (or worse) than that. There are several different ways tests can be designed, including the number of questions, the model used to weight different questions, and attempting to reduce the effects of a ‘good’ or ‘bad’ testing day (‘shrinking’), all of which can have a significant effect (boldface mine):
Comparing two of the most common approaches to test scoring, one study found that roughly 12.5 percent of students would be classified into different performance levels depending on the technique chosen…
Similarly, the use of shrunken scores will lead one to underestimate group differences. Consider, for example, if one wanted to estimate the black-white test score gap using data from the large-scale, nationally representative, federal study of the progress of children through school (The Early Childhood Longitudinal Study of Kindergarteners, or ECLS-K). If black children score lower on average than white children, then the reported difference in the test scores between the groups based on the shrunken estimates will understate the black-white gap because, on average, scores of black students will be adjusted up, toward the population mean, while the opposite is true for white students.
…if a black and white student respond identically to questions on the NAEP assessment, the reported ability for the black student will be lower than for the white student, reflecting the lower average performance of black students on this assessment.
This does not bias the average black-white test score gap. The average score of all black students remains the same because the scores of high-performing black students are pushed down just as the scores of low-performing black students are pushed up, as is the case for white students. However, individual scores are affected, which can create important biases in more complex secondary analyses.
Unfortunately, many of these tests are proprietary, which makes it difficult, if not impossible, for researchers to examine how different assumptions alter supposed test performance (and if the initial observation matches your preconceived notions, then there’s little incentive to re-examine the data). And, of course, test makers have a strong incentive not to emphasize how dependent statements about performance can be depending on how scores were calculated.
The limitations of your data do matter.
*These tests are never used to help individual students, so the moniker fits.