One of the supposed key innovations in educational ‘reform’ is the adoption of value added testing. Basically, students are tested at the start of the school year (or at the end of the previous year) and then at the end of the year. The improvement in scores is supposed to reflect the effect of the teacher on student learning*. I’ve discussed some of the methodological problems with value added testing before, and the Economic Policy Institute has a good overview of the subject. But what I want to discuss is a very serious flaw–what I would call fatal–with value added testing that stems from a paper by Jesse Rothstein (pdf).
Before we look at the abstract of the paper, we need to be very clear about what we’re measuring. We are taking the difference in test scores–the gain–of students, assigning each student to a teacher, and then asking if we can determine an effect of teachers on the variation of gains (the difference in year to year test scores). This is not the same as correlations between annual scores (e.g., high scores in third grade mean high scores in fourth grade). A teacher who has a class full of students who score 80 out of 100 will could have a class that does well at the end of the year (average ~80) and thus show little gain, but the teacher who starts with a class average of 50 and pulls it up to 70 has done well–that gain is what is being assessed.
Onto the abstract (emphasis and boldface mine):
Growing concerns over the inadequate achievement of U.S. students have led to proposals to reward good teachers and penalize (or fire) bad ones. The leading method for assessing teacher quality is “value added” modeling (VAM), which decomposes students’ test scores into components attributed to student heterogeneity and to teacher quality. Implicit in the VAM approach are strong assumptions about the nature of the educational production function and the assignment of students to classrooms. In this paper, I develop falsification tests for three widely used VAM specifications, based on the idea that future teachers cannot influence students’ past achievement. In data from North Carolina, each of the VAMs’ exclusion restrictions are dramatically violated. In particular, these models indicate large “effects” of 5th grade teachers on 4th grade test score gains. I also find that conventional measures of individual teachers’ value added fade out very quickly and are at best weakly related to long-run effects. I discuss implications for the use of VAMs as personnel tools.
If the sentence in boldtype seems problematic, you’re right: it is.
There is no known way a teacher in fifth grade, when students are supposedly shuffled between grades–and the method requires this assumption**–could possibly affect fourth grade improvement. There are two explanations here:
1) Value added testing has serious methodological issues (the technical phrase is “fucking bullshit”).
2) The North Carolina primary school system (where the study was conducted) routinely violates space-time. If this is in fact happening, we have far more important things than student achievement to be worrying about.
I’m going with option #1. So here’s the methodological problem:
Panel data allows flexible controls for individual heterogeneity, but even panel data models can identify treatment effects only if assignment to treatment satisfies strong exclusion restrictions. This has long been recognized in the literature on program evaluation, but has received relatively little attention in the literature on the estimation of teachers’ effects on student achievement. In this paper, I have shown how the availability of lagged outcome measures can be used to evaluate common value added specifications.
The results presented here show that the assumptions underlying common VAMs are substantially incorrect, at least in North Carolina. Classroom assignments are not exogenous conditional on the typical controls, and estimates of teachers’ effects based on these models cannot be interpreted as causal. Clear evidence of this is that each VAM indicates that 5th grade teachers have quantitatively important “effects” on students’ 4th grade learning.
One key point Rothstein makes is that principals don’t randomly assign students to classes. Instead, they typically take previous student performance and perceived (or misperceived) teacher quality into account. Some might place poorly performing students with the ‘best’ teachers in order to pull those students up (which will make ‘good’ teachers look worse than they are). Other principals might place the ‘best’ students with the ‘best’ teachers. And in other cases, some teachers might have a reputation for performing well with either poorly-performing or high-performing students. Unless student assignment is random, the models break down***. This problem is only magnified when comparing students across different schools, where one can’t even attempt randomization (“We would like to improve teacher evaluation, so, thanks to the luck of the draw, your child will be bused to another school an hour away this year.” That’ll work…)
I realize this seems pretty technical (and if you think I’m bad, read the paper), but if education ‘reformers’ want to claim that their methods are rigorous, they have to get the methods right–that’s how science works. If the methods fail, nobody cares about your results, and the discussion of those results is moot. You can’t violate assumptions of the methods.
Or the space-time continuum.
*This method is actually taken from studies that look at how different firms pay their employees. Intelligent Designer save us from the economists….
**Whether this is good for the students overall is a separate question–should evaluation trump classroom coherency?
***Given correlations with the previous year in terms of absolute scores (and, interestingly, weak negative scores with regards to gains), you need to ensure students are randomized, or else the gain might be a cohort effect and not a teacher effect.