Last week, E.D. Kain took Megan McArdle to task for promoting the use of student testing as a means to evaluate teachers. This, to me, was the key point:
….nobody is arguing against tests as a way to measure outcomes. Anti-standardized-tests advocates are arguing against the way tests are being used, and the prioritization of tests. If you really, truly want to measure outcomes, you should not create a system that incentives teaching to a test. Teaching to a test not only narrows the curriculum, it means that teachers prepare students specifically for the test. This skews the outcomes of test scores enormously. Testing should be done outside of normal instruction so that each teacher, school, and student can be fairly measured.
Tests are a good, if not absolutely perfect way, of assessing how well students have learned (if the tests are well-designed). If you’re trying to assess how a particular change in teaching works (e.g., a new math curriculum), you do need some method to assess performance.
But where ‘reformers’ go off the rails is their incessant belief that testing is a good way to evaluate how well a teacher has taught* (this belief also seems to imply that many teachers aren’t performing up to snuff, but I’ll let that slide…).
First, the methodological assumptions, such as random** student assignment to classes, of the best (or least worst) method, value-added testing, are usually violated. In one study, this led to fifth grade teachers affecting fourth grade student performance to nearly the extent that their fourth grade teachers did. Yes, you read that last sentence correctly. Either there are problems with the method (likely), or else this school system routinely violates our current assumptions about space-time (not so likely).
Second, the precision in figuring out how well a teacher taught is, to be charitable, non-existent. When a teacher can range between abysmal to ‘middle of the pack–let’s give her tenure’, this isn’t a very precise measure. An evaluation scheme this capricious can best be described as ‘demotivational.’
A 2010 study from the Annenberg Institute for School Reform authored by Sean Corcoran describes just how imprecise the estimates of teacher performance are. First, consider how Houston, TX teachers would be assessed using two different tests, the Stanford Test and TAKS. In the figure below, teachers are assigned to quintiles based on the TAKS-reading exam, and then compared to scores on the Stanford reading exam.
In every case, one quarter or more of teachers placed on the Stanford exam two quintiles or more away from their TAKS score (using a two quintile difference is very conservative as a teacher who scored 19% on one exam and 21% on the other would be placed into different quintiles). One out of six teachers who placed in the highest TAKS quintile fell into the bottom two Stanford exam quintiles, and vice versa. Believe it or not, this is the ‘least worst’ evidence for the imprecision of value-added testing estimates of teacher ability.
Consider this range of variation in New York City’s Teacher Data Reports (boldface mine):
As expected, the level of uncertainty is higher when only one year of test results are used (the 2007-2008 bars) as against three years of data (all other bars). But in both cases, the average range of value-added estimates is very wide. For example, for all teachers of math, and using all years of available data, which provides the most precise measures possible, the average confidence interval width is about 34 points (i.e., from the 46th to 80th percentile). When looking at only one year of math results, the average width increases to 61 percentile points. That is to say, the average teacher had a range of value-added estimates that might extend from, for example, the 30th to the 91st percentile. The average level of uncertainty is higher still in ELA. For all teachers and years, the average confidence interval width is 44 points. With one year of data, this rises to 66 points.
I think a Magic Eight Ball would be more reliable. And since this method is supposed to be able to identify good teachers, how does it perform at that task? Not well (boldface mine):
Given the level of uncertainty reported in the data reports, half of teachers in grades three to eight who taught math have wide enough performance ranges that they cannot be statistically distinguished from 60 percent or more of all other teachers of math in the same grade. One in four teachers cannot be distinguished from 72 percent or more of all teachers. These comparisons are even starker for ELA, as seen in Figure 8. In this case, three out of four teachers cannot be statistically distinguished from 63 percent or more of all other teachers. Only a tiny proportion of teachers – about 5 percent in math and less than 3 percent in ELA – received precise enough percentile ranges to be distinguished from 20 percent or fewer other teachers.
Not working well at all.
Again, the issue is the misuse of tests: testing is a good way to determine if a particular intervention works, or to get a handle on the relative importance of various demographic variables when looking at a large number of students. But as a method of measuring teacher performance, value-added testing–which is the ‘best’ method–stinks. As Corcoran notes:
Persistently exceptional or failing teachers – say, those in the top or bottom 5 percent – may be successfully identified through value-added scores, but it seems unlikely that school leaders would not already be aware of these teachers’ persistent successes or failures….
But teachers, policymakers, and school leaders should not be seduced by the elegant simplicity of “value added.”
In light of our inability to make meaningful statements about teachers, maybe combating poverty doesn’t look so intractable….
*The phrase commonly used is ‘teacher performance’, but “how well a teacher taught” seems to be a more accurate description of what they’re purporting to measure.
**With respect to the variables of interest.