Yes, it’s a brutal charge, but, sadly, it’s true. Economist Moshe Adler (who has also written an excellent book Economics for the Rest of Us: Debunking the Science That Makes Life Dismal) has squared off against Raj Chetty et alia, an MIT economist who has used value-added measurements–which we find very problematic when applied to education–to claim that teacher quality can affect lifetime economic earnings. Chetty’s work received a lot of attention when President Obama mentioned it in his 2012 State of the Union address (and the initial reports from 2010 had problems as well). Here’s a summary of Chetty’s work by Adler:
The first part of the report (NBER Working Paper No. 19423) reviewed here linked information about students and teachers in grades three through eight in New York City (NYC), spanning the years 1989-2009. The research used this linked dataset for “value added” (VA) calculations for individual teachers. The model used for the VA calculation controls for factors such as students’ prior achievement, parents’ income, and the performance of other students in the classroom. But none of these factors can entirely predict a child’s performance. After accounting for all known and easily measurable social and economic factors, a residue still remains. The report assumes that this residue is attributable to the teacher, and it calculates a teacher’s value-added by using the residues of his or her students.
The second part of the report then linked the incomes of adults with the value-added of their teachers when those earner-adults were students. Using this linked data set they found that a one unit (one standard deviation) increase in teacher value-added increases income at age 28 by $286 per year or 1.34%. The study then assumes that this percentage increase in income will hold for a person’s entire working life, producing a cumulative lifetime increase of $39,000 per student.
Sounds impressive. But there are problems, as Adler lays out (boldface mine):
1. An earlier version of the report found that an increase in teacher value-added has no effect on income at age 30, but this result is not mentioned in this revised version. Instead, the authors state that they did not have a sufficiently large sample to investigate the relationship between teacher value-added and income at any age after 28, but this claim is untrue. They had 220,000 observations (p. 15), which is a more than sufficiently large sample for their analysis.
2. The method used to calculate the 1.34% increase is misleading, since observations with no reported income were included in the analysis, while high earners were excluded. If done properly, it is possible that the effect of teacher value-added is to decrease, not increase, income at age 28 (or 30).
3. The increase in annual income at age 28 due to having a higher quality teacher “improved” dramatically from the first version of the report ($182 per year, report of December, 2011) to the next ($286 per year, report of September, 2013). Because the data sets are not identical, a slight discrepancy between estimates is to be expected. But since the discrepancy is so large, it suggests that the correlation between teacher value-added and income later in life is random.
4. In order to achieve its estimate of a $39,000 income gain per student, the report makes the assumption that the 1.34% increase in income at age 28 will be repeated year after year. Because no increase in income was detected at age 30, and because 29.6% of the observations consisted of non-filers, this assumption is unjustified.
5. The effect of teacher value-added on test scores fades out rapidly. The report deals with this problem by citing two studies that it claims buttress the validity of its own results. This claim is both wrong and misleading.
While Adler’s entire critique is damning (and accessible to non-specialists), point #1 just floors me.
Nowhere in any of the correspondence between Adler and Chetty et alia is there any serious effort to assess the power of test; Adler, on the other hand, has done those power calculations, and finds that the smaller dataset is about eight times larger than the minimal size required (Adler’s response to their response, erm…. is devastating). In other words, there’s no reason to discard the data for the 30 year olds, except for its inconvenience.
You don’t do science like this. Period.
This seems like a case where an initial, trumpeted finding, not confirmed by peer review, suddenly is confronted with additional data and falls apart.
In the entire history of science, this has never happened. I kid: it happens all the time. The question is, as a scientist, do you double down or admit that the effect is either very weak or non-existent? Human nature being what it is, the temptation is to double down, especially when you made your bones on this stuff (and the President mentioned your work!). Hell, senior researchers often don’t back off, until they are completely dogpiled.
At what point does the dam break? Just how many shoddy methods, misuses and misrepresentations of data, and the like before education reformers lose legitimacy? Good policy can’t be built upon incorrect science and evidence.
Sadly, I think there’s a long way to the bottom of that bottle.
Economists don’t do power calculations. We do *teach* power in our undergraduate stats classes, but then we promptly forget it.
Given what I’ve seen with other econometric trends, I expect power calculations will be “in” in economics about 10-20 years. We’re just getting over a fling with propensity score matching and we’re starting to have good feelings about factor analysis and pre-registering hypotheses. Essentially when something has been “out” in psychology for about 10 years, it gets discovered by economists.
I did just get asked to do a post-hoc power calculation for a revise and resubmit– had to explain to my coauthor what that was. There’s starting to be a backlash about post-hoc power calculations in psychology… so I’m guessing they’ll be de rigeur in econ in about a decade.
And sadly a lot of people still don’t really have a good understanding of alpha… We do all this super complicated stuff but don’t necessarily have good intuition about the basics.
I would argue that there are good reasons that we don’t use Cohen’s d– we care more about the policy size of an effect than how many standard deviations it is away from a mean when we’re talking about whether an effect is large or small.
Ick, I hope economists don’t follow psychologists’ publishing fads. I glanced at Adler’s complaints at: http://nepc.colorado.edu/thinktank/review-measuring-impact-of-teachers. His 5th item, “report cites studies as support for the authors’ methodology, even though they don’t provide that support” is a personal pet peeve. I have read dozens of psychology papers in the last few years where the authors claims X and cite a few papers. I find claim X interesting or unbelievable, look up the citations only to discover the citations don’t match the claim. Sometimes they never mention X, other times the results demonstrate the opposite of X and frequently X is a insignificant/small/indirect effect.
If an abstract stated “IQ was studied in populations living at high altitudes”, a subsequent psych paper would inevitably come out claiming, “IQ has been correlated with living at higher altitudes. This supports our finding that people living at sea-level may have brains suffering from over-oxygenation. However, we were not able to rule out that people with high IQ are drawn to mountaintops like seers of ancient societies.”
Hi 🙂
Could you sign this petition: change.org/petitions/board-of-education-and-all-educational-facilities-and-municipalities-reform-education-so-that-it-s-fair-for-all-and-not-for-the-elite-few-or-the-dull-many-no-child-left-behind