I highly recommend this devastating indictment of New York City’s education ‘reform’ policies by an anonymous NYC Department of Education insider. Wonkiness is rarely so brutal. What I want to focus on is the section about evaluation, since reformers love their metrics. First, it’s pretty clear that the school grading system is garbage (boldface mine):
School grades on the Progress Reports fluctuate wildly from year to year and do not reflect genuine changes in school quality. One year’s data showed that 75% of the schools that received F’s the previous year got A’s or B’s the next. Another year, 60% of schools moved 200 places or more in the rankings as compared to the prior year. Yet another year, over 55% of schools moved a letter grade or more (out of only 5 possible grades) as compared to the prior year. One analysis demonstrated that this year-to-year change in grades is only slightly less than would be expected by random luck. One year’s grades showed a correlation of -.02, in other words no correlation at all from one year to the next, in schools’ progress scores. Another set of grades showed low moderate stability in school growth scores, with small schools showing very little stability. Schools with fewer than 500 students saw the largest swings. A fascinating experiment demonstrated that a random number generator was 4-10% more accurate in predicting a school’s grade than the actual grade the school received the prior year.
Ok, ok, but it’s not like the evaluation process is screwing up student evaluations, right? Crap:
Many of the measures used in the report cards are easily gamed by schools and have nothing to do with the quality of education students receive. Up until last year schools graded their own Regents exams, leading to a situation where the number of passing scores was inflated by 4-8.3%, depending on the exam, and “the manipulation of Regents scores was noticeably more common in NYC than elsewhere in the state.” Allegations of test-tampering and grade-changing have “more than tripled” under mayoral control.
Thankfully, at least these policies aren’t providing incentives to avoid those students who the most help, right? Sigh:
The report cards claim to account for this by only comparing schools to other schools with similar student populations. But is this, in fact, the case? The data say no. A report by New York University found that schools serving higher proportions of Black and Latino students, English Language Learners, and students with disabilities received lower grades. The same pattern was found by a professor at Columbia’s Teachers College and by NYC’s Independent Budget Office. They also found that the less selective a school was in accepting students the lower the average grade. Other analyses found that the report cards favor schools that start with higher student baseline scores, that as the percent of self-contained special education students at a school increases from 0% to 14.5% the average report card grade falls by over 20 points, and that schools that get “D” and “F” grades have many more students entering overage than schools that get an ”A.” But that is not all. Schools in the lowest quartile of grades have lower entering student scores and 2.5x more self-contained students than schools in the top quartile. An analysis of another year’s data found that the median proportion of self-contained special education students at schools with “F” grades was 2,100% greater than at schools with “A” grades. Schools with the lowest levels of students receiving free lunch were 3.5 times more likely to get A’s than schools with the highest levels. So it is obvious that the report cards do not account for differences in incoming student characteristics.
But maybe this could be used to evaluate teacher performance. Heh:
For years NYC created value-add reports for math and English 4th-8th grade teachers. As of 2013 New York State will be creating value-add reports for even more teachers as part of the Race to the Top teacher evaluation system. What does the data from NYC tell us about the reliability of such reports? The data used to create value-add teacher rankings in NYC was often inaccurate. The DOE itself admitted that a third of all value-add reports were not reliable and that in 30 schools the reports for every single teacher were not reliable. Scores for teachers of classes at the top or bottom were particularly unreliable, with 3,900 of 11,800 multiyear ratings falling into this category. Even with multiple years of data, up to 70% of teachers could not be distinguished from average. A .01 change in either direction changed a teacher’s percentile ranking up to 63%, making small real world differences appear larger than they really were. Scores changed by large amounts from year to year, with 49% of teachers moving downwards. There was almost no correlation between the English and math scores of teachers who taught both subjects during the same year. 98% of teachers fell in a very narrow range, meaning that the numbers should not be used to create rankings of teachers. Looking at the same exact teachers, in the same exact schools, teaching the same subjects there was no correlation between teachers’ 2005-06 scores and their 2007-08 scores. There was no correlation between a teacher’s value-add score from year to year. It was, in fact, close to random. A teacher in the 90+ percentile one year had only a 1 in 4 chance of remaining there the next. A teacher in the bottom 10% had only a 7% chance of remaining there the following year. Only 7% of teachers landed above the median for 3 years in a row, with lots of movement between the upper half and the bottom third. Predictions about future student achievement assumed by the formula were not accurate. Scores were biased against teachers of high performing students. There was a 3+:1 ratio of teachers who taught high-performing students rated below average versus above average. A single extra question correct on the exams of a teacher in this group raised their value-add score by 10-20 points while an incorrect answer lowered their score by 20-50 points. The reports did not control for school level factors and class size. For example, a teacher was 7.3% less likely to receive a good rating for each additional student increase in average class size. A teacher’s score one year predicted only 5-8% of the next year’s score. The value-added scores of teachers who taught similar groups of students with similar pre-test scores for two years in a row showed almost no correlation. 43% of teachers with very high value-add scores in 2009 did not meet that mark in 2010. Of the thousands of teachers in the top 20% in 2005-06 only 14 math teachers and 5 ELA teachers remained there each year through 2009-10. The educator rated as the worst teacher in the city taught the highest need English Language Learners, very few of whom took the exams the rating was based on. What’s worse, 40% of her students had “imputed” scores which are wholly unreliable.
Most reformers pride themselves on being hard-assed technocrats who solve problems. They’re grownups, not touchy-feely Dirty Hippies. But these particular technocrats couldn’t find their own asses. In any competent scientific program, a first-year graduate student wouldn’t be allowed to get with this statistical chicanery.
But it’s only the cognitive development of children, so no biggie.
Mind you, I didn’t even mention how poor children are being provided with thirteen percent less money than wealthy students or the ridiculous expansion of the Department of Education central staff. You’ll have to read that on your own (section #1).