Matthew Di Carlo provides a very interesting figure comparing year-to-year school proficiency rates in New York City (i.e., how many students at a given school pass a threshold on an assessment exam):
Di Carlo (boldface mine):
Classes and schools tend to be quite small, and test scores vary far more between- than within-student (i.e., over time). As a result, testing results often exhibit a great deal of nonpersistent variation (Kane and Staiger 2002). In other words, much of the differences in test scores between schools, and over time, is fleeting, and this problem is particularly pronounced in smaller schools. One very simple, though not original, way to illustrate this relationship is to compare the results for smaller and larger schools….
The “sideways cone” shape of the dots indicates that the changes among larger schools – i.e., the dots further to the right of the plot – are considerably more modest than those of smaller schools. Just to give a better idea of these differences, consider that roughly one in four schools in this sample have sample sizes of fewer than 200 students, while almost one in five (17 percent) of schools have samples of 500 or more students. The mean absolute change (positive or negative) for the former schools (fewer than 200 tested students) is 6.7 percentage points, which is almost 50 percent larger than the average absolute change (4.5 percentage points) among the latter schools (samples of 500 or more students).
In other words, again, smaller schools exhibit much larger year-to-year changes, whether positive or negative, than larger schools. And this presumably is not because there is something about attending a smaller school that causes students’ measured performance to fluctuate more. It is because of their smaller samples.
This matters in accountability systems because it means that smaller schools are more likely to be rewarded or punished, not because they are any better or worse, but simply because their results are noisier. And the same goes for accountability systems that hold schools and districts accountable for performance among student subgroups – diverse schools would be less likely to punished or rewarded, because their subgroup-specific sample sizes are larger.
We’ve noted before that small sample sizes are a problem when it comes to evaluating individual teachers–there is a lot of year-to-year variability leading incredibly imprecise assessments. That doesn’t mean test outcomes, whether focused on teachers or schools, can’t be useful, but they need to be taken with a huge grain of salt and shouldn’t be used except in extremis to make hiring and firing decisions.