Understanding the Limitations of Your Data: The Testing Protocol Edition

In the era of Big (Infatuation With) Data, one thing to keep in mind is that all data have limitations. How the data are analyzed and collected can have huge effects on your conclusions. One example is how poor students are defined can really affect conclusions about educational policy. Here’s an even more disturbing case from Baltimore (boldface mine):

In Maryland, meanwhile, more than 41,000 Baltimore County students in grades 3-8 took the PARCC exams in 2014-15. Fifty-three percent of students took the math exam online, while 29 percent took the English/language arts exam online. The mode of test administration was decided on a school-by-school basis, based on the ratio of computers to students in each building’s largest grade.

Like Illinois, Baltimore County found big score differences by mode of test administration. Among 7th graders, for example, the percentage of students scoring proficient on the ELA test was 35 points lower among those who took the test online than among those who took the test on paper.

To identify the cause of such discrepancies, district officials compared how students and schools with similar academic and demographic backgrounds did on each version of the exams.

They found that after controlling for student and school characteristics, students were between 3 percent and 9 percent more likely to score proficient on the paper-and-pencil version of the math exam, depending on their grade levels. Students were 11 percent to 14 percent more likely to score proficient on the paper version of the the ELA exam.

“It will make drawing comparisons within the first year’s results difficult, and it will make drawing comparisons between the first- and second-year [PARCC results] difficult as well,” said Brown, the accountability chief for the Baltimore County district.

This really underscores the need to move forward” with the district’s plan to move to an all-digital testing environment, he said….

“Because we’re in a transition stage, where some kids are still taking paper-and-pencil tests, and some are taking them on computer, and there are still connections to high stakes and accountability, it’s a big deal,” said Derek Briggs, a professor of research and evaluation methodology at the University of Colorado at Boulder.

“In the short term, on policy grounds, you need to come up with an adjustment, so that if a [student] is taking a computer version of the test, it will never be held against [him or her],” said Briggs, who serves on the technical-advisory committees for both PARCC and Smarter Balanced.

While I haven’t been able to find any information about what this means in terms of standard deviations (i.e., useful information), it seems to me that if a district implemented an intervention that led to increases of three to fourteen percentage points in proficiency, this would be heralded as a significant effect. Politicians would be congratulating themselves non-stop for such improvements. Except here the breakthrough intervention is paper and pencil.

What kills me is Baltimore’s ‘accountability chief’ (really? Are you fucking kidding me?) call for students to switch over to computer testing. Because choosing the more infrastructure intensive–and expensive–testing method, that also lowers scores, is the obvious way to go here. Or something.

Anyway, something to consider when test scores are discussed: there’s a real-world implementation that makes things tricky to assess.

This entry was posted in Education, Statistics. Bookmark the permalink.