To use a phrase. The Texas Observer has a very interesting story about a researcher who testified in front of the Texas Lege about a fundamental problem with that state’s regime of standardized tests (boldface mine):
Then Stroup sat down at the witness table and offered the scientific basis behind the widely held suspicion that what the tests measured was not what students have learned but how well students take tests….
Stroup argued that the tests were working exactly as designed, but that the politicians who mandated that schools use them didn’t understand this….
What he noticed was that most students’ test scores remained the same no matter what grade the students were in, or what subject was being tested. According to Stroup’s initial calculations, that constancy accounted for about 72 percent of everyone’s test score. Regardless of a teacher’s experience or training, class size, or any other classroom-based factor Stroup could identify, student test scores changed within a relatively narrow window of about 10 to 15 percent.
Stroup knew from his experience teaching impoverished students in inner-city Boston, Mexico City and North Texas that students could improve their mastery of a subject by more than 15 percent in a school year, but the tests couldn’t measure that change. Stroup came to believe that the biggest portion of the test scores that hardly changed—that 72 percent—simply measured test-taking ability….
Stroup concluded that the tests were 72 percent “insensitive to instruction,” a graduate- school way of saying that the tests don’t measure what students learn in the classroom.
This claim earned Stroup a rebuke from the TEA [the Texas Education Agency], which stated that his findings betrayed “fundamental misunderstandings” about the way tests were constructed. The idea that most of a student’s test score carries over almost automatically, with little variance, year to year, was new, but it shouldn’t have been. After three years, STAAR scores have not budged much at all, and the TEA’s own recent report on the STAAR test results largely agrees with Stroup’s finding: The state agency declared that about 58 percent of middle school test scores showed little change from year to year.
Naturally, both the TEA and Pearson, the company that makes a lot of money selling Texas the tests, struck back at Stroup. There is some question whether the invariant component of the test is as high at 72 percent; it might ‘only’ be fifty percent. Regardless, these tests seem to be focused on test-taking ability*:
Determining whether the number was 50 percent or 72 percent is one thing, but the real question is what that percentage meant. Stroup thought it quantified the portion of the test that measured test-taking ability. Another theory, from James Popham, emeritus professor in the Graduate School of Education and Information Studies at the University of California-Los Angeles, was that these types of tests measured innate intelligence, a morally dubious deduction when the results neatly correlate with race and ethnicity.
Way hypothesized that the 50 percent correlation “most likely reflects the fact that students are retaining what they’ve learned in previous years’ instruction and are building on that knowledge in the expected way.” But if that were true, then some students would do better in math than in reading, for example.
But that’s not what the research showed. A student in the third grade did as well on a math test as that same student did in the eighth grade on a language arts test as the same student did in the 10th grade on a different test. Regardless of changes in school, subject and teacher, a student could count on a test result remaining 50 to 72 percent unchanged no matter what. Stroup hypothesized that the tests were so insensitive to instruction that a test could switch out a science question for a math question without having any effect on how that student would score.
At some point, you might expect Campbell’s Law to rear its ugly head. You’ll never guess what happened next!
If it’s true that the test measured primarily students’ ability to take a test, then, Stroup reasoned to the House Public Education Committee in June 2012, “it is rational game theory strategy to target the 72 percent.” That means more Pearson worksheets and fewer field trips, more multiple-choice literary analysis and fewer book reports, and weeks devoted to practice tests and less classroom time devoted to learning new things. In other words, logic explained exactly what was going on in Texas’ public schools.
When business lobbyists and legislators desired tests that measure whether a student was “college and career ready,” they didn’t dramatically reform the curriculum. They needed harder questions based on the same curriculum, a trick Pearson managed by incorporating logic puzzles into questions about knowledge.
As a tangential aside, it’s bizarre that we value students’ ability to puzzle out trick questions, when, as adults, we typically value clear, direct explanations (and requests). Odd training for the workplace, that is.
Keep in mind, this insensitivity could be masking larger differences in educational outcomes, so, leaving aside the use of tests to assess teaching, we aren’t getting an accurate picture of what students are learning.
By the way, you’ll hardly be surprised that Stroup’s job is in jeopardy, since the Pearson Foundation gave his institution, the University of Texas, a one million dollar grant to found “the Pearson Center for Applied Psychometric Research”. Pearson and UT, of course, deny any such link. Coincidence, I’m sure.
*My personal experience with these sorts of tests–and I did really well on them–is that there is a lot of test-taking strategy involved. Typically, you’re better off quickly assessing which answer seems right, and then working backwards from the answer to match the question (hell, they give you the answer).