The Crumbling Tower of PISA Evaluation

A few weeks ago, the PISA test scores–and the belief that U.S. education is DOOMED!–found their way into what passes for our discourse by way of this NY Times op-ed by economist Richard Gordon:

Then there is the poor quality of our schools. The Program for International Student Assessment tests have consistently rated American high schoolers as middling at best in reading, math and science skills, compared with their peers in other advanced economies.

This has been followed by the PR blitz surrounding Amanda Ridley’s new book, The Smartest Kids in the World, in which she dwells almost exclusively on PISA scores for international comparisons (An aside: I read Ridley’s book while on vacation, and it was derptacular. Poland? Really? Somerby has been eviscerating it, so thankfully I don’t have to. Also, can we please stop with the ‘relating individual stories to The Great Theme’ school of writing. There’s a reason Rebecca Skloot won a Pulitzer with this style–it’s really hard to pull off. In most cases, it’s just meaningless anectdata).

So there’s a whole lotta PISA goin’ on (as the kids used to say). Except there are some real problems with focusing on this test. I’ve focused on this before: PISA is based on a different educational philosophy from what many countries use (ironically, while Finland does really well on PISA, Finnish math professors think this educational philosophy leaves students woefully unprepared for college mathematics). Related to that, tests, if used to accurately measure what students have achieved, should line up with what they’ve been taught.

However, there are two additional problems that have cropped up. First, the recent PISA evaluations tested too many poor children in the U.S. (one would think that would be hard to do…). This dramatically alters the U.S. standings (boldface mine):

•Because in every country, students at the bottom of the social class distribution perform worse than students higher in that distribution, U.S. average performance appears to be relatively low partly because we have so many more test takers from the bottom of the social class distribution.

A sampling error in the U.S. administration of the most recent international (PISA) test resulted in students from the most disadvantaged schools being over-represented in the overall U.S. test-taker sample. This error further depressed the reported average U.S. test score.

•If U.S. adolescents had a social class distribution that was similar to the distribution in countries to which the United States is frequently compared, average reading scores in the United States would be higher than average reading scores in the similar post-industrial countries we examined (France, Germany, and the United Kingdom), and average math scores in the United States would be about the same as average math scores in similar post-industrial countries.

•A re-estimated U.S. average PISA score that adjusted for a student population in the United States that is more disadvantaged than populations in otherwise similar post-industrial countries, and for the over-sampling of students from the most-disadvantaged schools in a recent U.S. international assessment sample, finds that the U.S. average score in both reading and mathematics would be higher than official reports indicate (in the case of mathematics, substantially higher).

•This re-estimate would also improve the U.S. place in the international ranking of all OECD countries, bringing the U.S. average score to sixth in reading and 13th in math. Conventional ranking reports based on PISA, which make no adjustments for social class composition or for sampling errors, and which rank countries irrespective of whether score differences are large enough to be meaningful, report that the U.S. average score is 14th in reading and 25th in math.

Disadvantaged and lower-middle-class U.S. students perform better (and in most cases, substantially better) than comparable students in similar post-industrial countries in reading. In math, disadvantaged and lower-middle-class U.S. students perform about the same as comparable students in similar post-industrial countries.

•At all points in the social class distribution, U.S. students perform worse, and in many cases substantially worse, than students in a group of top-scoring countries (Canada, Finland, and Korea). Although controlling for social class distribution would narrow the difference in average scores between these countries and the United States, it would not eliminate it.

Not great for the U.S., but we’re hardly DOOMED! either. PISA also has test question issues (boldface mine):

What if you learned that Pisa’s comparisons are not based on a common test, but on different students answering different questions? And what if switching these questions around leads to huge variations in the all- important Pisa rankings, with the UK finishing anywhere between 14th and 30th and Denmark between fifth and 37th? What if these rankings – that so many reputations and billions of pounds depend on, that have so much impact on students and teachers around the world – are in fact “useless”?

…For example, in Pisa 2006, about half the participating students were not asked any questions on reading and half were not tested at all on maths, although full rankings were produced for both subjects. Science, the main focus of Pisa that year, was the only subject that all participating students were tested on.

Professor Svend Kreiner of the University of Copenhagen, Denmark, has looked at the reading results for 2006 in detail and notes that another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions.

“This in itself is ridiculous,” Kreiner tells TES. “Most people don’t know that half of the students taking part in Pisa (2006) do not respond to any reading item at all. Despite that, Pisa assigns reading scores to these children.

The other problem is a statistical one–the model that is used to compare PISA results among different countries has real problems. What most people don’t realize is that, due to cultural differences, scores aren’t actually reported–’plausible values’ are estimated. According to the experts in that particular statistical methodology, there are serious problems (boldface mine):

The Rasch model is at the heart of some of the strongest criticisms being made of Pisa. It is also the black box within Pisa’s black box: exactly how the model works is something that few people fully understand.

But Kreiner does. He was a student of Georg Rasch, the Danish statistician who gave his name to the model, and has personally worked with it for 40 years. “I know that model well,” Kreiner tells TES. “I know exactly what goes on there.” And that is why he is worried about Pisa.

He says that for the Rasch model to work for Pisa, all the questions used in the study would have to function in exactly the same way – be equally difficult – in all participating countries. According to Kreiner, if the questions have “different degrees of difficulty in different countries” – if, in technical terms, there is differential item functioning (DIF) – Rasch should not be used….

But Kreiner’s research suggests that the variation is still too much to allow the Rasch model to work properly. In 2010, he took the Pisa 2006 reading test data and fed them through the Rasch model himself. He said that the OECD’s claims did not stand up because countries’ rankings varied widely depending on the questions used. That meant the data were unsuitable for Rasch and therefore Pisa was “not reliable at all”

In addition to the UK and Denmark variations already mentioned, the different questions meant that Canada could have finished anywhere between second and 25th and Japan between eighth and 40th. It is, Kreiner says, more evidence that the Rasch model is not suitable for Pisa and that “the best we can say about Pisa rankings is that they are useless”

…more significantly, it [the OECD] has now admitted that there is “uncertainty” surrounding Pisa country rankings and that “large variation in single ranking positions is likely”.

Technical issues–that is, technique–matter.

So while PISA scores don’t look great, they tell a much worse story about the U.S. than either the TIMSS or PIRLS tests. This isn’t to deny that there are real problems–the Alabama-Massachusetts gap which is comparable to the effects of moderate lead poisoning comes to mind. But we have models that work, and we should adopt them.

This entry was posted in Education. Bookmark the permalink.

2 Responses to The Crumbling Tower of PISA Evaluation

  1. dr2chase says:

    Erm, are the complaints about not all students not tested on all questions any different from a misunderstanding of sampling and statistical methods? That tends to be how we test the quality of widgets — select 1000 out of a batch of a million, measure their quality, efficiency, whatever, and we can be pretty sure we learned something about the remainder of the million. Assuming, of course, that our sample is randomly selected.

  2. Min says:

    I have a nodding acquaintance with Rasch testing, having used it for a research project in grad school. IMO, Rasch’s great insight was that test questions have different relative difficulty for different test takers. So if you are comparing two students or two different groups of students, you do not use test questions that have different relative difficulty for the two students or two groups. That is Kreiner’s point. PISA is not really using Rasch’s approach.

Comments are closed.