Education rankings "flawed"by Catherine Woulfe
A test that ranks countries’ educational achievements has serious flaws, some academics say, and basing reforms on the league tables is a big mistake.
The BBC has called Pisa “the world’s most important exam”. This Programme for International Student Assessment (Pisa) is often compared to the Olympics. Every three years, in more than 60 countries, about half a million 15-year-olds sit down to answer questions on mathematics, reading and science.
The stakes are high: topping the Pisa charts made Finland the education world’s golden child. In Korea, students reportedly stand in lines on the big day, applauding their peers as they file into the testing rooms. Taking part cost New Zealand about $2 million in this round alone.
The tests are run by the Organisation for Economic Co-operation and Development (OECD), which crunches the data and uses it to underpin a huge range of analyses and policy recommendations.
Crucially, it also releases a league table, listing countries from best to worst, and it’s these rankings that get all the attention.
Countries that don’t do as well as expected – or drop drastically, as New Zealand has this year – often plunge into what’s known as “Pisa shock”. It happened to Germany and Denmark in the first round of tests in 2000, and Education Secretary Michael Gove is reforming the United Kingdom’s school system on the back of a “plummet” in the rankings.
Even when countries rank highly in Pisa, as New Zealand has every time – until, arguably, this year – politicians never stop coveting the next rung on the ladder. “Standing still is slipping behind,” Education Minister Hekia Parata told a teachers’ union conference in Nelson last year, pointing out that the six countries – six, from 65 – ahead of New Zealand in the 2009 rankings had been behind us a few years previously.
In October she sent the Secretary for Education and representatives from the major unions and principals’ groups to Singapore and Hong Kong “to investigate the characteristics of these top-performing systems and report on what can be learnt and what might be applicable to the New Zealand context … We need to learn from those who are ranked ahead of us in the Pisa rankings. We cannot be complacent.”
But as Pisa’s influence has grown, so has the attention it gets from academics. And 13 years in – with a towering stack of policy and reforms and reputations at stake – some who have examined Pisa closely are adamant that the whole thing is built on swampy statistical ground. Many believe there are problems with the way data is collected and analysed. These problems go so deep and matter so much, some say, that we should ignore the rankings completely – and certainly stop using them to drive changes to the way we teach our children.
This story is going to press the day after results were released. The words “nosedive”, “plummet” and “slump” are popping up all over, from Morning Report to stuff.co.nz. Whereas Parata says the results are “serious” but not a surprise, Labour’s education spokesperson Chris Hipkins says we are in “absolute freefall”.
In the fine print, the OECD warns countries not to get too hung up on the rankings. For example, in the latest tests, New Zealand ranked 23rd in maths, which was the main subject tested.
But in its report, a few pages past the league table, the OECD notes that the difference between us and Austria – which is five slots ahead on the league table – is not statistically significant. That doesn’t mean it’s not a large gap; it means in statistical terms it’s no gap at all.
In fact, we’re level-pegging with Austria, Australia (phew), Ireland, Slovenia, Denmark, the Czech Republic, France and the United Kingdom. Same thing in 2009: we came seventh but could easily have slid up or down two places, as we were on a par with Singapore, Canada, Japan and Australia.
(It’s also worth noting that some of those ahead of us, such as Hong Kong and the reigning champ, Shanghai, are special cases: cities, rather than countries, and educational hothouses at that.)
The Pisa reports break down the rankings into three broad bands: average, above and below, and some academics think that’s as detailed as the comparisons should get. Despite New Zealand’s slip, we are still in the “above average” band.
WHAT’S THE PROBLEM?
Part of the issue, academics say, is that the OECD is not forthcoming about aspects of how it runs the tests. Its technical reports, according to Danish statistician Svend Kreiner, are more like textbooks: they outline how Pisa is carried out and give assurances that aspects of the system are checked for glitches – but do not contain the results of those checks.
Kreiner is the reluctant epicentre of the current criticism. His work focuses on the questions in the Pisa tests and how the answers are processed. This differs from other criticism, he says, because it’s black-and-white maths: he’s using Pisa’s own data and the methods outlined in those technical documents, but getting different results.
Pisa questions are geared to be equitable across countries – to let a teenager sitting the test in Qatar, for example, have just as good a shot as a teen with the same ability in Korea. A little variation is to be expected. But although Pisa says it has managed “remarkable consistency” across countries, Kreiner claims to have found “remarkable inconsistency”. And he is adamant this plays havoc with the final results.
Kreiner, professor emeritus at the University of Copenhagen, examined 20 questions used in the 2006 reading tests and followed Pisa’s own methodology to look at how they behaved across countries. It took him only three or four hours to be convinced there was a serious problem.
None of the 20 questions behaved in an equal enough way: basically, results jumped around depending where students came from, regardless of the ability of those students. This would be fine if the results were then analysed in the right way. But Kreiner says Pisa feeds the results into a number-crunching model that can only give accurate data when there is no such instability or jumping around. This is called the Rasch model, and Kreiner is extremely familiar with it, having studied under its inventor and worked with it for 40 years.
He believes the Rasch model can be a valuable tool, but the evidence against using it to deal with Pisa data is “irrefutable”.
An online version of his paper, co-authored with colleague Karl Bang Christensen, was published online in the journal Psychometrika in June. Two weeks later the story broke in the UK’s Times Education Supplement. “The best we can say about Pisa rankings is that they are useless,” Kreiner said then. Has he changed his mind? “No, no,” he tells the Listener, with a wry laugh. “Not at all.”
He does his best to explain in simple terms what using the Rasch model does to those all-important rankings. “This means the ranking of the countries based on the test scores are really confounded. So there’s a systematic bias, and exactly what the effect of that bias is is hard to say, but we have made an attempt in our paper to look at it. And it looks as if it could be very serious.”
How serious? Well, in the reading tests Kreiner checked, he found the fluctuation in questions could have seen Japan wind up anywhere from 8th to 40th; the UK from 14th to 30th; and Canada from 2nd to 25th.
So Kreiner has an emphatic message for our Education Minister: “Forget about these rankings. Disregard them.”
That’s not to say we should ignore other aspects of our results – such as how many points we actually scored, as opposed to where that leaves us on the league table. But we do have to do our own analysis of these before using them.
“Try to use the Pisa data to analyse the changes over time in New Zealand,” Kreiner advises. “Forget about all the other countries. You have to do a serious statistical analysis – that goes without saying – and you have to do it better than the OECD, but that’s definitely possible.”
(According to the Pisa reports, our performance in all three subjects was flat between 2000 and 2009 and has now slipped slightly; maths saw the biggest drop, from 519 points in 2009 to 500 last year, reading dropped nine points and science 16.)
If we really wanted to push our luck, Kreiner says, we could be “a little less ambitious” than the OECD and compare ourselves with only a few similar countries, such as Australia. “Try to do that and see what can be done in a way that actually works out. Don’t try to compare test results from 65 countries.”
Kreiner says he initially tried to stay out of the public debate and refused to communicate with other critics. He resents the fact he is doing research he believes the OECD should be carrying out – and making public. This is why he refuses to expand his research by looking at Pisa results from other years.
OECD ACKNOWLEDGES "UNCERTAINTY"
In the Times Education Supplement (TES) story, the OECD acknowledges, for what is thought to be the first time, that there will always be “uncertainty” and “large variation in single [country] ranking positions is likely”.
That’s exactly his point, Kreiner says with a sigh. “At the end of the day, are they going to change anything? I’m not too optimistic. They have simply used so much money, wasted so much time, so they can’t just admit that it’s not worth anything. And of course, the OECD is an elephant and I am a mosquito, so …”
But he’s not alone. The BBC recently sent University of Cambridge professor David Spiegelhalter to interview Kreiner and examine his work. His conclusion? “Pisa does have a lot to offer, but as a statistician I’m left with serious concerns about the reliability of the league tables and the lack of evidence to support Pisa’s methodology in compiling them. Governance and external review seem extremely limited. I will treat the scores and ranks with considerable skepticism.”
The Listener asked Victoria University’s Michael Johnston, a senior lecturer in education policy and research, to review Kreiner’s work. Johnston says the Rasch model is used to produce “more meat in the data”. That is, Pisa’s statisticians feed in the raw results from questions students have answered, and the model builds on them to generate a set of new results – which Pisa’s technical documents call “plausible results”. These results are not earned in real tests. Both types of result – the real test scores and the “plausible results” – are then treated the same to come up with the country rankings.
The OECD did not respond to Listener questions about the proportion of 2012 results that are generated this way, but the TES said that in 2006, “about half the participating students were not asked any questions on reading and half were not tested at all on maths, although full rankings were produced for both subjects”.
Kreiner drilled down further into the reading data from that year and found another 40% of students were tested on only half the questions. That’s a large proportion of generated data, Johnston says, and it doesn’t sit well with him. “I don’t like imputing data, I’m an empiricist – I like to measure things.”
That’s not to say there’s anything inherently wrong with the technique, he stresses. But he goes back to Kreiner’s point: “Even if there’s nothing wrong with the principle, the way it’s being executed here is likely to produce distortions in the data and have an effect on the country rankings. I think that Pisa have a case to answer in terms of [the variability in the relative difficulty of questions for students from different countries] and their process for generating the plausible values. And until they do that, I would be taking the rankings with a big grain of salt.”
The frontman for Pisa, Andreas Schleicher, responded to the TES revelations with an op-ed dismissing Kreiner’s work and that of other critics. “The OECD does not see any scientific or academic merit in these papers, and considers the accusations made in the TES, based on these flawed analyses, to be completely unjustified,” Schleicher concluded.
Johnston finds this response “glib and profoundly unconvincing”. “Either they have not read or properly understood Kreiner’s criticisms, or they have ignored them and responded to straw-man criticisms instead … I think they believe that the media in particular will fail to follow the complexities of the argument and conclude that Kreiner and Christensen are just being picky. For what it’s worth, I don’t think they are.”
THE DANES ARE DONE
Kreiner is retired and doing this work for free in his spare time. He tells the Listener he does it because for the past 10 years, Pisa rankings have been the main weapon in the “political battlefield” of his country’s public schools.
“They use Pisa to claim that the Danish public school is not good enough, that the teachers are not good enough, that something has to be done …”
He sounds exhausted by it all. “What I really needed was to have the Ministry of Education listen to me … It is a disappointment that that did not work out.”
Two days after this interview, with the OECD poised to release the 2012 results, Kreiner sends the Listener a remarkably restrained email. “Things have been happening in Denmark over the weekend. The Danish Ministry of Education has actually decided to abandon the country ranks.”
Danish Education Minister Christine Antorini has since told media Denmark’s rank does not matter, and she is more interested in whether the country is below or above average. It is now just above.
The OECD did not directly respond to Listener questions, referring us to documents now available (below), along with the Times Educational Supplement story and Andreas Schleicher’s response.
The full PISA 2012 report.
(The OECD directed the Listener to Annex A when we asked about principals deliberately 'deselecting' students they thought would perform poorly).
Technical reports for 2012, which explain in detail how PISA is carried out, will be added when they are released. Here's the 2009 version.
The July 2013 Times Educational Supplement investigation into PISA, including links to academic papers.
Response to that investigation, written by PISA frontman Andreas Schleicher and published in the Times Educational Supplement.
Q & A with Schleicher, from an interview with Catherine Woulfe on July 10.
Gaming the system
There are some dubious ways to get a better Pisa score.
Professor Joerg Blasius, a cheerful German sociologist based at the University of Bonn, has been investigating what he calls “impression management” in Pisa.
He tells the Listener that although Pisa selects the schools that will take part, “schools can accept or not accept – that’s the first point”. Those who believe they would drag their country’s ranking down are less likely to say yes.
Once a school has accepted, Pisa picks the specific classes of children that will take part – but here, too, he says, principals can “deselect” students they know won’t do well. This is because on the day, Pisa does not need the whole class to take part. The principal “knows which students are good and which are bad”, Blasius says, so “you exclude the bad students”. When the Pisa tester turns up, he or she knows “maybe 25 students are in the class and today there are 22. Perfect. You could [go ahead and do the test].”
What does that do to results? “The school does better, and when there is more than one principal doing so, the Pisa ranking goes up. It’s a very, very easy thing to do, to get a better ranking. Nobody will check it, nobody will see it.”
Blasius’s advice to New Zealanders trying to understand our Pisa rankings is: “Don’t take them so seriously. Don’t worry if you go five points down in the rankings. If you go five points up, don’t be happy – this means nothing.”
Blasius recently completed a research paper, expected to be published next year, about the background surveys principals fill out. These surveys are analysed alongside the test results to produce some of Pisa’s deeper analyses. He says it is therefore crucial that the information given is accurate and robust.
Principals are asked to rate their school on all sorts of factors, from resourcing and management style to bullying and drug use among students. Blasius says those who don’t trust Pisa’s guarantee of anonymity – or who simply can’t spare the 90 minutes it takes to fill the form out properly – will “just say everything is very good”.
“Tick, tick, tick. Yes, exactly.”
He stresses that he has analysed only the 2009 surveys. Based on that, he believes 8-10% of principals worldwide were employing some sort of “impression management” – it was much more prevalent in some countries than others. It is not practical for him to go back and examine New Zealand’s data, but he remembers it being “average”. He claims that although “the Scandinavian countries are very good”, he is not so confident about the data from Germany, China, the United States and the United Kingdom.
Follow the Listener on Twitter or Facebook.
I predicted Bill English would lose the election and the winner would be Winston Peters. But no forecaster, including the PM, predicted her pregnancy.Read more
New essays on New Zealand-born US artist Len Lye elevate him to the status of Australasia’s most notable 20th-century artist.Read more
For about a third of infertility cases in New Zealand, there is no obvious reason why seemingly fertile couples struggle to conceive.Read more
More than one million images from Rykenberg Photography, taken around Auckland, are now in the Auckland Libraries Collection. But who are the people?Read more
A Golden Bay man spending his first night in his new house says he woke to find his bed, walls and floor covered in hundreds of creepy crawlies.Read more
There's a growing movement to stop the amount of wasteful plastic that goes into our oceans, but what about the tiny bits we can hardly see?Read more
The inconvenience to chlorine refuseniks is tiny compared with the risk of more suffering and tragedy from another Havelock North-style contamination.Read more