Testing times

Concerns continue to be raised about GCSE gradings this summer. As well as the continuing and well rehearsed fiasco over the English results we now have issues being raised with the grading of Science modules. With English the unfairness grows following the regrading of GCSEs for students in Welsh schools but not those in England.

The question I have been asked a number of times in the past couple of weeks (by people whose only connection with education is through their children being at school) is just how hard can it be to write two exam papers on the same narrow range of topics that have a similar level of difficulty. Which, when you think about it, is a very good question.

Generally speaking the task is not that difficult. Teachers do it on a daily basis for their own classes. Clearly a test designed for a single class is easier than one designed for national consumption. A number of factors come into play here.

Firstly, the wider the range of possible content, the more difficult the task becomes. This is simply due the inability to fit all the topics in if the range is too wide. So do you leave them out of both papers or do you mix and match. If you take the second approach (to help avoid any teaching to the test), then keeping the difficulty levels together is harder.

Secondly we have to look at the specific subject. Recreating papers of similar difficulty becomes harder the more subjective the content becomes. So, it is easier for Maths and harder for some aspects of Science, and harder still for English. Theoretically, in a criterion referenced system it shouldn’t be, but the criterion can never be finely enough drawn to enable this.

The really big problem is with the students who are going to take the exam. Simply put, if you gave the same paper to the same groups of students on two different occasions, you would get two different sets of results. Teachers are well aware of this phenomenon. Students have off days. Some of them perform out of their socks on the day for no apparent reason. Some of them get up with a headache. Obviously this is more noticeable in a single classroom than it is when 600,000 students are taking the same exam.  This is why teachers don’t like to rely on a single test result to allocate students to particular sets, for example. But when it’s 600,000 students the law of large numbers kicks in. Across such a cohort the ups and downs tend to even out. Exam boards can ignore the effect of any ups and downs for individual students and look at the overall effect. This is why changes in grade boundaries tend to be quite small, because the cohort effect averages out the individual student effects.

The problems the exam boards had arose in modules which not all the cohort have taken at the same time. The smaller the cohort, the less likely it is that individual student effects will work themselves out, making it difficult to moderate grades between two separate papers sat by distinct subgroups of the whole student population. Consequently they have had to implement large grade boundary changes. Whilst this is strictly speaking permitted by the structure of the system, it is not normal practice, nor can it be considered to be good practice.

The modular system consequently has two contradictory effects. On the one hand it is fairer for the individual student as it allows their individual ups and downs to be evened out across a range of modules*. On the other hand, there is the potential for the small cohort effects to adversely impact whole groups of students in a way that they have no control over, leading, in extremis, to the unfairness we have seen with the English grades this summer.

So, the short answer is that it isn’t that hard to create papers of similar difficulty, but the fact that it is human beings sitting them makes ensuring the real life outcomes are comparable a difficult task.

And now we are heading towards a single end of course exam. This was certainly the implication of the Secretary of States words in Parliament the other day.

We want to remove controlled assessment and coursework from core subjects. These assessment methods have – in all too many cases – corrupted the fair testing of students. We want to ensure that children are tested transparently on what they – and they alone – can do at the end of years of deep learning. Where individual practical work needs to be assessed in specific subjects, we will be flexible.

It is also strongly implied in the consultation document published soon after.

Our preferred approach is to remove internal assessment from all six English Baccalaureate subjects. Existing qualifications in all six subjects, offered in England and overseas, already offer opportunities for 100% external assessment – for example, in modern foreign languages, through externally assessed speaking exams, which are  conducted and recorded by the teacher and marked by an external  examiner.

Where ever we end up it will be nearer to wholly examined end of course assessment than we are now. Reason will probably out and we won’t have 3 hour exams, more likely two or three exams, with a mixture of exam types (multiple choice, short answers, long answers) similar to those in existence for O Levels. There will also be elements for specific subjects assessed through means other than written exam, though these are likely to be externally  marked (with considerable impact on the cost of exams).

Whatever the exact nature of what we end up with it is disingenuous for the SoS to suggest that the current system unfairly advantages some students whilst a changed system wouldn’t. At the individual student level different types assessment will always advantage some and disadvantage others. What the revised system will do is make it easier for OfQual and the exam boards to do their job of ensuring that the assessment standard is maintained over time. This will however be at some cost to ensuring that students capabilities are assessed by the best measures possible. It will unfairly benefit those students who perform best in one off exams, those who can cram large amounts of facts and regurgitate them, and those whose backgrounds provide them with facilities for quiet and considered revision to the cost of those that don’t.

I have no doubt that these costs outweigh the benefits.


*There is some evidence that across a cohort students perform better in a linear system rather than a modular one, but again this ignores the impact on individual students who would perform better with modules.