WARNING. Contains Maths.
Lets see if we can agree on a couple of things.
It is possible for one professional to assess if another professional is meeting the standards of their shared profession for a particular task. Make a judgement, if you will. The professional making the judgement would have to be experienced and preferably have the trust of the person being judged (though this is not a requirement – it is a requirement that anyone using the judgement would have to trust them).
The judgement can take a number of forms. Broadly speaking it can be summative or it can be formative. It can be formal or it can informal.
I’m also going to state here that the summative judgement has its place. Yes, the best assessment is to provide feedback on how any observed outcome could be improved, but at some point it is also necessary to provide a comparative assessment of capability. Why? One reason is that it would be helpful to know if you had a school full of, say, ‘inadequate’ teachers. That might spur you on to do something other than attempt to make incremental gains with the existing staff. But to argue that issue is not the real purpose of this post.
The question is, how reliable is the individual judgement? What I want to set out here is that whilst the judgement can be useful, the statistical reliability of lesson observation judgements as carried out is schools is low and should only be used as one of a number of measures of teacher ‘quality’ with the user having a clear understanding of just how unreliable the judgement probably is.
Lets look at some numbers. I’m going to restrict this to secondary level as the sums are easier, but the general principles apply elsewhere. And I’m going to look at this at the teacher, department and school level.
Assume that a school has a 25 hour per week timetable. And they have 80 full time teachers. 39 week year, call it 36 to allow for exams, leavers etc. So each teacher (on an average 90% timetable) teaches around 800 lesson per year (this is at the low end of the scale as most schools have lessons that are less than 1hour). Which gives us in total 64,800 lessons each year in the school.
How would you describe a ‘good’ teacher? Clearly there are Ofsted criteria we can use, but that’s not going to help very much with the statistics. For the sake of this exercise, I’m going to say that a ‘good’ teacher is one who teaches a lesson that is not ‘good’ only 1 in every 20 lessons.
Lets assume you are observing a known ‘good’ teacher. Further assume you make three observations of this individual. The likelihood of seeing three ‘good’ lessons at random is (0.95)3 which equals 0.857 or ~86%. So, you have a known ‘good’ teacher but there is a still a 14% (or one in seven) chance that you will see a lesson that is not good. Even if the ‘good’ teacher only has one not ‘good’ lesson every two weeks you would still have a one in fifteen chance of seeing a not ‘good’ lesson in a series of three observations. To reduce this possibility to less than a one in a hundred chance of seeing a not ‘good’ lesson, the ‘good’ teacher cannot deliver a not ‘good’ lesson more than twice in a year.
This is a very, very simplistic statistical model, but I find simple statistics allow us to illustrate simple truths. A ‘good’ teacher has to be consistently good, very consistent in order to have a low probability of being assessed as having had a not ‘good’ lesson.
There are of course many other issues at play here. Will a teacher teach better or worse for an observed lesson? In my experience that depends on the person and is usually not connected to their ability as a teacher. Will the observer allow for the different characters they are observing? Well, an internal observation will be able to, an external one won’t (and perhaps shouldn’t). If the first lesson observed is not ‘good’ will that affect the outcome of the second? The simple statistical model can’t measure that, but I have no doubt that it will have an effect. And that effect will be different for different teachers.
What the above tells me is that using a series of three lesson observations can only ever shine a light on the edges. It can show you those teachers who are truly ‘good’ and those that are truly not ‘good’ but it is not useful for looking at the middle, which is where real incremental development can work best.
There are statistical methods for working out how many observations we need to have a statistically sound judgement. Using these methods we can calculate that to be 95% confident we have the right judgement (in the ‘good’- not ‘good’ scenario above) we would need in excess of 60 observations. So to make an accurate judgement on an individual we need (proportionately) a very high number of observations. Segmenting the judgement (into, say, four categories) will increase the number of observations required unless you are predetermining the judgement you are going to make. For example, you may only want to decide if a teacher is ‘good’ or ‘outstanding’ if you are sure they are one or the other.
When it comes to larger numbers of lessons the results are different. Statistical measures of any complexity are very rarely linear. That is, if we double the population size we don’t need to double the sample size. So consider a subject department of, say, 5 teachers. This is 4000 lesson per annum. To make the same judgement as above we would need around 70-75 observations. So not many more than for one teacher. But still way too many per teacher.
It is only when we get to the school level that this becomes possible. If we use the figure of 64,800 lessons, and we are looking for a ‘good’ – not ‘good’ judgement then we would need only a few more observations than for the department. Which is a realistic proposition over a school year. This is still way more than Ofsted would do, but they are looking for specific issues highlighted by their statistical analysis, so this simple overview does nothing to challenge the validity of their use of observations at the school level. I would still suggest that at the department level, and certainly at the individual teacher level, Ofsted judgements are not completely reliable.
So, in summary, if you are judging the capability of a teacher solely on the outcome of three lesson observations then you are probably commiting a crime against statistics. If you are using the three lesson observations you do on each teacher to make a summative assessment of the overall teaching ‘level’ in the school then you are on safe ground. You can’t do the same for small departments. Probably only in Science, English and Maths can the lesson observations (on their own) provide a sound(-ish) judgement.
A question I have is where did the three observations per year figure come from? I can only assume that someone worked out the number of observations required to give a detailed judgement about whole school teaching level and then divided by the number of teachers in a school.
Which is another reason why we need to teach more statistics!