**WARNING. Contains Maths.**

Lets see if we can agree on a couple of things.

It is possible for one professional to assess if another professional is meeting the standards of their shared profession for a particular task. Make a judgement, if you will. The professional making the judgement would have to be experienced and preferably have the trust of the person being judged (though this is not a requirement – it is a requirement that anyone using the judgement would have to trust them).

The judgement can take a number of forms. Broadly speaking it can be summative or it can be formative. It can be formal or it can informal.

I’m also going to state here that the summative judgement has its place. Yes, the best assessment is to provide feedback on how any observed outcome could be improved, but at some point it is also necessary to provide a comparative assessment of capability. Why? One reason is that it would be helpful to know if you had a school full of, say, ‘inadequate’ teachers. That might spur you on to do something other than attempt to make incremental gains with the existing staff. But to argue that issue is not the real purpose of this post.

The question is, how reliable is the individual judgement? What I want to set out here is that whilst the judgement can be useful, the statistical reliability of lesson observation judgements as carried out is schools is low and should only be used as one of a number of measures of teacher ‘quality’ with the user having a clear understanding of just how unreliable the judgement probably is.

Lets look at some numbers. I’m going to restrict this to secondary level as the sums are easier, but the general principles apply elsewhere. And I’m going to look at this at the teacher, department and school level.

Assume that a school has a 25 hour per week timetable. And they have 80 full time teachers. 39 week year, call it 36 to allow for exams, leavers etc. So each teacher (on an average 90% timetable) teaches around 800 lesson per year (this is at the low end of the scale as most schools have lessons that are less than 1hour). Which gives us in total 64,800 lessons each year in the school.

How would you describe a ‘good’ teacher? Clearly there are Ofsted criteria we can use, but that’s not going to help very much with the statistics. For the sake of this exercise, I’m going to say that a ‘good’ teacher is one who teaches a lesson that is not ‘good’ only 1 in every 20 lessons.

Lets assume you are observing a known ‘good’ teacher. Further assume you make three observations of this individual. The likelihood of seeing three ‘good’ lessons at random is (0.95)^{3} which equals 0.857 or ~86%. So, you have a known ‘good’ teacher but there is a still a 14% (or one in seven) chance that you will see a lesson that is not good. Even if the ‘good’ teacher only has one not ‘good’ lesson every two weeks you would still have a one in fifteen chance of seeing a not ‘good’ lesson in a series of three observations. To reduce this possibility to less than a one in a hundred chance of seeing a not ‘good’ lesson, the ‘good’ teacher cannot deliver a not ‘good’ lesson more than twice in a year.

This is a very, very simplistic statistical model, but I find simple statistics allow us to illustrate simple truths. A ‘good’ teacher has to be consistently good, very consistent in order to have a low probability of being assessed as having had a not ‘good’ lesson.

There are of course many other issues at play here. Will a teacher teach better or worse for an observed lesson? In my experience that depends on the person and is usually not connected to their ability as a teacher. Will the observer allow for the different characters they are observing? Well, an internal observation will be able to, an external one won’t (and perhaps shouldn’t). If the first lesson observed is not ‘good’ will that affect the outcome of the second? The simple statistical model can’t measure that, but I have no doubt that it will have an effect. And that effect will be different for different teachers.

What the above tells me is that using a series of three lesson observations can only ever shine a light on the edges. It can show you those teachers who are truly ‘good’ and those that are truly not ‘good’ but it is not useful for looking at the middle, which is where real incremental development can work best.

There are statistical methods for working out how many observations we need to have a statistically sound judgement. Using these methods we can calculate that to be 95% confident we have the right judgement (in the ‘good’- not ‘good’ scenario above) we would need in excess of 60 observations. So to make an accurate judgement on an individual we need (proportionately) a very high number of observations. Segmenting the judgement (into, say, four categories) will increase the number of observations required unless you are predetermining the judgement you are going to make. For example, you may only want to decide if a teacher is ‘good’ or ‘outstanding’ if you are sure they are one or the other.

When it comes to larger numbers of lessons the results are different. Statistical measures of any complexity are very rarely linear. That is, if we double the population size we don’t need to double the sample size. So consider a subject department of, say, 5 teachers. This is 4000 lesson per annum. To make the same judgement as above we would need around 70-75 observations. So not many more than for one teacher. But still way too many per teacher.

It is only when we get to the school level that this becomes possible. If we use the figure of 64,800 lessons, and we are looking for a ‘good’ – not ‘good’ judgement then we would need only a few more observations than for the department. Which is a realistic proposition over a school year. This is still way more than Ofsted would do, but they are looking for specific issues highlighted by their statistical analysis, so this simple overview does nothing to challenge the validity of their use of observations at the school level. I would still suggest that at the department level, and certainly at the individual teacher level, Ofsted judgements are not completely reliable.

So, in summary, if you are judging the capability of a teacher solely on the outcome of three lesson observations then you are probably commiting a crime against statistics. If you are using the three lesson observations you do on each teacher to make a summative assessment of the overall teaching ‘level’ in the school then you are on safe ground. You can’t do the same for small departments. Probably only in Science, English and Maths can the lesson observations (on their own) provide a sound(-ish) judgement.

A question I have is where did the three observations per year figure come from? I can only assume that someone worked out the number of observations required to give a detailed judgement about whole school teaching level and then divided by the number of teachers in a school.

Which is another reason why we need to teach more statistics!

Important points made here. But should SLTs be making judgements about quality of teaching based solely on 3 lesson observations? Inspections look for wider evidence, including progress over time, student perceptions.

I would say that most SLTs would do that. The pressure will come now however from having to make a judgement each year on pay progression, and part of that judgement will inevitably involve the use of observations.

Reblogged this on The Echo Chamber.

Hi Mike, I don’t argue with your statistics, or your motivations, merely with your argument’s assumptions. Making a binary assumption about ‘good’ and ‘not good’ teachers (and then compounding that by jumping into ‘good’ and ‘not good’ lessons – naughty) allows you the numbers. But what about ‘better’ and ‘worse’ ? If most lessons fall with in a ‘good’ range say top three quartiles, then three inspections should see you through. a ‘not good’ teacher would have to fall in the bottom quartile at least twice, and, if they were ‘not good’ they would only be missed if the they were extremely lucky.

Nevertheless, all measurements of performance should be based on a range of indicators, which should include both snapshots and longterm indicators, but most importantly conversations. If you know your job, and you chat to someone who does that job, you are far more likely to asses their capacity through a conversation than by an inquisition or by box-ticking. (not sure what the statistical probablity is though) 😉

“good” and not “Good” are binary, as indeed are ‘better’ and not ‘better’ so I’m not sure about that as a problem. You say a not ‘good’ teacher would have to fall into the bottom of the range twice. That’s only true if the judgement is made on the basis of a majority decision. It is undoubtedly the case that an observers judgement is influenced by past outcomes, so if the one not ‘good’ judgement is the first observation that will impact on those carried out subsequently.

Is the implication of your suggestion about quartiles that lesson obs are moderated against each other?

The point of making those (binary) assumptions was to simplify the problem. The truth is that if you expand the problem to the Ofsted four grade judgement then three observations is even less reliable. Prior data will be able to give some guidance to the observer as to what they might expect to see, but whilst that does allow a narrowing of the focus for the judgement it does also risk inserting a degree of bias.

The only real point of the post is to highlight the unreliability of small numbers of observations.

I think we still disagree, the relevant binary of ‘better’ is ‘worse’ (not the negation of ‘better’) and that implies a point along a continuum between ‘best’ and ‘worst’ , so a performnce should be judged as where it falls on that continuum. What I’m trying to say is the math doesn’t work if you don’t accept the 1 in 20 are ‘not good’. You would expect a better practitioner to have a lower incidence of poorer lessons than a worse one, so a sample should be a good guide, although in all such things a larger sample is better than a smaller one (greater confidence level). I’m not implying that lessons obs should be moderated against each other, but against a reasonable standard. I would think most teachers should be able to hold an objective standard of ideal vs unacceptably poor lessons, and judge where on that continuum any single lesson ranked, but I speak outside of my knowledge here.

I suppose the answer here directly concerns confidence levels, as that is what sample size affects. So is it possible to calculate (even roughly) the number of observations required to provide a +90% confidence level?

The specific 1 in 20 is there to provide some simplification in the maths, not as a statement of how often a good teacher might deliver a lesson that isn’t good. As you say, the way to deal with this is through confidence levels which is why I was using the 1 in 20 to let me edge towards that.

Theoretically observations are judged against a standard, but it is a subjective standard interpreted by a range of individuals. Moderation would clearly help but it is not always built into the process.

The numbers of observations required for 95% CIs are those numbers quoted in the second half of the post. Usually a school will do 3 obs per teacher, which at the school level is sufficient to make a judgement. The issue arises when using that info for just the one teacher. The post is attempting to look at the likelihood of ‘mis-diagnosis’. For a ‘Good’ teacher to be mis-diagnosed as not-‘good’ is damaging so the risk of it occurring needs to be lower than a 3 obs process will allow.

I continue to disagree about better / worse being binary. There is a bit in the middle ‘no-change’. This is why I use ‘better’, not ‘better’.

Thanks, that penultimate point is important and I missed it. What works at school (aggregate) level is insufficient at the individual level. True, generalisable and often overlooked.

I will however maintain my opposition to artificial dichotomies 🙂