HadCrut4 reported last night, so I decided to fish out my script to test whether the trends since 2001 are consistent with individual models using the spread of ‘weather noise’ based on repeat runs from individual models. My script adds adds the spread due to “weather noise” using estimates of “measurement” noise based on claims from the individual agencies (and a bit of other stuff I won’t discuss in detail for the time being.) But basically, using this method, if a horizontal line indicating the trend from an observation lies below the short similarly colored lower dashed line just outside the blue “spider” bands, the observation is inconsistent with that model’s ensemble of projections even accounting for the measurement noise the observational groups estimate for their measurements.
So far, I’ve only coded cases from the A1B SRES of the AR4. We can see below that the trends based observations from NCDC and HadCrut are inconsistent with 5 out of 11 of these models;

The test of the multi-model mean is provided to the far right. The one I consider most important is the 2nd from right: That is the test of the multi-model mean computed over all 22 models (i.e. the one communicated by the modelers in the AR4). The 2nd from right estimates “weather noise” as the average “weather noise” in each model. According to that test, we would reject the multi-model mean if compared to 1 of the observational groups but not other two. (Clearly, one would have to examine the 3rd decimal point to determine whether it is NOAA/NCDC or HadCrut4 that triggers the reject. It’s that close!)
The uncertainty intervals around the case shown furthest to the right is computed making the egregious error of treating the spread in the model-mean projections as “weather”. That is: It uses the method in Easterling and Wehner (2009), which though apparently embraced by those who wish to defend models– likely because it expands the uncertainty intervals and thereby “saves” them– is absolute bunk.
For those wondering: What about the rejections we get in other posts lucia shows?? Well, those test using uncertainty estimates based on the observations. When we estimate uncertainty intervals based on the noise in the observations rather than from the model variability, the uncertainty intervals tend to be smaller than in most (though not all) models. So, we reject the model mean in those cases. It’s a different method.
I would show 2000, which is a bit more favorable to the models. However, since overall, we aren’t rejecting the multi-model mean since 2001, I don’t think it’s necessary today. I’m going to budget my time to finally add in the A2 models and also try to fish up the data from the upcoming AR5. Meanwhile, we’ll be watching what happens to the comparison using the model spread over the course of the year. If El Nino kicks in, the model mean will switch to failing to reject with all three cases; if La Nada, or more surprisingly, La Nina kicks in, we’ll start to see the model mean move into rejection territory (where it has been in the past.)
Lucia,
How sure are you that the model anomaly projections (and the multi-model means) are calculated using the baseline of 1980 to 1999?
My own calculation of the anomalies using temperature projections from actual IPCC data files, would suggest that the official anomalies and means are actually calculated using the baseline of 1980 to 2000.
“My script adds bads the spread”
Huh?
Ray
I go by the text in the report which says says they uses Jan 1980-Dec 1999.
‘is absolute bunk’
I say, steady on old girl. Anyone would think you smell a rat in the use of the question Easterling and Wehner ask.
Roger Pielke, Jr. analyzed it thus;
Imagine that you are playing a game of poker, in which you are dealt 5 cards. You’ve never played poker before so you don’t know the odds for a particular hand. You look at your 5 cards and see that you’ve been dealt two pairs. You then ask your companion, a poker expert, whether or not your hand is “likely” so that you can evaluate it rigorously.
Which of the two responses that follow would you consider to be a more straightforward response to your question?
Response #1 You can see over many hands that being dealt two pairs can and likely will occur. In fact, Joe had two pairs in a hand dealt 20 minutes ago and Tim had one 10 minutes before that. And if you simulate a game of poker you’ll find a hand dealt 25 minutes from now has 2 pairs and one 7 minutes later also has two pairs. So in conclusion, both observations and simulations show that two pairs can and are even likely to occur. Your hand, therefore is utterly normal and entirely possible.
Response #2 The odds of you being dealt two pair in any given hand in about 1 in 21, so it is a statistically rare event.
http://thebreakthrough.org/archive/spinning_probabilities_in_grl
Lucia:
“I go by the text in the report which says says they uses Jan 1980-Dec 1999.”
That may be what they say, but the only way I can get the calculated multi-model means to match the “official” ones, is to use 1980 to 2000. Using 1980 to 1999 produces a slightly higher multi-model mean anomaly than published, which of course means that actual observations are slightly lower relative to the MMM.
The difference is about 0.012c.
Lucia –
There is a small bug in the script which produced the chart above. Rejecting 5 out of 11 models doesn’t appear to me to be either 5.6% or 6%.
HaroldW–
It’s not a bug. Based on what you say, I don’t know if you think it’s too high or too low. if you tell me hour method of guessing what it should be maybe we can discuss that.
Ray–
I’m going with the text which is a 20 year (240 month) period.
Lucia –
Apologies for being unclear. I wasn’t complaining about your method which determined that, for example, 5 of 11 models were inconsistent with Hadcrut4 observations. I was just picking a nit with the text in the graphic’s legend, which says “HadCrut4: 145 months, -0.02C/dec reject 5/11 (5.6%).” I interpreted the “5.6%” to be a restatement of the 5/11 as a percentage, in which case it is not accurate. Perhaps it has a different meaning?
HaroldW–
No. The number is calculated using a long monte-carlo simulation to find the rate at which we would get 5/11 rejections at 95% confidence if the ‘noise’ was as large as shown– and making some estimates for the amount of noise in the weather. There are different ‘ways’ someone might imagine it should be calculate– and some will make the mistake of thinking you can use the binomial distribution for 5 independent rejections out of 11. If you did that, you’d get a really, really low probability. If, on the other hand, you thought you could do a simple bonferroni correction to find the ‘p’ threshold for an individual tests where 1 rejection out of 11 corresponds to ±95%, you end up with am method that has almost no power.
So, instead, I do the ±95%– a level we often look at anyway– and then report how likely we are to get the number of rejection we are seeing using somewhat computationally intensive method.
Lucia (#110879) –
Ah, ok, thanks for that explanation. My mistake.
HaroldW– I explained it a little way back in… oh… September? Right now it’s not very important since we are above 5%. There are some specific individuals who will want to know details but only once it’s saying <5%.