Yesterday I said I’d update the post comparing model trends to the distribution of trends in AR4 models driven by the A1B SRES showing the comparison if we compute trends ending in April 2013 the month for which we have most recent data. I also mentioned I’d add some information on statistical power of the test. Below, I have a combined graphic that shows both:

Upper Graphs: Observed Trends Compared to Model Projections
In the upper graph, we can see how trends compare to the spread of trends around the A1B multi-model mean with the spread of trend computed from runs of models with more than 1 run in the forecast period. The heavy black trace is the mean trend for the multi-model ensemble; the blue lines indicate the ±2 σ spread of trends one would expect from a system (e.g. earth, single model etc.) with a distribution of trends equal that of a “average” model. (I described the method yesterday here.) Roughly, if the earth weather is equal to that expected in a “average” model and the earth trend is equal to the multi-model mean from the A1B SRES, we would expect 95% of trends to fall inside the range bounded by the blue lines.
The observed trends for each year are shown with open circles.
On the graph I’ve a few key points: A vertical grey trace indicates the boundary between trends whose start point begins in the “hindcast” period (i.e. 2000 or previously). For points to the left of this trace, at least some of the temperature data used to compute these trends was known before models were developed and run and before the SRES were frozen. Also: prior to this point, modeling groups could select use their own estimates of emissions paths. Consequently some models included the solar cycle; some did not. Some models included volcanic aerosols; some did not. This can make comparison of trends starting in the hindcast problematic. Nevertheless, the data are shown for comparison purposes.
Focusing our attention on trends whose start date is begins after the hindcast ends, we see the observed 12.3 year trend is outside and below the range consistent with models (i.e. outside solid blue line). However, the dashed lines indicate my conservative estimate of the extra width one might require to account for measurement error. Accounting for this measurement uncertainty and squinting my eyes, it appears HadCrut4 and NOAA/NCDC 12.3 year trends are outside the range consistent with models, while GISTemp still lies inside the range.
The 11.3 year trends are ambiguous eyes that I can’t tell by squinting. Meanwhile we cannot reject the a nulls of “the multi-model mean is correct” based on 10.3 or 9.3 year trends.
So what do we conclude?
First: The main advantage of the graph above is to try to avoid the temptation to cherry pick. The graph shows results of tests over a range of year and helps prevent us from crowing about results from one particular start year.
However, there is also some danger in letting people who wish to ignore the “rejects multi-model” one might get do so on the basis of short term trends. For example: we are seeing rejects for 12.3 year trends. But we clearly do not reject for 10.3 or 9.3 year trends. Does this mean that we should ignore the rejection at 12.3 year?
The answer is: probably not. To understand why let us look at the lower graph above. I’ll reproduce it here:
Using the standard deviation of trends used to create the spread of trends in the top graph, estimated the power of this method to reject the “H0: multi-model mean is right” in cases where that hypothesis is wrong. To do so, I assumed the distribution of trends due to weather is Gaussian, found the lower cut-off (i.e. shows with the blue line in the top graph) and then using assuming the same standard distribution for trend, found the probability that we would reject the multi-model mean if the real trend is 0C/dec. I call this the “Slayer” hypothesis named after “The Dragon Slayers”. That probability is the ‘power of the test’ under the ‘Slayer hypothesis’. That probability is shown with the dark blue trace.
What this power tell us is this:
If (a)”Dragon Slayers” are correct (i.e. the true trend is 0C/decade), (b) the variability of trends in models is also correct and (c) we decided to use the test in the upper graph to decide whether to reject the null of “multi-model mean is correct”, then if we apply the test to 9.3 years trends, we would have a 25% chance of rejecting the null hypothesis. That 25% chance is called the “power” and denotes the probability our statistical test would give the correct answer: which is that the multi-model mean is wrong (which it is under the hypothetical for computing power.)
When the power is 25%, the “Type II error” of the test is 75%. That is: The chances that we will incorrectly fail to reject a null hypothesis that is wrong is 75%.
Notice that as trend become longer, the power of the test increases. For 12 year trends, the power of the test against the “slayer” hypothesis is roughly 50%. So: if the “slayer hypothesis” were right, then statistical tests of 12 year trends have a 50% chance of rejecting the null of “models are right”. This means that if someone who believes models tries to persuade a “dragon” slayer that their hypothsis is wrong based on a “fail to reject” using a 12 year trend, the Dragon Slayer is justified in saying: “But we’d get fail to reject 50% of the time even if we were right. Bring me more data.
Equally importantly: it’s worth nothing that if the Slayers are right if we have to watch data trickle in and test models, what is likely to happen is somewhere near 12 years after a forecast is made we will begin to get some “rejects” based on trends since the forecast date. These ‘rejects’ will not be robust. Instead, they will flicker on and off until we have sufficiently long trends to reach a power near 100%.
Note that in contrast: If the slayers are wrong, and the multi-model mean is correct, we will get few “rejects”. Admittedly we might see some, but the probability of a ‘reject’ is dictated by the level of type I error selected by the analyst. In the upper figure shown that was roughly 5% (with 2.5% of trends falling above the top blue line and 2.5% below.)
Because of the “hindcast” issue it is a big dangerous to discuss power for earlier periods and I’ll defer that to discuss in comments if anyone is interested.
Power for other alternate hypotheses
It happens that power of a statistical tests always depends on the alternate hypothesis chosen. Above I discussed the power of the test in the top panel under the alternate “Slayer” hypothesis. That hypothesis corresponded to “the models are totally completely wrong; AGW is a FRAUD!!!”. But what if a group of people merely think that the models are biased high? I’ll describe a group of people who we will call ‘coolers’ who think the real trend ought to be 0.1C/dec. Possibly they think this because the earth is rebounding from the little ice age. Who knows? But for the purpose of the exercise, they have some belief.
Now, supposing these “coolers” are true, then the power of the test is shown by the light blue line. Note the power of the test is lower. This is because in this case, the multi-model mean is less far off from the truth than under the “Slayer” hypothesis.
Examining the power curve under the “cooler” hypothesis, the power to reject the null of “the multi-model mean is correct” only rises to 50% sometime after 18 years. So, if they believe they are right and no data has yet been made available, they would estimate it would take 18 years before a statistical test would give more accurate results than a coin toss. This would be their estimate because for periods less than a year, the probability of a “fail to reject” outcome is less than 50%.
So can you learn anything from short trends?
Having written all this I’m sure some people will continue with their mantra of “you can’t make conclusions from short trends”. That mantra is false. You can make conclusions. These are:
- If you observe a “reject the null”, the shortness of the trend doesn’t matter. Provided the statistical assumptions are correct (here for example that the spread of trends can be estimated from models) that reject the null is meaningful in the sense that the event observed has a low probability if the null were true.
So: on that day, with that data, you can say that contingent on the data you have, you would reject the null.
- Unfortunately, as data are trickling in over time, if the reject occurs when the power is low (especially < 50%) the "reject" may not 'stick'. That is: it may not be robust even though the reject is the correct result. However, if the null is wrong, it is likely that the “reject” will recur as as even more data trickles in. (This is, by the way, what we are currently seeing. We are on our 2nd round of rejects.)
- Mind you: lack of robustness will be difficult if one needs to persuade. The same lack of robustness will be seen if the “reject” result was an outlier. However, in this case, it is relatively unlikely we will see additional rejects.
- Meanwhile: if one believes the null, they must be very careful to not over interpret “fail to reject”. In the case above, the power calculations tell us that “fail to reject the multi-model mean” based on 12 year trends is more probably than not even if the true trend is 0C/decade. So, clearly, one should never use “fail to reject” with a trend less than 12 years to mean anything at all. (In the case of the upper figure, that means the two “fail to rejects” at 9.3 and 10.3 years provide absolutely no evidence we should doubt a reject for earlier periods. So, the argument that we ought to ignore the 12.3 year trend results because the failure to reject with shorter trends would be vacuous.)
So above we see that we are beginning to see “rejects”. Similar to the previous periods (2009) when we were seeing these, we can’t be sure they will hold. In fact: Given the statistical power of the test, it is likely they will not hold even if the models are wrong. But they might: and if they do the lack of power does not represent an argument against taking them serious.
I’ll be plotting these as the year moves forward. Given what data look like right now (i.e. most recent monthly data lies below the best fit trends lines for 10-12 year trends) it is likely we will see the observed trends decline. I’m sure quite a few people will be interested in knowing if or when the observations do fall below the lower 2σ bound and others will want to know if observations rise back up into the center of the spread. We’ll see.
For the short term I’m expecting to see “rejections” for the multi-model mean for start years in 2000, 2001 and 2002 at year end. But I could be wrong. 🙂
Lucia, is what you call “weather” above calculated from the maximum highs and lows of all the runs used in AR4?
Very interesting Lucia.
(basically unrelated to anything you said): I have this tendency to want to make something about the fact that we hit right on the border of rejecting the null (slayer test) pretty close to right out of the starting gate. I can’t can I. I mean, it’s a 2.5% chance, and it doesn’t matter that it was the first (or one of the first) rolls of the dice came up 2.5% or less. Or am I missing something?
I’ll bite, although I’m not confident I’ll understand. What’s the hindcast issue and why does it have anything to do with the power of the tests for the early periods? By ‘hindcast issue’ do you mean that the models were tweaked to match available data up to a certain point? I’d imagine that would lean us further towards accepting the null hypothesis.
Both the test itself and estimate of the power of the test assume that one is testing something whose outcome was unknown and so probabalistic. When testing hindcasts, some aspect of the outcome is known and deterministic. So, testing hindcasts using probability theory is a bit dangerous.
Moreover, in the specific case of these tests, during the 90s. some models applied forcings for pinatubo and some did not. Those that did not apply forcings for pinatube should not match the observed trend. So interpreting the agreement or disagreement between observed and models when forcings are known to be different– and were known to be different even before the projections were made is dicey.
Generally, the danger in testing hindecast is that owing to the fact that the ‘observations’ are known one will tend to “fail to reject” because the people making “projections’ already knew the answer they were “projecting”. So, “fail to reject” should rarely be taken very seriously even if a power calculation would otherwise suggest we should take it seriously. In contrast, testing forecasts is a true test.
The reason the fact that it’s the first roll matters is that by definition if their is only one roll, no one can be cherry picking. In contrast if we had many to chose from one might “pick” the case that you thought was good. But right now, there is only one 12 year trend.
Similarly, ending “now” has a preferred status– since it’s using all currently available data. People who pick some end point other than “now” might be cherry picking– especially if evaluation using the end point called “now” changes their answers.
Thanks Lucia! 🙂
Thanks for your exploring these uncertainties.
For an earlier validation analysis, Green, Armstrong & Soon (2009) compare a “No Change†naïve model vs the Gore/IPCC projection of 0.03C/annum.
Kesten C. Green, J. Scott Armstrong, Willie Soon, Validity of climate change forecasting for public policy decision making, International Journal of Forecasting 25 (2009) 826–832
Scott Armstrong challenged Al Gore
“When and under what conditions would you be willing to engage in a scientific test of your forecasts?â€
Armstrong challenged Gore to a bet that the Green, Armstrong Soon “No Change Model†was superior to Al Gore’s IPCC 0.03 C/annum projection for global warming.
See the Global Warming Challenge
http://www.theclimatebet.com/
Though Gore has not accepted, the effective results are graphed in Gore v Armstrong since 2007. Through March 2013, No-Global-Warming was the best bet in 55 of 63 months.
Per Green et al., suggest calling “0C/decade†as a “Naive†or “No Change†model.
Suggest a category of “Natural warming†of 0.05C/decade.
For a classic “natural warming†paper with this average see:
Syun-Ichi Akasofu, On the recovery from the Little Ice Age, Natural Sciences, Vol.2, No.11, 1211-1224 (2010) doi:10.4236/ns.2010.211149
Can you do a plot of decadal temperature vs length of time needed to reject at 95% confidence? Thus if 1 degree per decade, we would only need a short period for zero degrees we need …..
DocMartyn
I’m not sure what plot you are asking for. Making the plot in the present post required me to set the mac to running, then I mowed the lawn and took a bath. Then it finished. This is inaddition to coding. Given how long they take to run, I would need to think the question is important. I’m not doing that just for “lookie – loos”.
Sorry Lucia, I wanted to know how long a low (no) rate would have to be before it disproved a modeled 0.2 degree/decade. It was the lineshape that interested me.
Doc— How is a “no rate” any different from a trend of 0C/dec? If it’s the same (and I think it’s pretty similar), a 12 year trend of 0C/dec would be hard to explain as consistent with “weather” and a mean trend of 0.2C/dec. Meanwhile if you are looking for inconsistent with the entire batch of models some of which warm more slowly than 0.2C/dec, you need to wait 15 years. I discussed this the day before yesterday and shows an annotated graph.
Lucia, I compiled the mean monthly global temperature trends for the CMIP5 models for the RCP45 scenario (the one closest to reality) for the period 1970-May2013 and compared it to the same trends for the observed temperatures from HadCRU4, GISS and GHCN. I averaged the replicate model data for those models having replicate runs which resulted in 42 model series and trends. The model and most of the observed data was extracted from the KNMI web site. I needed to go to the original observed sources to obtain the last few months data for CRU and GHCN.
The CMIP5 model data has separate runs for the historical temperatures and those runs which have both historical and the 21st century temperatures included in the series. The historical runs have very nearly the same trends as the historical plus 21st century runs – which makes me wonder why there are separate historical runs. Also the 1970-2013May trends for all the scenarios, i.e. RCP26,45,60 and 85 for a given model have much the same trends. I selected the RCP45 scenario as that is the one I see used when investigators are studying the expected GHG conditions.
The model and observed averages, standard deviations and sample numbers for the trends are given here (units for trends are in degrees C per decade):
Models: Mean= 0.2067; SD= 0.0461; n=42
Observed: Mean=0.1587; SD= 0.0107; n=3
Using these results I compared the model and observed means to calculate a t value = 5.09. That t value shows a very significant difference between the observed and model mean trends for the period 1970-2013May.
Kenneth–
I think I’ve downloaded. At first I tried downloading “all members”, but that seemed odd. So, I took the time to download each model individually. (I haven’t fully checked. I have 98 runs and I think I need 105. . . So I have to sit down and see which I missed.)
I’m reality checking my downloads before taking any other post processing seriously. (Right now, miroc 5 looks… uhmm… rather amazing. I’m not sure crops could grow on planet miroc 5.)
Lucia, in order to make this comparison between model and observed in the manner used by some investigators when comparing the model and observed lower tropical troposphere temperature trends, I am going back and comparing the observed to the models, one observed series at a time, and using the standard error of the observed trend with the autocorrelation accounted for. Just looking without calculation at this point I think the t value will show a significant difference.
If it would help you, Lucia, I can email an Excel file with all the CMIP5 data. You need not trust but rather verify.
Kenneth–
Did you get your CMIP5 Data from the climate explorer? That’s where I got mine. If we are drawing from the same source, there is no difference so I don’t need it emailed. If you got it from somewhere else, I’d like to know the source so I can add that to places I check.
Lucia, I got mine at KNMI which is Climate Explorer. I always like to give KNMI credit when I use their data. I really appreciate what Geert Jan van Oldenborgh has done single handily with climate data in putting it into useful downloadable form and particularly when I have gone to other sources for data and found they no longer support their programs for efficiently retrieving it and do not seem to really care whether it can be used by others or not.
Kenneth–
Ok. We got them from the same place. I just found that if I used “all members” and tried to get all of the data that way it was “weird”. (I also get weirder stuff using firefox than safar. Go. figure.) Anyway, I’m trying to do fiduciary checks before I say too much.
Lucia, I did a comparison between the observed and the modeled CMIP5 RCP mean monthly global temperature series from Jan 1970 through May 2013 in the form of that used in Santer et al. (08) “Consistency of modelled and observed temperature trends in the tropical troposphere”. The auto correlation expansion of the number of degrees of freedom for calculating CIs for the observed series was done from a 10,000 Monte Carlo replication of an ARMA(2,0) model. Actually the expansion resulting from the equation nadj=n(1 − r1)/(1 + r1), where n is the degrees of freedom and r1 is the AR1 coefficient and that was used in Santer for an ARMA(1,0) series, gave similar results. The pertinent data and results are listed below. The t.values show a significant difference for the trends for the 42 model mean and the 3 major observed series.
mean SD n t.value
Models 0.207 0.046 42 NA
CRU4 0.168 0.015 1 2.61
GISS 0.147 0.013 1 4.03
GHCN 0.161 0.013 1 3.08
Kenneth—
I”m not at all surprised to find a statistically significant difference! 🙂