On Monday, Roger Pielke Jr. alerted me to the 17 author paper Santer et. al (2008). Yesterday, I downloaded the pdf and gave it a quick look. This paper presents the results of a statistical test comparing the consistency of modeled and observed temperature trends in the tropical troposphere; the paper was recently discussed by Gavin, one of its authors.
Since I’m engrossed in the question, “How well do the models used in the IPCC AR4 predict or hind-cast Global Mean Surface Temperature (GMST)?”, on reading the article, I asked myself this:
“I wonder what I’ll find if I apply these methods to GMST instead of temperature trends in the tropical troposphere (TTT)?
So, I dashed to my macintosh, and began to apply the techniques in the paper I will henceforth refer to as “Santer17” to test the consistency between modeled and observed trends in models GMST rather than trends in TTT.
For purposes of testing I used monthly data, and I examined three time periods.
- To provide a long period to test the hindcast, I used the 30-year period from Jan 1970- Dec 1999. For this period, the modelers had access to data on radiative forcings.
- As I am also interested in examining fidelity of projections, I also examined the longer time period stretching from Jan 1970-August 2008.
- For the later portions of this period, models forming the basis of projections in the IPCC AR4 were forced using various SRES scenarios. I examine the cases forced using the A1B scenarios.
The results from these three periods are not statistically independent, but showing results from all three permits readers to develop an impression about impact of the start and end dates.
I will now begin to relate the results of my quick investigation.:)
Since this is a blog, documentation will be less involved than one might expect of a journal article. I will take the liberty of assuming you all have access to Santer17 which itself describes the caveats to the test. To minimize typing of equations, I will refer to various hypotheses and equations numbers using the values in that paper. Also, I will only compare models to one data set for GMST. It is the one I call “Merge 3” and represents the average of NOAA/NCDC, GISS Land/Ocean and HadCrut. Also, I have not obtained all runs included in the IPCC AR4 and extending to the current period. Instead, I have used the runs discussed in an earlier post.
Time period
Readers likely are asking: Why use a start date of 1970?
The choice is somewhat arbitrary but was influenced by the following:
- I wanted to the test of the hindcast to include a full standard 30-year climatological period.
- Prior to 2000, modelers drive their models based on measured forcings. Afterwards, scenarios under the SRES dictate forcings.
- Models that include volcanic forcing were reported to better match observations of ocean heat content (Domingues 2008); I wanted to see if including volcanic forcing had a similar effect on simulations of GMST.
- I prefer more recent periods, as observations are likely to be more accurate.
The start date of 1970 filled the required criteria. (Besides the above, I had previously used this period to investigate some issues related to the effect of volcanic forcing. This made the choise convenient, as I already had files set up to analyze with a start date of 1970!) Note that, in contrast, Santer17 uses a start date of 1979. Their choice was dictated by their desire to use the satellite measurement of the troposphere; these observations are limited to the period after 1979.
Now that I have explained the start date, I intend, over the course of a few posts, to apply various tests suggested in Santer to the data. Today, I will apply the sort of tests discussed in section 5.1.1 of “Santer17”; there are “Tests of Individual Model Realizations”.
Tests of Individual Model Realizations
In Section 5.1.1 of “Santer17” (pdf), you will find their Figure 3a which shows
- Trends in tropical tropospheric temperature for 49 model runs along with the “2σ” uncertainty adjusted for autocorrelation; the uncertainty is computed using “Santer17” equation 12 and supporting equations.
- Trends in tropical tropospheric temperature based on RSS v3.0 and UAH v5.2 along with their uncertainty adjusted for autocorrelation; the uncertainty is computed using “Santer 17” confidence intervals are computed using equations (4), (5) and (6) in Santer17.
- Grey bands denoting the “1σ” and “2σ” confidence intervals for observations.
Their graph looks like this:

Figure 1: Comparison of simulated and observed trends in T2LT (Figure 3A from Santer17)
Discussing this graph, Santer17 says: “The adjusted 2σ confidence intervals on the RSS T2LT trend includes 47 of 49 simulated trends. This strongly suggests that there is no fundamental inconsistency between modeled and observed results.”
What Santer17 are saying is that because 47 or the 49 “dots” corresponding to model runs fall inside the light gray bands describing the uncertainty of the observations, it’s not possible to say the models are inconsistent with the observed results. By implication of many of the “dots” fell outside the light gray bands, that would suggest the predictions are inconsistent with the data.
So, what of GMST?
So now, let’s examine a similar graph but looking at GMST from 1970-2008 rather than T2LT from 1979-1999. Voila!

In my figure, the symbols represent the best estimate of the trend in GMST determined using ordinary least squares. The “whiskers” represent the ±2σ confidence intervals are computed using equations (4), (5) and (6) in Santer17 The bluish region is the ±2σ confidence intervals based on the observation set for GMST I described above.
You may now proceed to count: the best estimate of the trend from 19 of the 38 individual runs lie outside the +2σ confidence intervals for the observations of GMST. One might ordinarily expect roughly 5% of the data to fall outside the range: 19/38 is 50%. That so many individual runs are inconsistent with the observations may suggest…. (Dare I say it?), …..that the trends predicted by models may be inconsistent with the observation!
But recall that this period includes eruptions of Pinatubo, El Chicon and Fuego. These stratospheric eruptions are thought by many to have affected the temperatures after the eruptions. Maybe the inconsistency between observations and models is due to the inclusion of 11 models that do not include the effect of volcanic forcings.
Let’s consider only results from the 27 model runs that did include volcanic forcings in the simulations.
Doing so we find the best estimate of the trend for 15 runs lie outside the +2σ confidence intervals for the observations: that’s 56%. So, by filtering out trends from the 11 runs associated with model that use anhistoric (aka ‘known to be incorrect’) forcings, the simulated trends appear to be more inconsistent with the observations than otherwise!
But maybe the disagreement is due to some mismatch between the A1B SRES and the actual forcings since Jan 2000? Here’s how the graph looks if we eliminate the months after Jan2000.

By eliminating 8 years worth of data, we now find the best estimate of the trend from 17 out of 38 model runs fall outside the +2σ confidence intervals for the observations. Focusing on the 27 runs including volcanic forcings, we find the best estimate of the trend falls outside the +2σ confidence intervals for the observations: that’s 59%. (Note that as we reduce the number of years, we expect to eliminate fewer and fewer of runs even if the models are inaccurate. this occurs because the +2σ confidence intervals for the observations widen. The main difficulty with using short time periods to test models is we risk failing to detect inconsistency where it exists.)
Even though eliminating years tends to reduce the power of a test — and thereby make it more difficult to detect inconsistency between models that are wrong, I was curious to see what result I would have obtained if I used the exact set of years chosen by Santer17. These are Jan 1979-Dec 1999. The results for this period are illustrated below:

In this case, the best estimate for the trend in GMST for 20 model of 38 runs (53%) fall outside the +2σ confidence intervals for the observations. Filtering out the runs without volcanic forcing, we find 19 out of 27 (70%) runs fall outside the +2σ confidence intervals for the observations; 1 out of 11 (9%) of the trends with no volcanic forcing fall outside the +2σ confidence intervals for the observations.
It is worth noting that for this time period, the inconsistent trends for runs with volcanic forcing somethings fall below the observed trends and sometimes fall above the observed trends. This may, in part, be due to the timing of the 1979 cut-off relative to the volcanic eruptions.
Paired t-tests of individual models.
Naturally, “Santer17” did not limit itself to eyeballing the graph and counting how often the best estimate of the trend from individual models was found to be inconsistent with +2σ confidence intervals for the observations. They also applied a “paired t-test” to determine whether the trend computed by each individual runs was inconsistent with the data. This paired t-test used in Santer17 5.1.1 was based “Santer17” equation (12) and (13). This test asked the question: Is the trend in individual model run “n” consistent with the observation.
I replicated that test, but applied it to GMST using the 5% confidence interval. Using monthly data from Jan 1970-2008, I found:
- 32% of the trends in GMST from the 39 individual model runs were inconsistent with the observed trend in GMST to p=5%.
- 37% of the trends in GMST from the 27 individual model runs that include the effect of volcanic eruptions on the radiative forcing were inconsistent with the observed trend to p=5%.
- 18% of the trends in GMSTfrom the 11 individual model runs that fail to include the effect of volcanic eruptions on the radiative forcing were inconsistent with the observed trend to p=5%.
Bear in mind, if the models correctly simulate the trend in the earth’s GMST, we’d expect to find 5% of the simulated trends inconsistent at p=5%. In contrast, using the test applied in section 5.1.1 of Santer17 we find roughly 1/3rd of all 38 model runs result in trends that are inconsistent with the observed trend to p=5%. If we screen out the runs that use unrealistic volcanic forcings, the inconsistency between models and observations worsens. In contrast, results improve if we include the cases that fail to include the forcings due to volcanic activity. However, even for the cases that with physically unrealistic forcings, 18% of the individual cases fail at p=5%.
Of course, I also tested the other sets of years. For the period Jan 1970- Dec 1999, I found 26% of 37% of the trends in GMST from individual model runs were inconsistent with the observed trend to p=5%; 37% of the trends from runs with volcanic forcing were found inconsistent. For the period Jan 1979- Dec. 1999, 21% of all individual model trends were inconsistent with the observed trend to p=5%; 30% of trends from runs with volcanic forcing were found inconsistent.
Preliminary results
The results shown so far suggest the trends predicted by models are inconsistent with observations.
However, this is only part I and only applies the one of the tests in Santer17. Also, Santer17 does discuss some caveats to consider before making conclusions of this sort; you’ll find them all in the pdf.
In addition: recall that I have not accessed the every run used by the IPCC in the AR4. I only downloaded those runs that forecast GMST using the A1B scenarios after 2000 and which were available sometime last month. Eventually, I plan to repeat this analysis using all runs; if I obtain different results, I will report that finding.
In future posts, I will be also discuss caveats further.
What’s next?
Later on, I’ll be showing the results that parallel the analysis in 5.1.2 of Santer17. I’ll also discuss some caveats.
In the meantime I leave you with this: the observations do show statistically significant warming. The fact that the models predictions appear inconsistent with the observations does not erase that conclusion. The inconsistency between models and observations only affects the assessment of the likely accuracy and precision of models relative to other methods of forecasting trends in GMST.
Update
1. Click to download Excel spreadsheet containing raw data for HadCrut3 “Mean of northern and southern hemisphere averages (Recommended for general use).” The trend is computed using LINEST. The lag 1 autocorrelation computed using “Correl” is r= 0.739541406.
2. The “whiskers” on the original 1979-1999 graphs showed 1σ rather than 2σ errors. The blue and purple bands for 1σ and 2σ were correct. I fixed this formatting issue and replaced the graph. The numerical results are unchanged.
Typos:
“So now, let’s examine a similar graph but looking at GMST from 1970-1008 ”
And is this missing a “do not?”
“Bear in mind…. If we screen out the runs that use realistic volcanic forcings, the results are worse.”
Delete at will.
Thanks Alan! Fixed.
Just mowed the lawn… no I need to go out and buy walnuts and chocolate for brownies.
Do you have the list of models used. Gavin claimed they all included volcanic forcing.
Steve: Here is table 10.1 from the AR4 WG1.
There is a column indicating whether or not volcanic forcings are included in the runs. According to the table, “NA” means volcanic forcings were not specified in the 20th century simulations.
Santer et al includes, FGOAL-g1, which according to the table did not specify volcanic forcings during the 20th century.
I’ll cut and paste the list from Santer et al in a sec so you can compare.
Model used in Santer– from legend for figure 3.

click for larger.
Using the information in table 10.1 which I included in the comment above, I filtered the simulated GMST values and created this graph showing the GMST averaged over models listed as using volcanic forcing and those listed as “NA” for volcanic forcing.
click for larger.
Note that Pinatubo and El Chicon ’cause’ the temperature to dip for the average computed over runs that do not say “NA” for volcanic forcing (shown in purple).
These eruptions do not dip in GMST for the cases where the average is computed over casea that say “NA” for volcanic forcing. (See the red trace.)
This result suggests to me that the “NA” really, truly means “NA”. The volcanic forcing was not used for cases that state “NA” for volcanic forcing. (The alternative explanation is that, in those models, the “physics” say stratospheric aerosols have no effect. That alternative would not result in our placing much confidence in those models.)
FWIW: I discussed this graph in an earlier post: http://rankexploits.com/musings/2008/effect-of-including-volcanic-eruptions-on-hindcastforecast-of-gmst/ Also, FWIW: If we pick the 1970-2008 time period, and do a paired t-test to compare the 28 “Volcano” trends with the 11 “NA- Volcano” trends, the two trends are found to be inconsistent to p=5%. (I can’t remember the precise number– I think it’s really to p=0.02%. I’ll be discussing that later.)
Steve Mosher– Here’s what Santer17 actually says about forcings in sect2.2 “Model Data”:
“These so-called twentieth-century (20CEN) simulations are the most appropriate runs for direct comparison with satelitte and radiosonde data, and provide valuable information on current structural and statistical uncertainties in model-based estmates of historical climate change. Inter-model differences in the 20CEN results reflect differences in model physics, dynamics, parameterizatoins of sub-grid scale processes, horizontal and vertical resolution and the applied forcings (Santer et al. 2005, 2006.)
In other words: The applied forcings across models used to hindcast the 20th century. (So do many other things.) Based on the table in the IPCC AR4WG1, some models do not specify volcanic forcing. So, at most “included” might mean “included as some sort of constant value known to apply almost never”.
I’ll admit, by “included” I have meant “Included as something a forcing that varies dramatically after eruptions like Pinatubo.”
Lucia,
I am a bit puzzled. The Santer graph you show has about 0.5ºC per decade (5ºC per century) between the upper and lower confidence limits and a number of the error bars for the models are a similar size (some are much larger and a few are smaller).
On your 1979/99 graph the CI lines are about 1.3ºC per century apart. I put this reduction down to the fact that you were using a merge of three data sets giving a reduced variability. The puzzle is that the error bars for the models also seem to be much reduced in size as well.
It seems a bit odd that the model error bars should be so much smaller for the surface temp as compared to T2LT.
Am I missing something?
By a strange coincidence I spent my afternoon reading the Santer paper and have reached exactly the part you are using for this post. All I did was read it and you have already produced this response. Amazing!
Jorge–
Santer examines a different metric: “Temperature Trends in the Tropical Troposphere”. (Say that three times fast! 🙂 )
I am applying their method to GMST. So, the numerical values for trends will be different because the tropical troposhere is expected to warm at a different rate than the global mean surface temperature.
So…. I am not disputing the actual results in Santer. I have a few quibbles which I will get to later. But basically, I am seeing what results are obtained if we apply the method to a different metric: GMST.
I can do this fast because I have the GMST data. Plus, the paper is mostly straightforward, both in the things that appear correct (much of it) and the things that I have a few misgivings about.
Lucia,
It is not so much the fact that the trends are different that bothers me. I rather expected that. The puzzle is the change in size of the error bands in the models in your GMST graph compared to the model error bands in the Santer T2LT graph.
The implication is that the model runs used in the Santer graph for T2LT are much noisier than the model runs you have used for the GMST. The difference really does seem to be huge.
Jorge– I haven’t downloaded the model data for T2LT. Certainly, those uncertainty bands based on the monthly “model” data are huge. The are estimated under the assumption the “weather noise” is AR(1) and also that excursions in temperature due to volcanic eruptions would occur at random times during “other” realizations. (I’m going to have to explain this at some point.)
You will notice that for many of those models in Santer, the standard deviation based on the “whiskers” is huge compared to what you might estimate based on the 5 run, three runs etc. This seems particularly true for GFDL2.0
The numbers all come from computer programs.
So far to date, we have no evidence that the numbers actually represent proper results from the model equations and solution methods and application procedures of any of the programs.
Proper results covers an enormous range of critically important issues; correctness and completeness of the model equations, correct coding of the model equations and solution methods, correctness of the specified initial and boundary conditions, correctness of all pre-processing methods and procedures, consistency with physical phenomena and processes, stable and converged numerical solution methods, correct application procedure of the programs, correct handling of the numbers in post-processing methods and procedures, along with many others.
Verification that the numbers are in fact proper results must always, and there are never any exceptions, precede any comparisons of calculated results with data.
Data comparisons in the absence of prior Verification of all models and methods is not a fruitful exercise.
Snip at will.
Dan–
Why would I snip?
I recognize that you believe one should not compare computer results to data if there has been no formal verification process. However, there results are presented to the public, and policy makers. So, whether or not this is the proper order, I think it’s fruitful to compare to data.
If the public and policy makers were being given guidance based on readings of Nostradamus, I would still believe it’s fruitful to compare the predictions to data.
I think you might need to check your calculation of the observational error bars. I looked at the HadCRUT3v monthly data Jan 1979-Dec 1999 and get a trend of 1.51 degC/century and a standard 95% CI of +/-0.25. With the adjustment used in Santer et al (equ 5), with a lag1 auto-correlation of 0.84, the n_eff is 22 (down from 252). Thus the adjusted 95% CI is 0.25*sqrt(250/20.) = 0.88. Giving 2 sigma range of 0.6 to 2.4 – significantly wider than that shown on your graph.
Lucia,
You write that “observations do show a statistically significant warming”. I am seriously puzzled. I thought that by now we can no longer by pass the fact that there is a growing and hard to refute body of data from a broad variety of sources, built up over the past 10 years or so, that puts strong question marks around that assumption. 2008 Global Mean Temps are at 1988 levels. A good number of observers [including Lindzen, Pielke Sr and others] have pointed out, quite correctly in my view that whatever warming may have occurred between the mid-70s and late 90s, all relevant temperature metrics have shown a leveling off since 1998 and continue to show a decrease since 2003. Lindzen goes as far as arguing any statistically significant increase in temps ceased in 1997, and Pielke Sr has argued that the Earth’s atmospheric system has in fact been shedding joules since 2004. I know you believe we are dealing AGW and Pielke Sr is of the view that Homo Sapiens has a measurable impact on climate [Lindzen doesn’t]. I still don’t understand how this growing body of data jives with your contention.
Gavin–
Which HadCrut data are you using?
I checked the dataset I linked above. My data are fresh. I get the same trend you get, but I get a a lag 1 correlation of r=0.739541406 for the monthly data from Jan 1979-Dec 1999. Naturally, since the ratio of the number of effective data to the number is (1-r)/(1+r) my number of degrees of freedom drop less than yours do.
I provided a link to the spreadsheet above.
Lucia,
Some quick questions about Santer et al:
1) What is the time period for which the models are compared to the tropospheric data? 1979-2007?
2) When were the models (whose runs where considered in comparison to the tropospheric temperature data) actually run? In the article it appears as if they were run for the 2007 IPCC report, but I couldn’t find when they were actually run.
3) When was the tropical tropospheric versus tropical surface data trend inconsistencies first raised as an issue of modeling accuracy?
Marty–
1) Santer et al use the time period of 1979-1999.
2) Each models is run at a different time. However, there was a deadline to submit things by sometime in 2005. The IPCC TAR was published in 2001. So, mostly, they were run after 2001, but before 2005.
3) I don’t know when the surface trend inconsistencies were first raised as an issue of model accuracy. However Santer17 is a response to Douglas et al (2007)
http://www3.interscience.wiley.com/journal/117857349/abstract?CRETRY=1&SRETRY=0 which discussed the inconsistency.
Douglas states:
One of the reasons Santer’s findings differ from Douglas involves updated data. Another is Douglas did not include the uncertainty in the observed trend. (I agree they should have.)
Tetris– In the data sets examined, the warming trends are all statistically significant. They would be even with Gavin’s suggested larger error bars. ( We’ll see if he is using a different version of HadCrut data set, as I continue to get the ones I show in the article.)
Gavin– Again. I did notice the whiskers on the graph were incorrect. The blue and lavender 1 and 2 σ bands were fine, but I’d accidently kept, but the whiskers matched 1σ. That was due to an formatting mistake. The correct graph is displaying now.
At these confidence values, what would be the effect of the propagation of errors to say the year 2100? (Most of these errors are probably not even known) Has anyone done such a calculation? Just because there is some aggreement with todays data does not mean that that is the same in the future.
eilert:
We can’t know.
Consistency between models and observation gives greater confidence the models are correctly reproducing the physical processes; inconsistency reduces that confidence. If the models are missing something, or off for some reason, that “something” may be important only during certain periods. In which case, the models could be right at other times.
(Examples: getting volcanic eruptions correct only matters when volcanoes erupt. So, were models to be a bit off estimating the effects of eruptions, they could still be right if volcanoes stopped erupting. )
A statistical tests of the sort done here can’t tell us why individual runs from certain models are off. They only tell us that many off them are outside the 95% confidence limits for the observations. Of course, this does assume those limits can be computed as suggested in Santer17.
Lucia,
Thanks for resolving the mystery of the shaved whiskers. Now another puzzle. You and Gavin are talking about the Hadcrut data set. I thought you were using merged data for all the graphs.
Is it coincidence that Hadcrut alone shows the same trend from 1979/99 as the merged set?
What happens to the effective number of data points when you use a merge of three sets – does it still simply depend on the lag 1 r values of the merged set?
I have a feeling that some modellers are not going to be happy about using merged data if this results in smaller error bars for the observations. 🙂
Jorge–
I am showing the merged.
I assume Gavin saw all the whiskers didn’t look right, like you and me, didn’t notice the whiskers for the observations didn’t match the blue band (as they must), and then just calculated using some HadCrut data he has on hand.
The merge with HadCrut, GISS and NOAA has to give similar values and uncertainty bands, so clearly, if things are off by a factor of 2, that will show.
When he first commented, I focused on the calculations. I forgot that when I make the graphs, I need to manually adjust those blue and purple bands, which are just rectagles overlaid. So, my file has a “multiple” box that multiplies the whiskers. I enter “2” create the blue box; enter “1” create the purple one, then I set it back to “2”.
On the last graph, I forgot to set it back to “2”. So, the only thing affected was the whiskers– the boxes and numbers in the text were still fime.
No. HadCrut makes up 1/3rd of the merge. Also, the trend is nearly the same for all 3 observations. The correlations are nearly the same for all three, and all have about the same number of effective data points. I use the “r” for the merged set.
I get nearly the same results no matter which set I use. For 1979-1999 NOAA gives the smallest error bars, HadCrut the largest, and the merge gives the second largest. The differences are tiny.
If you run the analysis on the tropical tropospheric shown in figure 1 do you get the same results as they do?
Chris– I don’t have the model data for the tropical troposphere. I could get it, but it would take a while. I’m interested in GMST so I’m looking at that.
Since that article has 17 authors, giving many eyes to proof-read, I assume they were able to apply ordinary least squares and compute uncertainty intervals using the very simple straightforward methods discussed in the paper, and did not make mistakes. However, at some point, I may ask John Christy for the “model data.” That would save me time locating or recalculating the “model data” for this region. (Recalculating the “model data” for T2LT might involve downloading gridded data for the planet and computing for each month. It’s doable– but this isn’t climate audit. I think what Steve does is great — but my interest differ somewhat. )
Lucia, all four data sets are now in for Sept 2008:
http://hadobs.metoffice.com/hadcrut3/diagnostics/global/nh+sh/monthly
The analysis changes completely when you include 2008 and 2007 and the effects of a La Nina into the dataset.
I have modeled the Nino 3.4 region anomaly with a 3 month lag with the Hadley monthly temperature data going back to 1940.
The Hadley global temperature anomaly is remarkably correlated with the anomaly of the Nino 3.4 region (lagged 3 months).
The global temperature anomaly is 0.15 * the Nina 3.4 region anomaly (3 months prior) or +0.5C for the biggest El Ninos such as the 1997-98 El Nino and -0.3C for the peak period of the 2007-08 La Nina.
Including this factor into a model of monthly Hadley temperature anomalies leaves one with just 0.08C per decade global warming from GHGs impact – far below the 0.2C to 0.3C per decade predicted by the models for the tropical troposphere.
Of course, if the ENSO is also driven by GHGs, then the analysis falls apart.
But the ENSO anomalies vary so much, independently from GHG changes that GHGs cannot be responsible. The monthly Hadley global anomalies also vary so much that GHGs cannot be responsible. The highly variable ENSO cycle explains Hadley’s global temp variation extremely well and thus I am forced to conclude the warmers are just trying to take credit for the increased El Nino trend of 1986 to 2006 for their global warming impact trend analysis. Including the temperature declines of 2007 and 2008 would change the conclusions considerably.
Just a correction for the record. I misunderstood a inline comment (#21) gavin made WRT volcanic forcing in the models. He’s clearly on record now saying that he he never said “all the models used volcanic forcing” So, I apologized for the misunderstand on RC and do the same here in case people here dont read RC.
steve–Thanks! It makes more sense to read Gavin said some use volcanic forcing rather than all. Some do, some don’t.
Lucia,
Yet he has no problems including models which don’t model a rather significant climate feature in the ensemble.
This is not not science. It is a numbers game designed to give pseudo-scientific justification for conclusions that were determined in advance.
Hi Lucia,
Is it correct to think that most of these models used hindcasting as a method to test validity (i.e. the hindcasts were significantly correlated to the prior instrument record). If that is the case, wouldn’t you expect that the early portion of the models’ time series would maintain significance, and to the extent that correlation becomes insignificant, that such insignificance is back-end weighted to the newer portion of the time series? For example, a model designed in 2000 should be generate pretty good looking runs for 1970 to 2000. If a model that maintains significance for 30 hindcast years, runs off the track within 6-7 years thereafter, dragging the entire time series with it, the most recent result must really be aberrant. As a non-scientist, I know that what I think might be insight is possibly superficial tripe. Feel free to tell the same.
PS- I have spent a little time recently at Tamino and wanted to let you know that you are not allowed to study the 8 year temperature timeframe from 2000-2008, as this is weather. Feel free to focus on the 8 year Arctic Ice extent trends however, as this is an unambiguous climate signal.
Lucia re 5878 and 5882, your calculation correctly implemented the Santer method; Gavin’s didn’t. You calculated the AR1 coefficient of the residuals from the trend fit, while Gavin calculated the AR1 coefficient of the series itself (which in this case is higher).
Steve Mc–Thanks! Why Gavin got a larger value was puzzling me. His was an easy enough mistake to make.
Oh yes. Thanks.
Lucia, in respect to your argument with beaker over exegesis of the test in Santer et al, I’ve been able to reproduce the Santer Table III numbers and they confirm your interpretation over beaker’s. Both articles use the standard deviation of the estimate of the inter-model mean, with the difference in test statistic being, as you (and I) surmised – the inclusion of an allowance for uncertainty in the observed trend in Santer. See http://www.climateaudit.org/?p=4163.