The Blackboard

Where Climate Talk Gets Hot!

Skip to: Content | Sidebar | Footer

Santer Method Applied Since Jan 2001: Average trend based on 38 IPPC AR4 models rejected.

23 October, 2008 (13:20) | Data Comparisons Written by: lucia

After Steve McIntyre established that equation (12) Santer17 (pdf ) does not contain a typo, I decided to apply the paired t-test in Santer17 section 4.2 to test two hypotheses; both are similar in form to the one Santer17 refers to as “H2″. My tests however, relate to Global Mean Surface Temperature (GMST) and are based on slightly different ensembles; to distinguish between Santer’s H1 and H2, I’ll number my hypotheses H3 and H4.

The hypotheses I tested are

  1. H3: Averaging 38 trends in GMST computed from 38 runs, is the “multi-run” trend consistent with observations?
  2. H4: Averaging over the 19 best estimate of the trend from each of 19 models, is the “multi-model” trend consistent with observations?

Both hypotheses will be tested using equations (12) & (13) in Santer17, but the test quantities will be modified so as to test the slightly modified hypotheses.

Since Santer17 relies on the Nychka method to correct for serial autocorrelation, and this method is know to result in some false positives when noise really is AR(1), I also repeated the test using the correction suggested by Lee & Lund 2008. Monte Carlo test indicate this method slightly over compensates and results in uncertainty intervals that are too large when the noise really is AR(1).

Observations

For this test, I use data from HadCrut, GISS, NOAA/NCDC and I also averaged the three to obtain a “Merge 3″ set.

Models and Runs

For this test, I use the 38 runs downloaded from The Climate Explorer are discussed here.

Nomenclature

As much as possible, I will be using the notation in Santer17 (pdf ). Note that the subscript “o” is used for observations, “m” for model values.

Calculation of Model Statistics

All quantities are computed as described in equations (12) and (13) in Santer17, using supporting equations as detailed in Santer17 section 4.2. However, because I define my “ensemble” slightly differently, I modify computation slightly.

Under hypothesis H3, each run is treated as one realization in an ensemble of all possible runs of models that could pass the IPCC AR4 criterion. To calculate an average, I simply summed all 38 trends and divided by 38. All other averages over the ensemble were then computed in the normal way. This test is interesting because Gavin appears to have previously discussed averages and standard deviations using averages over individual runs.

Under hypothesis H4 the average the average trend from all runs of a particular model is treated as one realization of all possible models that could pass the IPCC AR4 criterion. After computing the 38 trends based on individual runs, I computed the average trend for each model. So, if a particular model was associated with 5 runs, I summed the five trends and divided by 5. This created an ensemble of 19 “models”. The average over this ensemble was then calculated in the normal way. This test is interesting because the IPCC appears to base their projections on the multi-model average over all models.

In contrast, in section 5.1.2, Santer17 applies their test to each model individually. So, for their tests, the runs for a particular model constitute the ensemble for that model.

The numerical results for key quantities are provided in the table below. The top of the table provides quantities that are computed for either the models or the observations. The lower portion provides results for the t-test comparing the multi-model best estimate of the trend to each observational set.

Table 1: Results for comparison of Multi-run ensemble to observations based on Santer Equations (12) and (13).
Multi-Run NOAA HadCrut GISS Merge
Average trend:
(C/century)
«bm» 2.22 -0.26 -1.24 -0.26 -0.59
Sample standard deviation of trends:

(C/century)
s{〈bm〉or s{〈bo〉} 2.42 0.71 0.78 0.95 0.78
Effective Number: nm or ne 38 45 29 33 35
Degrees of Freedom (DOF): eqn. (13) NA 66 41 42 49
d1*: eqn. (12) NA 3.05 3.97 2.42 3.22
tinv(5%, DOF): NA 2.00 2.02 2.02 2.01
Multi-run estimate consistent with data? NA False False False False

 

The final row in the table indicates the conclusions of hypothesis test: H3 is rejected at a significance level of 5%.

That is to say: The best estimate of the trend based on the multi-run average is found inconsistent with the observations during the period from Jan 2001-Sept. 2008. This particular test says we should treat the hypothesis that the average over model runs matches observations as false.

Are you wondering which values in the table tell us H3 is rejected? Ultimately, it’s the comparison of “tinv(5%,DOF)” and “d1*” for a particular observational set. The quantity d1* is the difference between the best estimate of a trend and the observed trend normalized by the standard error for the difference in these two quantities.

If d1 is larger than tinv(5%, DOF), then we reject the hypothesis the best estimate of the trend based on the multi-run average is consistent with that particular data set. Note that tinv(5%) is approximately equal to 2, but may be higher or lower. When the number of degrees of freedom is very large, tinv(5%, DOF) approaches 1.96; otherwise, it is larger. The precise value depends on both the significance level and the number of degrees of freedom.

Multi-Model Results

The computations to test H4 are similar to the multi-run test performed above. However the average, standard deviation and number of samples for the “model” is modified to match the values for the ensemble of 19 models rather than 38 runs. The best estimate of the multi-model trend based on averaging over the best estimate from 19 models was 2.59 C/century; the standard deviation was 2.25 C/century. The values for the observations are unaffected.

The main result is the best estimate based on the multi-model ensemble is found inconsistent with observations at a significance of 5%. So, this test says we should treat the hypothesis that the best estimate of the trend based on averaging over models matches observations as false.

Lee and Lund Results

I repeated both tests using the modification described in Lee and Lund (2008). This involves modifying equation (6) in Santer to account for a finite number of observations. The final conclusions were unchanged: The best estimate of the trend based on both the multi-run average multi-model average is inconsistent with observations. That is: The result of the tests guides us to treat the both the best estimate of the trend based on the average over runs or the best estimate of trend based on the average over models as false.

Caveats

This result is interesting because it demonstrates that if we account for model uncertainty as discussed in Santer17, the best estimates based on either multi-model or multi-run averages are inconsistent with observations. This result is for a forecast rather than a hindcast, and so represents a test of models ability to predict events before they occur.

As in my previous article discussing Santer17, I note that there are caveats associated with the application of the method of Santer17. I advise readers to read the paper; I plan to discuss these sometime in the future.

Future plans

There are some caveats associated with application of Santer17 to data since Jan 2001 or any data.

The first caveat relates to Santer17’s use of AR(1) noise. Their tests is based on the assumption that all deviations from linear trends in the observations is weather noise and that it is AR(1). As I see it, the use of AR(1) tends to under estimate the confidence intervals, resulting in too many false positives. That is, we say the models are wrong when the are correct.

Santer17 discusses this, and it’s a fair point. We know from previous tests that treating the weather noise as “ARMA(1,1) or “AR(1)+White Noise” widens the uncertainty intervals. I’ll be posting the this month’s test using ARMA(1,1) at some point. (They will not have changed drastically since last month.)

Despite that caveat, I think it is interesting to examine the results if we use the method in Santer17, but we should bear in mind it would be wiser to use ARMA(1,1).

The second caveat is a bit more complicated. Santer17 treat all deviation from a linear trend as noise and assume this noise is “iid”. The assumption that this “noise” is independent across models certainly untrue for models runs that include volcanic forcing in as external forcings. I’ll be discussing this in detail later, so I will not belabor it now.

For now: The GMST since 2001 fails the Santer17 test for the “multi-model ensemble-mean trend” based on the 19 models and 38 runs I have been able to download.

Handy back links

New visitors to the blog frequently ask the same questions. In anticipation of this, I am providing some links to previosu articles.

  1. Previous article on Santer17.
  2. The choice of 2001 as “start date” and its effect on results of hypothesis tests is discussed here.
  3. Effects of ENSO is discussed many places. Most recently, here.
  4. The choice of statistical model matters. Last month’s test using AR(1)+white noise is here. An updated analysis for GISS using data through September 2007 is here.
  5. The 38 runs downloaded from The Climate Explorer are discussed here. Checking for updates and downloading additional runs is on my “to do” list.
Written by lucia.

Comments

Bob B (Comment#5950)

Lucia, at some point could you run the same test using UAH and RSS data? Surface temp data is UHI infected and I think between the work Steve and Anthony have done shows it is basically crap.

lucia (Comment#5951)

Bob B–
In principle yes.

But in practice, there are people who just can’t get over the idea of comparing the temperatures of the lower troposphere to the surface. Silly distracting and time consuming arguments ensue.

So, to do something that end up wasting loads of my time on silly arguments, I would like to have model predictions for the lower troposphere. I don’t currently have these. So, currently, doing that test involves getting gridded data, computing monthly averages of GMST, then doing the comparison.

Even though the trends aren’t all that different, doing the comparison you prefer is a lot of work!

But, I’d be very interested if someone else did the comparison. :)

Anyway, I’m not convinced the surface based data sets are as bad as you suggest. It correlates well with UAH and RSS, so clearly, it can’t be all that horrific!

Les Johnson (Comment#5952)

Good job, Lucia. It took me several read-throughs, to get my layman’s brain around it, and I have a headache, but that’s what beer is for…

I would think that there might be a paper in those calculations. Any thoughts on doing that?

Raven (Comment#5953)

Lucia,

Are you suggesting that the noise model (AR(1) or whatever) is a way to estimate the distribution of beaker’s ‘infinite earths’?

Roger A. Pielke Sr. (Comment#5954)

Hi Lucia – With respect to the value of the surface temperature data to assess long term trends, we also find that they correlate well with interannual and longer term variations; e.g. see Figure 20 in

Pielke Sr., R.A., C. Davey, D. Niyogi, S. Fall, J. Steinweg-Woods, K. Hubbard, X. Lin, M. Cai, Y.-K. Lim, H. Li, J. Nielsen-Gammon, K. Gallo, R. Hale, R. Mahmood, S. Foster, R.T. McNider, and P. Blanken, 2007: Unresolved issues with the assessment of multi-decadal global land surface temperature trends. J. Geophys. Res., 112, D24S08, doi:10.1029/2006JD008229.
http://www.climatesci.org/publ...../R-321.pdf

However, since there is a warm bias in the long term surface temperature trends, the warming that the global data sets present is overstated.

P.S. Great weblog series you are completing. I look forward to seeing this in the literature too!

Roger Sr.

lucia (Comment#5955)

Raven–
No. The Noise model is somewhat independent of the “infinite earth’s”. There will be a “volcanoes and Santer post” which is more related to beakers idea of “infinite earths.” The question is: what constraints do we put on that set of eligible earth weather? There is someone on CA mumbling the right things. I’m trying to remember who….. It’s a difficult thing to express without getting concrete. Ahh… it’s MC!

They are trying to fit a linear trend that cannot be assumed. They should average measured data sets and provide errors but the variation that is observed will be a real observation (depending on the data set integrity/modelling issues of RSS as Steve mentioned before). This on the face of it is a more complicated function. If one were to assume in nature there exists an intrinsic variation (which has been captured by measurement) then the model outputs should converge to this actual data trend.

[...]
As for the models they assume that there is a ym = phi_m + noise meaning and that the long term trend is to reduce to a linear time-independent function (autocorrelation effects should disappear as the ENSO noise should reduce. This is as stated in posts above). So in essence what you have is the intersection of a linear trend over the period with a higher order variation (real data) at two points (Jan 1979 and Dec 1999).

What he’s saying is this:

The method of Santer assumes for the models that if you ran them over and over and over, you would get a linear trend during the period analyzed. All deviations from linear are “weather noise”.

Another thing he could have said is the deviations from linear in the separate realizations are uncorrelated. So, all dips should average out and if we ran the models over and over, we get a straight line during the period of analysis.

The purple line is the one we get if we average over volcano runs:

If we call deviations from a straight line “noise”, the the “noise” in model run 1 is not independent from the “noise” in model run 2 and so on. The model runs all share the “noise” from Pinatubo!

Now, think about beakers “infinite earth’s”. Are they all infinite earth’s conditioned on sharing the same forcing since 1880? And starting from weather sort of like that in 1880? And with the same volcanic eruptions? Including Pinatubo and El Chicon?

Those are conditions for the model. But the Santer statistical method implicitly treats the deviations from linear due to volcanic eruptions as something that would vary out if over all these infinite earths!

You can do all the algebra you like, but this is a conceptual error.

Bob B (Comment#5956)

Lucia, it has been shown almost 1/2 of the warming in the US can be traced to UHI effects:

http://www.uoguelph.ca/~rmckit.....RDec07.pdf

lucia (Comment#5957)

Bob–
I’m not disputing that! But ideally, the lower troposphere measurements should be compared to projections to the lower troposphere.

I personally think it’s interesting to compare to the 2C/century, but it results in too much time wasting engaging those who don’t like it. If you know of a resource with somewhat pre-digested “model data” providing monthly average tempeatures for the lower troposhpere, I’ll be glad to put UAH and RSS back in the mix. But I only have a certain amount of time, and there are a number of things I want to say about the surface measurements.

steven mosher (Comment#5958)

I’m glad to see you follow up on Santer and Volcanos. I felt like I had taken crazy pills.

DeWitt Payne (Comment#5963)

So the Santer et al. test fails in the limit of perfect measurement because it overestimates the variability of bo by (incorrectly) assuming that the residuals are always noise? That would lead to the opposite error from Douglass et al. in that models or ensembles would be accepted that should have been rejected.

lucia (Comment#5964)

DeWitt–
Essentially, yes. I’ll be discussing this later.

The difficulty is the Santer method has two assumptions that off set each other in terms of estimating confidence intervals.

* One assumption makes their method overestimate the confindence interval, and results in too few “falsifications”. This assumption is to treat residuals from linear as arising from “noise” processes that are uncorrelated between different model runs. In all models runs with volcanic eruptions, Pinatubo blasts off at the same time. So, this “noise” is correlated between runs. (In the beaker’ notion of a bajillion realizations of earth weather starting with different IC’s, they assume the Pinatubo deviation is uncorrelated from earth “realization” to “earth realization” all of which have Pinatubo erupting at the same time.)

* The other assumption makes their method underestimate the confindence intervals. They assume the “noise” is AR(1). Though we can’t know the statistical process for the noise on the true earth with any great certainty, it seems likely that the measurements of GMST (and probably many other features) are a higher order process, or something else entirely. If so, using AR(1) is likely to underestimate the uncertainty intervals.

I have no idea which issue dominates and so can’t say whether their uncertainty intervals are, in the end “too large” or “too small”.

However, if I use AR(1) + white noise and/or ARMA(1,1) getting the coefficients from a period with no major volcanic eruptions, it appears to me that
a) The Santer confidence intervals are a little too small for what they test but
b) Basing the parameter estimates for ‘weather noise’ based on periods with volcanic eruptions probably results in confidence intervals that are too large assuming the statistical model for ENSO/PDO/AMO etc. “weather noise is appropriate.

So, I use the “no volcano” estimate for ARMA(1,1)! This puts my confidence intervals in between those currently advocated by Tamino and those in Santer17.

Jeff Id (Comment#5992)

Lucia,

Really nice stuff. I also made a comment about the assumptions regarding weather noise at CA, nowhere near as detailed as above. I haven’t got much time lately but I will be back to see your next posts.

IPCC 2C/century: Remains in Low Confidence Region Through Sept. | The Blackboard (Pingback#6329)

[...] Recall, in Santer17, the authors assumed the “weather noise” was AR1, but also included the uncertainty in the best estimate of the trend based on models “model noise”. We already know that if we use that and apply it to test GMST since Jan 2001, the models are found inconsistent with data. [...]

Niche Modeling » Review of Antarctic Snowfall (Pingback#34656)

[...] here and a 600 comment discussion follows. Or search on douglass for the saga, see Lucia’s posts and Air [...]

 

Comments Closed: If you would like them re-opened, Contact Lucia