The subject of today’s post is, once again, Santer17, the paper intended to rebut Douglass’s contention that AGCM model predictions of Tropical Tropospheric Temperatures (TTT) are inconsistent with observed data. While the paper has its merits, it also has some curious features some of which have been discussed at Climate Audit. I’ve mostly focused examining the conclusions we would draw if the method of Santer17 were applied to data for Global Mean Surface Temperature.
In the process of comparing models to data, I paused to look into this perplexing issue: Santer17 claims their statistical method is “too liberal” in the sense that when applied their test will tend to reject models that are correct more often than dictated by the choice of confidence level.
I believe that as implemented in Santer, this claim is false. Because their method implicitly assumes the “underlying signal” varies linearly with time, but they apply their method during a period when the “underlying signal” is clearly non-linear, their test is, more plausibly, “too strict”. That is: If the “noise” is AR1 (as assumed in Santer17), and the “underlying signal” varies non-linearly with time, the Santer17 test will result is too few rejections.
In part 1 of this series, I provided a qualitative explanation why I believe the Santer17 test is not too liberal– or at least it is not too liberal as applied in Santer17.
Today, I will reports some quantitative results to estimate the magnitude of the effect of non-linearity on the false rejection rate for the statistical test used in Santer17. Since this is a blog, for brevity, I will borrow notation from Santer17, available here (pdf).
Outline
The magnitude of the effect of non-linearity in an “underlying temperature signal” on the liberality (or non-liberality) of the Santer Test will be estimated using the following steps:
- Propose a plausible form for the “underlying signal” for monthly average temperatures φm(t) for the GMST based on the average of 27 model predictions.
- Estimate the magnitude of the residuals s{bm} and lag-1 autocorrelation, r1, for linear fits to model predictions of GMST based on average values from fits to 27 model runs.
- Run synthetic experiments where I assume the “underlying signal” for each month is φm(t), described in 1 above, the noise is AR(1) with a lag 1 autocorrelation and innovations selected such that when averaged over many runs, the residuals and lag-1 autocorrelation for the resulting noise+underlying signal process match those discussed in point 1 above (s{bm}).
- Repeat (3) above, but replacing the underlying signal with a linear trend and an AR(1) process with the lag-1 temporal auto-correlation in bullet (2). The purpose of this step is to show agreement with Santer17’s claim the method is “liberal” provided the underlying trend is linear.
- Apply the method of Santer and report the “false positive” rate.
Plausible form of Underlying Trend
The term “underlying trend” appears to be used by Santer17 to describe what people in some other fields refer to as “the expected value” of some quantity that includes a random component. In principle, we would determine this either based on some theory, or estimate it by sampling the quantity an infinite number of times. For the problem at hand, it is not possible to sample the GMST for any particular month on earth an infinite number of times. The best we can do is use model predictions as proxies.
So, to estimate a plausible form for the underlying trend on earth, I
- Obtained 27 model runs that include volcanic eruptions from “The Climate Explorer”
- Determined the average monthly anomaly in GMST based on these 27 runs.
- Smoothed to eliminate jitter in the runs.
The result for smoothed GMST from 1970-now is shown below. (Further analysis uses the smooth data from 1979-1999 inclusive.)
Assumed “Underlying Signal” for GMST
Using the process above, I obtained monthly GMST values, show in red below:

The blue symbols indicate the average GMST based on 27 model runs. (The choice of runs is discussed here.) The higher frequency “jitter” in this average exists because that 27 runs are not sufficient to entirely smooth out the random component often referred to as “internal variability” or “model weather”. However, this jitter is easily distinguished from the temperature dips after the eruptions of El Chicon and Pinatobo which are surely “underlying signal” and would not vanish if modelers reran their models many times.
Since we anticipate a plausible “underlying signal” would include the dips due to the two major volcanic eruptions, but not the jitter due to “weather noise”, I reduced “weather noise” by smoothing the data. The smoothed data are illustrated in red. The purple box indicates the portion of the data between Jan 1979-Dec 1999, as used in Santer17.
The smoothed data are used to represent the “underlying climate signal” in later steps. (For those wondering: Yes, this is an assumption going forward.)
Magnitude of residuals.
To estimate the typical value of residuals in models, I used EXCEL to apply an OLS fit to the 27 individual model runs described above. I also determined the residuals and lag1 temporal auto-correlation for each of the 27 runs.
The average values are provided in the table below; values matched in later synthetic experiments are shown in bold. (An excessive number of significant figures is provided for purposes of matching characteristics in later steps. A number of ancilliary value are provided for those familiar with Santer17. )
| OLS Trend C/century |
Residuals from linear fit, se C |
Lag 1 correlation <<Ï>> | <<(1+Ï)/(1-Ï)>> | (Neff) | Estimate of standard error in OLS trend using Santer17 (4), (5) and (6). C/century |
||
| Average result of 27 OLS fits to 27 individual runs. | 1.896 (See 1.) | ±0.1548 | 0.7399 | 8.38 | 39.2 | ±0.49 | |
| Trend based on smoothed average of 27 runs with volcanic forcing. | 1.887 (See 1.) | ±0.0717 | 0.9992 | 2418 | 0.1 | ~ ±3.67 (See 2.) | |
| 1. The average of individual trends does differ from the trend for the average. Smoothing alters the mean trend slightly. 2. This value is provided to illustrate the uncertainty we would deduce if we believed the bulges in the “underlying signal” were noise. The method of Santer17 could not actually be used to estimate the uncertainty because the number of effective samples is less than 2. Inserting this into (2) results in a negative variances, which violates the definition of the variance. Exact application of equations (4)-(6) would require us to take the square root of a negative number, making the magnitude of the confidence intervals in the imaginary. To provide a reference for comparison, I adjusted the procedure by not subtracting “2” as required to account in degrees of freedom involved in applying the OLS fit. |
|||||||
Out of curiosity, and to determine the OLS trend associated to the assumed “underlying trend”, I also applied the fit to the “underlying trend” discussed above.
This is a bit of a tangent. However, the exercise does illustrate why treating all excursions from linear behavior as “noise” is a bit odd. If the “underlying signal” is thought to be the signal in the presence of these specific eruptions, timed as they are, then the excursions are not “noise”, they are signal. If we treat them as noise, then we are forced to always treat them as noise.
This means that if we apply the method of Santer to the underlying signal for the data between 1979-1999, with end points selected, we would conclude there is statistical uncertainty associated with the determination of the OLS trend fit to that data.
This is nuts!
While one may correctly argue that fitting a line to the non-linear “underlying signal” from 1979-1999 curve is inappropriate we know there is no statistical uncertainty associated with applying a fit to a deterministic signal containing no noise! If this were the true underlying trend, and the models were perfect, then as the number of runs approaches infinity, the statistical uncertainty in determining the best fit trend to that line would approach zero.
Results of Synthetic Model Simulations
To perform model simulations, created series of monthly data, ym(t), of that obeyed the equations:
Volcano simulations
In the case where the underlying trend is assumed nonlinear, the monthly data for φm(t) were taken the values for the underlying trend determined above. (That is, the data points plotted in red in figure 1.)
The “noise” ηm(t) was assumed AR-1, with properties that resulted in se= 0.1548 C and lag 1 autocorrelation r=0.7399. These are the average values for the 27 model runs, and highlighted in table 1.
The appropriate innovations and autocorrelations for the AR1 process were determined by iterations. Once they were determined, I ran 50,000 simulations, and then applied the method of Santer17 to each of the 50,000 runs and calculated the rejections rate at significance levels of 5%, 10% and 20%. The results indicate that the Method of Santer would result in approximately 1.7% rejections at the 5% significance level, 4.1% rejections at the 10% significance level and 10.8% rejections at the 20% significance level.
| Case | Applied Parameters | Simulation Results: averaged over 50,000 runs. | |||||||
| Trend C/cent. | Innov. AR(1), C |
Lag 1 Autocorr. Ï |
Trend | Residuals from linear fit. <<se>> C |
Lag 1 autocorrelation. <<Ï>> | Rate applied trend is rejected at target significance level. | |||
| Targets (See 1.) | 1.887 | ±0.1548 | 0.7399 | 5% | 10% | 20% | |||
| Volcanic forcing. | 1.887 | 0.1013 | 0.6918 | 1.888 | ±0.1547 | 0.7401 | 1.7% | 4.1% | 10.8% |
| Linear | 1.887 | 0.1041 | 0.7399 | 1.884 | ±0.1508 | 0.7196 | 5.8% | 11.0% | 20.8% |
| 1. Target trend matches the average trend for the smoothed average of 27 runs with volcanic forcing. The target for the residuals and lag 1 correlation match the average over the 27 individual runs. | |||||||||
So, examining the table, we determine that for cases where the “underlying trend” shares characteristics expected to arise from the volcanic eruptions that occurred during the analysis period, application of Santer’s analysis will result rejecting models less frequently than anticipated by the test. So, contrary to the claim in Santer17, the test is not “too liberal”, it is “too strict”.
Since modeling convention is to treat the forcing due to volcanic eruptions that actually occurred as known, deterministic forcings, the deviations in GMST (or any temperature signal) due to these eruptions ought not to be suddenly viewed as “random” when testing models. (Certainly, these excursions ought not to be treated as deterministic when rejoicing over models ability to replicate them and later as noise when treating the excursions as noise can conveniently inflate the uncertainty intervals.)
Linear trend simulations
We can repeat the analysis by assuming the trend φm(t) is linear and that the noise is AR1 with a lag 1 auto-correlation of ρ=0.7399, and drive it with an innovation that would result in average residuals of <<se>>=±0.1548C if only we were to run this for an infinite number of months.
In that case, we do replicate the results indicated in Santer17: The 5% significance level results in approximately 5.8% rejections and so on. That is to say: If the underlying trend or climate signal really is linear, and the noise really is AR(1), use of the method in Santer17 would reject correct models more often than intended.
So what does this mean>?
Owing to dramatic non-linearities in the “underlying signal” for temperature from 1979-1999, the method of Santer17 when implemented to GMST may be “too strict” rather than “too liberal” as claimed in Santer17.
It is know that the eruptions of both Pinatubo and El Chico result in distinct cooling followed by a rebound in temperature. The phenomenology of the effect of major volcanic eruptions on global temperature eruptions is fairly well understood, and treated as deterministic in the forcing files for models. While each run may manifest ENSO’s, PDO’s and AMO’ at different times, all runs schedule the volcanic eruptions when they occurred.
The temperature dips after Pinatubo and El Chicon exist in the expected value (i.e. “underlying signal”) for all runs simulating weather between 1979 and 1999.
Santer17’s decision to treat this effect as “weather noise” that averages out over all model runs (or all possible realizations of earth weather between forced by these two specific volcanos ) will tend to make their method of testing hypotheses about model means too strict: Assuming the model for the noise is correctly chosen (i.e., the noise is AR1), assuming all deviations from a linear trend are “noise” when a large portion is “signal”, will tend make it too difficult to reject models that are wrong.
Open Questions
So, is Santer17 too strict over all?
Actually, I don’t know. I have discussed the issue of the non-linearity in GMST due to the eruptions of Pinatubo and El Chicon. However, like Santer17, my analysis assumes “internal climate variability” is AR-1. If the internal climate variability is not AR-1, their method may be too strict or too liberal. We cannot answer that question without determining the correct form of the noise.
As my readers know, I have been using AR-1 + White Noise, but estimating the parameters based on periods without volcanoes. With luck, those who have not understood why I focus on the “no volcanoes” period may now understand the reason. It is:
No matter what period we select, if we conflate the deterministic effect of volcanoes eruptions on GMST with “ENSO-PDO-AMO” type noise, we will obtain uncertainty intervals that are too large. The result will be that we will be unable to detect the real differences between the best estimate of the trend based on models, and the best estimate exhibited by obseravations.
Fooling ourselves in this way is dangerous because it tempts us to be too certain about the accuracy of precision of model projections.
Other articles of interest
You may be interested in the first of this full series Application of Santer17 to test GMST.
One thought on “Is the Santer 17 ‘d’ test too liberal? Part 2.”
Comments are closed.