ARIMA(1,1) MC corrected: GISTemp trends inconsistent with 0.2C/decade.

In Using ARMA(1,1): Reject AR4 projections of 0.2 C/decade. I tested the trend exhibited by GISTemp ending with August 2012 data using various start dates using the assumption that the temperature series during each individual period could be treated as a linear trend + noise, with noise described using ARMA(1,1) noise. The result of that computation was summarized in this graph:

That graph suggested that if we accept the statistical model, the trends computed using Jan 2001- August 2012 data or Jan 2002-August 2012 data fall below the 0.2C/decade (the AR4 nomimal projection) and that the difference observed is statistically significant. This is diagnosed by noting that the two black open circles on the far right hand side of the upper black trace fall below the blue dashed line indicating 0.2C/decade. Those black open circles are supposed to represent the ±95% confidence intervals for the trend if the residuals to the fit are “noise” (that is: not forced response) and if the “noise” can be described using an ARMA process.

Since the AR4 projections are based on SRES frozen in 2001, and moreover, since one should test forecasts against future data– not past data– this should lead those who accept ARMA 1,1 models as describing the noise to suspect the multi-model mean is biased high.

However, I observed that we know the uncertain intervals shown in the graph above are too small. Specifically, while that trace is supposed to indicate a random excursion that happens at a rate of less than 5% if the observed mean is “true”, running monte carlo shows that the method used to estimate the trend rejects to frequently. Specifically, if we run monte carlo tests to determine how often we see excursions of that size when the (ar and ma) coefficients match that of the observations computed based on the GISTemp series, the rejection rates are as shown in black below:

Note that all rejection rates are higher than 5%.

Fortunately, it is possible to adjust the rejection using the same monte carlo runs we use to detect the rejection rate is to high. I ran monte carlo at the (ar, ma) values estimated based on the GISTemp data and corrected the uncertainty intervals. The method of correcting the data is described in corrections for arima.

The new uncertainty intervals are shown in purple below:

Because the initial post was motivated by a commenter asking why my uncertainty intervals seemed larger than those estimated by Tamino, estimates based on some attempts to apply his method are shown. The “green” method is an application of the “Tamino” method with (ar, ma, and sigma) with sigma the magnitude of residuals to an ARMA fit estimated based on the timeseries being analyized. The “blue” method is the “Tamino” method with (ar,ma) computed based on the time series since 1976, but the ‘sigma’ computed based on the time series under consideration; red is the “Tamino” method with (ar,ma and sigma) all computed based on the time series since 1976.

It is worth nothing that my “purple” method should work if

  1. The forced response during the period for which we compute uncertainty intervals is linear.
  2. The residuals to the forced response are ARMA11.
  3. The residuals are normally distributed and homskedastic.

So, for example, if we are estimating uncertainty intervals for trends computed since 2001, these assumptions must apply to the forced trend and noise for the period from Jan 2001-now. No assumption is made about the forced linear trend or nature of noise outside that period.

Tamino’s “blue” or “red” methods require additional assumption that the assumptions must apply to the period since 1976 even if one is computing the uncertainty over shorter time spans. So, if — for example– one believes or suspects that volcanic aerosols caused the forced trend to be non-linear between 1976-1995 (or so), then Tamino’s ‘blue’ or ‘red’ method will provide incorrect estimates for the uncertainty intervals for trends after 1995 (or so). In contrast, the “purple” method wold not be affected by this known or suspected non-linearity.

Moreover, in cases where the assumption that the forced trend since 1976 is linear happened to be true Tamino’s “blue” and my “purple” method should give comparable results for the size of uncertainty intervals. Some differences would exist–as one always finds for estimated quantities computed different ways– but the differences would not be large.

I happen to believe that the eruption of Pinatubo and other volcanoes during the 70s, 80s and 90s resulted in a non-linearity in the forced trends. That is: if one could ‘repeat’ that period with slightly different initial conditions (as is done in climate models) the mean over many runs would show a ‘dip’ in the temperature (as indeed it does in climate models.) For this reason, I think the “blue” method of estimating uncertainty in short term trends is flawed. For those who might wish to see the ratio of the size of various estimates of the ±95% confidence intervals relative to the “ARIMA-R: monte carlo corrected” method, I show them below:

Main

Obviously, the purple trace is unity in all instances. The black trace is the uncorrected ARIMA-R method, whose confidence intervals are too small. The green is “unfrozen/unfrozen” ‘Tamino’ and blue and red are based on estimating some or all parameters based on the time series from 1976-now.

I have long said that I think the decision to compute the uncertainty intervals by estimating the parameters for ARMA(1,1) based on the time series since 1976 is both unphysical and cherry picking. It is the former because our understanding of the effect of stratospheric eruptions is that they result in a forcing that causes cooling. It is the later because the analyst selecting that period ought aware that selecting a periods with that level of irregular forcing to estimate variability during periods without similar forcing will result in excessively high estimates for the variability of trends given our knowledge that no Pinatubo sized eruptions occurred since the eruption of Pinatubo.

Getting back to the important observation:

If we model the GISTemp time series since Jan 2001, Jan 2002 or Jan 2003 through August 2012 as “trend + ARMA11 noise”, we reject the hypothesis that the true trend is 0.2C/decade with a statistical confidence of 95%. This is diagnosed by noticing the three right-most ‘purple’ circles in the upper trace fall below 0.2C/decade shown in blue.

For more much shorter trends, we are not going to get rejections. These ‘fail to rejects’ would not contradict the three (or possibly more) rejects we are getting. They could likely be explained by the low power of the test for short time periods. I have not estimated that power, but now that I have a corrected model, I have the tools to permit that. Because some people always want to suggest the presence of any “fail to reject m=0.2C/decade” even at short time periods somehow contradicts the rejections we are seeing, I’ll be computing that power. (Note: many of those same people do understand that “fail to reject m=0 C/decade” for short periods doesn’t contradict the reject at longer periods. But…well… such are climate blog-viations! 🙂

158 thoughts on “ARIMA(1,1) MC corrected: GISTemp trends inconsistent with 0.2C/decade.”

  1. Nice work.

    Just one small note, I actually use the data including 1975 (i.e. year >= 1975) in my calculations, and I believe Tamino does the same – I match his numbers quite close.
    Your wording “since 1976” suggests to me that you are using year>1975.
    I do not think that this will change anything in your analysis.

  2. SRJ–
    I started in 1976. I thought that was the year– I could change to 1975. It wouldn’t make much difference.

    I’m actually curious about how the ARMA(1,1) compares to variability across runs from models with multiple runs. ARIMA(1,0,1) out of the box uncertainty was too low for some models — but too high for others. But if “fixed up” it’s to high enough for all models (and it might be) I may be able to call it a “bounding” estimate. So… gotta see!

    So, I’m plan to add to this code going forward. To avoid constantly doing monte-carlo, I need to start storing some of the results in files and re-reading.

  3. Interesting that your MC corrected method also rejects Tamino’s linear 0.17C per decade at 95% confidence.

  4. Lucia,

    A couple of comments. You said on an earlier thread:
    “There are reasons why this is likely not correct: One is that the student-t distribution was developed for testing when residuals are normally distributed and statistically independent. By definition, ARMA11 residuals are not statistically independent.”

    But I think you’re using a “sm” from arima which should be derived from the underlying iid noise. I think the problem may be that that “sm” isn’t the variance of “m”, but is the sqrt of the [4,4] diagonal term of the covariance matrix. That would work if the cov matrix were close to diagonal, but I think for ARMA11, it probably isn’t.

    In this thread, you’re showing a sequence of 27 years and saying that 3 years are significantly below 0.2. You’re leaving out later years with a suspicion that they probably aren’t below. Now if they were independent, getting 3 out of 27+ at 95% wouldn’t be at all surprising. Obviously they are highly correlated here, but surely still in attaching significance to those points, you have to allow for the number of points you tried without significance?

  5. Nick Stokes,
    I have been thinking about the same. It is connected to the multiple comparison problem. Or what high enery physicists call the “look-else-where effect”.
    If we are testing 20 different trends at the 95% level, we should expect to see at least one trend with p<0.05 even if there was no real trend. One could probably apply the Bonferroni correction and adjust the significance level from 1-alpha to 1-alpha/m, m being the number of trends we test.

  6. Nick

    But I think you’re using a “sm” from arima which should be derived from the underlying iid noise.

    You mean the stuff remaining after the arima fit? Yes. That should be iid. But the original residuals aren’t. That’s what I meant.

    In this thread, you’re showing a sequence of 27 years and saying that 3 years are significantly below 0.2. You’re leaving out later years with a suspicion that they probably aren’t below. Now if they were independent, getting 3 out of 27+ at 95% wouldn’t be at all surprising. Obviously they are highly correlated here, but surely still in attaching significance to those points, you have to allow for the number of points you tried without significance?

    I’ll be computing the power later– at that point, I’ll have more than a “suspicion” about what we can say about later years. But I think you know perfectly well that power will be lower for short trend periods. This isn’t controversial– and in fact constantly “patiently explained” to those who notice that the current short trends are <0 c/dec and conclude the earth is cooling.

    f we are testing 20 different trends at the 95% level, we should expect to see at least one trend with p<0.05 even if there was no real trend. One could probably apply the Bonferroni correction and adjust the significance level from 1-alpha to 1-alpha/m, m being the number of trends we test.

    No nick.

    In the first place, you know the “projections” were created well after 1975. So, including those as some sort of “test” that one would count in a BonFerroni correction is… well.. .nuts. All that data substantially pre-existed, and if you want to do some sort of “real” correction, you create a test where the data prior to 2000 (or 2001) were a hindcast that was tweaked to fit the pre-existing data, and then start taking on random data.

    But it makes more sense to just limit “testing of forecasts” to after 2000.

    In the second place, even if you were doing a “Bonferroni correction” you would have to run monte-carlo to figure out how to apply it in the case of a series of trends.

    And in the third place: Stop behaving as if you don’t know that the power for trends beginning in — say 2009– is not low and suggesting that I’m just “suspecting” this.

  7. SRJ

    One could probably apply the Bonferroni correction and adjust the significance level from 1-alpha to 1-alpha/m, m being the number of trends we test.

    Bonferroni would apply if we would reject the hypothesis in the event that 1 or more of m test rejected. So: Say you have two tests and you reject in all of the following:
    1) test 1 only rejects.
    2) test 2 only rejects.
    3) both test 1 and test 2 reject.

    But it doesn’t work for collective assesements where we are– for example– testing 10 times and making judgements when 4 out of ten reject at 95% or something like that.

    In this case, we have a forecast. The forecast is based on models that were run with SRES available in 2001. The models themselves were tuned to 20th century data. (Or even if someone claims they weren’t– the data was avaialble for tuning.)

    So the tests of forcasting are limited to the ‘n’ after 2000.

    In the graph above, if we deem 2000 as the first possible year in which to test forecasting then we would use the trends starting in 2000 as the “test”. In that case, 3 out of 4 reject at 95%. So, if we thought these were independent, the chance that 3 out of 4 would reject at 5% is pbinom(3,4,0.05). (If we did that, we’d conclude we should reject at because this falls outside the p=0.9999938 intervals, which makes this event fall outside the 0.95 intervals.).

    The rejectsions aren’t independent so that doesn’t work. But what the graph shows is that the diagnosis of “reject” is pretty robust to choice of start year– provided we don’t use data that predates the forecast.

    After we find the expected power for year N, we could also do a test based on number of “failed to rejects”– and that would let us include the ‘failed to reject’ for short term trends also. Since the power is a function of the alternate hypothesis and also the number of years, it would be a bit more difficult to do. But I plan to compute the power function, so if you want to pick an ‘alternate hypothesis’ that we use I can use that. I would suggest that we could call the “alternate hypothesis” to m=2C/dec 0.17C/dec, selecting that value because it’s the one Tamino shows. Otherwise, if you know the value from the TAR, I can use that as the alternate. I can even run monte-carlo to do a more-correct thing than Bonferroni or binomial theorem with each point treated independently.

    But Bonferroni doesn’t work for this– and it would be easy to show it to strict to just take alpha/n and apply in this circumstance– and that could be shown using monte carlo.

  8. Oh– I should add: If someone wants to test the trend forecast by the multi-model mean using various start years, and they just want to see fidelity to the hindcast (as in starting from 1980), we need to actually compute that trend for each start year. I put the 0.2C/dec because that pretty close to the trend if computed from;
    1) 1980-now or
    2)2000-any point in the first 30 years of this century.

    But it would be higher if we computed it starting in 1990 or so. Because.. of… volcanoes. Those are in the multi-model mean.

  9. Lucia, IIRC, when you first started this, Nick was one of those complaining of the power of your work due to the short record length about a year ago.

    Nick what caused this change of heart? Is there a more powerful stat method Lucia should be using?

  10. John F. Pittman–
    Lots of people have complained that for some mysterious reason that violates all understanding of statistics we should ignore “rejects” due to lack of power. Such claims are tortured statistics.

    As things reject more and more robustly, we start seeing more novel ideas. Long ago, Gavin suggested we can ignore “reject at 95% ending in year ‘x'” because it failed to reject in “year x-1”. Well… ok. But if you are going to make a two year test and claim you are applying it using p-95% you need to tweak the test. I ran that:
    http://rankexploits.com/musings/2009/so-what-if-the-hypothesis-didnt-reject-last-year/
    These things are all do-able. But given that the data are all correlated and short trends do have low power it’s important that
    a) someone doesn’t basically ‘double count’ (like Gavin’s apparent suggestion we require rejecting two years in a row failing to note that if we want to maintain the claim the test is still 95%, then we need to reject two years in a row at a lower p for each individual year,
    b) someone doesn’t start insisitng that we treat a “fail to reject” for a test with a power of 0.01% as somehow “overriding” rejects at higher powers. Of course we’ll get scads of “fail to rejects” at in low power tests. Of course someone can find them. It’s borderline idiotic to claim that somehow “rejects” need to happen in all these low power tests before we take them seriously, and
    c) someone shouldn’t be able to insist that we can’t test hypothesis like “does this have forecasting ability using loads of data from the hindcast which was used to create the forecast.

    Of course, we might be able to do complicated things including testing with data from the hindcast– but in that case, doing it properly– so that we only treat the “uncertaint” aspects from the hindcast as uncertaint– is a real bear. And we might be able to do things with low power tests– but in that case, people should describe the fair test before we do it and that test should have a p=95% confindence level which we figure out before doing the test.

  11. Oh– I should add that if we are going to concoct adhoc test (because we really do have special circumstances) choices should be made to maximize power given choices withing the test type.

    For example: If someone is going to consider the collective results from tests using trends computed starting in years (2000:20xx) with ‘xx’, and the specify ‘xx’, and they were to say:

    “I want a criterion were I reject if one or more year rejects at p=(1-α) ” and I want to pick α so my test rejects a true null at the rate of 5%. Could you find α for the individual tests? That test can be done.

    But then, after we find α if we apply the test, and it rejects, we report that. In principle, if they claimed they would stick with that, the should stick with that test and not add an additional requirement above and beyond the previous tests.

    And of course, whatever criterion they have, if they apply it to test one null (i.e. m=0) they should apply it the same way to test another (m=0.2).

  12. Ok…. well today, I’m going to organize this to ‘store’ results so I can reuse. Then I’ll be able to overlay multi-model means, individual runs and so on. I’ll store the “Tamino” method results, but mostly dropping them from graphs as clutter.

    Going forward, I’ll compute power based on two or three detection thresholds so we can quickly compute various things. (Like: how we should interpret the slew of “fail to reject no trend since 2000”– my theory is those are low power. But if Nick is going to seriously suggest that use “fail to reject m=0.2” at low power as evidence to override the rejects, I’ll go ahead and show that by that logic, we should also reject m=0, and decree warming has ended. (Which is wrong– but if he wants to decree such a test applies to testing m=0.2, that test ought to apply to testing m=0.)

  13. BTW: If the probability of rejections for each start year was independent, the probability of getting 3 out of 11 rejections at 95% using start years beginning in 2000 to now is 1-pbinom(3,11,0.05) = 0.15%. And I haven’t run this out to check all start years. But prior to correcting, we were rejecting five years out of those starting in 2000-2008. Rejecting 5 out of 11 (from 2000-2011) would be something that happened at a rate of 5.801345e-04%.

    Of course– the tests aren’t independent. But if someone wants to apply Bonferoni — which assumes independence– it’s better to at least apply a more appropriate tests. And I can assure you that the binomial test has the greater power of the two tests. That favors it as closer to the “right” test.

  14. Lucia,

    Elsewhere you and others refered to the Lee & Lund method. It is applicable to ARMA if the correct values are computed.

    In their examples (L&L 2004) they discuss the estimation of Neff where the model parameters are unknown. For an AR(1) process for large N they discuss the justification of an estimate based on the ratio of (1-rho^) to (1+rho^). In case others have not made it clear their example formulae are NOT applicable to ARMA(1,1) processes.

    You wrote:

    “Specifically, if we run monte carlo tests to determine how often we see excursions of that size when the (ar and ma) coefficients match that of the observations computed based on the GISTemp series, the rejection rates are as shown in black below:”

    I can read this two ways. One in which the the model parameters are fixed and the results are selected based on whether certain correlation coefficients match those of the data, and another where the model parameters vary over some plausible distribution of values. The second is Bayesian in form, a reverse probability. The first would be some halfway house which I should find puzzling.

    Is it either of these?

    Alex

  15. Thanks for the link Lucia. That was the thread where Gavin stated: “”The point about masking or distorting is very apt, however that is the problem that is inevitably encountered in drawing inferences about climate from short term observations, the shorter the timespan involved, the more the forced component is likely to be obscured by the variability of the unforced component.”” wrt unforced ENSO, etc. Somewhere there was also the autocorrelated arguement as to why short term observations were poor for drawing inferences about climate. I wonder if pink noise will be brought up.

  16. Alexander Harvey
    I’m not sure what you are asking me in your paragraph. I think my answer is “I am doing neither of those two things”. What I did was this:

    (1) Created an N month series using AR=ar_giss, MA= ma_giss where the _giss values match the values we computed based on the observation in _giss.

    (2) Pretend we *do not know* the AR and MA for the series just generated. Use the method Tamino used to estimate the parameters used in the Lee and Lund method.

    (3) Reject or accept according to the rules in the “Tamino” implementation. In this implementation, “true” AR and MA are known to exist because of (1), but we don’t know them. So, we are just implementing the “Lee and Lund” method the way Tamino coded it– using estimates based on the synthetic time series.

    Repeat a “shitwad” of times.

    Based on this, we determine the rejection rate that would arise conditional on the (AR,MA) pair we used to create every single run in step (1). I don’t think this is “Bayesian”. I have no prior describing the probability of any particular values for (AR,MA).

    I also don’t think the method “One in which the the model parameters are fixed and the results are selected based on whether certain correlation coefficients match those of the data”. I don’t even know what that means. But I do fix the parameters I use to create synthetic runs. Then I apply Tamino’s method to determine whether the trend for that run is inside or outside the 95% confidence intervals– according to the criterion his bit of code implements. Then I report the rate his method would reject a “known true” trend as being true. This is a pretty standard way of checking if a method works.

    The only slight– and it’s very slight– complication is that for ARMA, a “complete” result would require someone to find

    Rejection_Rate=function(AR,MA, Num_months).

    So, you need a result that is a function of 3 things. I’m only exploring the axis of

    Rejection_Rate=function(AR=a specific value,MA=a specific value, Num_months).

  17. Lucia,
    Two (simple) questions:

    1) Do the “95% limits” calculated by your MC corrected method enclose 90% of the probability (5% higher than the upper limit and 5% lower than the lower limit) or do they enclose 95% of the outcomes (2.5% higher, 2.5% lower)? I assume it is 90% enclosed by the limits.

    2) The three points on the (purple) upper limit, which fall below the 0.2C per decade warming line, fall a substantial distance below that line. Can we then not infer that there is even less than 5% chance the canonical 0.2C per decade value is correct?

  18. SteveF-

    1) My 95% probabilities enclose 95% (2.5% higher, 2.5% lower). I do 2 sided.

    2)

    The three points on the (purple) upper limit, which fall below the 0.2C per decade warming line, fall a substantial distance below that line. Can we then not infer that there is even less than 5% chance the canonical 0.2C per decade value is correct?

    If you mean would we still reject them had we picked a different confidence level: yes. But I don’t know the specific p for those and trying to figure it out would require running the monte-carlo code. It’s not worth it. Also: remember that ordinarily, the ‘best’ practice would be for the method — including what number of years to use and so on– should to be defined before the data to test the hypothesis are collected. Then you use that method.

    With climate stuff, this is never the case because data just pour in. People often don’t even think about whether they are selecting their test periods “randomly” or fairly. (They always think their choices are objective, but those of others who get results they don’t ‘like’ are cherry picked.)

    But with respect to your question: If you just highlight results since 2002 or something and report that confidence level,that’s really not fair. Because.. after all… you know it looks like a local maximum. OTOH, it’s not fair to pick 1999 for the same reason.

    Multi-year methods might be useful– if we had a standard one.

  19. Lucia,

    That should have been three questions, not two. 🙂
    3) If a reasonable contribution for volcanoes could be subtracted from the pre-2000 GISS data, would that not allow a better estimate of the “natural variability” absent volcanoes. I mean, could you then not use all of the data to model the noise for the post-2000 period (as Tamino would like to)?

  20. Lucia,
    OK. Now 4 questions. If the limits are two sided, then doesn’t a point above the upper limit represent only a 2.5% chance, not 5%?

  21. Lucia,

    I am trying to follow but still confused.

    If by Lee & Lund method you mean either Eq 3.6 or 3.7 (D. Nychka et al). I believe the following to be the case.

    They are not applicable to ARMA(1,1) e.g. they cannot be justified in the same way as they can be for AR(1) as they are not the correct ARMA(1,1) limit as N goes to infinity. They can be a long way out. They give the wrong answer ( even for large N) .

    If those equations are being generally used as a starting point I think that they will cause a problem and that there is no justification in L&L 2004 for their use in ARMA(1,1) or anything other than AR(1). If they are generally used then someone must have been first. Has anyone published using them for ARMA(1,1)?

    I have read that paper and it is good stuff. It does generalise to ARMA(1,1) or any other linear time invariant process. I will look to see if there is a simple equivalent formulae for ARMA(1,1).

    If that is not what you mean by the Lee & Lund method I am puzzled about what you and others are saying.

    I am just trying to get on the same page as the rest of you.

    Alex

  22. 3) If a reasonable contribution for volcanoes could be subtracted from the pre-2000 GISS data,

    If? Yes. But I don’t have much confidence in being able to do that. We know it can’t just be a linear response to an optical depth plus a lag because that’s not what the earth would do if you suddenly had a step function in forcing. But also: You still have all the uncertainty you had about whether or not ARMA ought to work.

    3) Because quite a few models did include volcanic aerosols, if my goal is to test models and there is a problem of shared volcanic eruptions in both series, I’m inclined to think the best way to do it is to first create this series,

    GIS_norm=GISTemp-MultiModelMean.

    Under the “null” hypothesis that the “model is right”, this takes out the shared non-linear response to the volcanic forcings. After that, you analyze that difference series .

    There are some problems because some models don’t include volcanic aerosols. But at least you aren’t going to do something really odd like “correct” a multi-model mean s volcanic aerosol loading and response function that differs from the one applied to the the models.

  23. If by Lee & Lund method you mean either Eq 3.6 or 3.7 (D. Nychka et al). I believe the following to be the case.

    No. I said I use the method Tamino used. He cited Lee and Lund as a justification, but Tamino is not using 3.6 or 3.7. You have to read the Tamino blog posts. If I recall correctly, Tamino’s development would start at 2.3 and then discuss what you get if you use the correlogram for ARMA 1,1.

    But I merely use Tamino’s code snip. He cites Lee and Lund. But I just use Tamino’s code snip and find the rejection rate. The method he uses does work for ARMA 1,1 as the number of months->Infinity. It also agrees with ARIMA from R as months->infinity. It would do neither if he was using Nychka.

  24. SteveF

    OK. Now 4 questions. If the limits are two sided, then doesn’t a point above the upper limit represent only a 2.5% chance, not 5%?

    Depends how you mean it. A point that far was in magnitude (absolute value) would happen 5% of the time. But one with that much difference and in that particular direction would happen 2.5% of the time.

    Whether it’s fair to use one sides or two sided tests depends on the specifics of the argument. In my case, I would have decreed models wrong for excess excursions on either side, so I use 2 sided. Had I had a pre-existing notion they must be too high prior to testing, I might use 1 sided.

    This isn’t really unconventional usage– it’s just one uses 1 side or two sided and the choice depends on whether you are testing a null of m=0.2 with alternate of m≠2 or whether you are testing a claim that 0.2<= m with alternate m<0.2. The IPCC claims m=2. Not 0.2<= m. So two sided is appropriate.

  25. Lucia #104493,
    “No nick. “
    Actually, that was SRJ. But OK, the low power bit is extra. Let’s just look at the points you’ve shown. 3 out of about 27 have a CI that dips below 2. Now there’s lots of autocorrelation, so a Bonferroni correction won’t do. But the same issue is there. You didn’t just pick 2001 as a random year to test.

    It seems to me that this is similar to testing whether control charts (Shewhart) are out of range. You can’t just apply single point tests.

  26. Nick–
    First, I didn’t show 3 out of 27 because I don’t show 27. Count the dots: Not 27. AT all.

    Third: even by claiming 27 you are failing to make the distinction between “hindcast” and “forecast” in whatever test one was making. Clearly, you ones goal was to test a forecast, that one would recognize you can’t test forecast properly using data that was used to tune the models that were used to make the projection. So, the trend computed starting with data prior to 2000 are clearly irrelevant to that– and that is true whether they are computed, or not.

    Those “3” on the chart you see are in the forecast region. Fourth: I think you well know by now that the models don’t predict a trend of 0.2 if you compute the trend starting in years in the middle of 1990.

    I promised I would improve the “model” like and I now have:

    Note that if model trends also include the dip in temperatures from the volcanic eruptions, we are rejecting starting from other years. (Alas, I did not compute trends for 27 year– I computed fewer so I can get things done rather than having the macintosh take 24 hours to get fiddle factors estimated.) But if you look at this:

    You will see I am rejecting 5 out of 11 computed. I don’t know what trends starting in 1991, or 1992 would do an do on because I didn’t compute them. See: No dots. I did not compute 27 cases.

    I’ve never said I picked 2001 “randomly”. I think it’s best owing to the date when the SRES used to create projections were frozen. Other people “like” 2000. Currently, if the “standard” had been 2001, we would reject. If the standard had been 2000, we would not reject. There was no “standard” for testing forecasts decreed by those who made the forecast.

    I didn’t say that when looking at this we must consider a single point test. Only that Boniferroni is no where near the right thing.

  27. Lucia,
    I don’t see that forecast or hindcast is the issue here. You’re calculating trends – there are I think 27 in the top plot, although it’s true there that more than 3 upper CI fall below 2 (though not 1.7), and you’re citing the CI’s significance for an individual trend. Yes, Bonferroni isn’t right, but to make no allowance isn’t right either.

    I don’t object to calculating the single trend CI – just that this can’t go further without some appropriate measure for significance an autocorrelated series dipping below a prescribed value. I think control charts might be the place to look.

  28. Nick–
    Of course forecast and hindcast can be an issue Whether it is or is not depends on which hypothesis you are testing. If you refuse to test whether the forecasts are reliable, you might decree forecast/hindcast makes no difference. But if you want to test forecast it is a very important difference.

    Mind you: I can understand why people who want to “protect” the notion that models are correct want to refuse to test forecasts and so might which to not recognize this distincition. But I think testing forecasts is important.

    Yes, Bonferroni isn’t right, but to make no allowance isn’t right either.Sure. But as I noted: when testing the forecast I have 3/4 rejections. And — if one were to use the principle that picking the test with the highest power to detect discrepancies was imporant (which it is) one would use the “binomial theorem” method. And this would be a very strong rejection of the premise that the models can forecast.

    In this situaiton, Bonifoerroni is ass-backwards, and monte-carlo would show this.

    Can they hindcast? Sure.

  29. Alexander Harvey (Comment #104515)

    “If that is not what you mean by the Lee & Lund method I am puzzled about what you and others are saying.”

    I had the same concerns when I read Lee & Lund, but that is the reference that was provided in Tamino’s paper for the adjustment used for an arima(1,0,1) model. Lee and Lund did not derive the adjustment used in Tamino’s paper.

  30. “I don’t object to calculating the single trend CI – just that this can’t go further without some appropriate measure for significance an autocorrelated series dipping below a prescribed value. I think control charts might be the place to look.”

    Control charts are a compromise for keeping a process in control and not over adjusting it. The application here is not the same for showing trend directions. It is what it is. Tamino evidently found a way to make a black and white comment on the later day trends and now it appears Nick wants to use another system to make a black and white comment on results that are counter to Tamino’s findings. What Lucia has provided is simply a convenient way of looking at short segment trends.

  31. Lucia,
    “Whether it is or is not depends on which hypothesis you are testing.”
    You’re testing whether GISS falls significantly below 2.0 °C/cen. If you want to test a model-based forecast you’d have to 1. show that it was someone’s model-based forecast for that period and 2. take account of its error range.

  32. Just out of interest

    ” I can understand why people who want to “protect” the notion that models are correct want to refuse to test forecasts and so might which to not recognize this distincition.”

    How do you know that testing the forecast will show if the models are correct? Surely ‘disproving’ the multi model mean doesn’t mean the models are collectively wrong.

    Wouldn’t a better method be to re-run the models with real world inputs and see if they match what happened. You could then select the ones that performed well and then investigate how they behave in the future.

    There are so many things that could be incorrect about the models, the various assumptions of ENSO conditions, solar conditions etc.

    Perhaps you could do an exploration of what the various models are sensitive too and see how they performed when external factors didn’t match their assumed magnitude.

  33. “Wouldn’t a better method be to re-run the models with real world inputs and see if they match what happened. You could then select the ones that performed well and then investigate how they behave in the future.

    Nathan, just out of interest, has this approach been seriously proposed by any climate modelers?

  34. Kenneth
    Putting known real world inputs into models is how they ‘train’ them, no?

    In any event was just a suggestion.

    How would you test the performance of a particular model?

  35. Nick Stokes,
    “You’re testing whether GISS falls significantly below 2.0 °C/cen. If you want to test a model-based forecast you’d have to 1. show that it was someone’s model-based forecast for that period and 2. take account of its error range.”
    Humm.. the “about 0.2C per decade” comes from the IPCC. It is a prediction of warming, and it is based mainly on the average of the GCM’s considered by the IPCC. Do you think that it is not appropriate to compare the measured evolution of temperatures (GISS in this case) since that prediction was made to the rise that was predicted?

  36. SteveF

    Wouldn’t a better test be examing the GISS performance against it’s own projection?

  37. Lucia, just for the record I see that your graph and mine for the unfrozen segment trends and CIs match very closely.

  38. Lucia,

    Thanks

    I have now read the Grant & Rahmstorf paper

    The equation A.6 seems to be the one in question and it is I am very much afraid in error. Even were it stated as an estimate it can easily be out by about 20% or more. It is not the limit as N goes to infinity nor much of a fit. For unusual but perfectly vailid parameter values it can become negative which is problematic for a ratio of variances.

    Equation A.5 is referenced to L&L 2004 but I do not think correctly. A similar equation appears with a weighting term wj but it would not be corrent for ARMA(1,1). It is mentioned in relation to AR(1) and is not general.

    Now obviously someone has got this very wrong. Perhaps it is I, perhaps it’s not.

    Perhaps it would be kinder to Lee & Lund to refer to it as the Grant & Rahmstorf method. I fail to see what in Lee & Lund they think supports their assertion.

    Alex

  39. SteveF (Comment #104535)

    What Nick is referring to here is similar to Santer’s complaint with Douglass – only the other way around. Douglass did not use CI limits for the the various potential sources for observed temperatures. Here, however, I believe Lucia has merely used the 0.2 degree C per decade as a marker.

    The beauty of the Santer/Douglass debate was in the revelation that the observed (counting radio sondes) and the modeled data had such large CIs that while one could, with the data available at that time, say that no significant differences existed one could also say that with those wide CIs we could hardily make any reasonable statements about the temperatures in the tropical troposphere.

  40. Nathan (Comment #104536),
    Well, maybe, but you would need enough runs to establish a reasonable estimate of the variability of that one model, with each run based on a different initial state. That could be a lot of runs! The IPCC position seems to be that a pooled average of all the runs from many models is more representative of reality than any single model (not my argument). My view is that the range of model behaviors and diagnosed equilibrium sensitivity is large enough that we know for certain most have to be quite wrong… after all, there is only one right answer on climate sensitivity. I am not sure how examining only one model helps, absent lots of runs.

  41. “Perhaps it would be kinder to Lee & Lund to refer to it as the Grant & Rahmstorf method. I fail to see what in Lee & Lund they think supports their assertion.”

    The question then goes back to what I posed originally and that is whether Grant & Rahmstorf were using an adjustment they derived or one that is commonly used or at least independently discussed in the literature.

  42. Kenneth Fritsch (Comment #104539),
    Sure, but since the Santer/Douglass dust-up, newer data has pushed even the Santer analysis into the “reject” territory, at least in some of the measures of the troposphere (Chad, Steve McIntyre, and others even published a paper, I think).

  43. Lucia,
    “Whether it is or is not depends on which hypothesis you are testing.”
    You’re testing whether GISS falls significantly below 2.0 °C/cen.

    No. As far as testing goes, I would want to test whether the multimodel mean is correct in the forecast period.

    I might be interested if it also fails in the hindcast, but I would be most interested in the ability of of models to forecast.

    That said, in a post where someone asked me to compare my CI to the ones Tamino gave, I compare my CI to the ones Tamino gave. I might also make observations about hos the model is doing. But this is not a “test” of the model’s forecast ability.

    Nathan

    SteveF

    Wouldn’t a better test be examing the GISS performance against it’s own projection?

    GISStemp– as an observation makes no “forecast”. It is an observation. We can test the multimodel mean against that observation. The coincidence that GISS also has a modeling group means we could also test their individual models. But that doesn’t mean that tesitng GISTemp against the GISModel E is a “good” way to test how the multi model mean from the AR4 works. To test how the multi-model mean from the AR4 compares to the observation of GISTemp, we compare the AR4 to GISTemp.

    This is not a difficult concept.

  44. Nathan…

    How do you know that testing the forecast will show if the models are correct?

    Uh… testing the forecast will tell us if the forecast is correct.

  45. “How would you test the performance of a particular model?”

    First what I often see is that a particular scenario or several scenarios have been used in running a climate model a number of years ago. We then look at those results to determine the predictive skill of those models while at the same time recogonizing that those particular scenarios did not occur in the real world. I would suggest going back and putting in the real world inputs and looking at those results. I have not seen this being done. What you have in this case are truly out-of sample tests to apply. Of course, a common reason for not doing these tests is that models have evolved to a better state and testing old models is a waste of time. All that means is that we will never get truly out-of-sample testing until climate models stop evolving.

    Even with the limitation of in-sample testing, it could be done by any modeler with the latest and best models but would have to be done by presenting a model for benchmarking to an independent body and replicate runs made. This avoids the modeler throwing away bad runs. It does not get around a model being chosen for benchmarking by fitting it to real conditions.

    Competing algorithms for adjusting temperatures have been independently evaluated using benchmarking, but I have not seen this attempted by climate modelers. Have you?

  46. Kenneth

    Lucia, just for the record I see that your graph and mine for the unfrozen segment trends and CIs match very closely.
    Good! Two totally independent approaches and results agreeing means it’s less likely either of us made a boneheaded error. (I was worried I’d made one for about 24 hours. People probably saw my panicked note when I found a bug I thought would be of consequence!)

  47. Kenneth

    Grant & Rahmstorf were using an adjustment they derived or one that is commonly used or at least independently discussed in the literature.

    I think they are using a method that Grant Foster came up with. It’s not necessarily bad. But I think he’s cherry picked in this instance.

  48. Lucia,
    these two statements seem to be at odds

    ” I can understand why people who want to “protect” the notion that models are correct want to refuse to test forecasts and so might which to not recognize this distincition.”

    “Uh… testing the forecast will tell us if the forecast is correct.”

    Why make the comment about models?

  49. Kenneth

    “Competing algorithms for adjusting temperatures have been independently evaluated using benchmarking, but I have not seen this attempted by climate modelers. Have you?”

    No, but I don’t know any climate modellers.

    What you suggested, is in agreement with my thoughts though.
    I am guessing here, but I would think the logical way to create an improved model would be to test it against real world data and then see where your model went wrong. That way you wouldn;t be completely re-writing the model, you’d be updating an old one.

    I am assuming this is probably what does happen.

    Do you know how modellers move from model A to model Av2?

  50. Lucia

    “To test how the multi-model mean from the AR4 compares to the observation of GISTemp, we compare the AR4 to GISTemp.”

    I understand that this is what you are doing, it just doesn’t seem a particularly useful thing to do.

    You are also made the confusing statement

    ” I can understand why people who want to “protect” the notion that models are correct want to refuse to test forecasts and so might which to not recognize this distincition.”

    Because testing the multimodal mean doesn’t test the models.

  51. ” I can understand why people who want to “protect” the notion that models are correct want to refuse to test forecasts and so might which to not recognize this distincition.”

  52. Nathan,
    “Because testing the multimodal mean doesn’t test the models.”
    So does that suggest you think the IPCC projection (based on the multi-model mean) is not a useful projection? Or do you think neither the individual models nor their mean are making useful projections?

  53. I see the IPCC projection as an approximation, and it is useful in that it gives a rough idea of what we should expect.

    Simply checking to see if temps have matched is not particularly useful as it says nothing about why it doesn’t match.

    Ideally it would be better to test each model and then sort through what went wrong in each case to try and get a better model (and hence a better projection).

    I don’t know how they’ll be doing it for the next report, but I doubt they’ll just present the same data again. So if you like this is already an admission that the AR4 was ‘wrong’. But that is on it’s own a useless statement. What we need is to know WHY.

    Have you read James Annan’s take on this?

  54. SteveF,
    the “about 0.2C per decade” comes from the IPCC.
    Well, from the SPM. A first question would be, over what time period. It isn’t entirely clear, but it seems to refer to the future, written in 2007. It can’t be taken to indicate a hindcast/forecast transition at 2001. And even if it did, in this trend analysis the 2001-2012 period is contained in all the trends.

    But then there is “about”, and error limits (unstated in SPM, but needed for quantitative analysis).

  55. Nick:

    But then there is “about”, and error limits (unstated in SPM, but needed for quantitative analysis).

    I commented on this on your blog too. If you follow your tack to its logical conclusion, then the number 0.2°C/decade is completely meaningless. (It could have any bounds needed to get it to agree with the data, since none were provided.) If it was just a SWAG, they should have stated this.

    As I said on your blog, this 0.2°C/decade with no error bound is symptomatic of a disease suffered by almost the entire field, that of understanding or leaving unstated the uncertainty in their own results. (“Almost”.)

  56. Lucia,

    I have check and there is an exact equation for the required limit values as N goes to infinity for ARMA(1,1)

    ((1+ma)/(1-ar))^2/(1+((ma+ar)^2/(1-ar^2))

    Where the correlation vector is

    1
    (ar+ma)
    (ar+ma)*ar
    (ar+ma)*ar^2

    Hopefully the formula is without typos.

    Alex

  57. Nathan–
    Yes. I can see people who want to protect models don’t want to test forecasts based on models. For some reason they think that showing the forecast fails might be perceived as proving “models are wrong”. But I’m testing the forecast. I think it’s important to test the forecast. If it’s wrong we need to know that. After it’s agreed it’s wrong (which btw it is) then people can sort out the reason. Could be the models. Could be the forcings. Etc. But I’m testing the forecast.

  58. Nathan

    Because testing the multimodal mean doesn’t test the models.

    It tests the forecast. I think that’s more important than testing the models.

  59. Well, from the SPM.

    Its not just in the SPM. Also, the forecast is given in a graph– which is continuous over time. And it’s based on the multi-model mean– which is continuous. So, that can be tested.

  60. Nathan

    and it is useful in that it gives a rough idea of what we should expect.

    Simply checking to see if temps have matched is not particularly useful as it says nothing about why it doesn’t match.

    Sorry, but I think knowing the forecasts don’t match observations is a useful thing. Out of curiosity, are you conceding they observations don’t match the forecast? Because if you agree they don’t then we can move on. But otherwise, trying to change the subject to the importance of knowing why they don’t match isn’t going to make detecting that they don’t match unimportant to my eyes or those of many others.

  61. Kenneth–

    Here, however, I believe Lucia has merely used the 0.2 degree C per decade as a marker.

    In the present post, I am using 0.2C/dec as a marker. We are merely observing whether trends computed from certain start years lie above or below this marker. The discussion was– to a large extent– motivated by someone wanting to know how “my” error bar compared to Taminos. So I am showing what I get relative to what he gets– and the years for testing are chosen based on his choices.

    The are not particularly motivated as representing a “test”. But I think it’s interesting to see that many years in the forecast period do have their upper bound 95% confidence intervals fall below 0.2C/dec using ARIMA(1,0,1). This happens to be the case.

    Observing simple fact of this sort seems to upset Nathan and Nick. But.. well that’s where the confidence intervals are if we compute them using Arima(1,0,1)

  62. Alexander Harvey:
    But the difficulty is that we are dealing with small N. So, whatever the correct formula is, it gives wrong answers for small N. That’s why we need to correct. And as far as I can tell (and Kenneth) we need to do that with Monte Carlo). Do you think something different?

  63. Carrick,
    “If it was just a SWAG, they should have stated this. “
    Well, they did say “about” 🙂

    But famously, the SPM is hammered out in a plenary session, and tends to get away from the scientists. One can criticise that, but there’s not much point in trying to reverse engineer a testable projection from the SPM when there is a whole AR4 chapter on that.

  64. Nick–
    They gave more detail than “about” and the forecast isn’t just in the SPM. It’s in the body of the report too.

    One can criticise that, but there’s not much point in trying to reverse engineer a testable projection from the SPM

    There is a testable projection in the body of the report. The fact that the forecast is pulled out and repeated in the SPM doesn’t transform it into untestable.

  65. Lucia

    “It tests the forecast. I think that’s more important than testing the models.”

    Why?

    “Sorry, but I think knowing the forecasts don’t match observations is a useful thing. Out of curiosity, are you conceding they observations don’t match the forecast? Because if you agree they don’t then we can move on.”

    Obviously the recent observations don’t match the multi-model mean. They track differently.

    What I am saying is that as a test, or as something to test, it’s not that useful.

    I’m not upset, I am hoping you will go to the interesting question (as I can’t). The question of why… So what has happened that has made this projection inaccurate. Is it the sequence of La Nina’s? Is it the lower TSI? is it because the climate sensitivity is too high?

    When Tamino did his paper he looked at theses factors… And it seemed that when you compensate for them you end up pretty much with the models being more or less accurate; they just failed to adequately predict the TSI and ENSO conditions. That’s my understanding of his work anyway.

  66. Lucia,
    “Observing simple fact of this sort seems to upset Nathan and Nick.”
    Far from it. It’s what I said you were doing (testing 2°C/cen) and there’s nothing wrong with that. And it’s what the title says.

    I demur when it’s said that somehow doing the test after 2000 (forecast) is different from before. You haven’t shown a basis for that.

    “And it’s based on the multi-model mean– which is continuous. So, that can be tested.”
    Yes, but in terms of model performance in the short term, this isn’t much better than 2C/cen. The models individually represent the attempt to model weather, not the mean. At least you’d want to test whether the actual weather could statistically be in the population of models, which is different from saying whether it differs significantly from the mean.

  67. Nathan

    What I am saying is that as a test, or as something to test, it’s not that useful

    Well, then we simply disagree. I think whether the forecasts are high or low are of great importance to making decisions going forward. If someone one in the public wants to make plans about investments in infrastructure, or changing sources of energy and so on, knowing if forecasts are biased is more important than specifically knowing than why they are biased.

    I admit that specialists working on improving models might care why they are biased. But this is a teensie-beensie fraction of people who might wish to use forecasts. So, knowing they are biased is more important to nearly everyone.

    And it seemed that when you compensate for them you end up pretty much with the models being more or less accurate.

    Actually, you might want to squint at the figures a bit more. Even with his extra big error bars, and even with data ending near a “top”, the upper uncertainty falls below 0.2C/dec for some observational sets.

    That’s my understanding of his work anyway.

    Well… you might want to squint at those figures a bit more and read what he actually writes. Because showing that there is no change in the trend is not the same as showing the AR4 forecast is not off. Upload in a second.

  68. Note that in 1980, T’s figures have upper 95% confidence intervals below 0.2C/dec for RSS and UAH. They just graze for GISS, NOAA and Hadley. And for shorter periods– I’ve just shown his error bars are computed to be wide.

  69. Shoot… I chopped off UAH! Those trends are way lower than “about 0.2C/dec” in Foster and Rahmstorf. Get hte paper and look at figure 6.

  70. Nick

    I demur when it’s said that somehow doing the test after 2000 (forecast) is different from before. You haven’t shown a basis for that.

    Shown? What’s to show? Testing forecasts are different from testing hindcasts. Period. There is nothing to “show”. Either you understand that or you don’t.

    The models individually represent the attempt to model weather, not the mean.

    Nonesense. Model runs attempt to model the mean+weather. Some models have numerous runs– up to 7. The average is an attempt to model the mean.

    Moreover, even with an individual run, it is possible to make statements about the mean.

    At least you’d want to test whether the actual weather could statistically be in the population of models, which is different from saying whether it differs significantly from the mean.

    I haven’t said you can’t do that test too. It’s possible to test whether the earth weather falls inside the model run spread and also test whether the multi-model mean is consistent with weather. I have posts that discuss one and some that discuss the other.

    This particular post happens to discuss whether the model mean is consistent with the earth weather. There are the model mean is not consistent with earth weather– and that is true even if the earth weather falls in the spread of model runs. That spread is affected both by “weather” and by the spread in model means. So, to the extent that models means are wrong, it’s potentially wider than the spread of earth weather. I happen to think it is wider than the spread of earth weather– but that’s not subject of this post.

    The subject here is whether the model mean is consistent with earth weather. Trying to change the subject to a different question is not going to change the answer.

  71. Lucia,
    “Testing forecasts are different from testing hindcasts. Period. There is nothing to “show”.”

    So what do you do differently for forecasts? What would be different if their runs were completed in 1995 say? Is it a different 2°/cen?

    In any case, you can’t even say that the 1995 number is a hindcast. It’s the trend from 1995 to Aug 2012.

  72. Lucia

    “I think whether the forecasts are high or low are of great importance to making decisions going forward. If someone one in the public wants to make plans about investments in infrastructure, or changing sources of energy and so on, knowing if forecasts are biased is more important than specifically knowing than why they are biased. ”

    I guess we will have to disagree, so is the point of these posts to encourage people to not take action based on the multi-model mean?

    Could you say why you think the multi-model mean is used in making decisions? Can you explain what bodies or organisation you think uses the multi-modal mean for decisions.

    Local modelling – like the models the CSIRO have done here in Australia are used for those planning decisions here. I’m not sure if they form part of the multi-model mean. And even if they are ‘wrong’ (which they most certainly are – they are models), how should we make decisions? What should we base the decision making process on. Is there actually any alternative?

    If you don’t think we should base decisons on these models, what should we use?

  73. Lucia

    “Here is the graph with UAH. Note that with start years near 1980, trends with RSS and UAH are both below 0.2C/dec.”

    Ok, so when they say “about 0.2” that doesn’t mean exactly 0.2… Yes? They have one significant figure there. So about 0.2 should be seen as 0.1500000…00001 to 0.24999999999…..99999

  74. “So, knowing they are biased is more important to nearly everyone.”

    really interested in this remark… Can you define or quantify what you mean by ‘biased’?

  75. Nathan– No. 0.17 is not “about” 0.2 and it wasn’t that in the AR4. You can look at their tables and charts.0.17 with as a true trend is distinctly slower warming.

    And by “way lower” I mean there is a large visible gap between the dashed line and the top of the 95% confidence interval.

    Nick

    I guess we will have to disagree, so is the point of these posts to encourage people to not take action based on the multi-model mean?

    I think we should take action. But I also think we should take appropriate action. In that context, I think it is important for people to know were things are relative to forecasts. I am merely observing where things are. If the temperature were above the mmmean, I would observe that.

    And even if they are ‘wrong’ (which they most certainly are – they are models), how should we make decisions? What should we base the decision making process on. Is there actually any alternative?

    Alternative to what? Do you mean: Is there an alternative to refusing to let people who are making decisions know that forecasts based on collections of AOGCMS are running too warm?

    Yes. There is an alternative presenting them model data while concealing from them the fact that collectively models are running to warm. The alternative is presenting them with local model data and also let them know that sets of models are running too warm for the planet as a whole.

  76. Nathan-

    Can you define or quantify what you mean by ‘biased’?

    Yes. There is a difference between the multi-model mean and the true mean. That’s the definition of ‘bias’.

    As is if we claim y is an approximation or estimate of x, but
    E[x]-E[y]≠0, with E indicating ‘expected value’, then y is a biased approximation of x. Unbiased approximations or estimates have the property E[x]-E[y]=0.

    This is a pretty dang standard definition.

  77. Lucia

    “Nathan– No. 0.17 is not “about” 0.2 and it wasn’t that in the AR4. You can look at their tables and charts.0.17 with as a true trend is distinctly slower warming. ”

    For real?

    So when the say ‘”about 0.2 “, you say they really mean exactly 0.2…

    That sounds a bit like you’re putting words in their mouths.

    But hey, if you believe that who am I to dispute it?

  78. Nathan –
    Lucia is not putting words in anyone’s mouth. From AR4 WG1 SPM: “For the next two decades, a warming of about 0.2°C per decade is projected for a range of SRES emission scenarios”. Took me under a minute to find it.

    If you want a more precise estimate, see this comment on another thread.

  79. You claimed people or organisations were using the IPCC multi-model mean to guide them. What organisations are using the multi-model mean? Who is using this to determine action?

    Also, given that the IPCC is about to produce a new one, don’t you think disproving this one is a bit ‘useless’?

    “… let them know that sets of models are running too warm for the planet as a whole.”

    Ok, so they’e been running warmer for a about 10 years? Or so? I don’t want tto quibble on the number of days they have been running warm, so when I say ‘about 10’ I mean of that order, so somewhat more than 5 and less than 15.

    On what basis do you assume they’ll continue to run hotter into thre future? We have already seen (based on Foster and Rahmstorf) that the reason they’re running hotter is basically because they overestimated the TSI and MEI… How do you know that this problem will conntinue?

    This is why you need to know ‘why’. Because just saying ‘The IPCC projection is hot’ or something does nothing to say whether it’ll be still running hotter in the future. Looking at earlier data we have seen sudden accelerations before, why do you not think they will not happen again?

  80. HaroldW

    I know the line, it’s just that Lucia has decided that ‘about 0.2C’ actually means ‘0.2C’.

    apparently 0.17C cannot be considered ‘about 0.2C’

  81. Lucia given that in natural systems the estimate or approximation y of x will NEVER satisfy your equation E[x]-E[y]=0, how can we ever have an unbiased approximation?

    It’s a pretty useless definition in this context.

  82. “Lucia has decided that ‘about 0.2C’ actually means ’0.2C’.”

    When comparing a stated number to another number or numbers, I think it’s reasonable to use the stated number in your comparison.

    This is what I have an internet connection for: Discussing the dynamic applications of the word ‘about’. 😉

    Andrew

  83. SteveF–
    As I said: To deal with volcanoes, I would just analyze the difference. Here’s the trend in GISTemp-MMMean:

    The uncertainty intervals are too small because
    a) I haven’t taken the hours required to run the montecarlo and
    b) Nick noted long ago this method looses the spread in the multi-model mean due to the difference in model means. Adding the latter in is easy (at least if you don’t mind the fact that the method includes a bit of extra spread due to ‘weather noise’ still remaining in the model mean because it ends up counted twice.)

    I’ll add both later. (I’d add (b), but there is no point in adding it until after I run the monetcarlo. For now– just attend to the note and don’t make conclusions especially about recent years.)

  84. Re: Nick Stokes (Comment #104577)

    Lucia,

    “Testing forecasts are different from testing hindcasts. Period. There is nothing to ‘show.’”
    So what do you do differently for forecasts?

    Do you mean: what do you do differently other than placing your bet before the race?

  85. Nathan: Lucia has observed that if we look at the model means, the trends compute to higher than 0.2. You can just keep using “about” to say
    0.23 is “about 0.2”.
    And 0.17 is “about 0.2.
    And 0.17 is also “about 0.15”,

    Ergo 0.23 is “about” 0.15.

    In the current situation 0.17 disagrees with the multi-model mean which is greater than 0.2.

  86. Nathan —
    Let me get this straight. The purpose of the exercise is to evaluate AR4 model projections, specifically temperature. You’re saying that we shouldn’t compare the trend of the model projections against measured temperatures, but instead we should compare the trend computed from adjusted measured temperatures since 1979, against measured temperatures?

  87. Oliver

    Do you mean: what do you do differently other than placing your bet before the race?

    I predict JFK will be shot in 1963, Columbus will arrive in the Americas in 1492 and a guy named William will conquer England in 1066. I also predict the the end of the world will occur in November 2012.

    If the world does not end in 2012, I will insist we use Nick’s rules to decree I am right in 3/4 cases with each prediction being a prediction of a low probability event. We will then deem me a true psychic.

  88. Regarding the 0.2°C/decade trend, the the narrower question of “is the 0.2°C/decade trend” that the IPCC suggested we should expect, which based upon an assumed climate sensitivity of “3°C/doubling of CO2, consistent with observed data?”

    The reason this is interesting is if the 0.2°C/decade trend isn’t consistent with data, what it suggests is 3°C/doubling might be too high of a sensitivity.

    The 3°C/doubling = 0.2°C/decade is pretty much model independent, since you are starting with an assumed sensitivity (instead of asking the model to provide it) plus a forecast prediction of radiative forcings and as long as you’ve looked at long enough periods that internal variability don’t dominate the trend estimate, and of course that’s the purpose of this post, to take in account the uncertainty introduced by internal variability.

    If you get a disagreement between trend and IPCC prediction, there are three explanations, 1) the sensitivity is lower than the IPCC favored value of 3°C/doubling of CO2, 2) forecast radiative forcings did not follow the trend assumed by the IPCC, or 3) there is other internal variability than isn’t being modeled using the method described above (e.g. a 56-year period AMO). [Or some combination of these three.]

    The bottom line is if the real number is much below 0.2°C/decade, you probably aren’t going to be able to explain that using a climate model that has a 3°C/doubling of CO2 sensitivity.

  89. Nathan:

    We have already seen (based on Foster and Rahmstorf) that the reason they’re running hotter is basically because they overestimated the TSI and MEI… How do you know that this problem will conntinue?

    I’m not sure I agree with your argument about what “we have already seen” but you should realize when you say “they’re running hotter is basically because they overestimated the TSI and MEI…”, you’re arguing that the models are overestimating CO2 sensitivity.

  90. “I am guessing here, but I would think the logical way to create an improved model would be to test it against real world data and then see where your model went wrong. That way you wouldn;t be completely re-writing the model, you’d be updating an old one.

    I am assuming this is probably what does happen.

    Do you know how modellers move from model A to model Av2?”

    Nathan, it is my understanding that the modelers claim not to fit their models to real world data but rather to apply the appropriate physics based on the Navier–Stokes equations and in some part parameters where the physics become intractable. The ideal is then to test the model results from first or near first principles against real world results. Corrections/adjustments that are made based purely on fitting the model results to the real world would of course make a sham of the modelers claim as described above and further make any statistical analysis of the models forecasting (and hindcasting) skill difficult to impossible to do properly. I think the use of corrections such as those for aerosols where modelers apparently use different values that tend to make models conform that otherwise would not is a potential source of this problem.

    My point is that model performances and skills need to be evaluated independently and particularly so where out-of-sample testing is difficult to do and takes years of waiting to apply. The models would have to be run, not by the modelers, but by an independent group and all model runs accounted for and reported. Adjustments made, above and beyond the first principles, would have to be closely scrutinized as to how objectively these adjustments were derived and how it might differ from model to model.

    Maybe someone reading here is aware of an effort like the one I described.

  91. Kenneth
    You quoted Nathan

    “I am guessing here, but I would think the logical way to create an improved model would be to test it against real world data and then see where your model went wrong.

    What is truly amazing about Nathan’s comment is that in context the notion that it is logical to test the model against real world data appears to be being advanced as the reason why comparing models against real world data is not a useful thing to do.

    I think it is logical to test models against real world data and I do it. I think this step always needs to be done.

    Do my current tests tell me anything about how future models might do> No. It merely tests a recent batch that is widely promulgated and which many suggest we should take seriously — particularly when making plans about the future.

    As for my view of Nathan’s opinion of the value of this exercise: If he doesn’t think it’s worth doing, he doesn’t have to do it. Which he doesn’t. He is also permitted to pay no attention to the comparisons at all. But while protesting he does seem to pay an awful lot of attention to the comparisons.

  92. Nathan

    HaroldW

    I know the line, it’s just that Lucia has decided that ‘about 0.2C’ actually means ’0.2C’.

    apparently 0.17C cannot be considered ‘about 0.2C’

    If you read the paper itself, the adjusted warming trend for UAH is 0.141C/dec. If you think it’s ok to call 0.17 “about 0.2C/dec”, I assume you would call 0.141 “about 0.1C/dec”. That’s not “about 0.2C”. And as I have noted 0.2C/decade is outside the uncertainty intervals for UAH. So, claiming 0.141 C/decade — which rounds to 0.1 is “about” 0.2C/dec when 0.2C/dec isn’t even in the 95% uncertainty bounds is silly.

  93. Nathan,
    “However, this is after adjusting for MEI and TSI.”
    You really seem to put a lot of weight on the conclusions of that paper; you may be already aware that there are lots of people who think the F&R study is little more than a curve fit exercise.
    .
    To wit: Consider the rather odd conclusion from F&R that the response of temperature to variations in TSI and AOD have very different lags (5 to 7 months for AOD, and 0 to 1 month for TSI), even though both are presumably radiative forcings. The calculated lag of near zero for “optimum fit” for TSI is non-physical in light of the considerable thermal inertia of the system. How that made it past review without a clear and convincing explanation is puzzling. Even the lag of 5-7 months for AOD seems remarkably short, and only consistent with a effective thermal capacity that is lower than expected.

  94. Lucia,

    What is truly amazing about Nathan’s comment is that in context the notion that it is logical to test the model against real world data appears to be being advanced as the reason why comparing models against real world data is not a useful thing to do.

    Only amazing if one thinks Nathan’s objective is technical progress; not amazing if his objective is to make more likely a certain course of public action. It makes perfect sense if Nathan’s argument is considered a political (rather than scientific) argument, designed to avoid a course of public action (or lack thereof) on climate change which he happens to not agree with. As is too often the case, it’s just politics masquerading as science.

  95. Alexander Harvey (Comment #104557)
    October 4th, 2012 at 8:21 pm

    “Lucia,

    I have check and there is an exact equation for the required limit values as N goes to infinity for ARMA(1,1)

    ((1+ma)/(1-ar))^2/(1+((ma+ar)^2/(1-ar^2))”

    Alex, how does your equation above square with Foster’s correction factor for the standard error for the regression trend which I see as (1+2p1/(p1-p1p2))^(1/2), where p1 and p2 are the first and second ac coefficients from the regression residuals? How was the equation you show derived? What is the relationship of ar and ma to the acf coefficients?

  96. Lucia (Comment #104617)
    “If the world does not end in 2012, I will insist we use Nick’s rules to decree…”

    I press my question – what do you do differently for the “forecast” years? OK, you choose to talk about it differently. But in fact all you do is test whether GISS data, which has no forecast element, shows a trend significantly less than 2.0 °/cen. As your title says. Same test whether it’s 1990 or 2001.

    You don’t even really know whether 2001 was forecast or hindcast (by whom?). Doesn’t matter.

  97. SteveF
    Politics? I’m simply saying that information about whether some (unspecified) model was running forecast or hindcast is not used in the analysis (and is in fact unavailable to it).

  98. SteveF:

    Your politics are showing NIck.

    Or the lack of any real empirical training. The people I work with would eat somebody up and spit them out over a statement like that.

    The only way it would never matter is if “the model can’t be used in this way.”

  99. Nick
    I’ve already answered your question in lucia (Comment #104593) when I said “interpret what it means”. That is an answer to your question. That is: What you do differently is you interpret what the results of the test means differently. If the test is applied to a forecast, you can interpret whether the method has an ability to forecast i.e.: predict data not known while developing the method. If it’s applied to a hindcast, you interpret whether people have managed to get it to fit data already available during developement.

    If you are asking what mathematical operations you do differently, none. But shoving data into mathematical operations is not the only things one does with data. Interpreting what it means is an important thing that is done. And that is done differently.

    You don’t even really know whether 2001 was forecast or hindcast (by whom?). Doesn’t matter.

    We dang well know 1990 isn’t a forecast!

    But moreover, these forecasts are created using SRES and those were not available prior to 2001. Once they were chosen, temperatures after 2001 couldn’t affect this aspect of the decision and modeling groups had to use those.

    Generally, the specific AOGCM’s runs were crated before the TAR– in fact, they had to be run after the SRES were available. So, the process for creating these things is frozen about that time.

    So 2001 actually represents a date when one can legitimately claim to a boundary defining data in the forecast period. That is: data modelers couldn’t “peak at” to tweak anything coming out of their models.

    1980, 1990 and 1995 don’t fall in the forecast period. It’s not even close.

  100. Nick,

    I’m simply saying that information about whether some (unspecified) model was running forecast or hindcast is not used in the analysis (and is in fact unavailable to it).

    The information is available to the analyst who can use that information to interpret what the output of some mathemagical operation can mean.

    Interpreting results is doing something. It’s even an important thing that one ought to do.

  101. Lucia,
    “We dang well know 1990 isn’t a forecast!”
    Well, no, you don’t. The model may not have used post-1990 data. And how do you connect SRES to the 2 °/Cen number?

    There is also the aspect that your 1990 number is actually a GISS trend from 1990 to 2012. Is that forecast or hindcast? Does the distinction even make sense?

  102. Nick

    Well, no, you don’t. The model may not have used post-1990 data.

    If you are trying to suggest this is the case for models in the AR4, this is nonsense. Total nonsense.

    And how do you connect SRES to the 2 °/Cen number?

    The SRES were used to drive the models.

    There is also the aspect that your 1990 number is actually a GISS trend from 1990 to 2012.

    It’s mixed. But hindcast data affect that computation. And the degree to which the model was already tuned to that data makes would affect the correct value of the confindence intervals if you bothered to compute them. It could be done– you could run monte-carlo with some sort of “preload” to recognize that the data from 1990-2000 you are using was in the hindcast and already “agreed” and then find the magnitude of deviation required to reject a hypothesis that m=predicted value.

    That deviation is going to be smaller than the value you would get if you assume that you didn’t actually know the data from 1990-2000 before making your forecast. Someone could do this using monte carlo. But it’s a PITA and there is no advantage relative to just limiting testing of the forecast to data that arrived after the forecast was “frozen”.

    It’s not as if testing the forecast is difficult– just stick to the later data and otherwise use standard methods. Trying to deal with mixed stuff would be complicated– and no one does it. Those who want to test “forecasts” using that would rather just delude themselves that the it’s ok to claim you can test forecasting ability using data available before the forecast period.

  103. Carrick,
    ” The people I work with would eat somebody up and spit them out over a statement like that.”

    OK, how would these bulimic cannibals say that forecast/hindcast (of what?) matters to this analysis?

  104. A mind that can image “bulimic cannibals” as a retort but cannot conceptualize a distinction as evidenced by “Is that forecast or hindcast? Does the distinction even make sense?” is mind boggling.

  105. Nick Stokes,
    “bulimic cannibals”
    Humm.. chewing and spitting out is not bulimic… that would involve swallowing and later vomiting. Never heard of cannibals that had this eating disorder.

  106. SteveF’s climate science axiom: Anyone who can’t or won’t acknowledge clear factual evidence is not interested in advancing understanding; they have other motivations.

  107. Lucia,
    “The SRES were used to drive the models. “

    The SRES were simply a set of emissions scenarios, published in 2000. In terms of whether a model run is a forecast or not, the relevant question is what forcings they used. Even for GHGs, they could well have used observed GHG ppmv post 2000 where available.

  108. Lucia

    “What is truly amazing about Nathan’s comment is that in context the notion that it is logical to test the model against real world data appears to be being advanced as the reason why comparing models against real world data is not a useful thing to do.”

    You are again linking the multi-model mean to the performance of individual models.

    Your test here is not comparing real world data to the models. It’s comparing real world data to the average of abunch of models.

    Additionally this was simply a step to investigating what part of the model isn’t working. The models make assumptions about TSI, MEI, etc. and if it is simply that these assumptions are wrong, then it’s not a problem with the model, it’s a problem with the assumed TSI and MEI.

    This means the models are actually fine, in the sense that Kenneth was talking about (in terms ofthe driving physics equations).

    This also shows that Carrick’s claim that the less than 0.2 we see for the last few implies a climate sensitivity of less than 3 is not right either. The climate sensitivity is not the reason they gave a mmm projection of about 0.2, it was also based on assumptions of TSI and MEI etc.

    And yes Lucia, 0.14 is ‘about 0.1’. significant figures aren’t hard to understand.
    I think the 0.17 was for GIStemp, no?

  109. SteveF

    “SteveF’s climate science axiom: Anyone who can’t or won’t acknowledge clear factual evidence is not interested in advancing understanding; they have other motivations.”

    I am actually trying to encourage Lucia to take this investigation further to advance understanding.

  110. Nick

    In terms of whether a model run is a forecast or not, the relevant question is what forcings they used. Even for GHGs, they could well have used observed GHG ppmv post 2000 where available.

    You are making distinctions without differences. Forcings were selected based on the emissions scenarios or SRES and those drove the models.

    observed GHG ppmv post 2000 where available.

    Clearly, they can’t observe things that have either not occurred or not been measured.

  111. Kenneth Fritsch (Comment #104639)
    October 5th, 2012 at 12:52 pm

    Alexander Harvey (Comment #104557)
    October 4th, 2012 at 8:21 pm

    Alex, my skills with linear algebra are limited but I think I know that the AR coefficients can be derived from ACF coefficients through the Yule-Walker equations and the Levinson algorithm. I also am aware of R functions (black box to me) for deriving AR coefficients from ACF coefficients, but that doesn’t help me understand how Foster derived his degree of freedom adjustment.

  112. Nathan

    And yes Lucia, 0.14 is ‘about 0.1′. significant figures aren’t hard to understand.
    I think the 0.17 was for GIStemp, no?

    You were responding to this

    trends with RSS and UAH are both below 0.2C/dec.

    RSS and UAH are not GISTemp and observing their trends are not about 0.2C/dec is not defining about 0.2C/dec as == 0.2. Their trends would be “about 0.1” if we round the way you do (which I wouldn’t.)

  113. Fine Lucia,
    I thought we talking about them all… But yes sure, O.14 is ‘about 0.1’

    How would you round?

  114. I think we have had this discussion with Nick before with regards to selection of proxies for temperature reconstructions and as far as I remember he has not been able to conceptualize the difference between in-sample and out-of-sample testing from a statistical standpoint. I have seen this failure before – and with scientists and otherwise intelligent people. For some reason there is seldom an epiphany where these people slap their heads with an “oh my god now I see”, so I think it is best we move on.

  115. Nathan (Comment #104661)

    Nathan, you would not be offended if I call you Nick II or Nick Jr.

  116. Of course not.

    Strange thing is, Kenenth, we’re more or less in agreement. Is that not something to celebrate?

  117. SteveF,
    “chewing and spitting out is not bulimic”
    OK, I guess I’ll just have to take it as a slight to my edibility.

    “Anyone who can’t or won’t acknowledge clear factual evidence…”
    I guess that means me. I’m in the familiar position of being accused of not accepting clear facts that can’t be clearly stated. The facts are:
    1. The AR4 SPM (2007) has a rather vague statement that:
    “For the next two decades, a warming of about 0.2°C per decade is projected for a range of SRES emission scenarios.”
    2. Lucia has taken this and done an analysis of trends in Gistemp to 2012, looking at whether they differ significantly from 0.2°C per decade.

    First observation – “next two decades” would reasonably mean 2007-2026, or maybe 2011-2030. It does not associate the figure 0.2 with 2001 or 2002.

    The SPM does not say what model runs the figure is based on; it does say some used SRES scenarios.

    Even if model runs did switch from observed GHG forcings to scenario based forcings from 2001-2005, it is extremely unlikely that this switch would have made any difference to the “about 0.2” figure, especially as it relates to 2001 observations.

    That’s the only number which connects Lucia’s analysis of existing GISS data to the SPM projection.

  118. Lucia I think I can see where our confusion is coming from.

    ” So, claiming 0.141 C/decade — which rounds to 0.1 is “about” 0.2C/dec when 0.2C/dec isn’t even in the 95% uncertainty bounds is silly.”

    Ok, so you are testing whether 0.2C lies within the 95% bounds of the measured UAH. We both agree that it presently lies outside those bounds, and yes I would round 0.141C to 0.1C if I wanted to say what it was to 1 sig dig.

    But, what I am sayng is that the test shouldn’t be to see if 0.2C lies within the bounds, as the 0.2C estimation has uncertainty. They said ‘about 0.2C’ so how do you adress what they mean by ‘about’. You can’t test for exactly 0.2C and claim they’re wrong, when they said ‘about 0.2C’.

    From memory I don’t think they wre clear, but I don’t see how you can ignore the implied uncertainty.

  119. HaroldW

    “Nathan –
    Let me get this straight. The purpose of the exercise is to evaluate AR4 model projections, specifically temperature. You’re saying that we shouldn’t compare the trend of the model projections against measured temperatures, but instead we should compare the trend computed from adjusted measured temperatures since 1979, against measured temperatures?”

    See, this is the problem, you can’t say the ‘models’ are wrong by looking at the multi-model mean. It just makes no sense.
    You CAN say the multi-model mean doesn’t match the current temps, but this dosn’t mean the ‘models are wrong’.

  120. Nathan,
    It sure means the average of the modeled trends is wrong. No need to concern ourselves with whether the individual models are wrong; they cover such a wide range of diagnosed sensitivities that most of them (and maybe all) must be wrong, since there can only be one correct value for climate sensitivity. What lucia shows is that the model mean is biased high relative to what has actually happened. And yes, i do think that is important to know.

  121. SteveF

    “No need to concern ourselves with whether the individual models are wrong; they cover such a wide range of diagnosed sensitivities that most of them (and maybe all) must be wrong, since there can only be one correct value for climate sensitivity. ”

    This is really strange. The only way we will improve the future estimate is by seeing which models perform well and which don’t. Itis critical to look at the performance of individual models.

    “What lucia shows is that the model mean is biased high relative to what has actually happened. And yes, i do think that is important to know.”

    Ok, so why is it important? To me it seems relatively trivial, becuase there is no investigation of why the current temps are lower. It says nothing about what to expect in the future. We’ve seen rapid increases in temp in the recent past, so when they say about 0.2C/decade for the next two decades you have to consider that over the next 5 years or so we could see a rise in temps that makes the trend about 0.2C/decade.

    Lucia claimed people were using the mmm for planning. But could not give any evidence for that.

  122. Nick:

    OK, how would these bulimic cannibals say that forecast/hindcast (of what?) matters to this analysis?

    Hm..”chewin’ and spittin'” may not be terms you Aussies know about.

    Nasty habit many Southerners have in the US … chewing tobacco then spitting it out in e.g., used coke bottles. I’m actually glad they don’t chew up and spit out people who don’t know how to test models with data. Chewed tobacco is bad enough to see in empty bottles.

    Prior to circa 2001, the modelers were informed about how temperature behaved for that period..

    Humans being humans, we get affected by prior knowledge when we make decisions about parameters in our model, especially if our model ends up not aligning well with data. This probably doesn’t matter much for the really crappy models that nobody believes anyway, but the better models almost certainly get enough trials that they can work out whether their model is agreeing with data or not, as long as you don’t extend the model past the period where there is data, that is.

    Of course if the model is really bad (or if they don’t have enough CPU power to make enough runs to allow them to adjust their model if it isn’t fitting well), then you’d expect that model to not do well during the overlap period where simulations and data were contemporaneously available. In principle, you could use poor performance during the period 1980-2000 for example to winnow out the models that are so s*cky that we shouldn’t include them in our average.

    You could even argue that were the IPCC not such a politicized organization, these model runs wouldn’t have been included to start with, or there’d be an additional notation “excellent!”, “great”, “good, “poor, “bad, “crappy”, “s*cky”, “really bites”, etc.

    Anyway in terms of comparing measurements to model output, the period for which data was available is often referred to as the “verification period”. No model that fails to verify should be really considered seriously (in AR4 models fail in significant ways). Fails to verify = model contains serious flaws in the implementations of the known physics (wrong parameters in parametrized physics is included in that).

    The period post when data was available is often referred to as the “validation period”. Since it’s the period where you didn’t have a priori knowledge, it’s the only part that can be used for validating the model. Fails to validate means there are flaws in the assumed physics, which is more serious than just an implementation problem. (These could be as mundane as wrong forcings, as “simple” as too coarse a model to something more serious, like flaws in underlying physical model used.)

    We all get this (at least in my organization we know about this and the importance of model testing as well as building). I’m actually convinced you know all of this too, which makes me wonder as to your motive for bringing this up without providing your own explanation for why it’s important.

  123. Nathan,
    If the objective is to improve models, then each model woud need enough independent runs to give a fair representation of the model variability. It also means each model running the same assumed forcings and aerosols. No fixing up the models by adjusting aerosols at will. Then start to cull the heard.

  124. Interesting question. Why are people so resistent to the very notion that sensitivity might be less than 3C when at least 50% of the CDF for sensitivity is below 3C?

    I find that really odd. I find it even more odd with the recent studies that have argued for numbers between 2.5 and 3.

    Even the arguments around ‘about .2C’ are odd ( in a denialist way)
    about .2c could mean .17c. Brilliant. it could also mean .2356.
    Imagine the outrage if Lucia tested .2356 c and argued that it was “about .2C” really odd argument that one.

    If people want to argue that the IPCC published untestable claims and led people to believe that they were summarizing science, if people want to accuse them of malpractice, well go ahead. But it seems very odd to criticize people like Lucia who take them at their word and test the claims.

    That’s probably the threshhold question.

    1. is the IPCC claim malpractice or a testable claim?

    then, if it is a testable claim, how should we test it ?

    But arguing against testing it, is just silly because then one is left with the question ? what kind of claim is it.. that pretends to be science but cant be tested.

  125. Mosher that’s a straw man.

    Most people are open to climate sensitivity being between 1.5 and 4.5

    I think James Annan often claims it is ‘3’ but probably he means ‘about 3’.

    It confuses me that Lukewarmers don’t actually attempt to calculate climate sensitivity, when thats the fundamental point of difference between yourselves and the IPCC.

    “Even the arguments around ‘about .2C’ are odd ( in a denialist way)
    about .2c could mean .17c. Brilliant. it could also mean .2356.
    Imagine the outrage if Lucia tested .2356 c and argued that it was “about .2C” really odd argument that one.”

    They’re not odd. You have it around the wrong way. If Lucia calculated something at 0.2356C one would expect she had the number right, so there’d be no need to say that was ‘about 0.2C’ (which it actually is). The IPCC used ‘about 0.2C’ because they wern’t prepared to give it more certainty. This is beacuse there is uncertainty in their projection. This uncertainty is what Lucia is ignoring.

  126. Steven–
    The claim that it’s uninteresting to test is also odd. Reasons why it might be uninteresting to test are things like:
    1) Any claim about AGW is so unimportant we shouldn’t pay any attention to it– so no need to test.
    2) Claims about AGW are important, but it’s not necessary to know whether the claims are correct.

    And so on.

  127. Mosher

    “But arguing against testing it, is just silly because then one is left with the question ? what kind of claim is it.. that pretends to be science but cant be tested.”

    Well, you need to make an appropriate test for a start.
    I don’t think ‘about 0.2C’ is untestable. A simple test would be are the main temp indices within the 0.150000…01 to 0.24999999… band? Don’t forget to add your error bars for the main indices too…

  128. Lucia,

    testing whether the indices are trending at exactly 0.2C/decade is not interesting, because no one claimed they should.

    You still can’t say WHY it is interesting.

  129. SteveF–
    I have sometimes looked at individual models in the past. There is enough data to show some are wrong. I’m pretty sure Nathan won’t like those tests either. But I think Nathan’s opinions are pretty unimportant. If he doesn’t want to test models or read about models being tested and so on, he doesn’t have need to spend any of his time doing testing. Not my problem. I’m going to resolve to ignore Nathan’s yammering. Because it’s pretty much just a lot of unimportant silliness and I’m beginning to suspect his goal is mostly to waste my time responding to his series of idiotic strawmen, mistatements and dunderheadedness.

  130. Mosher

    ” if it is a testable claim, how should we test it ?”

    I agree. I think it is probably testable.
    So, how should we test it?

  131. Lucia

    “I’m pretty sure Nathan won’t like those tests either. But I think Nathan’s opinions are pretty unimportant. If he doesn’t want to test models or read about models being tested and so on, he doesn’t have need to spend any of his time doing testing. Not my problem. I’m going to resolve to ignore Nathan’s yammering. Becuase it’s pretty much just a lot of unimportant silliness.”

    Welll this pretty mmuch contradicts everything I’ve said. I have consistently said you need to look at individual models. Then see why they don’t work.

    You just ignore me, yes… Good grief.

    Yes ignore the implied uncertainty in ‘about 0.2C’ because it’s unimportant silliness…

  132. Lucia,

    ok I will excuse myself from this discussion.

    but this:

    “1) Any claim about AGW is so unimportant we shouldn’t pay any attention to it– so no need to test.
    2) Claims about AGW are important, but it’s not necessary to know whether the claims are correct.”

    Is not even close to my position. My position is that your test is not rigorous enough. My position is that testing is essential, but that this test does not go far enough.

  133. Nathan, how far you want to test is until it reaches your preconceptions. That is confirmation bias. That is your position from you what you have written.

  134. SM said “Why are people so resistent to the very notion that sensitivity might be less than 3C when at least 50% of the CDF for sensitivity is below 3C?
    I find that really odd. I find it even more odd with the recent studies that have argued for numbers between 2.5 and 3.”

    I’m sure some people do argue that, but I don’t think that is the consensus view, e.g.

    http://www.ipcc.ch/publications_and_data/ar4/wg1/en/ch9s9-6-4.html

    “…Results from studies of observed climate change and the consistency of estimates from different time periods indicate that ECS is very likely larger than 1.5°C with a most likely value between 2°C and 3°C….”

    I also agree with Annan

    http://julesandjames.blogspot.co.uk/2010/08/how-not-to-compare-models-to-data-part.html?showComment=1281802422423#c6813569415250812342

    “…Every extra year that does not break the 1998 HadCRUT record will be a small piece of evidence towards a slightly lower sensitivity/transient response, but this process is a slow drift in my beliefs and there is nothing very conclusive yet” IMO…..

    But I am less confident that the gradient of the MMM up to 2012 tells us that much about the climate sensitivity or TCR – when we compared the short term trend of the different models last time there didn’t seem much correlation between ECS/TCR and short term trend.

    Indeed I think if we did the same analysis that Lucia has done with real temperatures with the actual model runs that make up the MM ensemble then a lot of the model runs would ‘falsify’ the MMM ! (and I take the point this is a lot to do with the model spread)

  135. Nathan:
    “testing whether the indices are trending at exactly 0.2C/decade is not interesting, because no one claimed they should.”

    The IPCC says the trend is about 0.2C per decade.
    Lucia shows that the trend is certainly less than than 0.2C.
    We can exclude values above 0.2C. Is this result not interesting in relation to the original claim? I’m at a loss why you say this.

    If your truck is about 5.8m high and the bridge says you can pass if it is lower than 5.85m, I think you would appreciate additional information that excluded the truck being higher than 5.80m.

  136. Niels A Nielsen
    “The IPCC says the trend is about 0.2C per decade. Lucia shows that the trend is certainly less than than 0.2C.”

    No. The IPCC in the 2007 SPM projected a trend of “about 0.2C per decade” over the “next two decades”. Lucia showed that at 95% confidence the trend from 2001 (and 2002) to 2012 was less than 0.2. There is no way this period could be construed as the “next two decades” from 2007.

  137. Lucia,

    I must apologise.

    Stating things in terms of the response vector R.

    R(0) = 1
    R(1) = (ra+ma)
    R(2) = (ra+ma)*ra
    R(3) = (ra+ma)*ra^2

    etc. What I wrote was correct but unhelpful.

    Working from model parameters it gives the correct result.

    However:

    In terms of the autocovariance vector at lag j g(j) = R · R(j) {where · indicates the dot product}

    For the variance explained by the slope.

    Vs = g(0) + 2 * Sum(g(j) * w(j)) {summed from j=1} after Lee & Lund

    In the limit, as N goes to infinity, whilst g(j) goes to 0 w(j) remains =~ 1 so

    Vs = g(0) + 2 * Sum(g(j))

    Vs = g(0) + 2 * (g(1)/(1 – g(2)/g(1))

    so Vs/g(0) = (1 + 2 * (g(1)/g(0))/(1 – g(2)/g(1))

    so in terms of autocorrelation

    Vs/g(0) = 1 + 2 * rho(1)/(1-rho(2)/rho(1))

    giving Vs/g(0) = 1 + 2 * rho(1)/(1 – phi) Foster & Rahmstorf

    So the result I very much maligned is correct,

    So I apologise to G Foster and S Rahmstorf and all I may have misled.

    Why I chose to mistake an autocorrelation vector for a response vector, even to the degree of describing one and calling it the other, is a problem for me. It both puzzles and worries me.

    So F&R are very much correct. I was very much away with the Faeries.

    Working from the ARMA(1,1) process model, I stated the correct formula in terms of model coefficients, but that is quite a different matter.

    It is true, but trivial, that for an AR(1) process, R(i), g(i) and rho(i) are all exponential and except for a scaling interchangeable, but that makes for no excuse.

    Alex

  138. PeteB–

    Indeed I think if we did the same analysis that Lucia has done with real temperatures with the actual model runs that make up the MM ensemble then a lot of the model runs would ‘falsify’ the MMM ! (and I take the point this is a lot to do with the model spread)

    I do plan to do things with individual models and/or runs. There are actually extra things that need to be done with individual model runs. The reason extra things need to be done is that they can be done. 🙂

    Among those:

    1) For models where we have more than 1 run, we can check whether the method gives too large or two small of uncertainty intervals by comparing the variability over repeat runs.

    2) Potentially, we could also find the (AR,MA) pair that maximizes the AIC for *all three runs in a period together*. (That is: if there is a ‘true’ AR,MA, and that’s a property of the “weather”, the AR,MA it ought to be the same for all three runs.

    3) There is a bit of a question whether it ought to be the same for all periods– it might not be when volcanic eruptions occur. But otherwise, for periods with slowly varying forcings it ought to be. With models, we have some control runs– and we should be able to get the best AR,MA then. Also, the 21st century runs all have slowly varying forcings– so we should be able to get those from there.

    Whether or not (2) and (3) get done (1) has to be done. We need to know whether the method gives too small or too large confidence intervals for a model. This is actually also necessary for the MMM. If the error bars are too small…. no one is going to be convinced.

    (BTW: If you read a previous post : http://rankexploits.com/musings/2012/how-far-off-are-uncertainty-intervals-computed-using-red-noise/, where I discussed how large they were compared to uncorrected ‘red noise’ you know I’ve already begin doing this. And, I anticipate these error bars will be large enough– but I can’t be sure until I do it.)

    In that graph, you will see this:


    I’m going to repeat that exercise to see if fitting with ARIMA and adjusting for number of years helps to bound the uncertainty intervals. (I don’t expect perfection. Enough said for a comment in a post.)

    Anyway: While (1) is being done, people are going to be seeing more stuff compared to the MMM which I think is interesting in and of itself and which — in any case– must be done if you are doing individual models. Because it is always important to know if a collection is biased.

    ( BTW: I think these two things so even if someone like Nathan wants to yammer on and on about how it is uninteresting to compare data to a multi-model mean. And also wants to complain his questions which have been answered repeatedly were not answered merely because he disagrees with the answers!)

  139. Nathan/

    “I don’t think ‘about 0.2C’ is untestable. A simple test would be are the main temp indices within the 0.150000…01 to 0.24999999… band? Don’t forget to add your error bars for the main indices too”

    Keep your identify hidden nathan. that kind of stupidity could be career ending

  140. In response to Nathan’s question of why. The quotes indicate that Nathan and Nick appear to be throwing out AR4. Lucia, based on the hindcasting, you could make the argument that 2000 is the year to start for testing the skill at forecasting per chapter 9, especially, wrt the model and weather noise point. The quotes also show why you can’t use years before 2000, since the models and the projections are based on whether the scaling and results matched the observational record. According to chapter 9, most if not all of Nick’s and Nathan’s concerns over your test were answered, and the AR4 agrees with your approach. Perhaps later I will go get the Ch 10 material where we examine what they wrote about natural variability by 2030, and the basis for the 0.2 for the first two decades, which graphically and by way of Ch 9, indicate that 2000 is the correct start date so that they could claim the variability factor by 2030 and not 2031. That 30 year definition that AR4 states is used. But I have a few chores to do.

    Ch 9 p 665
    Greenhouse gas forcing has very likely caused most of the observed global warming over the last 50 years. This conclusion takes into account observational and forcing uncertainty, and the possibility that the response to solar forcing could be underestimated by climate models. It is also robust to the use of different climate models, different methods for estimating the responses to external forcing and variations in the analysis technique.
    p 669
    Detection and attribution results based on several models or several forcing histories do provide information on the effects of model and forcing uncertainty. Such studies suggest that while model uncertainty is important, key results, such as attribution of a human influence on temperature change during the latter half of the 20th century, are robust.
    p 670
    These ‘forward calculations’ can then be directly compared to the observed changes in the climate system. Uncertainties in these simulations result from uncertainties in the radiative forcings that are used, and from model uncertainties that affect the simulated response to the forcings. Forward calculations are explored in this chapter and compared to observed climate change….. However, for the linear combination of responses to be considered consistent with the observations, the scaling factors for individual response patterns should indicate that the model does not need to be rescaled to match the observations (Sections 9.1.2, 9.4.1.4 and Appendix 9.A) given uncertainty in the amplitude of forcing, model response and estimate due to internal climate variability.
    p 684
    The interannual variability in the individual simulations that is evident in Figure 9.5 suggests that current models generally simulate large-scale natural internal variability quite well, and also capture the cooling associated with volcanic eruptions on shorter time scales….The fact that climate models are only able to reproduce observed global mean temperature changes over the 20th century when they include anthropogenic forcings, and that they fail to do so when they exclude anthropogenic forcings, is evidence for the influence of humans on global climate.
    p 689
    has a nice graph as to the limits/start of hindcasting

  141. StevenMosher–
    The fact that Nathan jumbles up which things are just observations of facts (0.2C/dec is outside the ARIMA uncertainty band for start years blah, blah and blah) with actual tests like t-tests in the next post and somehow suggests that the t-tests don’t include the uncertainty bands on the projections makes his comments, claims and “advise” singularly ridiculous. Uncertainties in projections are included in all tests.

    Uncertainties in observations– which he seems to think are not accounted for are automatically included in the ARIMA confidence intervals because they make up a portion of the noise! I realize that some people might want to double-triple and quadruple count the noise– and to make an argument for a claim even more convincing one might include the suggestion to double count the noise when estimating the uncertainty– but that doesn’t mean that double counting the effect of the noise when estimating uncertainty is correct!

  142. Steven Mosher:

    Interesting question. Why are people so resistent to the very notion that sensitivity might be less than 3C when at least 50% of the CDF for sensitivity is below 3C?

    That’s pretty easy. They are taking (what I think is a slightly too high) central tendency as the lower bound.

    In some circles you’ll get booted out for say it’s less than 3°C or for suggesting that the broad tails on some distributions of climate sensitivities are unphysical (like the ones that hypothetically show a climate sensitivity of 9°C/doubling of CO2, I’d say it’s established fact there are no “hind-cast” models with 9°C/doubling that are even marginally consistently with the 20th century data).

    [I should mention the long-tails are an artifact of very limited constraints from the available data/model output. More realistic sensitivity analyses don’t affect the low-end version much, but pretty well chop off the high end of the curve. See e.g. this. I suspect a better job could be made using an EOF based approach for regional scale climate variability from the “better” models and by rejecting models that fail to verify.]

  143. Carrick, I thought the large tail was from using del T=sum(del F), rather than an impulse of a(exp^-bt) and ploting on an arithmetic scale and not semi log. And by choosing prior that were informed such as it cannot go below 1C for a doubling, rather than having less than 1C as very low probability.

  144. Alexander Harvey (Comment #104690)

    I greatly appreciate your showing the connection between the Foster and Lee/Lund adjustments and equations. That is something I was unable to do.

    To me it is important to know that an adjustment like this one is founded in work from a published paper and the conditions under which it is derived. I think the key is the simplifications made when N goes to infinity that could account for Foster’s (Tamino’s) short segment results differing from a Monte Carlo result. A lingering question would be why Tamino did not do a Monte Carlo simulation.

    As someone with limited skills in some of these areas, but willing to learn, I appreciate the efforts that people like yourself make at these blogs.

  145. Kenneth Fritsch (Comment #104639)

    Which reminds me that my adjustment equation for degrees of freedom from the Foster paper for a trend se for an arima(1,0,1) trend stationary model should have been (1+2p1/(1-p2/p1))^(1/2) and not (1+2p1/(p1-p1p2))^(1/2) , as I reported in the post noted above, where p1 and p2 are first and second correlations in the acf for the regression residuals.

  146. Steve Mosher,

    Why are people so resistent to the very notion that sensitivity might be less than 3C when at least 50% of the CDF for sensitivity is below 3C?

    A good question. The short answer is “People are people, and so less than perfectly rational”.
    A longer version: You are right that even talking about a climate sensitivity below 3C seems verboden in some circles. I believe the reason is that many people perceive (quite correctly) that the urgency of draconian public action to immediately reduced CO2 emissions falls when the mode and shape of the probability function is rationally considered, and falls further when that distribution has a shorter tail of extreme sensitivities.
    .
    The longer temperatures go on not rising very rapidly, the shorter is the credible width of the high sensitivity tail; Lucia’s analysis primarily points to that. If the extreme tail is of very low probability, then the societal ‘risk’ of waiting to see what happens before choosing a course of public action is lower, while the economic benefits of fossil fuel use remain the same.
    .
    The IPCC’s application of a post-hoc ‘prior’ to increase both the mean and the breadth of a published PDF (as shown by Nic Lewis) is characteristic of the problem. Those who believe immediate and drastic reduction in fossil fuel use is absolutely required, as most leading lights in climate science clearly believe, do not want to acknowledge as accurate any result which might prove their technical assessment incorrect, or diminish the probability of drastic public action. IOW, we are seeing a combination of personal values/priorities/morals and confirmation bias influencing acceptance or rejection of rather clear technical results.

  147. Lucia, I am attempting to summarize for my satisfaction what was revealed in these analysis with regards to the Tamino trends and CIs shown in the first post on this subject at this blog. My calculations are in good but not exact agreement with what you have shown in graphs in these threads discussing CIs. I have used the global monthly mean GHCN temperature series from Jan 1975 to Nov 2011 and any difference between that series and what you used could explain the small differences I see.

    What I see is that the Monte Carlo results and those using the Tamino adjustments from the Foster(Tamino) published paper using the acf coefficients to adjust the CIs agree well for both the unfrozen and frozen cases. The Tamino adjustment gives a slightly smaller CI range. That result shows that Tamino used some other method to obtain the graphed results that you showed initially to start this discussion.

    I am assuming that using the se, frozen and unfrozen, does not duplicate what Tamino showed in his graph. Using a frozen se appears to me to produce a CI range larger than Tamino’s and the unfrozen version has a CI range narrower than Tamino’s and that from Monte Carlo simulations.

    What I find interesting about this whole exercise is that, while one can obtain results from another blog as the case of Tamino’s and one can obtain results from a paper such as Foster’s (Tamino’s), in order to truly analyze these results, alternative methods should be run and scrutinized such as was done here at the Black Board. It is a form of sensitivity testing that I find lacking too frequently from climate science papers.

    I cannot find any obvious weaknesses in the Monte Carlo simulations run here that would prevent using those results for the trend CIs runs as a standard to be met by other methods. Tamino’s method of adjustment for trend CIs as described in his paper approaches closely the Monte Carlo simulation results in both the frozen and unfrozen cases, but at his blog he evidently used another unknown method. Why did he make that choice and why did he not do a Monte Carlo analyses? Perhaps I missed something, but why were not more details provided for his CI estimations that appear in the graph he presented at his blog?

Comments are closed.