Comparison of Observed and Simulated Trends: Hindcast, Volcano Only.

Yes. I’ve been silent…. I’ve been wanting to double check some results so I can discuss some findings in details. However, the double checking required some time, in this case, the final discussion will be deferred for a while. In the meantime, I thought I’d let you have a look at a graphical comparison of the OLS trends based on monthly data between Jan 1970 and Dec 1999 and trends from GCM simulations over a similar period:

Figure 1: Comparison of Observed and Simulated Trends (Hindcast, Volcanic only.)(Click for larger.)
Figure 1: Comparison of Observed and Simulated Trends (Hindcast, Volcanic only.)

I know they say pictures tell a thousand words. Unfortunately, we still need words. In bullet form:

    • The two black squares on the left represent the OLS trend based on GISSTemp and HadCrut3. The whiskers indicate the ±95% confidence intervals computed using the red correction discussed in Santer 2008. (AKA “Santer 17”).

      In principle, we are kinda-sorta supposed to believe there is a 95% probability the “underlying trend” climate trend for the period under consideration falls between the top and bottom of those whiskers. (Caveats apply.)

      The other black squares all represent OLS trends from individual IPCC model runs that incorporated volcanic forcing. These were downloaded from The Climate Exolorer The whiskers illustrate the ±95% confidence intervals for the trend computed using the method mentioned above.

      I’ve outlined the squares corresponding to similar models to help you decide whether some models seem low or high compared to the data. (Number crunching will be discussed later… possibly much later. )

      There is a fancy-schamzy way of explaining whether the prediction of an individual run is consistent with the observations to some confidence level. But, if you are eyeballing a graph and comparing two cases, generally speaking, if the whiskers computed at some confidence level don’t overlap at least a little, the two means will be found to be inconsistent with each other. Needless to say model CCCMA is not looking good.
  • Feel free to speculate about how many of the individual trends fail the Santer17 test at a confidence level of 5%. (Remember: If the models are perfectly correct, we expect 5% to fail. )

    13 thoughts on “Comparison of Observed and Simulated Trends: Hindcast, Volcano Only.”

    1. So if you throw out the seven runs above the box and the one run below (GISS model E, somewhat ironic that it’s the lowest), what’s the ensemble average slope? My eyeball estimate is that it will be pretty close to 1.5 C/decade.

    2. DeWitt–
      I think in terms of theoretical reasons for throwing away simulation results, it would make more sense to throwout whole models rather than individual runs from models. (Though maybe someone could concoct a scheme to weight our confidence based on the number of runs that don’t fail individually.)

      If we apply the test Santer17 called “Tests with multi-model ensemble-mean trend” , GISS ER fails if the cut off is falling outside the ±90% confidence intervals, but it survives if the cut off is the ±95% confidence intervals.

      Also, with regard to individal runs, 4 out of 9 of the GISS ER runs fail the “individual run” Santer test at both 95% and 90% during this period. If the model and data are unbiased, you’d expect it to fail the test 5% with the 95% test and 10% of the time with the 90% test. So… unless the data are bad (or my calculations are wrong)… GISS ER looks… not too good.

      I should note that my calculations might be wrong. I’ve been checking, and I don’t think they are. But, I am trying to double check a whole bunch of stuff, and document coherently. As bugs in the program would screw up everything equally, I can’t really post incrementally too much.
      But I thought people might be entertained by the number runs outside the 95% confidence intervals for the data. 🙂

      BTW: There is only 1 ER run for 2000-2008 SRES A1B at the climate explorer but the IPCC says there are a total of 5 runs for that sceneario.

      Also, Santer’s paper shows 5 for the period from 1979-1999. ER is one of the few models with incomplete sets at The Climate Explorer. I don’t know why this is.

    3. The one other thing that struck me was that GISS ER fails when compared against itself. Multiple runs are outside of the confidence interval for other runs of the same model.

    4. BarryW–
      The uncertainty estimates are based on the assumption of AR-1 noise, which may not apply to either real or simulated weather. However, the assumption was used by the 17 authors in Santer. There are some interesting issues associated with that choice.

      I did take a peak and compare the standard deviation of the trends for models with more than one run to the estimates based on AR1 noise. e. If the “weather noise” was AR1, the two should match on average. Interestingly, generally, the two are close on average if we examine models over all.

      But … not for GISS-ER! I’m not going to be discussing the relationship between the standard deviation of actual runs to the estimate based on AR2 that in detail anywhere, but yes, using this assumption, the different runs for GISS ER appear to disagree with other runs of GISS ER. (Notice that for a few other models the individual runs are very, very closely spaced despite having very high uncertainty intervals based on the AR1 noise assumption. )

    5. Yes, the fact that some of the other model runs were closely spaced was made the GISS ER runs stand out. Even the cccma, although it was an outlier, was internally consistent.

      I wonder what differences there were in the initial conditions that were used in the GISS ER model runs. You would have thought that they would have made some sort of checks relative to what you’re doing at least.

    6. Barry–
      The fact that the relationship looks weird for GISS is not inconsistent per-se.

      It could just mean that GISS’s “weather noise” has some higher order lag structure than AR1. The AR1 is an assumption made to estimate the uncertainty intervals. It’s not gospel. There isn’t any theory telling us what the lag structure should be, but we know the weather noise isn’t white.

      But based on this assumption– which is the one also made in Santer– GISS doesn’t agree with itself. The problem could be GISS ER or it could be the assumption of AR1 noise or both!

    7. A third problem could be with the initial conditions set for each run. The model may have quantization issues, non linearities, discontinuities or it just might actually represent what happens if your initial conditions are different enough. With the amount of information available there is no way to tell. There might just be “tipping points” !

    8. BarryW–
      Information about initial conditions are provided at PCMDI. Generally speaking the IPCC models do this:

      Sometime near the beginning of the thermometer record (i.e. 1850-1900), each model is initiated with a set of conditions that either corresponds to historical conditions (possibly Levitus) or are initiated from conditions spun up from a control run using forcings thought to apply at the time of initiation. (When they use Levitus, I suspect they do some “tweaking” to get some variability over the IC’s but I don’t know what they do.)

      The models are run to roughly 2000 using forcings the modeling groups consider realistic. Some end in Dec. 1999, some as late as Dec 2003. (Note the difference in start date. This actually matters if we use the argument that all differences in post-2000 runs are due to “weather noise”. In reality, there can be some differences related to model biases but also applied forcings.)

      At whatever year the modelers chose to end their scenarios, the modelers then switch the the SRES using one of their 20th century runs as the initial condition.

      You can get the pdf for the initial conditions for GISS ER here:
      www-pcmdi.llnl.gov/ipcc/model_documentation/GISS-E.pdf

      The key bit discussing initial conditions for the 20th century says:

      1 – A: pre-industrial control experiment: E3AoM20A – 1880 atm.conditions
      B: initial conditions = final state of a preceding 200 year run;
      that model started up from a series of previous models whose
      combined simulation time added up to 428 years.
      The initial run of that series started up from observed
      atmospheric conditions (1 Dec 1977) and ground conditions
      from a long series of earlier runs.

      2 – A: present day control experiment – nothing submitted

      3 – A: 20C3M: ensemble of 9 runs E3Af8[a-i]oM20A
      B: E3Af8aoM20A start: 1/1/1880 = 1/1/year 6 of E3AoM20A
      E3Af8boM20A start: 1/1/1880 = 1/1/year 7 of E3AoM20A
      E3Af8coM20A start: 1/1/1880 = 1/1/year 8 of E3AoM20A
      E3Af8doM20A start: 1/1/1880 = 1/1/year 9 of E3AoM20A
      E3Af8eoM20A start: 1/1/1880 = 1/1/year 10 of E3AoM20A
      E3Af8foM20A start: 1/1/1880 = 1/1/year 31 of E3AoM20A
      E3Af8goM20A start: 1/1/1880 = 1/1/year 56 of E3AoM20A
      E3Af8hoM20A start: 1/1/1880 = 1/1/year 81 of E3AoM20A
      E3Af8ioM20A start: 1/1/1880 = 1/1/year 106 of E3AoM20A

      So, I think their 20th century initiates off a very long control run at forcings thought to apply in 1880. They “randomize” by staring at different years in teh control run. It is noteworthy that 5 of the 9 the initiating years are clustered between 5 and 10. If there are any long cycles in climate, that means they started during similar AMO/PDO type cycles. On the other hand, they’d be fairly well distributed over an El Nino. The later ones are mostly every 25 years.

      Does this randomize enough? Who knows.

      The IC”s for the SRES A1B experiments says:

      6 – A: SRES A1B experiment: ensemble of 5 runs E3IP_A1B[^f-i]oM20
      B: E3IP_A1BoM20 start: 7/1/2003 of E3Af8coM20A, 1/1/1880=1/1/yr 8 of E3AoM20A
      E3IP_A1BfoM20 start: 7/1/2003 of E3Af8foM20A, 1/1/1880=1/1/yr 31 of E3AoM20A
      E3IP_A1BgoM20 start: 7/1/2003 of E3Af8goM20A, 1/1/1880=1/1/yr 56 of E3AoM20A
      E3IP_A1BhoM20 start: 7/1/2003 of E3Af8hoM20A, 1/1/1880=1/1/yr 81 of E3AoM20A
      E3IP_A1BioM20 start: 7/1/2003 of E3Af8ioM20A, 1/1/1880=1/1/yr 106 of E3AoM20A

      As you can see, their SRES experiment initiates in July 2003, and the initial conditions come from 2oth century scenarios with their IC’s at about 25 year intervals. (BTW: Only one of these 5 runs is avaialble at the climate explorer. I don’t know why the other four are missing.)

    9. The CCCMA might not be in the confidence intervals but the inter run variation and general trend errors (even including the Santer method) look a lot tighter than the other models. That ironically is a good sign. The general feeling I get from looking at the plot is that there is still a lot of work to do on these models to get a tighter convergence on runs. I wouldn’t chest beat very much with this data. Its good you’ve shown it Lucia.
      A slightly different question: If the forcings are functions of time (independent of direction) and weather noise is assumed, could the models be run in reverse and trends compared with the forward direction? If there are no iterative components that contribute discrete/chaotic effects to the trend (in fact the Santer 17 paper assumes this) would this not be a nice test for consistency? Don’t worry I\m not expecting you to do this!

    10. MC–
      I’ve never thought of the idea of running models in the reverse. It’s a bit of an odd idea, as there are practical issues that making it difficult to run a set of conservations equations backwards. (Some modeling approximations don’t violate the 2nd law of thermo if run forward, but might running backwards. )

    11. With regards to the thermodynamics the system would just be cooling. What I meant was that some of the forcing relationships could be reversed in sign and then compared with the temperature record in reverse as if the planet had cooled. There should be no difference starting at a warm level and reducing the forcing. In fact if you think about it if the models don’t show a cooling trend when forcing of CO2, water vapour and the like is used, but with a reducing trend then there is something wrong in the models. A bias.

    12. MC–
      Not exactly. You’re thinking first law only.

      The difficulty with the 2nd law has to do with what types of simplifications might be imposed on a computation going forward. Some simplifications only work if you predict going forward in time or compute flow going from an upstream condition to a downstream one.

      Example issues: Suppose you have a pressurized container, and air flows out through a converging diverging nozzle into a test chamber (and later out some suitably shaped exit). If the pressure is high enough, flow accelerates to a super sonic conditions. Maybe it does something in a test chamber (like a wind tunnel). Maybe various shocks occur.

      You could write some programs to predict what happens inside the test chamber knowing the upstream conditions and marching forward. You can’t really start at the down stream conditions and march backwards in your computation. (At least, I think you can’t.) Difficulties arise due to the second law of thermodynamics, and flow not being reversible.

      Similar things can happen in lots of systems. Usually, the issues aren’t as dramatic or obvious as shock-waves. Irreversibilities in other flows are smaller, less dramatic, less noisy etc. But basically, it’s difficult to march backwards in time from a solution for many problems involving conservation of mass, momentum and energy. If you had a forward problem you would really need to think carefully whether any approximation made still works when you run backwards in time. They might not. (Or they might. You need to look at the problem.)

    13. I agree it depends on the problem and the models. With regards to the 2nd law of TD it is a bit of a conundrum: it’s a classical formulation yet it describes and irreversible process with discontinuities. I bet though that the thermodymanics equations and heat flow in models are made to be simple enough to get the trend + noise outputs so some sort of reversibility may be allowed. The reason I was wondering is that only on an abstract level it is sometimes good to run a process in reverse to check that the mathematics are behaving as they should and that the software isn’t generating unexpected artefacts. How accurate or complete the reversal is may be able to be determined beforehand. It may only be a simple examination of periodic output with the model run forwards.

    Comments are closed.