Anomalies: Mimicking AR4 Figure 10.4?

From time to time, I like to think about how to show compare simulated and observed temperatures not in the way I think “best” but using methods “most similar” to figures contained in the AR4 itself. Today, I decided it would be useful to concoct a figure that compares the observed anomalies using a graph that is “similar to” Figure 10.4 in the AR4, aka: The figure created by the authors of the IPCC AR4 to communicate their projections associated uncertainties to policy makers and the public.

Here’s my similar graph:

Figure 1: Comparison of Simulated and Observed Temperature Anomalies.
Figure 1: Comparison of Simulated and Observed Temperature Anomalies.

My main observation is this: The current annual average surface temperatures fall outside and below the ±1SD for model mean temperature anomalies. So, in some sense the observed temperatures fall outside the uncertainty range the authors of the IPCC implied when they created and published this graph.

Of course, some will point out that in other senses something about the observations falls inside some other sort of uncertainty range that can be concocted based on model runs. Yes: It happens that we can pick and chose all sorts of tests, some with more statistical power, some with less. The test shown above has low statistical power compared to some others, but it’s possible for someone who wants to protect models to come up with tests with even lower statistical power. (It is a rule of statistics that one should chose a method with greater statistical power rather than lower statistical power. This rule results in fewer errors overall. But I’ll discuss that more in another blog post.)

Still, seeing the above graph, question one might ask: Would it be fair to use this graph to say the models “have failed”?

Well… in some sense it is fair.

What makes it fair?
The sort of comparison shown in figure 1 was implied by the choices made by the authors of the AR4.

Back in 2006-2007 when temperature were “on track”, the authors (many of whom are modelers) elected to communicate these specific type of uncertainty intervals.

At that time the modelers could have:

  1. shown ±95% uncertainty intervals based on model means. This would have given the public and policy makers an impression that the uncertainty was larger than shown in Figure 10.4 of the WG1 contribution to the AR4.
  2. shown either 1 SD or ±95% uncertainty intervals based on all model runs. (This would be intervals based on “all weather in all models”. They include uncertainty due to both “weather noise” and model bias due to unknown physics but in an odd sort of way because some models have more runs than others.
  3. created uncertainty intervals in some other way to account for the unequal number of runs for each model.

As for me: I would interpret the figure 10.4 as at least communicating the modelers expectation of the range of temperatures the public would see, with temperatures breaking out of the uncertainty bands as being unexpected to some extent. That the authors of the IPCC did not explain precisely what their uncertainty intervals meant is both apparent and regrettable.

Nevertheless, the uncertainty interval they did select and publish created an impression about the uncertainty range in the minds of the public and policy makers.

So, I think it’s fair to at least say the observed temperature anomalies are lower than modelers expected to arise. Or failing that, they fall outside the range they told the public and policy makers to expect.

That’s the main blog post! 🙂

Things to know when you read other comparisons of observations and models.

There are some odd ‘features’ about the uncertainty intervals in Figure 10.4. These make it difficult to explain what breaking out of the uncertainty bands means.

  1. The observations of surface temperatures include “weather noise”.

    In principle, the uncertainty intervals in Figure 1 here (or Figure 10.4 in the AR4) do not include “weather noise”. In practice, both include quite a bit of “weather noise.

    Why is this so?

    In principle many when applying the idea of “ensemble averaging” to modeling, many runs are performed for each model. Averaging over many individual runs removes “weather noise” for a given model. The remaining spread in model mean temperature anomalies would communicate the spread of model biases and would indicate the effects of different physics governing each “model planet”. (The model physics are intended to mimic those of the earth; whether or not they do so sufficiently well to forecast climate trends is the question asked in the “climate blog wars”.)

    In practice, quite a few models provide only 1 run; few provide more than 3. Only the Essence runs provide more than 7. If you bear in mind that averaging over 4 runs will generally reduce root mean square (rms) by a factor of 2 and the 9 runs reduces by a factor of 3 and so on, it’s clear that for most models used by the IPCC, the “model weather noise” is not averaged out in any significant way.

    The result is the ±1SD spread in my figure 1 and the AR4 Figure 10.4 includes contributions from both “model weather noise” and “model bias” in an odd sort of mix.

  2. This collection of models does not completely match the IPCC collection. I’ve included all SRES A1B scenarios runs from all but one model available at The Climate Explorer. (MRIJMA TL959L60 is omitted because only 10 years are available for the 20th century. )
  3. Examining these graphs by eye is interesting. However, diagnosing model bias using anomalies themselves tends to provide low-power statistical tests. That is: we need a lot of data to detect when an incorrect projection is off track in any specific way. Other methods permit us to detect that wrong models are wrong with smaller amounts of data. (This is one reason I prefer testing trends.)
  4. Comparisons using anomalies can change dramatically when we change our choice of baseline. For this reason, I prefer comparing trends as they are less sensitive to choice of baseline. (The Lukewarmer mug uses the 1900-1999 baseline.)
  5. I could make a similar graph using all 74 runs from all models. This would be useful if we want to discuss whether the earth’s weather trajectory falls inside “all weather in all models”. Such a test is pitifully weak from a statistics point of view, but those who want to protect the reputation of models are likely to embrace it.
  6. This graph begins in 1950 because the Essence runs begin in 1950; this is the shortest time period of the runs used.
  7. Anomalies are not corrected for model drift; IPCC anomalies are corrected for model drift. I’m looking into that.

5 thoughts on “Anomalies: Mimicking AR4 Figure 10.4?”

  1. What is model drift?

    Heh. I knew someone would ask that. 🙂

    In principle, the way the models are run is:

    1) Guess the steady state solution you would get for some level of applied forcing. The solution you guessed is your “initial condition”. (You would pick the level of forcing that applies to the first year you intend to actually model. Let’s say that’s 1900 for cases we care about.)

    2) Because your guess is a guess, run the model for a billion zillions years at the applied forcing.

    3) Watch to see if the model reaches ‘pseudo-equilibrium. In principle, you would get the same pseudo-equilibrium answer no matter what initial condition you selected.

    4) Continue to run the model to create a bunch more of years at “pseudo-equilibrium”. Save the data from Dec. 31, year N of these cases.

    5) Now, call some particular year after you reached “pseudo equilibrium” and call year 1900. This is your initial condition for the 20th century runs.

    6) Vary the forcing as you expect it to vary from 1900- forward. These are your “real” runs.

    6) When creating new runs, pick a different year from the spin up to call 1900. This randomizes the effect of a particular start year on your results.

    During step 2 (called “spin up”) the temperature of the planet “drifts” toward pseudo equilibrium.

    Now, suppose instead of running the model a zillion billion years in spinup, you cheaped out and ran it 400 years but picked year 100 as the initial condition for the 2oth century. Then, you might think the modeled “remembered” the initial condition in step 1. You might expect that even if you had not varied forcings as for the 20th century, the temperatures would continue to “drift” toward the pseudo-equilibrium solution.

    If you actually still have results from years 100-200 of the spin up, you can estimate this drift and subtract. (It would, of course, be better to just run the spin up longer. But… anyway. that’s model drift.)

  2. A “prediction interval” is wider than a “confidence interval”. (I leave it as homework for Boris to look up definitions.) The point is: one does not expect 95% of future observations to fall within a 95% confidence interval. What one is 95% confident about is that the interval around the sample mean contains the true mean. A rough guess would be that ~75% of observations are expected to fall within the 95% confidence interval. i.e. By random chance alone you would expect 1/4 of observations to fall outside that confidence interval. That’s pretty permissive. 2-3 years out of every decade. One standard deviation is what – a 68% confidence interval? You expect half of future (out-of-sample) points to fall outside that interval. Incredibly permissive.

  3. For some reason you ignored the other excursion outside the limits. The models underestimating temperature anomolies is not a problem, overestimating is.

  4. When you consider the task the modelers have set themselves, it’s amazing how accurate the results are. I would have thought congratulations were in order.

Comments are closed.