How far off are the AR4 AOGCMs?

Yesterday, I showed that the 108 year Hadcrut trend is greater than the GISSTemp trend. BarryW asked me to post a graph showing the differences, with a trendline superimposed. Doing that made me think, “Hmm…. what if I plot the differences between the mean-model global mean surface temperature (GMST) and the observed GMST?”

So, I did. I made three charts, which I’ll show today. There are no intensive computations, this is mostly a visual. This is not a powerful statistical method– but it does give you an idea of whether the case that models based on over predicting the trend falls apart if we switch to anomalies and/or longer time periods.

Monthly Anomalies: Baseline 1900-2000

Before displaying differences between model means and observations of monthly global surface temperatures, it’s worth comparing some features of the model mean and the observations. Figure 1 shows the model mean temperature which I obtained by averaging the monthly temperature predicted by 22 AOGCMs.

Figure 1: Monthly global surface temperature anomalies. (Click for larger.)
Figure 1: Monthly global surface temperature anomalies.

For this graph, the baseline is Jan 1900-Dec. 1999. I don’t usually use this baseline – I picked it to force the average difference to zero during the 20th century.

Features worth nothing:

  1. The model mean is much smoother than the other cases. This is because a) the individual model means sometimes involve several runs; this averages out some “weather noise” for the model mean and b) the remaining weather noise is somewhat averaged out by averaging over the 22 models.
  2. Individual observations are affected by all three of the following: a) “weather noise” as it happened on the real earth, b) measurement error and c) whatever processes drive systematic changes in the earth’s temperature. (These can include solar, volcanic, ghg forcing etc.)
  3. The difference between the observations and the model mean is affected by both the earth’s “weather noise”, any biases in the model mean and to some extent, measurement error.
  4. The relatively smooth multi-model mean curve currently exceeds the curve showing observations.

Monthly Model Means

Figure 2 is created by subtracting the observations from the model mean.

Figure 1: Difference between observations and model mean (monthly).(Click for larger.)
Figure 2: Difference between observations and model mean (monthly).

Features:

  1. The choice of baselines means the average of the difference during 20th century is zero. If someone believed the model mean perfectly represented the best estimate of the temperature anomaly during that period, then the difference between an observation and the model mean might be construed as arising from “weather noise” and/or “measurement noise”. At the other extreme, if someone thought the models were totally wrong, the differences would represent error in the model mean. That is: we would say the model mean provided a biased representation.
  2. The least squares trend applied to the difference indicates the model mean tends to over-predict in recent years and under-predicts in past years. If we treat all residuals as AR(1) noise, the trend in the differences is significant at 95% for GISS. The trend is not significant for HadCrut.
  3. The maximum difference between the model mean and the observation occurred in 2008 when the model mean exceeded the observations by 0.6603C The minimum occurs in 1939 when the difference was -0.5534. Both represent excursions from HadCrut.

So, basically, there may or may not be a trend in the difference between the model mean and the observations. The diagnosis depends on whether you like Hadcrut or GISS better. If we pick a baseline from [1900-2000), the maximum monthly difference between the model mean and the observations happened in 2008.

What does this mean? Who knows. We could pick different baselines to make the difference seem less dramatic. Picking recent baselines forces agreement near the present time; picking older baselines makes things look worse.

Annual Averages

Since I showed 12 month averages yesterday, I thought some people might be curious about how this looks based on annual averages. After all, one might think I’m picking whichever metric looks worse, right? (The reason I showed monthly was because both more and less smoothing are worth considering.)

Anyway, here is the annual average plot:

Figure 3: Difference between model mean and observation (12 monthly lagging average).(Click for larger.)
Figure 2: Difference between model mean and observation (12 monthly lagging average).

  1. The maximum positive deviation still occurs in 2008. It’s 0.3515 C. The largest minimum deviation occurs in 1943; it’s -0.3626 C. So, in some sense by averaging over 12 moths, the divergence between model projections and observations looks “less bad”. If we average over 12 months, excursions with difference between model predictions and observations with 0.3515 C have occurred. Maybe it’s weather noise? (Or not.)

By choosing different baselines or averaging time periods, I can make the result we saw with the monthly values more or less dramatic. That said, currently, the model mean is high relative to the data. During the past century, the model mean never exceeded the data by the amount displayed in 2008.

Wrap Up

As I said, there will be no major conclusions. This is just a look at the data. We could talk about over-interpreting, under-interpreting etc. But, honestly, there are quite a few pesky issues involved in the assumptions surrounding any statistical treatment. (The ‘noise’ is probably not AR(1); the ‘underlying trend’ is certainly not linear. Both have implications.)

So for now, the only message is: The difference between obserations and models in 2008 was large relative to differences we was during the 1900-2000 “hindcast” period.

24 thoughts on “How far off are the AR4 AOGCMs?”

  1. Here’s an idea. Imagine a data set where the value for each month is the trend from that month through to the present. Do that for both the observed GMST and the modeled GMST starting in 1900. Then take the difference of the two fabricated data sets. The resultant data set should demonstrate over what time frames the modeled data trends higher or lower than the observed data.

  2. Should not the early period differences between the model mean and observations be near zero since (I believe, but may be wrong) parameters are fit in the models to minimize historical residuals?

  3. Jack
    The parameters in GCM’s are tuned, but they are not tuned to specifically minimize the historical residuals.

    It’s more like this:

    Conservations of mass, momentum and energy are applied. But that’s not enough to say whether water vapor in big grid cell will condense to form clouds or rain or what-ever. So, someone says “Hey! We know some thermo, and if the relative humidity is 1, let’s say clouds form”. Then, we add something to decide if the drops are large enough to form. Also, we add something to decide whether those big puffy clouds reflect sun or do something else.”

    Now, lets say relative humidity 1 doesn’t quite work. Clouds don’t form often enough rain doesn’t fall when it should. Then, you might tweak to improve things. But, when tuning, you try to examine the literature to figure out the relationship between cloud cover, relative humidity and/or any other feature you compute in your model. Then, you create a little submodel to predict when clouds form and what properties they have.

    You constrain yourself to pick parameters that match data you have for cloud properties as a function of relative humidity etc. You don’t specifically tune to match the earth’s historic surface temperature.

    Mind you: To some extent, the experimental data may not constrain the cloud model very much. So, you have some latitude.

    And, ultimately if a GCM doesn’t match the earth’s GMST over all, it will be seen as failed. So, the GMST information will, inevitably propagate in and affect some parameters.

    But, the parameters are not specifically tuned with the goal of minimizing those residuals.

  4. Jack– Well if they were tuned to minimize residuals with surface temperatures, then modelers couldn’t brag that they weren’t. The truth is in between. They are not directly tuned to minimize those residuals. But the parameterizations aren’t always strongly constrained, and everyone always does compare to GMST over time. To at least some extent, some inputs at last may be tweaked for better matching. (I can even find quotes from peer reviewed papers that say so. This isn’t all that controversial. The fact of tuning is discussed in the AR4!)

  5. Do they ever do a sensitivity analysis of parameters and e.g. things like sensitivity to CO2? in Financial modelling marginal sensitivities are computed (just partial derivatives) either analytically or empirically.
    Sorry if I ask too many questions.Just tell me to lay off if I do.

  6. Jack–
    Yes. Modelers do sensitivity studies. So, they do have some information about the effects of tweaking a parameter. I don’t know details though.

    Tweaking and running a full GCM for many years would be very computationally intensive, so I don’t know if they can do a huge amount of testing for sensitivity, but they do at least some.

    Actually, I think these are useful discussions because normally, all we read are:

    Skeptic/ model detractor: “But the models are TUNED!”
    Alarmists/ model defender: “Not the way you think!”
    Skeptic/ model detractor: “But the models are TUNED!”
    Alarmists/ model defender: “Not the way you think!”
    Skeptic/ model detractor: “But the models are TUNED!”
    Alarmists/ model defender: “Not the way you think!”

    Rinse repeat. The parameters are tuned. But they are tuned in a way that is familiar to engineers who work in computational fluid dynamics, heat transfer etc. (Engineers use both types of tuning by the way. Some empirical methods are nothing more than curve fits. Others aren’t. So, we do make sure student understand the difference.)

  7. Hi Lucia,

    I have some questions regarding the surface temperatures that are calculated from typical GCM codes.

    (1) Is the surface temperature assumed to exist right at the earth’s surface (z = 0) or is it at the center of the grid cell adjacent to the earth’s surface?
    (2) If is assumed to exist at the surface, then the boundary condition must be a Neumann (flux) or mixed boundary condition. How is this done? There must be a lot of hand waving to figure out how to set the surface heat flux balance and ultimately back out a temperature.
    (3) Do the computed surface temperature values represent point values in space or averages over the surface grid cell?
    (4) When they compute the mean temperature, are the GCMs consistent with the historical data in using daily (Tmax+Tmin)/2 (assuming the time step is on the order of an hour or less), or do they do something else?
    (5) Has anyone looked at *local* comparisons of GCM output with historical data? For example, how does the GCM temperature history for the grid cell containing Chicago, IL USA, compare with the historical data for Chicago? I suspect predicting local variations are going to be much more of a challenge versus the whole earth (where you’re averaging Siberia with the Sahara!).

    I suppose I could look at model E to get some answers for that code (I’ll see what I can find). Maybe they’ve documented it…ummm…..

    Thanks!

  8. FrankK

    Is the surface temperature assumed to exist right at the earth’s surface (z = 0) or is it at the center of the grid cell adjacent to the earth’s surface?

    1) My impression is the temperature is supposed to be at 2m off the surface. I’m not sure how the codes are structured to distinguish between the surface of the top molecules of dirt and the air above it. But… that’s my impression: 2m off the ground.

    If is assumed to exist at the surface, then the boundary condition must be a Neumann (flux) or mixed boundary condition…..

    2) I don’t know the details of how they apply the boundary condition at the surface. (This lack of knowledge includes what they do for both the temperature and velocity!) I don’t see it as any special problem. Engineers apply mixed boundary conditions at the surface of, say, heated pipes all the time. Depending on the level of detail…well.. you parameterize! But anyway, I think they define surface temperature at 2m above the surface.

    Depending on your point of view any parameterization can be either “hand waving” or “semi-empirical formulation based on understanding the physics or transport”.

    (3) Do the computed surface temperature values represent point values in space or averages over the surface grid cell

    Some sort of average.

    (4) When they compute the mean temperature, are the GCMs consistent with the historical data in using daily (Tmax+Tmin)/2 (assuming the time step is on the order of an hour or less), or do they do something else?

    I’m not sure what you are asking. Are they consistent on at the hour by hour level?

    They aren’t expected to be consistent in the sense that any individual realization of an AOGCM run matches the earth’s realization. The hope is that they are consistent in the sense that if we ran a model a zillion times, the average result would tell us the expected value for the population of all possible realizations of earth’s weather in response to imposed forcings. So, the average over all model realizations is supposed to be the mean climate.

    Given this, one would not expect to match any hour by hour set of measurements of the earth’s temperature field.

    (5) Has anyone looked at *local* comparisons of GCM output with historical data?

    I think Kostyannis tried to look at local comparisons. But he didn’t compare individual realizations of model weather to the individual realization of earth weather.

  9. Thank you Lucia.

    As for the temperature history, what I am asking essentially is: does the GCM temperature represent a time-average in the Reynolds-Averaged sense? I know GCMs aren’t predicting weather per se, but they are using very similar time-marching algorithms and time steps on the order of an hour of physical time (maybe even smaller than that). So they should be able to resolve daily temperature variation and hence derive a min-max.

    I guess this is why I like to get back to the descriptions of the equations that are being solved (aka code documentation). What do the variables represent? Are the equations really conservation statements of *instantaneous* mass, momentum, energy, etc.? Or are they filtered equations (in the sense of LES)? So many questions…

    Perhaps it would be fun someday to take the local temperature output from Model E and run it through GISTEMP to see what homogenizing that computer generated urban bias gives us ;^)

  10. FrankK:

    does the GCM temperature represent a time-average in the Reynolds-Averaged sense?

    Supposedly not. It appears more volume averaged— but that sometmes gets called Reynolds Averaged. (Isn’t the term Reynolds averaged supposed to mean “ensemble averaged” in the sense of expected value. That’s the way it’s discussed in Batchelor and Tennekes & Lumley and most theory texts. The “ensemble” vs. “time” average distinction matters for time varying problems).

    If the code were truly ensemble averaged (which is the meaning I give to RANS) then the individual realizations from a single model shouldn’t differ so much. On the other hand, if it’s volume averaged, and thought of more as Large Eddy Simulation, then you can have realizations that differ. But the way the papers discuss the sub-grid parameterizations reads more like RANS as implemented in … the 70s? Very simple low order parameterizations. (And quite a few of them– as required for a computation with conservation of mass for more than one species, conservation of momentum and energy, plus parameterizations to explain phase change and the the radiative properties of water in it’s various phases.)

    Obviously, they aren’t running Direct Computations in the sense of DNS for turbulence.

    So they should be able to resolve daily temperature variation and hence derive a min-max.

    Sure. But they could only do comparisons of statistics. You can’t expect a day to day match. No one can specify the appropriate initial condition, and the resulation is not sufficient to stay on any specific weather trajectory for very long. Even if the parameterizations are right on average, they are certainly approximate instantaenously. This means someone ends up kicking the butterfly!

  11. I like your conclusion that the difference between models and observations in 2008 was greater than in the hindcast period. That is certainly true and based on the PDO switching to its cool phase in late 2007 will probably be true for the next 30 years.

    People sometimes make statements about the difference between weather and climate. Let me attempt on. Weather prediction is based on a study of the atmosphere. Climate prediction is based on a study of the ocean.

    Levitus et al (2005) studied the warming experienced in the late 20th century and found about 84% of the excess heat is stored in the ocean, about 5% heats the continents, about 4% is absorbed by the atmosphere and the remainder melts sea ice and glaciers. If true, and his finding makes perfect sense, then a study of surface temps is the wrong approach. Why not compare observations to model predictions of ocean heat content? I think you will the differences between models and observations to be far greater than surface temps.

  12. RonCram

    Why not compare observations to model predictions of ocean heat content? I think you will the differences between models and observations to be far greater than surface temps

    The answer to the first question is: Because
    a) we don’t have a standard source describing long term changes in ocean heat content. The recent good data is likely to be revised.= and
    b) the IPCC AR4 made no explicit projections for ocean heat content the level of detail seen with their projections for surface temperatures. So, I would have to download the gridded data and come up with my own method of making a projection based on what I think they would have projected had they wished to make a projection. Testing my ginned up idea of what the IPCC would have projected is not the same as testing an IPCC projection.

    So, while your idea makes sense based on physics, there are legitimate reasons to look at surface temperatures instead.

    But, if someone else wants to do the work, I’d be happy to read what they find.

  13. Jae–
    Discussion of heat storage and radiation are not either/or. Heat storage (or non-storage) would be a consequence of a radiation imbalance. So, the existence or lack there of is evidence for or against the hypothesis the imbalance exists.

  14. Lucia: Yes, but the “believers” get heat storage conflated with why the Earth is 33 C (or whatever) warmer than it “should be” without GHGs. The reason for the 33 C is FIRST OF ALL that the heat is stored in the oceans and air. That stored heat CAUSES the IR that is observed, not the other way around. The whole “greenhouse effect” is misunderstood (I know I won’t win that argument here, however! 🙂 )

    It is very interesting now to contemplate just why this putative “radiation imbalance” is not having any warming effect over the last 10-12 years. The world, including the oceans is cooling. The CO2 sure didn’t go away; in fact it increased. The “believers” insist that the Sun has nothing to do with our climate. So where is the effect of the putative radiation imbalance now? Could it possibly be that there is no such thing?

  15. There are two issues with your comparison of models and observations:

    -On the 20th century, natural forcings (solar forcing, volcanic eruptions) are not included in all model simulations.
    -After 2000, models do not used observed forcings but concentration scenarios What is the scenario used after 2000 in your graph? A1B, A2, B1? Even with perfect models, you would expect greater differences between models and observations after 2000.

    Not really big issues given your goal here, but worth noting.

  16. Ignatus– According to the tables in the chapter 10 of WG1 to the AR4, most the 20th century runs include variations in solar. Some don’t include volcanic eruptions. The model mean based on the models that do include volcanic eruptions are in worse agreement with data than the set that did not include the volcanic forcing that has occurred.

    My graph is A1B. I forgot to say that here, but many of my readers know that’s what I’ve been downloading and examining. (I still should remember to repeat. But… anyway… a1b).

    However, if you read the AR1, you’ll see that the specific scenarios had little effect on the projections during the first part of the century. All start with the current levels of forcings, and diverge afterwards. So, the temperatures similarly don’t diverge until after the scenarios diverge.

  17. 5 models (+ 2 probably) don’t have solar forcing.
    7 models (+ 2 probably) don’t have volcanic forcing.
    It is not negligible.

    Even if models with volcanic forcing are in worse agreement with observations concerning the mean, it is not really relevant here: you look at the trend. And on the 1900-2000 period, the volcanism decreases the temperature trend. Models without volcanic forcing probably increase the discrepancy between observation and the ensemble mean at the end of the 20th century.

    Projections under different scenarios clearly do not diverge widely in the 2000-2008 period, but you look here at very small differences and it is possible that the discrepancy between observed forcings and the SRESA1B scenario induces very small differences.

  18. Ignatus–

    Models without volcanic forcing probably increase the discrepancy between observation and the ensemble mean at the end of the 20th century.

    That’s what you’d expect. However, if you compare you will find the opposite.

    Solar forcing increased over the 20th century. It’s not a large increase. But the fact that the models over predict warming with some not including this increase in solar forcing.

    The models over predict if I compare trends begining in 1980 or 1900. The magnitude of the over prediction is less. But, the over predict.

  19. lucia,

    It seems to me the IPCC has already established a process for generating a “model mean” (if that is the right term) for climate projections. What I don’t know is if the models generate easily identifiable numbers for OHC. And if they do, where is that information found? I have seen Roger Pielke’s projections but his methodology is not transparent to me.

Comments are closed.