Yesterday, I showed that whether models appear “goodish” or “badish” when we compare anomalies is somewhat sensitive to the choice of baseline. This sensitivity arises when models are wrong in the sense that their estimate of deterministic changes (i.e. “climate” changes) differs from those of the real earth — which is what they tend to predict. To some extent, the “sensitivity” also arises when a graph compares model projections or hindcasts over a short time series which sometimes permits disagreement to be hidden. In contrast, when models are off, if you show enough data, you can always show the model and data disagree at some point in time. But you can pick whether they will mismatch in the past, but match now, or vice versa.
Today, I want to show a few interesting features in plots ginned up using the baseline method. Knowing these features can help you understand how to interpret these graphs, and also recognize when someone might be accidentally making choices that hide the mis-match between models and observations.
Models rebaselined using 1930-1949
I’m going to start by using a baseline that will make the multi-model mean mismatch observations by quite a bit now— but I’ll show features without showing data first.
When one wants to detect (or emphasize) the mismatch between models and projections in the years 2000-2011, the first step is always to pick baseline that is fairly far in the past. This will make data agree in the distant past, but if the simulations give a different amount of warming than the observations, permitting a long time to lapse gives the two time to drift apart. In the specific case of simulations of earths temperature, it turns out that rebaselining to the somewhat inexplicably warm 1930-1949 period will make models look worse than picking 1900-1910, so I’m going to do that. (This is cherry picking. Yes. But the choice shows something else too.)
Below, you can see the a graph showing the multi-model mean computed with all run rebaselined to the 1930-1949 period. The dashed grey lines illustrate the ±1-sigma bands for model-means.
Occasionally, people imagine (or are given the impression) that the spread in the ±1sigma bands arises mostly due to “weather”. What I want you to notice on this graph is that the spread for the ±1sigma bands is narrowest during the baseline period. The spread widens as we get further away from the baseline period. My choice of the 30-40s permits you to see the spread happens both for years that precede or follow the baseline.
Clearly, this makes no sense if the spread is mostly due to “weather”.
It’s possible to explain that this spreading has very little to do with weather; it mostly arises because individual model predicts different amounts of warming since 1920. You can see this more clearly in the following graph that shows individual runs from models that had more than 1 run. Each model run is shown in a brilliantly colored trace. (Observations are superimposed in dull grayish hues. Squint and you’ll see where they fall.)
Examining the brilliantly colored traces, it seems evident that each model shows distinctly different temperature rises since the 30s.
Examining the observations, you can see they fall over the models that show the least amount of warming. I could illustrate this showing the multi-model mean, the ±1sigma bands for model means and the observations:
With this choice of baseline, the 13-month box car smooth anomalies fall outside and below the ±1sigma bands.
But remember: I cherry picked. I (and everyone familiar with climate change data) knew that the 30-40s were on the real earth. Picking hat period tends to make the models look warmer than most other choices. In contrast, if I re-baseline using 1990-2010, the model mean and observations will agree perfectly now but the models will look inexplicably cool during the 1930-1940s.
But, of course if I wanted to hide that mismatch, I could trim the graph to show they agree well now, leave people in the dark about the mismatch in the hindcast and possibly insist that choice of baseline doesn’t matter or that the choice of 1990-2010 is “right” for some reason. (I believe someone at another blog recently did something similar.) In that case, I get this:
In this graph, the models observations and anomalies appear to generally agree because their 20 year mean was mathe-magically forced to agree between Jan 1990-Dec 2009. This results in a large disagreement in the hindcast– but that has been chopped off. If one doesn’t notice the mismatch in the trends, the result looks quite flattering to the models.
Those who want to create graphs that make model that aren’t soo good look good can do so by:
- Picking the most recent baseline they can justify including a baseline that includes the period being compared.
- Chose a start or end time for the graph such that it the mismatch that exists at sometime in the time series.
- If possible, use different baselines to plot the hindcast (pre 2001) and the forecast (post 2001) . This permits choice of a baseline that makes the hindcast look pretty good while still letting the forecast look pretty good.
Of course, it’s worth remembering: If models are spot on predicting warming due to climate change, choice of baseline ends up being unimportant. In that case, other than issues having to do with “weather” (i.e. ENSO, AMO etc.) models and observations will agree fairly well using all baselines.
Lucia,
Cool post.
I have a hard time though calling a 1930 time period cherry picked in this case because the models are supposed to represent temperature. Cherry picked might be matching temps in recent times yet ignoring the past. Since models are used for projecting temps a minimum of 50 years out, it seems reasonable to baseline 80 years in the past for verification. If you were to baseline in the early 1800’s maybe but if we can’t trust them not to overheat in the last 100 years, how can we predict the next?
I see the bump now in the curves. 1930-40 is worst case.
JeffId–
It’s cherry picked in the sense that it makes the mismatch greatest now. Since generally speaking people are focusing on whether or not the models predicted, it’s a choice that makes the models look unable to predict.
It’s true that it would make sense to pick a baseline in the far past. If models were right it wouldn’t matter. If they are wrong, it gives models time to get off track. But picking that particular period tends to make things look bad.
I can’t pick the early 1800s because I don’t have data back them. One might argue that we should pick 1950-1980 because we don’t trust data before 1950 and that makes 1950-1980 the earliest baseline where we trust the data. That would be a defensible argument. The difficulty is that if we pick the baseline today we know in advance what each choice does.
Again, this is going to depend on start points, but isn’t one of the arguments that, even if the actual values don’t match the model temps, the overall trend is correct? What if, instead of looking at the model temps you compared the distribution of model trends against the actual trends? Wouldn’t that to some degree eliminate any differences in starting points and initial conditions?
You may have already dealt with the question and I’ve forgotten or missed it.
I think a 1920-1960 mean would be hard for climatology to disagree with. It is over 30 years and incorporates a lot of data.
am I the naughty boy who asks whether the models actually capture all the climatic effects?
Lucia,
Nice clear explanation.
Of course, the divergence between the different models is not just due to differences in diagnosed sensitivity, although that is part of it. The substantial differences in assumed net forcing history make the models appear much more consistent with each other and with the temperature history than they are in reality. It would be very interesting if a single forcing history could be applied in multiple runs of all of the models…. I am not going to hold my breath.
SteveF–
The modelers were permitted to pick their own forcings during the 20th century. They also all know that runs are going to be compread to global surface temperatures and will have looked at that themselves. So, the divergence in warming the 20th century ‘hindcast’ is probably less than would happen if the modelers really had to do this all blind without knowing what 20th century temperature were before even starting modeling. (The problem of modelers being able to “predict” something already known and breaking when predicting blind is well known in engineering where we have the luxury of not having to wait for nature to take its course!)
It would be interesting. The difficulty is that models runs are time consuming and there are a zillion different things that might also be interesting.
If running models was cheap, I suspect the IPCC panel could enforce a rule that only modelers who used a proscribed forcing history would be used in the future IPCC reports. They could also permit the modelers to also show their best estimate based on their judgement of the best forcings.
The difficulty is models are expensive. When push comes to shove, the IPCC panel needs to use what is available, and modeling groups aren’t going to project into the 21th century using models they couldn’t even get to mathc the 20th century. No one will do that– I wouldn’t do it. You wouldn’t do it. So, even though it means the modeling groups picked aerosols levels to make their hindcasts look ok given other features, it’s the only thing anyone has to work with!
lucia’s charts here show that the choice of the base period does influence how people “perceive” the data (when it is charted), even though nothing else has changed.
The base period, for example, should NOT be in the middle of the series.
If you take the model predictions and the observations and centre both of them in the centre of the graph, the human visual sense will see both lines as very similar.
The observations are higher at the beginning of the chart (than the models) and then lower at the end of the chart, while the middle is the same.
This makes both lines look very similar to our human visual perception even though they are very different trend-wise at least.
All of the GHCN temperature reconstruction charts I have seen are produced in this manner.
The base period should be constructed so that an accurate representation of the data is apparent.
In this case, the hindcasts of the models are not particularly good but they are tuned to more-or-less reflect the historical temperature. The AR4 forecasts, however, are much higher than the observations (and most AR4 forecasts used data up to the end of 2003 and some were even being re-submitted/fixed in early 2005). So, the base period should at least end in 2003.
Secondly, why not just use the Hadcrut3 base period. That way, noone has to go into spreadsheet and manually adjust the observations for a new base period. One should just be able to compare the 0.8C IPCC AR4 forecast to the currently public 0.350C Hadcrut3 value.
Thirdly, using the 1961-1990 base period of Hadcrut3 doesn’t allow the Dana1981’s of the world to play around with the numbers (even making Hansen’s 1988 predictions look right – I mean really – the SkepticalScience crowd just fell for it). There is less chance for manipulation and it is easier for people to check for themselves if we just use a standard base period.
So that’s my view on the issue.
Bill
No choice of baseline materially reduces the effort. If you are comparing to models, you need to read the runs in, pick a baseline and show using that baseline. You then want to compare to any number of observations– each of which picks their own baseline. Using Hadcrut3 has no particular advantage in saving any manual effort.
Ordinarily, I compare IPCC projections in the AR4 using the Jan 1980-Dec 1999 baseline because that’s how they chose to make their projections. That defines the choice for testing projections.
One may then, of course show how the hindcasts look using that baseline and so on. But there is no doubt that the IPCC make projections expressing them relative to the mean from Jan 1980-Dec 1999, so that’s the only way to make the comparison.
Neither would using Jan 1980-Dec 1999 which has the advantage of actually being the baseline the AR4 used to make their projections.
Yes. The Skeptical Science crowd doesn’t undertsand the issue related to baselining. There’s a lot of stuff they don’t understand. Nothing new there.
Actually to do this correctly you should remember that
the global average that the models represent is a full coverage
average. The observations are not full spatial coverage
In attribution studies this difference is actually taken into account.
Lucia, as well as doing global averages, do not the models also return regional averages?
How do they do, for instance, on the Pacific or Continental USA ?
Lucia,
“If running models was cheap, I suspect the IPCC panel could enforce a rule that only modelers who used a proscribed forcing history would be used in the future IPCC reports. They could also permit the modelers to also show their best estimate based on their judgement of the best forcings.”
I will assume ‘proscribed’ is a Freudian slip in this case! 😉
The GISS forcings, including aerosols, are readily available; do you know if other groups publish their forcing histories?
Brilliant! This should be required reading (‘Cherrypicking 101’).
PS: “someone **accidentally** making choices that hide the mis-match between models and observations”. Very diplomatic!
SteveF–
I think the forcings from all groups are available publicly with some digging. However, I haven’t dug. GISS makes them more readily available.
Doc–
I’m plotting global averages but the monthly global average was computed by someone else. You could get continental avearages if you downloaded data from PCMDI and wrote a script to do all that. I haven’t and don’t plan to do so.
Steve Mosher–
Yes. Chad has computed global -model tempeturatures omitting bits left out by the observational groups. It doesn’t make very much difference over shorter times frames. I don’t know how much difference it makes over longer times.
I think the ideal baseline for such a presentation should be chosen so that the actual and model trends intersect as the start of the period. The 1990-2010 baseline has the trends intersecting around 2002 (eyeball estimate), so the difference in 2011 is effectively equal to the amount of difference that has accumulated in about 2002.
Your original chart shows that even in 2000 the trendline for actual is above the model. Again an eyeball estimate is that the trends may intersect in 1998, so effectively your original chart shows 13 years of accumulated difference between model and actual temperature, and gives the impression that this occurred within 11 years.
Perhaps equally valid would be to show the model and actual temperatures intersecting in the middle of the reporting period. That then splits the accumulated error in two with half at the start of the period and half at the end. This would obviously make the models look even better.
If you have a predisposed preference to try and make the models look bad or good it is obvious whether you’d prefer to show agreement at the start or end of the reporting period. I can’t think of any good argument for why one would be more ‘honest’ than the other though.
I do note a fairly standard approach for comparing two variables by normalising over a time period would tend to see the trend lines match somewhere near the middle. Try it on Wood For Trees and see.
I think the forcings are poorly enough known that trying to match up the models to historical reconstructions isn’t going to much use. I suspect that’s why they are using 1980-99, for which I suspect (and somebody could correct me if I’m wrong) the forcings should be pretty much the same for all models…
Bill Illis –
I’m glad it’s not just me that is staggered by dana1981’s ‘manipulations’ – when I see the question being asked at SkS “Lets see how good the FAR predictions were” I know there’ll be some ‘Texas-bullseye-graph-drawing’ – draw the graph you want, then start the process of adjusting the data. And it is so blatant!
Has HadCrut gone AWOL in the last two graphs?
Lucia writes “They could also permit the modelers to also show their best estimate based on their judgement of the best forcings.”
And yet the only defensible “valid” result is for model result averages based on all feasible forcings and that would put model results all over the ballpark (and out of it too)
Carrick (Comment #86583) December 2nd, 2011 at 12:07 am
“I think the forcings are poorly enough known that trying to match up the models to historical reconstructions isn’t going to much use. ”
Actually, this argument applies to the varaince repression of the reconstructions as well. CS likes to say that high historic variance means greater sensitivity. What it may mean instead is that they haven’t quantified sources of natural variance accurately.
concerning the applied forcings (possibly i miss the point of yesterday’s discussion):
http://cmip-pcmdi.llnl.gov/cmip5/forcing.html
and schmidt et al. 2011
concerning the pmip3 forcings for the last millennium runs http://www.geosci-model-dev.net/4/33/2011/gmd-4-33-2011.html
(Climate forcing reconstructions for use in PMIP simulations of the last millennium (v1.0)),
and v1.1 http://www.geosci-model-dev-discuss.net/4/2451/2011/gmdd-4-2451-2011.html
I am an engineer who has been involved in model development in the aerospace industry for many years. I continue to not understand the logic of the processes being used for model use when it comes to the study of CO2 and the climate.
1. Why does it make sense to “average†the outputs of multiple models? When we have a better model we do not average its results with those models that have been demonstrated to be flawed- why do that in climate modeling? It seems utterly foolish. Just use the best model and throw out the lower performing models.
2. What specific attributes are these models designed to accurately forecast within what margin of error?
Overall, Imo a model should be designed to accurately forecast specific criteria and should be measured to determine how well it performs in that function. Different “climate†models do not get evaluated in this simple manner. They should have outputs that we could say “measure temperature†and rainfall at specific locations and report a degree of accuracy after “x yearsâ€. For instance, we should be able to report that after 3 years we are confident +/- 10%, and after 20 years +/- 30% (for example). All the discussion about hind casting is a part of model development and not relevant to validation of a models performance. That can only be done by comparing the models forecasts to observations of actual conditions.
I have a question about the first graphic. Is the data between the bars (1930-1949) supposed to determine the slope of the dotted line? If so it sure looks like the dotted line is steeper than the data suggests. Why’s that?
Mark
The baseline barely affects the trend. The trend is from the beginning of observed or model data through Oct 2011. The best fit line for data from 1920-Oct 2011 is steeper than the best fit line for data from jan 1930-dec 1949.
Carrick
I agree. But the comparison is worthwhile because people are told the ‘good’ comparison during the hindcast ought to give us confidence in the predictive ability. That means it’s worth looking to see how “good” the comparison was during the hindcast.
Rob:
1) Few here would disagree with you. I show model means because the AR4 happened to highlight that. Summaries of projections use the multi-model mean.
2) The models predict “everything” and “nothing”.
Lucia
Thanks for the reply. My 1st time here and I read from Mosher over at judith’s that you were knowledgeable on GCM’s so I thought I’d stop by.
Rob,
The answer to your first question is politics, pure and simple. The IPCC does not wish to offend any country by ignoring their work.
Rob–
I’m not that knowledgable. I’m an engineer. I don’t run GCMs. But they are models and there is nothing particularly special about them as models.
Rob.
As a former aerospace guy who ran models I share your befuddlement. going on 4 years now waiting for somebody to clarify.
taps foot.
basically there is no spec for the model. They are all research projects.
Rob,
From what I can tell, models are average for two reasons, one justifiable and one not so much.
First the same modeling group will run dozens or hundreds of different model runs with the same parameters but somewhat different initial weather conditions to smooth out internal variability between model runs. This is good practice, as models tend to try and represent noisy patterns in things like ENSO, volcanoes, and other sources of natural variability which may differ in time occurrence from real-world variability.
Second, the IPCC tends to average together model runs from different modeling groups to get a multimodel mean. This is rather silly, as it given equal weight to different modeling groups who may have differing skill in modeling, and is somewhat influenced by the politics (e.g. not wanting to piss off any countries by excluding or downweighting their contributions).
Rob,
If the most nonperforming models were predicting excessive cooling, I don’t think that national sensibilities (as Zeke graciously surmises) would prevent them from being dropped from the ensemble.
The simpler answer that keeping bogus high-end projections based on excessively high sensitivities in the mix allows the summary writers to say things like:
Six degrees? Seriously?
The large spread allows one to smuggle in highly improbable upper end estimates which are scarier and serve the express political and policy ends of the IPCC.
Also, as long as reality hugs the bottom end of a wide error range for a model mean comprised of disparate projected outcomes, it can be claimed that the ensemble as a whole remains in play even as the individual high end model projected outcomes move rapidly further away from actual data.
So far, the IPCC has had no incentive to get it right or pick the best performing model or range of models if it/they show(s) a lukewarm outcome. Maybe with their remaining credibility on the line they will thin the ensemble next time around.
Lucia:
Yep. If they did a “good enough” job in the hindcast, it shouldn’t matter on the baseline. The fact it does matter is a form if indirect metric tells us there are still issues with the models in the hindcast.
Wondering who you’re thinking of when you say “people are told the ‘good’ comparison during the hindcast ought to give us confidence in the predictive ability”. Is that Gavin?
George.
Part of the issue is sensitivity as well.
The sensitivity of models varies between 2.2 and 4.4.
The argument goes that within those ranges you can get good hindcasts.. so that range is taken as some “evidence” that the
true value lies between those numbers.
If all modellers adopted one sensitivity, say 3C, then your forecasts going forward would have really narrow bounds (+-.5c)
and you would be over estimating your confidence.
Tough problem. no simple answer
FWIW, the spread of individual models is shown in IPCC AR4, Figure 10.5.
It’s easier to look at on the PDF, which has hi-res images.
Re: Steven Mosher (Comment #86605)
As I’m sure you know, the doubling sensitivity isn’t just a number you can plug in at the start of the run.
Standardizing the historical forcings would seemingly help constrain the range of sensitivities, though.
Lucia – Wouldn’t it be better to use absolute values rather than anomalies to evaluate how well the models match the current warming and replicate the past temperature history, as you did here http://rankexploits.com/musings/wp-content/uploads/2009/03/temperatures_absolute.jpg
It seems to me that whatever base line you choose for the anomalies it makes the models look a lot better than they look when you look at absolute values.
Well, actually it’s pretty good:
http://www.ipcc.ch/publications_and_data/ar4/wg1/en/faq-8-1-figure-1.html
Note that despite the alignment on the 1900-1951 baseline, the past is reconstructed pretty well all the way to 2000.
If the temperature has unforced excursions of a decade or so (like in the 40s) [EDIT: and if the baseline is short], the choice of baseline is always going to matter for short-term predictions, right?
Carrick
It’s just one of those general claims put forth here and there. This is wording form the AR4
http://www.ipcc.ch/publications_and_data/ar4/wg1/en/faq-8-1.html
and
The text includes a comparison of the multi-model mean to observations. In that case “Temperature anomalies are shown relative to the 1901 to 1950 mean.”
BobN-
Yes and no. If you were just trying to discuss why we might have less confidence in models, you would want to show that the absolute values are all over the place. However, if you want to test projections, you need to accept that a group picked a specific method to create projections, and then communicated their projections. Whether or not you, I or the flea on the wall thinks that method is legitimate or whether we we have any confidence in the method, if we want to test it we need to test those projections, and also talk about how that method might or might not work.
So, if you want to discuss whether the method used by the IPCC seems to work, you have to use anomalies because they use anomalies. Of course, it’s perfectly legitimate to discuss other subjects– but we can’t pretend that the IPCC uses raw temperature because they don’t. They use anomalies.
toto–
Using the baseline method, it is possible to make the hindcast look ok. However, if we keep that baseline, the forecast looks not-so-good. It’s important to not forget that you can’t take credit for a good hindcast with one baseline and then switch to another one to evaluate the forecast.
Wrong. There are circumstances where choice of baselines doesn’t matter for short-term predictions.
If the models are correct, the baseline won’t matter much. Ever. It won’t matter for long predictions; it won’t matter for short predictions. The only thing you would have to consider is how variable weather is– but that’s not really a baseline issue.
If what you mean is that if you show a small number of years then we can always pick a baseline that makes the models look good over a short time that’s true. If you show short term predictions, it is always possible to pick a baseline that permits you to make even the crappiest of models look ok.
So, I guess if you assume the models are crap, then for short term predictions, the baseline always matters– because the disagreement can be hidden fro those who do not understand base lining by some choice of baseline. But under the hypothetical situation where models were good choice of baseline wouldn’t ever matter.
OK, fair enough. But eyeballing the graph, I can’t see a very large discrepancy in the 1980-1999 period, which leads me to suppose (rightly or wrongly) that shifting to the latter baseline would have little impact on hindcast quality.
Only if you define “correct” as “able to reproduce every single short-term fluctuation of the temperature curve”. As soon as you allow for unmodelled short-term fluctuations (like the bump in the 40s), moving the baseline can shift the whole curve up or down significantly – as you showed.
Toto, Lucia used a 20-year baseline (1930-1949).
I don’t think that natural fluctuation is going to substantially affect the baseline over that period.
So what do you define as “short” and how did you make that criterion choice?
If your reasoning is that the models breakdown therefore it’s too short, I detect a loop in your logic. I will even make the offer to do the Monte Carlo if you promise to admit the errors in your way once I prove you wrong. 😛
I think a much more reasonable approach is to say that something’s wrong with the model forcings, the temperature reconstruction, or both. If you want to call this “unmodelled short-term fluctuation” that’s fair enough, but don’t mistake unmodelled with unmodellable.
Toto
It won’t look horrible. I’m not sure what you mean by “little impact”. It will have an impact. the forecast will look the way they did in my original– Hadcrut is outside the 1-sigma and the other are near it. The temperatures remained below the multi-model mean even during EL Nino. That is to say: It won’t look very good.
Huh? No. I’m not defining “correct” that way. I’m defining it as being able to reproduce climate trends– that is the underlying one. If the models are good in the sense that they are off in the long term trends, then choice of baseline doesn’t matter much. But if they can’t produce long term trends, it matters.
If what you mean is that if you only show a short bit of data and you use the best baseline for that short bit of data, then you can always make even crappy models look good– that’s what I said.
However, if the models are good in the long run, the baseline doesn’t really matter so much. The models will look good on pretty much any baseline– provided you evaluate them keeping weather in mind.
Steven Mosher (Comment #86605)
December 2nd, 2011 at 3:06 pm
George.
Part of the issue is sensitivity as well.
The sensitivity of models varies between 2.2 and 4.4.
The argument goes that within those ranges you can get good hindcasts
———————————-
The hindcasts of the models match the observations by increasing the negative aerosols forcing when the CO2/GHG sensitivity is higher. Several Climategate emails make note of this and think it is an important indication of simple tuning in the models.
http://img36.imageshack.us/img36/8167/kiehl2007figure2.png
Re: Bill Illis (Comment #86616)
There is an entire paper on this in the literature. It’s no secret.
Probably you mean this:
JT Kiehl, “Twentieth century climate model response and climate sensitivity”
GEOPHYSICAL RESEARCH LETTERS, VOL. 34, L22710, doi:10.1029/2007GL031383, 2007
(Purloined from the Bill’s link.. the file was called “kiehl2007figure2.png”.)
It’s enough of a secret that I occasionally have people argue with me and claim that climate models don’t involve tuning (and sometimes we work out we don’t mean the same thing by “tuning.”)
This thread had a lot of discussion on tuning btw
http://www.youtube.com/watch?v=Z4uvS9l-FcQ&feature=related
Planet Earth’s temperature seems to be sometimes in agreement with the ECHAM model – albeit without the perpetual toggling every other year between record-breaking (in size and intensity) El Niños and La Niñas.
Success!
Re: Carrick (Comment #86618)
December 2nd, 2011 at 11:24 pm
So it was… I saw “…” in his link and didn’t see the full file name. 🙂
Yep those ellipses hide a multitude of sins…
Here’s figure one of Kiehl btw.
You really need to compare it to figure 2 to get a full picture.
From my perspective he should have had a table of models vs their sensitivities too. Oh well.
Kiehl occasionally publishes with
TreebeardTrenberth (ha ha bet he’s never heard that one before), so technically he’s on the team.No word on whether the paper helped “the cause.” 😛
George Tobin says: “So far, the IPCC has had no incentive to get it right or pick the best performing model or range of models if it/they show(s) a lukewarm outcome. Maybe with their remaining credibility on the line they will thin the ensemble next time around.â€
George, for the hindcast discussions and comparisons with observed temperatures in AR4, the IPCC excluded half of the models in the CMIP3 archive. So if they were to thin the ensemble next time around even more, they aren’t going to be using many models. In Chapter 9, the IPCC says they excluded the models due to drift, but if you plot the models excluded and compare to the ones they included, what’s missing from the excluded ones stands out like a sore thumb. There are no dips and rebounds from volcanoes.
http://i39.tinypic.com/2572cuu.jpg
If you were to use the CMIP3 ensemble mean data from the KNMI Climate Explorer during any statistical evaluation, the absence of the volcano dips and rebounds in half of the models could impact whether there’s a fail or pass during a given epoch. It would, of course, depend on the period being evaluated.
Lucia: There’s always the option of using the same base period the IPCC did in their comparisons in chapter 9 of AR4, which was 1901 to 1950.
http://www.ipcc.ch/publications_and_data/ar4/wg1/en/figure-9-5.html
That way you couldn’t be accused of cherry-picking the base years.
It’s a bit academic really ( there is some divergence among the models, so at the very least some must be worse than others at estimating the trend) but I can’t see how you infer that.
If there are short term fluctuations that are not captured by the models, and if these fluctuations have significant duration in comparison to the length of the baseline, then it follows that the choice of baseline can shift the entire curve up or down, by an amount comparable to the amplitude of the fluctuations. It is just trivial arithmetic – if you subtract a larger/smaller number, you get a lower/higher curve.
This is completely independent of the long term trend accuracy – or lack thereof. Bad long term accuracy will compound the problem if you push the baseline further back, but even an ideal model that predicts the long term trend perfectly would still be subject to that effect.
Lucia,
“It’s possible to explain that this spreading has very little to do with weather; it mostly arises because individual model predicts different amounts of warming since 1920.”
Does this mean that if you only used one model instead of 14, the 1-sigma range would be zero? That hardly seems likely. Wouldn’t it be better to determine an actual error range for each model based on each model’s hindcast period? Then combine inter-model variances and intra-model variances using ANOVA. Even better, weight each model by the inverse of its internal variance. That would give us a much better idea of what the real 1-sigma range is — and what the real mean is, too.
Toto:
You still haven’t defined what you mean by “short term” (usually that is less than 10 years).
Lucia is using a baseline that is much longer than what is normally met by “short term”.
The models should be able to capture this not very short term fluctuation, and they do not.
You are avoiding admitting this, why?
In my monthly post comparing observations to projections, I use the baseline the IPCC choose to describe their projections. Warming is projected relative to the average from Jan 1980-Dec. 1999.
Toto may object this is somehow too short– but it’s the one the authors of the AR4 chose to describe projections. Because they used it, I think it’s the fairest one to pick to compare observations to projections.
It happens to be one of the choices that makes models look more favorable than most others.
Toto–
It is simply a mathematical consequence of baselining and has nothing to do with model divergence.
Are you worrying about someone using a week long baseline? Or day long? Sure. But once again: this is a “weather” issue. If one interprets the graph keeping weather in mind, then if the models are right in the long term, the baseline won’t matter much. In the case of the IPCC models, the data spread would be large. Moreover, the earth weather would be so noisy that the models wouldn’t look bad– you’d get what appears to be agreement within weather.
Run some monte carlo and give it a try. You might be able to teach yourself something.
KAP
No I don’t mean that. You can compute a 1-sigma spread for the model means with 1 model. Ever. You end up dividing 0 by 0!
You need at least 2 models to compute a standard deviation.
You need to define your goal to decide what is “better”. I show the multi-model mean and the 1-sigma spread because that’s what they use to illustrate projections in the AR4. If the goal is to show how temperature are tracking matching the conventions in the AR4, you use the conventions in the AR4.
If you have some other question, some other graph might be more enlightening. But FWIW: I think SteveMc agrees with you, I don’t disagree. But the idea you are suggesting doesn’t seem to have caught on in climate science. (FWIW: it will make models generally look pretty bad. Suggesting the full model spread is “weather” or if not that the ‘only’ way to compare models to data is the way to create really large uncertainty bands and that seems to be the preferred method du jour. )
No I don’t mean that. You can compute a 1-sigma spread for the model means with 1 model. Ever. You end up dividing 0 by 0!
You need at least 2 models to compute a standard deviation.
Fair enough, but if you had used 2 models, the standard deviation would be zero at every point the models’ predictions cross. That still seems inadequate to me, regardless of how the IPCC does it. If the method I propose makes errors larger (and it does), that’s a reflection of reality: model errors are large. It would be nice to have a graph that shows that.
Kap–
Sure. If they crossed then at the instant they crossed, that would happen. Standard deviations computed from finite samples are estimates of what you would get as then number of components in sample approached infinity. So, sometimes the estimate based on the sample is smaller than the thing you want to estimate and sometimes its larger. This is a feature of finite samples.
In principle, you could get a standard deviation when the multi-model mean from 22 models all instantaneously cross too. Heck, in principle it could happen with 1000 models. This is just a finite sample thing. The more samples you have, the less often you get “0”. This isn’t a real problem– we can also estimate the uncertainty in the standard deviation relative to what we would get with N->infinity models (all from the same population.)
My understanding (I may be mistaken) is that models hindcast as easily as they project and models that can’t hindcast accurately aren’t expected to accurately forecast.
So it would follow that the expectations are that projections are as accurate as hindcasts? In other words the modelers expect the forecasts to have the same margin of error as the hindcasts?
If the above statements are true, then if the models forecasts margin of error exceeds the hindcast margin of error doesn’t that falsify the Model?
Genghis:
Well since you don’t know future forcings (indeed they are unknowable), you won’t be able to forecast with the same accuracy as you can hind cast.
I think a lot comes down to the length of the baseline period. As Toto wrote above:
“If the temperature has unforced excursions of a decade or so (like in the 40s) [EDIT: and if the baseline is short], the choice of baseline is always going to matter for short-term predictions, right?”
Lucia replied:
“Wrong. There are circumstances where choice of baselines doesn’t matter for short-term predictions.
If the models are correct, the baseline won’t matter much.”
I think that is only true in an ideal world with a 100% correct model. No model is 100% correct. (“all models are wrong, but some are useful”)
If I try to simulate the diurnal pattern of a certain measured compound, then only trying to simulate daytime concentrations is surely going to make the agreement with nighttime concentrations worse.
Likewise, with an imperfect model and in the presence of unforced variability in the data, the shorter the baseline period, the worse the agreement outside of that base period.
Lucia, is there a way to obtain the model runs (just the global surface temperature) without going through some lengthy registration process or having to download masses of irrelevant data?
Toto –
You can go to Climate Explorer: http://climexp.knmi.nl/selectfield_co2.cgi?id=someone@somewhere
Select “tas” (surface air temperature), then select to average across the globe for all members.
I don’t believe this is the same exact ensemble as AR4, but it’s pretty close.
Bart
Sure:
1) “If the models are correct, the baseline won’t matter much.” Maybe you could figure other ways to word what I wrote, but I don’t think my statemtn becomes incorrect merely because– as you correctly observe- “If model that is completely correct, baselines don’t matter at all.”
2)
Not if the model also simulates nighttime concentrations correctly. It will only make the simulations of night time concentrations look poor to the extent that the model cannot simulate concentrations correctly. This is important to recognize if one is claiming the model simulates both.
3)
Yes. That’s why I’ve been mentioning the issue of “weather” repeatedly. IOTH: I said it right off. Then, I’ve agreed with everyone who brings it up as some sort of correction to what I said and I have pointed out that I’ve said this.
toto–
For time series, go to the klimate explorer. That’s what I did.
Fair enough. If the period is long enough for the unforced and/or unpredictable variability to be levelled out, than the better the model is, the less the choice of baseline should matter, and vice versa. With an imperfect model (ie all of them), I would expect a shorter baseline to provide a less good fit.
It also depends on the question one is posing:
When the question is how well a certain time period can be simulated, that time period should be the baseline.
When the question is, given the simulation in time period X, how well does it simulate time period Y, the baseline should be X. Just thinking out loud here, there probably are more formal rules to apply.
Bart–Or another way to look at is that when assessing the difference between a model or a baseline, you need to consider the magnitude of weather variability associated with that length of baseline. If you’d pick a 1 year long baseline that variability would be quite large; a 5 years baseline would be smaller; a 20 smaller still and so on.
Obviously, one can make things look pretty bad with a very short baseline especially if you hunt around and cherry pick. But that can be dealt with by insisting people include information about how much deviation you expect to be introduced if you would expect by picking a very short baseline and selecting the time periods at random and also examining the baseline they actually picked to see if the choice shows signs of cherry picking. (Using the 20 year period from 1930-1950 is cherry picked– and that is so even though 20 years is pretty long.)
Also: i would note that if, when setting the baseline to X and comparing Y, the comparison during time period Z goes to heck, and someone points out that points to a problem, then you can’t suggest the disagreement in period Z should be ignored because you it can be made to disappear if you test period Z using baseline W.
One of the problems with this rebaselining issue is that either intentionally or accidentally, arguments of this sort are advanced:
1) Figure 1 using baseline X shows pretty good agreement during the 20th century.
2) Figure 2 using baseline Y (which differs from X) shows pretty good agreement in the 21st century and
3) then communicating the notion that, given these two figures we’ve now shown there is good agreement.
But the difficulty is that if you use baseline Y, the 20th century might look bad– even considering differences you might expect from “weather” while if you use baseline X the 21st century looks bad. Of course we can all argue about whether a particular level of agreement is good or bad. But it seems to me that if you are going to claim good agreement over a full time span from say 1900-2011, then one has to show that the agreement remains good over the full period. You can’t use one baseline to compare during a sub-time period (say the 20th century) and then a different one during a different sub time period (say the 21st century.)
OK, thanks Lucia and Troy!
I answered my own question: actually, the multi-model median and mean are virtually identical, and remain so in the future. So apparently outliers don’t have that much of an influence over the mean.
Disclaimer: I included all 22 models from the climate explorer archive, including those with only 1 run, and baselined everything on 1980-1999. And of course I may have done things wrong.
Lucia… I’ve got another silly question… (please don’t go catholic-school on me!) 🙂
On the first graph of that post, where you showed models with a 1961-1990 baseline… are you absolutely sure that you re-baselined the observations as well?
Because I only get a similar graph if I keep the observations on a 1980-1999 baseline, while applying the 1961-1990 baseline to models.
toto–
Hmm… you know what. I’m not sure!
I’d been fiddling with something, and I may have screwed something up in the script and not caught it. Let me check. The screwup may have been fiddling I did after– since what I’m seeing now wouldn’t make the graphs in this post. But still…
I’ll check and get back to you.
There’s a tarball or zip file floating around somewhere with the averaged versions of the individual model runs. I can’t remember anymore where I found it (seems like somebody on this blog pointed me to it).
Anywway, here’s a copy of it.