This is just a snark for historical reasons. Long time readers know that I have always considered 2001 the “best” start year for comparing observations to projections, with my reasons discussed here. However, there have been “others” who insist that the start date “should” be 2000 for different reasons. Needless to say, I like the date the authors of the AR5 use as the “start” for projections.

Included in the caption we find:
For the AR4 results are presented as single model runs of the CMIP3 ensemble
for the historical period from 1950 to 2000 (light grey lines) and for three scenarios (A2, A1B and B1) from 2001 to 2035.
So, in the AR5, the “start” year for the AR4 projections is 2001. The “end” year for the AR4 hindcasts is 2000. Grin. 🙂
(Substantive comments on this graph deferred until later. But for convenience the entire caption reads
Figure 1.4: Estimated changes in the observed globally and annually averaged surface temperature anomaly relative to 1961–1990 (in C) since 1950 compared with the range of projections from the previous IPCC assessments. Values are harmonized to start from the same value in 1990. Observed global annual mean surface air temperature anomaly, relative to 1961–1990, is shown as squares and smoothed time series as solid lines (NASA (dark blue), NOAA (warm mustard), and the UK Hadley Centre (bright green) reanalyses). The coloured shading shows the projected range of global annual mean surface air temperature change from 1990 to 2035 for models used in FAR (Figure 6.11 in Bretherton et al., 1990), SAR (Figure 19 in the TS of IPCC, 1996), TAR (full range of TAR Figure 9.13(b) in Cubasch et al., 2001). TAR results are based on the simple climate model analyses presented and not on the individual full threedimensional climate model simulations. For the AR4 results are presented as single model runs of the CMIP3 ensemble for the historical period from 1950 to 2000 (light grey lines) and for three scenarios (A2, A1B and B1) from 2001 to 2035. The bars at the right hand side of the graph show the full range given for 2035 for each assessment report. For the three SRES scenarios the bars show the CMIP3 ensemble mean and the likely range given by –40% to +60% of the mean as assessed in Meehl et al. (2007). The publication years of the assessment reports are shown. See Appendix 1.A for details on the data and calculations used to create this figure.
)
AR4 was released in 2007. Shouldn’t that be the start date?
MikeN–
In principle, one could argue that. The difficulty is that the modelers insist they are projecting from earlier dates, and there is some justification for that. The reason is that it takes them a while to build an tune their models, so the ‘tunings’ are in some sense frozen at some point and then the modelers start to make runs, and upload to places like CMIP5. So, many of the AR5 runs predate 2013– and will certainly predate the official release date. (The tend to predate the first draft.)
By the same token, some decisions about the projections can be made after models are run. The authors of the AR5 figured that out and some of their ‘projections’ are ‘observationally constrained’, with the constraints figured out after the models were uploaded to places like the CMIP5. So….
Anyway, I wrote up why I was willing to call 2001 the “start year” but not go back any further. (“Way back when” some bloggers and modelers like to insist the ‘start year’ is 1990– which, taken literally, means projections from 2007 were somehow “blind” to data after 1990. This despite the fact that none of the AOGCM used in the AR4 even existed in 1990!)
I still don’t quite get what jive they’re pulling in
realigning“harmonising” the data here but there is serious misrepresentation involved.Â
According to this graph , when TAR came out in 2001 full of Mann’s hockey-stick and full-on alarmism, the actual recorded temps where breaking out of the upper limit of TAR projections.
Â
ie TAR projections were not centred on what observations were at that time and the projected range indicated a slowdown was in store.
Â
I really don’t recall the spin in 2001 being “our projected range of temperatures suggests there will be a dramatic reduction in the rate of warming over the next 15 years and by the end of that time we expect temperatures to be in the middle of our projected range. ”
It was like: OMG, look at the last 20y, our models predict this exponential rise will continue and we’re toast. We must act NOW.
“Harmonising” is such an elegant piece of wordcraft, it sounds so natural, who could object to”harmonising” ? It sounds so much better than “moving the goalposts”, don’t you think?
Apart from that, they’ve overlaid so much information that the whole thing becomes a jumbled illegible mess, that no one can read unless they are pretty familiar with all the data anyway.
In a word: obfuscation.
I just don’t get it. The only way you can honestly (yeah right) compare two anomalies on a point by point basis is if they are adjusted to a common anomaly based on their start points or at least an average based on the start of the model run. It would be like taking two voltage traces of a ramp and a flat line and adjusting the bias of one to overlap the middle of the other then claiming the end points were close. The only “valid” comparison that I can see is if they showed trend lines instead of anomalies. Of course that wouldn’t work out that well for them.
Lucia: “The authors of the AR5 figured that out and some of their ‘projections’ are ‘observationally constrained’, with the constraints figured out after the models were uploaded to places like the CMIP5. So….”
How can something be “constrained” after the fact? I don’t understand what that can mean.
(“Way back when†some bloggers and modelers like to insist the ‘start year’ is 1990– which, taken literally, means projections from 2007 were somehow “blind†to data after 1990.
The only range of projections that looks reasonably close (at least on this rendition) is SAR. Far was pretty much a shotgun approach but it was early days. SAR got considerably better eliminating a lot of the more extreme end. But then spin and agenda trumped science and it all want wrong. We seem to have moved into the era of “honest vs effective dilemmas”.
Final draft looked pretty clear to me except that I could not see what the big grey envelop was about swelling up out of the floor. I’ve never seen that on a graph before.
Barry: The only “valid†comparison that I can see is if they showed trend lines instead of anomalies.
Â
God forbid. Trend lines are the worst thing that ever happened to climate science. Should be banned, along with running mean “smoothers”.
Â
[correction] It was SOD that had the clearer graph with just ranges plus obs data.
Â
It was clear that AR4 with the benefit of 20-20 tuning hindsight was able to follow dip in the 90s which earlier work was obviously unaware of.
Â
It also allowed an easy visual assessment of the how the ranges of projections compared to obs data.
Â
If the data had been uplifted on that plot the fix would have been blatant, with the whole of the 90s well above the model that was supposed to have been tuned to it. It would have been obvious it was rigged.
Â
That’s why they had to do all the spaghetti to obfuscate what was going on.
Greg Goodman
Some of the guys at the UK Met office wrote a paper explaining that trends for the models in the upper range of the CMIP3 are inconsistent with data. They then did a Bayesian analysis based on some complicated stuff involving observations, and came out with an adjustment for each model based on how much it over predicted trends in the past and so on. They then describe their adjusted projections.
I suspect those are the “observationally constrainted” projections in the AR5 as they are the only ‘observationally constrained’ projections I know of.
Lucia,
Please pardon me if I’m harping on the obvious. That strikes me as perfectly outrageous. If the model is inconsistent with the data, the thing to do is address whatever problem is causing the model to be inconsistent with the data, not find some fudge factor for the results. If the model can’t be fixed, fine; give it up. But there’s no call for anyone to pretend they’ve got a model that works by indulging in charades like that.
Caveat – if there is some reasonable theoretical basis for adjusting the projections, ok; then we can consider that adjustment part of the model, albeit one that is applied manually. It doesn’t sound to me like that’s what we’re talking about though.
Here I thought I was beyond being astonished at anything climate scientists do.
Mark Bofill, I can give an example where a version of this might be necessary.
Suppose you have an model, but the inputs to the model aren’t predictable. For example, for a climate model, you don’t really know the future CO2 forcings, let alone volcanic aerosol forcings.
If you want to test the quality of the model, you can try and run a series of scenarios (e.g. ,A1, A2, B1 and B2 emission scenarios) and obtain a series of projections from the models based on these scenarios. Note these are “projections” rather than “forecasts” because the inputs are alternative futures that will in practice never happen. Also, to be clear, we run the model multiple times with independent starting conditions, and generate an ensemble of model outputs for that model for each scenario.
Next you wait five years, and then you compare the model projections against the actual measured results. Only there’s a problem because you never ran the model with the actual forcings that were present during those five years. So how do you constrain the model in this case? To make this clear, suppose at the time the code was run, we knew the forcings until 2005, and after 2005 the forcings were all projections. We then want to compare those model outcomes to the actual data from 2005 to 2010.
The best way I know would be a form of “constraining after the fact”:
You require that the GCM code be designed so that the set of input conditions can be fed to it without requiring the code to be recompiled. (So many scientists build their inputs into the code—this is okay for exploratory work, not good at all for code that you want to verify.) You also require that the program can be stopped at any intermediate point in such a way that it can “save” the intermediate state (2005 in this case), and allow it to be run from there into the future, either with the same forcings, or new ones.
To test the validity of this particular code what you do is run the same exact binary (same timestamp and everything) that you ran in e.g. 2005 starting with the “saved states” from 2005 through 2010 using the “actual” forcings.
As far as I know, this is isn’t done, but it’s the only way I know to legitimately test an observational model.
The point here though, is validation in observational science is done differently than with experimental science. You don’t get to reset the universe and run the “experiment” multiple times, you just get one shot, and you generally aren’t able to control all of the conditions affecting the system being measured during the observation periods.
It’s important to recognize that the A1, A2, B1 and B2 scenarios are more useful in comparing model outputs between models, and to get an idea of the effects of different environmental policy on future AGW. Since the actual emissions for the period 2005-2010 were unknowable at the start of 2005, there really isn’t a good way to directly look at how a model actual performs against data outside of the verification period (e.g., 1980-2005).
The case can be made that you really need to do this: The software writers likely selected a subset of the model outcomes that did particularly well during the verification period (this isn’t always intentional, but people will tweak parameters to make their output look better for the period where data are available). Running them past 2005 helps you sort out this type of selection bias.
Carrick,
Thanks so much for your answer. I haven’t had more than a minute to quickly scan through (relatives visiting), and I’ll admit I’m a little puzzled at a quick glance, but I look forward to going through it in detail.
Carrick,
Okay, I am grateful for your explaining this, because this certainly isn’t what I thought Lucia was talking about. I have no issue with what appears to me (if I’m understanding you correctly) to be rerunning the model filling in inputs (forcings) after the fact that were only guessed at in the earlier runs of the various projections.
What I thought Lucia was saying was quite different. I thought the results or outputs were modified on the basis of some statistical procedure. If the modification is merely ‘filling in the best guess blanks with what was actually observed for forcings’ in the inputs, then that’s fine.
Thanks for the explanation.
Actually, what I mean is that in the AR5, there are several sets of projections.
i: One is based on raw output of models.
ii: Another is ‘observationally constrained”.
iii: Yet another!
‘i’ is likely done as Carrick described. This is how models are run.
‘ii’ is does more or less the way Mark Bofill thinks I said. Some looked at output of models, diagnosed it was too high relative to data and suggested the models will continue to be too high in the future because they over-respond to GHG’s. They then created a projections wish is not the average over models, it is an adjusted value that uses both model output and observations to predict the future.
iii: Seems to be a survey of opinions. But those opinions are influence by whatever the people holding the opinions have read, talked about and so on, so likely influenced by the models.
@BarryW
The only “valid†comparison that I can see is if they showed trend lines instead of anomalies.
There is a graph in chap. 9 (p 162) comparing trend line. A good one, without too much information.
Carrick makes a good point:
“To test the validity of this particular code what you do is run the same exact binary (same timestamp and everything) that you ran in e.g. 2005 starting with the “saved states†from 2005 through 2010 using the “actual†forcings. ”
From a financial discipline, you would like to run your model against the demand that actually happened, in order to be able to say how much of the variance against your original business plan was due to reduced demand, rather than different input costs, or different labour efficiency. Given the extent to which IPCC is model-driven, you would kind of expect these kind of variance analyses to be explicit in their write-ups. In real-life financial disciplines, it is often next-to-impossible to capture these variances to the desired detail, but they can be approximated. From what Carrick is saying, IPCC does not even start to go down this route. In which case, no model could ever be falsified.
Mark— I wasn’t actually suggesting that Lucia meant what I was describing, just that you can “constrain after the fact”, and that it not only makes sense with observationally-based computational models, but that this is the only “pure” way to validate these models.
Diogenes, in my opinion, this sort of validation testing should be part of the IPCC model-testing framework. The AR5 should have included a section discussing the results of this testing of the AR4 models for example. I actually think certainty of falsification might be a factor here. (Why should I run this test if it is only going to show how bad my model is?)
Carrick,
OK.
Lest anyone wonder what bugs me about this, the short and simple form of the problem as far as I’m concerned is that there are an arbitrarily large number of possible ways to map or scale the output of a model to the observations that the output is not matching. Not only are the odds of a correct mapping based entirely on an empirical approach extremely low, but there is no guarantee there ~is~ a correct mapping period paragraph; the model may simply be fundamentally flawed. But this approach is particularly bad in the observational science case, where you can’t even collect a set of observational ‘runs’ to use to try to come up with a reasonably comprehensive mapping from flawed model output to actual observed results. It’s sort of like saying this: the model says f(x) = x^2+10, the observations are that for input x=3 the result is 10, so apply a correction factor of 1/2 to the outputs to make the results closer.
Phooey. Phooey I say.
Lucia,
A bit off topic, but did you not do some posting a couple of years ago discussing how longa period of time the departure from the ‘consensus’ models would have to go until the observations and projections diverged sufficiently to call the predictive work into question?
Carrick,
“Why should I run this test if it is only going to show how bad my model is?”
.
Indeed. I agree that there is quite a lot of this tendency in both AR4 and AR5. Critical model testing/validation is one of those things which seems to be treated with very (VERY!) soft kid gloves in climate science. It shouldn’t be.
A while back, CA posted some observations on the dearth of interest among the tree recording-thermometer gang in gathering additional data. I don’t think anyone suggested what i took to be the real reason which was that additional data might not converge with previous interpretations but only add noise. A result like this would undermine confidence in the efficacy of trees in this application and would seem counter-productive, at least to the trade.
SteveF (#120089)
Not only is critical model assessment skirted in the IPCC reviews, it seems to be missing in the entire development process. From the preface to the “National Strategy for Advancing Climate Modelingâ€:
Aside from the single (and vague) word “reliable”, there is not a trace of concern in the accuracy of models. It’s all about finer-resolution, more complex models. More MIPS! More terabytes! It’s as if they take it for granted that the GCMs are inherently correct and complete, and the only thing that’s missing is to provide greater horsepower for finer grids and including secondary effects which are currently omitted. A bureaucrat’s dream, no doubt.
I imagine there are modelers (probably almost all of them in fact) who are diligently attempting to reconcile modeling results with reality. There are huge impediments: we have only one Earth to observe; we can not conduct controlled experiments; and the time-scale of behavior is so long that it is difficult to separate the effects from the multiple causes present. Nevertheless, I find it appalling that climate models’ prognostications are treated as being equally reliable as, say, astronomical predictions, when it is clear that they have quite a bit of maturation to do.
In particular, AR5’s demurral that the “projections from the models were never intended to be predictions over such a short time scale” just compounds the situation. It is true that short-term predictions can be upset by “weather” (e.g., short-time-scale and apparently unforced events such as ENSO), but it is precisely the modeling of the longer-term processes which has the least reliability due to the limited duration of solid observations.
HaroldW,
I agree that climate modelers and their supporters assume far too much in terms of validity, and as Lucia (and others) has regularly shown, there is very little chance that the existing models as a group encompass reality: the chance that the ensemble is a reasonable depiction of reality is miniscule.
.
I agree with Professor Brown (Duke University), as he recently pointed out at Climate Audit, that the entire idea of evaluating the model ensemble against reality is logically flawed. Each model is supposed to be a rational representation of Earth’s climate, and as such, each should be individually evaluated by seeing if the run-to-run model spread, as well as the range and spectral characteristics of the model’s emergent variability, is statistically consistent with the Earth’s measured temperature history. It seems pretty clear to me that all of the models in the ensemble (perhaps excepting one or two) are clearly wrong, IOW, inconsistent with the data at high statistical significance.
.
Were this any normal scientific endeavor, that clear discrepancy would be enough to push the modelers to reevaluate and modify the models to better match reality. But climate science is most clearly not normal science; it carries huge political and moral baggage, along with huge potential social and economic impacts. The fundamental disagreement in the ‘climate wars’ is not about the accuracy of the models; heck, I have no doubt most modelers themselves recognize the models are not close to capable of making accurate projections, or of determining climate sensitivity as an ’emergent property’. The disagreement is about values, morals, priorities, judgement of risks and benefits, and about who controls the path society, national and international, takes….. just like many politically contentious issues. Resolution of the obvious technical problems with climate models, so that they could (at last!) make reasonably accurate projections, would not end the political disagreement, but might, if we are lucky, move the debate to the purely political arena where it has always belonged. In my darker moments, I suspect that the lack of validation, rigorous testing, and movement toward concordance with reality is due to a desire, conscious or otherwise, to keep the debate/disagreement ‘scientific’, so that draconian changes can be couched as ‘demanded by the science’.
.
I hope I am wrong about that, but the longer modelers refuse to make changes needed to bring models into line with reality, and with those changes, substantially lower estimates of climate sensitivity, the more I think my hopes are mistaken.
Lucia,
Your search engine was very effective.
This is the article that I was asking about:
http://rankexploits.com/musings/2011/statistical-significance-since-1995-not-with-hadcrut/
If I may be so bold as to offer a question, it would be to ask what if anything has confirmed or challenged the analysis you offered in the above linked post?
hunter
I think long ago, based on white noise and residuals to historic fit I did something like “how long” we could get a trends below zero or something like that.
hunter (Comment #120098)
That was just a question raised by someone who read an interview of Jones at the time. It’s probably significant now.
Steve F,
The point about ensembles and accuracy is interesting.
I recall the idea that error filled models tend to magnify, not reduce, error.
The latest post at Climate Audit shows how the IPCC is reduced to increasing the error bands in AR5 compared to AR4 in order to sustain the IPCC conclusion of model accuracy.
If the models were improving accuracy, one would expect the error bands to be shrinking, not growing.
You are correct about the deep need for the AGW promoters to maintain the facade of science on their social movement. If the AGW consensus were seen as just another hand out in the political money smorgasbord, then people might start asking about cost-benefit and where did the money go?
hunter
Not quite because of the way AOGCM’s work, which involves uncertainty in the initial conditions and the sensitivity to that. (And until someone can measure every single temperature at every single point in the earth’s atmosphere, ocean and possibly dirt on the earth and so on, this uncertainty will remain. It’s in some sense fundamental and you can’t get past it. It is the ‘weather prediction’ problem.)
Anyway, given the uncertainty in initial conditions, if models because more accurate:
(a) You would expect structural uncertainty to shrink. That is: the mean response from different models would converge. This is true even if the earth’s trajectory is more like a “run” than a “average over all runs, i.e. model mean”.
(b) You may or may not expect the spread in runs for an individual model shrink. The spread in runs for an individual model is related to ‘weather’ and whether this spread will increase or decrease as a model becomes more accurate depends on whether the current estimate is too high or too low. We don’t know which it is– but we know that ‘weather’ happens on earth and the spread in runs for an individual model should not be zero.
So finally: the total spread has contributions from ‘structural uncertainty’ and from ‘weather’. To the extent that ‘structural undertainty’ dominates, increasing accuracy should shrink uncertainty intervals. To the extent ‘weather’ dominates, it should expand them.
Now: with regard to the IPCC figures. In the past the IPCC chose to show error bars that included “structural uncertainty’ only. In the current report, they are showing the contributions of both (based on model runs). That’s because the “spaghetti” springs from both the difference in model mean responses and the ‘weather’ about each model mean.
Part of the argument over Figure 1.4 being an “improvement” or not is whether the error bars for the AR4 “should be” the structural uncertainty or the combined effects of both structural uncertainty. But… there’s another issue which is: should it show what the authors picked regardless. My view is they should have shown what the authors picked and then, if they needed to explain it alsoshowed what they ‘now’ think is more appropriate for testing whether models are off. But collectively, the ‘borg’ mid of IPCC authors don’t seem to want to write a section where the say that in the past they chose to show structural uncertainty etc.
lucia,
Thank you for your interesting-as-always answers.
My take away from your second answer is that ‘they’ have chosen to rely on widening error bars, for whatever motive, to support their prior conclusions. Not so much to clarify but to hide. In industry this is a popular dodge to hide, err, declines and other disappointments from decision makers.
In AGW-land, however, the decision makers seem at least as committed to not seeing disappoinments as those generating the reports.
As to your first answer: yes, very interesting indeed. And the current situation was forseen over two years ago and discussed right here.
Sincere thanks,
etc.
Could someone explain the method of observationally constraining models? The description above suggests they take the model output, look at the actual results, and redraw the model output to better match.
I like Lucia’s discussion of the error as being split into structural uncertainty + what I would call “internal variability”. I’d call “weather” anything with a period less than 6 months and “climate” anything with periods longer than 6 months.
It’s important to recognize that climate variability should be split into a sum of internal oscillations + some type of red noise. Typically people discuss the ENSO , the Pacific Decadal Oscillation and the North Atlantic Oscillation.
Nick has been playing with AMIRA models and been finding these are inadequate in reproducing the observed autocorrelation function (I had predicted this). Of course the ACF is related to the power spectrum by a Fourier transform (the Wiener–Khinchin theorem).
The approach I have been using for Monte Carlo’ing climate noise is spectral based. The simplest version of this is to take the magnitude of the Fourier amplitudes from the power spectrum, then randomize the phases, and inverse transform.
This method is imperfect because it doesn’t allow for persistence, phase locking with external forcings nor interaction terms between the modes (e.g. partial mode entrainment).
A simple cheat that partly address the persistence is subtract off the internal modes from the Fourier amplitude, and assign these the corresponding amplitude and use a phase model that includes a phase diffusion term (e.g., allow the phases to randomly walk). You’d probably need a coupled limit-cycle oscillator model that includes entrainment and interaction with the annual radiative forcings cycle to do much better than this.
What I’ve found is, for climate noise, most of the variability is explained by the internal modes, and the red noise component can simply be neglected. YMMV.
Regarding the comments by Robert Brown, people might remember I’ve made similar criticisms in the past about the errors in treating the “structural” error as if it could be described as part of a statistical ensemble. The issue with “structural” error is, well, it is errors in simulation of the underlying physics.
Similar to what SteveF was saying, I’ve argued all along that not all climate models were created equally. Some appear to be included in the ensemble for political reasons, are are really quite primitive both in terms of the complexity of the model and in terms of model verification.
In other words, in treating the models as ensembles, you would expect one model to perform better over a certain period, and worse over another. However the social dynamics relating to these modes imposes a statistical stratification in the models: Some models will consistently have larger variance relative to actual climate than other models.
In terms of analyzing the IPCC claims, it is reasonable to use their algorithms for testing the validity of the models, without change. Essentially you’re asking, “given the algorithm they use and the assumptions they make, how well are they doing in practice”? I’ve viewed Lucia’s approach as something along these lines.
But it’s also important to recognize the limits of the validity of the statistical methodology used by the IPCC (and groups in climate science in general). It’s my impression that Ben Santer is a particular advocate of this approach.
At the least, I’d apply a weighting based on how well the model reproduces the internal variability (amplitude and phase of ENSO) in accessing the uncertainty in the model projections of global mean temperature (e.g., use a weighted ensemble).
However, I’m not sure how you go from an ensemble, weighted or otherwise, to predicting what the actual physics would give, had you done it properly. To the extent that the errors are dominated by mesh size and time step, I suspect the ensemble method isn’t a terrible approach.
To the extent that errors are dominated by physics errors (e.g., cloud feedback, incorrect model of aerosols, etc.), I really don’t have any suggestion for how you’d combine the models.
What we would do in physics is grade each model separately, and not try and reduce our uncertainty by combining the models with each other. That would be regarded in most physics fields as a bizarre attempt to manipulate the true uncertainty to make things look better than they really are.
James Annan has discussed the use of empirical orthogonal decomposition to compare models to data. This seems like a better approach than single metrics, especially if you do it for an increasingly complex hierarchy of climate models. (What you’d like to see is a “convergence” of the lowest order modes, the modes that explain most of the variance in the system.)
There’s a nice review of this method here. Not surprisingly, the lead authors are Russians—who don’t have huge budgets traditionally for research, so they are willing to use their brains and pencils occasionally instead of throwing the entire US science budget at the problem.
Carrick
The only way in which “structural errors” become part of an ensemble is if we think in terms of ‘set’ theory and the “set” the AOGCM’s is drawn from is “all possible AOGCM’s that mostly funded researchers would concoct given the current level of understanding of climate and current ability to run models on available computers”.
This is an odd set of things– but one could ask: Is ‘property X from this set biased relative to the property X for the earth?’
Pretty much yes. Because in the past their uncertain intervals pretty much were “structural”. (In the AR, they are a weird amalgam of structural and internal variability.)
Carrick,
“That would be regarded in most physics fields as a bizarre attempt to manipulate the true uncertainty to make things look better than they really are.”
.
Not just physics my friend, any field of rational understanding. 😉
.
I do have one comment about the “structural” versus “weather” uncertainties in models. Weather in climate models (indeed, variability at all time scales) is an emergent property, and so necessarily dependent on structural accuracy; structural inaccuracies (including parameterized sub-grid behaviors like cloud properties) ought to make ‘model weather’ over multiple runs have different spectral properties compared to the real Earth, and indeed, visual inspection of the trend histories from multiple runs of individual models shows that many models are wildly more variable than Earth (frequency and/or magnitude much too high). That kind of deviation, independent of inaccuracy in projected trend, is a more damning deficiency, because it can’t be so easily discounted by claiming uncertainty in forcing (the well know aerosol offset kludge) or heat accumulation (the well known heat hiding in the deep ocean kludge). I have not looked, but my SWAG is that there may be a correlation between high diagnosed climate sensitivity and variability which is much too high.
Lucia, to make it clear, my concern is with the uncertainty estimation associated with the treatment of the ensemble of models.
Uncertainty is an estimate of how large the error is between the model output and the “true” values. If none of the models include realistic physics of e.g. clouds, or the physics of aerosol particles is in the models is wrong (“too simplistic”), looking at the spread of outputs associated with “structural uncertainty” tells you nothing about the probability that that a given model is with a certain accuracy, relative to the “truth”.
James Annan refers to the notion that the mean of the models represents something about the underlying reality as the “truth centered paradigm”.
[See the above link for a more description of his perspective of the flaws in this paradigm.]
If the only thing that affected model output were differences in out the finite element code was developed, you might have a glimmer of a hope that the mean of models would converge on the “truth”. IMO, what we have are some really bad models mixed in the soup along with some somewhat-less-bad models.
So the differences in accuracies between models doesn’t represent a random scatter around the truth, but rather a random scatter around a mean that has no useful interpretation.
That doesn’t imply that looking at the metric provided by AR4 (“mean of models”) isn’t useful, but it’s important to realize that this metric isn’t helpful in estimating how likely it is that a given model (or ensemble of models) is reproducing the “true” climate values.
Quick note here—real climate is a sum of deterministic plus stochastic components associated with internal variability (for sake of simplicity, let’s ignore nonlinear effects of the internal variability on the deterministic component).
If you reran the real Earth starting with 1850, each time you would get a different realization of the internal variability of the system superimposed on a deterministic component. So when we are talking about “true” values, we mean the average over ensemble of possible realizations, not the value of one particular realization, not even the particular realization that actually happened observationally.
I agree that the mean of models doesn’t give a useful metric for predicting the “true” climate value.
That’s the way I see it. I consider the variance due to “internal variability” to be the spread across those runs and so on. But for the earth, we will have only 1 realization.
In contrast, each model is an attempt to create a black box that replicates that. But that replication might be “true” or not. And if we put together a bunch of then, the average over models for any particular feature of interest might or might not match the earth.
The difficulty is that– of course– that most discussions in climate science do estimate the likelihood of any outcome using “one model-one vote”. If the models are not “truth centered”, this method of creating an estimate from the ensemble will be biased. After all, the average will be the “where ever the center of the models” is and not at “the truth”. I don’t know about other people– but even if I don’t believe in the “truth centered” model, I think it’s useful to try to gauge if the truth is to one side or the other of the models.
lucia:
Yes, so additional assumptions have too be made in order to model internal variability. Inevitably you will be forced to do some form of “windowing” of the data in order to estimate the spectral properties of the internal variability, and this will limit the longest resolvable period to the observation period divided by the number of windows.
I think this is a politically based decision rather than based on issues of scientific validity.
I agree there is value in seeing whether the ensemble of models is biased relative to the data. The problem is interpretive, from my perspective… how do you evaluate the inevitable gap between your measured and model trends? In particular, when you are tossing models that are very bad in with models that are less bad, I’m not sure what the variance from the models is supposed to mean in this case.
SteveF:
I think this is a reasonable guess.
I did look at ENSO from the climate models (and the AR4 has a comparison of it too), and it does seem that the models with high sensitivities tend to have larger variability associated with ENSO.
real climate is a sum of deterministic plus stochastic components associated with internal variability (for sake of simplicity, let’s ignore nonlinear effects of the internal variability on the deterministic component).
So which is Gavin?
Carrick (Comment #120106)
October 9th, 2013 at 9:56 am
“Nick has been playing with AMIRA models and been finding these are inadequate in reproducing the observed autocorrelation function (I had predicted this). Of course the ACF is related to the power spectrum by a Fourier transform (the Wiener–Khinchin theorem).”
I’ll have to see what Nick has done, but my ARMA modeling of the CMIP5 and Observed temperature series gives a very decent fit. I have modeled the CMIP5 Historical model runs for those models with at least 6 multiple runs of which there are nine, two with 10 and seven with 6. I have used the period 1964-2005. The CMIP5 runs fit an ARMA(1,1) model best by way of AIC scoring and similarly the 3 Observed series of HadCRU4, GISS and NCDC fit an ARMA(2,0) model best. The fits score well when the ARMA residuals are tested for independence with a Box.test in R with a lag of 20.
I have found that the variability of the trends from these CMIP5 model series best fits that variability found from simulations of an ARMA model of the series when using a loess filter with a span=0.40.
It turns out that the variability of trends with either ARMA(1,1) or ARMA(2,0) simulations are very nearly the same.
I do not know what Nick had in mind for doing his modeling, but I see an ARMA model with simulations as a means of estimating the weather/chaotic noise in the series and making comparisons of climate model to climate model and climate model to observed for both the weather and deterministic parts.
Carrick, thanks for the link to Nick’s blog. I just quickly read his post on acf and see that he goes much further out in looking for significant dependence than I did (I went to 20 months). He also makes a point about the non linear character of a series going back further than 1980. That is where the use of a filter like loess can be useful, I think.
I have gone back further than 20 months in previous analysis and seen what Nick is referencing, but I did not recall the dependence being very significant. Now I’ll have to take another look.
Kenneth, I think the big issue is quasi-periodic internal variability. I’m willing to be wrong here of course.
I think 20-months is going to be too short of a period to really test any issues, ENSO can be approximated by coupled 2 and 4.5 year oscillations, so the smallest interval that I think would be useful for comparison is 10-years.
For comparison, I estimate that about 50% of the total variance in the global mean temperature series comes from the ENSO contribution (and 25% from sub-annual).
My detrending methodology has no problem with non-constant trends (I detrend on a per-window basis).
Carrick, I went back and looked at a Box.test with a lag=60 months for acf of the ARMA residuals for the 3 observed series and 62 runs of the 9 models I noted above. I did this for both ARMA models based on the residuals from a linear trend and a loess filter span=0.40 for the 3 observed series and using the loess filter for the modeled series. All series were monthly global mean temperatures for the period 1964-2005.
The observe series ARMA residuals showed independence for the residuals from a linear trend and the loess filter (p.values in range of 0,5 and higher) and 52 of the 62 models runs showed independence using Box.test p.value >0.05. Those that failed did not fail by much.
Alternatively I have modeled these series using an ARMA(X,0) model where I increased X until I reached a Box.test p.value =0.70. Some of these series require a rather high value of X to reach that p.value, but in the end I have a model. I think cycles in a time series can be handled with an ARMA model.
What I found interesting was that model runs for a given model could require very different values of X to obtain a p.value=0.70
I did my lag=60 before reading your post. I’ll go back and look at 120 months – if the Box.test in R permits that many lags.
Carrick, I used the Box.test in R with lag=120 on the ARMA residuals for all the series noted above and in all cases the p.values were all higher indicating no or less dependence. Of the 62 model runs only 3 now had a p.value <0.05.
OT:
Nature is at it again.
Article: http://www.nature.com/nature/journal/v502/n7470/full/nature12540.html
Scary press release: http://www.thedailybeast.com/cheats/2013/10/10/study-record-heat-to-hit-in-30-years.html
Good thing I wasn’t drinking anything when I read that or I’d probably need a new keyboard.
Kenneth Fritsch, thanks for the comments. To be clear it’s non-purely cyclic that I’m more concerned with, rather than pure cyclic behavior. Hence “quasi-oscillatory” behavior.
I don’t think there are any problems with ARMA models as interpolating functions (put enough parameters in and you’ll be able to fit to any continuous waveform “nearly everywhere”). Can you tell me how many parameters your ARMA model ended up with?
I’d also be interesting in seeing the forecasts from your model and am wondering if you can provide this—
When I look at the forecast from ARIMA type models, what I see is they are brute force fitting the quasi-periodic behavior within the calibration period, but not preserving the quasi-periodic behavior outside of it.
This issue is important if what you want is a method that you can use to generate random instances of the same noise model:
What I do is develop a noise model for the observed period of data, then “run it forward” by steps of the same number of years to generate “independent realizations of the same underlying noise field”.
Re: DeWitt Payne (Oct 10 14:17),
“At that point, the coldest temperatures on earth will be higher than the current hottest temperatures…”
… on Europa!
I’m sure that’s what they meant 😉
Carrick (Comment #120124)
October 10th, 2013 at 2:24 pm
When I ran ARMA (X,0) models of the CMIP5 model run temperature series by increasing X until the Box.test p.value =>0.70, the range of X was 2 to 19 – as I recall. That model with a high order ar was not often the highest scoring with AIC, but surprisingly sometimes it was. I would have to go back and look for exact details, but as I remember the range of X could be quite large for the multiple runs of the same model. Obviously using a large order for ar is over fitting the model and of no practical value other than showing (and pondering) the differences from within model runs and with other models.
Using an ARMA(1,1) model for the CMIP5 Historical temperature series for the period 1964-2005 provided a good fit for those models with at least 6 multiple runs. Using a loess smooth with a span=0.40, I could also obtain reasonable agreement between the trend variability from ARMA simulations and the observed variability of the multiple runs for a given model. I take your comments to suggest that the ARMA model be tested using separate calibration and validation periods. Of course, the test would be to determine how well the red/white noise in the model (or instrumental) series were handled. I need to think about the best way to run this test. The historical part of the CMIP5 models from 1880-2005 were run separate from the runs into the future (2005-2100) and latter joined together. The multiple historical runs for a given model were varied by using different initial conditions.
I have mixed feelings about these exercises, as on the one hand I am saying, given the differences between climate models and observed temperature series, what is the best method of comparing them with some reasonably made assumptions, and, on the other hand, knowing that there are differences within a given model in multiple runs when applying an ARMA model and further that the observed series and climate models fit different ARMA models: ARMA(1,1) for most model runs and ARMA(2,0) for the observed series. There are also, for models with at least 6 multiple runs, large differences in the standard deviations of trends for the 1964-2005 period for some pairwise model comparisons. I think that a comparison of the observed temperature trends with climate model trends might not be ready for prime time until these differences can be resolved and probably with the elimination of a good portion of the climate models as candidates for comparison.
Obviously, of the two parts in the comparison of climate models series to observed temperature series, the important part is the deterministic trend. Unfortunately without making assumptions, separating the deterministic trend and the white/red/cyclical noise is problematic. Use of models like ARMA can be applied to estimate the noise as residuals of a trend determined by either linear regression, segmented linear regression, loess or some other filter. Without assumptions about fixing either the noise or the trend the modeling proposition remains circular. If we do make assumptions and obtain the separation of the deterministic trend and the noise, can we confidently compare trends from the observed and climate models knowing that the noise parts are different? If we could show that the deterministic trend and the noise are independent, I would suppose we could more confidently compare model to observed trends.
I have been reading the linked paper below that presents a good review – in my estimation – of modeling observed temperature series with technicues borrowed from econometrics. There is probably too much space devoted to unit roots for the taste of many reading here. I could find only a one sentence noting the physical limiatations of differencing a temperature series.
There is a good discussion of improving modeling of temperature series by handling the breaks and non linearity of the series and how this avoids the use of differencing in producing stationary series. I was not aware of the problem that investigators have had with detection and attribution whereby temperature series were considered trend stationary and the forcings series, without accounting for the breaks and non linearities in the series, could be made stationary only by differencing.
http://people.bu.edu/perron/papers/Estrada_Perron_revised.pdf
Re: Kenneth Fritsch (Oct 12 07:33),
Atmospheric CO2 forcing doesn’t need to be differenced to make it stationary because it’s totally deterministic. The only randomness is in the measurement process. Modeling CO2 as if it were any sort of ARIMA process with or without a linear trend is stupid. It’s not like it’s a stock price. Econometricians who meddle in physics should learn some physics first. Right now what they’re doing is mostly mathturbation.
DeWitt Payne (Comment #120150)
October 12th, 2013 at 9:28 am
I was thinking that my posts here on unit roots and mass extinctions would surely bring DeWitt forward.
I agree that the physical limitations of these models is evidently lost on some of these econometrics. The authors do make a one sentence reference to that limitation of using a I(1) model for temperature.
What I found of interest was that evidently some mathurbaters in previous times were using differencing in attempts to get around the trend breaks in the temperature series. I think they understood a single linear trend would be trend stationary and not require differencing. It was the non linear nature of the trends that lead to differencing.
Actually in my view econometricians should get what they do right first. Some observers, like me who are influenced by the Austrians, see these economists making some rather precarious assumptions in order to simplify the math and stats they apply to economic processes.
Coming from the “one picture is worth a million words” department, I thought it might be a useful exercise in the visual interpretation of graphical information to combine IPCC/AR5 Figure 1.4 with the Hadley Center’s graph of Central England Temperature (CET), 1772-2013, placing both graphics onto one common page.
.
This exercise is yet another phase in my ongoing efforts to expand my “CET is Anything and Everything” climate science paradigm into uncharted visual communication territory.
.
A major characteristic of the CET-is-Anything-and-Everything paradigm is the assumption that pre-2007 rates of temperature change in the CET historical record can be used as rough predictive indicators for post-2007 GMST rates of change — at least to the extent of stating that similar rates of change have been experienced within the past 240 years which cover similar (or longer) timeframes as does the AR5 2013-2035 predictive timeframe of twenty-five to thirty years.
.
Here it is: AR5 Figure 1.4 and CET 1772-2013
.
The illustration has two major graphical elements:
.
-> The first major graphical element, located in the upper-left quarter section of the illustration, displays an adaptation of IPCC AR5 Figure 1.4 which highlights the boundaries of the “AR5 Expanded Modeling Envelope”; i.e. that section of the original Figure 1.4 which illustrates the observation validation zone between the year 2001 and the year 2035 of past IPCC model runs. Overlain on the Figure 1.4 adaptation is a series of seven temperature rate-of-change trend lines spaced in 0.1 degree increments, each of which begins in the year 2007, and each of which also has a historical precedent in the Central England Temperature record.
.
-> The second major graphical element, which is shaded in light gray and which covers approximately three-quarters of the illustration, documents the method which was used to visually fit the approximate slopes of the seven CET temperature trends occurring between 1772 and 1975 which are being used as the historical CET precedents. A third graphic illustrating Global Mean Temperature between 1850 and 2008 is also included for visual reference and comparison. The original source graphics for CET and for GMT are from the Hadley Center.
.
Let’s remark here that the Central England Temperature record is the only instrumental record we have that goes back as far as it does; and that its recent temperature trends are approximately reflective of recent global temperature trends.
.
Concerning the derivation of my own graphical adaptions of the IPCC and Hadley Center source graphics, the process by which the slopes of historical CET trend lines were determined is readily evident from direct examination of the illustration, without any further explanation other than to clarify that all fitting of trend slopes was done by visually placing each linearized trend line onto the HadCET plot wherever it was appropriate in the CET record for the particular decadal rate of change being fitted: -0.1, -0.03, +.03, +0.1, +0.2, +0.3, or +0.4
.
Several points become immediately evident from a casual look at this one-page graphical illustration:
.
(1) GMST could fall at a rate of -0.03 C per decade between 2007 and 2021 and still remain inside the AR5 model validation envelope.
.
(2) GMST could stay flat between 2007 and 2028 — i.e., have a trend of 0 C per decade for a period of 21 years — and still remain inside the AR5 model validation envelope.
.
(3) A small upward trend of +0.03 C per decade is the approximate rate of change in CET for the period of 1772 through 2007, a period of 235 years. GMST could rise with that same small upward trend of +0.03 C per decade for another 28 years beyond 2007 and still remain inside the AR5 model validation envelope.
.
(4) For the timeframe covering the period between 2007 and 2035, GMST could experience a rising temperature trend of anywhere from +0.03 per decade on up to +0.4 C per decade, while still remaining within the scope of past historical precedents documented in the Central England Temperature record for similar periods of time.
.
(5) Rates of CET temperature change which covered time periods of at least twenty-five years, and which ranged from a low of -0.1 C per decade on up to a high of +0.4 C per decade, occurred at pre-industrial levels of CO2.
.
What does it all mean?
.
It means we have seen it all before, and we will probably see it all again; i.e., there is nothing new under the sun.