In discussion in comments, I showed Lazar a plot illustrating the results of tests of the hypothesis that multi-model mean trend agrees with the trend underlying the single realization of earth weather indicates that the multi-model mean is probably not equal to that for the earth: i.e. it appears the mean trend for models is currently “off”, the results are statistically significant for a choices of start year over a broad range going back to 1960.
Lazar then requested that I vary the end points. I asked
“Why would I vary endpoints?â€
My reasoning for using the most recent data as end points to trend analysis in based on these notions:
a)it is traditional to use all data available, throwing away only data suspected of being erroneous and
b) if the case in favor of my argument were based on trends ending in 2007 ignoring available data from 2008, those who disagree with my result would insist I use 2008. In this case, there would be no question: The analysis that counts is the one ending using current data, not the one based on an arbitrary decision to ignore recent contradictory data. (I would agree with them.)
c) if El Nino zooms in, and the trends ending in 2009 make the models look ‘not so bad’, admirers of models will likely be showing those graphs and not waste paper or time displaying graphs of trends ending in 2008. Failing to show these graph omitting data arriving after 2008 will not be considered misleading; those graphs will be deemed unimportant.
However, because I asked him why I should vary the end points, Lazar answered:
Gavin showed the results are highly dependent on endpoint, and without wishing to get into too philosophical an argument, I don’t think it can be convincingly argued that the selection of endpoint is random. 2008 was a cool year relative to the linear trend from 1970. Those trends are all on the upper 95% boundary going right back to 1960 which does not agree with the plot in this post where the observed trends sit squarely in the center of the model distribution… I would argue that the plot in this post is a better indicator of model performance.
Let’s not get into precisely what Gavin said or what he showed. Suffice it to say that Lazar’s use of the phrase “the observed trends sit squarely in the center of the model distribution” cannot be said to contradict the phrase “the observed trends appear to lie below the multi-model mean for nearly all start years, but aren’t hugging the 95% uncertainty interval.”
In any case, 2008 happened.
There is no reason to believe mis-measurement on the part of GISS or Hadley caused the observed temperature anomaly in 2008 to be erroneously low (or at least it’s no worse than any other measurement.) That means the suggestion to create and contemplate graphs based on dropping a perfectly good data point is not motivated by believing that data is bad.
Lazar suggest there is a problem that the decision to ending in 2008 is not “random”. It’s not random.
The choice of ending with the most recently available data is dictated by convention. When people test models projecting into the future, it makes sense to use all relevant available data, so we use recent data rather than ignoring it. If, when 2009 comes along, I insist on performing trend analysis using data ending in 2008 because I prefer that result to the one using data from 2009, that would be cherry picking.
Is the fact that ‘some’ might do more analyses when the weather turns cold a big problem? Not really. First, tests are being done constantly by both ‘some’ and ‘others’. It’s true that when excursions happen, we will notice them. It is also conventional to accept that from time to time, statistical tests will return false rejections: if a test has a confidence level of 95%, we’ll see true hypotheses rejected in 5% of cases. “Some” will report them. On the other hand, when they aren’t happening, tests are still run, and failing to reject is reported by ‘others’ — and often quite loudly. There also exist ‘unique individuals’ who would report both rejections and fail to rejects regardless of their preferences.
So, I see no big problem with showing or describing results ending in 2008, and no need to turn myself into a pretzel re-casting my graphs to show what they would have shown had ignored data since Jan 2008 and ended in 2007.
But what if you think Lazar has a point?
The reason I am often willing to examine what happens if we indulge someone suggesting we make analytical choice I consider poor is this: Not every can quickly graphs why they are poor. Gavin threw this stinking fish on the floor. Some blog readers have smelled it and are now concerned. After all, 2007 was an El Nino; 2008 was a La Nina.
Before looking we look at results of trend analysis ending in 2008, it might be useful to ask: If we look at both results including 2008 and excluding 2008, what criteria will we use to reject the models?
Here are candidates:
- Gut feel. To reject the hypothesis models are on track, the results ending in 2008 must reject at 95% and based on the ordinary eyeball inspection the magnitude of d* for results ending in 2007, your intuition must you that the results end in 2007 data “don’t look different” 2008. (We don’t need no frigging numbers!)
- To reject the hypothesis models are on track , the results ending in 2008 must reject at 95% confidence, and those ending in 2007 must reject at 95% and both rejections must occur in the same direction; i.e. both say the models are too high or both say the models are too low. (That must be the 95% confidence level, right? 😉 )
- To reject the hypothesis models are on track, the results ending in 2008 must reject at 95% confidence and the trends ending in 2007 must fall on the same side of the multi-model mean. (That is, if the model mean agreeing with observations rejects by being too high in 2008, then model mean also be too high in 2007. (Note, that to a large extent, Gavin’s graphs mostly showed trends ending in 2007 still below the model projections, but not as low as the ones from 2008. So, this is relevant to Lazar’s concern that the results are sensitive to the end point.)
In all cases above, notice the 95% confidence level is mentioned.
This might cause naive people to believe the tests correspond to the 95% confidence level. That is obviously a mistake. In all cases, what the analyst has done is raise the confidence level above 95% without stating the actual confidence level used.
What are the type I error rates for the three case?
Let’s do some test to estimate the effect of superimposing the additional criteria based on 2007 on the probability of type I error associated with the three cases above.
The first case: No one can compute confidence intervals for this nebulous method because no one can guess in advance what your gut is going to tell you when you see compare graphs of trends ending in 2008 to those ending in 2008. I person might not consider two results “contradictory” unless a 2007 ending trend rejects low at 95% confidence while a 2008 ending trend rejects high 95% confidence. You might consider the results contradictory unless both reject at 95%. If we don’t describe our criteria in advance, and have no history of applying this two-year criteria, who can guess what we mean?
So, I’ll skip that.
I can compute the type I error rates for the second and third cases and determine the confidence level of the new tests provided I specify type and level of “noise” in the process. I’ll do this for a special case where we of 96 month trends and 84 month trends. (These numbers of months correspond to 2001-2008 and 2001-2007 for comparisons of AR1 models.). Doing so for real data with real distributions would be a pain in the tuckus (and encourage debate over the true structure of the noise.) So, using synthetic data I’ll give you a idea what happens if we assume a) the underlying trends are linear, and b) the residuals are gaussian white noise. (We know this noise structure does not describe earth weather. )
To do this, I used the same 40,000 cases of 96 months “runs” of monthly data I generated >last night. It happens that when I generated that data, I also analyzed trends based on the first 84 months. (Like Prometheus, I must have foresight!)
I already know the cut off multiple for the 95% confidence intervals for 94 degrees of freedom is 1.986. Recall that I obtained 4.82% type 1 rejections. (That is, false rejections of a correct hypothesis; if I ran many more cases, I should get 5%.)
The 95% rejection/ 95% rejection method
I examined the number of type I rejections if I insisted that both 96 month trend and the 84 month trend rejected at 95%. (Note that using this criteria, we would set aside Michael’s conclusion based on a graph of trends ending in 2008 based on the appearance of graphs of trends based ending in 2007.)
Using cut-off multiple for the 95% confidence intervals for 84 month trends, I obtained 4.91% type I rejections based on the 84 month trends. How often would I reject the model mean if I insisted I must reject in both years and both rejections must be in the same direction? 2.30% of the time.
So, by insisting on this criteria, I’ve taken what appears to be a test with a confidence level of 95% and raised the confidence level to roughly 97.7% confidence level.
The 95% rejection with previous year “on same side”
I examined the number of type I rejections if I insisted that the 96 month trend rejects at 95%, and the 84 month trend being on the “same side” of the multi-model mean as the 96 month trend; i.e. if the 94 month model mean rejects on the high side at 95%, the 84 month model men trend must be at least a tiny bit high compared to the observations.
Now, if I count the number of rejections at 95% and then simply checking if the model mean is on the same side of the observations in both years I get 4.81% rejections, down from 4.82%. So, for this level and type of “noise” with 96 months worth of data, the extra criterion made almost no difference; sometimes, the extra criterion screens out a case, but it’s rare. (The extra requirement would make a difference if we had less data, or noisier data.)
Generalization
As I did yesterday, today, I illustrated a general principle by way of example. The principle is: If we add an extra criteria to the standard t-test, we are jacking up the confidence level by some unknown amount.
In the example using the noise type and level in my example, tested at 96 months vs 84 months, if we insisted on getting 95% rejections in two consecutive years, the confidence level of the new test rose from 95% to 97.7%. If we used the 95% confidence criteria for the second year, but only required no sign change in the comparison from the previous year, the 95% level barely budged. These numerical values for both methods change depending on a) the noise type and level, b) the number of months in the sample and c) the number of months the analyst decides to drop. The general principle will always be: If we add a criteria above and beyond simply rejecting at 95% based on this years data, we will be lowering the rate of incorrect rejections, and raising the confidence level of rejections associated with those rejections we do detect.
Is there anything wrong with using a test with a confidence level of 97.7% instead of 95%?
Not necessarily. However, the most straightforward way to do this is to simply admit that you want a higher confidence level.
There would be something not quite right with proposing that we begin by using the standard one year method, and switch to the two year method only when you decide you don’t like the results based on the customary method of performing the analysis including all current data. There is something even worse in providing extremely vague descriptions of the actual criteria for the two year test and/or selecting the criteria only after running the test and verifying you get the results you prefer and then dispensing with these new criteria should the data become more favorable to your case. (This would be called “special pleading”.)
Now, I suspect some readers are wondering the elevation of the 95% confidence level to 97.7% represents the upper bound of the possible effect when deciding to do a test based on trends ending in both 2007 and 2008? Nope.
I used arbitrary levels of white noise to show the effect of dropping 1 year of data on the confidence level of a test one might mistake for one with a “95% confidence level”. I could get all sorts of different numerical results using different arbitrary data. The one consistent finding would be this: Adding another criteria reduces the rate of type I error that appear to be claimed by the analysist.
If “someone” proposes a new, idiosyncratic ending in various different years method of testing trends, “they” should do figure out what their criteria really are, describe then, and then do the computations to figure out the honest confidence level associated with their test. Otherwise, for all the audience reading an analysis knows, the confidence level has been jacked up to 99.99999999%, making it virtually impossible to degree a model is incorrect even in the face of overwhelming evidence.
For now, I can estimate the effect of adding the extra year on a simple t-test applied using 96 observations of earth data. If I assume the monthly weather noise is AR(1), driven by an innovation of 0.1C and having a lag 1 autocorrelation of 0.5 — which is roughly consistent with observed data since 2001– I find rate of type I errors using the original Nychka method applied at a nominal 95% are 6.4% and 6.5% for the 96 month trend and the embedded 84 month trend respectively. The combined method requiring serial rejections exhibited 3.2% type I errors, and corresponds to the 96.8% confidence level.
If I applied the Nychka correction discussed in Lee & Lund 2004, the method exhibited type I rejections at rates of 4.3% and 4.3% when applied to each case individually, and 2% for the “combined” method requiring rejections in two years in a row. So, the combined method had a confidence level of 98% rather than the rate of 95% one might have suspected based on the use of 95% confidence levels in the description of the method.
Using more realistic “weather noise” the level of type I error decreases meaning rejections correspond to a noticeably higher confidence level. For those concerned about type II error (i.e. failing to detect that an incorrect hypothesis) when you increase the confidence level, you nearly always increase type II error. When people concoct novel methods, they also usually increase type II error compared to standard methods applied with matching type I error rates. (Though, it is possible someone could happen to concoct a better method.)
We know that adding the ‘extra’ criteria of considering the results of analysis dropping the most recent data sneakily increase the type I error rates. It is a way of misleading yourself (or others) into believing you have applied a 95% confidence level for those rejections you report, when, in reality, the confidence level may be much higher. The effect of type II error is not known– but it’s quite likely that any novel proposed method will result in significantly larger type II errors. (That is: failing to reject hypothesis about the multi-model mean when it is wrong.)
I’d be willing to show how dropping 2007 affects the appearance of graphs. But before I do, I’d like those advocating this notion to explain their criteria for deciding whether we will “reject” or “fail to reject” the models based on the additional information in such graphs. That way, when we discuss specific points, we can run the numbers and see whether the results dropping 2007 make any difference to the reject/ fail to reject decision. We can also compare the type II error of the ‘slick’ new method, and the standard methods everyone learns in school.
References
Paper by Santer (pdf ).
Lucia:
There’s the general question: “With what confidence should any values near an upper or lower confidence boundary (e.g., near 5% or 95%) be taken as definitive in trying to decide an issue when the original ‘agreement’ was to examine data against a criterion of a 95% confidence interval ?
This is not meant to denigrate your efforts to teach us the statistical nuances, such as those connected with selecting monthly versus annual averaging to estimate trends.
But the bottom line is that both methods give ‘nearly’ the same result. And the small differences between them should not be used as a basis for abruptly deciding to accept OR reject the use of the current set of GCMs in attempts to unravel what might be happening to our planet – as I’m sure you agree.
Len–
Saying the multi-model mean appear to over-estimate what is happening is not the same as saying we should “reject the use of the current set of GCM’s in attempts to unravel what might be happening on our planet.”
It’s just saying the multi-model mean appears biased high, and explaining the basis for why it appear to be high. Projections can be wrong/off etc, yet, in a broader sens, AOGCM’s still have uses.
So, I think we agree.
That said, some of these posts are responses to people who are arguing the model projections are somehow on track, and we can’t tell they are off track. They often do this without quantifying anything, and their notion of newer and better ways to consider the data don’t make much sense if you think a tiny bit. (That’s why these newer and better ways are not standard.)
The suggestion to use 2007 as the end point (or ignore 2008) would be valid if 2008 was a cool year due to either manmade or natural isolated incident such as a major volcanic eruption. But it wasn’t. Any cooling is purely due to natural cycles, whether it’s solar activity, oceanic, or some other natural influence that has yet to be discovered.
If a major El Nino were to develop this year causing global temperatures to spike and you chose not use 2009 as an end point your critics would lambast you for it!
Bill
Of course they would– and they would be correct to do so. That’s why I wouldn’t ignore 2009 data if El Nino came along.
Yes. If the low from 2008 was due to a volcanic eruption or some man-made incident, that would be an argument for ignoring or downplaying results that hinge on including 2008.
Lucia: OT but what happening to NH ice bets this year? LOL
Interesting points. I had thought of the relation to multiple comaprisons (esp. criteria for “corrections” for them), but hadn’t connected it with raising the bar for rejection in a “hidden” fashion.
But I read Lazar’s main concern differently (and perhaps this is o.t. for this post). I thought he was worrying about the endpoint because of the short time period: small sample => any individual score can carry extra weight in analysis. That is, enough weight to swing a test when the sample is very small, but not when larger.
Well, frankly, why do we care what the trends were anytime other than right now?
Ian–
Lazar tried many tacks for suggesting that the one point matters.
One extra point is enough to swing the sample trend a lot. But when there are very few points, the range of trends within the 95% uncertainty interval is also very large. This is because for N evenly spaced data, the standard error for the trend varies as sqrt[1/{(N-2)N}].
So, the end point don’t increase the probability of a false rejection during short periods rather than long ones. (If the method is correct, the probability of false rejection at the 95% confidence interval is 5%, and stays there for all lengths of time in the sample.)
What does happen with small amounts of data is that the probability of type II error is high. That is: Because the uncertainty range is so high, we are very likely to fail to reject hypotheses that are wrong. Yes, the model trends are only a little low if we stop analysis at the top of the El Nino in 2007, the uncertainty intervals are large, and so we can’t say the models are off using a t-test based on data ending in the 2007 El Nino.
(We are, by the way, going to see this argument persist because El Nino will arrive, and if the models really are off– as they appear to be– we are probably going to see a “fail to rejects”, and then return to “rejects”. That’s what happens if you watch data trickle in and test constantly. The oscillations will continue to happen until we have enough data to make the probability of type II error low.)
But in any case, Lazar’s end point issue could not have been related to the short time issue. The ‘end point’ issue was raised by Gavin (and became a talking point) when people began to show that even long trends ending ‘now’ are either rejecting or skirting rejection regardless of the start point. Michaels trends were as long as 15 years. The trends I show go back to the 60s!
So, when the “too short time” talking point vanished, the end point one popped up.
A thorough piece of analysis IMO and I cannot see why anyone wants to consider “lets stop in 2007” except perhaps those who think we should stop in 1998/9!! However I think the white noise hypothesis might be false. I think the Mark 1 eyeball suggets that the temperature anomaly has a “same as last month” tendency or trend. I haven’t done the analysis and I am not sure how much it would change your conclusions. It would I think change the probability for the 95% in 2008 and the same direction in 2007 case. Any thoughts?
Colin– Weather noise is not white. There is no doubt about this.
I repeated the test with red noise, with properties similar to GISS since 2001. Oddly, I get roughly the same results. The auto-correlation matters, but on the other hand when the innovations are equal, the red process is “noisier”, so the two balance out.
To answer the question, Lucia, I think the ‘So what?’ is that if in one year we have a trend not falsified, in the next year we have it falsified and if in the yerar following we have the potential of it being unfalsified again, then we have to understand that ‘falsified’ isn’t meaning what quite a lot of folks might think it means…………..
By the same token, the next year a trend could be falsified again. I guess “unfalsified” doesn’t really mean anything at all, does it?
Andrew
I guess “unfalsified†doesn’t really mean anything at all, does it?
Well, it shouldn’t do, if ‘falsified’ meant anything in the first place. That’s rather my point.
You used the word “unfalsified” my good man. It must mean something to you. What does it mean? That something could still be true? So What? 😉
Andrew
Simon and Andrew_KY:
If the models move back into “fail to reject at 95%”, we will still know they did reject. However, we also know that should happen in 5% of cases even if the models are right. So…. we’ll be in a period of ambiguity.
If the models are way off, they’ll reject again and soon. If they are a little off, they’ll reject again, and it will take a while. (If they are only a little off, we probably will see then ‘unreject’ before they move back to reject.)
Given temperature so far in 2009, it seems very likely the model-mean will still be rejecting since 2001 at the end of 2009. But that’s speculative. Temperatures could rise the even more than they dropped in 2008. If that happens, I’ll just report it happened. (It’s not like someone else wouldn’t anyway!)
Personally, I would prefer p values to reject-fail to reject dichotomy. After all, the cutoffs are kind of arbitrary.
Andrew_FL– Yes. Cutoffs are somewhat arbitrary. For much scientific work 95% is customary.
Some work done for regulatory agencies required proof with higher p values.
The ensemble average and the output from each model are plotted here (http://www.drroyspencer.com/). If I were a modeler, after the ensemble average fails the Santer17 test, I’d just through out the worst models and form a new ensemble average that passes the test.
Lucia, I don’t understand this;
‘If the low from 2008 was due to a volcanic eruption or some man-made incident, that would be an argument for ignoring or downplaying results that hinge on including 2008.’
Since volcanos and men/women are part of the climate system why would a year with volcanic or human activity be any more or less statistically important? Thanks
Gary–
The forcings in the SRES account for human impact through modifying the level of GHG’s, but not volcanoes. So, if a volcano erupts, the forcing deviations from those used to drive the models. By man mad incident, I assumed the previous poster meant dramatic events like nuclear winter or something like that.
It’s not the statistical issue so much as something happening well outside the scenarios for forcing.
As the multi-model mean is an estimate of only the forced component of the trend, whereas the observed trend is the result of both forced and unforced components, an estimate of the size of the unforced component is required to know how similar we should expect the observed trends and multi-model mean to be.
If the results are sensitive to start and end dates, surely this just suggests that the unforced component (which varies in time) dominates the forced component (which is essentially constant) over the time scales in question?
Basically, what I am trying to say is that the choice of start and end dates is not important if you have a good estimate of the plausible magnitude of the unforced component of the change, which depends more on the length of the window than the particular start and end dates (the shorter the window, the larger the variance of the distribution of plausible values for the unforced component).
Gavin, don’t you think the “unforced components” (I imagine you mean natural variability caused by the ENSO and other natural climate drivers) should then be incorporated in some manner in the models.
If they are capable of affecting the trends, then they are also capable of masking or distorting our understanding of the actual empirical impact of the forced components.
Bill said: “Gavin, don’t you think the “unforced components†(I imagine you mean natural variability caused by the ENSO and other natural climate drivers) should then be incorporated in some manner in the models.”
The models are capable of modelling ENSO (and do so), but not predicting it, which is an important distinction. Each model run will give slightly different unforced component of the trend as they will each give different realisations of things like ENSO. By model averaging, the unforced components of the trend tend to cancel eachother out and you are left with only the forced component. The spread (variance) of the individual runs though is an estimate of the plausible spread of the unforced component.
As I understand it, for hindcasting you can include the observed ENSO as a forcing in the models (which ought to make the multi-model mean a lot closer to the observed trend). However if you do that (a) it becomes unrepresentative of the plausible skill of forecast models where ENSO is unknown (b) the multi-model mean is no longer an estimate of the forced component of the trend (which I think is the thing climatologists, rather than meterorologists are principally interesed in) (c) ENSO is only one source of the unforced component and (d) they would have to run all of the models again, which I doubt would be a good use of processor time.
Bill said:”If they are capable of affecting the trends, then they are also capable of masking or distorting our understanding of the actual empirical impact of the forced components.”
Yes, that is why it is important to look at the whole spread of the model runs rather than focussing only on the multi-model mean (unless you are interested in the best guess of the forced trend, in which case the multi-model mean and its standard error is what you want).
The point about masking or distorting is very apt, however that is the problem that is inevitably encountered in drawing inferences about climate from short term observations, the shorter the timespan involved, the more the forced component is likely to be obscured by the variability of the unforced component.
HTH
Gavin
Gavin:
You’re not posting from a NY address, but I’ll assume you aren’t the ‘other’ Gavin who posts at Marohasy’s blog?
First, when saying “sensitive”,can you quantify? Results of trend analysis will always be at least somewhat sensitive to start and end points. Can you quantify what you mean by a “good” estimate? We can never get perfect estimates.
When doing statistical test, we are only trying to figure out if the data are ‘good’ or ‘insensitive’ <>enough to draw a conclusion about some very specific hypothesis. The level off “goodness” is specified by uncertainty intervals. We can draw some conclusions based on the magnitude of these uncertainty intervals. (There are other conclusions we cannot make.)
To a quite a large extent, the sensitivity to picking different endpoints is already considered in the formalism of statistical tests that account forwhen we estimate in the mean in the first place. When we have large unforced variabilty from year to year, the residuals from the least squares trend will be large. If you examine the equation used to compute the uncertainty in the estimate of the trend based on “N’ data points in a time series, the stadard deviation in the residuals appears in the numerator of the uncertainty in for the fit (i.e. Sy’y’ where y’ is the residual to the fit for individual data points). These residuals will be large to the extent that El Ninos and La Ninas fling around the temperature– so that’s in the uncertainty intervals.
The denominator contains the standard deviation of all the times multiplied by the number of degrees of freedom ( i.e. the product [N-2]*St’t’ where t’ is the difference between the time for a data point and the mean time increases roughly proportionally with the number of data points in samples we are considering.) So, the denominator increasea as the window gets larger.
One expects Sy’y’ to be an unbaised estimate of the residuals and should stay more or less constant for any number of data points. But the quantity in the denominator increase as you increase your window length. So, estimate of the uncertainty decreases as the window length increase.
As you can see, the uncertainty estimate for the trend accounts for exactly the features in the process you are discussing. When testing a hypothesis, we are comparing the observed deviation to the possible ranges one might observed given the unforced variability and the size of window.
By suggesting that we discount a rejection with the current end point based on sensitivity to ignoring data from 2007, you are setting up a criterion where you insist we
1) Account for the sensitivity of then trend to choice of end points by using uncertainty interals and insisting on some criteria like “reject at 95%”. The formalism already accounts for the variability in trends due to “unforced variability”, and so already accounts for the inclusion or exclusion of the last year.
2) Then, if your preferred hypothesis fails you check if it would have failed at (who knows what level) specifically throwing out the previous year.
Doing this, you are accounting for impact of the sensitivity to the inclusion or exclusion of 2008 twice. But what is worse, you are not specifying how ‘different’ the result dropping 2007 must be before you think it would be not foolish to draw any conclusions about anything based on data ending in 2008.
In anycase, do you really want to hang your hat on a two year test now? I’d strongly advise against it! It’s really not in your interest. (If you come up with something formal, it’s going to take longer for the weather noise to pull us out of an outlier based on a two year test rather than a one year tests. So, switching to a two year test now and then switching back when El Nino comes is not going to look good!)
Bill
Gavin is right; models do simulate Enso’s. Here are ENSO”s in ECHAM5 and GISS AOM.
Without the volcanoes, you can really see those ECHAM5 ENSO’s. 🙂
There do seem to be a fair few Gavins contributing to blogs on this topic, but I am indeed a different one.
The point I am trying to make is not that there is a problem with the sensitivity of the tests, or that one recipe for selecting the end points is better than another. It was more that there often seems to be a case of comparing apples and oranges in that the multi-model mean is not intended as an accurate estimator of the observed trend, just an estimator of the forced component of that trend. If the unforced component is dominant (which is implied by the variability in trends over short timespans), then even a very accurate estimate of the forced trend is unlikely to be “close” to the observed trend itself over that period.
More (pseudo-) formally Let A = B + C be the observed trend, where B is the forced component and C is the unforced component and let M be the output of a model providing an estimator of B. In asking if the model trend is close to the observed trend we are asking whether there is a significant difference between A and M, but is that a sensible question in the first place (i.e. is the hypothesis reasonable a-priori)? Only if we can define what we mean by “close” in terms of C. If C is a random variable with large magnitude compared to A, then we should expect to see a sustantial difference between A and M, even if M were an excellent estimator of B. It is the estimate of C that seems to be missing in many model-data comparisons.
In the multi-model case, the spread of the individual model runs gives an indication of the range of values for the observed trend that the ensemble considers plausible, so that gives the appropriate test (i.e. one where the models are being evaluated on something they can reasonably be claimed to be able to do at least in principle, if not in parctice). If you do that the spread of the model trends increases in an appropriate manner as you narrow the window over which the trend is calculated, so then the exact start and end dates should be irrelevant.
I certainly wouldn’t recommend anyone draw strong conclusions about *climate* from short term trends (especially not a two year) one, simply because in the short term our view of the climate is so greatly obscured by the weather.
Gavin:
The statistical test are based on the notion that the multi-model mean only estimates the forced component of the trend. So, the test is apples to apples. The amount of error in determining the trend associated with the forced component is already included in the computed uncertainty intervals, which include information about the variability in the unforced component. The uncertainty intervals are very large for short time spans for precisely the reasons you give. We don’t reject the hypothesis that the multi-model mean unless it falls outside the range of these very large uncertainty intervals.
I don’t know who your “we” is. But I ask if the observed difference is statistically significant, and it’s always reasonable to ask this.
There are indeed a surprising amount of Gavins involved in some area or another of climate science. A quick Google finds, in addition to our friendly NASA Gavin:
.
Gavin Foster at the University of Bristol
Gavin Jennings, the Minister for the Environment and Climate Change in Australia
Gavin Young at Australian National University
Gavin Starks at a rather nifty company called AMEE
And an eponymous pussycat over at Tamino’s house
.
On a more serious note, I’d be curious to see when the last time that the observed temperature record was outside two standard deviations for a period of 8 years. I’d hazard a guess that around the time of the 97-98 ENSO that an 8 year trend would be anomalously high (and probably lead to numerous articles about warming occurring faster than expected), though I don’t have the easy ability to test this conjecture. Still, anomalous events do happen, so a rejection at two standard deviations once 2008 is included should raise eyebrows and be grounds for further analysis but not start a stampede for fur coats to ward off the coming ice age.
.
2008 was not a particularly cold La Nina by historical standards (though it was not insignificant). We also have a rather anomalous solar minimum, though the literature suggests the effect of it will be small on overall climate forcings (though the LOSU is still low). It will certainly be fun to watch the climate blogs over the next few years, both if temperature jumps back up or remains low relative to projections.
Zeke-
Here are the 8 year trends. I indicate by the beginning year rather than the end year. So, 2001 is 2001-2008 inclusive.
http://rankexploits.com/musings/2009/simulated-and-observed-8-year-trends/
Edit: (The trends ending near 1998 don’t pop out on the high side. All previous pop-outs are borderline and associated with volcanic eruptions. Just getting the timing or magnitude of the volcanoes a bit wrong could cause this.)
Zeke=
I noticed I’d screwed up my graph with temperature anomalies themselves. This is right:
The 1998 El Nino does not pop out of the ±95% bands (Just as before.) I don’t smooth the standard deviations, and they just happen to be large at that point!
Thanks Lucia. The anomalies are interesting, but the 8 year trends are probably more meaningful. I wonder what happened back in 1955? Granted, its long enough ago that there are probably a lot more uncertainties in forcings.
Zeke–
A few eruptions register on the DustVeil index in the 50s. They are smallish compared to Agung in which erupted in 63-64. http://en.wikipedia.org/wiki/Mount_Agung . The 50s volcanoes also weren’t on the equator. I don’t know whether the “buckets-to-jet inlets- then back then back again” could possibly extend into the 50s.
On the trend graph, a dip at 1955 corresponds to 1955-1962 inclusive. Notice the models show a small dip there too.
Gavin’s comment above has one thing which bothers me about it. He suggests that by looking at the model’s spread, we get a look at the differences due to unforced variability. In the short term this is more or less true. But I’m sure he is aware that the differences in models are due to much more than just different realizations of internal variability-they differ also in their sensitivities and other important aspects. It would make more sense to me that you could get an idea of the internal variability of each model by scaling the ensemble mean to each model and then removing that “forcing signal”-the resulting noise would not have the forcing trend. That would be your unforced variability, not the differences between individual models and the unscaled ensemble mean-which would have strong trends that have nothing to do with ENSO etc. but have more to do with model sensitivity.