Deep Climate sure likes to come up with his own idiosyncratic methods of computing trends for the earth’s surface temperature!
In comment 12745, Deep introduced the notion that we could learn something by noticing that 20 year average trends have been increasing. I pointed out that I was aware they are increasing, but at a rate less than forecast by the models underlying projections in the AR4.
Deep Climate has now refined his notion of which specific long term trends will tell us “that which we wish to know.” (What “that” is seems to vary from comment to comment, but currently seems to be related to debating Pat Michael’s assessment of models presented to Congress.)
At his blog Deep now devised two new methods for comparing trends from IPCC AOGCM model runs to observations on the actual, honest to goodness, physical earth.
I’ll discuss both today. As usual, I’ll compute trends both ways, and show graphs.
Trends based on “The Deep Climate Naive Method”.
Deep Climate recent post begins by discussing his first novel method of computing trends:
My first (naive) implementation of such a metric simply computes the moving 20-year average over the post-baseline period, and establishes a trend metric by dividing the difference in average anomaly of the two periods by the elapsed time. The following graph shows the result of the “naive†approach.
This is an interesting approach. 🙂
Suppose that the IPCC defines their baseline using years 1-19 inclusive. Now, suppose that we want want to use the Deep’s method of computing the trend from the baseline period to year 20 using annual average data; this is 1 year worth of data after the baseline.
We know the 20 year average ending in year 20 includes years 2-20, inclusive, and is one year later than the baseline. We can compute the two averages, subtract and divide by 1 year. Right?
But, before punching loads of number into our calculator, let’s think back to when we were 12 taking pre-algebra and see if we can simplify the formula.
First, we write down formula to compute the average of the temperature for the baseline period which involves summing over years 1-19, and dividing by 20. Next, we get the 20 year average over years 2-20 by summing and dividing by 20. We are supposed to divide by 1 year — the time between year 19 and year 20.
Next…., we scratch out the terms for temperature that cancel each other. We delighted 6th gradres discover that simplified formula is to subtract the temperature from year 1 from the temperature in year 20 and divide by 20.
Wow! That’s a lot fewer numbers to punch into the calculator!
We are also delighted to discover that, for a one year lag using annual average data, Deep Climate has reproduced what is commonly known as the “George Will Formula for Determining a Trend” otherwise known as “Connecting Two Points with a Line”.
Of course, the Deep Climate Naive (DCN) method isn’t quite as bad as the “George Will” method when applied to two year lags. When we arrive at year 31, the DCN will permit us to compare the observed changes in the earth’s surface temperature to the projections in Table 10.5 of the AR4 using the exact method used to define changes in temperature in that table. (The method I discussed yesterday, and called the “IPCC method” shares the latter good feature without ever degrading into the “George Will Method”.)
Because DC showed ‘DCN’ trends on his blog, I applied the method to the monthly data already in my EXCEL spread sheet, then plotted, starting with the first month in 2000.

Using the ordinary eyeball method to interpret the graph, we see application of the “Deep Climate Naive” method to obtain the trend from the IPCC baseline to the first month in 2000 resulted in a negative 20 year trend. (Let’s write an editorial for the Washington Post! 🙂 )
The noise in the DCN type trends diminishes as we include additional months. In 2009, when the method essentially involves the difference between 9 year averages separated by 20 years, the observed trend falls 20% or more below the multi-model trends.
Is the 20% or more difference models and observations statistically significant? Are the models failing abjectly? Or is this difference expected given the magnitude of “wheather noise” on earth. Who knows?
The DCN method is non-standard. Standard methods based on least squares exist: I am not inclined to figure out a method to waste time coming up with a test for significance based on the DCN method of trend analysis. (I think it’s better to stick to more standard methods of trend analysis. Concocting one’s own method of determining trends just increases type II error: i.e. you will decide differences are not significant when they are.)
What I can say is this:
- The model projections exceed the observed projections when we compute the trends using “The Deep Climate Naive” method.
- Variations in the observed trend from 2000-2009 computed using the “The Deep Climate Naive” method should probably not be interpreted as communicating whether the 20 year trend has increased or decreased since 2000. These trends are pretty noisy.
- Observed trends in the earth’s surface temperature computed using “The Deep Climate Naive” method have decreased recently.
- Those who prefer to risk significance testing might like this non-standard idiosyncratic method when the hypothesis they prefer is looking shaky. The reason: Methods to test the statistical significance of deviations between observations and data using non-standard idiosyncratic method are not pre-coded into EXCEL and may not even appear in the peer reviewed literature.
This means we can all notice the model projections look poor, but there is little risk we can show the difference is statistically significant.
That’s about all I can say of the results based on that method.
Deep Climate Less Naive Method
Unsatisfied with his first stab at computing trends using one non-standard method, Deep Climate came up with second more complicated idiosyncratic method, which I could call the Deep Climate Complicated Method, but which I will call DCII. He describes it thusly:
It would be better, then, to limit this comparison metric to the period beyond the baseline, that is starting in 2000. This complicates the calculation of the elapsed time slightly, as it is based on a comparison of periods of unequal length (and will continue to be until 2019). In the case of the period 2000-2008, the elapsed time for the purpose of trend calculation is 14.5 years (the difference between the exact centre of each period).
(When the post-baseline period exceed 20 years, it appears DC computes the 20 year trend.)
This method, which I will call the DC II method can also be computed easily by using the AVERAGE() function in EXCEL to compute the average time for the post-trend averaging period, and then plotted rather easily.
It has the advantage of using all 20 years in the IPCC baseline. Like the “DeepClimate Naive”/”George Will” method this method contains lots of noise when applied using 1 year of post-baseline data.
I have plotted the trends based on monthly observations and multi-model means using the DCII method. Have a look:

What can we say about these trends?
- The trends are quite noisy for short periods of time. They are less noisy that linear trends trends computed using data from 2000-now. This is because all trends are preceded with identical changes in temperature from the baseline period to December 1999.
- The current observed trend computed using the DPII method are 20% or more less than the multi-model trend.
- We don’t know if the differences of 20% or more are statistically significant. Like the DP Naive method, DPII method of computing trends is non-standard.
So… what do you think of the DC methods?
Here’s what I have to say: Well, they are original!
Though I cannot apply significance tests to determine whether the 20% or larger difference between models and observations is statistically significant, I will note that the observed discrepancy arises despite the fact that climate scientists making projections had a rough idea of temperature rise from the baseline period up to 2001 prior to developing the SRES used to drive the models and the projections published in the AR4 themselves were formally published in 2007. So, unlike methods I prefer, the trend comparisons do involve knowledge of features of the data that existed before the projections were made. This is true despite the fact that in DCII the data are not arithmetically involved in computation of the average temperature after 2000.
To determine whether the difference between model projections and observations is statistically significant, it would be advisable to use more standard methods, like t-tests for least squares trends. To and avoid polluting a test to diagnose forecasting ability with data that existed prior to the forecast: i.e. test projections excluding data that was known before the method to create projections was developed. (I use 2001 as my cutoff because that’s when the SRES were formalized. Arguments for other choices are possible, but this is mine.)
Condescension is best left to the comments. You’re stealing Raven’s and my thunder!
Boris–
When I saw Deep’s method becomes the “George Will” method in the limit of short times… well…. How could I help myself? (BTW: I noticed this in… oh… seconds of reading the description. I blinked… reread… blinked again.)
1) We have it on the highest authority that it takes at least 30 years to confirm a climate trend that is adverse to any expectation formed by consensus.
2) I’m partial to 360-year baselines (end of the LIttle Ice Age) to bask in a steady gentle warming, though those hard core denialists would probably prefer something in the range of 800 years (height of alleged Medieval Warming period) to get a slight cooling trend. Cherry-picking bastards!
3) This is off topic but I had a dream of being dressed dressed as a medieval knight sweating profusely while a bald man carved a hockey stick from a bristlecone pine and then using it to stir up mud in a Finnish lake while yelling Eureka! What could that mean?
George,
3) Wait… was the bald man wearing pants? 😉
Andrew
George-
It means you need to get out more. 😉
Just hope it’s not in my neighborhood. 🙂
Shallow weather.
I have been asking occasional question at various blogs and people have been very nice to explain some very basic concepts. Somewhere I missed a memo. The state of the art in climate science seems to be something like this:
Model 1 2+2=Wrong_Answer_1
Model 2 2+2=Wrong_Answer_2
….
Model N 2+2=Wrong_Answer_N
———————————————–
Conclusion: (Wrong_1 + Wrong_2 + … +Wrong_N)/N = Right_Answer.
For some reason I have more confidence in “The Old Farmers Almanac” At least they seem to have the guts to publish a climate forecast a year in advance. And when their forecast is reduced to a 50:50 chance of being right or wrong, by golly, they are right about half the time.
So my naive question is, “Where can I find the one year forecasts from each of the models?” Weather is chaotic enough that next years prediction of weather will due to climate, not todays weather. With access to todays volcanic activity, aerosol and sulfate measurements, surface sea temperatures, ice pack, solar activity, etc. it seems like a reasonable test of any model and the only way to work on improving a model in a reasonable time period.
A skillful prediction of the last freeze of spring and the first freeze in the fall on a regional basis would be the most useful temperature based prediction. Summer precipitation would be the most useful prediction on moisture. (I worry about good crops.) Thanks. I’m probably off topic but I’m loosing interest in the statistics of the average of a bunch of wrong answers. Or I just missed that important memo.
PS. It would also be nice to get rid of that tropical warmth as mid altitudes and the assumption of constant relative humidity because both have been falsified. (Do all of them really do that? I would bet an adult beverage against any three preselected models beating “The Old Farmers Almanac”. Of course all three have to agree or I win, no averaging!)
GaryP–
You can’t. They don’t do that. The fact they can’t is not a flaw of climate models.
“You can’t. They don’t do that. The fact they can’t is not a flaw of climate models.”
If you have a hitter who can’t hit a curve ball when it’s actually thrown, you have a flawed hitter.
If you have a climate models that don’t produce term period x forcasts you have flawed models. Models change over time because they are flawed. They are used to produce an array of future scenarios because they are flawed, and can’t produce a single scenario that reflect reality today. They suck. 😉
Andrew
The models cannot make a one year prediction? But I’ve seen pictures of global temperature maps supposedly from the models with claims of how well they match reality. (always in the past) Why can they not make a temperature map for next year?
Each year I travel to Michigan during the deer hunting season. The global warming in recent years kept it so warm that it was not cold enough to safely hang a deer from a tree at hunting camp. A couple years ago, we even delayed hunting to the second week in hopes of freezing weather. But last year in the summer we made arrangements to hunt the first week. I had high confidence in a return to freezing weather by Nov. 15 based on low solar activity as reported at Anthony Watts web site and an Alaskan volcano. We had freezing weather. Apparently my sniffing at the wind is all I can count on. You are telling me I cannot get a forecast (right or wrong) for seven month out from the climate models? They are being out performed by “The Old Farmers Almanac”! 50% right sure beats 0%.
Sorry, if I am ranting. For some reason I had thought these models were being tested against the real climate in some type of reasonable time frame so there could be continuous improvement.
By the way, it will be freezing at night in the Upper Peninsula Nov 15- 20 this year. Still an a quiet sun, another volcano in Alaska, Arctic ice extent is at 7? year high, and the Pacific Decadal Oscillation (PDO) is in the 2nd year of its cooler mode. Air conditioning bills will be low again this summer. Anybody have a fuzzy caterpillar report? It seems thats all we have to go on. …………. Nice site. I visit almost every day.
Go ahead have your fun Lucia, just remember they laughed at Bozo too.
No, that isn’t fair, I’ve been reading DeepClimate’s posts both here and at CA, dammit he means well, and all that, but being bright and meaning well just isn’t always enough. I’m sure it isn’t because he doesn’t know how to calculate the trends correctly, or have Excel calculate them for him, the problem is he wants the trends to be as “robust” as possible, and that requires a little creativity.
But speaking of George Will have you seen the latest Arctic Ice extents?
http://www.ijis.iarc.uaf.edu/seaice/extent/AMSRE_Sea_Ice_Extent.png
Wonderful stuff. Very funny, very much to the point. Mind like a steel trap. A great read. Thanks.
“I had a dream of being dressed dressed as a medieval knight sweating profusely”
Even further off topic: the Latin word for armoured warriors on horse back is clibanarii, which means “oven bearers”, which is what it felt like to be riding around dressed from head to toe in metal armour.
Andrew_KY Models do appear flawed in some ways, but their goal is not weather prediction which involves knowing when events are going to happen.
Gary P Modelers only claim models reproduce the statistical properties of the climate. They don’t predict the timing of any event large or small. So, for example, having ENSO with the proper cyclicity is a goal. Predicting the precise timing of ENSO is not.
Kazinski Does ‘robust’ mean “hide the short term negative trend”? If all DC said was the longer term trends are positive, and they are less noisy, that would be fine. But… then I showed the positive long term trend in the post where I jumped in.
He had to keep making claims that were either a) simply incorrect or b) rendered meaningless if we switched to apples – to – apples comparisons with models.
Lucia,
If models can’t determine the ~when~, I’d say What’s The Point?
Andrew
Anyway, the title of a recent thread on this site has the words “Models… Predict”, implying that the models DO attempt to describe a ~when~.
Is the ~when~ merely “sometime in the future”?
I’m not seeing any value here.
Andrew
Andrew_KY: They attempt to predict climate in the future. That is: The want to determine the average weather and the correct statistical distribution of weather events. They don’t try to forecast whether snow will hit Chicago this weekend.
It’s appropriate to ask whether they do what they try to do well. But the fact that they can’t do what they neither claim to do nor try to do isn’t a “flaw”. Focusing on that issue diverts attention for the important question which is: How well (or poorly) do they predict climate (i.e. the statistical properties of weather.)
Lucia,
Do the models do what they “try to do” well? Do they predict the climate perfectly?
Flaw-
–noun 1. a feature that mars the perfection of something; defect; fault:
Andrew
The models are being run all the time and producing simulations of climate next month, next year and the year 2100, 2200 etc. There is no reason why they can’t produce public short-term forecasts.
The only times this has been done (Hansen in 1988, Jones in 2005), they were spectacular failures and the IPCC has learnt this lesson now so all you will get from now on is really thick lines going out to 2100 where the lines themselves are as thick as 0.1C so you can’t even eyeball a short-term trend.
I pulled apart the components of the 2005 version of GISS Model E (with hindcast data to 2003) and then extended its forecast based on the trends and the forcings.
I’ve started building in a solar forcing decline of 0.1C from 2007 to 2010 given the state of the Sun.
So, GISS Model E would be out about 0.23C right now (which is a rather large amount). gavin commented on RealClimate recently about what his current Model ER EH’s runs were and they were about +/- 0.3C higher and lower than this extension (quite a range in my opinion).
http://img259.imageshack.us/img259/6594/modeleextramar09n.png
Andrew_KY– I don’t believe the predict climate perfectly. My points is only this: criticizing them for not predicting weather distracts from discussing whether or not they predict climate adequately.
So, if you want to focus on evaluating climate models, it is best to avoid criticizing them for not predicting weather. They can’t– but they don’t claim too. When you criticize them for not predicting weather, people will respond “You don’t understand the difference between weather and climate.”
Focus on whether they predict averages properties and when asking, be quantitative. (How close are they on the annual average temperature for the whole planet? Answer: off by degrees C.)
When Climate is on the mound, he throws you Weather. If you can’t hit what he throws, you belong on the bench, in the minor leagues or try some other sport. 😉
Andrew
–Bill Illis comment 13000–
Thanks, I was suspecting that the owners of the models were not publishing forecast was because they were failing.
Lucia, Without some actual prediction such as, there will be an ENSO event next year, we cannot judge whether or not there are even as good as pure chance. If they are worse than pure chance then there is a fundamental error in the model. I am not even getting an answer to that.
Of course we are already aware that they are fundamentally flawed. The lack of a mid altitude tropical warm spot and the declining humidity at 300_mb proves that they are less accurate than pure chance. Is there even one model that does not have these fatal flaws? No one can argue that these two are “weather”. It is incorrect to say that the models have no skill because that implies they are as good as chance. The truth is they are flat out wrong. How are they going to be fixed, verified, and validated in our lifetime?
Lucia,
This was a little mean and I think you may have enjoyed writing this a little too much. On the other hand, DC was particularly arrogant in his original comment so he probably deserves this medicine. But this is certainly a blow to any aspirations he may have of credibility in the climate blogosphere.
Sorry to be the guy who rains on the parade, but let’s all keep firmly in mind that whether models are even marginally credible or simply falsify on an ongoing basis, these very models and their climate change “forecasts” are used as justification for multi-billion dollar carbon mitigation/management/taxation policy decisions being made by governments in the EU countries and more recently by the Obama administration and government agencies in DC.
Unfortunately, that reality doesn’t falsify…
[Please not that this message does not come with an umbrella]
dribble– I admit the “George Will” reference was have been a little mean. DC was equally arrogant here in comments, and said many things that were either a) wrong, b) obscure or c) obscured and turned out to be silly when clarified by his blog post.
FWIW: I know some really nice people who are naturally nice to their core. I am not one of them.
David Gould:
Could you elaborate? I’m not sure what you mean by saying you got runs with projected trends above 15 or what you mean by running models. (The average of models in my graph is the average over models actually used by the IPCC. I didn’t run them. )
tetris,
You are correct, but we have to continue to resist the stupidity of all of this as long as it continues to affect us… even if that means making the same obvious points over and over again, everyday, for an indefinate periof of time. Those are the cards we have been dealt, and we either play or fold. 😉
Andrew
Lucia,
The data that you have posted shows a run of a number of real trends being below predicted trends, the predicted trends being the IPCC model predictions and the real trends being taken from actual climate data. What I wanted to discover was whether such a run of 12 or 13 data points being below the predicted trend is unusual if we assume that the model is actually correct.
I have created a simulation of temperature. It is basically random noise – the standard deviation of which you can adjust – plus a trend, which you can set at any number you like.
This simulation obviously predicts that any trend – whether 20-year, 30-year, 40-year or 100-year – will be the trend that you build into it at the start.
However, when you run the simulation and check the running 20-year averages, what you find is that there is considerable variance from the predicted trend.
More than that, you find that runs of 10 or more 20-year trends which are below or above the predicted trend are quite common.
Indeed, I suspect that if I had thought about it a bit more I should have predicted this.
For each successive 20-year trend, 19 of the data points are the same. Thus, the chances are good that the difference between the trend will be relatively small. Sometimes, it will bounce around, of course, as we are dealing with random data that sometimes throws out a big differential. But much of the time successive 20-year trends will be similar.
Overall, of course, the model is right. It cannot be otherwise, as we are not comparing the model with real data; we are comparing the predicted trends in the model with data generated by the model.
Thus, it is impossible to use the current comparisons between the IPCC predictions and the real data to say whether or not the IPCC models are wrong.
And I understand that you are not saying this – I just wanted to see if I could narrow down a bit better whether this was so or not.
Oops: ‘running 20-year averages’ should be ‘running 20-year trends’.
So David,
you are saying the models are completely independent of reality??
kuhnkat,
No – the opposite.
What I am saying is that the models and reality are thus far in agreement as far as can statistically be determined.
To come to the conclusion that the models and reality have diverged, we would need more data – perhaps another decade.
Alternatively, a particular 20-year trend would have to be outside two standard deviations of the predicted 20-year trend. And that would be a significant drop from current observed 20-year trends.
As an example, in my model I am currently using a standard deviation for the random data of around .15 degrees centigrade – less than the deviation in Hadley, GISS, UAH or RSS data.
This gives a standard deviation in the trend slopes of around .005 degrees/year when I set the trend at .02 degrees/year.
A trend greater than two standard deviations below the predicted trend would thus be lower than .01 degrees / year.
You can see that we have not reached this point yet in the data that Lucia posted. As such, the IPCC models are currently in statistical agreement with reality.
And I should point out that in, say, 100 years of data we should expect one or two data points to be more than two standard deviations from the predicted mean.
With all this, what I was trying to do was work out whether the data that Lucia posted was sufficient to determine that the IPCC models were wrong. Clearly, it is not. And note that she did not claim this – she was making a point about graphs speaking for themselves (the point being: they don’t).
“To come to the conclusion that the models and reality have diverged, we would need more data – perhaps another decade.”
So in your informed opinion, the models are completely worthless for at least 10 more years? Thanks for the info. In ten years we’ll have ten more years of unpredicted weather. Brilliant! 😉
Andrew
Andrew,
No.
What *is* worthless is using the *average* predicted trend of the model runs to examine the worth or otherwise of a model *without* examining expected variability. (Note: Lucia is *not* doing this and neither should you).
What I have shown is that the models are currently in agreement with reality. How that can lead to the notion that the models are ‘completely worthless’ is beyond me, I’m afraid. You will need to share more of your thought processes.
As an example, if I make a prediction that Ponting will score between 400 and 600 runs in the upcoming Ashes series, taking the average of that prediction will give a result of 500. If he then scores 420 runs, you cannot conclude that my prediction was wrong by comparing 420 with the average of my prediction.
By the way: a prediction can of course be correct and be completely worthless at the same time. As an example, if I predicted that Ponting would score between 0 and 5000 runs in the upcoming Ashes series, my prediction would definitely be right. But it would also just as definitely be worthless.
However, an IPCC prediction of trends of between .01 and .03 per year *is* useful. And thus far such a prediction has proven accurate.
David,
How is a prediction of .01 and .03 per year useful? Who is it useful to and what useful thing do these people do with the information?
Andrew
Andrew,
It provides upper and lower bounds to projected global temperature change by any given year in the future.
This enables other scientists to, for example, set upper and lower bounds to impacts on other things.
It also enables policy makers – assuming they bother to base their policies on actual science, which is dubious of course – to set policies to mitigate or adapt to climate change.
If your argument is that we would be better served by smaller bounds, I agree. But normal temperature variation will see quite high variation in 20-year trends. If we run 30-year trends, 40-year trends and 50-year trends, the variation becomes much much smaller.
For example, in my models the standard deviation for a 30-year trend is .0025 instead of .005. For the 40-year trend, it is .0017.
This clearly shows that examining short-term trends is not as useful as looking at the long-term trend.
Weather is very noisy, that’s all.
David–
I’m still not entirely certain I know what you did. Earth’s surface temperature are autocorrelated so you should account for that when running your monte-carlo simulations.
However, I agree that if we specifically use the trend from 1988-2009, the earths’ trend does not fall outside range or all weather in all models projected/hindcast by models. I’ve never suggested otherwise.
I usually don’t focus on that question but instead focus on whether the model mean trend is consistent with the underlying mean driving the earth’s data. To test that somewhat different question, I don’t check whether the earth’s trend as weather falls outside all weather in all models.
Instead, I used a t-test to determine the probability the mean trend associated with model projections and the mean populatin underlying the monthly data for the earth’s single realizations are the same.
These are different questions, and I think if we wish to figure out whether the models are on track, testing whether the mean trends differ is the more important of the two. (In anycase, if the models are wrong, testing the means is the more sensitive case. That is: it has lower type II error. Both tests have roughly equal type I error. So, the test with the lower type II error has lower error rates all around.)
I illustrate the normalized difference between the model mean and the observations below:
[caption id="attachment_3821" align="aligncenter" width="500" caption="Figure 2: Normalized error as a function of start year."]
[/caption]
As you can see, if we had happened to pick 20 year trends, the value of d* falls inside the ±95% confidence intervals. This is not too surprising as this test involves mostly data that was already known before the projections/predictions were made. If we focus on data after the IPCC froze the SRES used to drive the models (2001), the value of d* is well outside the confidence range.
This suggest that the model mean trend differs from the trend underlying the earth’s data. However, the deviation does not necessarily mean that the earth’s weather falls outside all weather possible in all models. (That would take quite a long time to occur.)
You can read more here: http://rankexploits.com/musings/2009/multi-model-mean-trend-aogcm-simulations-vs-observations/
The important point is this: just because weather is very noisy does not result in us being unable to make meaningful statements about our climate future.
Lucia,
Yes, the autocorrelation is something that I am working on. This is one reason why I made sure that my standard deviation for temperature was lower than that in the Hadley and GISS data prior to any significant warming trend.
I will examine the rest in a moment – thanks.
David–
The graphs in this post does not “speak for itself”. Trends computed by these particular idiosyncratic definitions DC suggested are unnecessarily noisy, based on throwing away data etc.
We can all argue about the start year, using a t-test vs comparing to all weather in all models etc. But creating novel ways to compute trends is stepping a bit outside of what is useful.
Lucia,
Basically, what I did was set up a model that had a built-in trend. This trend is obviously becomes the mean predicted trend of the model.
Then I ran the model and compared rolling 20-year trends with the predicted trend.
With a now altered standard deviation in the random temperature data of .11, I got a standard deviation from the predicted trend of .0036. (I changed it from the previous standard deviation because it seemed to produce too choppy results – a consequence of not accounting for auto-correlation, I am guessing).
This indicates that aything in the range of .013 to .027 in 20-year trends is not out of the ordinary.
The other thing that I checked for was runs of below average 20-year trends – for example, 10 20-year trends in a row being below the predicted one.
These occurred quite frequently.
(And I know that I am not explaining myself very well – I do not yet speak the language …)
I will look more at the d* argument in a moment. Thanks.
David Gould:
“Ponting will score between 400 and 600 runs in the upcoming Ashes series”
Now there *is* a model that is completely independent of reality. Let’s look at most recent real data: 6 innings, average of 35, total of 210. Dream on, Mr. Gould.
David,
Thanks for taking the time to respond. You speak in generalities, which are not useful to specific situations.
You say we can still make meaningful statements about our climate future, which is a pretty meaningless statement itself.
What specific meaningful statement about the climate future do you mean?
Do you have a date in mind and some kind of specific occurance in mind? Or are you just saying something you thing sounds nice?
Andrew
Lucia at 13039,
Agreed.
Andrew,
Well, one specific thing that the models predict is significant warming. In my opinion, that is very meaningful.
However, if you can give me an example of what you think would be significant, then perhaps we can narrow it down a little. My suspicion is that you are talking about predicting weather events (you raised the notion of dates and specific occurrences, which sounds like weather). The models certainly cannot predict the weather.
David–
I think the models appear to be heuristic tool, but their accuracy appears rather limited. So, I think they give the qualitatively correct answer that if we increase GHG’s, the temperature goes up. But trend based on the multi-model mean is biased high compared to the trend that underlies the earth’s temperatures since the time the SRES were selected.
This could be the result of a number of things that range from problems with parameterizations, inadequate scenarios and/or other.
Dribble,
Well, as an England supporter, I hope he doesn’t score anywhere near 400 runs in the series …
Lucia,
Given my lack of experience in statistical analysis, I would need to do more work to see if I agree or disagree. (And I do not know enough yet to ask enough questions).
However, based on my admittedly limited model, it does not seem that any bias is needed in a model to overpredict a series of between 10 and 20 rolling 20-year temperature trends.
David Gould
Likewise, just because weather is very noise doesn’t mean we can’t do statistical tests to determine whether a multi-model mean lies outside the range consistent with data. Short periods of time mean wider uncertainty bound. But there isn’t a magic time threshold when “weather noise” suddenly drops to zero.
We can test a hypothesis with fairly short time periods. The tests will have large type II errors– but it doesn’t mean we can’t test the hypothesis.
(For some reason, some in comments elsewhere want to talk about “defining the mean”. No one here has ever said the short term trend “defines the mean”, or has anything other than quite large uncertainty intervals. They are big.)
Andrew–
I think saying the long term trend is between 0.1 and 0.3 C/century but their best estimate is 0.2C/century would be meaningful. We’d know that warming was expected, but the uncertainty is large.
But, I happen to also think it’s useful to figure out if the trend the IPCC communicated as their best estimate may be biased high. If, given data arrivign after projections were made, we can now be 95% sure the long term less than 0.2 C/century, I think that also meaningful.
It’s not as if there is only one meaningful possible statement about model accuracy.
Lucia,
A trend of .2C/century (assuming it’s real) for an indeterminate amount of time should compel a person to do what exactly?
Andrew
Lucia at 13048,
Agreed.
I have had a look at using d*.
I read how you used the equation given to determine the standard deviation. Could you tell me what the standard deviation that you derived using that equation is?
The thing is, while I may not be understanding this very well, the standard deviations for each model seem to be listed in table 1.
These standard deviations seem to be quite high, so I am betting that I am misunderstanding this.
I understand that answering my questions and helping me overcome my lack of knowledge is going to take time, and that you might not be willing to take that time. However, any assistance would be appreciated.
I think some of my posts vanished – not sure, though; they might be awaiting moderation.
Andrew_KY–
Compel? I’m not into compulsion. But at a minimum, it can be useful to plan ahead. For example, if I lived in a dry climate and thought less water was likely to fall in the future, I might consider building more desalination plants. Or bigger reservoirs.
Other people might want to consider other options: Moving. Or, if they thought they could do something to prevent the future drought, they could propose that.
By the same token, I’d also like to monitor whether projections or predictions are panning out. It’s a waste to build a 1,000,000 gal reservoir when it turned out you only needed a 500,000 gal reservoir.
Meaningful is a pretty squishy word. Lots of things are meaningful. Imprecise, biased projections can still be “meaningful”. This is because “meaningful” doesn’t necessarily mean “accurate” or “precise”. The word “meaningful” can is very useful for people who want to promote models that aren’t particularly accurate or precise. It give the impression that someone is claiming the models are accurate or precise, but when push comes to shove, all they really claimed was the models projections aren’t they aren’t utterly meaningless or utterly useless.
No one wants to promote their research with slogans like “Our state of the art AOGCMs: Not totally useless!”
Lucia,
I can speculate all the things you mentioned with or without a trend from any models. The trend from the model doesn’t necessarily mean there will be ANY of the specifics you described about less water or whatever.
How ’bout this for a model slogan (Yours was pretty good)-
AOGCM’s: Because, Dammit! Just Freakin’ Because, OK? Shut Up! Now Leave Me Alone! 😉
Andrew
Lucia,
Given that the IPCC projections for the century warming range from 1.1 and 6.4 °C, it would seem to me that there is a significant standard deviation in the models – one that is perhaps greater than the one you used for your d* analysis.
For rolling 20-year trends, the deviation is likely to be even higher than for a century trend – that is certainly the case in the model that I am running.
Andrew_KY–
I agree policy makers and people can consider all sorts of thing when making plans. These include:
1) Past observed trends.
2) Simple 1 – D models from radiative physics.
3) Full AOGCM’s.
I’m not sure that AOGCM’s add utility, accuracy or precision beyond what we get from 1&2. Those two things give us sufficient basis to motivate some planning decisions.
I’ve never seen anything that I think shows models tell us much beyond what 1 & 2 already tell us, and the uncertainty from model projections seems high.
But others may have different opinions.
Andrew_KY (Comment#13052)
A trend of .2C/century (assuming it’s real) for an indeterminate amount of time should compel a person to do what exactly
I’m guessing that Lucia meant 0.2C/decade or 2C/century.
Thanks Lucia and Simon. Happy Friday to everyone! I can’t wait for 17:00:01!
The difference in the trend between .02C/decade and 2C/century would mean what for who? Obviously, there’s a numerical difference, but what reality does that translate into? What kind of weather difference is that? What change in who’s climate is predicted for the lower trend and how would that be different to what is predicted for the higher trend?
Andrew
Sorry, I meant .2C/century and 2C/century. 😉
Andrew
Predicting maximum damage with ahigh probability of occurrence is a form of compulsion. Prudent policy requires that a planned investment in prevention be approximately equal or more than the product of the probability of an adverse outcome times the cost of that outcome. My impression of alarmist ideology is that the goal is to dictate policies not subject to such cost analysis.It does so by means of inflating both the scale and the probability of loss due to climate change. The refusal to agree to yield all forms of economic and energy planning to one’s purported saviors is to be complicit in planetary disaster of unimaginable proportions.
Bjorn Lomborg made the eminently reasonable observations that (a) the probability of some warming is high to near certain but the probability of catastrophic warming is very low; (b) we should accept the cost of simply adapting to that likely modest warming rather than embark on costly economically destructive and futile efforts to prevent warming; and (c) in addition to being futile and destructive it would waste resources that should be directed to more immediate needs. He does so in very non-polemic tones with lots of factual references. For that he has been vilified and attacked.
Alarmism is a form of compulsion.
Lucia,
I have looked at d* tests. Pretty simple stuff, really.
But the problem I have is this: the test depends on the how meaningful the SD is.
Each individual temperature data point in the Hadley data is within two SDs of the mean predicted by the models.
But the trends are not.
Doesn’t this suggest to you that either:
1.) The SD is wrong for the individual temperature points; or
2.) The SD is wrong for the trend.
If the same set of data can give two different results if you test it in two different ways – one that the data is all within two standard deviations of the mean, and the other that it is not – *something* must be wrong.
Lazar raised this more eloquently at Tamino’s blog, so all credit for thinking of this should go to him.
(However, I will point out that my modelling showed precisely this: that a set of temperatures with a small SD will result in a large SD for the 20-year temperature trends.)
As an example, when I autocorrelate my randomly generated temperature data, a standard deviation of .13 in the temperature data (as an example, this is lower than the standard deviation in the Hadley data from 1850 to 1949) gives a standard deviation of .0076 in the 20-year trends.
Taking a +.02 degrees centigrade / year trend, this means that results as low as +.005 degrees centigrade / year are still within two standard deviations of the mean.
So: methinks that there is something wrong with the SD you calculated for your d* test.
(Without any autocorrelation, a standard deviation of .128 for temperature data gives a standard deviation of .0044 for the 20-year trends, which means that .012 is still within 2 standard deviations).
I guess my next step is asking for an explanation of equation 12 in Santor.
It looks to me as though this is going to be different for any particular 20-year trend – I may be in error, of course, but I will lay out what I *think* I know and then you can laugh and correct me. 😉
Beneath the square root sign in the expression in the denominator there are three things which I think I know.
I think that 1/Nm is the number of data points over which you are testing – the more data points, the better.
Then you have what looks like the inner product of the variance of the model squared. This does not make sense to me, as the variance is a single variable and so I am unclear what an inner product of that would be. So I likely do not understand the symbols.
Then there is the estimated variance in the real world data squared.
I do not at this point have any inkling of what ‘s’ is.
Any explanation would be helpful. 🙂
David–
There is no contradiction. Consider this example:
Say the average height of men is 5’10”. But someone comes along and claims it’s 5’8″. To test his claim that the average if 5’8″, you collect 400 men at random and measure. Suppose you find the average height of the 400 men is 5’10.1″, but the standard deviation of their individual heights is 2″.
You’re uncertainty in determining the mean will be 2″/sqrt(400) = 0.1″ .(This value is called the ‘standard error in the mean’.) So, the 95% uncertainty intervals for the mean is roughly 5’9.9″ to 5’10.3″. (Notice I ginned this up so the true mean actually did fall inside the uncertainty intervals. You can’t ever know that– you can just collect more and more data to make your uncertainty intervals smaller.)
Based on this amount of data, you are very confident the mean is not 5’8″. You know that if you repeat the experiment and gather up 400 different men, the averages height of those men is almost certainly not going to be 5’8″. You could inform those who insist it’s 5’8″ based on some model that their models is probably wrong.
But now, suppose the come back and tell you they can find tons of men are 5’8″. In this example, the 2 sd for their individual heights is roughly 5’6.1″ to 6’2.1″. So, you have, in their mind failed to prove that the average is not 5’8″ because 5’8″ falls in within 2SD of the mean.
Would you conclude, in this case, there must be something wrong with the computation of the mean? Or of the standard deviation of the population of men? Or, would you realize that the uncertainty in the mean height is different from the scatter in the individual heights?
(You should conclude the latter. The uncertainty in the mean computed from many samples can be much, much smaller than the standard deviation of the data. This is actually why we bother to average and also why the George Will method of getting trends using two points at two ends is not as good as using all the data.)
In this example, I used real variations in height as the ‘noise’ relative to the mean. So, the standard distribution in the individual data was driven by something in the system studied. But the concept of how well we can know the mean is equally valid if the ‘noise’ is measurement uncertainty or a combination of measurement uncertainty and system noise. For measurements of the surface of the earth, it’s both.
Discussing the uncertainty in trends based on ordinary least squares is a bit more complicated. But the method is a sort of average. Just as we can determine the uncertainty in an ordinary average, we can do the same for trends. The uncertainty in the trend can be very small if we collect lots and lots of data. (This is why we can do lab experiments to determine things like lift coefficients on wings very, very precisely. There are additional complications with time series, but the thing you are seeing as a contradiction is just not a contradiction.)
Well, then this answer should be useful to him too! 🙂
If your modeling has been done correctly, and used appropriate levels of noise, it will reproduce the Santer17 results. So, apply their test to compare to realizations, repeat that and see if you reproduce their uncertainty intervals.
In fact, I advise doing the ordinary t-test with white noise first. This has been around for ages, and appears in my Sophomore year advanced math text, is repeated in many experimental methods courses, and pre-programmed into excel. If your monte-carlo ‘proves’ that method wrong, it’s your monte-carlo, not the t-test.
David Gould–
The specific values you obtain for the sample mean, sample standard error, sample variance etc. are themselves, random and will include noise. So, you will get a distribution of numerical values in different experiments. This is not a problem.
The formalism of the method is set up so that, based on the data you have, you will ‘reject’ the null hypothesis in x% of all possible experiments. (We’ve been using x%= 5%.) That is to say: You are either willing or intend to this type of mistake (i.e. type I error, or incorrectly rejecting a null hypothesis) at a rate of 5%. The individual parameters like sample mean, sample standard deviation are usually estimates for the true mean, true standard deviation etc. But the sample values aren’t necessarily equal to the true value. So, they will not be the same in all experiments.
Flipping back in the paper is a good practice when you don’t know what the symbols mean. See equation (9) where the symbol is defined. Equation (12) contains no inner products of variances. S is a standard deviation not a variance. The variance is the square of the standard deviation or the inner product of the individual residuals, if you prefer.
The notes directly under equation 12 also discuss these terms in words.
No. That’s the square of the standard deviation, as mentioned in the paragraphs below the equation.
Evidently not. 😉
Read the paragraphs under the equations, and when in doubt flip back to previous equations for the definitions. This helps if you aren’t familiar with these notational choices. That said, Santer’s choices are pretty conventional: ‘s’ is fairly standard for a sample standard deviation; ‘s2‘ is fairly standard for a variance.
Thanks, Lucia. That’s helped shine a light on some things for me. 🙂 I think that my main problem is that I was not thinking in terms of averages, but only in individual runs of the model.
(I still have one problem with this, which I will raise in the next post, and I would appreciate very much any explanation that you can provide.)
Regarding white noise and proving it wrong, can you go through that in a little more detail, as I am not sure that I understand.
Prior to putting in some autocorrelation, I have used pure random data to generate my temperatures. I have examined these with a trend of zero. Is this what you mean by using pure white noise?
Okay: on to the problem that I have (well, the problem to do with this specific thing, at any rate …).
If we run a model a number of times and then take averages of those runs, the standard deviation of the averages will be less than the standard deviation in any particular run.
Indeed, as we take more and more runs, the average will end up not looking very much like any particular run at all.
Now, the thing is: the real world data is not going to be the average of 40 runs. It is only going to be one single run.
If we assume that the model perfectly captures the real world (within the limits of the randomness in the system, which is what gives us our standard deviation) then we should not expect the real world to meet the d* test of the average of 40 model runs, should we?
It seems to me that every single one of the 40 runs of my model used to generate the average fails the d* test when measured against the average of those 40 runs.
So I am wondering what comparing the average of 40 runs to the real world data tells us, exactly.
Surely it would be best to average the standard deviations rather than take the standard deviation of the average to determine if a particular model run or set of real world data fits within the model parameters.
That is what I conclude, anyway. As you and others are doing this averaging, however, it must give us valuable information. So: I must be missing something rather important here. Can you explain what it is?
David Gould:
I’m not sure which of the following you mean, but both are true:
a) The standard error for the sample mean trend will be less than the standard deviation for the collection of means of the groups.
b) If we create a time series based on the mean of all model runs, and estimate the standard error in the trend based on that data, that estimate will be lower that what we get for the individual runs.
The following is not necessarily true:
The standard deviation of trends from the collection of N models will be less than the estimate based on the tiem series data from 1 run.
In fact: if the residuals are approximately equal, and all residuals from a straight line and are “red noise” the two will be approximately equal. (If we check with the models, the are close two equal.)
Correct. The average will converge to the mean for the model.
If the models perfectly match the world, the real world should match the d* test. Notice the standard error for the noise in the real world realization does not go to zero– this is the second term under the square root in equation 12 in Santer.
As we take the model runs to infinity, the d* test becomes an ordinary t-test with the world compared to the average result from the models.
Before convincing yourself using that standard deviation is better think about this:
If I told you the average standard deviation of all trends from the all models was less than from the value of s{bo} used in equation 12, would you want to use the model value? That would make the models fail worse. That could happen if the noise in models is too low (as used to be the case with older AOGCMS.) Would you want this to be the rule?
How about if I told you the average standard deviation of all trends from the all models was larger than s{bo} but there is a test that can show the average s{bm} in models is inconsistent with s{bo} for earth? (If the models are correct, the two should match. I haven’t run tests on this because it’s time consuming, but I know the numbers, and I’m pretty sure I’d get this outcome.)
Will the rule be: Take whichever method is more flattering to models? Even if statistical tests show the model value is inconsistent with observations?
It turns out that, though you think you want to modify the method used in Santer (which is a standard method used in many fields), there are good reasons that method and not the new one you propose is standard.
a) Your proposed method ignores information about the noise level obtained from the real earth (i.e. it ignores some of the data.) Statistical methods to test models that ignore data are generally frowned on.
b) The method you propose is not a standard method– where as the d* method actually is. So, because it’s just not the method people have been applying, it has the odor of someone making up a new method based on not liking the results from a standard method.
c) Your method treats data from climate models as weather forecasts rather than climate models that are only thought to reproduce the mean correctly.
d) The moderlers themselves admit the weather noise in climate models is often not earth like and even describe known issues. You can find notes at PCMDI. So, if even modelers think they don’t get this right, why would we ignore the earth data in favor of it?
Now, I’m going to commment on a worry I suspect you have, but you are not stating directly. I know many who don’t want to believe the models are off track have developed the concern that because we’ve only sampled the earth for a short period of time, we might not get the correct estimate of the standard deviation for the earth noise.
This is true. We might not. What ever the ‘true value’ is, if we compute using finite data, sometimes our sample will show too little weather noise and sometimes too much.
You should be aware, that “Student” (aka Gosset), motivated to make a more perfect beer, went to a lot of trouble to account for the fact that both the mean and the sample standard deviation differ from the true values, and incorporated any biases into the t-test. He came with the Student-t distribution to deal with this issue and t-test account for the uncertainty in the sample standard deviation when implementing the full test. So, although Tamino and the others tend to simplify and tell you the multiple of “2” represents the 95% uncertainty intervals, that’s a simplification. The correct value depends on the number of degrees of freedom. Thats’ computed under DOF in equation 13 in Santer. I use that DOF, in all test. As it happens, this means I always use a value larger than 2 for my cut-off. (You can discover how much difference this makes. Whip out EXCEL, look at TINV(0.05,1), TINV(0.05,30), TINV (0.05, 100000). You’ll learn that if you have 1 degree of freedom, d* of 12.41 is the 95% confidence interval. Using 2 would be wrong.)
Since the time Gosset did this, people have safely used the sample standard deviation of based on observations despite the fact small samples ake it’s magnitude is uncertain. That gets fixed up by referring to the t-dsitribution.
Are there still difficulties with the SANTER method? Sure, if the noise in the earth’s weather is not AR(1), then it will give wrong answers. Whether the answers will be too high or too low, we cannot know for sure. So, this is something worth investigating.
What I can tell you , is that I’ve compared the average s{bm} estimated from individual realizations in Santer to the average of standard deviation of trends from models with more multiple runs. (I can’t do the test with models with only 1 run).
That I compare the averages of s{<: bm>} to the average of s{bm}. The averages of the standard deviation over of all trends is lower than the estimate based on individual realizations. If the same holds true for the earth, then the Santer method over estimates the standard deviations of trends for the earth! (That is, this does the opposite of what Tamino has suggested when telling us the earth’s noise is not AR(1), but ARMA(1,1) with a particular estimate for the parameters.)
What I just described is a comparison one would make to figure out what sort of problems one might have using AR(1), but relies purely on models that have multi-model runs. However, there really aren’t enough models to conclude anything. The results do the opposite of what would happen if the weather is ARMA(1,1) with properties suggested by TAmino, but, for all I know, this could happen by chance even if he is correct.
I don’t know if I’ve managed to explain what you might be missing. But.. that’s pretty much it.
Lucia,
I certainly do not have enough knowledge to be suggesting non-standard tests! 🙂
I am just concerned that when looking at my model *none* of the individual runs pass the d* test when using the standard deviation of the average of those 40 runs.
I have a prediction, one which you or I can test if we have access to the data (I am not sure where to look for it, but you probably are).
My prediction is this: if we use the d*test on any of the 40 runs of the model(s) we will find that that run fails to pass the d* test.
If my prediction is correct, finding that the real values taken from the earth fail to pass the d* test *cannot* mean what you say it does.
In this particular example, the standard deviation of the averages is a measure of the statistical limits of the true trend of the model. It is *not* a measure of the variance expected in any particular run of the model. Thus, using it as the standard deviation in the d* test cannot be appropriate.
A measure of the variance expected in any particular run of the model is the average of the standard deviations of the runs, not the standard deviation of the average of the runs. So, this is the standard deviation that should be used in a d* test.
Now, if my prediction is wrong, I accept that my argument must be wrong. Can you point me to the data so that I can do a test? (Although it might be better if you did it, as my objectivity is likely compromised to some extent as it is my prediction and I want very much for it to be correct. )
By the way: the only rule that I want is the right one. The reason that I am harping on with regard to this point is because of the results that I am getting in my model. My model might well be poorly constructed, and that could very well be the reason.
But, if my model is a reasonable approximation, it is not unusual to get results that fail the d* test that you subjected the real data to; in fact, it is *expected*.
I am also not sure that I understand your example of the standard deviation of the averages not being less than the averages of the standard deviation. It seems that in the best possible case that you outline, it will be nearly equal. Can you elaborate a little?
(And I know that I am being a pain here and it is not your responsibility to teach me statistics. But I am learning stuff from you, and I really appreciate it.)
One other comment: do you agree that if 100 runs had been done instead of 40, the SD would have been smaller and thus the real values would have had to have been closer to the average to pass the d* test?
Don’t you think it odd that the number of times that you run the models affects how close the earth has to get to the average in order for the earth data to pass the d* test?
I mean, if they had only run the models 10 times, the earth would have that much closer to passing the d* test …
Something is wrong with that picture, don’t you think?
And all 40 of my runs pass the d* test if the average of all their SDs is used instead of the SDs of the averages. This seems to me to be strong evidence that the SD to use in the d* test on the real earth values should be the average of all their SDs. This average does not change all that much the more tests you do, either.
David
If you coded your montecarlo, try to apply the d* test, and all your individual runs, then there is something wrong with your code. Santer tested the d* test, and discusses the results around Figure 5. I replicated a point on the graph in about 5 minutes of coding, and already knew this was true. So, I’m not to concerned that your monte carlo runs are contradicting the d* test. It only means you have a problem with your code.
If you want model data, you can download it from the climate explorer: http://climexp.knmi.nl/selectfield_co2.cgi?someone@somewhere
If “weather noise” were was really AR(1) and all residuals were due to “weather noise” the two would be equal on average. If it’s not AR(1) they might not be.
But, even if the noise is AR(1), as with all experiments, if you have finite amounts of data, the observed value will have scatter. So, you might not get perfect equality in an individual experiment.
It happens that, in the ‘experiment’ based on monthly data starting in 2001, and ending “now” the observed value of the standard deviation of the average trends in models is less than the average of the estimate of that standard deviation based on the assumption residuals are AR(1) noise. (Tamino has suggested that the true nature of the noise is ARMA, and the standard deviation of the average trends in models is greater than the average of the estimate of that standard deviation based on the assumption residuals are AR(1). There is not enough data to prove his contention wrong, but the relationship happens to be in the opposite direction of what he suggested.)
David Gould–
Nothing wrong with the picture. The uncertainty in the model mean decreases. But the uncertainty in the test never drops below the level dictatated by the ‘weather noise’ on earth.
Lucia,
Do you agree that if the standard deviation for an individual run of the model is greater than the standard deviation of the averages of 40 runs then the individual run will be more likely to fail a d* test based on the standard deviation of the averages of 40 runs?
This is beginning to sound very familiar. Didn’t we beat this horse to death back when Santer first came out? Does the name ‘beaker’ ring any bells?
Lucia,
I’m afraid that the climate explorer stuff is way above my head. I’m not even sure where to begin, I’m afraid. 🙁
David:
The result in the first sentence is strong evidence you are doing something wrong when running your monte-carlo simulations.
Dewitt-
Beaker and David are not similar. I’m not going to explain the difference, but, even though David is a bit less familar with statistical terms, I think better of David than beaker.
Well, I will get to work on trying to make sense of climate explorer. It may take a while.
And I accept that my simulation could indeed be very wrong.
Thanks for the help. 🙂
DeWitt,
I know that my questions can be annoying and repetitive, and in all likelihood they have been asked by others before. Lucia is under no obligation to answer my questions, and I am grateful that she is taking time to do so. My aim is to become less ignorant about statistics in particular – I am doing a unit on inference next semester, so I am trying to get my head around a few things.
David–
I can’t check your simulation. But when you do this sort of thing, you need to get in the habit of checking pesky details from step 1.
So,
1) Runs your script with zero trend and no autocorrelation. Fit trends to all the individual cases and check if a) the standard devation of residuals is what you expected based on the noise you used, b) the residuals are normally distributed and c) that the trend is zero. If they aren’t. you have a bug. Find it; fix it. Retest using some very long synthetic time series.
2) After passing test 1, run with an imposed trend. make sure that gives you the right answer.
3) After passing test 2, with zero autocorrelation, apply a t-test to 100 cases using the 95% cut-off. See if you get 5% rejections. If you don’t, you have a bug. Find it.
Etc.
Once you have something that passes all well know, well accepted non-controversial tests, you can make sure you understand the d* test and check that. Bear in mind: The test works pretty well. The graphs in Santer are correct. Currently, you are describing results that seem to contradict Santer. So, either you a)don’t understand SAnter of b) your simulations have a bug or c) there is some other problem I haven’t imagined!
David:
One thing I advise with these test: Try to keep track of the question being asked. It’s really easy to get lost when devling into the details.
The Santer tests asks: Are the underlying means the same?
Other test flung around the blogosphere ask other questions. It’s very very easy to forget which on you are asking, and get confused in the process.
Lucia,
Thanks. I will run tests 1b and 3 (I have already done 1c and 2).
However, I am not sure how to tell from the noise what SD I should expect from the residuals.
David–
If it’s white noise and normally distributed, the standard deviation of the residuals will, on average, be equal to the standard deviation of the noise you applied. We can worry about autocorrelated later!
Hmmm. I must be missing something here – this could be where my problem lies.
Some clarification on residuals: by this, I am assuming you mean the difference between the deviation from the expected temperature (which I have set at zero) and the actual temperature, which is derived from the random data.
The SDs are identical. But that is kind of built in. So, I am guessing that I must not understand what you mean when you talk about ‘residuals’.
Ah! The straight line fit … thanks. 🙂
They look good.
And they are normally distributed (to a pretty good approximation, at least, checking 200 data points).
I will look at a t-test now.
By the way: my model is *very* clumsily constructed – I know that there must be simpler ways of doing things in Excel, and I will try to learn it. Next semester, I will be learning R, which should be interesting ( I have had a brief look at it, and it seemed fun).
As an example of the clumsiness, I am not using rand(0). I am using random.org to generate my random data and then copying and pasting that into Excel.
Stuck on the t-test, I’m afraid. I will do more reading on this.
Very tough, this t-test thing.
From what I can see, it is the (mean of the values – the expected mean)/(the standard deviation/square root of n).
At a .05 rejection level/95 per cent confidence level for n = 100, DF = 99, which would mean that t needs to be less than 1.9842 (two tails).
At least, that is how I think it works. Am I right?
David — If I may answer for Lucia (it is about 5am her time), that is the right formula for the 2 sided one sample t-test. So you would be using the average and s from your random data and an expected mean of zero.
Thanks, Andrew.
I think that I have figured out my error.
It is not in my model.
It is in the fact that I do not understand Santer.
I will look more into this over the next few days.
David–Excel has tinv coded. tinv(0.05,100) would provide the result for 100 degrees of freedom, two tailed.
I’ll email you privately because this is filling comments!
Just to make sure that those interested know: I was wrong about this, big time. I was not using the second term in Santer. Dumb mistake due to not understanding it.