As I have mentioned in the past, I have been screaming at the R programming language recently. But more about why R sometimes makes me scream later. . . Today, I have a graph showing information people might think worth digesting:

I’m not going to try to discuss the implications of this graph, as that discussion would involve discussing a while bunch of sub topics (which may come up in comments.)
Are you wondering what the graph above shows? Assuming it doesn’t just show mistakes….
I’ll first describe what I did, and then explain what the graph shows:
Roughly, it shows a test to determine whether an observed trend from Jan 2001-Dec 2010 is equal to the sample mean trend computed over “n” runs from an individual model. The particular test used is a type of “z” test.
In a “z” test, we assume we know the standard deviation of 120 month trends we would compute from an infinite number of runs with an identical history of forcings prior to and through the end of any 120 month. Of course, we don’t really know this standard deviation. Because no modeling group can ran an infinite number of runs when projecting into the 21st century, it can’t be computed.
In principle, we can get an estimate by computing the variance in “n” model trends from Jan 2001-Dec 2010. Unfortunately, because the number of model runs is small, this results in a sizable uncertainty in our estimate of the standard deviation. So, to get a better estimate, I computed the variances over 9 non-over lapping 120 month periods during the 21st century then averaged those. My estimate of the standard deviation σ of 120 month trends for an individual model is the standard deviation computed by taking the square-root of the averaged variance.
In my analysis, this σ is assumed to describe the variability in a trend due to “model weather” during any individual realization of 120 month period simulated by a model. (Other variability over time would be due to the deterministic (i.e. climate) response to forcings.)
If a model was perfect in all regards, this value of σ would also describe the variability arising from “weather” during 1 realization of earth weather over any period. (Note: this does not include variability due to variations in forcing over time.)
So, assuming we know the value of σ and that value corresponds to both “model weather” and “earth weather”, the difference between the model mean trend based on “n” runs and the observed earth weather will have a standard deviation of σT2=σ2(N+1)/N.
At this point, I made a rather wild assumption: that value of σ contains no uncertainty and that the trends across models runs are normally distributed. I then computed the uncertainty 95% uncertainty intervals using the ‘z’ value for the normal distribution. These are shown.
So, under the assumptions,we can see that
- the observed trend computed based on HadCrut is outside the ±95% confidence intervals for 4 models,
- the observed trend computed based on NOAA/NCDC is outside the ±95% confidence intervals for 3 models,
- the observed trend computed based on GISTemp is outside the ±95% confidence intervals for 1 models,
(Caution: Before someone says “Wow, 4!”, thinks “Bi-nomial theory can be used to estimate the probability of 4 rejections out of 10″ if all models are ‘perfect’ ” … no… don’t do that. The tests aren’t independent. I’ll be discussing that further at a point.)
This is one test I will be discussing further in later posts. I’m also getting to the point of assembling another set of test that will use time series analyses to estimate the standard deviation of earth trends. So, we’ll be seeing more graphs. But since I got to this one and feel comfortable enough to at least answer questions that were immediately obvious to me, I thought I’d show it.
More later.
Update: I thought I should display sensitivity to start year by showing results if we begin in 2000 as well. 
Lucia – Somnething seems wrong with your labeling versus your 3 bullet points. From the graph, I get NCDC outside 95% in 4 models, CRU outside 95% in 3 models, and GISS outside in 1 case.
Overall, my interpretations is that, as a whole, the existing climate models tend to overestimate the mean warming trend. In all but one case, the model’s mean trend is greater than all three of the indices.
BobN– Your right. In had “dark green” and “blue” reversed in the “lines” command!
Lucia
I like your thinking. When you are expert in R would you pop over to France and give me some lessons, please?
stephen–
uhm… I love France. But pop over to teach you R? I’d rather pop over to do a tour of the wine country.
Shouldn’t the 3rd bullet point refer to GISTem?
S.Geiger–Yes. Fixed. Clearly, not proof reading well this morning.
I’d rather pop over to do a tour of the wine country.
You mean la gironde, Bordeaux, where I live close to? Du bon vin à €2.50 une bouteille. $1.92 a bottle. Today it was 18°C in the shade and about 30°C on the terrasse. Just the ‘climate’ for a glass or two.
Lucia,
I’m not sure I understand this. Do not the 120 month variances include whatever overall trend was present? If you are trying to establish if the recent trend is statistically different from the model trends, doesn’t that have to take into account how the “forced” warming increased with rising GHG’s? I mean, comparing an early 20th century period (with little change in GHG forcing) to the recent trend (with rapidly increasing forcing) seems to me to give an estimate of the variance that is not representative of what is expected now. I hope I am making this clear enough.
Stephen Richards,
US$3.50 I think, but still, very inexpensive. I don’t see French wines in the States for that price.
SteveF–
By definition, the sample variance of x, over “n” models during period “i” are <xij‘2 > where xij‘ is the difference between the ‘jth’ trend xij and the sample mean over all j xij for period i.
So, the average trend for period ‘i’ doesn’t affect the variance. The average trend over a number of models would be an estimate of the deterministic response based on the forcings and the initial condition for that period. The variance then estimates the “weather”.
Did this answer your question?
SteveF–
FWIW: Using 21st century model data, I ran kruskal-wallis tests to test these three hypotheses:
1. variance in trends are identical for all models. (Reject)
2. variance in trends are identical for all periods. (Accept)
3. trends themselves are identical for all models. (Reject)
4. trends themselves are identical for all periods. (Reject)
The fourth test relates to what you are discussing. If computing the variance for the 120 month period did not remove the portion due to the mean trend, the fact that (4) is a reject would pollute my test because part of the “variance” would be due to the evolution of the deterministic trend.
Conversely, if the outcome of 4 was “fail to reject”, that would suggest that I could just compute the variance due to weather by assuming the “true” ten year trend is a constant for all periods. It’s not. In fact, if it was, well… we wouldn’t be discussing AGW. The entire premise of AGW is that the ghg’s affect the mean trend.
Lucia, you have done a “Douglass et al” by not showing the error bars for the observed values. I just scanned your post but I did not see the Nychka correction of trend SE due to autocorrelation.
Also do not you have to correct the degress of freedom when you use overlapping data.
Maybe all of this is accounted for in your methods and I did not see it on first pass.
Would not the logical conclusion to all this be that the scatter of model means and model variances is so great as to not be worth much in the kind of comparison you are making here? If someone could show a method or rationale for selecting a best model or a last a few best models the comparison might mean something – more.
Kenneth-
No. Instead of showing separate error bars on the observations and the models, the error bars around the model are pooled to include the error in the observation and the model mean. So, the variances are computed σ^2*(1 + 1/N) where
σ is the variability for 1 realization of “weather”. This is actually what you would see around the observation if I put it there.
σ /n is the uncertainty in the model mean, and is based on the “n” runs from the model. (Model runs range between n=2 and n=7)
So, the distance between the two points should be sqrt(σ^2*(1 + 1/N)) multiplied by the “z” for the normal distribution.
Or, looks at another way, even before we have access to any observations, we can say any individual weather realization should fall within sqrt(σ^2*(1 + 1/N)) of the model mean based on ‘N’ runs. So, we can draw the error bars. The “1” accounts for the weather. the “1/N” is for the uncertainty in the model mean.
Then, when you get a weather realization (or a future model realization from a run not yet obtained) you can put it on the graph and see if it falls withing the uncertainty intervals. You don’t add uncertainty again.
So.. not a Douglas. (In Douglas, his error bars around the models only had the 1/N term.)
Kenneth
We can test models individually even if some are crap and some are good. The goal is to eventually test whether some models are obviously off and whether that can be shown. (I think at least one is.)
Lucia,
“Did this answer your question?”
Yes. I did not understand that the trend was included in the calculation. I assumed it was just based on difference of each month in a 120 month period from the average for that period.
Shouldn’t there be variance/ci in the surface records trends that is not displayed?
I have to think about what you have done here, but is not it a bit difficult to attribute errors and understand the comparison by combining the model and observational errors.
What about autocorrelation and overlap adjustments to degrees of freedom?
Douglass made two major errors in that he did not compensate for autocorrelation and did not put error bars on the observed trends.
Instead of showing separate error bars on the observations and the models, the error bars around the model are pooled to include the error in the observation and the model mean.
.
I see your reply to KF 3:15 should address my question as well. But I still don’t understand a piece. Did you take the average variance in the all the obs?
I would suggest there’s a basic problem.
Let us say:
All/(each of) the models are/is of the form y=x(n) where x is a constant, n denotes a particular x (y is anomaly [whatever]). SD of the ensemble is ZERO! No matter how many times they run them, SD = 0. SD is only changed if the mix of model runs is changed. Run each once, thrice, ten or a million times, if each model is run the same number of times, SD = 0.
Any variability that arises is programmed into the models. Meaningless!
Provide something showing SD (confidence limits) of real data.
Not for this test and choice of method of display.
Ron–
The 21st century is divided into 9 non-overlapping 120 month periods. (Some moths are left over.)
For model 1 with 4 runs I
For period 1:
* computed 4 trends, (t1, t2, t3, t4). These are now just numbers. I then computed the sample mean trend and sample variance in trends for these periods in the ordinary way. Call these T1 and VAR1.
So, T1 is an estimate of the “true” trend we would get if we could run a zillion models. VAR1 is an estimate of the “true” variance about the mean get if we ran a zillion runs. So, VAR1 is an estiamte of a “weather’ quantity.
Another way to look at it: VAR1 is an estimate of the repeatability of T1 if the modeling group had run a zillion runs. That is: it’s the estimate of the variance in the trend due to “model weather”. In contrast T1 is an estimate of the trend arising from “climate”.
I don’t have any data to improve in the estimate of T1, the trend computed from 2001-2008. That’s just the ‘dot’ for a model.
But I want a better estimate for VAR. To get a better estimate I made the following assumption:
Even though ‘T’ varies from period to period (specifically, warming rates depend on forcing), the “true” value of VAR — a “weather” variable will not vary from period to period. Based on that assumption I did this:
* repeated for each period ‘i’, getting and so had a collection of Ti with i from 1 to 9 and VARi with i from 1 to 9.
* I then found the average of VARi over the 9 periods. This is the model average of VAR based on the sample of 9 period. I then took the standard deviation of that. That gives me what I use for σ in the post.
This value is an estimate of variability in a population of computed 120 month trends for any period. So, it’s a weather value.
Rarm
This is incorrect. Runs from the same AOGCM give different results based on differences in initial conditions.
I will be showing uncertainty intervals for the observations based on data in a separate analysis.
@Lucia 4:49 pm
Any variance is purely a result of programming – different initial conditions or not. Same initial conditions = same result (aside from programmed randomness). What’s the pdf of initial conditions? Determined by the modellor?
QED.
@Lucia 4:49 pm
The “black box” model has been presented from both (each?) sides of the climate divide (Eschenbach on the skeptic side, I forget who on the ‘warmist’ side, [recent] good paper). Both show that a simple deterministic model tracks GISS-E (or any other IPCC-used model) to r^2 well over 95%. Beats anything else I’ve seen in Climate Science.
If it’s deterministic (or can so be modelled), variance, hence SD = 0. Any output represents only the choice of initial conditions.
@Lucia 4:49 pm
Sorry, really cluttering up your blog, But, my thinking develops (“Bear of very little brain” – slow).
Initial conditions are NOW, surely? What hypotheticals do they imagine? Start with Venus? Mars? Jupiter? 500,000 years ago?
Puh-uh-rlease!
Rarm–
I mean iitial conditions for the model run. Model runs for the AR4 were generally intiated from different points in a long spin up with conditions from pre-1900. The modelers sampled from different years to get the individual runs started at different points of any oscillations like AMO, PDO, ENSO etc. So, the “weather” is different in each run, but the average across runs should give a climate response.
So the climate in 2100 depends upon the weather in a year pre-1900.
And they didn’t even know about AMO, PDO, ENSO &c. a few years ago. They certainly didn’t understand them! They still don’t.
Come off the grass!
Rarm–
No. Weather in 2100 is affected by weather pre-1900. And, guess what, weather really is affected by initial conditions.
But it’s climate in 2100 that they try to scare us with! Not weather.
And any Met Office (US, UK, Aus, NZ) has a truly abysmal record in forecasting weather at much more than ‘look out the window’ time-frames.
Hey, you and I don’t disagree [much!] philosophically. I just have a problem with variances calculated on model outputs.
I’ve raised at least 6 points earlier. Answers?
Rarm– Maybe. But you are discussing an issue that is not relevant to the content or point of my post.
Here’s a real question: What don’t you like about variances being calculated based on model outputs?
I would want to know why there is such a large variance in the average trend between each model. What made an individual model have an average as low as 0.03C or one have an average as high as 0.36C. Probably impossible to answer but that is the question the chart is asking.
And at the end of the day, what is one really testing? I’m more concerned about warming by 2100, not whether some model had a random downturn between 2001 to 2010 and somehow the variance magically stayed within the -0.03C Hadcrut trend. Perhaps expand the timeline for the models so that it more accurately reflects what the model is really saying ie. 0.2C per decade that results in +3.0C warming by 2100. Eliminate the random downturns or upswings and focus on what the model is really predicting.
lucia (Comment#71515)
If you show them – relevant. If you discuss how they are derived – relevant. If you discuss the methods of those behind the figures from which they are derived – relevant. If you discuss whether the figures as derived affect past, present or future predictions (sorry, scenarios) – relevant. O/T – well, your gaff, your rules, but I don’t think so.
To answer your question – models reflect the pre-conceptions (prejudices, whatever) of the programmer. I spent more years than I prefer to remember in building models on a (then – I bow to Moore!) supercomputer.
Any variance in the output is purely a function of the program and the initial conditions. Anything calculated therefrom is purely a function of the program and the initial conditions.
Which of those is not [an arbitrary] choice by the runner of the program? How is anything calculated from the output not purely a function of the program and the initial conditions? Purely a function of/input by the programmer/runner?
My objection? Circumspice
So, T1 is an estimate of the “true†trend we would get if we could run a zillion models. VAR1 is an estimate of the “true†variance about the mean get if we ran a zillion runs. So, VAR1 is an estiamte of a “weather’ quantity.
Another way to look at it: VAR1 is an estimate of the repeatability of T1 if the modeling group had run a zillion runs. That is: it’s the estimate of the variance in the trend due to “model weatherâ€. In contrast T1 is an estimate of the trend arising from “climateâ€.
.
Okay … but what are the three surface records doing up there? My understanding of surface record trend +/- 2sd for short periods like 10 or 30 years is not that this is a well-defined trend with weather noise, but that the trend itself is uncertain. So that, for instance, the GISTEMP 10 year trend 1991:2000 is 0.23C/decade +/- 0.29C. In other words, there is a 95% chance that the actual trend in 1991:2000 is between -0.06C to 0.52C. The sd in the case is not related so much to weather noise as it is to the relative scarcity of data points (10). So I’m not sure how you can slap a surface record trend of 0.08C/decade (or whatever) up there when the “actual” trend could differ by as much as +/- 0.2C (or whatever).
Ron–
I’m going to have to do a post to explain. I was planning to explain it later– but I didn’t expect people to not understand this.
The short answer is if I have an individual model to test, I can show the uncertainties two ways:
1) I can show error bars that intended to convey the range in which we would find the climate trend. When I do that, I need a set of error bars around both the model mean trend and error bars around the observations.
2) I can also show error bars that are meant to convey the region around the multi-model mean in which we expect to find all weather realizations. Then, we can show the weather realization for the earth as a point with no error bars! (Well… unless we get some measurement error. But that’s not what you mean with the ±0.29C/decade.)
You are expecting me to present error bars of type (1). I’m presenting error bars of type (2).
Error bars of type (2) and (1) are related. I’m going to write down how the are related under the assumption that an individual model reproduced the observed weather noise perfectly, and the standard deviation of 10 year model trends for that model is σ, and the model mean trend was computed based on “n” runs.
Given these things if I made uncertainty bars following the convention for “type (1)” error bars, the uncertainty bars around the multi-model mean would be ±1.96 σ/sqrt(N). This goes to zero as N-> infinity. But that’s ok because these uncertainty intervals enclose the climate trend, not the “weather”. The uncertainty bars around the earth’s observation would be ±1.96 σ You’d be happy seeing the big error bars around the earth’s observation.
But what if I make uncertainty bars following convention 2? I now want to know all possible realizations of weather that are consistent with the multi-model mean. I don’t actually need to know the weather for this. I can just draw uncertainty intervals equal to ±1.96*(1+1/sqrt(N))σ around the model mean. The 1/sqrt(N)σ part arises from my being uncertain that I have the correct model mean (owing to having a finite set of runs). The (1) is the ‘weather’ region– and so the part you are expecting to see around the GISS trend.
So, my graph does reflect the bit you expect to find– it’s just not where you are expecting it. (And that’s partly because this is a “z” tests, and I’ve estimated the σ from the model– not from the time series of the earth weather.)
But, never fear, I’ll be making some the other way too. I just need to get the time series bit running tomorrow or Friday. 🙂
I think I get the part where you have moved uncertainty from the obs trend to the model trend. It seems to be an odd move, but I follow you. What I don’t get is the part where you state that uncertainty in obs trend is ‘weather.’ The surface of the planet over the last 10 years represents one ‘reality’ of temperatures. Yet we have slightly different trends and variances for each of the surface-records. These differences are not due to weather. There is only one set of actual weather/climate events over this time span to observe. Rather, the differences in the surface records must be an uncertainty due to measurement+method. As long as these differences fall within a reasonable span (overlapping CI) of each other, no harm, no foul. If they start to diverge, it should prompt reexamination. So it is the statements relating variance to weather (variance from climate means) that I am questioning since some fraction seems to me to be related to observational (measurement+methods) uncertainty.
Lucia – given all these questions regarding error bars, auto-correlation and the like, are there non-parametric tests that might resolve some of these questions, even if they are of a lower power than parametric tests?
Ron, you can Monte Carlo the effect of short-period climate fluctuations (I personally wouldn’t call anything that is averaged for longer than about a week to be “weather”), based on past variations of climate, so I do think it’s possible to make an estimate of the effect of this on the uncertainty of the trends.
For example, here’s a stab at it, using the measured spectral variation in GISTEMP over the last 100+years:
Monte Carlo trend
I’m assuming a 0.02°C/year trend + the observed fluctuations from the trend, and generated 1000 realizations of the short period climate “noise”, then fit to a 1-decade period. The red arrow is the “actual” OLS trend (not removing any of the variability first).
The probability that you could have an low-side outlier of this magnitude is around 0.01… In other words, you’d expect a decade to fall this far below the distribution about once in a millennium.
These numbers shouldn’t be taken totally seriously, in particular I haven’t done any sort of verification stats on how well my generated Monte Carlo describes the observed temperature variations. So this exercise should be viewed as an illustration how one might do the Monte Carlo analysis, and also to suggest that there is something very odd about the last decade that probably isn’t explainable purely by short-period climate fluctuations.
(In playing with this, I discovered a curious oddity with the nonrandomess of the behavior of the UNIX random() function with respect to its initial seed value.)
I agree that comparing the different surface temperature series doesn’t tell you much about this sort of uncertainty—my tired brain may have missed it but I didn’t see where Lucia claimed that either.
Stephen Richards,
US$3.50 I think, but still, very inexpensive. I don’t see French wines in the States for that price.
You are spot on of course. My excuse? I was tired. Got $3.50 thought that was topo expensive and changed it. Fortunately I don’t sell wines to the US.
Rarm
.
Any variance in the output is purely a function of the program and the initial conditions. Anything calculated therefrom is purely a function of the program and the initial conditions.
Which of those is not [an arbitrary] choice by the runner of the program?
The former. At least partly . The programmer generally hopes to program something that is not completely arbitrary e.g he hopes that the numbers computed have some remote and not arbitrary relationship to the reality that the program is supposed to simulate. Different programmers will do different programs but this is not completely arbitrary .
.
However you are completely right for the latter.
This arbitrary choice leads to arbitrary results and the only justification I have ever heard is that “it doesn’t matter”. Magically the statistical properties are postulated independent of this arbitrary choice.
Technically this should be a discussion about ergodicity but this one never takes place .
Once one understands that this is a postulate which can never be proven by experimental evidence because we have only 1 Earth and running models is not experimental evidence, then if one doesn’t agree with the postulate , f.ex because one thinks that initial conditions matter on all time scales , one will stop posting on similar threads .
.
Normally I wouldn’t comment but it seemed to me that you were relatively new to this issue so I wanted to move your learning curve a bit faster 🙂
What Lucia is generally doing are internally self consistent studies .
She generally doesn’t speculate (much) whether Axioms A , B , C are true or not .
She says if Axioms A , B , C are true , then here is what follows .
So if you happen to reject Axioms A , B , C then obviously you don’t agree with the results but any comment in that direction would be kinda off topic because the purpose is generally not to discuss A , B , C but the validity of a statement “If A , B , C then X” .
.
BobN–
I’m going to write a post explaining. It will use synthetic data. The test is the “z” test– so parametric. The difficulty has nothing to do with parametric vs. non-parametric but with what these uncertainty intervals mean.
Ok. If your question is the uncertainty due to (measurement+methods), then the answer is:
* In my graph, the the way in which the uncertainty due to method is illustrated only by making the comparison to all three records instead of just one– i.e. HadCrut, GISS and NOAA. That’s it. Otherwise, I guess I’d have to have Zeke set up his temperture trend methods to include random selection of ‘corrections’ and processes, create an “ensemble” of observations based on different methods and show that. I’m not going to do that. (I bet Zeke wouldn’t be interested anyway.)
* the uncertainty due to measurements of single thermometers is thought to be small. If we added that as white noise onto HadCrut, GISS and NOAA themselves, that uncertainty on the trend would be small.
* the uncertainty due to the sparse data set? Or possible trend in the bias of the measurements (i.e. UHI? ) Both those end up covered in the range of methodologies for extracting the monthly values.
Whether or not the data sets differ from each other is an interesting question, but for testing the models, I’m going pretty much capturing the uncertainty in method/measurement of observations by making comparisons to the three established measurement records that we have. That’s it.
I didn’t. But originally, I didn’t understand Ron was asking this. I thought he was asking what Kenneth asked. My answer is: That uncertainty in measurements/method is reflected by comparing to all three observational records: NOAA, HadCrut and GISS temp.
Lucia, Carrick – thanks. I’ve been considering what you’ve said. I’ve got two questions: one technical, one .. something else.
.
First, is the variance in the models time-invariant?
.
Second. GISTEMP, HadCRUT, and NCDC are all attempts to measure the overall (area weighted, integrated) warming (cooling) of the earth from some baseline. The fact that each differs indicates that no two of them can be said to be a completely faithful representation of that warming, and probably none of them are.
.
Similarly, one of the parameters of the models that can be measured/recorded/noted is the modeled temperature of a particular grid which in turn can be area weighted and integrated and compared to a baseline to derive a predicated/modeled/forecast future global anomaly. But they are not attempting to directly recreate GISTEMP, HadCRUT, or NCDC.
.
So I still have a sense that there is a comparison gap here. Not apples to oranges – but golden delicious to macintosh.
.
I think my concern is addressed by Lucia here: ’m going to write down how the are related under the assumption that an individual model reproduced the observed weather noise perfectly,. I doubt that the models reproduce the variance of the various surface records perfectly – and that neither reproduce the variance of the actual temperatures on the Earth’s surface perfectly.
.
No rush to answer. I know that Lucia is working on another post to go into this in more detail.
lucia,
So your error bars are prediction intervals for the data rather than confidence intervals for the trend? Are you using 1.96 rather than the t statistic appropriate for the degrees of freedom or is the difference too small to matter?
Ron:
Some of the newer models designed to capture ENSO might do a decent job.
Regarding ergodicity, I think that is a problem here: If we believe that CO2 forces climate, then increasing CO2 (and other anthopogenic forcings, like particulate emissions and surface land usage changes) likely affects the fluctuations that are observed. So it’s not truly ergodic, though you should be able to model how climate fluctuations are affected by anthropogenic activity.
I think Rarm raises a valid point.
The magnitude of the SD in the model runs, and hence the size of the errors bars, depends on the magnitude of the SD of the initial conditions chosen by the modellers for their different runs.
So suppose I am in charge of model 3. Your graph shows that HADCRU is outside my 95% range. I can fix this by doing more runs with a wider spread of initial conditions which will widen my error bars.
So the answer to your question “What don’t you like about variances being calculated based on model outputs?” is that the variances are determined by choices made by the modellers.
Since we are modelling 2001-2010, the initial conditions ought to be the state of the world’s climate in 2001 as accurately as it is known.
To investigate all this properly, one would need to look very carefully at how the initial conditions were set in the model runs.
Also you’d need to look at how rapidly small changes in the IC’s diverged, in other words how chaotic the models are. (I’m assuming above that the climate models are not sufficiently chaotic over 10 yrs to ‘wash out’ the initial conditions). This is basically Ron’s Question 1 above, which is a very important question – does the model SD increase (chaotic case), decrease (stable damped case) or stay about the same as t increases?
Re: PaulM (Mar 10 08:44),
Except that’s not how it’s done. See lucia’s comment above. You can’t initialize a model for a given year. The data needed to do that doesn’t exist.
There’s every reason to believe that the models aren’t chaotic. When not forced they don’t drift, or at least the runs that are used don’t. I’m pretty sure they have spinup failures where conditions become wildly different from the real world. Those are discarded.
Ron–
Very good question! I’ve assumed this so this clearly needs to be tested. So far, I have done exactly 1 test (mentioned in lucia (Comment#71479) )
The test is a Kruskal-Wallis test. This tests the hypothesis that the median variance for Periods 1-9 is the same for all periods. I either reject or fail to reject that hypothesis. I get “fail to reject”. So… at least so far the only test I’ve done does not contradict that. But I plan more tests, as I think this is an important assumption and your question is a good one.
I agree. So, at a minimum, I have to compare to all 3 sets, and people can judge what they think based on comparisons to all 3, which I do. (If I had “Zeke temp”, I’d add that. I suspect we may soon have “Zeke temp”). Each person might decide how to interpret the information in the graph differently. So, for example, you might favor the methodology in GISTemp because it tries to account for missing data at the poles, and notice GISTemp falls inside the range for all but one model and give that more weight in your mind than the larger number of outliers for Hadley etc.
Mind you: I would like to estimate statistical uncertainty intervals due to methodology, but I firmly believe it can’t be done based on the difference in GISTemp, Hadley and NOAA. The reason why is that, to some extent, the difference in methodologies is in no way “statistically independent methodologies drawn from all possible methodologies”. To at least some extent, each succeeding one exists (and was both funded and publishable) to the extent that a scientist and peer reviewers thought the suggested tweaks in methodology would have a noticeable impact on the results. So, to some extent, I think different methods tend to represent the extremes of the range owing to different methodologies.
If you download all the model data and create a monthly series based on GISTemp’s methodology, you get a much more “granny smith to granny smith” comparison. I haven’t done that.
I think Chad had done that. I haven’t. So… not showing it.(If I did, I’d have to split the graph above into 3 because each “model” result would have a slightly different mean and variance.)
So the answer to that is: No. That’s not there. For now, when judging what the graph tells you, that’s issue would be “a question”.
I agree. But this situation is this not qualitatively different from nearly any comparison between ‘observations’ and ‘theory’ because, like it or not, even in the best possible experiment ‘observations’ are generally imperfect at detecting ‘reality’ for various reasons. Unfortunately, these impact of these differences can’t be quantified at all.
I’m happy to quantify the ones uncertainties that I can possibly quantify. But.. there are some that really, I can’t. It has to be left to the judgement of the individual person who understands what is quantified and left out. So… some of what you point out is just in the list of “what is left out”.
Oh… I doubt it too. Moreover, I don’t think anyone believes it. So, in some sense, the test would be what Neilsen-Gammon might call a “void” hypothesis. I’m testing something no one believes.
But in frequentists statistics hypothesis tests always assume something is true. So, one can make a test and then say what the result is. Maybe you won’t be surprised by the result… but this test makes those assumptions, and I’m actually doing it for a reason.
In other tests I’ve discussed here, I’ve assumed the variance of the earth’s surface temperatures can be obtained by analyzing the time series, using red noise. So, using that assumption, I then can test whether the multi-model mean of a model is consistent with the earth observation.
But once criticism of that test where I do not assume the variance of the “weather” in models is not correct, is that using the earth weather I get variances that are smaller than in the models! Presumably, the response to that is: “Ok. I’ll repeat the test, but this time using the variance from the models.” That give the result above. (I anticipate result above is more favorable to the models than estimating the uncertainty in the “earth” climate trend using red noise. But I haven’t pumped this through R yet, soo…. we’ll see.)
(There is a third criticism of that I should use variances based on earth weather but not use red noise. Now that I’m up to speed on R, I’ll be showing what I get with a different ‘noise’ model to estimate the uncertainty due to earth weather and also talk about how those methods work if applied to “model data”. As you can see, I have “parts” coming up. Some of the long delay was getting up to speed on R to do all of these.)
Oh– on the measurement error bit. I thought about this a bit last night. I did write an R bit to add “white noise” with a standard deviation of dT to the monthly observations and see how much difference that makes to the determination of the trend. I ran with dT=±.1C– which I think represents a very conservative upper bound for what to use in the module and I get roughly ±0.003 C/decade for this uncertainty. But… obviously, I need to lay out the argument for what I select for dT, why I use “white noise” etc. So, for now, I’m just leaving that uncertainty off the graph. (But it’s useful to discuss because people are going to wonder if the uncertainty due to that is large or small.)
“Once one understands that this is a postulate which can never be proven by experimental evidence because we have only 1 Earth and running models is not experimental evidence, then if one doesn’t agree with the postulate , f.ex because one thinks that initial conditions matter on all time scales , one will stop posting on similar threads.”
The meta physical aspects of measurements and comparisons, such as yours here, Lucia, always come up, i.e. climate is chaotic and depends on initial conditions and the earth is simply one single representation of many equally possible outcomes as are the individual outcomes from climate models. Whether you agree with this view or not Lucia would it not be a good starting place to merely show all the model trend outcomes in a vertical row for each model and the observed value from one given data set (the data sets are not independent unless you are using satellite and surface records) and some reasonable error bars.
The other problem with these comparisons is that using a scenario depends on attempting to match the levels of GHGs and other forcings of a given scenario to what actually existed over the time period of interest. A model’s validity has nothing to do with the scenario matching and the best case comparison would require using the actual conditions that existed for the observed time period. My question is are actual condition model runs available and, if not, why not.
The zero SD argument does not hold if we are using the same methods for determining the observed and model trend uncertainties (error bars) which is simply the error of placing a trend on the time series. The error bars for the models should then also be presented with the individual model trends – remembering also to include the effects of autocorrelation on the error bars.
I am guessing that the uncertainties of the observed trends and model outputs over a 10 year period, including the effects of autocorrelation, would be shown to be so large as to make any conclusions about the comparisons impossible – outside a qualitative view that “on average” the models show some bias to the observed. One would also have to come to terms with the scenario and real world differences that can affect the model outcomes.
I have always thought that doing comparisons like Douglass attempted and Santer did (although not over the entire time period for which data were available) got around some of the problems noted above in that the comparison was between (when done properly) two aspects of the same chaotic outcome, i.e. the temperature trends of the surface and the troposphere in the tropics.
TomVonk
I’d describe it more like this:
If Axioms A,B,C are collectively true, then we expect to observe “D falls in some range” (with some probability, generally I use 95%.)
So, we then observe D and see if it fell in that range.
Then: If D does not fall in the expected range, we reject the notion that (A,B,C) are all collectively true. The data are strongly suggesting that at least one of (A,B,or C) are untrue.
Tom,
Well… I can certainly tabulate the data. I’m not a big fan of tables. What do you think you would see from the table rather than a histogram or something?
The is of a “model” that is presumed to be not only the GCM but a groups ability to forecast forcings. It’s true my test doesn’t separate these. The “model” could be wrong because the forcings are wrong (for whatever reason) or it could be wrong because the physics are wrong.
With respect to making policy decisions, the fact that the method relying on an AOGCM is wrong because a group who collectively picks forcings can’t do that is just as interesting as if the AOGCMs are wrong.
But even if this were not so, the effort to detect why the models predictions are wrong isn’t particularly worth undertaking unless we have good reason to believe they are wrong. So, I don’t think the fact that the reason for incorrect results might be forcings and not physics is a reason to avoid test.
I don’t know what you mean to communicate here.
Replicate runs of models do not have variability that is so large as to make comparison impossible. I don’t know what “autocorrelation” you are concerned about. Do you think when Gavin runs his models and creates a monthly time series, that the trend from 2001-2010 computed for run 1 is correlated with the trend computed for run 2? Or do you mean something else?
You noted Tom when you were replying to my comment about portraying the model versus observed data. What I suggested was to continue to use your graphs but instead of putting a line representing standard deviations merely place in a vertical line the model results and we can see the range and scatter of results.
In addition I suggested that you show the error that you would get from any time series where you compute a trend i.e. the error in the trend slope. That error calculation is influenced by the autocorrelation in the series residuals from regression over time and the degrees of freedom need to be adjusted. You could thus show the trend error for each model result.
The fact that we evidently are talking past each other over this matter leads me to believe that either I do not understand what you are attempting to do or you do not understand what I am talking about. Since the former might be more likely perhaps you could show your R code and I can see for myself what you are doing.
In zero SD, I was referring again to computing an estimated error in the trend given the series data and autocorrelation of the regression residuals. This applies to any series – be it observed or modeled data.
Ok. I decided to more carefully read your thread introduction and I see that your model error limits are derived from estimation of a standard deviation of the model trend means by using an average SD of trends from 9 non-overlapping time periods within a 120 month time frame. Obviously a model run of 2 does not have a standard deviation. Do you use the method below to calculate the average standard deviation?
Standard deviations of non-overlapping (X ∩ Y = ∅) sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known:
http://en.wikipedia.org/wiki/Standard_deviation
What you are then looking at is the mean result of a model trend and the variation around that means for a given model and ignoring the estimated individual trend error around the mean. That would make sense if the error around the individual model run trends is small compared to that between model means. Is it?
Kenneth
In words, tell me what questions such a graph would answer? I am doing a “z” test and for those, I use information from the standard deviation. That’s why my graphs show sds.
The current discussion has nothing to do with time series. I will be discussing things related to time series later.
I don’t know how your reading my R code would reveal to you the purpose of what am doing.
You mean I am ignoring the contribution to the spread that arises not from weather but because the deterministic trend is expected to evolve over the 100 years. Right?
That determnistic response is not “weather” it’s climate and I want to exclude it. (I need to for my purposes.)
I am trying to estimate a parameter that tells me how repeatable variances defined about the mean are from period to period. I specifically wish to exclude the portion of variability that arises due to the deterministic response of the earth’s climate system to the forcing. So, I need to exclude that variation in the individual trends about the mean trend for the 21st century.