Have you been following the great “SE vs. SD” debate in comments at Climate Audit? It’s discussed in several threads related to “Santer17”. (See 1, 2 etc.) Are you wondering what it all means?
Two issues were debated in comments:
- When testing models for consistency with observational data, did Santer17 use SE or SD?
- Which parameter should be used when testing models for consistency with observational data, SE or SD?
Commenter “beaker” appears to have been promoting the use of SD as correct and SE as incorrect. Unfortunately for less statistically oriented readers, the debate has been going on using terms like “SD” and “SE”– not that the full term “standard deviation” or “standard error” would help. Analogies to apples and top hats have been used to explain all this. Worse yet, equations have discussed and posted. So, what are those who know little of statistics to make of this all?
Since so many are confused, I’ve been trying to think of a non-climate example to explain the “SE” vs. “SD” approaches of testing a model (of any sort.) With luck, this will clarify the “apples” and “top hat” issue. (Or, maybe it won’t.)
First: Steve McIntyre resolved the first question: Santer used “SE”. This is reported in Replicating Santer Tables 1 and 3. The tables can only be replicated using SE, not SD. This doesn’t help readers who still don’t have any idea what SE or SD means, but at least, I we can tell you “SE” was used.
Now that that’s resolved, let me move on to explain the difference between testing model consistency with data suing “SD” rather than “SE”.
Being with a claim.
Let’s say, for some reason, you really, really want to predict my weight. Maybe you are planning to launch me into outer space or something. Maybe you need to figure out how many lead weights to have on hand when I act as copilot in a glider. Who knows?
Now, suppose scientist A claims this:
Claim: We can use the average weight of men to “predict” Lucia’s weight (which for some reason is unknown to you.)
In an analogy with models, “men” are the “models” for reality. I am reality. The feature you wish to predict is my weight.
Now, we want to test the scientist’s claim that statistics collected from men (the models) have predictive power with regard to reality (me).
Based on his believe in his claim, scientist A goes out, finds 10,000 men, measures them once each month for a year. He computes the average weight for each man, and then averages over those averages. He discovers the average man in his sample weighs 180lbs and the standard deviation (SD) of their individual average weights is SD=25lbs. So, the 2 SD confidence interval is 50 lbs, or from 130 lbs to 230 lbs. (Ok… lets just pretend everything is normally distributed.)
Now things get tricky: Scientist A publishes a table in which he reports the best estimate of my weight based on his multi-model average is 180lbs. He also publishes a histogram that shows the distribution of men’s weights. So readers are informed men’s weights vary.
In his report Scientist A does spend time promoting the idea we should focus on the 180lb “multi-model average” as follows:

Chapter 8 or the thick report contains additional text explaining that average over all models (i.e. men) is a better predictor of my weight than any individual man. After all, men differ structurally. Some are large; some small. So, we the average will be a better guess of my weight. Right?
We the readers are left to interpret what this all means.
Now let’s test
Next scientist B decides he wants to test scientists A’s claim: Can we predict Lucia’s weight based on the average of men’s weights? Scientist B has access to scientist A’s claim and report.
Scientist B comes to my house and weighs me every month for a year. He discovers my average weight is 135 lbs with a standard deviation of 2.5 lbs. So, the 2 SD for my weight is ±5 lbs. (There is no difference between SD and SE for my weight. I’m one person. My weight is what it is on any particular day.)
Do the SD test
Now we apply beaker’s test (also the test at RC). We discover my weight of 135 lbs falls between 130 lbs and 230 lbs, which is the ±95% confidence intervals for guessing the weight of 1 man.
The reason this is the “SD” test is that we found the standard deviation of the men’s weights is 25 lbs, and the 95% confidence interval is approximately 2 * SD, or 50 lbs. (This assumed men’s weights are normally distributed, and I am also rounding the 1.96 to 2.)
So, by using by this test we can’t prove there is any problem whatsoever with using men’s weights to predict my weight. Scientist A chortles with glee! His theory is not proven wrong!
This is one way to look at testing the claim, and it is correct is some sense. Still, it’s lucky for the scientist he didn’t suggest he could predict my mother-in-laws weight. At 4’10” and 110 lbs, he would be proven wrong even by the “beaker” test.
But maybe you think that doesn’t quite make sense, right?
Remember the scientist’s claim:
We can use the average weight of men to “predict” Lucia’s weight.
Also, remember the report discussing the claim, and the estimate of 180 lbs was chock full of text explaining why the average— 180lbs– was a more reliable estimate than any of the individual models. When you read this, what did you think it meant? Did you think it meant:
1) “Lucia’s weight falls within the range of all averages weights for men”? This translates to: Lucia’s weight is about 180 lbs, and the 95% confidence interval is 130 lbs and 230 lbs.
2) “The best estimate of Lucia’s weight is the average of all men’s weights?” This translates to: Lucia’s weight is about 180 lbs and the 95% confidence interval based on measuring 10,000 men 179.5 lbs to 180.5 lbs? This is
3) “Who knows precisely what he means? Maybe this is a group consensus document and one author means the first and another means the second?
Beaker’s test examines the truth of the first claim. The claim is based on the “SD” of the distribution of “models” (i.e. men) and should be tested using an “SD” test.
Far be it from me to suggest the “SD” test is meaningless. It does tell use something important. My weight falls inside the distribution of men’s weights. Some men weigh less than I do. (I’m married to one of them!)
Let’s test the second question
So, now let’s test the second question. If we really want to find better ways to estimate my weight wouldn’t it be nice to know whether estimating based on men might be biased? Maybe, after we discovered the bias, we could find a flaw in our method and fix it. (For example, we could notice I’m a woman and revise the “models” to include some women. Better yet, we might screen out men entirely and use women!)
How do we test the second question?
To test whether using the collection of all men’s weights is biased we need to compare whether the average of men’s weights matches the average of my weight measured over 12 months. To do this test, we use the standard error in the average of 10,000 men’s weights. The standard error (SE) of men’s weights in the example above is 25 lbs/ sqrt(10,000)= 0.25 lbs. The 95% confidence intervals is then twice that or ±0.5 lbs.
If you do a test using the SE value based on 10,000 men, you’ll easily discover that I weigh less than the average man. After all, 135 lbs, is definitely less than 180±0.5 lbs. Heck, even if we account for the uncertainty in my weight on any given day, and compare 135lbs ±5 lbs to 180 ±0.5 lbs, we’ll still figure out I weigh less than the average man.
I’ll skip any mathematical details, but if you use the Santer17 method, based on SE, you will discover that my weight is less than that of the average man. (Thank heavens!)
Surely, this is something worth testing. And with regard to IPCC projections of temperatures, surely we might wish to know if their “best estimate” based on the multi-model mean is biased higher or low? Or just so wild that during some periods it’s too high and during some periods it’s too low?
Diversion
What if you used the “Douglas method”? Well, the difference between Douglas and Santer is the uncertainty in my weight. Douglas would treat my weight as 135lbs ±0 lbs. Santer accounted for the ±5 lbs uncertainty in my weight. This has nothing to do with the SD vs SE argument above. In regard to this argument, the Santer method is correct. When comparing two quantities, you must use the uncertainty in both quantities. (That said, it’s easy to show the larger uncertainty dominates the calculation.)
So…. should we use SD or SE to test a claim?
Both!
Or, more precisely, whether one uses SD or SE depends on the question one wishes to ask. This is true for the claim about weights described above; it is true for climate models.
With regard to the IPCC projections: The AR4WG1 explicitly tells us their projections are based on the multi-model mean. To test whether a multi-model mean of “X” is consistent with the earth’s value for “X”, we use the SE test as Santer did. The SD test advocated by “beaker” is a much weaker test.
It is even possible to observe the following:
- If models fail the SE test, then the average of the ensemble is biased relative to reality. This is like the collection of “men” returning a biased estimate of my weight. I weight less than the average man.
- If models fail the SD test, the ensemble is, to some extent, pathologically bad. If I scientist A’s claim to a prediction that he could estimate my weight based on the weights of an ensemble of NFL linebacker, that claim would fail the SD test. Not only do I weigh less than the average NFL linebacker, I weigh less than 95% of NFL linebacker. (I bet I weigh less than every single one of them!)
What of GMST?
Currently, it appears that trends in GMST since Jan 2001 do not fail the SD test. Models taken as an ensemble not be pathologically wrong.
Whether the average over all models fails the SE test depends on some assumptions about the statistical model. That’s why there are blog-climate war fights using terms like AR(1), ARMA and “Long Term Persistence”. The choice of statistical model makes a difference in conclusions!
Lucia, I think that your analogy needs to include something to reflect the fact that the Santer analysis relies primarily on measurement uncertainty of your weight to make its rhetorical point.
Operationally, they take the position that the average weight of a man is known very precisely, but your individual weight is known only imprecisely and that you have been weighted to an accuracy of 135 +- 60 pounds. Ergo, no one has proven that you don’t weigh 180 pounds.
I can also report that the actual Santer “SD” appears to be the standard deviation of model ensemble trends , which they divide by (M-1), M the number of models – the very operation decried by beaker and others.
It’s a bit of a brain-twister as some of the operations are pretty counter-intuitive but I’m pretty sure that this is the implication of what they actually do.
Excellent, very clear, very instructive post. It was definitely needed, and you’ve done it very nicely.
[Added after seeing SM post, my goodness is this what they are really saying. If so how extraordinary! But it is always the logic of these arguments that is so hard to disentangle, and so decisive when its properly done.]
Steve–
Where in Santer do they assume my weight is 135±60 lbs? Equation 12 would use the estimate of my weight based on weighing me.
Do you mean the graphs? The graphs do show SD uncertainties!
The google ad is hilareous. It’s for weight loss. Clearly, google thinks I need to lose weight.
Saw SM posting after previous comment. Is this really what they are doing? I don’t understand, in that case, which is analogous to which in the analogy, ie is it the models that correspond to the measured weights of men, and the temperature of the planet to the weight of our one individual, LL? Because surely they don’t think the temp of the planet has huge measurement variation?
Just when it was all becoming crystal clear, too!
A little typo:
“Santer used “SEâ€. This is reported in Replicating Santer Tables 1 and 3. The tables can only be replicated using SE, not SE.”
You probably mean “can only be replicated using SE, not SD.”
I looked at somethign based on Steves note.
The content in my post is based on the test in it is worth nothing that sections 5.1.2 and 5.1.1 in Santer. However, their figure 6 shows an “SE” attributed to Douglas, and then shows an SD in grey.
At many elevations, the data does seem to be all over the place. But around 700-400 hPa, it does look like the data is well outside the 2 SE region. This should indicate the model trend predictions are biased at that elevation. However, just as some 120 lb men weight less than I do, some of the models predict values in the range of uncertainty of the data.
The error that everybody is making (including Douglass et al.) is that one should not begin any statistical calculation of models until the “pathologically bad” ones are filtered out . Many should be thrown out. In the papers under discussion all models that disagreed with the observational surface values [critierion: outside 2sd of the observations?] should be disregarded which is easily half. Others should go for different reasons.
Suppose that there were only one model surviving such filtering and that the set of 6 or 7 observational data sets [with average and sd] were the same. What is the best way to make the comparison?
After this question is answered, then each of the models surviving the filtering should be compared to the observations individually without benefit from a statistical relationship to the others — each model lives or dies alone.
David- – I agree with you in principle, but there is one difficulty: We have to identify the pathologically bad ones in the first place.
Ideally, no one includes pathologically bad models in projections. But, to avoid including the pathologically bad one, they need to test each model individually using some base set of accepted metrics.
What should those metrics be? I don’t know. But presumably, a model’s 30 year average GMST in real temperature units should match the earth’s during the same time period within 2-sigma of some reasonably determined estimate of observational error. Ideally, they could compare from 1960-1990.
Whatever those metrics are, as long as the IPCC makes projections based on a particular set of models, I think it is useful to at least see how those models do individually and collectively. They often don’t do so well!
If one compares all the models one at a time with the observations according to some criteria [which must be specified], then the “bad’ models are rejected at this point of the analysis. No statistics. There are just two lists — survivoirs and non-survivoirs. This should move the modelling groups to compare their model with the others — not a bad idea.
Typos:
In his report Scientis_t_ A does spend time promoting the idea we should
Some men weigh_t_ less than I do. (I’m married to one of them!)
If we’re aiming at explaining for the layman, then both SD & SE could use a little direct definition.
Alan–
Thanks.
I thought about giving direct definitions of SD and SE. But, I often find that if you include any equations at all, it makes the equation averse stop reading. Most who aren’t equation averse already know the devitions of both SD and SE!
It is sad but true.
Thanks, Lucia. That really helps us old guys who have forgotten much of what little statistics they once knew!
I was thinking more along these lines:
Understanding the Standard Deviation (sigma) of a normal, random distribution – a bell curve – can be aided by this figure:
http://en.wikipedia.org/wiki/Image:Standard_deviation_diagram.svg
In our example, the peak of the bell curve (marked mu here) would be 180 lbs. A one Standard Deviation range (the deep blue) covers 68% of the population. Here, that’s marked as the range minus sigma to plus sigma on the graph, or 145 to 205 lbs. in our all male population. Two SD (two sigma) covers 95.4%.
I’d say something about the Standard Error, but beyond “It’s the -estimate- of the Standard Deviation,” I can’t quite elucidate a math-free definition.
David Douglas is logically correct. The criteria for being a survivor is quite obvious. To survive the model prediction must be within the 95% confidence level based on the observed values. With this method there is no need to eliminate any runs or models based on any supposed pathological errors. Any model that has gross errors in it will be eliminated anyway. Just make sure that you use the CI calculated from the observations not the models.
It seems to me that all climate models reduce to: Global Temp. = a(CO2conc.) + b, with a margin of error of +/- 100 C.
Thanks for the great post. Appreciate your time and effort turning an impenetrable subject into something clear.
If I understand the analogy and am applying it correctly… (probably not)
Does this mean there is a benefit (for the modelers) in having a wide range of results from models because that increases the SD and therefore observed reality is unlikely to fall outside the 2 SD threshold?
And do the models predict very different values?
Steve Carson–
If people accept beakers way of testing for consistency as “the only” valid test, and we apply to collection of models, then yes. The more variable the collection of models the more impossible it gets to show inconsistency!
Keep in mind that there are two SEs shown in Santer’s calculation – only one of which is shown in the above figure and that the grey bands shown above are NOT used in Santer’s calculation. Santer uses the SE from this figure together with the SE from the observations (not shown in this figure) in his calculations. The grey bands here,l while logical enough, are merely illustrated here for rhetorical purposes as they are not used in the actual t-tests by other party. That should be clear as mud, no?
Steve—
That’s what I thought. The figure shows SD’s and SE’s in this figure. But the SE in this figure is not the SE used in the denominator for the difference between models and observations. That SE is the square root of sum of the squares of the other two! (It’s also really difficult to illustrate in a figure.)
A nice presentation from one of the key authors ( I kinda recall) in chapter
8 of AR4
http://www-pcmdi.llnl.gov/wgne2007/presentations/Oral-Presentations/mon/Taylor_metrics_err_wkshp_1.pdf
here read them all
I like the weight analogy since weight is often thought of as an energy balance (calories eaten, calories burned), but we don’t know all the internal mechanisms.
For models, the choice of “men” is good, implying that the models are somewhat different than the reality. You could go even further and use various ethnic groups to represent different models, each with its own distribution of weights. I would even suggest that since the models are only approximations of reality, they could be even apes, monkeys, gorillas, and so on, possibly even a few turkeys and swans.
For Earth, lucia is a good choice, but in a nod to beaker, what we really have is one lucia out of an infinite number of possible lucia clones. In other words, there is some element of chance affecting lucia’s past and future weight.
The goal of making a model is to predict lucia’s future weight. We have some history of lucia’s past weight. But just because all the clones have the same genetics (nature), that doesn’t mean they would have the same environmental history (nurture), so there is some doubt about how closely the models should be tuned to match the historic weights which may be due to unknown external forcings instead of lucia’s internal nature.
If the models predict the average lucia, there is the question of how far lucia is from the average lucia, both in the past and in the future. Measuring lucia’s recent weight variations does not give us that. lucia is just one run of the lucia model. That’s okay if lucia is a nicely behaved deterministic model, but we don’t know that.
(This is more general than the discussion of Santer) FWIW
It is pretty clear that the choice of the test which allows 0.0C no-warming to 0.5C per decade of warming to be consistent with the models is completely ridiculous.
If there is 0.0C no-warming over 100 years, how can we say the models predicting the warming are accurate.
If there is 5.0C of warming over 100 years, we would have to conclude the models missed the mark big time and they should have warned us much more forcefully about the coming disaster that 5.0C of warming would bring.
When statistics do not match up with up obvious logic, the statistics have to be thrown out.
In response to David Douglass shouldn’t one also exclude the ‘pathologically bad’ observational data?
Phil–
I should think comparing anything to pathologically bad data is pointless.
I’m all for comparisons including the uncertainty in the data. If the data are known to be wretched, comparison is pointless.
It is unfortunate for climate science that much of the data appear to be bad. The current system of collecting poor data and then fixing the data after it’s found to differ from predictions is designed to increase skepticism.
People will tend to doubt “proof” models can predict until such time as models predict something that is then born out by data collected by trusted systems after the data are collected.
Phil,
how did you ever decide there was an approximate .2c/decade trend if the data is so bad??
With all the references to baking brownies I would have fudged the prediction up a notch or two …
Also what if the scientist decides that fat guys are more typical and skinny guys just temporary short-term aberrations from the true norm and therefore builds a model that expects the norm (and you) to be 190-195 lbs? If you fall outside the acceptable range for model confirmation would we then have to wait (say, 30 years) to see if you fatten up to match the prediction and thus undo/correct the short-term aberration of non-obesity? The model can never fail if we always assume enough time and enough brownies in the pipeline.
Cheers.
About throwing out the bad models – I’ll bet this was already done prior to releasing AR4, but now time has passed, and the earth didn’t cooperate with the models that weren’t thrown out. Now, many more of them are bad also. If those were thrown out, and another 8 years went by, there would still be more bad models, probably up to the point where there would be no models left to throw out.
Lucia, concerning the scientist B weighing you: the SD is OK, because your weight really changes a bit in time, but isn’t there also a SE (even when you are a single person) which expresses the standard error of weighing procedure itself? That is what puzzles me in the SD/SE debate.
For e.g., taxonomists measure length and width of fungal spores and some of them erroneously give a SE of the mean length, whereas SD value is what should be given, as true informative of the length span.
In case of models I’m truly confused as to which criterion should be used. Are they a “measuring method”? Or are they a “measured value”?
To quote Arte Johnson: “Very interesting, but stupid”
I think Lucia’s example exactly describes why all this statistical hand waving is a waste of time.
Just looking at the experiments’ setups illustrates why…
Your model’s as Lucia points out are “men” which cannot replicate any of the possible activities of the experiment “Lucia”. What if Lucia is pregnant during the calibration? What if Lucia gave birth just before the calibration? During the calibration? What if Lucia is a child? What if Lucia’s parent is overly stressed? What if the global economy is changing? What if…? x n³
None of the models would predict any of these potential mpacts regardless of how many times you averaged them together with any degree of certainty.
There’s something wrong here, but I’ll probably not state it correctly, at least the first time. By this measure, if the measurement error of every individual man of the 10,000 is also +/- 2.5 lbs, then almost all of the 10,000 men will also fall outside the SE of the mean whether combined with the measurement variance or not and so the mean is biased as a predictor for men too. All that proves is beaker’s point that the mean +/- 2SE alone does not and cannot predict anyone’s weight. If the measurement variance is small compared to the variance between individuals, then using the measurement error instead of the variance between individuals will cause rejection at too high a percentage.
DeWitt–
I was waiting for someone to ask that. 🙂
Oddly, there is nothing wrong! And you have not proven beakers point. The issue is: What question do you wish to answer?
The average weight over all men is biased compared to most individual men. What the average weight of all men tells you is…. the average weight over the population of all men.
So, if the reason for studying all men was to later order a batch of clothes to have on hand for randomly selected me who arrive at a clinic, then the statistics for all me are relevant.
But, if for some reason, you are interested in one specific man, it is very useful to discover that the average of all men might be larger or smaller than that man. So, for example, when I want to shop for my husband, I don’t keep consulting the statistics for “all men” to estimate his size. I’ve done the comparison– the average for “all men” is biased relative to my husband. No matter how often I look at statistics for all men, my husband still wears size 28″ waist jeans. (He’s been up as high as 29″ and wore 27″ when I married him.)
The SE is what I can use to determine the batch of “all men” is biased compared to my husband, Jim.
lucia,
Indeed.
But the question you ask above is not the question that Santer et al. and Douglass et al. are asking and failing to answer correctly, I think. That question is not the fairly uninteresting: ‘is the model trend or model ensemble average trend greater than or less than the observed trend?’ It’s: ‘can we reject the hypothesis that an individual model trend (H1) or some model ensemble mean trend (H2) comes from the same population as the observed trend?’ If the measurement error of the observed trend is used for this test, then unless the measurement error is large, i.e. comparable to the difference between individual realizations, the test becomes equivalent to the Douglass et al. test and the hypotheses will always be rejected even if the model or model ensemble were perfect (parallel or constructed Earth). For that test, you need the standard deviation of the population of all possible observed trends, the true value of s{bo}, not the error in measurement of an individual trend which may or may not be equivalent to s{bo}.
Then there’s the even more interesting question: ‘Is any model or combination of models useful?’ But neither Douglass et al. or Santer et al. attempt to answer this question.
Dewitt–
What Santer actually say they are testing with H2 is this “… Under H2, we seek to determine whether the model-average signal is consistent with the trend in φ0 (the signal contained in the observations.)
The are quite specifically testing whether the means match.
However, if the means don’t match, then it’s true the models and observations come from different populations. But they are specifically testing whether the mean trend from the models falls in the range consistent with data.
The argument beaker has is not about whether we use SD or SE for the observations. It’s whether we use SD or SE for the models.
There is no argument about the observations because there is only one sample. When you have “N” samples of a thing, the SE on the estimate of the mean is SE=SD/sqrt(N). For the observations, that’s SE= SD/sqrt(1), so SD=SE.
Why do you think this? It’s not true. The mean trend will only be rejected in the Douglas test if the t-test shows it’s outside the 95% confidence interval for the mean of models. That value is not zero– so the mean will certainly not always be rejected.
However, in the limit that the mean of the models does not match the observations, it will be rejected. That’s what you want to happen!
If the models are correct the Santer test works just fine. Because for correct models as N-> infinity, the mean result must fall in the range for the observations. If the observation is known perfectly, then as N->infinity, correct models much match the observation.
If the error in the observations happened to be zero, the Douglas test is correct. It’s the classic single t-test where we compare something with stochastic error to a constant. It works fine.
lucia,
That’s what I’ve been trying unsuccessfully to explain. I understand what you are saying and it’s correct as far as it goes. But you clearly don’t understand what I’m trying to say, which is probably my fault. I think you’ve somewhat misinterpreted beaker as well, but he got lost on the wrong side of the SD/SE problem so his main point was also lost.
Let me try from a different approach. While hindcasting skill is a necessary condition for model validity, the real question is what may happen in the future. To but bounds on this you need to know the distribution of how fast individual realizations diverge from the present value. The measurement uncertainty of the current trend does not, IMO, tell you this. Nor does the model ensemble SE. You need a good estimate of the true value of s{bo}. That would be the model ensemble trend SD if the model were perfect, I think. It’s not about predicting what you weigh now, it’s how much will you weigh next year and ten years from now.
If the test were for only one model, this might be true. That is: If we assume an individual model is perfect, we might be able to make such a claim and test it.
But, the reality is that we have many models all with different parameterizations. This means the spread across models is not a good estimate of s{bo}.
It’s also possible to test things like the “weather noise” for individual models– so that would be testable.
If we accept this argumetn for an individual models, that’s a version of the H1 test, not the H2 test. Also, in this test, you change the Sbo for the oberservations– but you can still perfectly well use SE for the individual model, run N times. It’s average should converge. So, with respect to what you use for estimate of the uncertainty in the model mean, you use the model SE.
FWIW, if you used the model SD to replace the observation SD in the santer H1 test, many individual models will fail. Some individual models have fairly tight SDs. I’d have to look at the numbers— but for some models it may be less than the SD used for the observations!
But is this what beaker was suggesting? In which case, the lack of clarity was in saying he was arguing about SE vs SD. If his point was he wanted to replace the observaiton SD with the model SD, he should have said that.
IMHO there is one basic question that is crucially important and should be kept in mind:
Is there a predictive power in the model ensembles for claiming with 90% confidence that the temperatures will continue to climb at 0,2C per decade as long as the CO2 increases 2ppm a year? It is for CO2 that societies and economies will be further constrained and taxed.
If the standard deviation of the models covers all possible trends it is like predicting the sex of a baby: it will either be a boy or a girl.
When the Douglass paper came out, I saw it as a quantification of what was obvious to the naked eye that the IPCC plots which had claimed tropical tropospheric signature of CO2 in decadal trends were by far off.
The use of statistics to obscure instead to clarify the basic issues is not good.
For example, I would like to see what the temperature projections curves are for the models that are in the statistical tail of 2 sigma overlapping the Douglass et al data. I would bet that the predicted temperatures from those models for 2100 would be low and far from alarming. One cannot use a model in a tail of distribution to “unfalsify” one prediction and not show the predictions for the (NON)catastrophic ( as I guess) warming that the same model would predict on the temperature per year plot.
lucia,
I agree. I’m not trying to argue that any of the current models are any good. What I want to avoid is throwing out a future good (and I’m quite aware of and have some sympathy for the school of thought that there cannot be such a thing) model by too stringent a test. Douglass et al. doesn’t, in principle discard all models, just 80% or so when it should be 5%. Santer et al. are potentially hoist by the same petard when including more data decreases the CI of the observed trend.
Well yes, sort of. But the question still remains, what do you use for the observed SD? You seem to imply that the model average should converge on the current observed trend. It will converge to something (although I still have some small doubts about that given chaotic behavior). But that something doesn’t have to be the current trend or even close to it in terms of the SE of the model average. Does the average weight of all men converge to your husband’s weight as you weigh more and more men? Are there any nuclear families with 2.5 children?
My reading of beaker was: Don’t throw out the baby with the bathwater. I can never remember which is a type I or II error, but beaker, I think, and Santer et al. sort of proved, that Douglass et al. rejects too often. But I think beaker’s proposed solution of using SD instead of SE for the mean of the model trends was incorrect. Santer et al. uses SE for the model mean also. The error is not properly accounting for the variability of the observation, not the model. And that variability must include more than just measurement error. I think. Maybe. I’m still working on understanding all this and could be persuaded that I’m wrong, but you haven’t done it yet.
That’s not what I intend to convey. The model average should converge inside the uncertainty bands for the observed trend.
Forget about climate science, and think about applications where the observation is known very precisely. In applications where the uncertainty bands for observations really is zero (or nearly so) correct models should converge to the correct true value.
So, for example, if you had a model based on first principles, that predicted the acceleration of gravity at sea level. It should give 9.81 m/s^2 ± some very small number, right?
Say, for some reason, you developed some funky model that has, a feature, a models that predicts the accelartion due to gravity at sea level. If you run it a bajillion times, the average should converge on 9.81 m/s^2, right?
If you told me, that your model converged to 10.1 m/s^2 and standard deviation was 1 m/s^2 when you ran N=100,000,000 runs. Now, I tell you your model is off, but you that 9.81 m/s^2 falls well within the ±2 m/s^2 confidence intervals based on SD for your model. Meanwhile I point out that your SE is 1 m/s^2/ sqrt(100,000,000) = 0.0001, and the 95% confidence intervals are 0.0002 m/s^2. Your mean is way too far off to agree with 9.81 m/s^2. So, if the goal of your model is to predict the acceleration of gravity on earth, it’s off!
Do you really think anyone would believe that we should replace the know teeny-tiny uncertainty in observations with the SD from your model before we can point out that your models mean value is wrong?
I think using the uncertainty in the observation is correct. The fact that it’s tiny presents no particular difficulty for testing models. Models that are wrong will be found wrong; models that are right will converge to the correct value.
Of course not. So, the SE test tells use precisely what we wanted to know: The average man weighs more than my husband!
All Santer showed about the Douglas test was that if one ignore observational uncertainty you get too many rejections.
All that argument says is this: Don’t treat an observation as being more precise and accurate than it truly is. If you do that you will get too many rejections. That’s true.
So, I agree when you say this:
Yes. This question remains: How do we do this? But the answer isn’t replace the observational uncertainty with the SD from models. This is especially true if the models are a huge collection of different models with reasons for varying not shared by the observations.
lucia,
Again, I agree with your example, but disagree that it applies in this case. That is an example where there is known to be one true value, g. Actually, even that isn’t quite true. g varies slightly from place to place and there are sensitive instruments that can measure the difference. Many years ago I saw one of these, which was used in oil exploration, that could easily see the influence of the moon on the local value of g as the earth rotated. They have much more sensitive instruments now that can measure local g to six significant figures or so.
Is there one true trend from 1979 to 2008 to which a perfect model would converge? That seems to be the assumption of the climate modelers but it remains just a conjecture as far as I can tell. My reading of the posts of Tom Vonk and Gerald Browning is that they don’t believe there is one and only one trend. Of course they don’t believe you can construct a perfect model, or even a very good one, in the first place.
If you started 100 Earth’s from, say January, 1979 initial conditions within the limits of quantum uncertainty for every molecule and atom on the planet, would they all be in the same place today? I don’t know, but I don’t think so. Would our Earth fall within the +/-2SE error band of the average of the 100 Earths? I think the answer to that is no also at least 6 out of 10 times, but very close to 19 times out of 20 the average of the 100 Earths would fall within the +/-2 SD error band of the Earth’s climate where the SD is the SD of the 100 Earths.
Dewitt:
Are you asking specifically for a model?
The idea isn’t that there is “one true trend” any more than there is “one true height” for all men. The idea is that if you ran a model many times, you get a particular average value. The individual trends also have a distribution SD. And, as you run the model more and more times, you can state the average you obtained will be closer and closer for the true average for that particular model give a specific set of externally applied forcing functions. This is likely true for the models. If not, it would be useful for the modelers to run the models many, many times and tell us it does not converge.
I’m not going to try to speak for Tom Vonk or GerryB. I think each has different reservations.
However as to your question
No. No one has suggested this.
Also, far as I am aware, no one, anywhere is suggesting any randomly selected single trend (model or earth) would be within ±2SE of of 100 observations of the real earth trends all taken on the 100 imaginary planets circling the sun. No one is suggesting this even if we could observe the 100 imaginary planets.
Who do you think suggested this?
lucia,
I give up. I obviously can’t explain my point with sufficient clarity that you at least understand it even if you don’t agree.
Maybe the question is : “in a chaotic system is an average value a stable value, or is it also chaotic” ?
In your examples of your weight and the gravitational constant you are talking of an average over non chaotic systems.
Let us take ocean waves. Does the average height of an ocean wave have a stable value, or is it chaotic also?
Let us take clouds in the sky. Is the average density/thickness a stable value or is it chaotic too?
Anna–
The way chaos is defined mathematically, I don’t think being chaotic precludes the existence of an average. If you run the Lorenz example, there is an average value for any variable in the system.
Depending on the particular chaotic thing, the average value may or may not be particulary meaningful. But, it can exist. It also can be meaningful. (That said it might not be meaningful.)
The “stable value” question is a good one. The issue of whether or not time averages are the same as ensemble averages for a particular problem falls under the question of something called “ergodicity”.
I’m fascinated by the comment that James H made (#6009).
One theory in finance – that I’m sure many people here know – is that fund managers have no more predictive skill in finding investments to outperform the market than me throwing darts into the back pages of the Wall Street Journal.
Therefore, the theory goes, financial companies running mutual funds start up many different funds, so that the under-performing ones can get killed off (rolled up into the successful ones), and the finance company can quote “out-performed the market by 18% p.a. for the last 5 years”. Well, you start enough off, you are “guaranteed” -strike that – replace with appropriate statistical caveats – to have an over-performer.
So this is a great opportunity for some “skillful models” to be showcased over the next year or two. With 100s of models, there’s “certain” to be some accurate ones. And how their praises will be sung!!
Lucia,
Though I have sat through a course on chaos a while a go, I do not feel confident enough to make pronouncements, just questions, so my posts should be thought as questing thoughts.
Averages of course can always be defined. The crux is stability. The only stable in all variables points I believe are called attractors of the theory, the ensemble has to go there while evolving, or to one of them.
Now the average of temperatures, the averages of trends of temperatures etc should be part of the final evolution.I would think that ice ages and hot ages would be two attractors to which the chaotic climate system would gravitate depending on the push ( I suppose what Hansen calls the “tripping point” ), and there certainly the averages will be stable .
I would suppose by construction the averages would not be stable until they fall into an attractor state, i.e.depending on initial conditions and path conditions, they will be different.
In our particular application the question is how much different?
My physicist’s intuition tells me that if the average difference in day and night temperatures is 10 degrees C ( or average winter and summer), anything below such a measure should be within the chaotic variability as an upper limit. The observation that there is a stasis in average temperatures and assuming that the GCModels are not too bad in the short term evolution, a lower limit of variability would come from the divergence of .2 per decade. These seem to me to be the logical range of average temperature 10-0.02 C in a year, rather than a stable average temperature. Now I suppose I need a theorem to show chaotic behavior within this range :).
Just having fun :).
sorry, that should of course read “average anomalies of temperatures”, or “temperature differences” in the last paragraph.
DeWitt
“Is there one true trend from 1979 to 2008 to which a perfect model would converge? That seems to be the assumption of the climate modelers but it remains just a conjecture as far as I can tell. My reading of the posts of Tom Vonk and Gerald Browning is that they don’t believe there is one and only one trend. Of course they don’t believe you can construct a perfect model, or even a very good one, in the first place. “
Jerry , Dan Hughes , SpencerUK , myself and some others indeed do not think that a DETERMINISTIC dynamical model of the Earth system (I always try to avoid the very badly defined term “climate”) can be constructed be it stochastical or otherwise .
The reasons are rather different but that is already significant because that shows that there is a whole range of arguments coming to the same conclusion .
Jerry basically says that with the hydrostatic assumption you are unphysical and without it you diverge .
So you are doomed .
I use the chaos theory and say that the only statement you can make about the system is a statement about its attractor but that is not what the GCMs do .
So you are doomed .
Lucia is playing a very different game .
She doesn’t question the models , she takes them at their face value and looks at the results .
That is a respectable empirical approach because if the results disclose an “inconvenient truth” , she cannot be accused that she cheated by changing the rules .
Well actually Schmidt tries but everytime he changes the rules , Lucia changes hers to stay consistent 🙂
I follow the Santer&al debate at CA but even if it is interesting in the purely statistical theoretical way it doesn’t incite me to make any comments because it is based on an , in my view incorrect , assumption namely that an OLS procedure applied on chaotic data allows a physical interpretation of the slope of the regression straight line .
In other words saying that the system is linear and statistically deterministic on some arbitrary multidecadal scale is such an extravagant claim that it would need an EXTREMELY strong proof which of course can’t exist .
I won’t develop much the chaos theoretical arguments (I posted some more technical considerations in http://www.climateaudit.org/phpBB3/viewtopic.php?f=4&t=562) .
Perhaps only 1 remark .
What we deal with in the real Earth system (and in Navier Stokes) is spatio-temporal chaos .
What we deal with in temperature time series , so in Douglas , Santer etc etc is temporal chaos .
Those are two COMPLETELY different ball games .
The attractors in temporal chaos are geometric objects living in the phase space (f.ex Lorenz , diode circuit , pendulum etc) .
The attractors in spatial chaos live in the ordinary space (clouds , jet streams etc)
That’s why all those statistical consideration on temporal “climatic” series need first an operation of space averaging .
They simply need to get rid of the spatial chaos and do it by averaging .
Now this operation is clearly illegal because temporal chaos appears in TRUE local , physical variables – think hamiltonian dynamics with the p,q variables .
As there is no chance in hell that the space averages obey the same differential equations as the TRUE local variables , you loose all the relevant information about the dynamics of the system by doing space averages .
In other words you get only garbage and spurious pseudo stochastical results that may “look” like the real data … untill they stop doing so after a certain (totally unpredictable) time .
P.S for AnnaV
If your 10°C is a temporal average of some LOCAL temperature (or humidity or rainfall or wind speed or …) , f.ex at the Eiffel tower in Paris then the theorem allowing to conclude that this parameter is chaotic is known .
Show that the Lyapounov coefficient is positive by examining the appropriate system of ODE at that particular place (assuming that you can establish the ODE system) .
Alternatively gather a huge time series for that place and determine empirically a positive Lyapounov coefficient .
.
Interestingly once you have shown that , you will be able to show with the same token that the temporal average value of 10°C is meaningless .
Technically you will find out that provided a good knowledge of the topology of the attractor , all kind of averages are possible and the value of 10°C only happens when the system visits a particular place of the attractor which happens to be right now when you look at it 🙂
However if your 10°C is some spatial average , I give you no hope to say something intelligible .
AnnaV–
I agree with you about the stability. Also, the question is: If the system is chaotic, is the average meaningful? For some chaotic systems, it’s not. For some it is. In a turbulence course, the question “Is a system ergodic” included a discussion of ice ages. Then, we plowed on, avoiding the more metaphysical aspects and focusing on the classic stuff we all needed to know!
Lucia
I would not like to stray too far from the topic but only for the record .
The ergodicity states that an average of a function taken along a dynamical orbit defined on some measurable manifold is equal to the integral of the same function taken over the manifold .
Now the relevant measurable manifold for dynamical systems is the phase space .
So what the ergodicity says is a statement about orbits in the phase space and it doesn’t talk about the ordinary (x,y,z) space .
Chaotic systems should clearly never be ergodic (in the phase space) because due to the positive Lyapounov coefficients , 2 orbits starting at the same point will have very different averages along their orbit .
Also the ergodicity says directly nothing about ordinary space averages .
I have not checked but I am sure that Ruelle must have written papers about that and similar issues .
But then Ruelle and the ergodic theory are notoriously difficult to read .
Tom–
There’s also an awful lot of “can’t shown to exist” vs. “shown not to exist” issues in those mathematical Ruelle papers.
But certainly, with regard to climate, there is a difficulty with the glacial and interglacial periods. The modelers are clearly trying to predict something conditioned on some subset of initial conditions with a climate that is somehow similar to the state that existed in the late 19th century or so. So, presumably, the “set” of “possible weather trajectories” are those with IC’s drawn at random from all weather states that somehow similar to what we think might have existed 19th century, but we aren’t defining precisely how similar, and anyway, the 19th century measurements aren’t all that terrific.
So… there is just enough set theory in the whole idea of ensemble averaging to confuse everyone. Those who don’t understand the set theory idea feel the eye glaze over effect. Those who think a little more realize the set of infinite earth’s not really defined at all.
I’m an engineer. This does not make me uncomfortable. However, I recognize it for what it is. Assumptions we make and get on with.
Tom, I went to your post in CA. I can see your point about spatial chaos and the unknown solutions to unknown differential equations.
I have been harping on this from the other side, that all the use of averages over grids in the spatial approximations of GCMs are really linear approximations to putative perturbative expansions of the true solutions, which are highly non linear, and thus it is inevitable that the GCMs will fail after a number of temporal steps.
Looking at it from the Chaos side, please consider this:
if we observe the earth from far enough, another star lets say, then there will be one spatial value for each variable, presuming we have some instrument to measure them. These values will be bounded by something similar to the interval I estimated above, event though inside chaos reigns.
If we get closer, we might be able to measure the two hemispheres for the spatial variables, and again, the averages will be bounded by something reasonable. then closer three etc and the bounds will be similar.
Measuring spatial distributions and getting averages might still have meaning within a band of allowed values. If this band were narrow enough, for example during ice ball earth we would be getting a very small band for spatial temperatures, might not the chaotic underlying nature be ignored for practical purposes?
I suppose what one is saying when rejecting a spatial averaging is that the allowed variation of values in space and time, is much larger than any effect that is being measured in time?
As much as I like it, I think your analogy is missing another aspect.
The measurements of the men are NOTrandom men. They are specifically selected by the study organizers. AND, some men’s weight are are counted more than once (i.e., models given extra weight by extra model runs). Thus, the SD and SE of the men’s weight is an artifact of the organizers choices, and can be easily manipulated without even changing the sampling population.
re: Clark (Comment#6086)
I’ll go farther and say that we don’t know for sure that only men have been chosen. That is, the different models are not necessarily models of the same Earth. It’s like, Well we’ve got all these numbers, there must be something we can do with them.
It would be interesting to see the effects of a single model passed around among the participates and letting each group decide how to setup and run a specific example. The ICs and BCs would be specified, but all other aspects of the calculation, including the users, would be left up to each group.
I’m trying to specify an analogy to the case of testing the same specimen, or test section, in different laboratories. I think I did.
re: Dan Hughes (Comment#6088)
Well there are a bunch of edits floating around in electron space somewhere.
We clearly do not have a textbook exercise here such as “do the samples from these production runs of machine screws meet the design specifications?” or even “do they match the prototype?”. We don’t have a specification or a prototype, we have a black box.
BTW, see Dan Hughes’s comments on black boxes at http://www.climateaudit.org/?p=4163#comment-307854
How would the answer for the testing question change when the assumptions for the mechanism inside the black box change?
Does it matter if we think the climate system is (for example) linear or chaotic?
For an OLS regression, we have some parameters we can estimate, each with some uncertainty. In a GCM simulation, how many parameters are being estimated (or assumed) and how do those uncertainties add up?
I don’t believe we know all the uncertainties either in the earth system or in the models.
AnnaV
.
Measuring spatial distributions and getting averages might still have meaning within a band of allowed values. If this band were narrow enough, for example during ice ball earth we would be getting a very small band for spatial temperatures, might not the chaotic underlying nature be ignored for practical purposes?
I suppose what one is saying when rejecting a spatial averaging is that the allowed variation of values in space and time, is much larger than any effect that is being measured in time?
.
First is that in your far star example you would not measure spatial averages of the temperature .
For an unresolved pointlike Earth you would only be able to measure some infrared spectrum that you would fit to a blackbody spectrum and say that you observe an isothermal point at – 10°C (yes , bad luck your star was in the Earth axis direction and faced Antarctica) .
You could also compute standard deviation of the time variability and conclude that it is much smaller than your measure incertitude .
In this thought experiment you can indeed neglect everything , chaos included because the data you have is so poor that the only reasonable model you can make is a quasi isothermal planet at – 10°C . The model would be of course completely wrong but anything else would be useless speculation because no available data could falsify the assumptions .
As the resolution increases , you’d observe more and more complex behaviour untill you are able to recognise spatio-temporal chaos which can’t be “neglected” once one sees it .
Second is that I am not sure that I understand what is supposed to imply for you the statement that “allowed variation are larger than measured variation” .
Allowed variation in temporal chaos depends on the topology of the attractor . It is whatever it is and it is as “large” as the attractor is . Imagine a 3 dimensional toroidal attractor . The “allowed” variation in 2 dimensions is big (the big radius of the torus) and in 1 dimension small (the small radius of the torus) .
The “measured” variation is a measure of some part of the surface of the torus and is , by definition , smaller than or equal to the torus size . This measured variation is completely dependent on the place where the systems happens to be when you measure , on the “speed” with which the system visits various parts of the attractor and on the duration of your measures .
As during all this time we were in the phase space where the attractors live , the ordinary space (as in x,y,z coordinates) was irrelevant . Ordinary space averages are simply no dynamical parameters in temporal chaos . What relationship is there between the chaotic character of a temperature time series at the top of the Eiffel tower which is a TRUE , LOCAL and LEGITIMATE dynamical variable and the ordinary space average of temperatures at a certain time along a line going from the top of the Eiffel tower to the stratosphere or along any other arbitrary line ?
Right , none .
Lucia
Well as I know that you are familiar with N-S , I know that you’d appreciate that the idea of sets of initial conditions is indeed a powerful mathematical tool for otherwise untractable problems .
Of course that doesn’t mean that they become tractable even with this tool .
Btw the Fields medal Terry Tao in his awesome musing about “Why is N-S so difficult” is saying that the strategy that could have a chance to clarify (oh so slightly !) the N-S problem would be to work with “suitably” partitionned sets of IC .
However he immediately added , if memory serves , that this “suitable” partitioning would probably be pretty exotic and highly non trivial 🙂
I agree with you that the modellers must have tried things like that .
At least Schmidt has written somewhere that “the models exhibit chaotic behaviour with POSITIVE Lyapounov coefficients” what would be impossible to see without working with sets of IC .
Unfortunately he has not yet come to the insight that the corollary of this statement is that ordinary space averages of dynamical variables are irrelevant and meaningless . But it is bound to come one day too .
Now if you think that I am really going too far from the topic of this thread which is actually the SD/SE problematic , say stop . I am doing right now some work on chaos theory but that is not a reason “to pollute” a thread which is about classical statistics .
Tom–
I agree some aspects of the problem are intractable. However, I find value in testing the premises that appear to be accepted by a sub-set of the field of climate science. That’s why I test the models. Basically, the question I ask is “If we ignore the sorts of problems TomV, Dan Hughes ad GerryB worry about, does the model/data comparison show agreement?”
The modelers do work with some sets of ICs. For one thing, when the submit multiple runs for the IPCC, each is initiated with weather from a different year from a control run. (They could probably do even less. If they just flipped a bit on day 1 the butterfly effect would probably cause the trajectories to diverge within a month! But, starting with different years seems a better option.)
On the straying thread issue– My “rules” are different from SteveM’s. He’s into auditing, and so likes to stay on the audit topic. I don’t mind straying all that much. (Many bloggers don’t mind.)
Tom
“Second is that I am not sure that I understand what is supposed to imply for you the statement that “allowed variation are larger than measured variation†.”
I am considering measurements.
In your torus example the allowed (for measurements) would be “smaller than or equal to the torus size”, whereas the measured is what a standard deviation (or what not) a statistical analysis is giving.
I am trying to define for myself when one can legitimately ignore chaos, the way we ignore quantum mechanics if we are talking of measures much larger than hbar, or special relativity for velocities much smaller than c.
I realize of course that chaos is a much more complex issue.
You say to Lucia “I am doing right now some work on chaos theory but that is not a reason “to pollute†a thread which is about classical statistics .”
I think that it is highly relevant to know whether classical statistics can be used meaningfully on spatial averages or not. Otherwise we are back on counting angels on the head of a pin.
AnnaV
.
I am trying to define for myself when one can legitimately ignore chaos, the way we ignore quantum mechanics if we are talking of measures much larger than hbar, or special relativity for velocities much smaller than c.
I realize of course that chaos is a much more complex issue.
.
I can answer this one or at least formulate a similar question that is more rigorous .
First I’ll restrict it to temporal chaos . Spatio temporal chaos is much more difficult and poorly understood .
The definition of chaotic behaviour is a positive Lyapounov coefficient what means exponential divergence of orbits in the phase space .
That’s why it is rather binary – either you have no chaos and one can’t neglect something that doesn’t exist or you have chaos and you can’t neglect it because it would mean that the parameters are on constant orbits what they are not .
.
So perhaps you are asking instead if there doesn’t exist some asymptotic or perturbationnal theory that would in a sense “ignore” chaos when chaos exists .
The answer is a clear no .
There are of course perturbationnal treatments in the chaos theory .
But they cannot make go away the fundamental feature of chaos which is the exponential divergence of trajectories .
A chaotic system is fundamentally unpredictible and nothing can make disappear this feature .
.
Now you may also ask if a stochastical theory could not be substituted to the chaos what would be very different from the case above .
This one is trickier .
The answer is again clearly no for low dimensionnal systems .
F.ex the 3 dimensionnal (the dimensions are in the phase space) Lorenz sytem which describes a simplified version of convecting fluids obeys to no statistical laws .
So you can neither predict the time evolution of the variables nor the time evolution of their averages , of their standard deviations or of any functionnals more exotic than averages .
You cannot compute a probability of presence of the system somewhere either .
Imagine that you calculate the average and standard deviation of the positions of a planet in an X,Y plane around a star at (0,0) with X and Y measured independently .
You will find that the planet is in average in the middle of the star 🙂
All is mathematically correct but obviously absurd .
Then you have high dimensionnal systems (N body problem with N high) where the dimension of the space phase is 6N .
Here you can find a stochastical theory if and only if you have some symmetries in the degrees of freedom .
Typically isotropy and homogeneity hypothesis eliminates a huge number of dimensions and a few collective parameters may emerge .
That’s the case of statistical thermodynamics and it works – the orbit of each molecule is chaotic but the collective parameters like P and T make sense .
.
Last I could mention turbulence . First it is spatio-temporal chaos so something else like the cases considered above .
BUT , but if you can get rid off the space chaos in a legitimate way you will be back to the temporal chaos only .
That goes again with symetries like isotropy , homogeneity , scaling etc .
Kolmogorov tried that in his stochastical turbulence theory where eddies are supposed isotropic , homogeneous and scaling .
Well when they are (more or less) at very high Reynolds then it works (more or less) but when they are not at lower Reynolds , it does not .
On the other hand Ruelle and Takens brought a rigorous mathematical proof that turbulence was low dimensional chaos (e.g it has a low dimensional attractor) .
But here too it works only for certain conditions and it is not a general answer either .
So the jury is still out .
I personnaly think that turbulence covers the whole spectrum from low dimensionnal chaos to a quasi statistical thermodynamical regime and moves among many regimes depending on many things especially Reynolds .
So this very longwinded answer to the question in this particular case would be : “As long as the spatial extension of the system is small and isotropy at microscopical level high , then the microscopical chaos can be described by some macroscopical statistical theory . In all other cases it stays simply chaotic and no statistical description is possible .”
Thanks, Tom, that you took the trouble to summarize this for me.
I am being educated.
P.S
Btw even if I understand your interest for the matter and there indeed IS relevance to those SD/SE issues , it is still irrelevant for Lucia’s approach so could be “polluting”/distracting .
Indeed everything that Lucia’s doing here uses the philosophy “Whatever issues X and Y have with the models can be ignored IF the target is to compare what the models (right or wrong) say with what the Nature says .”
It is a very pragmatic engineering approach that I appreciate much because it enables progress .
Of course if Lucia finds consistence between models and Nature it still says nothing about the models because wrong can imply true but if she finds non consistence then there is a real progress because true can’t imply wrong .
So I would have full understanding if she said that discussion straying from HER target are distracting and drawning the purpose of the thread in interesting but irrelevant issues .
TomVonk–
The reason it’s not distracting is that I can just plow ahead and continue with the analyses I have slated.
All bloggers everywhere have the difficulty that readers will want them to delve into a other specific topics. Sometimes I will, but I try to avoid getting derailed from the topics that interest me.
By the same token, I like reading broader opinions! So, I prefer not to try to reign comments onto some bullseye target of “focus”. As a moderator, my only concern is potential nastiness, ad homs, and pointless repetition. I figured out how to deal with that long ago. (I wrote a plugin that for the first “N” minutes, displays those comments only to the person who posted them. When those people are ignored, they go away. Heh. heh….. I’ve used it on exactly two people. )
A couple of questions. One of the points made in the Santer paper is that the autocorrelation has to be taken into account to adjust the effective sample size. If a model is truly representative of the climate temperature, then shouldn’t the autocorrelation of the model output be similar to the observation’s autocorrelation? There are numbers from over .9 to below .02 in Table 1. Shouldn’t the models whose autocorrelations are dissimilar from the observation’s be discarded?
Santer shows an example in his Fig 1 of multiple runs from a single model averaging out the noise to show the actual mean of the underlying trend. Ok, I buy that, but isn’t there an underlying (and possibly unfounded) assumption that the trends produced by these models all come from the same population? How do we have any confidence that given so few runs, that the models would converge to a mean value like the example he gives?
Yes. If we were to discuss this in the frequency domain, the spectra should match.
In principle, yes. But, as always, you need to consider the uncertainty in the model value and the observations.
Uhmm… let me hunt a bit…
In this post, I compare whether the models “weather noise” are similar to each other: http://rankexploits.com/musings/2008/on-hypothesis-testing-testing-%E2%80%9Cweather-noise%E2%80%9D-in-models/
I looks the standard error in the residuals. The residuals contradict each other. So, at least some are wrong by this metrics. I have a bunch of individual analyses I haven’t blogged. I’m diverted to SAnter because it will be easier to explain findings. After wards I’ll be going back to discussing model-model comparisons of “weather noise”, including the autocorrelations and extending to model-weather comparisons. ( The general result is if I apply a test, the models mostly fail. But, I need to think to see if there is some problem I’ve overlooked, that would result in the failure being meaningless. So, it takes a bit of time!)
For Lucia blog in climateaudit
11/1/2008 7:41:28 PM Fred Singer
I start with an apology to all, esp to Steve and Lucia. I am not a blog reader and sort of stumbled into this extended discussion without having read most of preceding comments. So with this disclaimer out of the way, let me report to you on three issues that bothered me when coauthoring Douglass et al [IJC 2007] — and have still not been resolved in my own mind. I am now trying to reply to Santer 17 [IJC 2008] – so if you have any responses, would you kindly also e-mail them to me at *****. Thank you.
Issue #1. Finding the “ensemble mean†(EM) of a climate model: A good example is shown in Santer’s Fig 1 (but I have other examples). He shows 5 “realizations†(“runsâ€), all with different trends, and then gets the EM trend of 0.280 by a simple arithmetic average. Is that fair? I would think you would need more runs and keep averaging until you can demonstrate an asymptotic trend. Just imagine the modelers running out of computer time (or money) and stopping after run #1; he would have arrived at a trend value for the EM of 0.024.
Which brings me to issue #2: I am unhappy about the fact that, as everyone else does, we simply averaged over all models and ignored the fact that the number of runs per model ranged from 1 to 10. Shouldn’t one give more weight to a model average based on number of runs? I had suggested an empirical test, as follows: Start with models that have more than, say, 6 runs. Then add the models with more than 4 runs and see if this changes the overall result – and so on. Or, alternately, should one average over runs instead of models?
Issue #3: You may have noticed that the CCSP-1.1 Exec Summary (authored by Wigley, and including our pal Santer) uses the concept of “range†in comparing models and observations (see pp 12-13). I certainly noticed – and we discourse on this misuse in Douglass et al [2007]. I regret, however, that we neglected to point out explicitly this paradoxical fact: The more models one uses, the wider the “range†and the easier it is to persuade the unwary reader that there is no disparity between GH models and observations.
—
Note: I edited out Fred’s email to prevent spambots from reading it. (I have access regardless.)
Beaker has clearly and from the beginning called out the incorrect use of SE of the mean as an indicator of a distribution. McI has been very passive aggressive on this. He would NOT call out Douglass as wrong (since they are pole-smoking Heartland Institute buddies), neither of course will he support it. Since he knows it was wrong. The incessant posts about Santer are a distraction to this basic point.
Fred Singer, the old dodderer, thinks that the readers need to have it clarified that a larger spread of models, makes it harder to disprove the set of them. I think the average (smart) person gets this pretty easily. Beaker nailed you guys when he said large claims need large proof. If you want to say models are all over the place that is a very different thing than saying we can show inconsistency with the set (versus observation). Only people who have gotten so used to conflating everything to a battle of warmer versus colder, would miss this. REAL thinkers, real scientists, can disaggregate and analyze. That’s why I roast you pussies over the coals. Even though I’m on the anti AGW “side”. I just wont’ tolerate butt-poor thinking and sophistry along the way.
RE Fred,
The simple averaging over model runs bothered me as well, as I noted in the CA thread on Douglas, especially when we know that the models have different levels of “skill” In particular, as Lucia and I have noted with regard to Santer, Some of the models are driven by historically “‘accurate” volcanic forcing data, while others are not.
TCO–
On your first comment:
Santer used SE. Beaker suggested their equation was a typo. It’s not. Could you explain why Santer is incorrect to in their choice of SE over SD? Also, could you enlighten us as to what you believe is “the basic point”?
On your second comment:
Who are “you guys” and in what way did Beaker “nail” whoever they are by saying large claims need large proofs? Which claim are you suggesting “you guys” made, where was it made, and what is large about said claim?
Also, please, my dear little marshmallow, please refrain from supporting your arguments with ad hominems like the word “pussies”. Else, I will be required to toast you and turn you into the filling for a ‘smore. (I trained with the Girls Scouts of America, so you know I have the skills)
Thanks.
Lucia:
Beaker’s original main point (in the discussions on CA, going back) was that Douglas misused SE to characterize the spread of a distribution. It’s quite possible on some other recent, downstream points, that Beaker may have been wrong.
Sorry, I offended you with my potty mouth.
Lucia:
In terms of “you guys” and “the skewering”, it happened on CA and during the Douglass paper discussions. And the you guys was most all the hoi polloi. Plus I lump Steve (and Ross) in there since he is a passive aggressive p…erson when he sees something wrong, but doesn’t want to call his “side” out. That to me is the opposite of a real scientist, a real thinker.
Now, feel free to get back to the Santer typo issue. I’m not trying to stop you from running that snake down it’s hole. I just had the basic point, I wanted to make on Beaker and what he has mainly contributed to the ongoing blog conversation. I made it.
TCO– It’s been established beyond any doubt that Santer also used SE just as Douglass did. Are you objecting to the fact that all these posts are distracting from the fact that Santer and Douglass both used SE? And that the peer reviewers of both papers accepted that as a valid application of the classic “t-test”?
As for the idea you have skewered anyone: Though I often read you congratulating yourself on your acumen, I can’t recollect any comments by “TCO” making any useful points, basic or otherwise. Most of the time, I find it difficult to discern any point whatsoever in your comments.
But…you know.. hey…whatever. Carry on. 🙂
That Santer used SE does not change my point. That Beaker is correct in chiding Douglas. And that is was his original main contribution to the CAosphere. You can have those things together, grasshopper. Capisce?
TCO- SE is always used in t-tests to compare means. SD is not. Douglass and Santer both used t-tests to compare means, and so used SE as required.
So you disagree with Beaker’s original points on Douglas? And would do so, regardless if Santer ever wrote a paper or not?
TCO–
What original point? Beaker is not here. I’m not convinced you know what his original point was and there is no way to ask him to clarify. If you have some point you want to make and you want to know if I agree with it, you are a) going to have to state your point directly (not by proxy) and b) then ask my opinion.
I propose to ignore TCO until his manners improve, and carry on a civil discourse with Steve Mosher: Yes, models are in poor shape. They have widely different “climate sensitivities†(= temperature response to GH gas forcing), non-standardized forcing from aerosols, and completely ignore the major forcing from changes in solar activity. Yet some believe that simply averaging available model results will take care of these problems.
The “fingerprint method,†properly applied, does help by looking at the PATTERN of each model (or of each model run), i.e. by noting the difference (surface trend minus troposphere trend), as shown in Fig 5.4G (tropics) of CCSP-SAP-1.1 [2006] (or the identical Fig 9a in the NIPCC report [2008] “Nature – Not Human Activity – Rules the Climate†http://www.sepp.org/publications/NIPCC_final.pdf ). The disagreement between model results and observations is fairly clear. Santer17 [2008] now claims that (1) there is something wrong with the observations (which he had subscribed to in CCSP) and (2) that the CI are so wide that there is no longer a disagreement. (In CCSP the main authors (Karl, Santer, Wigley, who also appear in Santer17) don’t even show error bars.)
Fred Singer
When trying to compare models and observations, this is an important issue. The extremely small numbers of runs makes testing models extremely difficult. In particular, type II or “beta” error is enormous with small sample sizes. This is an issue I plan to discuss with respect to model testing.
By having only one model run for a particular, computed statistical uncertainty in what the model predicts is infinite. With 2 runs, it can still be very high. For quite a few of the model tests, the limiting factor to test the models is there are not enough model runs.
I think the answer is “it depends”. Do we believe the spread in trends predicted by models is mostly driven by biases and uncertainties in models, which would result in different predictions between models even after running many cases? If so, then averaging over models may be reasonable.
Or, do we believe the spread in trends is mostly driven by what I’ve read called “internal variability” (aka “weather noise”). If it’s mostly internal variability, then it makes more sense to weight the cases with more runs more heavily. Those will give closer to converged answer.
Unfortunately, I don’t know which situation holds. However, it appears the IPCC performed an average over models to develope their projections. This might suggest they believe the model biases are the dominant factor in driving the spread in predictions across models. (Then again, who knows?)
I’ll read this more fully tomorrow.
But there is a difficulty with the argument that models generally are ok provided that observations fall in the range of not only all models but every single model run from any model. As time goes by, new nations and new agencies jump into the modeling game. The number of models may increase. Some may be great; some may be wretched.
If screening is light, and the range of models increases, it become impossible to decree even the worst of models poor under this perverse for testing models.
Lucia: Have you read Beaker’s original comments on Douglass on CA? Have you thought and abstracted the main points? (Back in the day?)
I’m not going to bother going back to the original, but Beaker repeats, recently:
http://www.climateaudit.org/?p=4101#comments
‘Steve: This Douglass quote is at the heart of their error:
“A more robust estimate of model variability is the uncertainty of the mean of a sufficiently large sample.
It is a bit like saying that a robust estimator of the variability of the weight of apples is the uncertainty of the mean of a sufficiently large sample. If that were true as we approached an infinitely large sample all of the apples would have to have exactly the same weight, which is obvious nonsense.”
The SE simply is not a measure of the variability of a sample, but that is how they use it.’
——————————
Note that Steve doesn’t engage with the point (to agree or disagree). Steve is evasive,
weasely.‘
Edited by moderator for name calling
Noting that the models are all over the place is DIFFERENT than saying that as a class they are inconsistent with observation. All the hoi polloi (and throwing Singer in with them) are
prohibited wordup, by conflating those two things.Edited by moderator.
TCO–
I read many of beaker’s comments on that thread. I don’t think it’s possible to determine which point was his “main” one. But, if it’s important to your question, why are you unwilling to go back to the original?
The sentences beaker quoted from Douglass is poorly worded. However, I have the paper, and it appears Douglass’s intent was to compare the best estimate of models to the best estimate of the data. Based on the balance of the words in the paper, it appears they used the term “range of models” to mean “the range of model means”.
If your (and beaker’s) main point is that a sentence in Douglass was poorly worded, yes, it was. It could have used an extra word. It’s a pretty trivial point.
To do compare means one accounts for the uncertainty in the means. To do this, one uses SE not SD. Santer and Douglass both use SE to do this test. It’s a classic test taught undergraduates. The peer reviewers recognize it’s utility.
On your other point: Of course noting that models are all over the place is different than saying they are inconsistent with observations. Models have more than one problem. Recognizing the existence of two separate difficulties is not conflating them.
Once again, my little marshmallow: please refrain from the “f” word, and calling people names.
I feel uncomfortable that TCO apparently agrees with me, more or less, but I’ll try one more time. Yes, the SE is used to compare means of two samples containing multiple items to see if one can reject the hypothesis that the items in each sample come from the same population. But while you may be able to calculate a model mean and SE either by averaging multiple runs of the same model or averaging results from multiple models or some combination of both, you only have one realization of the actual climate so you have no obvious way to calculate the SD or SE of the climate, excluding measurement error. Sure you can make assumptions about what’s weather noise and what’s not and come up with some number, but you have no proof that your assumptions are correct and that your calculated climate SD or SE has any meaning at all. Even estimating the measurement error, as done by Santer et al., requires making unprovable assumptions like an underlying linear trend and a particular noise model. So you don’t have two means to compare. You have a mean and an individual item. The SD of the model mean is then the proper way to test if the model(s) and the climate are from the same population, not the SE of the model mean. So both Santer et al. and Douglass et al. are wrong. If you want to talk about skill or utility, that’s different. But neither Douglass nor Santer were about that.
That’s the best I can do. If you don’t agree, and I dont’ really expect that you will, that’s fine and I suggest we leave it at that.
DeWitt–
What you say is correct. We only have one realization. But the fact that both SD and/or SE are difficult to compute for the observations is different question from the one being debated.
The one being debated is “Assuming we can compute it, should SD or SE be used”. The answer depends on the hypothesis to be tested.
When testing the difference between means one uses SE. That’s what Douglass used for the models. That’s what Santer used for the models. The attempted to estimate it for the observations. (However, for the observations SD=SE, because there is only 1. So, the debate “use SD or SE” is not the same.)
The SD of the model mean is then the proper way to test if the model(s) and the climate are from the same population, not the SE of the model mean.
The SD of the model mean is never the proper way to test whether the means (or best estimate) of two populations match.
There are many ways to test models and data.
1) You can ask whether the data is an outlier compared to the models. For that test, you can use the SD of the models. This question is equivalent to asking is my weight is an outlier compared to the sample of all men’s weights: It’s not. Neither Santer17 nor Douglass asked this question. However, if someone asks it, that’s fine. They can use SD.
2) You can ask whether parameters of certain distributions match. For example, you can ask if the means (aka best estimates) match. You can ask if the standard devations match. For the first test, you to a t-test, which uses SE’s. This test’s whether the average of men’s weights matches mine. You will discover that the average man weighs more than I do.
These are different questions, and so have different answers. Santer and Douglass both say they are testing whether the best estimates of models match the data. So, yes, they are asking question 2. They are asking a sort of skill or utility question.
There is no rule that says because question 1 exists we may not ask question 2. The fact is, question 2 is a more sensitive tests. That’s why it’s widely used in process control in industry. Question 1 is also used to test for outliers.
The problem beaker has is this: He is insisting on a method without being willing to state the question being asked. Both Santer and Douglass are asking the second type of question. So, SE is correct for their analyses.
If TCO or beaker want to ask the other question, that’s fine. But it doesn’t make Douglass or Santer’s use of SE incorrect. It just means TCO and beaker want to ask a different question.
I think this may be the point of beaker’s (and my) argument. I (and he, IMO) maintain that Douglass et al.’s use of the term “consistent” means they are in fact answering question 1 with a test constructed for answering question 2. If one wants to discuss skill or utility, one should be more specific. Consistent doesn’t cut it in this respect. Consistent implies to me and apparently to beaker as well that the measurement is or is not an outlier compared to the models, i.e. question 1.
Question 2, if properly constructed, is the interesting question. I think your approach to that is much more logical and is also more interesting than either Douglass or Santer.
DeWitt–
Ok. At least I think we’ve figured out where we disagree. So, now I have to aske you: why do you think the word “consistent” can only be used when answering question 1 and not question 2?
Here is my view on the meaning of the labels “consistent” and “inconsistent”.
If models are truly, entirely consistent with observations, then all statistical moments of the distribution must match. If these distributions don’t match, then models are not consistent with observations. That is: The distributions are not in agreement or harmony. They differ in some way.
Consistent is a broad term, not a narrow one.
So, using what I think is the normal — that is broad– definition of “consistent”, if the models converge to a different mean than than is consistent with the possible mean for the observations, the models are inconsistent with observations in some way. Specifically: Their distribution of the projections has a different mean from observations. The models are biased.
This is a form of inconsistency.
It is true that this use of “consistent” is related to skill. But so what? We have lots of nuanced terms in English and science. The fact that the idea of skill exits doesn’t preclude also stating the result of a t-test using the commonly used terms “consistent” or “inconsistent”.
As for word choisce: I think failing test 1 would tell us the models are so pathologically inconsistent that they are utterly worthless. But failing question 2 still warrants the term “inconsistent”. They are biased. They may only be a little biased or very biased. There could however be some overlap between predictions and projections. But there is also some inconsistency, in so far as some predicted events may fall outside the range of what is possible and/or on average, the predictions are biased. Both are forms of inconsistency.
pah 🙂
you have all got it wrong about SD and SE !
Any biologist knows that, when you are putting ‘error bars’ on a graph, that the SE is smaller, and hence infinitely superior 🙂
per
lucia,
Ok. I’ll buy that. In fact I was going to post something similar. Giving Douglass et al. and by extension Santer17 the benefit of the doubt and/or a broad definition of consistent/inconsistent, does the answer that the models are somewhat, but not pathologically biased tell us anything useful? I would say no because the models will always be biased in some way at some level. That, IMO, is part 2 of beaker’s argument. Santer17 is a band-aid that temporarily fixes the bias problem by incorporating a large apparent uncertainty in the measured data. But that uncertainty, as you have pointed out, may be more apparent than real and will eventually get smaller anyway as passing time increases the length of the data set and decreases the uncertainty in the trend. Or if you use annual average data with low lag 1 auto-correlation to start with as done by Kenneth Fritsch at CA.
That still leaves unanswered the question whether the models are too biased or insufficiently informative to be useful for policy decisions. Going back to your example, does knowing your husband is above or below the mean weight for all men tell you what size long sleeved dress shirt to buy for him?
Lucia, I am referring not just to THAT THREAD, but to the history of Beaker’s comments, in particular to his remarks on the threads about the Douglass paper when it came out. Do you have a block to understanding that? Also, I am capable of reading multiple remarks and threads and coming up with a “main point”. It’s called abstraction. It shows thought and discrimination to try to do so. When you read multiple things, you put them into a pattern. Heck, wording like “the heart of the matter” ought to be a clue. Also, the repetition of Beaker’s comments is a clue. Also, the fundamental nature of the point. Also, me pointing it out to you. But if you are just genetically incapable of this kind of thinking (what is the main point within a set of remarks), then stick to the math analysis of factoids. That still gives value, also.
Oh…and SE of the mean tends to zero with sufficient samples…but there is only one realization of the “trial Earth”.
Lucie: I am glad that you agree with the problem in conflating two different issues (the inconsistency of model versus observation, and the inherent all over the placeness of the model grouping). The hoi polloi cheering squad scum on CA (Steve’s little “plausible deniability” brownshirts) have not been good at differentiating these two. Singer starts down that path as well, in the recent comments here. Maybe you can hit them while I hold them. Figuratively.
Admin: please cut out the name calling, Nazi references and allusions to physical violence.
I say yes; the fact that the difference is statistically significant has meaning. The statistical test is designed to help us avoid fooling ourselves when differences fall within the scatter and if statistical significance is shown, we’ve learned something meaningful.
The steps involved in determining if a difference is real are
a) to estimate how large it is and
b) determine if this is statistically significant.
Showing (b) is important. If that’s not shown, then the models predictions are indistinguishable from perfect. You seem to be arguing (with reference to beaker) that this particular test doesn’t tell us everything means it tells us nothing and that’s wrong.
Of course not. Why would it? More importantly, why try to make this an all or nothing question?
The fact that the weight of all men is a biased predictor for my husband’s weight does suggest that I shouldn’t use the average of all men’s weights to estimate how many lead bricks Jim in particular needs when flying a glider plane. If I wanted to predict something completely different about Jim, the fact that all men are a biased indicator for Jim’s weight would suggest that I shouldn’t use the average of all men’s shirt size to estimate his size when buying a shirt.
If, failing proof, others insist I must order based on the average of all men, then demonstrating that doesn’t work is useful information. I now know I need to come up with a different method to estimate Jim’s shirt size.
Back to climate models: Showing the model are biased in some important way, and the bias is statistically significant tells us something. The fact that a particular test doesn’t tell us everything about models doesn’t mean we have learned nothing.
The reason there are many types of tests is that there are many questions.
It may tells us something, but is that something new and interesting? I don’t think so.
DeWitt–
Well… that’s where we differ. I think the SE test tells us something interesting, and the SD test tells us nearly nothing of any interest whatsoever. Or more precisely, by the time anything fails an SD test, it is so obviously pathologically wrong that no statistical tests are required to determine anything.
The SD tests is statistically inefficient when used to determine if models are off track.
Had an earlier post that disappeared. Got worried of being censored.
Lucia, my main point was on what Beaker’s main point was. I do think it’s possible to abstract said. Of course, this is not a mathematical certitude. Yet still, I think, relevant, to thinking about things. I have followed Beaker’s remarks from the beginning, have noted words like “the heart of the matter”. This helps my faculty of analysis.
On content: The problem with SE is that in it’s limit, we get certitude. Yet, we only have a single sample of the parrallel earths reflecting El Nino, butterfly effect, etc.
Lucia, do you agree that this is wrong? Trivially so? In such a manner that a decent first year student in a stats class could see the logical error?
“A more robust estimate of model variability is the uncertainty of the mean of a sufficiently large sample.”
Response: TCO, I already discussed this in #6223. – Lucia
I understand the GMST estimates from the GCMs to be calculated as follows. For each spatial location (X,Y,Z) of interest for each day in the simulation the max and min temperature are determined and the average calculated to get T_avg = (T_max + T_min)/2. These are then area-weighted to get the Global Average value.
Note that this is significantly different from averaging values that are actually on the local-instantaneous temperature trajectory at the spatial locations. In fact, this T_avg will not be on the trajectory.
If the ‘chaotic response’ of the temperature is the rationale for using ensemble-averaging, how can that rationale be applied to quantities that are not on the trajectories.
Dan–
In principle, anything that can be predicted can be compared to a measurement. The only requirement is that the measurement correspond to the predicted item.
Why you think T_avg = (T_max + T_min)/2 isn’t on a trajectory. All a trajectory is the time series for that a particular. In principle, each point on the globe has a Tmax for the day and a Tmin for the day. These can, in principle, be measured, and averaged.
So, you can have a trajectory for T_avg(x,y,z) defined as above. You can have a trajectory for its average over the surface of the globe.
Neither the fact that these don’t match the instantaneous surface temperature at any point nor the fact that we don’t use these particular averaged values to predict values at the next time step doesn’t prevent us from defining a trajectory for them.
Of course, there are difficulties comparing computed averages to measured values if the two aren’t defined the same way. But that sort of difficulty is not unique to climate science. How to measure the thing you want to test is a universal problem in experimental methods.
Lucia:
I am not Beaker.
I think the Douglass comments is more than poor wording, it’s a fundamental flaw. Dewitt thinks so as well. I think if you actually take that wording on face value, it is easy to see it as a flaw. Then when you look at the practice in the paper, which is a claim on consistency, you see the fundamental flaw as well.
If we can at least agree that the statement as worded is WRONG and fundamentally so. Then we can talk about if it was just poor wording or related to a flaw in the paper itself.
Sorry, to belabor it. But this is why I mention that it was Beaker;s “main point” or as he says “heart of the matter”. There are undoubtedly other nuts to chew…but it amazes me to see Steve McI not even address this. He doesn’t want to get tagged as supporting something wrong, but doesn’t want to call out a “buddy” either. I am glad that you at least engage on it, even if you are wrong.
[I searched CA for the phrase “the heart of the matter” and “beaker” could not find beaker using that phrase. If you want to discuss Douglass, discuss Douglass and explain why what they did is wrong. They appear to have compared the observations to the projections. They used the SE for the projections-. This accounts for the uncertainty in the best estimate of the projection. They seemed to have given short shrift to the uncertainty in the measurements.
This difficulty with the analysis in Douglass has nothing to do with the SD/SE issue beaker discussed and it has nothing to do with the nit-picky wordsmithing you are dwelling on. This was done over and over at CA, where, for some reason, you appear to have remained silent rather than jump in and support beaker.
Also, please unless you claim to psychic powers, please omit your odd theories about Steves secretly held, unstated opinions. Steve posts his opinions, not yours. I know it must be frustrating to you that you can’t figure out how to support yours arguments except by use of ad hominems or claims to authority, but that doesn’t mean you can force those who can actually think to make your arguments for you. – Lucia]
Hello to all. I don’t normally read this blog but it looks like good fun….
As for this thread, I wanted to make a few observations/comments. Yes of course you use the SE to see if the means are different as this is the ‘error’ of the mean. For the theoretical n tends to infinity Normal distribution the error in the mean is 0. But for a finite sample there will be an sd and an error in the mean itself. Okay this wasn’t my main point though:
WIth regards to the whole modelling side of things it appears that many things are backward. The Santer paper makes a stab at showing that there is not that much difference between the models without stating the obvious that there is very large error associated with them hence we shouldn’t have a lot of confidence in them. To some extent this is echoed in the measured data.
With regards to the beaker hypothesis I now understand what he/she is talking about: If you assume that the models use a subset of all possible parameterizations and that this is significantly less than the total set (completely random combination), and that you assume that the ENSO etc is noise, then provided the parameterizations are ‘reasonable and allowable under physics’ the Earth is simply another model or more precisely should be part of the normal distribution of models. You can and should use the SD as it will just be part of the group. Its a t-test of 1.
However this assumption has some serious problems:
1) To justify the spread of models you must have shown in some way that the variation of the parameters was in a tight set that is reasonable and allowable under physics and not a random variation otherwise you have basically shown the trivial solution of all processes i.e. that any m vector space can be bounded by the sum of n other random and independent vectors
2) The models don’t appear to include the variation of CO2 forcing. Has this sensitivity been done? I don’t think so, so the models have shown variation but have not shown that there is a significant change (and the SE could be used for this) between models with no CO2 forcing and those with, both including other parameterization variance.
The Beaker method is similar to lots of other physics methods were you assume that the models have to a large degree captured the underlying principles (in material science Neumann’s principle is an example of this in extremis) and that an observation should be driven by the same processes hence you compare measured to model.
I agree with this except that as I have tried to say on CA ( and I saw it was picked up here) to assume the ENSO is noise is a MASSIVE assumption and hence biasses the methodology to a serious degree. You then don’t know the underlying process and need to compare model to measured.
Add in the fact that the variance of model parameters seems random then it is no surprise the error bars are so large. This doesn’t show anything.
David Douglass’ idea that you have to screen model runs by restricting those runs that agree with certain measured boundary conditions or even ENSO variation, is a more sensible way to go. The trick is then seeing what range of parameter variations characterises this set. It may bound the models in a better way or it may be random. At the moment people are discussing a lot of bark, not even the trees and certainly not the wood.
MC (Comment#6351) and others
Additional information relative to my Comment Dan Hughes (Comment#6088) above.
I have taken a zeroth-order rough cut at what can happen when constructing an ensemble a la IPCC here.
All comments will be appreciated.
Hi MC–
I try to have something for everyone. 🙂
I’m curious, why do you think ENSO isn’t noise in the sense of “random variability”? Do you think the timing of ENSO events is likely to be correlated across models? Or triggered at some particular time on “all possible hypothetical earth’s experiences slightly different initial conditions”?
Lucia, there are a number of reasons I don’t think the ENSO isn’t noise alone
1) That all physical processes in the weather/climate are not understood and also their interaction with each other, hence the bold statement that ENSO is noise is a big assumption. I work with plasmas which are multi-parameter systems and we see standard acoustic and chaotic behaviours that are not totally understood existing together. Things like Cantor dust oscillations and the like. We don’t fully understand why and plasmas have been studied for over 60 years. Yet people are so quick to dismiss the ENSO in climatology as being just ‘noise’
2) The basic fact that as a planet we are at the will of the Sun and hence a lot of our weather exists because the Sun is involved. And as someone who studied Astrophysics, the Sun and its interaction with the planets is continually being evaluated. It seems that people have dismissed this as not relevant based on SST measurements that do not appear to have been taken properly or continually evaluated with enough scepticism that errors might have crept in. The fact that Anthony Watts’ research (amongst others) has shown that there could be discrepancies in SST due to the station placements requires a careful analysis beyond the shoe-horning of trends to fit that familiar proxy-generated sporting implement. There is a large degree of tunnel vision going on.
3) More importantly, as a scientist and a mathematician, my gut instinct tells me to be careful dimissing what appears to be a structure as noise. Proper noise i.e. randomness is a very powerful concept in physics and maths and requires a much greater level of insight than climatology is offering
4) Lastly, I take the Richard Feynman approach ‘DISREGARD’ to results alone. Look at the method and assumptions as well. Beware of paradigms.
So as for the hypothetical Earth’s, this idea falls down on the first assumption. We only have this planet I’m afraid. If you want to build a hypothetical set of Earths then you obviously know more about the dynamics of the climate and hence a model should be able to predict the actual Earth with good accuracy. Because remember its not just the troposphere trend, its also the SST, its the precipitation over the Rockies, its the levels of ice in the Antarctic etc etc. There are lot of observations to match and get right before we can say we understand what is going on.
And this takes time and it involves many good data sets and well-performed temperature reconstructions to get on track with the physics and nuances of the models.
So I would rather see a large piece of humble pie that we really don’t know much in these papers rather than arguing about semantics. I would also like to see the model runs without CO2 forcing. In fact as I think right now this is the natural extension to Santer et al. They have shown the models have a lot of uncertainty so does that uncertainty bound no CO2 effect? This may be quite the elephant in the room.
MC–
I agree there is only one earth. But, to test models against observations, we need a “model” (i.e. idea) of how to test them. The “hypothetical earths”, is a convenient frequentist notion.
One this:
There are actually quite a few of use around here who dislike the term “weather noise” which seems to be used at cllimate blogs. I use it because the term is used. However, I should note that in fluid mechanics, no one calls turbulence “noise”. Turbulence has structure. The only thing it shares with true “noise” is the tendency for certain features to average out under certain circumstances. (Example: when fluid flows steadily in a pipe, the velocity at any point will vary with time, but exibit an average over time. We don’t call the deviations “noise”, we tend to call them “fluctuations”. So, we’ll have “mean” and “fluctuating” components. )
There is no doubt ENSO is a structure. The only way in which it is “noise” is that it might average out over “realizations”. (Then again, if it is triggered by something, maybe not.)
Model runs without CO forcing are available. They are called control runs.
I agree with some things Lucia but on 2 points, proper noise is invariant under frequency and temporal transformation. Equally in mathematics it means any point will map to any other point within the real number space and that each new map will have a finite difference from the last point even tending to infinity. That’s what I mean when I say noise is a powerful concept.
For fluid flow the randon variations modelled (Boltzman distrib usually) are defined within a bounded distribution. Noise is not bounded in the same way hence CFD or DSMC it is not exactly random noise in the proper sense. It is fluctuation or a perturbation around a number of fixed states. This is because it matches measurement.
2nd point is that hypothetical Earths only works (as in quantum mechanics) when the allowable states are ‘renormalized’. This was the biggest problem with QED as they had to again bound the distribution by employing probability functions. How can we do this with hypothetical Earths? Not all realizations have equal probability so how do you screen for them?
Lucia:
The actual phrase is “heart of their error”. Sorry. I thought you would get it from my previous post, 6218. Will be more exacting in the future.
And I say again, Beaker’s main deal, his main interest, the BULK of his remarks on SE/SD were in the context of the Douglas paper. That is VERIFIABLE, if you go back and read those threads.
This was my simple point. That Beaker’s main point was the flaw of calling a SE of the MEAN (which in infinite sampling goes to zero!) the same as the standard deviation (which shows the spread of the distribution of outcomes).
So… your simple point is what? That on a thread discussing Santer, Beaker chose to discuss Douglass, without clarifying which paper he was discussing? And that his main criticism of Douglass was a point about semantics? FWIW: I have no intention of wasting my time hunting down beaker’s comments at CA in order to confirm what he said is as trivial as you suggest. – lucia
Lucia:
It’s your blog and if you give me a direct order not to comment on Steve McI, I will follow your order. I think my remarks are on target–I have a heck of a lot of observation behind it. This is another occurence of a trait that I have seen before and written about before. A failure to call out his “side”, his buddies for flaws in argument. An instance before of the same behavior was with Loehle paper (which had severe flaws), instead of calling L out explicitly, full-stop. He said “IF (emphasis added) Loehle is wrong, so is Moburg.” Well that was bullshit, since Steve had already SAID Moburg was wrong. That little “if” was a sneaky lie. No wonder Steve claims to like Clinton. They were both equivocators. Well at a military school, intent to decieve is LYING. And you get your ass kicked out by an honor board. The rhetoric games don’t save you, since they rationally consider it a lie if you intend to decieve. Can even get booted for non-verbal lies (presenting a fake ID for instance).
My remarks are also important. We need to weed silliness and dishonesty (even equivocating dishonesty) out of the skeptic world. The point of revealing analysis is to unpeel the onion to understand things.
I don’t think that I should have to post opinions of Steve only on his blog. That is too much “his dojo”. But if you give me a direct order not to comment on him, his writings, I will obey.
My point is that you sound like an idiot when visit my blog to complain that Steve won’t make some mysterious point for you.
If you want to make some point, make it. I have absolutely no idea what point you wish Steve would make. Quite honestly, given your unwillingness to even state your point yourself, I suspect the reason SteveM doesn’t advance whatever point that might be is that it’s an incoherent trivial point.
That said, carry on.
I figure I’ll just let people read your comments try to figure out if they have the slightest clue what point you are trying to make about Douglas, Moburg, Loelle, Steve etc.
-lucia
Lucia:
1. My point is that Beaker’s remarks are better understood in the context of his criticism of Douglass. That paper came out first, and Beaker made the SE/SD point first at that time. I noted that, since I think it’s helpful in understanding the thread of discussion.
What’s to understand? SE is the correct thing to use when testing the best estimate of a mean to another best estimate of a mean. That’s what Douglas did.
2. Have you read the earlier Douglass threads on CA?
Yes.
3. I disagree that Beaker’s point is semantic. It is fundamental.
SE is the correct thing to use when comparing best estimates of means.
4. Thank you for revising your Steve McI prohibition to allow me to continue to abuse him. I think he works on very interesting things, but that there are some real flaws with it (for instance, but not restricted to “not publishing”). He’s also had a habit of revising his posts, of refusing to acknowledge errors, etc. All that, while still doing yeoman work on going through code and papers, etc. I think it’s important to call out his failings or other places where our side is dishonest. Since, the hoi polloi tend to jump on things as “proof”. Heck, most of them seem to think that McI has demolished climate science, when the guy has not written a paper for over 3 years and when a slew of his posts are repetetive in content or much longer than needed to display insights (loaded down with asides and adjectives and general personal bullshit.)
Well… yeah. You want to post illogical incoherent comments at various blogs under the cover of a at least two pseudonymns. Go right ahead.
But what’s with your “our side”/ “their side” delusion? And so what if you don’t like SteveM’s writing style? And why keep telling us he should bend to your will? Do you think anyone takes you seriously? Sheesh. — lucia
TCO,
I’ve now had time to find the comment containing the phrase you think is revelatory, and which, evidently is important to the point you wish to make. (But which, you are either too lazy to link and thus reveal to readers what point you are trying to make.)
The beaker post with the phrase “heart of their error is here
Beaker says this:
Beaker’s analysis is incorrect.
If one accounts for the uncertainty in best estimate of the mean based on the observation, using SE for the models works out just fine. In fact, use of SE is the classic way to do this.
For this reason, Douglas and Santer both use SE to test the difference in the means, and the peer reviewers evidently had no objections to using this classic, well recognized method called “the t-test”. It’s discussed in the math book I used sophomore year in college.
lucia,
But then the argument comes back to how you estimate the uncertainty of the observation. It’s still not at all clear to me that any within the period of the observation measure of variability can be shown to be a reasonable estimate of variability between different observations or realizations or whatever. In the limit of a perfect model, the between model runs SD would be the best estimate of the SD of any single observation. I think. Maybe. I don’t really know, but I’m uncomfortable with any method that appears to always reject even perfect models in the limit.
Let me add that a method that only rejects pathologically bad models isn’t much good either. Besides, unless you can get the modeling community to pay attention, you won’t accomplish much anyway. Showing that the models and observations fail a test the modelers came up with (Santer17, e.g.) is a good step in that direction even if the test itself may not be all that good in theory for one reason or another.
DeWitt–
I too would be uncomfortable with such a method. However, using SE for the model does not cause this to happen. Beaker claimed this, but what he claimed was simply incorrect. Using SE for the models does not cause this to happen. It’s forgetting to include any uncertainty for the observations that causes this to happen.
These are two different issues.
When using a t-test, either for a single value of a difference between two values, only models that are wrong are rejected in the limit of perfect models. Others are rejected at the rate stated by the confidence intervals. (So, if you pick 95%, you reject 5% of the time.)
If you wanted, you could gin up synthetic data, test this, and see it. The problem with beakers argument was…. Well, it’s hard to say because he was vague. However, it at least appears he was assuming that in addition to using SE, people would also forget to include the uncertainty in the observation.
But, if a model is correct the best estimate of the mean underlying trend will converge to the correct best estimate of the mean underlying trend. If, when making the comparison to observations, you include the uncertainties in the observation, the problem beaker described doesn’t materialize. This is because
a) we aren’t comparing the models to a single point value for the trend, we are comparing to the range pr possible underlying climate trends consistent with the observation and
b) if the models converge to something outside the range of what is physically possible, they are wrong.
Will we have problems if we forget to include the uncertainty from the observation? Absolutely. Then we have the problem discussed by beaker. However, the beaker mis-identifies the source of the problem. It’s not the SD/SE for the models, it’s forgetting the uncertainty in the obesrvations.
Will we have problems if we mis-estimate the uncertainty in the observation? Sure. If we over estimate the uncertainty intervals for the observations, we’ll falsify too infrequently. If we underestimate them, we will falsify too frequently.
If you want to discuss how to estimate the uncertainty in the observations, we can do that. (In fact, I’ve posted numerous posts about this.) It actually can be shown that if the residuals are AR(1), ARMA(1,1), white noise etc, we can estimate the uncertainty in the underlying climate trend based on one observed trend.
If that question is the hurdle for you, we can discuss that. But… it has nothing to do with the SD/SE issue beaker was talking about! 🙂
DeWitt–
The thing is, the test is actually correct in concept. There are some assumptions underlying the Santer test, but they have nothing to do with the issue beaker (and now TCO) is worried about. What beaker says happens just doesn’t happen– provided you correctly estimate the uncertainty in determining the underlying trend based on an observation.
lucia,
I don’t think we’re all that far apart conceptually. I agree that beaker was right about Douglass et al. for the wrong reason and hence failed to analyze Santer et al. correctly. Where I still disagree is that we can always use ‘noise’ models with parameters derived from one limited time series to estimate the variability between different time series. YAFA, it’s like attempting to determine the variability of the weights of the apples in a barrel from the weight of one individual apple. That’s an oversimplification. Better would be estimating the variability of the weights of the individual apples in a barrel of apples if you have the growth curve for an individual apple. Or maybe even better, the variability of the slope of the growth curve for all apples based on a noise model that uses the residuals from a linear model of growth calculated from the growth curve for one apple. That’s probably not right either, but it’s at least somewhat closer to where I think we differ.
DeWitt–
Santer makes an assumption that the noise is AR(1). It’s not entirely possible to test that based on limited data. So, you are at least partly right that we may not be able to estimate variability– we can only do it conditionally, contingent on assumptions.
That said, making assumptions is not unusual. Also, if we can’t make them, we really can’t test anything at all.
But I do think we agree: What we are discussing is not the SD/SE issue beaker discussed.
Dewitt (and TCO) – I followed this discussion on both CA and here and am somewhat flabergasted that you two still believe that the SD is somehow the right parameter for comparing the mean of the multi-model ensemble to the observed trend. Conceptually, it seems quite simple. According to the modelers (e.g. Gavin Schmidt), the reason for the using the multi-model ensemble is that the ensemble average seems to fit observations better than any single model. They aren’t quite sure why the multi-model mean seems to work better but, IIRC, Gavin said something like one model might be top five in matching one thing but bottom five in something else. I guess the averaging gets rid of the highs and the lows and gives you something in the middle. Less than perfect but better overall better than any individual model.
Given that the point of using the multi-model ensemble isn’t to see the range of model outcomes but rather to calculate an average, then it makes perfect sense to use the SE of multi-model mean to compare to the observations. For the observations, as I understand it, the SD and SE are essentially equal
Dewitt – quit worrying about estimating the variability weights of apples from one single apple – that is a poor analogy which attempts to put the model output on equal footing with the observations. The observations are the goal which the model attempt to match. In other words, the observations are the only true apple and the models are statues made of styrofoam, clay, wood, plastic, marble, etc. The models are not the apple, but mere attempts at recreating the apple as best as possible, worm holes and all.
Bob North (Comment#6507) November 11th, 2008 at 12:02 am,
I don’t have a problem any more, and haven’t for some time, with the use of the SE of the ensemble mean. beaker was correct about Douglass et al being wrong, but his analysis of why Douglass et al was wrong was also wrong. Equation 12 (I think) in Santer17 is the correct way to test model vs observation.
The problem I have is with estimating the SD of the observation to use in the equation. If models can be improved, at some point (which may never be reached) the SD of a set of model runs for an individual model or the SD of an ensemble of models will be a better estimate of the SD of the observations than any measure of variability within the observation of the single realization of the climate that we have. I’m talking about the SD between climate realizations, not measurement error or variability within the single realization we have. But I don’t think we’re at that point or maybe even anywhere near it so assumptions have to made about the relation of the variability within the observation to the variability between potential observations to calculate an observation SD to test if the models have any skill at all. We just need to be very clear about what is assumed and the possible errors involved.
Am I correct that models are not initialized to current conditions but rather spun up from some crude state for a century or so without forcing until a more or less stable state is obtained? That would explain the difference in absolute global mean surface temperatures between models and between the models and the real world. That would also seem to me to make it even less likely that the mean of either an ensemble or an individual model would ever converge on the current climate.
DeWitt–
Yes. If the models were perfect, we could use the model SD for the observation SE. But, it’s not the classic way to test. One of the main reasons (outside climate science) is that if you compare mean “A” to mean “B” but replace the standard devaition of “B” with the value from “A”, you lose information. So, the statistical test becomes more error prone. This is the general issue and applies even outside climate science. 🙂
But with the models, there is a worse problem. The estimate of “weather noise” in the models is absolutely not robust. The variability is different in each model. Moreover, for those models with enough runs, it’s clear the average results differ. So, you just can’t replace the weather “SE” with the model “SD”. (But…. if someone really proposed that, then they could do it for individual models. But, in that case, we could run a test on the “weather noise”– which I have in some posts. But, I got a bit sidetracked. Still, some models have nutty “weather noise”, and it’s clear they disagree with each other.)
The models are spun up from a period in the past about a century ago and are generally run form any years to reach an equilibrium. They get different equilibriums when run under the same conditions. So, the models actually do get different results from each other.
So… your simple point is what? That on a thread discussing Santer, Beaker chose to discuss Douglass, without clarifying which paper he was discussing? And that his main criticism of Douglass was a point about semantics? FWIW: I have no intention of wasting my time hunting down beaker’s comments at CA in order to confirm what he said is as trivial as you suggest. – lucia
YES! That you are being silly by almost WILLFULLY ignoring the thread of discussion. Even ignoring the places where beakers specifically REFERS to Douglas.
So… your simple point is what? That on a thread discussing Santer, Beaker chose to discuss Douglass, without clarifying which paper he was discussing? And that his main criticism of Douglass was a point about semantics? FWIW: I have no intention of wasting my time hunting down beaker’s comments at CA in order to confirm what he said is as trivial as you suggest. – lucia
YES! That you are being silly by almost WILLFULLY ignoring the thread of discussion. Even ignoring the places where beakers specifically REFERS to Douglass.
Also, that there’s something slimy about ripping into Santer and not Douglass. You do realize that Santer is essentially a comment on Douglass?
I said Douglas was incorrect not to account for the uncertainty in the observations long ago, both here and at CA.
Your first version of beaker’s main point was semantics: Douglass left a word out of a sentence. The word did not match the tests later done in the paper. Your second version of beaker’s main point, which you provided by dropping a string I could search on google was a incorrect point. I said that both here and at CA in the thread where he made that point, and which you claim to have read.
If you have a third version of beaker main point, please quote and link to the thread at CA so we can read it. — lucia
Lucia: I am comfortable with moving on to discussion of other papers/issues. It’s just when you have a provocative title like “SD or SE: what the heck are beaker and others talking about?” Well, the answer (if you read the overall conversation accross threads…or even the excerpted quote I posted here) is that beaker has talked a lot about Douglass. And if you want to be informed on “what the heck” “he is talking about”, you help yourself by considering the long discussion of Douglass on the SD issue.
TCO– I wrote this particular post because a number of commenters wanted to understand the practical significance of SD vs. SE. Beaker is mentioned because he introduced the whole SD is right, SE is wrong idea, kept insisting, and kept speaking in acronyms. Many people wanted to know “what the heck is beaker talking about” with respect to using acronynms for these very simple statistical concepts.
Why writing this post, my focus was not to discuss Douglass or Santer specifically. It was to give a concrete example of the sorts of questions we answer by referring to SD and to SE. Most of my readers seemed to have grasps that the point of this post was the specific debate over SD vs SE quite specifically.
I permit comments to stray from the topic. If you have something specific you want to discuss about Douglass or Santer, fine. But then make whatever point you intend to make directly here in comments. Or, you can keep trying to prove some unstated point by making vague allusions to comments threads at other blogs. In the second case, no one will have any notion what point you are trying to make. But… carry on…. — lucia
In the one case, where I went to the work of finding the specific quote (in this thread), you were still ignorant of it. Don’t send me running for brownies. A general comment “Beaker has a long history of discussing SE versus SD in Douglas threads and it is helpful to understanding his current remarks to consider that…and that he even mentions his opinions under that consideration (in the quote which I beavered out for you), is very helpful.
It is NOT a semantic issue. There are fundamental issues here inconsistency remarks and calling the SE a measure of the distribitions where Douglass was just wrong. He should amend his remarks. Inconsistency with observation (of the class) is a huge bridge to cross. Much larger than saying that the models have a wide spread amongst themselves. Or that realized climate has a large spread of outcomes (butterfly effect). This is as simple as the excluded middle. Saying something has been proven 95% certainty to be wrong is DIFFERENT from saying it has not been proven 95% to be correct. Yet I see Douglass and many hoi polloi make this mistake all the time!
What are you talking about? You dropped a searchable string with no link back to the comment. I used google to find that comment containing the string. I had already addressed that particualr comment at CA. What beaker said in that comment was incorrect.
You keep harping on the semantic issue. Douglass and Santer both compared the best estimate of means from models. In one sentence in Douglass, the authors expressed themselves poorly and referred to the range of models rather than the range of best estimate of means. The range of uncertainty in teh best estimates of means– which both Santer and Douglass test — one uses SE, not SD.
If you want to make some other sort of comparison and explain why it’s correct fine. Do so. But SE is the correct thing to use for that comparison. It is correct when the hoity toity use it; it is correct when the hoi palloi use it. Because use of SE is correct, the peer reviewers of both Santer and Douglass accepted those analyses and the papers got published. The word “butterfly” doesn’t change that.
BTW: I’m getting a bit busy; you are simply repeating vague complaints over and over. Most likely, I’ll read your next comment, add an inline comment and close the thread. — lucia
6218, Lucia. You did not adress the substance of that. Yes, I did not direct quote it again later in this thread, but I had already made 6218 and you had already blown it off.
Th inconsistency charge and the SD/SE issues are not semantic. They are basic things like out of Box Hunter Hunter. you can keep calling it semantics, but the language was clear. Douglass went a bridge too far. Now, he needs to clarify what he assserts. And probably send in a corrigendum or a comment on his own paper.