The Blackboard

Where Climate Talk Gets Hot!

Skip to: Content | Sidebar | Footer

Model Mean: trend since 2001 rejecting for a year.

25 May, 2009 (16:21) | Data Comparisons Written by: lucia

Guess what? If we run the Santer17 type test on Global Surface temperatures, we now find that the model-mean trend rejects relative to both GISSTemp and Hadley temperatures and they both rejected 12 months ago. Amazing, huh?

You may wonder: Why do we care about what results the test gave if we ignore the most recent 12 months? Well, as you may recall, Gavin got a bit bent out of shape about Pat Michaels results making the model mean look bad if we used all current data, and suggested that result doesn’t mean much because… well… the models didn’t look so bad if we ignored recent data. (For my previous comments on this notion read (1) and (2))

So, today, I’ll show two graphs. One will show results for trends computed ending in 2009; one compare the results from 12 months ago to current results.

Comparison of Trends to Multi-Model Mean.

Below, I show the trends computed from January of various year through April 2009:

Figure 1: Santer Test Normal

Figure 1: Santer Test Normal


As I previously mentioned, the results for different years are not independent of each other; the graph shows that the diagnosis of “rejection” for the models is not strongly influenced by start year. Because projections are based on the SRES, I have been using 2001 as my main “start year”, and will be discussing those.

If you examine the graph above, you’ll see the “model mean” trend from AOGCMs used by the IPCC is shown using a green curve. The ±95% confidence intervals computed as described in Santer17 are illustrated using a blue curve. The 1 standard error range are illustrated with teal curves. When observations fall outside the ±95% confidence intervals, this means that if the model mean is correct then the observed event would have happened in less than 1 in 20 instances of all possible “weather noise”. So, we reject the hypothesis the model mean is correct at the 95% confidence intervals.

Notice that if we compute trends beginning in January 2001, both Hadley and GISS data fall outside the 95% confidence intervals. You can tell this because both the orange and red symbols denoting GISS and Hadley fall outside and below the 95% confidence intervals for trends computed since 2001.

That means: We reject the notion that the model-mean trend based on AOGCMs used in the IPCC AR4 is true. Or, said another way, we treat the model-mean as false, wrong, disconfirmed or whatever word you like to use to say “very probably not true.”

Of course, we do get somewhat different results depending on start year as has been discussed previously.

What about last year?

Those who were impressed by the notion that we should see whether or not the multi-model mean rejected 12 months ago will be pleased that I added circles to indicate results form last year.

Figure 1: Santer Test shows Multi-model mean rejects relative to Hadley and GISSTemp 12 months ago.

Figure 1: Santer Test shows Multi-model mean rejects relative to Hadley and GISSTemp 12 months ago.

Above, you can see that the multi-model mean since January 2001 rejected relative to both Hadley and GISSTemp when we used data up to April 2008.

It is true that the observations didn’t fall as far outside the 95% confidence intervals this time last year. (Note the confidence intervals computed in April 2008 are wider than those computed in 2009. This is because confidence intervals get tighter when we add 12 more months worth of data.)

In case you are wondering: GISSTemp only moved from one side of the 95% confidence intervals to the other from March 2008 to April 2008. So, last month I couldn’t say the multi-model mean had rejected relative to both Hadley and GISSTemp for two March’s in a row. Hadley had been rejecting that long, but GISSTemp, with it’s larger trend, had not.

But this month, I can report that Hadley and GISSTemp outside the 95% confidence intervals computed using the Santer method in April of 2008 and 2009.

Where will the observed trends since 2001 be relative to the 95% confidence intervals be next year? I don’t know for sure. But unless the El Nino kinks in soon, and hard, the HadCrut trend since 2001 will almost certainly still be outside the 95% confidence intervals next year. A moderate El Nino hitting soon could move GISStemp in enough to be in “fail to reject” territory.

I can’t predict the future any better than anyone else. But the current results, based on data arriving after the SRES used to create predictions were pubished say “Reject the IPCC model projections relative to observations”.

If someone says they don’t think the current results “count” because they imagine that things would be different if we ignored 12 months of data… well, no. Back in April 2008, the multi-model mean failed the Santer test for trend computed since 2001. If someone doesn’t like starting in 2001, ask them what year they like. If the pick 1999, ask them why a year 2 years before the SRES were defined is appropriate for testing projection/predictions. Tell me what they say.

Written by lucia.

Comments

David Gould (Comment#13856)

So, GISS moves outside the 95 per cent confidence interval after adding one month’s worth of data. And if el nino kicks in – and it looks as though it is going to, based on recent Australian bureau of meterology data – it might move back again after a few months. So, if we rejected the model relative to GISS based on current data, and then accept the model relative to GISS based on data in a year’s time, something seems wrong with the methodology for rejecting/accepting.

Shouldn’t the statistical decision to reject/accept the model match to real data be based on a period of time over which the model is rejected? For example, let us say that a model is in the 95 per cent confidence bounds for the first 10 data points, drifts out for 1, drifts back for 10 and then drifts out again for 3. We expect 1 in 20 data points to be out. But here we have 4 in 24. How significant would that be? Would autocorrelation affect the significance of 4 in a row being outside the bounds, as opposed, say, to 4 scattered?

(And I have used only two years here for simplicity’s sake – I know that the confidence intervals are larger with less data, but hopefully that should not make a difference to the underlying question.)

Chuck Lampert (Comment#13857)

A little (but not too much) OT, but I think many of us would be interested in your take on MIT’s prediction that there is a 90% chance that global temperatures will increase by 9 degrees C by 2100.

MIT Global Warming Prediction Haiku

If predictions fail
Make new ones that are much worse
to scare people more.

lucia (Comment#13858)

David

So, GISS moves outside the 95 per cent confidence interval after adding one month’s worth of data. And if el nino kicks in – and it looks as though it is going to, based on recent Australian bureau of meterology data – it might move back again after a few months.

GISS moved outside the 95% based on one month’s data in april 2008, not 2009. So, we now have 12 months of data ensconcing GISS pretty well within “falsified” territory assuming AR(1).

It’s going to take a lot of warming to move it back in now. However, I did run some “what ifs”. IF the temparatures for every month for the next 12 months match the maximum observed for that specific month since 1998 (inclusive), GISS will “unfalsify” within a year. So, a whopping El Nino could do it. But even hitting max’s every month for 12 month’s straight is not enough to unfalsify Hadley.

Amazing, huh?

Chuck,
I have no opinion about MIT’s projection the temperatures will increase by 9 degrees by 2100. I’m going to have to read the basis. Seems a bit odd on the facr of it.

David Gould (Comment#13859)

Okay – I missed that. But statistically speaking, how many data points would it take to be – for example – 95 per cent sure that a model was falsified? I understand that this would be a function of the total number of data points and the number of data points outside the 95 per cent confidence interval, but I am wondering what such a function might look like.

Using the example I initially gave, would that be 95 per cent falsified? I guess it comes down the chances that 4/24 would be less than 1/20. Is that correct?

David Gould (Comment#13860)

I do not think that that was the MIT prediction.

“The researchers predict a 90 % probability that surface temperatures will be 3.5° to 7.4° higher by 2100, under a scenario involving no policies to specifically reduce greenhouse gas emissions. These temperature increases are more than twice those predicted under the previous version of IGSM, which was run back in 2003. ”

From:

http://physicsworld.com/cws/article/news/39207

Chuck Lampert (Comment#13861)

What was actually said was:

Without policy action, the group’s model runs “indicate a median probability of surface warming of 5.2 degrees Celsius by 2100, with a 90% probability range of 3.5 to 7.4 degrees”.

Sorry I did not get it right before. :(

David Gould (Comment#13862)

The predictions had a median of 5.2 degrees.

lucia (Comment#13863)

David–
I’m not sure what you mean. The t-test can only say “If hypothesis is true, the probability that observation X would occur is less than p”.

So, right now, given the observation we have, if the model mean is correct, the observations we see would not be expected to occur in 1 out of 20 times. (In fact, it’s even less likely than that.)

The things to focus on are this though:

1) Does my choice of start date ‘matter’. I picked 2001, and did so because of when the SRES were published, not based on the data. (Picking based on the data is a very, very bad thing.) But even so, I picked 2001 in 2008 when data from 2001-2008 were available. So… is my choice random? Well, I admit to not picking it out of a hat. This is a problem in statistics. Everyone’s analysis of observed GMST is polluted by this problem. So, this is a valid question–an affects everyone.

2) Is the AR(1) model ok? We can argue this forever. We don’t have enough information to really say. (I could write a lot at the blog. I plan to write a lot in a journal article–but after I figure out some way to get page charges covered by something other than my personal after tax income.)

lucia (Comment#13864)

David, Chuck–
On the MIT predictions: I need to read the actual article. I know they predicted higher. What I have to read is their basis. Do they expect more CO2 than the IPCC did? Do they think the models underpredict for some reason? If so, why?

Basically, I need to read the thing.

I admit that right now, given the recent slow rise in earth temperatures, I find it difficult to believe that we could get 9 C this century (unless they are predicting methane and CO2 oozing out of the earth and ocean.) But without reading the paper, I can’t really say.

David Gould (Comment#13865)

It is to do with a number of things:

“Many changes contribute to the stronger warming; among the more important ones are taking into account the cooling in the second half of the 20th century due to volcanic eruptions for input parameter estimation and a more sophisticated method for projecting GDP growth which eliminated many low emission scenarios. However, if recently published data, suggesting stronger 20th century ocean warming, are used to determine the input climate parameters, the median projected warning at the end of the 21st century is only 4.1°C.”

from:
http://ams.allenpress.com/perl.....1&ct=1

(this is only the abstract)

David Gould (Comment#13866)

Lucia,

I guess what I mean is this:

If the probability that observation X occurs is less than P, and we make Y observations, with Z of them being X, how can we determine the likelihood that the model is false?

lucia (Comment#13867)

David-
Based on their abstract,it appear part of their assumption that GDP growth will be stronger than assumed in earlier analyses:

a more sophisticated method for projecting GDP growth which eliminated many low emission scenarios

I have absolutely no insight whatsoever into the rate of GDP growth. So, if they are assuming high GDP growth resulting in rip-roaring CO2 growth, then higher warming makes sense physically. But will higher GDP growth occur? I don’t know. Ask Zeke when he re-appears. He’s an economist.

If the probability that observation X occurs is less than P, and we make Y observations, with Z of them being X, how can we determine the likelihood that the model is false?

I will admit to having no clue what you are asking. You have way too many variables.

How about “If we assume Z is true, and under that assumption, we determine the probability that observation X occurs is less than P, and we observe X, what is the probability that Z is true? The answer is “Less than P”.

There may be other questions involving Y and “Z of the X are Y” is true that are worth asking, but you need to be more concrete for me to answer. I honestly don’t know what you are asking. (Depending on what you ask, some answers may not involve statistics per se.)

lucia (Comment#13868)

Oh, David. I should add: Using frequentist statistics, you don’t determine the probability a model is false. You determine the probability a particular observation could have occurred if a hypothesis is true.

It’s important to remember that a) the model is not the hypothesis itself and b) you never determine the probability that something is false.

What we do is determine whether or not an observation is improbably if a hypothesis is true.

If we we see things that are improbably under the assumption the hypothesis is true, we begin to suspect the hypothesis is false.

I know it’s tempting to ask other questions. But if you want to ask them, you will need to take many statistic courses and ask many statistics professors how one might be able to get the answers to the questions you’d rather ask. Many traditional statistics questions seem to be posed “backwards”. This happens because the math is more tractable that way. Some questions you might want to ask can be difficult to figure out how to answer. So, we apply tests to questions we can answer.

David Gould (Comment#13869)

Lucia,

What I am asking is: how many observations outside the 95 per cent confidence interval does it take to falsify a model?

As an example, you would expect that in 100 runs of a model you would get 5 rejections even if it was perfect. Getting 5 rejections does not falsify the model; it is *expected*.

So, if we ran a model 100 times and got 6 rejections, what is the likelihood that it is falsified? What about 7, or 8, or 10?

I hope that explains my question better. :)

David Gould (Comment#13870)

Lucia,

I just read your addition.

So, I guess it becomes: if we had 6 rejections at the 95 per cent confidence level, what is the chance that our hypothesis is correct?

lucia (Comment#13871)

David,

What I am asking is: how many observations outside the 95 per cent confidence interval does it take to falsify a model?

I still don’t know what you are asking.

Which 95% confidence intervals? The ones for all weather in that model? And what kind of model? In advance, what do you expect the model to be able to predict? The magnitude of weather noise? Just the mean? What? (These questions really matter.)

But more importantly, why are you asking about comparing 5 observations to a hundred runs? To a large extent, we only have one observation. The observation may include numerous temperatures over many months– but in some sense it’s only one observation. This happens because we can’t turn back time and rerun the earth. So, as a practical matter, we need to come up with tests that permit us to use the single time series of the earth’s temperature to something about the models runs.

Maybe you can try to make your question more concrete. For example, are you wondering how to interpret that the GISS obsevation is not outside the 95% confidence interval in 1999? Do you want to concoct a specific test that involves the result for 4 years in a row? Or… what? The second can be proposed and worked out as a hedge against cherry picking– but it’s idiosyncratic. So, ideally, someone should have proposed the test before looking at this data.

That said, I previously showed that if someone (for example Gavin) wants to imply that we to look at data for two year in a row, we could formalize that with on a test involving results for two-years in a row passing rejecting to a confidence of ‘p’. Once that person describes what they want to impose as a test, we can compute the effective confidence intervals for the two year test relative to the the stated confidence of ‘p’. If we insist on 3 years, we ratchet the confindence up more… and so on.) It turns out that requiring rejections at p-90% two years in a row has a real p of 95% under assumptions of AR(1) with autocorrelations close to what we have.

We could do 3 years… 4 years etc. But, at a certain point, one might begin to suspect that the “criteria” for deciding the one year test is no good is that we got the answer we didn’t like, so we proposed a two year test, The, if that still got us the answer we didn’t like we proposed 3 etc. Picking your method based on whether or not you like the answer is a very, very bad thing in statistics, science, and just logic. So you can see where it would be better if the method was proposed before the data were collected!

David Gould (Comment#13872)

In terms of runs, we do have more than one run. Every data point at a specific year is in effect a run against the model prediction.

I guess that I am trying to get a more general answer that can be applied to a lot more than climate. :)

Imagine, though, that I have a model that predicts a series run of 120 monthly temperatures from now until 10 years into the future.

As time goes by, I get more and more comparisons.

As an example, I might be within the 95 per cent confidence interval of my prediction for August 2010, but outside it for November 2014.

How many times out of 120 would I have to miss before the model is rejected?

Statistically, I would expect to be wrong 5 per cent of the time. But what if I was wrong 10 per cent of the time? Would that be sufficient to reject the model?

Anyway, if you still do not understand what I am asking, I will have a think about it and see if I can find better terminology/examples/???

Thanks :)

lucia (Comment#13873)

David– Are these 6 out of 6 rejections at 95%? And are they independent from each other? If you get 6 out of 6 rejections, each with a probability of 5% (i.e. 1-0.95) if the a hypothesis is correct, then the probability of that outcome under the assumption the hypothesis is correct is (0.05)^6 = 1.6 * 10-8 = 1.6 * 10^6 %,

But… that’s not what you meant to ask, right? You want to know the probability the hypothesis is false.

You might think this is the same as saying the probability the hypothesis is false is 1- 0.000000016, but that’s not so. It could be higher or lower depending on what other things might be possible based on something other than these 6 out of 6 observations.

See why I said statistical statements seem backwards? If you want to ask forward type questions, you need to start doing Bayesian stuff. That’s confusing too because, in that case, you make assumptions what you believe the probability of certain things are before you see the data.

lucia (Comment#13874)

David–

Statistically, I would expect to be wrong 5 per cent of the time. But what if I was wrong 10 per cent of the time? Would that be sufficient to reject the model?

To some extent, it depends what the model claimed to predict in the first place– seriously! And also: Do you mean just in the forecast?

I’m not trying to be evasive here. Let’s switch off climate to get examples. Let’s look at the average weight of a baby born in Austrlia over the next century.

Suppose your model only intended to predict the average weight of babies each month. It never intended to predict the “baby weight noise”. So, your results are very, very smooth. You run it 100 times, and every time, you predict the exact same number for every month. For the first 6 months, it looks like this:

Run 1 2 3 4 …100 data
weight 6lbs 6lbs 6lbs. 6lbs… 6lbs. 5.89lbs
weight 6.1lbs 6.1lbs 6.1lbs. 6.1lbs… 6.1lbs 6.13lbs.
weight 6.2lbs 6.2lbs 6.2lbs. 6.2lbs…. 6.2lbs. 6.19lbs.
weight 6.3lbs 6.3bs 6.3lbs. 6.3lbs…. 6.3lbs. 6.28lbs
weight 6.4 lbs 6.4 lbs 6.4 lbs. 6.4 lbs….6.4 lbs. 6.43lbs.
weight 6.5 lbs 6.5lbs 6.5lbs. 6.5 lbs…6.5 lbs. 6.49lbs

Look at that: You were outside the ±95% confidence interval based on the distribution of runs every single month!. After all, the ±95% confince intervals for the distribution of runs is 0 every single month.

Do you think your model is “wrong”? Heck no.

You will recognize the spread of model results is not the same as the ‘noise’ in the data. Why do you get to do this: Because you never thought the spread of your model runs contained any information about the “noise” in the data.

So, now: I ask you: when you ask your question, what do you think the spread in the model results for ‘something’ means? This claim (which you are permitted to and also required to advance) would be your assumption about what the scatter in the model runs means.

Your question is incomplete because you haven’t stated it.

(BTW: Whatever you think it means, remember, I can test that claim too. But, right now, I can’t answer your question, because it’s incomplete. You have to state the assumption you want to make. Otherwise, I don’t know what your question is. (And I can make up all sorts of silly possibilities, like this baby weight problem, which has the opposite noise problem from the one with all weather in all models.)

Alan S. Blue (Comment#13875)

David, you can say “That is broken” with pretty concrete certainty. Demonstrating correctness is something else entirely. If it flips back to “just inside failing to falsify” for awhile – it still isn’t a sign of a robust model.
.
A broken watch can perfectly match a model in 100% of all observations – but the model “It’s always 7pm” is clearly complete baloney.
.
Falsifying global climate models doesn’t say “There is no warming” or “There will be no warming” or “There will be cooling” or any of a number of things. All that it does say is “This particular model isn’t good.”

David Gould (Comment#13876)

Alan S Blue,

What if it flipped back for a while, and then the real data converged on the model data, such that it was within 1 sd from 2010 to 2025. And then it went outside the 95 per cent confidence interval again for a period.

How often would it have to be outside the 95 per cent confidence interval before you could say, ‘This is broken’ with pretty concrete certainty?

David Gould (Comment#13877)

Lucia,

The problem is, I am looking for generalities, not an answer to a specific question.

If a model predicts 1000 data points and after, say, 100 real world results we find that 6 of them are outside the 95 per cent confidence interval, how likely is it that the model is right? (Or wrong, whichever can be answered).

More generally, in a set of X data points, if Y are rejected at the 95 per cent confidence interval when compared to the model, how likely is it that the model is right? (or wrong, whichever can be answered).

Or are these questions way to too general, because so many things depend on the specifics?

David Gould (Comment#13878)

There is also the issue of autocorrelation, which is obviously applicable to climate but elsewhere, too.

lucia (Comment#13879)

David–
The problem is that you are trying to be simultaneously too general and to specific. To answer your question, we need to know what type of data points X are, is there serial autocorrelation etc. We need to know what you mean by the 95% confidence intervals based on some criteria etc.

These are the 95% confidence intervals for this statistical test etc. But generally: If the model is right, you can’t have loads of these outside the 95% confidence intervals.

But for people to answer your or any statistical question, you have to pose it more specifically that you are. You need at least the level of specificity that that describes the questions asked when determining the confidence levels for a trend obtained using ordinary least squares in the first place. You have to get this specific when asking your question or no one can answer it. If you want to bring in some non-climate data to ask it, do that. But you still need to a) Propose an idea for a test, and b) propose a hypothetical batch of data. After that, one could begin to figure out confidence intervals and also discuss whether it’s a good or bad tests. (Bad test are ones with low power relative to existing tests.)

hunter (Comment#13880)

If the MIT study is right, and we are facing 7oC by 2100, then the Hansen/IPCC models are wrong.
Has anyone measured just how many times the AGW promotion industry has claimed that things are going to be much worse than predicted?
Has any of the predictions about AGW being much worse actually happened?

Bill Illis (Comment#13881)

MIT’s numbers can’t be based on anything except a faulty computer program.

I imagine they didn’t produce this chart or even check their own numbers over time or they would have started over.

http://img29.imageshack.us/img.....modelc.png

jack mosevich (Comment#13882)

Lucia: Enjoying summer? Its cold again in Evanston (Tues). Anyway there are kinks in the graphs around 2005, with the model mean up and GISS, Hadrcut down. Any idea why the
models have kinks? Interesting too that they are opposite to GISS-Hadrcut.

lucia (Comment#13884)

Jack–
The kink around 1990-1995 is Pinatubo. The later kinks in the model mean means very little.

Even the model mean contains noise because there are a finite number of models. Also, a few of the models have hilariously high weather noise. HILARIOUSLY high!

Andrew Kennett (Comment#13930)

David,
Is this what you are asking?
If we have 10 pairs of data — say a predicted value and an observed value and we want to test the Ho that there is no difference or more formally H0: ave diff = 0, H1: ave diff NOT = 0, if the ave diff (predicted – observed) = 0.1 and the Std dev is 0.1 and the n = 10 then the calculated t-val = 3.162 which is less than the critcal t-val (95% confidence, 2-sided) of 3.169 so we fail to reject H0, but if the ave diff and SD stay the same but the n = 11 then the calc t-val is 3.317 which is greater than the t-crit of 3.106 so we reject H0 and 1 sample has made the difference. The tests of the climate model v actual climate are more complex but essential the same type of thing so at 2001 don’t reject, at 2002 don’t reject … at 2008 reject etc. Unless the average difference or the SD change just adding more data won’t chnage the rejection.
Make sense?
Andrew

David Gould (Comment#13933)

Andrew Kennett,

That is useful, but it is not quite what I am asking. (At least not at first glance – I am new to this and a little slow at getting it). From Lucia’s answer, it seems that there is no simple or semi-simple general formula for what I am asking, and that it depends very much on the specifics of the problem.

David Gould (Comment#13935)

Lucia,

Thanks for the answer. It is obviously significantly more complicated than I thought, darn it. :)

Jorge (Comment#13938)

Lucia –

It has taken me a while to realise that what your top graph is showing is that the failure of temperature to rise over the last few years has pulled down the observed trend even with a very early start point. This makes the models look pretty bad over the whole period from 1970 to date.

Would it make any sense to do the same comparison but calculating all the trends starting with 1970 data and showing how they change as new data is added? Presumably one would then have huge error bars on the left of the graph instead of on the right.

I imagine the models would look much better in the early period and then get worse when the period of stagnant temperature is reached at the end. Of course, this might simply show that the models are better at hindcasting than forecasting. :-)

lucia (Comment#13939)

Jorge
Yes. The models are ok at hindcasting. The obvious questions are: Is this because the forcings and parameters are nudged towards agreement with known data? Or is this because the recent data are truly unusual events? Or is this because they guessed the current forcings totally wrong?

I may make the graph you suggest.

David
It’s not that computing anything is complicated. It’s that to compute the probability of something, you need to define the test with some level of specificity. The test can apply to data sets generally, but the test itself must be fully described so that someone can know what the test is and implement it when given a set of data. You are not describing a concrete test so no one could possibly tell you what the probability of the vague described outcome is.

The New MIT Climate Study: A Real World Inversion for the Political Moment? — MasterResource (Pingback#13971)

[...] that climate model are predicting global temperatures to be rising at a rate far greater than they actually are, you would think that the model developers would be taking a long, hard look [...]

Ron DeWitt (Comment#14038)

I understand your using the Santer test in your discussions to address the modelers on their own terms. However, I think more considerations should be given to the question of whether that particular test makes any sense at all. I found it helpful to conduct a little thought experiment, applyingd what I understand to be the Santer test to a situation for which most of us would expect falsification. Let us suppose that we have a bunch of predictions of climate change made by astrologers, and we wish to use the Santer test to answer the question “Does available climatic data falsify the astrological method for predicting climate?” We look for falsification at the 5% level.

Now suppose we start with an ensemble formed by predictions made by an elite subset of astrologers, say those who learned the art from top-flight astrologer Prof. Zoe Raster, who holds the Tycho Brahe Chair of Astrology at Phogbound University (P. U.). (After all, these important matters should only be placed in “the very best of hands”.) All having learned the most rigorous astrological methods, the P. U. alumni produce a tightly packed ensemble of predictions with a mean of 0.05 C/year and a standard deviation of 0.008 C/year. When compared with the independently selected set of observed data, which indicated an observed trend of .01 C/year, the P. U. predictions are found to be 5 standard deviations off, a result that is extremely unlikely by chance alone, so we consider the P. U. predictions to have been falsified.

Now, suppose that we wish to be a bit more general in our test, applying the test not only to the P.U. predictions, but to predictions made by “astrologers in general”. To do so we augment our ensemble to include predictions made by a more diverse group—newspaper astrologers, storefront astrologers, carnival astrologers, perhaps even some habitués of the Starlight Lounge, a local gin mill. The new mean of the predictions from this larger ensemble (just by coincidence) turns out to be the same 0.05 C/year, but the new standard deviation is much larger, say 0.03 C/year. With this larger, more dispersed, ensemble of predictions the mean of the predictions would be only 1.33 standard deviations from the observed trend, and because there would be about a 7% probability that chance alone would explain that amount of difference, we cannot say that we have falsified the predictions at the 5% level, and astrology has been found to have survived the test. (Or perhaps it would be better to say that the test has failed to falsify astrology).

If it seems paradoxical that the inclusion of questionable predictions in the ensemble somehow has avoided falsification of the astrological method, it is time to refer to Karl Popper. In Popper’s terms, the P. U. predictions were a “strong hypothesis,” definite enough to be falsified, in the spirit of what Popper would have called true science. By including dubitable predictions in the ensemble, and thereby broadening the distribution of the ensemble, the hypothesis to be tested has been “watered down”, approaching what Popper might have called pseudo-science.

The conclusion that I draw from this is that models have to be tested individually, and that it is “weather noise” that should be used to compute the standard deviation with which a difference between the model prediction and observation should be scaled. If my recollection is correct, that was the approach you initially used.

The way matters stand now, the modeling community appears to be relying upon low outlying predictions to avoid falsification while citing high outlying predictions to arouse fear.

lucia (Comment#14042)

Ron–
The Santer test can be applied to models individually, and I have done that from time to time and even presented it at this blog. Some models fail individually. However, because of the nature of the test, it’s actually quite difficult to fail an individual model with very few runs minus 1. This is because the t-test used in Santer uses a standard error, which is the standard deviation divided by the number of samples. A model with two runs has to be very, very far off from the observations to fail. In contrast, the multimodel mean which has it’s standard devaition divided by something like sqrt(26) doesn’t have to be as far off to fail.

Ron DeWitt (Comment#14043)

Apparently my understanding of the Santer test was rather less than I thought it to be. I understand your comment to say that if a single model is run multiple times, the variation in the results of the various runs is great enough to constitute a kind of “model run noise” great enough usually to mask the difference between model mean and actual observation. If that is so, then the averaging of the results of the runs of a single model is also a form of weakening the hypothesis, in effect saying “maybe it will be as run 1 indicates or else maybe the way run 2 indicates, etc. etc.” I assume that there is no pretense that the variation of results from one run to the next can be regarded as modeling weather, so would it be necessary to add the variance of the model run noise to the expected variance of weather noise to get the total noise variance against which the difference signal must be detected? Thank you for taking the time to respond to my comment.

lucia (Comment#14044)

Ron–
There are actually several test in Santer. But, the two that I have been applying permit you to:

1) T-test 1: Compare 1 run to the observations of the earth’s weather using both the “earth weather noise” and the “model weather noise”. In this test, the standard error for the estimate of the trend for both the model run and the earth is computed based on the residual to a linear fits. I rarely discuss this one, but I have. The total standard error for the t-test is the square root of the sum of the squares of the two standard errors. I could discuss this more for now, but I’ll skip it because I don’t think it’s the test either of us has been discussing.

2) T-test2: Compare the mean trend from some collection of N runs or models (or model-means) to the observation of the earth’s weather. The standard error for the trend on earth is computed exactly as in the previous test; so this depends on the earth’s “weather noise”. However, this uses the standard error of the collection of runs by computing the standard deviation of the ‘N’ individual trends over the “N” runs, and then dividing by the square root of N-1. Once again, to get the standard error for the test, you take the square root of the sum of the squares of the two standard errors. So, if you had an infinite number of runs, the standard error for how well we know the model mean would be driven to zero. The standard error for the earth weather would come from the magnitude and properties of “earth weather noise”. But, if you have very few runs, the standard deviation across models can be high.

Now, when you say this:

…so would it be necessary to add the variance of the model run noise to the expected variance of weather noise to get the total noise variance against which the difference signal must be detected

I don’t know if what you are saying is “right” or “wrong” — it depends what you call the variance of the run noise. What Santer does is find the estimate of the variance in the model mean, which estimated by taking the standard deviation across N model runs and dividing by the square root of N-1.

This is the normal way to test whether or not two means estimated from many samples are identical. You would find the discussion in many text describing t-tests to determine whether the mean failure rates of widgets from manufacturer A equal the mean failure rates of widgets from manufacturer B. For these t-tests, you can increase your ability to detect difference in samples as you measure more widgets.

The only complication with the test in Santer is figuring out how to estimate the standard error in the estimate of the trend for the earth noise (or in the case of testing one earth trend to one run, estimating the uncertainty comapred to the trend estimated from 1 run.)

Steve McIntyre (Comment#14046)

Lucia, I’ve done quite a bit of work on Santer. Last fall, we observed a similar discrepancy with satellite trends and, as you know, Ross and I submitted a comment to IJC observing that some of Santer’s results were overturned if up-to-date were used. The referees did not contest any of our calculations but argued that this result was insufficiently “original” to warrant publication. Ross thinks that there’s a crack open to permit resubmission which we’re working on.

Many other Santer results fall apart with up-to-date data.

In reviewing the file, I noticed something very curious that I’d not noticed originally and which I’ve mentioned recently at CA. The predecessor discussions (CCSP, Douglass et al) used the three major surface indices as comparanda (CRU, GISS, NOAA). Santer dropped GISS and NOAA from his comparisons, replacing them with HadISST and two now discontinued ERSST versions, which had lower trends and thus were less inconsistent with satellites.

If GISS and CRU are NOAA are used together with up-to-date data, the comparisons within Santer deteriorate even more.

One reviewer interpreted our submission as general musing on the impact of endpoints and observed that this point had already been made in the literature with more authority. He conceded that Santer’s use of 1999 as an endpoint might be “suboptimal” but took Gavin’s position that 2008 would have an “even more substantial impact” on the results. Including up-to-date data, without simultaneously testing the exclusion of early data, was said to “belie a significant lack of understanding of the satellite data issues.”

The reviewer used the term “disingenuous” in consecutive sentences, which in combination with the term “plain wrong”, gave a Gavin-esque sort of language – which perhaps has become a common turn of phrase in the Community.

lucia (Comment#14057)

SteveM

The referees did not contest any of our calculations but argued that this result was insufficiently “original” to warrant publication.

This reviewer makes no sense for a comment on another paper submitted to a journal.

Comments are, by definition “unoriginal” as they are required to restrict themselves to commenting on the already submitted paper. You should be able to find the language of the journal discussing the requirement for comments, and see they are restricted to commenting on the paper in question. So, yes, musing on endpoints, pointing out weaknesses in the argument etc. are “unoriginal”. But they impact the interpretations of the results of the specific paper that was just published.

How is one to comment on a flawed paper if the comment is simultaneously a) required to contain original content by the reviewer and b) forbidden to contain original content by the journals rules for comments on papers?

I’m totally puzzled by the notion that 2008 would have an “even more substantial” impact or results as a reason to leave it out; 2008 happened.

Steve McIntyre (Comment#14058)

Lucia, we also applied a Santer-style test to whether the observations were statistically significant from zero – the sort of issue that you’ve written about extensively, raising an issue that wasn’t in Santer. Here the reviewer stated:

The question of trend significance from zero was not covered by S08 and is a side show. I see no reason why this should remain in any published submission.

A “side show” even. My, my. This looks like a part that the editor cut0and-pasted as neither reviewer consented to us seeing the original review. I wonder what was in the part of the review that we didn’t see.

lucia (Comment#14059)

Steve,
It strikes me as very odd that the reviewers did not provide comments for you to see. Obviously, you can’t respond to their comments if you are not permitted to read them.

In the end, the editor has a great deal of discretion in deciding what to do. But, I should think that an editor ought to consider ignoring the review from anyone who will not permit the authors to read reviews and substituting a new reviewer. There is just something so inappropriate about a reviewer not permitting an author to read the basis for declining a paper.

Jonathan (Comment#14060)

It is very odd for reviewers to only fill in the confidential comments section. It is rare that I use it at all when I review papers, and when I do I certainly make enough public comments to justify a rejection. I suspect my first move in Steve’s shoes would be to ask for the manuscript to be sent out to new reviewers on the grounds that the current ones are behaving unreasonably.

If I have understood Steve correctly one of the reviewers is also criticising the comment both for being too novel and for being too unoriginal. Reviewers really shouldn’t be allowed to get away with that sort of silliness; they should be forced to make up their minds one way or the other.

My experience with comments, however, has been pretty unhappy. The last one I submitted was rejected outright by the editor without review because the author of the original paper said it was wrong! It wasn’t, and it is now published elsewhere.

 

Comments Closed: If you would like them re-opened, Contact Lucia