
Believe it or not, the purpose of this post is to show this figure and respond to some odd claims in comments at other blogs.
The figure above compares the running 12 month averages temperature anomalies in C for all runs in all models used in the IPCC and available at The Climate Explorer and compares them to observations. The illustrated mean is the mean based on the averages of all runs; the ±95% represents the range of the distribution based on the standard deviation of the populations. The baseline for each individual run and the observations is Jan 1980- Dec 1999.
These are the runs and the baseline I use for analyses since January in which I actually draw conclusions. (I generally draw conclusions based on trends, not anomalies themselves, because trend analysis is quite insensitive to choice of baseline. I can make the comparison to projected anomalies show better or poorer agreement by selecting different baselines. This full set of runs became available in January; about half were available last fall.)
So, what specifically does the graph rebutt?
- One person suggested that when comparing observations to projections, I have somehow selected a subset of models or runs that are biased high. If you examine the sidebar, you will notice the numerous runs I include when computing the multi-model means. I use all easily available models and runs. ( My collection includes all but one run based on the A1B SRES.)
I do not throw away runs from low sensitivity models to obtain uncharacteristically large trends. (In reality, I get smaller changes in temperatures from the baseline periods to 2011-2020 listed in in Table 10.5 of the AR4. This may be due to not correcting for model drift, the missing run, or some other feature.)
- Another person suggested that my conclusions are based on “pinning” the projections to the AR4. When I queried, she linked to a single post, where, in response to JohnV’s specific request, I show AR4 trends pinned to 1990– rather than doing what I normally do.
It’s worth providing some context to understand both JohnV’s request and why I created the graph.
I had just criticized Gavin of RC for pinning all runs to 1999 and then saying “You can get slightly different pictures if you pick the start year differently”. I showed that, in reality, the difference was not always slight, and advocated comparing AR4 projections to observations using the baseline in the IPCC AR4 and showed how Gavin’s comparison would look if we applied the IPCC AR4 baseline.
The models didn’t look to terrific, though the earth’s temperature anomaly does not fall outside the range of all weather in all models on this basis.
JohnV nevertheless wished to know how graphs of AR4 projections would looks if I pinned to specifically to 1990 because Roger Pielke had shown projections from the FAR, SAR, TAR and AR4, referenced to the temperature in that year. (The FAR, SAR and TAR specify 1990 rather than the Jan 1980-Dec 1999 average as their baseline. So Roger’s choice was to use the baseline selected by the IPCC for three out of the four projections. )
Anyway, I complied with JohnV’s request for the pinned graphs, but also I provided several bullet points explaining what the post would describe. The second said this:
2. Show the graph “pinning” all realizations using a reference of Dec. 1989 to Dec. 1990 average. The only purpose is to show how things happen to look using that baseline. (For discussing the AR4, I prefer to use the average temperature from Jan. 1980-Dec. 1999.
Readers who click the link to the two referenced posts 1, 2 will see that I do not advocate “pinning” projections from the AR4 to 1990.
I do not use ‘pinning’ to any individual year in posts where I make statements about the statistical significance of any comparison between the AR4 and observations. (In any case, even if I did, the choice of the baseline has a negligible effect on the results of trend analyses, which are what I use when assessing models. Though I sometimes look at and display graphs like those shown above, this is not a statistical analysis, and I am not drawing conclusions front that graph.)
What if I did pin the AR4 to 1990?
Having discussed the issue of pinning, I do think it’s fun to note a coincidence. 🙂
See the vertical dashed blue line near 1991 on the graph above?
The graph shows 12 month averages; the average over the months from Jan 1990-Dec 1990 appear at year= 1990+11/12. (Averages ending in January appear exactly on the precise years.)
Notice that, the 12 month averages over all runs falls very close to the observed 12 month averages. This is a coincidence. However, oddly enough it shows this: If I had pinned the models to the data at 1990 and showed the multi-run mean only, the relationship between model runs and observations would be almost identical to that shown above. (The scatter in model runs and uncertainty bounds would change.)
But some will wonder, “Almost? So, would the models look a little better or worse?”
Well, if you click on the image to make it larger and then squint, you’ll see the 12 month average model temperature anomaly falls in between HadCrut and GISS values using data from all months in 1990. So, if I rebaselined everything to the annual average temperature in 1990 (i.e. ‘pinned’), the multi-run mean would looks worse compared to GISSTemp, but better compared to HadCrut!
Anyway, for those who think my conclusions are faulty because I
a) Either throw away low sensitivity models to make the models look bad: I don’t throw away any models.
b) Or the results I describe are based on “pinning”: Results I report are not based on “pinning”. I have specifically criticized the practice of pinning. The post in which I pin illustrates that even if we pin to 1990, as suggested by JohnV (who seemed to believe this would make the models look better relative to data), the observations still undershoot the projections.
So, if someone wants to not believe my results or conclusions that’s fine. But these two particular reasons for not believing are groundless.
Update
Lazar asked me to show the graph above with the volcano runs on it. The graph above weights runs equally. I’ll create that graph, extend to 2030, strip off the individual runs and show it on Tuesday. However, I usually test the mean weighted on models equally, which is the choice made by the IPCC when making projections in the AR4. This compares the volcano and ‘all model’ means over the periods shown above.

It’s acceptable to change the start date in my opinion when you want to test a prediction made in a particular year.
ie. If the prediction was made in 1990, you want to compare actual temperatures from 1990 to current.
If you want to test models in general, then check to see if the models perform against a variety of start dates. If they don’t, its either the test or the model that’s wrong, and its likely to be the model.
Nick–
The difficulty with using a single year as the baseline to compare models to data is that the comparison will look much different if we pick 1998 vs. 1999 etc.
Since models don’t even claim to predict weather, it makes sense to baseline to a longer period. Rather than select my own baseline, I just compare models to data using the baseline the IPCC chose to communicate their results.
What was the last year of available temp data for the AR4 runs.
The models always seem to be able to hindcast reasonably well (reasonably well considering they deal with external forcing only and do not include internal climate variability like an El Nino bump) but once a few years go by, they always seem to be off-track.
Right now, the ensemble mean is over by more than 0.2C which seems to be a lot considering the cut-off for the hindcasted data would have been 2005 or 2006.
(I also think the thickness of the lines should be turned down. I’ve found the difference between the printed version of Excel charts on the second lowest line thickness and the lowest line thickness is shocking – so many other little things show up at the lowest possible line thickness. Just a suggestion).
Test Gravatars.
Lucia, I’m guessing you’re referencing my comments in 1).
I did not suggest your ‘all-model’ group was biased high. But I was talking specifically about where you make comparisons with a volcanic forcing subset which has higher trends than the all-model group, I stated “it would be good to understand why the volcanic forcing subset gives a more positive trend than ‘all models’ — is this caused by the additional forcing or because some runs are outliers, or a combination?”
In a subsequent comment, “NCAR CCSM 3.0, has a very high mean trend with little variability between runs. It contributed six 20c3m runs to the AR4 and included volcanic forcing. I wonder how much this increases the mean trend of Lucia’s 16-model volcanic-only subset above the all-models group?”
So I wasn’t “disbelieving” your trend analysis, just curious as you must be about the discrepancy between the two groups. Do you have a list of the models used in the volcanic group?
Hi Lazar–
I was not referencing you, but someone else who appeared to suggest that I simply picked models with high climate sensitivities simply because their sensitivities were high.
In recent posts showing the variations in 20 year trends over time, I did compare earth trends to models with volcanic forcing similar to the earth and said so. As I said in those posts, I did this because the earth experienced those volcanic forcings. As it happens, we expect certain types of variation in 20 year trends to occur because of those forcings, particularly the effect of Pinatubo. So, I think comparison to models with volcanic forcings is more relevant than comparison to models that omit those forcings.
It is true that if I compare the rolling 20 year trend ending “now” to the models that omit volcanic forcings, the disagreement looks less bad. This is primarily because most recent 20 year anomalies for those models don’t the Pinatubo dip in the first half of the data. (1999-1998 is the center of that data, and the big dip from Pinatubo is now in the first half of the 20 year trend.)
To clarify this issue I later added the “all models” trend to the graph. ( Of course, this may not have clarified– but we still see the effect of the “Pinatobu dip” on the trends in the “all models” graph.)
I haven’t ever compared the average climate sensitivity of models that include volcanic forcings compared to those that don’t. However, the running20 year-trends for both draw together once we get well past Pinatubo.
NCAR CCSM 3.0 does indeed have a high trend. That trend appears to have been included in the multi-model mean projection by the AR4. So, I think it’s fair to include it when commenting on IPCC projections. However, if you like, I could run the numbers eliminating it. (Some of my regular readers have requested I eliminate all individual models that fail faile the “Santer test” and see what we get for the projection over 100 years. I haven’t gotten around to doing it. NCAR CCSM 3.0 fails that test, as do a few others.)
Of course, testing that new projection would not be the same as testing the IPCC projections.
Lucia,
“I think comparison to models with volcanic forcings is more relevant than comparison to models that omit those forcings.”
I agree.
“if you like, I could run the numbers eliminating it.”
I think it would be good to estimate outlier effects and CCSM 3.0 is an extreme example.
Could you also make a plot like the above for the volcanic-forcing subset?
“Of course, testing that new projection would not be the same as testing the IPCC projections.”
True, but I think if you want to examine modelling performance then you will need to test that effect…
… as well as test the performance of each model individually.
Why do we never see UHA and RSS data here?
“I haven’t ever compared the average climate sensitivity of models that include volcanic forcings compared to those that don’t. However, the running20 year-trends for both draw together once we get well past Pinatubo.”
It is also possible that models with the ‘same’ climate sensitivity have different transient response.
Lazar–
CCSM3.0 is definitely out there, but there are other models that are out there too. This shows averages for models with more than one run.
(Click for larger)
Sure. But I’ll strip off the spaghetti from all the models and I’ll extend to 2030.The spaghetti gets cluttered, and it’s worth seeing where the differences are.
Sure. That’s why I tested all models individually as in the post where I showed the figure in this comment.
Lazar–
Yep.
vg–
I showed RSS about a week ago here. That’s the post that got Deep Climate explaining what we’d see if we plotted 20 year trends, trends since 1979 with different end years, and at one point suggesting some subset of these plots spoke for themselves.
I plotted using monthly data and compared to a model-projection what they said seemed unpleasing to Deep’s ears. He then came up with new ways to define trends.
But, the main reason I don’t discuss RSS is I don’t have monthly means for the lower troposphere specifically. So, I can’t do detailed test of model performance compared to that metric.
If someone created such monthly means, I’d be happy to test. (And maybe a year from now I’ll write a script to download the gigabytes of data to create my own. But for now….no.)
Lucia,
Looking at the long-term trend in that figure, I’d say nine models capture the trend pretty well, two do so-so, and three are rejected. I’m not comfortable testing recent short-term trends when there is high noise to signal that combines with…
a) the ability to select start and end points, data sets, and hence the large number of trends which can be constructed, which is not reflected in the CI’s. It would be good if you could show how that distribution moves around as the start and end points are varied, both individually and simultaneously.
b) differences between modelled and real-world unforced variability may have impacts on significance testing and cause short-term rejections when a long-term trend is right on the mark, which is difficult to assess with only one real-world realization and uncertainties in forcings.
Lazar–
The lower panel tests begin in 1980. T-tests automatically create uncertainty intervals that are proportional to the noise level.
I’ve done both. Here is varying the start date only.
Discussed here.
Why would I vary endpoints? (I’ve done it… but what, precisely do you think we learn from that. It’s pretty darn unconventional to ignore recent data, or do analysis based on guessed future data, as Gavin recently did.)
The Santer method is supposed to account both real world and model variability, and uses both.
Lucia,
“Why would I vary endpoints?”
Gavin showed the results are highly dependent on endpoint, and without wishing to get into too philosophical an argument, I don’t think it can be convincingly argued that the selection of endpoint is random. 2008 was a cool year relative to the linear trend from 1970. Those trends are all on the upper 95% boundary going right back to 1960 which does not agree with the plot in this post where the observed trends sit squarely in the center of the model distribution… I would argue that the plot in this post is a better indicator of model performance.
“The Santer method is supposed to account both real world and model variability, and uses both.”
Okay.
Lucia,
Thanks for the update.
Using model means is good for comparison between the all-model and volcanic-forcing subset. It would be useful though to plot the 2 s.d. lines of model runs to compare with observations as per the first plot. As might be expected the 1991 Pinatubo dip is greater in the volcanic-forcing subset, although the volcano mean is also trending above the all-model mean since 1999/2000… any thoughts?
Lazar–
Answering your second comment first:
The graph I showed was a quick update from a previous post where I was showing the 1SD because that’s what the IPCC happens to show when projecting.
The anomalies themselves are not outside the 1SD and clearly won’t be outside the 2SD. (I usually show ±95% confidence intervals because given the number of we available, if you do the test the ±95% creates wider uncertainty intervals.)
I don’t know. We’ve discussed that a lot here. Theories by different people include:
1) Aerosols wrong in SRES. (I suggested this one.)
2) Model physics are utter crap. (I don’t believe this one.)
3) Modeling is hard so model physics slightly off due to difficulties getting parameterizations correct and/or just difficulties associated with the need to parameterize subgrid processes at all. (I think this might be.)
4) Solar forcing set to constant in models after 2000 but we are in a solar max. (JohnV has been an advocate of this.)
5) It’s due to Leprechaun’s.
6) Other.
Now on to your first comment:
On Gavin’s post.
I commented on the gavin’s notion, which I think is based on lack of understanding of the difference between type II and type I error. He is trying to use the existence of high type II error to suggest we can ignore type I rejections. That’s wrong headed.
Also, I show the model means will reject based on the Santer method if we tack on the two alternate 2009 temperature. This is what happens if we guess that 2009 ends like 2007:
For details see: http://rankexploits.com/musings/2009/look-i-can-use-made-up-data-just-like-gavin/
and http://rankexploits.com/musings/2009/why-failure-to-reject-doesnt-mean-much-type-2-error/
These graphs are correct. The trends all end in “now” (defined at the time of the post) and the temperature “now” falls below trend line.
You can inspect the trends before normalizing by the standard error here:
http://rankexploits.com/musings/2009/multi-model-mean-trend-aogcm-simulations-vs-observations/#comment-12062
As you see, if we go further back, the difference between model trends and projected trends gets smaller and smaller. However, the standard error and uncertainty intervals also get tighter. (The second are measure of the effect of noise.)
Because the d* is the ratio of the difference between the projected value and the standard error, small deviations in the mean will always trigger “rejections” after a long time.
Ordinarily, if the “projections” were made with no knowledge of the observations, and the projection method is biased, we should see d* explode to ±infinity as we go back in time. However, I’m pretty sure the IPCC AR4 were all runs after the publication of the TAR, and presumbaly after the publication of the SRES on which they are based. The SRES were unofficially released late in 2000 and officially published in 2001. (This is why I use 2001 for my short term test. It’s the earliest full year for which I think it’s possible to claim modelers were unaware of temperature data.)
I agree that comparing least squares trends is a better indicator than comparing anomalies. There are many reasons for this. The two most important are:
1) The agreement between anomalies is greatly influenced by the choice of baseline. If I baseline using 1900-1999, and recreate the graph in this post, you’ll see the models don’t look good. (Similarly, if I show the graph above back to 1900, you’ll see that using this baseline, they don’t predict the past!)
2) If we use a recent baseline (as done in the IPCC AR4), comparing anomalies themselves has very high type II statistical error relative to test based on trends.
Type II error is the rate at which you fail to reject a hypothesis that is wrong.
It is generally the case that when you have to possible statistical tests to answer the same question with matching type I error, you should prefer the test with smaller type II error computed for any given alternate hypothesis.
Many of the suggestions that some method is “better” seem to be based on the notion that we should pick test with larger type II errors. This is wrong. That principle increases our total rate of error.
The only reason one would intentionally select a test with unnecessarily type II error is to lie with statistics. However, most people make the mistake unintentionally because they forget about type II error and they forget type II error can be quantified provided we specify a type of statistical tests and pick an alternate hypothesis.
Lazar–
I reread your second comment, and I see the caffeine hadn’t hit my brain.
Actually, my thoughts about why the volcano mean remains above the no-volcano mean now is due to two factors:
1) The baseline is 1980-1999, a period when the volcanoes pulled down temperature in models with volcanic aerosols but did not pull down those without the aerosols. Because this affects temperature in the 20 year baseline, it will persist in graphs of temperature anomalies (but will drop out of analyses based on comparison of least squares trends.)
2) Even though volcanoes induce sharp dips, many closely spaced volcanoes followed by periods with no volcanic activity will result in a steady rise for some time. This is discussed by…. believe it or not…. Tamino! http://tamino.wordpress.com/2008/10/19/volcanic-lull/
(I had been harping about this here. But I figure you’re more likely to believe Tamino when he suggests that the mere fact of closely spaced volcanoes followed by no volcanoes can explain a rise. Admittedly, he doesn’t come out and say the principle will apply to any period where we see this, and so the effect might be expected to apply to both the periods in the early 20th century and now. Maybe he figured I’d tell people that? Or not. 🙂
“If someone created such monthly means, I’d be happy to test. (And maybe a year from now I’ll write a script to download the gigabytes of data to create my own. But for now….no.)”
It’s on my list of many things to do.
Lucia
“He is trying to use the existence of high type II error to suggest we can ignore type I rejections.”
He didn’t say type I rejections should be ignored nor that there was a high potential for type II errors (shouldn’t that be the other way round?). What he did point out was Michael’s methodology was not robust to the inclusion/exclusion of a single point or changing the dataset. That statement is valid and the implications are the same regardless of the light in which they cast the models…
“[Gavin would have us] conclude that we should ignore the analysis based on 100% real data if the made up cases made the models look not so bad”
… is somewhat unfair to Gavin.
What is the hypothesis being tested?
If it is; ‘if the underlying trends in observations and models are the same over this specific eight-year period, then what is the chance we’d get the observed trend if we pretend the observations were randomly sampled and we ignore the implications of other available data’… then you could conclude with 95% confidence that the observed and modelled underlying trends are different. If we allow other data to influence confidence in our conclusions, thinking in a Bayesian sense, then when it becomes apparent that the model/observational mismatch depends on the inclusion of a single point then robust general conclusions about the accuracy of model physics are not possible, and the probability that the underlying trends are different decreases somewhat. In your response to Gavin I think you should test the effect of excluding 2008. Good point about the volcanoes.
“he doesn’t come out and say the principle will apply to any period where we see this”
… that would be stating the obvious.
“The agreement between anomalies is greatly influenced by the choice of baseline. If I baseline using 1900-1999, and recreate the graph in this post, you’ll see the models don’t look good. (Similarly, if I show the graph above back to 1900, you’ll see that using this baseline, they don’t predict the past!)”
Predicting the past/future is by comparing the change in anomalies over time, not the absolute difference at a given point, so it’s fairly easy to identify by visual alignment when a baseline is inappropriate as the type I errors stand out.
To try and summarize my probably rambling thoughts; the sample is not random and other data suggest that an observed difference is due to an outlier than underlying trends.
Lazar–
There is a high probability of type II errors: that is failing to reject models that are wrong. Gavin is using that as a reason to suggest we should not reject the hypothesis that models are correct. This would mean we are insisting we should increase the probability of committing type II error by setting aside a result that says “the models are probably wrong”.
This is kooky!
On excluding 2008: The projections were published in 2007. So, you are suggesting that we exclude the only data that arrived after the projections were published.
In reality, the data that arrived in 2008 is the only data that is truly randomly acquired vis-a-vis the actual projections after publication in the following sense: When the projections were published we didn’t know what that data would be.
Lazar, you might want to review the meaning of type I errors. Neither type I nor type II errors ever stand out. We can only know the probability of making a type I error when applying a particular statistical method. We can’t know if the error actually occurred.
I prefer trend analysis because that’s what permits us to see how much anomalies have changed over time. I’ll show you the graph going back to 1900 tomorrow or Tuesday. You’ll notice that if we use the baseline in the image above, the models and observations will agree poorly in the past. This is because good agreement was forced in the baseline period.
Lucia,
“This is kooky!”
Maybe I’m being slow but I really don’t see where Gavin is making those claims. Perhaps if you could provide a couple of quotes?
“On excluding 2008: The projections were published in 2007. So, you are suggesting that we exclude the only data that arrived after the projections were published.”
It would be interesting to see the effects.
The multi-model mean trend is on the high side right back to 1960 in your d* analysis, even outside the 95% limits starting around 1960, of course those aren’t short-term trends, so it’s interesting. I should read Santer et al. to understand better. Question… you mentioned Santer et al. assumes AR1, Tamino has done work showing monthly anomalies are not AR1, is this important?
“you might want to review the meaning of type I errors”
I was using the term outside the random sampling context. I agree selection of baseline is somewhat arbitrary and effects conclusions, but so is choice of start and end points. I like plots like the above because you see all the data and a dubious baseline choice is often obvious.
I agree with Gavin’s points on Pat Michaels and short-term trends. Ever since “global warming stopped in 1998” people have run around applying all sorts of different tests to all manner of short-term trends, without an overarching hypothesis guiding sampling or model selection, kinda like distributed data mining, then a cool year comes along and someone claims Eureka… then do we believe that one sample was taken, one test applied, the sampling was random, that conclusions which depend on one data point tell us something about models, and ignore that the observational anomalies track the center of the modelling distribution, and we conclude with 95% confidence that ‘the models are failing’?
Lazar: Gavin has claimed that it is foolish to take the fact that models are running outside the 95% confidence intervals of all weather all noise if we include 2008 data because this does not happen if we ignore 2007 data or if we add fake 2009 data. You read that article.
He does not specifically discuss type II error, but that idea is based on a misunderstanding of type II error. The idea is kooky.
Maybe, maybe not. The true structure of the noise always matters. However, Tamino’s analysis includes the data that was affected by volcanic eruptions.
We don’t know what weather noise looks like when there are no volcanoes and that’s a large component of the noise. The samples aren’t long enough to tell. Tamino started looking at the structure including volcanic eruptions as exogenous, but stopped. I assume he’s been too busy.
Michaels’ was applying a test using the a notion Gavin suggested at RC! The differences are a) Michaels created a probability distribution for the spread of all weather in all models by looking at many years and b) Micheals did it a little later in time and c) Michaels showed his results were independent of choice of start year.
When Gavin showed it, he restricted to one specific year which if you had downloaded the data is one when the standard deviation of all weather in all models has a local maximum. (I have no reason to believe Gavin did this on purpose. But if you were to cherry pick to show models were ok, that’s what you would do!)
I still don’t see why you object to people using recent data or why you consider restricting to that choice “not random”. No one could know if what the 2008 temperatures would be when the projections were made.
As each month trickles in, the results can change. So, yes, maybe they will. But for now, if we don’t concoct reasons to ignore recent data, models are either rejecting or close to rejecting using relatively conventional tests. (Like t-tests, or comparisons to distributions of all weather in all models.)
BTW: Had we done the d* analysis in January 2008 when the Dec. data came in, Hadcrut would have still been outside the 95% confidence intervals. So, we get the same answer. GISS would not. For GISS to reject using d*, we need to start after April 2008. So, if it rejects next month, we will reach the point where it will have been rejecting for a full year using this test.
Lazar:
That answer really makes me suspect you need to go back to a stats book and review the temrs “type I” and “type II” errors. Their definitions and usage have nothing to do with being in or out of a random sampling context!
Lucia,
“Gavin has claimed that it is foolish to take the fact that models are running outside the 95% confidence intervals of all weather all noise if we include 2008 data because this does not happen if we ignore 2007 data or if we add fake 2009 data.”
1) I think you’re reading a binary meaning into Gavin’s post which isn’t there. He’s not saying that using a period ending in 2008 is right or wrong, but that confidence is overstated because the result is not robust.
2) Surely his point is about a high risk of type I errors? You’re claiming he’s using type II as justification. What is the hypothesis? Maybe I’m just being slow…
Lucia, your d* test is between linear trends… are the trends back to 1960 linear?
“Michaels’ was applying a test using the a notion Gavin suggested at RC!”
Who did the test, whose test is better, and motives for doing the testing are irrelevant. The point isn’t Michaels versus Gavin. Every time a sample is taken or a different test applied, the probability of obtaining a result that is statistically significant increases. A large number of people have been looking at a large number of ‘short-term trends’ using different tests for how many years now? If someone comes along with a result claiming significance at 95%, are you happy to conclude (if you think the methodology is ok) with 95% confidence that ‘the models are failing’? I don’t wish to argue. Just would like your opinion. Yes/no.
“I still don’t see why you object to people using recent data”
I don’t. You mean 2008? I don’t either. Showing the effect of including/excluding 2008, making judgments as to robustness, are seperate issues from deciding 2008 or any other year ‘should’ be included.
“or why you consider restricting to that choice “not randomâ€.”
Temperature data are known beforehand.
“No one could know if what the 2008 temperatures would be when the projections were made.”
The modellers are not the ones doing the testing.
“That answer really makes me suspect you need to go back to a stats book and review the temrs “type I†and “type II†errors. Their definitions and usage have nothing to do with being in or out of a random sampling context!”
Meaning is malleable, if you’re not confident guessing meaning from context then asking the person using the word is usually more productive than insisting on standard usage… standards are averages… ‘course it’s tricky and sloppy language does not help discussion. Anyway, as you object I’ll rephrase without “type I”… when comparing anomalies as in the plot at the top of this page, it is often easy to visually identify when anomalies are outside a confidence interval because of difference in trends or because noisy series have been shifted by baseline pinning a low to a high.
Why does no one here even consider the fact that temperatures may go down and down and down for the next 10-20 years and all this (models, estimation etc..) will be considered absolute trash by then? The “climate scientists” involved in AGW may or should seriously start thinking about their future prospects if there whole AGW theory etc comes crashing down. People may actually get very angry with governments and research persons/institutions peddling ideas/data that is not actually happening, when they find their electricity bills etc…. were increased under this pretext. This will be especially significant, when they find out that the world is cooling and AGW turns out to be absolute drivel. This evidence is now beginning to pile up everywhere even on IPCC supporting sites such as cryosphere today (they cannot continue to hide the data/graphs as persons can keep records of past “adjustments” and will continue to do so). For example mainstream newspapers in Australia that were 100% AGW say 5 months ago pro AGW are now turning pretty dramatically (see recent Sydney M H, Age, Australian etc).
vg,
“Why does no one here even consider the fact that temperatures may go down and down and down for the next 10-20 years and all this (models, estimation etc..) will be considered absolute trash by then? ”
I’ve considered them trash ever since I heard of them.
What we have here are people who cling their models like they were holding onto a branch overhanging a river. I suppose they don’t know how to swim, otherwise they would let go. Or maybe there are alligators in the water. 😉
Andrew
vg:
There is no downside to excessive gloom and doom predictions.
When the AGW alarmism craze fades because (a) it’s not really happening and (b) the media gets bored, alarmists will still be permitted to bask in the memories of their moral and intellectual superiority. The fact of being wrong and proposing utterly ruinous solutions will have no consequences.
For example, the President’s current science advisor (Holdgren) is a successful professional doomsayer and the former sidekick of the ridiculous Paul Ehrlich (of overpopulation fame)–whose record of failed dramatic pronouncements is abundant.
But the weird truth is that being spectacularly wrong is of no consequence. Once a doom scenario becomes au current and PC there is some unspoken code that says no one is allowed to notice or remember when it proves to be BS.
Lazar–
This is what I think he means to suggest.
If Gavin is means what you suggest, and I believe he means, I am saying that is bunk. What would Gavin’s favorite adjective ‘robust’ mean in this context. It would mean that, if we ignore this years data, the hypothesis test does not result in “reject”. This is using the known existence of high type II error to ignore type I error. It’s a classic mistake made by people who do not understand type II error.
No. The type I error for a specific test was already set by Michaels. The type II error (failing to reject a hypothesis that is false) is very high when one has very little data, which we do. There is only 1 year of data since projections were published. and roughly 8 years since the SRES were frozen. No one could have known what the as yet uncollected data would be when projections were made, so it is the only ‘randomly sampled’ data. All data prior to projections was deterministic at the time of projections.
Gavin has reduced the amount of random data in the sample, thereby jacking up the Type II error. Then, when he sees a fail to reject (which is highly probable even if the models are dead wrong) he wants to use this to suggest we should ignore the fact that we are getting a ‘reject’ when testing the hypothesis that models match observations. So, he is using the known high level of Type II errors to insist that we must set aside “rejects” when we see them.
Mind you, if he feels that we should not pay attention to “rejects” conclusions when testing the accuracy of models until we have a confidence level of 99%, that’s a valid opinion. In which case, he should just say so. But the argument he posed is silly.
No. And it can be shown that when the underlying trends are not linear, this test will “fail to reject” models that are wrong at too high a rate. This is actually rather well known for t-test, but I ran some monte carlo to estimate the magnitude. I discuss that here
Whether happy or not, I would conclude that based on the results of that test, the hypothesis the models are correct appears false. After that, we can discuss whether the choice of years is cherry picked or whatever, and I am happy to give my opinion on that. My opinion about Gavin’s examples are this: Ignoring the most recent data is cherry picking. Using it is not.
I think the models look very bad right now. Maybe future data will reverse this. But right now, we can only use the data we have, and the models look bad.
So, when considering “robustness”, did Gavin randomely drop out years? Did he use rand() in excel to pick which years of data to drop? Why didn’t he just drop 2007? Or 2006? Or guess 2009 would end as 2000? Or replace 2008 with 2005 so as not the decrease the length of the trend? Or do any other number test (all just as bizarre as the one he did?)
Gavin’s tests for “robustness” is to specifically decide to drop the known cold year at the end and he did this based on his knowledge of the answer and his preference for the outcome. (There is also the type II error.)
The 2008 temperatures were not know before the projections were made. If you mean the temperature were known before the analyses were performed sure. But using the most recent data is an established principle for looking at things like GMST.
So? They made the projections. There is no rule saying people who are supplied predictions can’t test whether they look ok.
The meaning of type I and type II error in statistics is not malleble. They are terms of art and have specific meanings; you are using them incorrectly. So, it might be best if you either review their meaning or avoid using them.
Lucia,
“It would mean that, if we ignore this years data, the hypothesis test does not result in “rejectâ€.”
I see what you mean now and agree. However it is still useful to know when a conclusion rests on one year of data.
“insist that we must set aside “rejects†”
He doesn’t say or imply that. You also ignore his point about robustness to dataset.
Gavin’s conclusion…
“To summarise, initially compelling pictures whose character depends on a single year’s worth of data and only if you use a very specific dataset are unlikely to be robust or provide much guidance for future projections. Instead, this methodology tells us a) that 2008 was relatively cool compared to recent years and b) short term trends don’t tell you very much about longer term ones.”
Lucia,
“After that, we can discuss whether the choice of years is cherry picked or whatever”
They don’t need to be cherrypicked for the above problems to apply.
“Ignoring the most recent data is cherry picking. Using it is not.”
Waiting for a really cool year whilst not reporting analyses which fail significance is not cherry picking? Do you know this wasn’t done?
“Gavin’s tests for “robustness†is to specifically decide to drop the known cold year at the end and he did this based on his knowledge of the answer and his preference for the outcome.”
Would you be equally happy to claim that Pat Michaels embarked on and reported the test based on his knowledge of 2008 and his preference for the outcome? Gavin has no public record for dishonesty. The same cannot be said for Pat Michaels.
“So?”
We’re discussing the data in terms of the test.
“is not malleble”
All meaning is malleable which varies from person to person, moment to moment, context to context.
“it might be best if you either review their meaning or avoid using them”
I know their standard meaning, I’ll use language any way which is useful, which depends largely on the person I’m talking with.
A simple question, Lucia – do you think that Michaels would have presented the same analysis in the same form to Congress a year earlier?
Simon isn’t that rather beside the point. A year earlier Pat M wouldn’t have had anything to report so he wouldn’t have reported. But now he does so he did report. Trying to shift the arguement back a year is ignoring the data that has come in. You might as well ask would Pat M have said that 30 years ago or 100 years ago — why ignore one year of dta when you can ignore 100?
Lucia,
On the other hand,
“[Gavin] is using the known existence of high type II error to ignore type I error.”
… presupposes the hypothesis to be false, so I think your descriptions of Gavin’s actions are overreaching. As you said, you don’t know which are type I or II errors.
Andrew,
I specialise in points that are beside the point 😉
Ok, my point is that current data is not the ‘best data’ in terms of considering trends, and that is simply because it isn’t centred. Current data will be more useful in assessing trend in, say, fifteen years time. It’s surely freakin’ obvious that if the past year had been higher than trend expectations then Pat Michaels would not have been making a noise about the fact. If it had been, would that have been cause to assert the certainty of model projections? No (although I’m sure it would have been thus asserted – but let’s be sceptical about this both ways).
I’m unconvinced of the current model projections, btw, but I’m also unconvinced of analyses which do happen to look strong on the back of one year’s recent data.
Simon–
I doubt Michaels would have shown those plots last year, but there are two very innocent reasons for this:
1) Neither Climate Explorer nor PCMDI provided easy access to the data, so he would have had difficulty compiling it. (Chip downloaded from PCMDI.)
2) If they had been available, and he had downloaded the data he would have had to say that those test he had done indicated the models were not inconsistent with the data. What is likely is that the congressional rep who invited him would have said “Oh. Never mind. No need to visit.”
Could there be other reasons? Sure. I don’t know them. But some of these analysis are being done by the peanut gallery now because the model data has only become available now.
Are some people also tendentious? Yep. On both sides. When I read Gavin criticizing those particular graphs by Michaels my reaction is “pot kettle black”. Gavin was pleased as punch with very similar comparisons of “all weather in all models” when they looked ok to him last year before 2008 data came in.
BTW: On the other comment, what do you think the data aren’t centered on? (Also, btw, I predict that when next month rolls in, comparisons of HadCRut and GISSTemp will have failed the SANTER17 type tests for 12 months using monthly data.) Michaels test compare trends for observations and models over with matched beginning and end points. He’s not comparing to the IPCCs average over their chosen span.
Lazar– My statement does not presuppose the models are false. The reason the probability of type II error is always high when very little data are collected is that analysts (including me) presuppose the null hypothesis is true. That means all my statements about type II and type II error are based on the hypothesis the models are true. Other than that being the hypothesis which we test, no assumption is made about whether it really is true or not.
Lazar–
No. The conclusion rests on all the years of data used. As we get more data, the signal to noise increases. So, the change in the conclusion with the addition of the recent data arises because a) The noise has been reduced resulting in a reduction in error bars and b) The data that happened to come did not happened to increase the difference between the projections and the observations.
By throwing away the last data point, you both performing an analysis with a lower signal to noise ratio and cherry picking.
I don’t ignore it. I have explained why his notion of of ‘robustness’ is peculiarly odd in the context of statistical analysis. He is finding an excuse to simultaneously cherry pick and rely on tests with reduced signal to noise?
Gavins conclusion, which you quote is kookie, and is based on lack of understanding of what happens when the probability of type II error remains high. Moreover, his summary misses the point MIchaels made. Michaels was not suggesting the short term trend can be used to predict the long term trend. He was suggesting the short term trend suggested models were off. These are different issues. Based on the things Gavin writes, it appear he doesn’t understand the difference. This is not Pat Michaels deficiency. It’s Gavin’s.
On the malebility of the use of terms like “type I” and “type II” error:
If you wish to insist on your own idiosyncratic useages of type I and type II error, I suppose you will. But these are terms of art in statistics. Using them by some mysterious definition of your own while discussing statistics may make people suggest you have no idea what you are talking about.
Simon Evans [13198]
” I specialise in points that are beside the point”. You said it, and you are very good at that, indeed…..
Lucia-
“1) Neither Climate Explorer nor PCMDI provided easy access to the data, so he would have had difficulty compiling it. (Chip downloaded from PCMDI.)”
I’d have to say that Climate Explorer would be a bit nicer if it had an FTP interface, but It’s not difficult to use. PCMDI could use a better FTP interface, but getting the data is no problem. Just time consuming to process when you do it on a 2 GHz processor with only 512 MB of RAM ;(
Chad– What I meant was both resources are fairly new. (Or they are new as far as I know.) Funding agencies have funded entities to make the data available recently. In the past, it was generally possible for the peanut gallery to make tests.
So, the accusation that people just “waited” for temperature to turn down before performing their tests isn’t quite fair. It happens that the temperature downturn coincides with model data being made available. People interested in making comparisons started making them, at least in part, because model data are now available.
Lucia,
Your response to Chad speaks to the fundamentals of the scientific method. The very notion that somebody somehow “waited” for some new data to run a new test, tells us that whoever suggests such a thing [ and by no means necessarily Chad] doesn’t understand that in science matters are never settled, statements by Gore and Associates to the contrary notwithstanding.
It is a matter of course, that when new data becomes available, the hypothesis is tested anew, especially if the hypothesis [read models] has been refined in the process.
Lucia,
“He is finding an excuse to simultaneously cherry pick and rely on tests with reduced signal to noise?”
When a result depends on using one dataset when there are other valid sets tp choose from which throw the result, the suggestions that showing the outcome of using that one dataset and not the others is not cherrypicking, whilst showing how the result is sensitive to choice of dataset is cherrypicking, strikes me as very up-is-down-black-is-white.
I think you have a good point regarding type II errors. But I also think you overreach in claiming therefore sensitivity analysis should not be done. It should be done, but with a caveat for type II errors.
Lazar–
Up to now, you have not brought up this point. You have only discussed the issue of using or ignoring data from 2008.
If you are now referring to Michaels showing only Hadcrut, I criticized Michaels for that before Gavin did. You will note that I show results for HadCrut and GISSTemp. (I omit NOAA/NCDC out of laziness. I’m planning to add those now though.)
Yes: One should show all data sets, including those that don’t support your preferred interpretation.
As for sensitivity analysis: If Gavin merely said that the answer changed if we drop one year, I’d say sure. . If he suggest it might change next year. Sure. If he wants to point out that 95% confidence means rejections are bound to happen 5% of the time, fine. We could be experiencing one of those periods. I’ve said those things myself.
But if he want to adopt high toned language and call people foolish for looking at the data that we actually have and observing those rejection are occurring: He is being tendencious to the point of seriously misleading his readers. If anyone is going to call not ignoring data cherry picking, he has stepped into kookie territory.
Needless to say, because this idea is being advanced, next month when the data will probably reject based on the SANTER method, as they have both Hadcrut and GISS since last April, I will post that. I will show that if we throw out 12 moths of data, the still reject.
As far as I know, there is no ElNino popped up during April, so I’m pretty sure the temperatures will not have risen enough to make any difference.
So, when I show this, will your criteria be we have to throw away 2 years worth of data and still reject? 5 years? 10 years? What’s the criteria for your sensitivity analysis? We have to throw away data until the models are saved?
lucia ,
(Comment#13199)
Ok, I accept your points 1 & 2 , and also that people will be tendentious on both sides (said so myself above!).
I would not be suggesting that we should ignore recent data if running the Michaels analysis. However, if the analysis were run to a 1973 end point then the models would ‘look bad’ to the cautious side, would they not? Then three years later, for a 1976 end point they would ‘look bad’ the other way . Whilst the model mean averages out such outliers, the analysis gives such weight to the end point that it can throw up ‘rejects’ , or get close to that, in opposite directions over a short time (I hope I’m understanding this correctly – probably not, so I’ll expect to stand corrected!).
By saying that the most recent data is not centred I simply mean that we have no sure way of judging whether or not it is an outlier against a longer term trend, whereas we can now reasonably say that both 1973 and 1976 were outliers.
Simon–
When do you want to start any comparison of trends ending in 1973? I don’t know specifically for Michael’s method (which is more flattering to models than the Santer method!) For all I know you are right. You can’t just guess though– the model predictions aren’t for a constant trend since 1900, so you really need to run the predictions and compare to observations.
Models with excess sensitivity should display too much cooling during periods when the models predict cooling and too much warming when models predict warming. So, having negative d* outside the 95% uncertainty intervals could, in some circumstances, be a symptom of excess climate sensitivity (or not.)
We never have a sure way of knowing if something is an outlier. But it is true that if the trends do rise, we will all be pretty sure the current dip is an outlier.
The difficulty with comparing to 1973 is…. Fuego erupted! So, we think know why the data deviated from a positive trend. The models ‘predict’ cooling due to Feugo.
Lucia,
“But if he want to adopt high toned language and call people foolish for looking at the data that we actually have”
where does he say that?
Lucia,
“So, when I show this, will your criteria be we have to throw away 2 years worth of data and still reject?”
People wanting to know when a test result depends on one year’s data, does not equate with insisting the test does not contain that dependency.
Lazar–
See http://www.realclimate.org/index.php?p=663&langswitch_lang=it#comment-115889
“…. As for the recent testimony, you know as well as I that the new graph you made is just as affected by end point effects as the standard ‘global warming has stopped’ nonsense. While purporting to show 15 independent points, they are not independent at all and will move up and down as a whole depending on the last point. Plot it for 2007, or using the GISTEMP data for instance. If a conclusion depends on one point and one specific data set then it’s still a cherry-pick and one would be foolish to draw conclusions. – gavin].”
1) The point is not cherry picked. It is the most the data. Gavin is saying that we cannot draw conclusion if we get a different result by ignoring the point from 2008.
2) No one purported to show the 15 points were independent. The purpuse in showing the results with different starting points was to show the result is not a result of cherry picking the start point.
3) Gavin’s argument is that we should “not make conclusions” based on results including 2008, but should instead rely on conclusions based on his cherry picked end point of 2007. Worse that data includes almost no data after projections were made!
Lazar–
Then you’ll be very happy in mid may when the results of the d* will both show models are wrong if we use data up to April 2009 and if we chop it off at April 2008, right?
Or, as I asked before, will the criteria require me to chop off 2 years of data?
Lucia,
The comment is not calling people foolish for “for looking at the data” but for drawing conclusions when the result depends on “one point and one specific dataset”, and I agree that is foolish.
Lucia,
“will the criteria require me to chop off 2 years of data?”
Same answer as before.
Lucia,
3) Gavin’s argument is that we should “not make conclusions†based on results including 2008, but should instead rely on conclusions based on his cherry picked end point of 2007.
I cannot agree. The points made apply to this method regardless of which year is chosen. The illustration that different end points would throw up apparently significantly different results was made to show this, not to suggest that the method should be used but any other end point should be preferred “instead”. Gavin’s criticisms are of the method (although he also criticises Michaels for using such a method when the opportunity suits his tendency).
Lazar–
a) The results are not based on 1 data point. The fact that the rejection goes away when you remove 1 data point is a not the same as results being based on 1 data point.
b) If we remove the data point he cherry picked to remove and which we have no reason to believe is incorrect, we should not draw conclusions. (Gavin could have programmed opinionated monkeys to remove points at random. They might have thrown out 2007 and kept 2008!) He gives this advise given when we have very little data collected in the “post-projection” period, and are including data that those making projections were aware of before projections were even made. So, the probability of type II error (using the standard statistical definition) is very large. We are very likely to fail to reject models that are wrong. In contrast, we have, at most a 5% chance of rejecting models that are right. (Including data available before projections were made gives the models a advantages not normally permitted during analyses of this sort. After all, they are “predicting” data that existed before the predictions were made.)
So, if we follow his advise we will be unintentionally elevating our confidence criteria even further beyond 95% while calling it 95%. That silly.
So, Gavin’s claim is both perverse and foolish. If he wants to insist on a 99% confidence interval for his beloved models (to limit type I error to 1%) he should just say that’s what he needs to change his mind. But if he wants to tell people they are foolish to conclude models are off track because we are getting a 95% rejections, well…. How is that foolish.
Simon– Notably, even if we remove 2008– representing the worst Gavin can do– the models still exceeded the data, and were hovering around the 1 Sigma level. If the models are wrong, we do expect these sorts of oscillations. But, once we start getting rejections two years in a row, it means our confidence they are wrong is higher that the stated 95% confidence level.
It’s never foolish to make conclusions that models are rejected at 95% confidence using the conventional method of not ignoring data. Of course you wouldn’t make this claim in 2007 when it was not true. But that’s not the same as waiting for the right data or cherry picking. Everyone, on everyside is looking at these data constantly. If we are using the 95% confidence level, we should be making type I errors 5% of the time.
That people like Gavin have no issues with papers like Rahmstorf where they not only swooped in when the data briefly popped out of the TAR bounds, but did so only because they made up a novel method to determine the temperature in 1990 , suggests that collectively, people may make the error of saying the earth is heating up even more often than one would expect based on claims of using 95% confidence levels. (Had Rahmstorf used more common methods of determining the temperature in 1990, the models would not have looked high. Heck, if he’d used the method Michael’s is using, they would probably not have looked to high– though I can’t be sure because PCMDI only had detailed info about AR4 models.)
If Gavin really wanted to convince people he meant what he said, he might say “Look, Ramhstorf did that permaturely, and see? The temperature dropped.” But… no. (Admittedly, he’d still not convince people because Ramhstorf annouced quickly and did other odd things. At the time, Rahmstorf detractors were not using this newly made up dictate of Gavins!)
Lazar
The conclusions don’t depend on one point. They depend on all the data use, which, in no case, was one point.
I always agreed that using 1 data point is ill advised. That said, it’s not “foolish” to note that we are seeing rejections with a particular data set using all data currently available. One can certainly “draw conclusions” about what current data show about models. People do this all the time, and it’s not “foolish”. Can our views changed when new data arrive? Sure.
However, in fairness, one should admit that we don’t get them with the other well respected data set. Michaels did not include GISSTemp or NOAA, and that was no more fair than Gavin throwing away the data point he doesn’t like.
So, you still won’t actually give a specific answer to this question? If the models reject when we throw away 1 year wroth of data, will your criteria expand to require me to chop off 2 years of data? This has a “yes or no” answer.
Lucia,
“The results are not based on 1 data point. The fact that the rejection goes away when you remove 1 data point is a not the same as results being based on 1 data point.”
Rephrase; the result depends on one data point, if that data point is removed the result disappears.
“So, the probability of type II error (using the standard statistical definition) is very large. […] So, if we follow his advise we will be unintentionally elevating our confidence criteria even further beyond 95% while calling it 95%.”
The confidence level is set by the alpha criterion for type I error rate. By definition, changing the type II error rate does not change the confidence level.
“If we remove the data point he cherry picked to remove and which we have no reason to believe is incorrect, we should not draw conclusions. […] So, Gavin’s claim is both perverse and foolish”
Repeat; he did not.
Additional highlighting on the logical and…
“‘If a conclusion depends on one point ***and*** one specific data set then it’s still a cherry-pick and one would be foolish to draw conclusions’ — gavin”
… and I fully agree.
Lucia,
“The conclusions don’t depend on one point. They depend on all the data use”
Of course the conclusions depend on other data, that does not exclude dependency on one data point. Having multiple necessary conditions does not mean an individual condition is not necessary.
“I always agreed that using 1 data point is ill advised. That said, it’s not “foolish—
I take this as moving toward Gavin’s position… it’s a fine line between “ill advised” and “foolish”, with the caveat that Gavin’s use of the adjective depended also on the sensitivity to dataset choice.
“it’s not “foolish†to note that we are seeing rejections with a particular data set using all data currently available”
‘noting’ is not the same as drawing conlusions.
If it’s not “foolish”, is it “ill advised”? 🙂
“One can certainly “draw conclusions†about what current data show about models.”
That one can does not mean one should.
“People do this all the time, and it’s not “foolishâ€.”
I’ll settle for “ill advised” 🙂
“However, in fairness, one should admit that we don’t get them with the other well respected data set.”
Good.
“So, you still won’t actually give a specific answer to this question?”
Same answer as before; I do not “require” you do sensitivity analysis.
Lucia,
My issue with the Michaels graph has nothing to do with statistical justifications. For sure, once one understands the way it represents the development of a short term trend from year to year then one can take it or leave it as just another means of graphically representing data. However, without that understanding (and I’m pretty sure this must have been the case for most, at least, of those who heard his testimony to Congress) I consider its ‘visual impact’ to be highly misleading. Any representation which in one year could see, apparently, ten years of temperature records comfortably in the middle of modelled projections but which then, in the year following, shows ten years at the bottom edge and beyond of the envelope may be just fine for statisticians (who realise that it’s graphing trends and not temperatures), but for non-specialists it is, IMV, inappropriately inviting of the misunderstanding I’ve deliberately included in this sentence.
It’s not just the graph which invited that misunderstanding, either. I quote from Michaels’ testimony, my bold:
“…in Figure 3 are the observed temperature trends for periods from five to fifteen years from the IPCC history, ending in December, 2008. It is very clear that temperatures are running at the lower limit for the .95 confidence level. In other words, the ensemble of the AIB models is failing.”
So, he jumped from describing trends to a statement implying that over the period he describes temperatures had been “running at” the lower limit – that’s hardly true, is it, if a year earlier they had been “running ” in the middle of the envelope?
Then, in the following paragraph, he did exactly what you have objected to Gavin doing –
“Figure 4 assumes that 2009 mean surface temperatures are the same as 2008, which is a very reasonable assumption at this time.”
– and then, of course, he presents Figure 4, which looks even worse. Now, whereas (I think) Gavin was showing the effect of different end points to demonstrate the potentially misleading nature of this method, Michaels was (I think) intending exactly the opposite impression to be received.
I think that anyone presenting data has a considerable responsibility to ensure that the form of presentation will not mislead the non-specialist. At the top of this thread you present your graph, which gives a fair representation of the data in a way that any engaged observer can understand. Of course, you might wish to graph trends in other ways, but I actually have confidence that you would do so in a manner which sought to avoid misunderstanding rather than to invite it.
I don’t know the Rahmstorf history you refer to, but my view is that ‘responsibility’ must apply to ‘both sides’.
Lucia,
“If he wants to insist on a 99% confidence interval for his beloved models”
Outside the context of this discussion, your work is likely to attract hostile responses. Please do not let that make you angry and for that anger to influence your statements. It will seriously put people off who are interested in science and obscure the good points you make (as per type II errors), as have certain other sites where the bitterness and bias is too much to wade through.
Gavin has done a lot of serious work in modelling… would he prefer that work be shown to be useful? From a basic psychological standpoint probably. You claim Gavin chose to exclude 2008 because it gave the answer he wanted. I think it’s only fair then to conclude Michaels chose to use HadCRUT because it gave the answer he wanted, or that Michaels did the test and published the results for that reason. Michaels is a known public liar in testimony to Congress precisely for cherrypicking to present models in a bad light compared to observations. That you have offered several innocent reasons as to why Michaels might not publish a conclusion which fails to reject, and refuse to be drawn on what non-innocent reasons might be, whilst you’re not willing to give the benefit of the doubt to Gavin… strikes me as unfair.
[sigh]
The inevitable destination of our recycled journey is here. We once again arrive at the same spot… Climate Science producing a choice between “ill-advised” and “foolish”. What is Useful? Who is UnFair? Why is it Cherry-picking?
Someone is taking this seriously?
If I was a kid again, I’d ask Grandpa to read me a different story next time.
Andrew
PS… The Kentucky Derby is towmorrow. 😉
Simon–
There are at least three issues that keep getting interlaced:
1) Did Michael’s overstep or mislead? and
2) How do we interpret what happens when we specifically drop 2007?
3) Did Gavin overstep or mislead in suggesting it’s foolish to make conclusions about models based on on analyses including 2008 because of what he finds if he drops 2008 data?
Lazar has been going on about (2), and now people want to switch to (1) and or (3). The three issues can have different answers, and the difference is important. I think the answer to 1 is Michaels presentation is a bit misleadin. I think the answer to (3) is Gavin’s discussion and coments over step and are misleading. On (2), the issue Lazar introduced, I think that it’s foolish to attribute much significance to what we see if we ignore 2008. (Will we have even more confidence models are wrong if things reject year after year after year? Sure. But, that confidence level will rise above 95%)
If your criticism is Michaels language was immoderate, and he said “temperatures” instead of “trend”, I agree with you. If you criticize Michaels for not showing GISSTemp when presenting to congress– I agree and said so in my first post on the subject.
I did a quick check and showed this:
hat we see is ending in 2008, HadCrut trends are skirting the the “all weather / all models” range. GissTemp do not, but they still look pretty low compared to the observations, as seen here:
(See http://rankexploits.com/musings/2009/michaels-testimony-do-the-graphs-look-right/ My histogram uses less data than Michaels (because he estimated noise based on all results trends embedded in 30 years and I only did 1 set of trends. But as you can see, ending in 2001, the models look not-so-good based on GISSTemp either, though they don’t reject. ) )
Had Michaels shown those we would see including GISSTemp, we would see the trends currently fall below well below the mean, but do not skirt the 95% boundary of “all weather and all noise”. Based on GISSTemp, including 2008, the observations seem to fall outside the 80% boundary. So, this choice would not have supported Michael’s story line somewhat, but no where near as strongly as using HadCrut alone. By not giving people the opportunity to see GISSTemp, Michaels choice was tendencious. (That said, many would still think the models looked bad, so it’s a bit of a mystery why Michaels didn’t show GISSTemp as it doesn’t hurt his case very much and leaving it off does.)
You criticize Michaels for saying temperatures instead of temperature trends: Yep. That’s a goof. It’s in a written document, prepared for Congress, so it was sloppiness beyond a mere slip of the tounge. (Oddly, I don’t think Gavin actually criticizes him for that.)
Despite this, I think, the result is rather remarkable.
What are we to make of Gavin’s point about 2007? I think the results including 2008 remain remarkable even though we don’t reject at 95% if we ignore 2008. Ignoring 2008 means ignoring most of the data acquired after the projections were published in 2007.
So, how do I interpret what happens if ignore 2008? (The only full year worth of data after projections were published?)
If we do ignore 2008, the models only look high but not as high. This would have been evident if Gavin had included the mean on his graphs, rather than adding the min/max extent bounds. (If we had enough model runs, we all know the min/max extent bound will be positive and negative infinity! So, what was the purpose in adding these? See extra lines added at http://www.realclimate.org/index.php/archives/2009/03/michaels-new-graph/ . )
So, it’s not as if if dropping 2007 makes the models look like their projections are low. But worse: Are these graphs really much more than a hindcast? Wouldn’t you have expected the modelers compared data to observations before finalizing their decision to make projections based on the multi-model mean without throwing out any models? So, when discussing whether or not showing Michaels graph is misleading (provided someone knows he means trends) we are arguing about how seriously we should take the fact that the models did not look too bad compared to data available to modelers when projections were published!
So, for this reason, and others discussed above, I think Gavin is going well past the bounds of balance when he suggest it is foolish to give much consideration to results including up-to date data, because the models don’t look so bad if we look at the graphs of GISSTemp plotted ignoring 2008 data. (Because the models look pretty bad for the three cases of a) Hadcrut including 2008, b)GISSTemp excluding 2008 and GISStemp including 2007. (This is where the “and” comes into play in the quote where Gavin calls Chip “foolish” .)
On this:
“Figure 4 assumes that 2009 mean surface temperatures are the same as 2008, which is a very reasonable assumption at this time.”
Way back before Gavin wrote anything, I answered your question about that and said this
See: http://rankexploits.com/musings/2009/michaels-testimony-do-the-graphs-look-right/
Had Gavin simply said the fictional graphs using 2009 tells us almost nothing beyond what we learn from 2008, I would agree. But lecture to Chip about what is “foolish” had nothing to do with 2009 and, in comments here, Lazar was not discussing the fictional 2009 issue. It had to do with what happens if we ignore 2008.
When responding to Lazar’s point about dropping 2007, in my opinion, given the circumstances — including the fact that the projections were published in 2007– dropping 2008 and suggesting that comparison with end points in 2007 makes it “foolish” to draw conclusions based on comparison ending in 2008 is silly. We can draw conclusions about what we see based on the full data. Whether or not everyone will agree is another matter. But people draw conclusions about what current data tell us all the time, and it’s not “foolish” to do so.
Well… usually, if someone asks, I just make the plot! Sometimes it’s too time consuming and would require too much work.. but I usually just do them.
Lazar–
Why do you think I’m angry? I am aware that my criticizing Gavin can create hostile responses. But I don’t find this a reason that Gavin should be sheilded from criticism when he post utterly tendencious arguments that hide discrepancies between models and observations. Do I think Gavin loves his models too much and loses all objectivity when assessing criticism of those models? Yes. Absolutely.
On this
As you can see above, I criticized Michaels for not includeing GISSTemp, and always have.
You should be a bit careful here. Michaels did not lie to Congress. As far as I can see, in general, he and Gavin are about equal with regard to promulgating tendencious discussions of models.
You specifically asked me whether or not Michaels would have preformed and presented this analysis in 2008. The true answer is: itwould have been nearly impossible for him to have done the analysis.
What would Michaels have done if the model data had been available? Who knows. But if you are going to ask me to justify why Michaels didn’t do the impossible, I’m going to point out that it would have been impossible for Michaels to have done what you seem to think he should have done.
Now, as you like hypotheticals, let me ask you this: If Gavin had found the model results based on dropping 2008 still looked bad, do you think he would have posted? Or do you think he would deferred posting until El Nino hits?
Mind you, I have no idea what Gavin would have done. All we know is he did do the analysis dropping 2008, did not do the analysis dropping intermediate years at random, and posted what he found.
Lucia,
Thanks for the response. I don’t think I dispute anything you’ve said, but I am still concerned about the understanding of this (I don’t mean your understanding, but that of policy makers).
I agree that models are not looking so great over this century so far. However, of course, we don’t know why that is. The implication of cloud feedbacks being off is very different to the implication of aerosol inputs being off. And so on.
A problem, I think, in the ‘message’ from Michaels’ testimony is that the current offness of the models can be simply extrapolated into the future. Ironically, this puts faith in the long-term consistency (i.e., consistently ‘wrong’) of the modelling whilst pointing out the short-term inadequacy of it. It just doesn’t follow that factors which may be pulling the temperature trend below expectations now will pull it below by an equivalent degree into the future. Of course, if the physical assumptions are wrong, then that is another matter.
My own (naive, no doubt) hunch is that assessments of equilibrium response may turn out to be good, but that trends may vary up and down between now and then, perhaps over long periods!
I think there’s another issue with language, which is that words like ‘reject’ and ‘falsify’ can all too easily lead some to presume that what is being said is that AGW analysis is altogether rejected or falsified, whereas in fact it’s just a rejection/falsification of a particular trend. I know that’s not your intention, but I’m just mentioning it.
Simon
Absolutely. The issue of “are projections off” and “why” are somewhat separate. They are linked only in so much as if they are not off, we don’t need to ask “why?”
Possibly. Some will, and do, say that models undershooting right now can mean one and only one thing, and the tell you what that is. Models could be off now because aerosols are wrong, time constants driven my mixing in the oceans are wrong etc. It could be that models being off means the situation is worse that if they were on. We don’t know.
Yes. Reject and falsify can be misunderstood. Unfortunately, it is often those who want to dispute the models projections are off who are likely to create the most confusion. I find I am more likely to read those who advocate strong measure to counter act climate change who want to suggest I have claimed to have falsified the entire enhanced greenhouse effect. Then… people read the statistical analyis, and listen to those “rebutting” and come to believe I’ve falsified “everything”. It would take a lot more to falsify “everything” (and I don’t think anyone will ever succeed!)
I just test trends projected by models.
Lazar,
Why should Gavin be given the benefit of the doubt with regards to motives for choosing specific data when he does not extend this courtesy to others? See here for example: http://wmbriggs.com/blog/?p=118&cpage=2#comment-2593
One possibility as to why the projections maybe off the mark (too high) may be the fact that solar forcing was frozen from the year 2000 on. I did some multiple linear regression and from 2000-present, held TSI in place at it’s year 2000 level and found a temperature trend with 2°C/century well within the 95% CI. I didn’t publish because I was sure there are lots of hidden pitfalls with multiple linear regression that I’m not aware of. I could be totally wrong.
Chad–
JohnV has been suggesting it’s the solar for a long time.
I actually don’t know if modelers all held the sun at the 2000 level, or if they knocked it down to the average level for the solar cycle and held it there.
What did you fit in your regression? Optical depth for volcanoes? Solar? Etc?
Here’s something Tamino wrote about the solar cycle a year ago: http://tamino.wordpress.com/2008/04/05/stalking-the-elusive-solar-cycletemperature-connection/
I don’t know if he pursued discussions with Camp and Tung or what he currently thinks about the effect of the solar cycle. If it is as large as Camp and Tung suggest that could be the reason for the low trend. If it’s much smaller, as quite a few people believe, solar cannot be the reason.
I fit all the usual suspects in the climate system (ln(CO2), TSI, aero forc., ENSO, etc). I couldn’t get my fitted values to have as much variability though.