Chad at Scientific Prospective emailed with a hypothesis:
…So I came up with the hypothesis that the models apparently are over predicting warming because what we are comparing them to do not adequately account for the temperature in polar regions, where warming supposedly should be amplified.
Here’s my plan: HadCM3 has higher spatial resolution than HadCrut, so I will average the data so it has the exact same mesh grid as HadCrut. I will take the spots in HadCrut that are missing values and mask out the corresponding lat/lon points in the HadCM3 data. This way they will both have identical coverage of the globe, month to month. I’ll do the necessary calculations and who knows what we may find. I suspect it will decrease the rate of warming. But we’ll see.
I told him this was an excellent plan. After all: it would be best to compare HadCrut trends to model run trends for surface temperatures defined over the portion of the planet HadCrut actually measures.
Because Chad’s results will be scrutinized, I suggested he should make sure that his script replicates the results at The Climate Explorer and to post the comparison at his blog. That way he can link back when he does post the more exciting posts.
I encourage you all to go have a look, because I think Chad’s next few posts will be exciting indeed.
I can’t predict who they will please. Maybe computing model temperature anomalies based on temperatures over the HadCrut grid will bring the trends down enough to make a difference to the results of hypothesis tests. We’ll see!
Either way, when Chad gets these together, we’ll have a second set of “models projections” to compare against data. That will be fun. 🙂
I thought that HadCRUT didn’t have data for much of the extreme polar regions.
It was my impression that GISS interpolates data over these regions, and that the resultant warming in the extreme arctic accounts for a non-trivial portion of the difference between GISS and HadCRUT.
If he is worried about HadCrut’s lack of data toward the poles (actually there is a lack of data period…) then why not just use GISS which interpolates arctic warming? That it is an interpolation is hardly a defense, but more a confession that GISS is great for soundbites but a mess for science.
Jason,
HadCrut does not have data for the poles. The model projections in the AR4 include the poles.
I don’t see how data from 10% of the surface can significantly affect trends.
More importantly, the best data from the south pole suggests that it is in a cooling trend which would tend to cancel out the effect of higher trends in the arctic.
Lastly, estimating data does not provide us much useful information. If it did I am sure HasCRUD would have included it in their datasets. If one really wants to compare apples to apples then one must calculate model GMST using the only the data from the areas with HadCRUD coverage.
Raven–
Chad already has a script. He is going to recaculate the model temperatures omitting the bits HadCrut doesn’t measure. So, pretty soon, we’ll see whether it makes a tangible difference.
I think it was a pretty cool idea Chad had.
very cool idea
Lucia,
Sorry I misunderstood what was being done.
That said, I think it would be worth having a more of a discussion of the significance of this test before the results are known.
i.e. there are 3 possible outcomes: higher trends (i.e. models look even worse), no change or lower trends (i.e. models look better).
I suspect many warmers would be quick to dismiss the results of this test if the trends ended up being higher while they would loudly trumpet the results if the trends endup being lower.
This leads me to believe that the test has little ‘power’.
A test that had a lot of ‘power’ would be a test where results that directly contradict the expectation would be considered as significant as results the confirm the expectation.
In my understanding, the models perform very poorly when it comes to representing areas as small as a single grid cell so it is not possible to predict the effect of removing model grid cells based on the availability of station data. For that reason, I would not consider the results interesting even if they did produce higher trends (i.e. a result that would support the lukewarmer case).
Raven–
The test will have the same statistical power as the ones I have been doing. It may turn out to have the same power to change people’s minds also– and that seems to be little.
However, I am grounded in the empirical method, and I think it’s best to compare like to like. So, if, relative to the Chads new “projections” of temperatures based on HadCrut the models are ok, I think that’s wort reporting. Or, if they are not, that’s also worth reporting.
I’m almost certain Chad’s new trends will be lower. I don’t know if they will be sufficiently lower to make HadCrut not reject using a Santer test.
I see what you are saying but I am not convinced that this approach is any more of a like to like comparisons since the HadCrut temps represent the average of one or more points 1m above the surface but the model temps represent an average of of a large block of atmosphere.
It seems to me that the criticisms used to reject the satellite temps also apply here. In fact, it would be possible to create a similar “like to like” comparison between the model results and the models by adding weighted average of a few vertical blocks into the ground temps.
Raven:
Yes. It would be possible. All that is required is for someone who is interested in this going to the effort to do it.
Right now, SteveM seems to be doing this for the tropical troposhphere. But I suspect at some point, someone will make projections to compare to the satellites.
So, does this mean that previous comparisons are apples to pears?
Jonathan–
It’s more like Golden Delicious to Granny Smiths.
HadCrut land/ocean is attempting to observe the surface temperature of the earth. The projections are supposed to be the temperature of the earth.
However, HadCrut does not cover the full globe, while the projections do. There are other issues. But HadCrut is at least supposed to be an Hadley’s best attempt to measure ‘that which is projected’.
GISSTemp tries to correct for the missing measurements in polar regions.
Chad is trying to make the comparison closer to Golden Delicious to Red Delicious. These are at least both sweet varieties of apple; as Raven noted, there are still differences between observations and the actual honest to goodness temperature of the earth. But what are you going to do?
If Raven’s argument is used, then we would have to say that it is impossible to compare models to any presently available instrumentation.
I think a potential problem with this is the poor local performance of the models, which presents a whole new slew of arguments about whether such a comparison is even an apples-to-coconuts comparison. My personal inclination at the moment (without having put tons of thought into this) is that the global averages are a more accurate comparison despite the poor instrumental coverage at the poles.
I would definitely be interested in the results, regardless.
At least coconuts are edible fruits!
Clearly, we need a scale:
“Red Delicious to Red Delicious”,
“Red Delicious to Granny Smiths” (i.e. “Applies to apples”,
“Apples to Oranges”
“Apples to Coconuts”
“Apples to Walnuts”
“Apples to tulips”
“Apples to Bunnies”
“Apples to Dinosaurs”,
“Apples to Dirt”,
“Apples to grenade launchers”,
“Apples to Jupiter”.
“Apples to Andromeda”.
etc.
“Apples to Andromeda” by Andrew
Like to dislike comparison
Apples to Dirt
Should I try some soothing smoothing
When my brain starts to hurt?
I liked apples to bunnies because it was so unexpected. 🙂
From Buffy The Vampire Slayer: Once More With Feeling, Bunnies, by Joss Whedon
Couldn’t resist
Dewitt Payne-Haha! All I remember from that show is “Fire bad, tree pretty”-Haha! Hm, the absolute bottom of that scale should be Apples to antimatter apples, which are just as delicious, except they make you explode and take half the solar system with you.
Speaking of Buffy… how about SMG? That’s a three-letter acronym I can dig. 😉
Andrew
Ryan,
Models are designed for a purpose. The current crop of models are designed to produce a semi-realistic estimate of GMST. They are not designed to produce realistic estimates for the average temperature of a grid block. In fact, the model makers get quite upset if someone tries to draw conclusions based on the temperature estimates for a single block.
For that reason, I feel that dropping grid blocks out of the calculation of GMST is the equivalent of dropping random terms from a fourier series approximation a waveform. i.e. the process will change the waveform but the meaning of the resulting waveform is an unknown.
Here is a more concrete reason why the exercise is not necessarily meaningful: Let’s say that a model predicts warming in the poles that much much larger than what could be physically plausible given the other data points we have (ice, snow, anecdotes). It is quite possible that removing the poles from such a model would bring the model into alignment with the current data yet I don’t think anyone would agree that means the model is ‘right’. It just means that the largest errors happened to be in the area that was conveniently excluded.
I’m about 75% done with my analysis. I had to rewrite all of my code because it was grossly inefficient (compare 30 minutes to calculate to 5 minutes).
Raven–
Plenty of modelers look at model projections for regional temperatures. Look at Steig!
Some modelers who blog get upset when you suggest that models are projecting regional temperatures incorrectly. They don’t mind anyone doing the exact same comparison, provided the result is favorable to models.
Chad has identified a reason why a test might not hold up. So, it will be interesting to see what happens when we throw away the arctic data.
Raven:
Sure. But still…. The measurements do exclude that region. It’s useful to consider this.
I think the exercise is worth doing. As with most things in climate science, I think the caveats are often as important as the results – whatever they may be.
That said, I think it would make more sense to do a zonal analysis (tropics, subtropics, poles) instead of simply dropping the blocks that happen to have no coverage.
Raven:
My universal answer to people who suggest some ‘other’ analysis is that the person investigating things as a hobby gets to decides which things interest him. I think this one is interesting. Maybe after Chad does this, he’ll be interested in zonal analysis.
I’m voting he creates monthly data for the troposphere so we can test RSS and UAH! But it’s his choice.
Lucia,
My choice of words was “I think it would make more sense…” not “Chad should do…”. The latter is a suggestion that someone else do work. The former is a critique of the work being done.
It is not clear to me why doing an alternate analysis is a precondition to criticizing someone else’s analysis.
Raven–
Sure. What you said is not necessarily a criticism. You’re just expressing the point of view that you’d be more interested in a different analysis.
The other analyses may be more interesting or not. One disadvantage I can see to testing zonal projections rather than global predictions is that the less area you use, the noisier the data will be. So, you need longer periods of time to reduce the level of typeII error.
But, if someone has a specific question about the ability of models to predict zonal temperatures, they would need to test against projections for zonal temperatures. So, for example, those testing tropical troposphere temperatures are looking at that zone.
I have to admit I’m more interested in the ability of models to predict even global temperatures. I don’t concern myself about the other things until they can demonstrate ability to predict global averages. So, I’m more interested in what Chad is doing than in zonal projections.
Is there some particular question about zonal predictions that interests you?
The argument that Chad is making is that the polar zone is not accurately represented by the real data. He plans to provide an analysis which partially excludes the polar zone to compensate for the missing data.
I am arguing that individual grid cells are too small a unit when one is dealing with climate models and that it is misleading to arbitrarily remove individual grid cells.
However, I am also suggesting that an entire zone may be large enough unit if one wants to investigate the effect that Chad noted. i.e. if Chad’s hypothesis is correct then we would expect to see a good match for the tropical and sub-tropical zones.
Unfortunately, even if we see a good match for the tropical and sub-tropical zones that does not mean the models are correct – it just means that the model errors are largest in the polar regions.
Raven–
Chad isn’t arbitrarily removing cells. He’s removing cells corresponding to locations where temperature aren’t measured.
Yes. If Chad’s hypothesis is correct, looking at zones is better than looking at the full surface temperature. But why is it better to use zones than to screen by thermometer locations?
Given that the polar zone happens to be the region that is not covered by thermometers, there may be little practical distinction between what Chad is doing and what you suggest. But suppose, for some mysterious reason, the entire continent of Australia had banned thermometers back in 1900 and stuck with that, why not throw out Australia? Basically: Why not throw out data from places that are not sampled? Why would it be better to include Australia just because it falls in a “zone”?
Maybe there are good reasons to screen by geographical zones instead of using the thermometer locations to dictate the screening. But what are they?
Sure. A good match would not ensure models are correct. All we can ever do is either a) show hypotheses about models reject at some confidence level or b) discover that the hypotheses fail to reject.
Still, if the hypothesis that models fail to reject happens often enough, with sufficient amounts of data, and with no identifiable problems in the comparison, then in my opinion, our confidence in models should increase. If they do reject, our confidence in models should decrease.
The fact is, Chad did identify a not-apples to apples aspect, and it’s worth lifting it somehow. I think screening based on the location of the thermometers is seems a useful way to make the comparison more fair.
Lucia,
We have two datasets that we want to compare. The trouble is one of the datasets is not complete so we need to do something to compensate.
One option is intropolation – i.e. choose a suitable value for the missing data and then compare. When we compare the averages for all available data we are implicitly assuming that all missing data points can be replaced by the average for the dataset. This approach does bias the comparison if we know that the missing data points are significantly larger or smaller than the average.
Dropping data is another option – i.e. leave data out of the first dataset in order to provide the same “coverage”. When we compare averages for only the matching data points we are implicitly assuming that all missing (reall) data points can be replaced with the corresponding value from the other (modelled) dataset. This approch also biases the comparison if we know that the missing data points in the first (modelled) set are likely to be wrong.
My suggestion to do zonal comparisons is a slight variation on the first approach that uses the average for the zone to fill in missing data points. This should reduce the bias caused by missing data that happens to be in a region where larger than average anomolies are expected.
I do not feel that any comparison bewteen the models and data that implicitly assumes that the models are correct over the regions where we have no data is reasonable since we are trying to determine whether the models are correct in the first place. Intropolation (implicit or explicit) is the only option we have.
That said, we could do was GISS does and try to come up with more realistic estimates for the missing regions. However, such an approach depends on the ‘model’ used to estimate the missing temperatures. The gigabytes of text produced by the two Jeffs on Antarctica should illustrate how estimation is a rat’s net.
For that reason, I feel the naive interpolation done when calculating averages for all available data gives us the most meaningful comparison. The same approach applied to the zones would supplement that and allow us to better understand the bias introduced by the niave intropolation.
Raven
Why do you think this? If it’s missing in the data, you leave it out of the model. There is no assumption that the data you left out of the model matched the data. That would only happen if you filled in the experimental data with data from the model, then computed the ‘experimental’ data on that basis.
The comparison will be of the trends over the same surface areas. We should see more noise in the models (because we left out area.) We may see different trends.
I think your zonal method introduces a bias into the comparison because it computes the anomalies giving different weightings than the method used for the data. So, if the models are right, and say, Australia had banned thermometers, then the model trend would include include a bias proportional to the difference between Austrialias anomalies and those in it’s zone. Meanwhile the data would have a bias proportional to just omitting Australia. When comparing, the net biase is the difference between the two biases.
If you just compute the two using the same area weighting, both would be biased compared to the trend for the full surface area weighted anomalies for the earth. But neither would be biased compare to each other.
If Hadley fills in data points using your zonal method to compute their mean, then Chad should fill in using their method before comparing to HadCrut. But if they don’t… then the best comparison for the models is to weight as closely as possible to mimic what Hadley does.
RE: Raven (Comment#13142)
.
What you stated in this comment is kind of my concern. My earlier comment was about the apparently general notion that models couldn’t be compared to HadCRU or GISS regardless of whether the comparison was whole or partial.
.
It’s not the same situation as Fourier analysis, though. The problem is that if the models do not faithfully reproduce small scale phenomena, cutting out portions for comparison could itself introduce errors.
.
But the basic point – that this type of analysis might not be any more accurate than looking at the aggregate measurement – is a potentially valid one, I believe.
.
With that being said, I still want to see what it looks like. 😉
Lucia,
It really comes down to what the results are claimed to mean.
As a thought exercise lets assume that the US continental measurements are the only measurements which can be considered to be ‘reliable’. Based on that assumption one could calculate the mean surface temperature from the models by only using grid cells from the US and do the comparison with the real data from the US.
Based on those results one could make a claim about how the models predict the US temperatures. But one *cannot* make any claims on how well the models predict the global temperatures because the test did not include temperatures for the globe.
Chad’s approach would be ok *if* he always included caveat that the results do not necessarily apply to the entire globe because the entire globe was not included in the test. However, this test is being promoted as a ‘better’ way to test the model’s ability to predict GMST. That is why I am objecting. The test does not tell us anything about the model’s ability to predict GMST unless one *assumes* that the models accurately predict the unknown polar temperatures.
Ryan,
I only used Fourier analysis as a way to illustrate the nature of the problem (i.e. the individual grid cell temps are simply components of the GMST and do not necessarily have physical meaning in themselves).
Raven–
Obviously, the test would be to see if the models can predict the temperature over the regions actually measured. It’s almost the complete globe but not the entire globe.
It’s just as much “GMST” as HadCrut is GMST. So, either we admit neither is GMST or, take liberties can call both GMST.
The alternative to not recomputing the surface temperature limiting to regions covered by HadCRut is to compare projections for the complete globe to incomplete measurements. If the two differ, someone can suggest the issue is the measurements are over different areas. So, Chads proposed method resolves that issue. It’s not in Chads power to put thermometer at the north pole, now or retroactively, so this is pretty much Chad’s option if he want to remove the difficulty that measurements don’t include the north and south pole.