In comments Sven and Arthur have been arguing about which of 3– count them 3— observational data sets reporting surface temperatures are the “outlier”. Sven votes for GISTemp, Arthur votes for HadCrut. I decided to plot each and perform a t-test on the observational data sets for various time periods. Because the data sets are supposed to measure identical things, I performed the t-test on the difference between the highest v. median and lowest v. median, testing for using various start years.
The figure below compares the trends since 1980.
I fit a trend line to the difference between GISS and NOAA/NCDC and repeated comparing HadNHSH to NOAA/NCDC. If assume the residuals to this fit are white, the trend in this difference is not statistically significant. Heck, I compared GISS to HadNH_SH and the difference in those two trends is not statistically significant.
Below, you can compare trends since 1950.
I fit a trend line to the difference between GISS and NOAA/NCDC and repeated comparing HadNHSH to NOAA/NCDC. If assume the residuals to this fit are white, the trend in this difference is not statistically significant. (Note: the difference in trend between HadNHSH and GISS is statistically significant. That exhausts the number of tests one could possibly do. Bear in mind, it’s not quite kosher to pick the two extremes and pretend your selection was random when applying a t-test. Also, I haven’t tested the residuals to see if they are white.)
Perhaps Sven or Arthur have some basis for suggesting one of the observational data sets is an outlier. I don’t see it. You know what looks like an outlier to me? The multi-model mean.
Update: Sven clarified saying: “I was talking about approximately the last decade and the record year of 1998.” I then focused on the period from Jan 2001-now. I compared the two extremes (Hadley and GISS) to the median (NOAA/NCDC). I computed the trend in the difference and estimated the uncertainty intervals using the red-correction to white noise. Here are the trend and 2 σ uncertainty intervals.

By this assessment the difference in trend between NOAA and GISS is statistically significant and the difference in trend between NOAA and Hadley is also statistically significant. What this means… I do not know. I haven’t looked at the spectral properties any further.
Hmmmm, are we arguing about anomolous anomilies??
I’m too busy right now to look it up, but as I recall, one of the issues we found in the ClimateGate Files was discussions in emails of how the three groups were feeding back to one another the raw data sets whey were using for the model; the effect was that the data selection was somewhat interdependent. If so, I’m not sure how much information content the correlation has.
Charlie–
There is much information content in the difference. The differences are very small. But there are people on other treads saying that one or the other of these things which are nearly indistinguishable are “outliers”. They are all very close. There are only three. How can anyone decree one of the three an “outlier”?
lucia, you are too quick to dismiss the “outlier” issue.
Commenters here pick the flysh*t out of the pepper on data issues like no one else and know that many quatloos may change hands over a few tenths of a degree. It matters. Maybe not for the straights but for the climate-obsessed, it matters.
Hi Lucia – my only basis for the claim was the unusual warmth of 1998 in the Hadley series – perhaps due to their use of a different sea surface record, it may be more sensitive to ENSO variation? I’ll see if I can track down any more info on that.
But yes, I agree, over the long term, the three series are almost indistinguishable, no one of them an outlier.
George–
The term “outlier” implies that one member of a group is well outside the range consistent with the other members. Currently, the discrepancy between the three observational sets is consistent with measurement uncertainty as indicated by the residuals in the differences. So, even if Tilo is correct about the reason for the difference, that mechanism has not made enough of a difference to make any difference in trends distinguishable from noise.
Of course, if Tilo is correct, the discrepancy may eventually stand out from noise, but it doesn’t yet. Likewise, for Arthur: Whatever the differences we are seeing are entirely consistent with “noise”. So, calling something an outliers just doesn’t seem warranted.
My general guess is that owing to different choice in treating the poles the different set span a range presenting values that are near the extremes of what different analytical choices might make. But I couldn’t begin to prove this, and Zeke knows more and may well suggest that my guess is incorrect and that we could get two more extreme data sets even making a reasonable set of choices. But even if I’m right about GISTemp representing an extreme that in some sense maximizes the spread of temperature we see, I don’t think the two sets representing the extremes can’t be called “outliers” when the two sets are giving results that return trends whose difference is statistically indistinguishable.
I think BEST work will probably indicate that HadCru is the odd man out. If Best gets anything close to what JeffId got.
Steven–
It may. But until it does we don’t know!
“Perhaps Sven or Arthur have some basis for suggesting one of the observational data sets is an outlier.”
In the big picture HadCru and GISS tracked fairly well together. From around 97-98 they have been diverging. From that time HadCru seemed to agree with the satellite data better than GISS. If you plot a longer time period you will hide, to some extent, the more recent divergence.
Steven Goddard has an article up about the Ice Sheets melting “faster than expected”. Anthony has an article about CU sea level data. So I just did something that I’ve been wanting to do with the CU data for some time. I split it into two equal chunks and plotted it. Then I ran a trend line through each chunck. It shows that the rate in sea level rise for the more recent chunck is less than for the earlier chunk. So if we have accelerating ice sheet melt and if Trenberth’s missing heat is going to the deep oceans, then why isn’t the rate of sea level rise reflecting it. Where is that extra water from the ice sheets going and why isn’t the heat in the deep oceans causing expansion? Why does it seem to be reflecting just the opposite.
I’ll get around to putting my chart on my web site later.
The real outlier seems to be the average of the models. Who’d have imagined?
Lucia, with trends from 1980 and 1950 you are measuring different things to what we were talking about. I was talking about approximately the last decade and the record year of 1998.
And I don’t really buy the reasoning behind removing the satellite records out of comparison. I think the satellites, though reacting more strongly to enso, have a much larger and smoother coverage and are a pretty good way to check the surface data. Quite interestingly they were also quite well in sync with all surface data sets until the point where GISS started diverging and till then there did not seem to be any issue of regarding them totally separately. It seems to me that the reason for “throwing them out” is that HadCrut, RSS and UAH are all pretty much on “plateau” since the end of 1990’s and this seems to be “inconvenient truth”?… And, after all, according to the theory, is lower troposphere not supposed to show the greatest warming?
Sven–
The troposphere is not the surface. While one might argue the trend ought to be close, it’s a different thing. So, trying to identifying ‘outliers’ for the surface by looking at the troposphere doesn’t work.
Not using them to identify the “outliers” isn’t the same as “throwing them out”. We get information from the satellite sets. It’s just that we can’t determine whether a surface set is an “outlier”.
Oddly, if we apply a t-test on the trends since 2001 treating residuals as white, Hadley differs from NOAA and GISS differs from NOAA. So one might decreed both outliers. But actually, more work is required as it can simply be that the short term errors also exhibit some autocorrelation.
I haven’t looked at that any further. Have you? If not, you haven’t, then you need to do more work before saying you have identified an outlier.
It’s true the measurements contain measurement errors, which is noise above and beyond “weather noise”. But “outlier” is a strong term, and we just don’t have much evidence for that yet.
Tilo Reber (Comment #73904),
The reduction in rate of sea level rise is probably due mainly to a change the rate of ocean heat accumulation, which has been very much slower in the last ~8 years (less thermal expansion). Trenberth’s “missing heat” is in fact missing because it has not shown up in the ocean; at least it does not appear in ARGO measurements, nor in published estimates of deep ocean heat uptake. The melting of floating ice (the declining summer ice cover in the arctic) ought to have no significant impact on sea level. The rate of melting of land supported ice (both mountain glaciers and ice sheets) which can contribute to sea level rise, is much less certain in magnitude, but probably does continue to contribute significantly. The increased use of irrigation over the last 75 years almost certainly has also contributed a significant amount to sea level rise. Some estimates suggest that up to 20% of the of recent rise can be attributed to irrigation.
Lucia, we might be talking past each other. There seems to be a need to define again what I mean by “outlier”. Outlier for what, on which indicator? I don’t consider GISS as an en gros outlier. As I said, out of these four data sets that I was talking about, GISS was the only one a) showing 1998 only as 6th warmest and b) showing a strong warming trend since 2001. And in this particular issue I noted GISS to be an outlier.
Though, as I was reminded, I wrongly discarded NCDC that is more on GISS side so at least partly I was wrong (as I was also on my presumption on the origins of raw data)…
Sven–
I just think you are placing too much significance on noise without doing much analysis to figure out if these differences are entirely consistent with the sort of measurement uncertainties we expect.
How is 1998 as only the 6ht warmest markedly dissimilar from the others? I don’t see that as making anything an “outlier”. It’s not as if GISS didn’t show the 1998 spike, it’s just it shows bigger ones later. There is maybe a small amount of evidence that NOAA/GISS and HadCRUT are diverging a little. But someone has to sit down and see what they think of the noise.
I did, btw, just look at this

These are trends fit to the difference between NOAA and the other two. The red-corrected 2σ uncertainty intervals do not contain ‘zero’ which suggest that both Hadley and GISS are “outliers” relative to the ‘median’ defined by NOAA alone. But I haven’t done anything to figure out if “red” is adequate.
But by this analysis, both are outliers. So one can’t just decree GISS and outliers while saying Hadley is not an outlier. (It’s still weird to discuss outliers when we have only 3. But…still..)
But even if the are, this doesn’t tell us which one might be right and which wrong! Because GISS does account for poles (rightly or wrongly) it differening from the others could either mean:
a) Tilo is right. The melting results in noticable bias in trend over decadal scales or
b) The poles really are warming faster and so the GISS trend is correct.
We can’t know.
Lucia, Sven,
Seems to me the real utility of comparison with the satellite trends is that GISS does show a greater divergence from the satellite data than do the other surface based trends. In light of the difference between how GISS and the other land trends treat sparse coverage regions in the north (1200 Km interpolations, etc), the greater divergence of the GISS trend from the satellites is at least good reason to closely examine the kinds of assumptions that go into the GISS method. Perhaps the GISS method is rock solid, the other surface trends are mistaken, and the divergence represents a real change in the pattern of warming in recent years (more 2-meter altitude surface warming and less tropospheric warming at high latitudes), but perhaps not. It is for sure the kind of divergence one would not expect…. it lights up my BS sensor.
Lucia,
We cross posted. I wouldn’t have bothered had I seen your most recent post first.
SteveF–It would certainly be nice if we had a station at the north pole reporting promptly and monthly. Even 1 thermometer would reduce the danger of interpolation of a sort that contais an element of extrapolation. (The interpolation is based on measurements from regions further south). Unfortunately, ice moves and sometimes melts. So, that thermometer would tend to move. So I guess they would need gps to keep track of its location.
Sven: “And, after all, according to the theory, is lower troposphere not supposed to show the greatest warming?”
Right. I was going to bring that up. I believe that according to CO2 theory, what UAH and RSS are measuring should be showing more warming than what the surface stations are showing. The fact that it’s the other way around – especially for GISS – makes the surface records even more suspect. Throw in the fact that most proxy records, as far as they go, cannot keep up with the level of rise shown by the surface instrument records, and it gives you an extremely low level of confidence in GISS, which, by the way, I also consider an outlier.
Concerning NCDC, I was trying to find a chart of theirs last night that I once saw that showed what all of their adjustments were. Needless to say, they were virtually all up. I couldn’t understand why everthing yielded upwards adjustments. For example, different stations changed the time when readings were taken. Seemingly this would yield some stations being adjusted up and some down. But instead, the overall effect was a significant upward adjustment for changes in reading time. It’s one of those things that makes one uncomfortable. I’m hoping that someone will look at that more closely.
Lucia: “While one might argue the trend ought to be close, it’s a different thing.”
What argument would you make that the trend should be different Lucia? And if you do have an argument, does it say that the satellite trend should be higher or lower? I’m certainly not willing to throw out UAH and RSS when judging for outliers. In fact, when you look at all of the diverse elements that go into making a surface temperature record, from individual station placement, to UHI, to half a dozen sources of adjustment – all of which are possible sources of error – I trust the satellite trend far more than the surface station trend.
Tilo–
You just made one yourself.
On to….
I’m just saying the two are different. It’s true with sustained warming, theory suggests the TLT should warm faster and it’s not. That’s independent evidence that people are having difficulty predicting what happens. But it doesn’t contradict my obserservation that the TLT and surface temperatures measure different things. The two are different.
I think you are too quick to diagnose “outlier”. It may well be that GISTemp will turn out to be wrong, but I don’t think you can judge it an “outlier” yet.
Lucia: “a) Tilo is right. The melting results in noticable bias in trend over decadal scales or
b) The poles really are warming faster and so the GISS trend is correct.
We can’t know.”
I don’t think that it’s one or the other Lucia. I’ve always believed that there was more warming at the poles. It makes sense that areas where CO2 makes up a greater portion of the total greehouse gas concentration that more warming would show. But when I looked at GISS divergence since 1998, it just seemed improbable for such a small physical area to introduce that much of a change for the entire global reading. Then the difference in reading for some polar cells that both HadCrut and GISS covered led to further suspicions. We don’t have to speculate if the retreating sea ice effect that I’m taling about is real. You only need to go to the temp records for the costal arctic stations and look at what happens when the ice retreats. And the satellite photos from cryosphere today can show you when that happens. You can also look at the anomalies for those stations for months of years when the ice retreats early and for months when it retreats late.
Lucia: “The two are different.”
Acknowledging that they are different, the essential piece of information that I care about from them is something for which that difference should make no difference – and that is the trend. And insofar as there could be a difference in the trend, it is, in fact, the reverse of what it should be. This points to GISS being even more of an outlier than you would expect by just comparing the trends of GISS agains UAH and RSS. Again, just for clarity, I’m only talking about the trend since 1998.
Tilo–
Are you saying you thingk the trends for the TLT and surface should be the same? My impression is that they are expected to differ somewhat.
I keep reading your physical explanation of the deviation. But that’s not germane to the point I am making. It’s fine to explain that you think GISS is somehow incorrect, or temporarily biased high relative to the earth trend. I get that. I’m neither endorsing nor disputing that issue. So set that physical explanation aside for now and lets talk about what we think the term “outlier” conveys as a word.
I think the “outlier” means something and I don’t think it means what you intend to convey. It does not mean “is wrong for identifiable reasons”. This is why your physical explanation for to explain why you think it is wrong is beside the point relative to my saying GISS does not appear to be an outlier.
So what matters: If we take the 3 surface trends (because those are supposed to measure ‘the same thing’) as a collection of data considering it as just a set of data, not considering physics, the GISS trend is on the high side. But for the most part, that GISS Trend appears to be consistent with measurement errors or residuals in various sets. Or, if you think otherwise, you have to do more work to show that the GISS trend falls outside the range of of the fuller data set. This isn’t something you can just wave away. The reason you have to do this to call it an “outlier” is that that’s what outlier means.
If you use a method like like the one in the updated graph I posted then Hadley is also an “outlier” which makes 1 out of 3 observations “outliers”. This is not a meaningful use of the word “outlier” and suggests there is a problem with that method. (There is a problem: ordinarily, one doesnt’ run around trying to find statistical outliers when they have only 3 data sets. You have too much uncertainty in the mean and probability distribution to identify an “outlier”.)
Lucia: When I use the word “outlier”, I don’t use it relative to one method of measuring surface temperature. I use it relative to the question, “How much is the surface of the earth warming”? The fact that the three surface data sets use 90% of the same station data is actually a deterent in determining an outlier. I regard the fact that both data and method are different for the satellite records as a good thing. The more physical and proceedural independence the better. Yes, we cannot ignore the fact that there exists a theory that says that the satellites should show more warming than the surface temperature. But since the reverse is true, that means that the only theory to justify a difference in trends points to GISS being an even bigger outlier.
And there is yet another independent source of data about what is really happening to the global surface temperature – and that is the proxies. Now take a look at the hockey team’s chart. Notice that there are no proxies that can match the rise shown by the instrument record.
http://en.wikipedia.org/wiki/File:1000_Year_Temperature_Comparison.png
And keep in mind that proxie series are actually selected based on how well they do match the instrument record – mostly by members of the hockey team. In other words, a random sampling of series would be even further below the instrument record. So, yes, I want to consider all of the available infomation that we can lay our hands on, and that information tells me that Hansen is flying too close to the sun.
Tilo
The problem is that I think this is contrary to the usage of the word “outlier”. So using the word that way results in confusion because people are likely to assume you are using the word according to its traditional definition. We have a perfectly good word for the feature you are describing, it’s called “biased” or “erroneous”. If you start saying “outlier” when you mean “biased”, then you are going to have to coin a new word to describe what “outlier” has been used to mean (and which is actually still its meaning in statistics texts as far as I can determine.)
I swear…. I’m going to have to ban the use of the word “outlier” if people are going to start using it inter-changeably with “biased” or “in error”.
Lucia:
Wiki: “In statistics, an outlier[1] is an observation that is numerically distant from the rest of the data. Grubbs[2] defined an outlier as:
An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.”
In this case, the samples, as you would like to use them, are surface instrument data sets.
The samples, as I would like to use them, are data sets that tell us about how much warming the surface is experiencing.
Tilo:
GISTemp is not numerically distant from the rest of the data. The rest of the data are: NOAA and HadCrut.
The other members of the sample of observations are NOAA and HadCrut. GISTemp does not deviate markedly from these.
RSS and UAH tell us how much the lower troposphere is warming. Anyway, even if they weren’t measuring a different thing, have you shown GISS differs “markedly” from them?
Tilo–
If I test the trend in the difference between Hadley and UAH, the 1980-now trend is statistically significant. This suggests what we all believe: UAH measures something different from Hadley.
A few thoughts:
1) I don’t know if the recent revision to RSS has changed this, but there is (or was) large zonal differences between UAH and RSS. UAH has much lower tropical trends and larger high latitude trends, whereas RSS shows the opposite.
2) There are at least 5 analyses of MSU/AMSU data and UAH and RSS show the least amount of warming.
3) There is a large (warm) step change in HadSST2 in 1998 (and thus present within HadCRUT) due to the merging of two data sources. This is apparent after you take the difference between HadSST2 and any of the other SST analyses.
cce–
I’m aware of the step change. That could explain why HadLey might the highest trend since 1980 and/or 1950 (if it were higher): and the step change could raise the temperatures near the end relative to the beginning.
But it appears people are focusing on divergence after 1998. Is there any reason the step change in 1998 would result in low or negative trends for start years after 1998? I can’t think of one, but maybe there is information I am missing.
Since the global temperature is only good to within a couple of degrees per the CRUs own error models, why do you care if the noise in the 3 data sets is correlated or not? Since they all use a significant number of the same modified raw data, how could the noise not be correlated.
Does anyone really think we know the global temp to 0.05°C year over year?
Has anyone ever proved we do?? Even the Satellite data would not make that claim
Lucia,
I suspect much of the difference can be found by comparing the proportion of land to ocean in each analysis. GISTEMP and NCDC have more land in proportion to ocean, and since land warms faster than ocean, those indices will warm faster. That would be the case even if the data for each were the same and the only difference between the three was interpolation.
FWIW, I compared HadSST2 to Reynolds OIv2 for the time periods they overlap. The step change appears to happen in June 1997. HadSST2 is warming slightly faster than Reynolds until that month, so that was partially offsetting the faster GISTEMP (land) warming. The difference is about 0.02 degrees per decade which is enough to obscure a difference in warming over land about twice as high.
AJStrata
Depends what you call noise. Other than that, you seem to be arguing by rhetorical question (e.g. “why do you care if the noise in the 3 data sets is correlated or not?” ) I have no idea what point you are trying to make. If you suggest answers to your rhetorical question, I might know what you are trying to tell me.
Do you mean the difference in the trends form GISTEMP, NCDC and HADCRUT? If some have an inappropriate amount of land (or ocean) those observations would differ from the true global average– yes. But right now, the models are warming faster not slower than the observations suggest.
But also– this isn’t related to the step change in SST issue, right? So I don’t see how this would affect the discussion of “outliers” which people seem to suggest is something that happened after 1998.
If HadCRUT has a greater ocean to land ratio than GISTEMP or NCDC, which I think it must, it would warm slower than those even if the underlying data was identical.
In the case of GISTEMP specifically, prior to the step change, HadSST2 was warming faster than Reynold OIv2. Since the step change (i.e. different data), they are about the same. So, I think the near “sameness” of earlier trends was due to the faster warming ocean offsetting the smaller land proportion. I think this is something that could be tested (Zeke?).
With respect to the models, do you remove the effect of the solar cycle since January 2001 from observations? Because I doubt the models include a future solar component.
cce–
The observations are the observations not corrected observations. Most models froze the sun in the future even when they included it in the past. Why they elected to do this instead of continuuing an 11 year cycle I do not know.
The effect of solar cycle is difficult to tease out of the observations and equally difficult to tease out of models even when we know the modelers included it and we have multiple runs!
OK so we have problems with the surface temperature measurements and we have problems with the troposphere measurements.
Neither produce a ‘true’ picture apparently. So how do we know what the real world situation is?
Dave,
Both have issues, but both have a lot of agreements. The real world is probably similar to any one record (either suface of satellite) +/- 15%
RE: lucia (Comment #74013)
I believe the 11 year cycle was essentially averaged because thermal inertia of the oceans and estimated lag times of 3-5 years made it relatively difficult to measure in terms of climate change. I generally agree that simplifying the 11 year solar cycle to an average probably makes sense. What was left out, and we are likely experiencing now, are the effects of longer solar cycles that have a much greater measurable effect on climate such as the grand maxima in the last half of the 20th century and the prolonged quiet sun since 2006.
http://solarscience.msfc.nasa.gov/images/Zurich_Color_Small.jpg
Zeke,
Is +/- 15% not a large margin of error then?