Whenever I compare IPCC projections for global mean surface temperature (GMST) to observations, someone suggests I do the comparison a different way. Today, I’ll do the comparison yet another way!
Here’s today’s method
I will:
- Plot the the average GMST from 38 model runs from the AR4 with each run “rebaselined” to the average temperature from 1980-1999 for that particular run. (The selection method is described here.)
- Plot observations of GMST based on the average of GISS, HadCrut, NOAA/NCDC. Measurements from satellites are not included because the satellites accuracy during the ’80s is in question, making computation of the 1980-1999 average problematic.
- Illustrate 2C/century passing through a zero temperature anomaly in 1990. (2 C/century represents the trend in the average global mean surface temperature predicted by models during the first two decades of this century. The value is mentioned on page 12 of the summary for policy makers of the WG1 Report of the AR4; the linear behavior of the trend can bee seen in Figure 10.4 of the WG1 report. Specific numerical values can be estimated from the values in Table 10.5 in Chapter 10. I rounded down a little. Links to report 🙂 )
- Compute the best-fit trend for observation of the GMST from Jan 2001 through July 2008. The best fit trend will be obtained using ordinary least squares forcing the fit to pass through a temperature anomaly of 0.22C in 2001. The choice is consistent with a linear trend of 2C/century passing through a temperature anomaly of 0C in 1990.
- Computing the uncertainties in the trend correcting the uncertainty intervals for “red noise” (i.e. I assume the residuals to the trend are AR(1). This is not necessarily a good assumption.)
- Discuss the level of agreement or disagreement between observations and data.
Why force the fit through T= 0.22 in 2001?
Before showing the results of the steps outlined above, I need to discuss why I am constraining the trend to pass through a specific point as I have not down this in the past.
The main reason I am doing this now is I ran some Monte Carlo tests to check out an idea discussed in comments in an earlier post. When I ran the tests using 90 months of data with statistical properties somewhat similar to those displayed by observations of GMST, I discovered constraining the fit through a known point specified in the hypothesis can dramatically increase the power of a hypothesis test in instances where the hypothesis is wrong. That is to say: It reduces the probability that we will make the mistake of failing to falsify a null hypothesis that is wrong.
The increase in power occurs with no increase the probability that we decree a correct hypothesis is false.
All this is a longwinded way of saying: The new method of doing the test appears to increase the probability of getting the correct answer.
Why didn’t I do trend analysis this way in the past? There are two reasons. The first is: I didn’t know forcing through a known intercept had a lot more power! (I suspected it would have a little more power. But… well at least for residuals that are “white” or “red”, the increase in power is quite large.)
The second reason this is that I was concerned that people would quibble over the point I chose to pin the OLS fit.
The AR4 does not specifically quantify baseline for the temperature for their projections. Rather the authors restrict themselves to describing how the baseline for anomalies “relative to 1980-1999” (see for example, page 762 in Chapter 10 of the WG1 report to the AR1). Readers interested in a numerical value of the baseline shift for the observations or precise numerical value of the temperature would be required to calculate it themselves. Oddly enough, this involves some interpretation of the phrase “relative to 1980-1999”.
This slight ambiguity in the meaning of “relative to 1980-1999” permits an analyst to cherry pick methods of determining the “baseline” and this can modify results. How do I interpret “relative to 1980-1999”? I’ve decided it means the average temperature computed using all month from Jan 1980 to Dec 1999.
Yes, other interpretations of this sort of language have appeared in the literature. Moreover, if we pin the trend through the wrong point, the method does not have more power that the one I usually use. So, if you wish to prefer the method I normally use for that reason, that is perfectly valid. Or if you’d like me to use some other baseline, let me know and I’ll do it if the method of determining the baseline is easy to implement.
Now, having “interpreted” the meaning of “relative to 1980-1999”, I will show plots.
Comparison of models to data forcing T =0.22C in 2001
Here is a comparison of the observations and projections and an OLS trend forced through T=0.22 C in 2001.

If we examine the plot we see:
- The average of the model projections, shown in red, lies outside the 95% confidence intervals for range of trends consistent with the earth data shown in lighter yellow. That is to say: The average projection would be rejected as inconsistent with the observation of earth GMST on this basis.
- The observations of the earth’s GMST still falls within the uncertainty range of the model projections.
What does this mean? It means:
- The average temperature projected by the models is hotter than the current observed temperature. This is based on plain observation that the temperature for August falls below the red curve representing the average of all model runs. No funky statistics required!
- If we imagine the earth’s “weather” to be a single weather realization of a physical process that is the result of an unknown deterministic trend, “m”, and AR(1) “noise”, then there is a 95% probability the unknown trend lies between those pale yellow lines. Note that 2C/century falls outside that range. So, we reject 2C/century as a hypothesis based on this test. That is to say: though we started out assuming it was true, we now treat it as false to a confidence of 95% but
- The current GMST and all values shown in the graph fall inside 95% confidence intervals range of GMST predicted by models in any month.
On the basis of 3, some might suggest the average temperature predicted by models should be not be rejected, treated as false, called “unskillful” or said to be incorrect. However, in my opinion, results 1&2 are vastly more important than 3.
Of course, I’ll explain why I think that! 🙂
Why is the comparison of trends more important?
Points 1 & 2 are more important because, used as a statistical test, method 3 has very little statistical power.
In statistics, the “statistical power” of a test describes the probability that we will correctly detect the fact that an incorrect null hypothesis is wrong.
If you are going to argue at blogs, or interpret whether a scientific theory is consistent with data, it is important for you to understand that some tests have higher statistical power that others. The reason you need to understand this is you will wish to avoid fooling yourself (or others.)
So, how might you fool yourself? Well, suppose you fervently wish to believe your favorite theory but it seems to be contradicted by observations. Human frailty may lead you to hunt down statistical tests that don’t prove your favorite theory wrong.
In your search, you will inevitably stumble across tests with very little power. As contradictory data accumulate, if you look hard enough, you will probably find the “perfect test”: one with nearly 0% statistical power.
If you don’t understand statistical power, you will apply the test. Your theory won’t be proven wrong! You may rejoice at this splendid result.
If you truly haven’t a clue about statistical power, you may insist others must use your powerless test as the “one true test”. If they know nothing about statistical power, they may even believe you.
Unfortunately, your bubble will eventually be burst. It turns out that, in science, engineering, investing and life in general, we tend to prefer statistical tests that give the correct answer at the highest possible rate. For this reason, the convention is to prefer the result obtained using the test with the higher power at a specified confidence interval.
It just so happens that comparing the projected trend to the observed trend has higher statistical power relative to the other test.
Now that I’ve claimed the test comparing trends has higher power than the test where we check whether the data fall outside the confidence intervals for “weather noise”, I’m going to have to back that up, right? Well, now, even though I have a backlog of at least two future blog posts promised to be posted “when time allows” , I will now bump the number of “promised” blogs up to three, and promise to discuss the relative power of using criteria “3” to “falsify” relative to using criteria 1 or 2.
I’ve written an EXCEL spreadsheet to accompany this promised post, and I’ll try to write it up tomorrow.
Lucia,
Your comment about the low statistical power of the test 3) is interesting because that is the test that the IPCC uses to make its claim that the warming since 1960 is most likely the result of GHGs.
When I eyeball the chart in AR4 Chapter 9 (http://rankexploits.com/musings/wp-content/uploads/2008/08/figure95-ipcc.jpg) I conclude that the trend from 1910 to 1940 falls inside the 95% confidence intervals range of GMST predicted by models. However, the 30 year trend from 1910-1940 likely lies outside of the trend predicted by the models.
That said, I am not sure how to apply your approach to validating hindcasts. i.e. does the failure to produce a 30 year trend over a single interval demonstrate that the hindcasts have no skill? I see both sides of the argument. On one hand this is a clear case of cherry picking the most adverse interval. In the other hand, 30 years is supposed to be climate and not weather so I would expect a skillful model to correctly hindcast the 30 year trends 19 times out of 20.
In this case, it did it looks like as many as 10 out of 80 30 year trends would be outside the range predicted by the model and one could conclude that the hindcasts do not skillfully reproduce the 100 year temperature record.
I hadn’t heard that before. Where did that come from? Drift, calibration?
I realize this isn’t directly relevant to what you’re doing but
what is the trend of the average of the model runs over the whole span?
How about the averaged GMST?
BarryW– Maybe it’s better to say “have been in quesiton?” The satellite measurements were corrected relative to the reported values during early periods in use. They may be fine now, but I prefer to relying very much on anything during the initial phase of use. So, I’m fine using the satellites data now, but prefer to avoid using that when the test actually requires no-drift over the whole span.
Do you mean what is the trend from 1980-now? Or since 1880? What’s the full span? Since 2001, the trend from the models is very close to 2C/century. You can see a few wiggles in the “red” curve, so it varies a little, but those wiggles are small. The average over all models really does have a linear trend of 2C/century during the periods I’m testing.
Raven: Power is specifically related to failing to reject a hypothesis that is actually wrong. The standard of “stayed within the yellow” in the graph you link results in weak test. The already low power is further reduced when the the data were available prior to the creation of the models, and some of the features of the data resulted in people creating hypotheses that are now incorporated into the predictive models!
As a practical matter, I don’t know any better way to develop the models, or decides on forcings scenearios. But the problem of circularity still remains.
Eyeballing the graphs, it appears the 30s may have managed to just inside the span of model projections — but I’m not sure. Given the correlation in temperatures, you would expect that once GMST data strays outside the 95% bounds, it could stay outside a little while. If the model span really is “weather noise” (which I doubt) the data should poke out about 1/20th of the time.
At some points, I may recreate that graph with results from each model colored in. You’ll see that despite the appearance of weather trajectoriees that cross each other, the reality is the traces for each model tend to “cluster”. Some models predict more warming since 1880; some predict less. You don’t need fancy statistics to show it– just picking similar colors for each trace is enough.
1) What if one’s favorite theory is that “climate change” is not limited to warming but means a generally more dramatic climate–more big highs, big lows such that more weather noise and more outcomes at odds with the models proves that the generically concerned thinking behind the models was right after all.
It’s not my favorite theory but the malleability of clinical catastrophism is well-documented.
2) In the alternative, since everybody already knows the real trend in degrees per century we could use a serially adjusted uncentered pseudo-principal component analysis to filter out misleading anomalous data and reveal the real climate trend rather than get bogged down with large sets of actual data.
George–
On the “higher high-lower lows” front…. I’ve read that. But I can’t test anything that isn’t quantitative. Also, I prefer to test IPCC stuff because it’s supposed to be consensus.
On 2… Heh. Oh.. did you read SteveM’s post? Looks like Tamino got a public spanking for his final article on principle components. Ouch!
lucia:
The collapse of part of Tamino’s defense of Mann was indeed the inspiration for that part of my post. I noticed that in lieu of unleashing his considerable math skills to build a defense of the new hockey stick, Tamino simply pasted passages from Mann et al and nodded approvingly in between block quotes. That departure from his usually more substantive MO was itself a kind of red flag.
Steve McIntyre’s takedown of the new stick is astoundingly detailed. The release of a new hockey stick paper must have been like Christmas morning and a birthday rolled into one for Steve–he is clearly having fun.
For me, the graphic display of wildly disparate data that somehow became a hockey stick was quite striking all by itself.
I meant the trend of the span (1980 til now) of the GMST observations (GISS ect) that you were using in your analysis. I was curious as to how that trend differed from the 2degC/Century trend. It would seem to be lower than the models since the there is a divergence over the 21st Century but I was wondering how much.
Speaking of trends, you showed the trend based on the average of the models. What is the distribution of the trends of each model run? I suppose you could interpret that as either from 1980 or from 2001. I’m not sure if either or neither is relevant, but would that provide anything useful? Other way of phrasing it might be, do any model’s trends fall within the observations confidence interval on the same time scale?
BarryW–
The trends across individual runs is all over the place. I get the same result Gavin gets in his RC post: There is a standard deviation of about 2.1C/century.
The model trends do not fall within the observation’s confidence intervals for the same time scale. No.
I’ll be posting comparing the variability of 8 year trends in models to data from 1914-19?? (I don’t remember the precise year). I’ll be using an integral measure. If we compute the variability of 8 year trends for individual models by restarting each month and summing over all trends, by that measure the models have more variability that the earth.
So, basically, by measures relevant to hypothesis testing, the models have more variable “weather” than the earth– at least during “volcano free” periods.
George T–
I’m reading SteveM’s stuff.
I know nothing of PCA. I chatted with Jean S a bit by email, and I wondered a bit about the “meaning of the origin” business. This “meaning of the origin” seems to now come to the forefront in the discussion with Joliffe.
OH well. Whether Mann, Tamino of Gavin like it or not, at this point, large numbers of people will consider the new hockey stick something that needs to be checked out by a known, non-anonymous third party before they accept it has having been a valid defensible analysis. SteveM is, for some reason, willing to dive into those details.
It is interesting that Joliffe stepped forward at just this particular time!
Lucia,
In line with my comments on the last post, you won’t be surprised that I am unhappy with this new approach. It’s not surprising that it has more power; before you were asking whether a line of slope 0.02 C/yr could be within the population of fitted slopes to a decade of temperatures. Now you’re asking whether it could be within a smaller population of slopes of lines passing through 0.22C in 2000. So of course the answer is more likely to be no.
But in a way that is good, because it makes the basis for disagreement clearer. We’ve argued about whether the IPCC really predicted a 0.02 C/yr trend for 2000-2008, but one thing they certainly didn’t do is predict 0.22C in 2000, so it is unreasonable to expect them to match that. Now you may say that’s absurd – the initial value is not predicted but known. But in drawing Fig 10.4, this knowledge wasn’t used. You can see that there is a spread right from the start, reflecting the fact that the models were initialised some time earlier, not in 2000.
You say that “the linear behaviour of the trend can be seen in Fig 10.4” and that seems to be the basis for choosing this “IPCC central tendency” to test. But is it a trend passing through 0.22C in 2000? That is what you are testing against. And if you had constrained yourself to draw a line through that point (or equavalently, in the GCM data that you rebaselined), would the “central” slope for the first decade have been 0,02 C/yr?
Nick–
Do you mean in figure 10.4? Yes. The line either passes right through 0.22C in 2001 or close. (It’s 0.2 C in 2000. If I said otherswise in the post, that’s a typo.)
If you want to check, here’s the figure. Click to enlarge and look:
The line also passes through that 0.2C at 2000. You’ll also get pretty much this result if you download the underlying data for the A1B scenarios, process rebaseline as they describe and plot it. Click on my graph in the post to see the full size.
The same question twice? 🙂 You must really think the temperature doesn’t go through 0.2 C. The answer is still yesss!!! 🙂
(The monthly temperatures do vary a bit. I am rounding. But, the annual averages don’t vary so much, and that’s whats in figure 10.4.)
On this
You make two points: One relates to initialization. One relates to the spread.
The runs were initialized at different times, but all runs were initialized well before 1910. I know… I downloaded them. 🙂
The projections and data in figure 10.4 are based on rebaselining everything relative to jan/1980-dec/1999. The spread in temperatures on figure 10.4 is the spread over the models after rebaselining. That spread is either due to a) “weather noise” or b) real differences in model predictions for the trend over the period 1980-1999.
However, since the baseline period is ends in Dec 1999, the spread at 2000 in figure 10.2 is likely mostly “model weather noise”.
I’m not sure why you are worried about the “initial condition”. My trend fitting method does not treat 0.22C in 2001 as an “initial condition”.
It treats 0.22C in 2001 as the “expected value over a large ensemble of realizations” as indicated in the figure illustrating the projections. The model average is kinda-sorta claimed to be and “expected value over a large ensemble of realizations”.
The trend fitting method then essentially treats weather data as being single realization of a stochastic process with an “expected value” that matches the models “expected value”. None of this has anything to do with initial conditions or assumptions about initial conditions.
The fit in this post does force the trend through the “expected” value for model results 2001 as illustrated in figure 10.4. So, yes, this method uses even more information from figure 10.4 than the previous method.
FWIW, my figure above shows that a similar “spread” as that in figure 10.4— they are the pink jagged lines. My ‘spread’ isn’t exactly the same as that in 10.4. The difference is I show the 95% confidence bands for monthly data and the authors of the AR4 show 1-standard deviation for annual average data.
But be aware of this: I don’t set the observed temperature anomaly in 2001 to zero. The observed temperature for the earth in 2001 is lower than 0.22C, since it was lower than that relative to 1980-1999. If you blow up my figure you can see that.
I’ll be writing up a little more, likely on Friday.
For what it’s worth, I don’t actually know if I like this method better. There are oddities you have not touched on! Think of different ways to interpret “relative to 1980-1999″… you might think of an oddity! (I’ve been having an argument with myself over pinning the fit through 1990. I’d argue 2001 is better, but I’m not sure everyone would agree based on the text of the AR4.)
I have a stupid question:
Whenever you measure something there is error. When you measure global temperature through averaging according to (hopefully) spatially equal average temps to obtain a global temp and then plot the global temp according to its difference from an arbitrary mean, every single error gets added together.
First there is the measurement error. then you take the measurement and average it over time and then a spatial sector. The original error gets multiplied. then you take that number and its error and average it again. the error grows. Finally you take a baseline which has an error and subtract it from a number which has an error, making the final error even bigger.
What would a true estimate of the final error be, assuming that you can measure temperature to .5 percent in the the very first measurement?
I am speaking here for the Giss land data set, but it could be applied to any calculation of temperature
peerreviewer, there are no stupid questions, haven’t you heard? In any case, yours certainly doesn’t fit that bill. What you’re asking about is a variation on what some have called the Morse-Blood Paradox, which essentially states that one can never conceive of error unless one first knows accuracy (or one cannot calculate probabilities without first having some absolute upon which to start calculating), and yet since math, the science of measurement, is inherently inaccurate, one can never actually know accuracy or absolutes. The solution, say some (and I am among them), is in the very premise.
Thinking man, I like your style — almost Stephen Kingish. As for your organic thoughts, I think “they” like the term better than fecal agricultrual technology/carbon anally transmitted. Personally, I’d go with the flow.
Darwin, you throw off my equilibrium with your well-punctuated post — to say nothing of your compliments, the origins of which I hope were not random. Naturally, you select among many possibilities my “organic thoughts” (as you say) upon which to zero in, and that tells me something good about you. Thank you.
Regarding this “fecal-agricultural-carbon-anally-transmitted” business you speak of — I’m afraid that’s all slightly over my head.
But it certainly sounds like no shit. And I do hope the dynamic Ms. Liljegren will permit me that small vulgarity, as well as this long-winded divagation.
TM, you are a fun, and interesting read. You don’t take any and see right through a lot. I just think of fecal-agricultural-technolgy/carbon-anally- transmitted by its more easily digested acronym, which is really what it is all about. I believe that there is some appropriate use for it in the humid tropics where the soil is variably charged and root systems grow out and not down, but not in temperate climates where smart farmers can test their soil and find the proper mix. In those circumstances, FAT CAT is just a cachet to sell at a higher price and support a particular lifestyle. No …. kidding.
Sorry, sweet Lucia. I will now stay silent and simply wait for more analysis.
peerreviewer–
I think there are few stupid questions, but sometimes there are stupid answers! 🙂
On the errors– I’m not sure precisely what you are asking. So, I’m not sure how to answer it.
Of course there are measurement uncertainties in data. If GMST were a simpler “thing”, we could take measurements, and do some sort of calibration to estimate errors. Instead, people do try to estimate the uncertainty based on knowledge of errors in thermometry, lack of spatial coverage etc.
Oddly, individual errors don’t quite “add” together, in fact, depending on exactly what’s done, the can on average cancel! (For example, measuring the same thing 10 times and taking the average can, in many instances decrease our uncertainty in knowlege of the measurement. But this depends on whether the errors in the 10 measurements are correlated.)
Thin King Man–
Oddly, your comment got through. Not that I would have moderated it for content, but I was in Wisconsin camping last night and had set WordPress to moderate all comments by people who hadn’t commented before! (Or, I thought I had. . . )
Luckily, no food fights broke out while I was away.
Oh… I never go by Ms. Liljegren. I usually use Lucia, but professionally it’s Dr. and socially it’s Mrs. 🙂
Oh… I googled “Morse-Blood paradox”, and the first and only hit was this blog. So, you may need to explain that to me.
However, it is true that to calculate probabilities we have to have something absolute to go on. One of the difficulties with various analyses is that there are always some assumptions which we treat at absolute before proceeding and doing math.
So, you can see here Nick is questioning a starting assumption. His questioning setting T=0.22C at year=2001 is a valid question. In fact, if you read the article, you’ll see that the difficulties in defending any particular choice is on of the reasons I didn’t select this method in the first place.
But… I do think it’s useful to see what different answers we get using different starting assumptions. To the extent that we get similar answers with different starting assumptions, we might have more confidence that an answer is “correct”. (I think the word used when blog-viating about climate is “robust”. 🙂 )
Darwin,
I have been known to spread composted fecal matter in the garden. It’s about the same price as decent top soil, and the hill I live on is so full of clay I try to add all sorts of organic matter. (Jim built a compost bin and I dig in leaf mold too. In the fall I may try to get the neighbors to donate leaves so I can have LOTS of leaf mold.)
errors. many years ago, when i first started measuring things, i was taught errors. say you measure a room and it 10 feet plus or minus .1 foot on one side and its 12 feet plus or minus .1 foot on the other. if you take the square footage, 120, the range of square footage is 9.9 times 11.9 or 117.81 to 10.1 times 12.1 or 122.21 . if you add them the range is 21.8 to 22.2 , the .1 error adds, if you multiply or divide the error also increases. any time you make a measurement and do an operation on it with another measurement, you increase the error of it.
to calculate a global temperature, the error increases with every arithmetic operation you do to the number.
for the room example, say I took a measurement and did ten multiplications with it. the range of true measurement would be 9.9 to the tenth to 10.1 to the tenth. So for global warming if we start with the ability to truly measure a temperture and assume it 18.1 degree C plus or minus .2 degrees C, and then do at least 6 operations on it, the error of the mesurement will increase with each operation. Another more complicated way of thinking about is that everything has a gaussian attached to it and these gaussians do not go away.
so what significance do I give to a 0.4 degree temp anomaly? about nothing since average temp is about 15 to 20 so a .4 degree change means that I can measure temperature to a 2 percent accuracy after manipulating the number multiple times. Just doing the baseline subtraction adds the error just like adding the lengths of the room. So to end up with 2 percent accuracy you had to start with 1 percent accuracy. and to begin with the global average number of 1 percent accuracy, you would have to be much much better than 1 percent accuracy at each of the geospatial averages and even better at each site, and that is just impossible.
Lucia,
From time to time I find myself rather bored, so I click on the link you have to Open Mind. If you head over there I am sure you can see the new post which seems to me to be close to the “Grumbine question.”
But today’s stupid question isn’t about the “Grumbine question.” I was wondering if it is common for mathemagicians to support a particular method which I beleive you adopted but then switch to another method once the first method stops giving the answer they want to hear?
Ok that isn’t really a question so much as a creative linking method. Thought you might be interested if you hadn’t already seen them.
Raphael–
On the post overall– I read the earlier one, and I was waiting to see what Tamino did later. I’m intersted in what the various structures can give. Giving the new post a quick look, it appears Tamino uses the amount of “noise” that includes the volcanic eruptions– since he’s trying to mimic since 1975 when there were plenty. If so, the statistics have become unhinged from the phenomemology.
One of the important points about the current trend is that it happened without a volcanic eruptions. So, using statistics where the “weather only” variability is enhanced with volcanic eruptions to give “weather + volcano” noise gives a false impression of the uncertainty. (At last I believe this to be true, and the older data seems to bear this out.)
But, maybe I’m wrong, and Tamino didn’t create synthetic data with too much noise for the current period. I’ll be taking a look at the numbers Tamino is using later to see how much “noise” the produce for the current period.
On the more specific issue of switching methods:
People do switch methods if they discover a previous method is flawed. There is, however a problem if they only switch methods when the don’t get the answer they like.
The tendency of people to switch based on the “answer” is one of the sources of confirmation bias. As long as they previously liked a particular answer, they used it. Also, when switching, it is best if they can explain precisely why the new method is either equally good or better than their new method, and show rather detailed proof its better.
So, for example, since Tamino used to “like” (in the sense that he used it) the Lee and Lund method for proving global warming, and now he’s switching to ARMA(1,1), he should show the correlograms, show they are better. Show the parameters he’s using are statistically significant etc.
It looks like he’s writing a series, so maybe he will show all the things he needs to show to convince people he didn’t just select his parameters as being something that permits us to create a synthetic data series that creates lots of strings of “flat” bits.
Obviously, if he doesn’t, he will continue to convince the people who always believed him, and others will remain skeptical about whether his method is more accurate than the one he used to apply at his blog.
Peerreviewer–
I learned a similar thing in gradeschool or highschool. But the approach was modified in college and grade school.
If you add numbers, round, record, and then add, subtract, multiply and divice operations with the rounded numbers, the maximum possible error increases. So, 12.4 + 15.4 = 27.8 But 12+15=27. Since 27.8 would have rounded up to 28, you are now officially “off”.
Obviously, by rounding, you could now be off by 0.5, and if you added enough numbers, you could be off by a great deal.
This is very useful for students working with small numbers of data to know.
However, it turns out that if you add 10 numbers, you very rarely round down every single one. So, the standard deviation of the error resulting from that process doesn’t grow as quickly as gradeschool students are taught– in increases at a rate of the square-root of the number of measurements, N. If your measurement method involved taking an average of 10 number, you would divide the sum by 10, and your error decreases by the square root of 10! (There are some unstated assumptions about the errors in the previous discussion.)
So, estimating errors that are propagated can be complicated. We need to figure out the nature of the errors, and the method by which they are propagated.
So, how does propagation of error affect GISS temp? Beats me! It’s probably no worse than for NOAA/NCDE, or Hadley.
Lucia,
You could comment over there discussing your qualms about his methodology. I’m sure you two could manage to keep things civil :-p
Zeke–
Nope! Not going to comment over there yet.
I’ve gotten an impression based on a cursory inspection. But, I need to run some numbers to see if that impression pans out.
I have specific questions in my mind– they would require computing some values. If I asked Tamino precisely what I want to ask, he would probably have to understand the question and run the numbers. Then, I’d re-do them anyway! So… I think it’s more time effective for everyone if I just run my numbers, and then describe what I find.
It’s possible that when I run the numbers, I’ll agree Tamino picked a decent statistical model. Or not… Either way, I’ll be commenting.
Lucia. Adding carbon to the soil is a good thing. That’s one reason to keep sillage on the land. Leaf mulch is good, too. But one can have too much of a good thing. Finding the right balance is usually best. Escuela de Agriculturo Regionales Tropical Humidas has some very good programs trying to find them.
Hi lucia, Thanks for answering. Sorry for being unclear.
part 1
I am not talking about rounding. I am talking about making a measurement which has a degree of error or indeterminacy and combining it with another number which has indeterminacy. My question is what happens to the error range when you manipulate numbers that have errors. For a concrete example sake, what happens when you take a globally averaged number for temp which has some error associated with it and then use a number that comes from a baseline and do a subtraction or addition to obtain the subtraction from a baseline to get the anomaly.?
if the global temp is 15 plus or minus .2 and the baseline is 14 plus or minus .2, what is the resultant error for the anomaly?
A lot of the arguments surrounding AGW surround how well we know the temperature is changing and the inability to say how well determined the number is. And I dont have an answer.
part 2
I have a lot of other problems with the surface data. for example each temperature recording from each site has built in variance from an imagined “true” temperature. We dont have any way of knowing what is the error of the measurement. Its not the error of the thermometer under laboratory conditions, its the error related to the entire system: the thermometer, the hut, the paint, the location, the observer etc. If you have ever taken a class of students and a bunch of thermometers from different sources and had them read a temperature from the thermometers in a classroom, you would see that reading a thermometer to get a temperature is not very precisely done.
If this were a lab, I would take a sample and measure some property using the same machine and another lab would take the same sample and measure it on the same machine or a different one and we could compare results. And I would repeat the measurement several times. You cant do this with temperature which is spatially determined. The next best thing would be to have several measurements with apparently identical devices at one location. We do not have any surface temps in which the local temperature is measured side by side with several temperature measuring huts, or even 3 or 4 at one location. We can not compare the differences in local temperature data obtained from that measurement. So we have no idea what the measurement error is of each particular data point is in the global surface temperature monitoring system, nor how this error changes with time ( and it has to, because different devices and observers have been used to collect the time dependent temperature series).
The assumption is that because we are taking many measurements that the data will be robust. But we are not taking many measurements. We are taking many independent measurements of different temperature each completely separate from each other and then manipulating the data.
If i was using the global temp data in the lab, I would simply take all the numbers and get an average and a standard deviation etc. But you cant do that because obviously we are not measuring the same thing at each site, so we cannot use the population of numbers to define an error. The error has to be determined from a measurement of the error of that hut and we dont have any numbers for this. Maybe USHCN does. Anthony’s blog has certainly shown us the variability of the numbers and has shown in a single multiple measurement painted non painted hut example shown over a degree of difference in recording . If you used Anthony’s watts hut example to determine the error of temperature measurement there would be no debate because there would be no determination of temperature within .2 degrees.
http://wattsupwiththat.wordpress.com/2007/07/14/the-stevenson-screen-paint-test/
So how do I know how good each measurement is, and how do I propagate this error to the spatially averaged mean?
peerrevierwer:
Nothing much! If a constant number is added or subtracted from a series of numbers to create an anomaly, this doesn’t introduce error per se. For example, there is no error converting a number from Celcius to Kelvin.
The error in the anomaly is ±0.2 for nearly all purposes. For example, if you wish to do trend analysis, after subtracting 14 from 15±0.2C the anomaly is 1 ±0.2C. This is because you subtract 14 from every single number.
There are exceptions where you might think the error is larger. For example, if you only have anomalies and you want to reconstitute absolute temperatures, and you forget the original temperature you subtracted to create anomalies (which was 14 above), then you will have a problem getting back the orginal temperature in absolute units. But this isn’t due to adding and subtracting. It’s due to not knowing the original value you used to create the anomaly.
So, now if you need to re-estimate the absolute temperature for the baseline and add it back, you might not get 14. You might get 13.9, add that, and your reconstituted absolute temperature is 1+13.9= 14.9. But, at this point, you will also want to begin keeping track of random and bias error separately, because that 0.1 error is the same for every single temperature!
There are ways to estimate errors for derived quantities. If you want to estimate those from adding and subtracting, you can do differential error analysis. I haven’t done it myself for GISSTemp– and I don’t plan to.
But I should warn you that people often make mistakes by assuming that information was “forgotten”. When they do that, they drastically over estimate the uncertainty due to propagation of error.
On part 2:
Yes. There are problems with measurements, and we know specific ones with GISSTemp, NOAA/NCDC, and presumably any and all measurements of GMST.
You are correct that we don’t have a calibration standard for GMST. It’s not possible to create one. You are correct there obviously are measurement errors. This is not unique to GISStemp; all measurements ever taken include uncertainties.
Since real calibration is impossible, we are left with Hadley’s estimate (or any others that may exist). You can find the paper under the heading “references” here.
many thanks lucia. I think my confusion is thinking that the baseline is as you refer to it, a constant number. I considered it derived from a temperature series, and therefore having the same error charateristics of the temperature itself, and thus having a built in error. I will have to think about that part and see what happens when i do a trend fit with known error amounts to the individual temperatures from say 1940 to 1980 to see what the predicted standard error of the trend line is ( how fat).
Lucia,
I have been feeling ill, and didn’t have much sleep last night. The good (or is it bad?) thing is I have had plenty of time to think. The bad (or worse?) thing is I am in a semi-coherent state, and I am aware that what seems to be perfectly reasonable to me may be far from it.
That being said, I have a “dumb question”. (which may not be in the form of a question)
When I was young, I received a father to child education about the signal to noise ratio. If we consider the underlying climate trend to be a signal and weather noise to be, well, noise masking the signal, we have a simple to understand (for me at least) analogy of the “Grumbine question”.
This led me to two observations:
1. Grumbine is likely right. 7.5 years is probably too short an interval to reveal a strong signal.
2. I don’t think it matters. I don’t think we need to reveal the signal at all. I think the important question to ask is, “Can this noise mask that signal?” This should be a question that is answered by the same test you are performing.
So, I guess if I were to phrase a dumb question it would be, “What is wrong with my question?”
Just looked at Tamino’s new post; he assumes a steady increasing trend and some noise variables to show in the linear graph produced that contrary to the upwards trend there will be small time sections showing a downward trend; this is similar to the Keenlyside thesis whereby the anthropogenic influence is temporarily masked by contrary natural effect;
http://www.nature.com/nature.com/nature/journal/v453/n7191/full/nature06921.html
A previous post by your good self seemed to address this idea;
http://rankexploits.com/musings/2008/gavin-schmidt-corrects-for-enso-ipcc-projections-still-falsify/
Which is to say, if the contrary, or accentuating, natural cause is removed, the anthropogenic effect should manifest; you showed it didn’t. If the trend goes down when nature is removed and the designated anthropogenic cause, CO2, continues upwards, the theory would have be suspect; unless Tamino has some other suppressing non-anthropogenic factor in mind.
I’ll try again with the Keenlyside link;
http://www.nature.com/nature/journal/v453/n7191/full/nature06921.html
If it doesn’t work; the reference is Keenlyside et al, “Advancing decadal-scale climate prediction in the North Atlantic sector”; Nature 453; 84-88 (1 May 2008)
Cohenite–
The main difficulty I have with Tamino’s analysis is that it does not distinguish between variability caused by the numerous volcanic eruptions from 1975-1994 or so and that caused by more ordinary weather. (That is, weather variability in the climate system that is not experiencing highly variable external forcing.)
Of course any statistical fit that with parameters taken from that period will reproduce the variability due to the eruptions of El Chicon, Fuego, Pinatubo etc. But that doesn’t mean that variability would be expected to apply to periods when the volcanos are not erupting.
I’ve been a little busy with real work, but I’ll be responding at more length later.