How does the start date affect the hypothesis test?

In comments today, “gravityloss” suggested that the results of my method is not “robust”.

I’m not sure what she means by that, but, she went on a bit about how the magnitude of trends varies if one picks different start dates. Well… of course. That’s why one must pick the start date based on something other than the data itself and it’s also why the uncertainty intervals are important.

Still, I thought readers might want to see how the result of the hypothesis test might have changed had I happened to select a different start date other than Jan 2001.

Even if readers aren’t curious, I am.

So, I selected a range of “start dates” as far back as 13 years ago to 6 1/2 years ago. For each of these start dates I ran 10,000 cases of “simulated weather” using the AR(1)+noise process and a trend of 2C/century discussed yesterday. For each month, I then determined the point separating the lowest 2.5% of trends consistent with 2C/century; this is illustrated by the red curve. I repeated and found the cut-off for the lowest 5% of trends and also showed the boundary representing 1 standard deviation below the trend of 2C/century.

These can be seen below:

What do we learn by examining the graphs?
First: Note that if we selected a start date fairly and the green curve falls below the red curve, that means we would reject the hypothesis that 2C/century is true at a confidence level of 95%. That is: We would treat it as false under the assumptions used in this analysis.

Currently, based on the start date of Jan. 2001, HadCrut trend is so close to the 95% confidence interval that the results differ each time I run a new set of 10,000 cases. (This was discussed yesterday!) I call falling outside the 95% confidence level is “falsified” based on the assumptions of a particular analysis. (If the analysis is correct, we expect to incorrectly call it false 5% of the time.) The HadCrut trend computed starting in 2001 falls distinctly outside the 90% confidence intervals. This would be deemed the “very low confidence” region by the IPCC.

Those are, of course, the results I reported yesterday.

But let’s look at what results we’d get if I’d happened to pick a different time period.

For start dates show (which are the only ones I tested), the HadCrut trend appears to fall outside the 90% confidence interval on the low side roughly 1/2 the time. I determine this by the “eyeball method”.

Because the data have this characteristic it clear is that if someone wants to find a trend since 1996 that is not inconsistent with 2 C/century, they perfectly well can. They can, if they wish compute the trend since sometime in the middle of 1999 and note that it’s only 1 standard deviation below the 2C/century! If 2C/century is correct we would expect this to happen about 1/6th of the time, so it’s not so bad, is it?

On the other hand, if someone wants to find the strongest negative trend, they can use the graph and concoct some reason to start the hypothesis test in 2002 or even later. (Good luck coming up with some decent justification for that date! ) For some dates, the hypothesis test says “reject” very strongly.

So, what does this show us? Not much. This sort of thing always happens. When a hypothesis is wrong (or even when it’s right) there are some specific choices of data that will “reject” or “fail to reject” at slightly different confidence levels. After all, there is randomness in the weather; there is no specific number of years or start data that entirely eliminates randomness in the results of a hypothesis test. (In fact, if one sets the confidence level at 95%, and the statistical model is correct one will incorrectly “falsify” hypotheses that are correct 95% of the time. This is true even if we were able to collect a zillion years worth of data from a perfectly stationary process.)

What it does show us is the result of the hypothesis test is affected somewhat by choice of start date after 1996. It just so happens that the one I picked gives intermediate results. That’s partly because I didn’t pick it based on the data– I picked it based on publication dates. As luck would have it, it turned out I picked a date that gives “intermediate” results for the hypothesis test.

That said: It is true that if the IPCC schedule had been different, the results I announce would be slightly different. As it stands, my hypothesis test is what it is: A test of the hypothesis that the trend will be 2C/century, based on observations of monthly GMST since Jan. 2001.

The results say that we should have very low confidence in the idea that the “climate trend” is currently 2C/century but masked by “weather noise”. Of course, as always, the results are affected by the choice of statistical model. But this is true of all statistical analyses!

Conclusion

Of course the magnitude of short term trends varies dramatically as we very the start date.

However, this doesn’t mean we can’t test whether or not the trends are inconsistent with 2C/century. It only means that to fall outside the 90% or 95% confidence intervals, very short trends must disagree dramatically with the projected value of 2C/century; this can be seen by noticing the blue and red curves in the figure veer away from the 2C/century line as the trend is calculated with smaller amounts of data.

Currently, the HadCrut trend computed with a start date of Jan 2001 falls in a range that says we should have “very low confidence” in the projection of 2C/century. Trends from other agencies don’t differ as significantly; values for 2001 were reported earlier this week.

As we get more data, I may revive this chart so we can see whether adding more data changes the results. Naturally, I’ll continue to use Jan 2001 as my official start date. I won’t be shifting to whatever start date gives “better” results (whatever “better” is supposed to mean!)

After word: Why compute trends since 2001?

Since this post discusses how the choice of start data would have affected specific results of my analyses, I want to remind readers why I picked this start dates. Here’s why:

Readers will recall that way back then, I insisted that the start date should be based on criteria other than the data itself. The reason for this is that it is always possible to “hunt around” for a start date that is most favorable to ones preferred result. This drastically biases the results of a hypothesis test. When data are plentiful, ideally, one would select a start date at random– unfortunately, that’s a practical impossibility with the climate data.

So, instead, I selected Jan 2001 as the start date for computing trends to be tested against the IPCC AR4 projections early this calendar year. The main factors in favor of this particular choice are:

  1. I want to avoid testing model “projection/predictions” against data that clearly existed before the method used to make “projection/predictions even existed. Clearly, testing “projection/predictions” published in 2007 against data from 1975, long before Hansen’s 1988 testimony to congress is absurd.
  2. Insisting on testing only with data arriving after 2007 means makes testing projections a practical impossibility. The TAR projection/predictions published in 2001 were superceded in when the FAR was published in 2007. The FAR will be superceded in a similar time frame.
  3. The FAR (fourth assessment report) supercedes the TAR, which contained different projections. Work on these projections is nearly constant, so it can be argued that modeling decisions and choices are somehow “frozen” years before they are published. Nevertheless, I wanted to exclude weather data that available before the TAR projections were made on the basis that the FAR methodology for creating projections cannot pre-date the TAR which it superceded. The official publication date for the TAR is Jan 2001.
  4. Selection of the Scenarios on which projections are based is part of the process of creating projections. I wanted a start date that begins after the SRES were selected. The SRES were developed from 1996 and evidently finalized in 2000.
  5. Other than the factors above, I wanted to longest possible period of time to test projections in the FAR. The three factors dictate starting in January 2001.

Having selected Jan. 2001 as my “start date”, I now consider it “frozen”.

24 thoughts on “How does the start date affect the hypothesis test?”

  1. Sorry to divert from the theme, but don’t you think inquiring minds would like to see how your small contest on Sea-ice predictions for summer 2008 turned out?

    Cheers!
    Cassanders

  2. Cassanders– I’ve been waiting for the UofI guy to put up the final results. Our contests is based on a figure he puts up– and it may not be up until the end of the fall! But, I’ll do a preliminary and then a “final”!

  3. seek and ye shall find …
    hence the need to vigorously guard against confirmation biases

    thanks as always, lucia

  4. If we are testing the 2C per century prediction, we should be able to use all possible start dates since the prediction is supposed to apply throughout the time period of rising GHGs.

    Perhaps it doesn’t apply for the 1850 to 1900 time period since GHGs were rising slowly at the time, but the logarithmic relationship of GHGs to temperature means it almost applies for this period as well. But we can certainly start measuring from 1945 since GHGs have risen fairly consistently since this date.

    If one starts measuring in 1991 or 1992, one can reach the 2C per century trend. If you start in 1945, the trend is only 1.0C per century. If one starts in 1997, the trend is 0.0C. 1850 is 0.4C per century. 1979 is 1.6C.

  5. Bill–
    You can’t use all possible start dates and then also report a “p”. The reason is that if you pick your start date based on the data, you change the probability computation.

    For example: Suppose I intentionally pick 1999 after looking at a graph like the one above or just because I know before doing the test 1999 has a “low” dip in GMST and will give a higher positive trend.

    Now, precisely because I picked 1999 based on the data, I have altered the probability of a trend given the evidence. What I know have now computed is a conditional probability based on having knowingly chosen the year that is most favorable to getting a high trend.

    As I see it (and I’ve said this before), if we impose the rule that we can’t pick start dates based on the data themselves, there are only three plausible “start dates” for testing the trends:

    1) Jan 2000. This is because the AR(4) graphs “end” their hindcasts at this point.
    2) Jan 2001. This is based on actual publications dates.
    3) Jan 2008. This is the beginning of the year after publication of the AR(4).

    The reason I don’t pick 2000 is the IPCC sets all sort of dates in the document and I like to avoid data that clearly existed before the TAR was published. The AR(4) projections also give all sorts of possible ‘start dates’ all of which give them a head start on ‘being right’. There references to 1980-dec 1999 for the baseline. But starting any tests with data from 1980 clearly means we would be testing “projection/predictions” for the 2000’s using data that was available before the AR(4) generation of AGCM models even existed!

    Still, some could argue that 2000 is good, and if they did, they get somewhat different results.

    So, you see, statistical tests tell us something concrete. But, even if you don’t have time to run the test yourself, the real arguments are over the assumptions made before crunching the numbers.

  6. Hi Lucia,

    I only have time for a quick note, so I’ll try not to pick a fight. 🙂

    Can you explain why you showed the HadCRUT trend but not NOAA, GISS, or average of all three? I only ask because you typically show all of them and seem to prefer the average.

  7. JohnV–
    There are only three reasons:

    a) The only purpose in showing this is to show that picking different start years does make a difference. That can be shown with any case equally easily.

    b) Time constraints. Unfortunately, running all cases for all combinations is time consuming. I haven’t run this for all the cases. But the effect of chosing the start date would be similar for all cases. If you pick 1999 as a “start date” for some unstated mystery reason, you’ll get the highest magnitude ‘recent years’ in the band I picked. If you pick 2002, you get a lower one.

    c) My main test uses the start year of 2001. I posted using all four of those results on Monday and linked that. So, you can see those results. I linked back to that and discuss it. HadCrut give the strongest “false” for this case based on starting in 2001.

    If there is one other of the four in particular you would like to see, I’d be happy to make another graph. I just don’t want to spend the time on all four, as I don’t think more than one is necessarily to discuss the topic at hand.

  8. Lucia,

    by using the same methodology, but with a much earlier start date (say 1900), you could figure out the confidence intervals for the linear trend during the 20th century.

    Now according to theory, exponential rise of CO2 leads to linear temperature trend. So any linear trend will be partly due to CO2. So far, no other effect is thought to give a linear trend over such a long period. However, very long cyclical effects could appear to have a linear trend during the same period, and contribute, positively or negatively, to the trend. For example, a 200 year cycle (say in and out of a little ice age) could look like a linear trend over a 100 year period. Nevertheless, knowing the confidence interval for the 20th century trend would be useful. If we can then assign potential ranges for long cyclical trends, either via theoretical considerations, or proxy reconstructions (good ones…), then we could circumscribe the correct value for CO2 forcing. In fact, since we would be interested only in cycles longer than 50 years, low resolution proxies would be good enough. One could do a spectral analysis over, say the last 1000 years, excluding the 20th century, and put boundaries on their contribution to the 20th century trend.

    So it seems that extending your methodology to the entire 20th century could make a very useful contribution to estimating the CO2 sensitivity. If you ever feel like doing it, that would make for interesting posts!

  9. Francois: The population from 1850-present gives just as good of a match as carbon dioxide, methane, nitrous oxide and CFC gases do. I’m sure charts of industrialization, or cost or power of technology, or urbanization would be good matches also. Probably even number of motor vehicles. Or even sophistication of the temperature sampling devices.

    ——————

    As to picking dates. And playing with numbers. And fluffy orange bunnies or purple cats.

    As I posted over on the CA boards, GHCN-ERSST for 1945 to 2007 shows a “jump” of .2 in 1959 and another in 1992. If you add +.2 to the anomaly for each of those years going forward, the yearly anomalies hover quite well around the 1951-1980 base period 0 line.

    And no, I forget why I picked 1945. I chose it before I saw the resulting graph if that cuts me any slack. 🙂

    http://www.climateaudit.org/phpBB3/viewtopic.php?f=3&t=527&p=11335&hilit=1945#p11335

    Interesting range of -.3 to +.2 in that case.

    But the fun of massaging data aside, the real question is what exactly is the margin of error on that .7 upward trend of anomaly readings from 1880-2007?

    This about 1997-2008 ground only might also be interesting.

    http://www.climateaudit.org/phpBB3/viewtopic.php?f=3&t=529&st=0&sk=t&sd=a#p11407

    That’s about .1 a decade.

    Again, other questions, issues and discussions aside, what level of confidence do we place in anomaly readings that are at +.62 in some year, or trending about the same over 130 years?

    I know my answer. The anomaly isn’t a good way to track the energy levels of the planet, so the time period or results don’t matter anyway.

    Long live climate science!

    🙂

  10. Lucia; I just thought of it; how do we get 2 C per century on the trend when it’s not even gone past 70% of 1 C over 130?

    Or are we postulating some odd situation where (say) the 50 ppmv carbon dioxide rise in the atmosphere over the last 30 years is going to increase in rate, and result in causing the trend to go up faster than some logarithmic amount?

    What was the lower tropospheric satellite reading over the last 30 years, a trend rise of about .4 or so? Odd how that point four keeps coming up; maybe it’s a conspiracy.

  11. Lucia,

    Could you explain the method you used to calculate the (green) trend values. Depending on the way you have calculated trends in the past it seems that you find different trends as well as changes to the confidence limits.

    Until recently you mostly used a corrected OLS method or a CO technique so I wondered if it was one of these or, maybe, one of the new variants that tries to separate measurement noise from weather noise.

    Sorry to be a pest with silly questions but most of my experience with signal processing was back in the days when we used analog electronics. That was a time when we did all the work on the signal before the analog to digital conversion. 🙂

    I suppose it is easy for electronics people because we usually know what the signal is and anything else is noise. This is a luxury that does not seem to exist in climate time series. Signal is now, by definition, what you keep and noise is what you throw away. It seems a bit backwards to me!

  12. Jorge–These are just ordinary least squares computed using “Linest” in EXCEL applied to data. The general point would be the same with any method though. Because weather is variable, even if there was a constant trend, we’d get some oscillation around it. The result is that there will always be a point we could pick to get the highest trend and a point we could pick to get the lowest one.

    This fact does not invalidate statistical analyses. It’s why a) it’s important to pick a start date based on something other than the data and b) you aren’t allowed to hunt around for your favorite start date.

    I don’t really like the terms “noise” and “signal” the way it’s used on climate blogs. But… it is used! I’d prefer if “noise” were used for measurement errors, since like it or not, the weather is real.

  13. Thanks Lucia,

    I did not think it really mattered to the point you were making so I was just being curious!

    It is things like ENSO/PDO that appear to cause most trouble as their low frequency makes it hard to distinguish their upslopes/downslopes from trends we are looking for.

    Douglass & Christy 2008 recently used the nino3.4 index in a regression with the UAH lower troposphere temperatures for the tropics. With a bouquet of volcano, they found little room for a CO2 induced trend. To be fair, there does not seem to be much of a trend in the tropics anyway, or the southern hemisphere either.

    Actually I was surprised to find that nearly all the warming since 1979 is confined to the NoExtropics. Perhaps it is not global warming at all, maybe, in the words of Harold Macmillan it is just a little local difficulty. 😉

  14. Jorge, what makes you think the effect of CO2 is divorcable from “ENSO”? As the Pacific ocean heats from CO2 do you not expect stronger ENSO events? Think about it.

  15. bender,

    I am not actually sure what causes the ENSO temperature signals in the first place. Anything to do with mass/energy/salinity movements in/around/through the oceans is a complete mystery to me.

    What did you mean by stronger ENSO events? Are you suggesting the index will have greater max/min excursions? For all I know throwing more energy at it, if that is what more CO2 would cause, is just as likely to disrupt the pattern altogether.

    When someone comes up with a physical explanation for these pendulum like swings we might be able to decide what extra energy might do. My understanding at the moment is that the ENSO index is pretty much a proxy for temperature anyway.

    Even if CO2 causes greater variation in ENSO events it does not mean that it would affect global temperatures averaged over climatic timescales.

  16. Nice analysis. I wondered what this would look like.

    What the green curve would look like if you controlled for ENSO all along?
    (I suspect the undulation would be greatly attenuated). See Douglass and Christy.

  17. If ENSO is explanatory (and it appears to be) then controlling for ENSO should reduce the oscillations. However, there is the difficulty of properly controlling for ENSO. It’s not entirely separate from the temperature record itself.

  18. I really think it’s bad science if you take 2001 as a starting year (when solar cycle was at the highest point) and compare it to 2008 (when solar cycle is at the lowest point) and then say that 2 degrees per century cannot be true.

    1 W/m2 change in solar forcing (comparing highest/lowest points of solar cycle) affects almost as much as human caused CO2 since pre-industrial times (IPCC says it’s about 1,66 W/m2).

    You can’t prove anything until you compare the trend from 2001 to the next year when solar cycle is at the same point. That would be around 2013.

    Also, as already said, ENSO does have a huge effect in yearly temps. Taken also that into account, your graph doesn’t prove much about the temperature trends of the 21st century.

  19. Tuuka, if you check the archives you will see that Lucia tried correcting for ENSO and it didnt much affect the result that 0.2/decade looks too high an estimate for the current temperature trend. How much does the temperature fluctuate over the solar cycle in your opinion?
    There are different opinions on that, I think. At Realclimate for instance they don’t point to the sun when they try to explain away the current negative trend in global temperatures – they ‘cry weather’ as a commenter put it.

  20. Tuukka
    Both the effect of the solar cycle and ENSO have beendiscussed a lot here, and at other blogs. If you think these explain why the flatish trend can be consistent with projected trend of 2C/centry, you should quantify, both what you are suggesting about the solar forcings, and what you say about ENSO.

    For example, you say this:

    “1 W/m2 change in solar forcing (comparing highest/lowest points of solar cycle) affects almost as much as human caused CO2 since pre-industrial times (IPCC says it’s about 1,66 W/m2).”

    What do you mean by “almost as much”?

    Back when I was still only looking at annual average data, I wrote this up:
    What Does NASA Mean by Solar Variations Don’t Matter?

    In that post, you will find a graph showing the relative magntidue of various forcing:

    Note that, contrary to your claim that “1 W/m2 change in solar forcing (comparing highest/lowest points of solar cycle) affects almost as much as human caused CO2 since pre-industrial times (IPCC says it’s about 1,66 W/m2)”, the variations in solar forcings is teeny-tiny compared to the estimated increase in forcings due to GHGs.

    You can read some comments by real climate here: The lure of solar forcing You can also read Tamino’s discussion of the invisibility of the effect of solar forcing on variations in GMST here here.

    I believe JohnV used to visit and say that one of these days he was going to do the analysis to correct for the solar cycle which the current analysis treats as a “random” effect, and changing it into an exogenous variable. However, I think he hasn’t had the time to do so yet.

    Now, on this:

    Also, as already said, ENSO does have a huge effect in yearly temps. Taken also that into accoun

    What do you mean by Enso has a “huge” effect? As in: what’s “huge”? And given that the MEI index has oscillatd up and down several time since 2000, and my analysis period begins during a La Nina, why do you think correcting for ENSO will magically give the answer you want?

    The reasons the unquantified handwaving at ENSO gives no traction around here is we’ve all looked at it repeatedly– quantifying each time. Most recently, after Gavin posted what he thinks is a somewhat legitimate correction, I posted this:

    Correcting temperatures for ENSO. I used the AR(1) statistical model. The effect of correcting for ENSO is to narrow the uncertainty intervals, and the 2 C/century remained falsified.

    I haven’t done a similar correction for the ARMA(1,1) model, but it would be perfectly possible to do so. I probably will at some point since people who now wish to believe things denialists have been mocked for believing want to consider the ENSO factor. (Those two things are a) it’s all solar and b) there is always energy in longer cycles than those used to analyze the period.)

    Finally– with respect to both the solar and the ENSO: I estimated the uncertainty due to both factors using a 30 year period of data. So, the variance in trends over 7 1/2 years due to these factors is included in the current uncertainty intervals. That is, unless you think the 11 year solar cycle and ENSO stopped during the 30s.

  21. A 1 W/m2 change in the solar constant is 0.25 W/m2 corrected for spherical geometry at the TOA. Now correct for albedo and you only have 0.175 W/m2 forcing at the surface. The equivalent of the IPCC’s 1.66 W/m2 would be a change in the solar constant of 9.5 W/m2 or 7%.

  22. DeWitt, if you have surface forcing of -0.175 W/m2 from 2001 to 2008 (7 years) from solar variation that would mean -2,5 W/m2 if yhe same trend would continue all the way to 2100. That is way more than CO2 has affected to radiative forcing since pre-industrial times.

    Also, if Lucia had done this last year or year before that we would have seen very different values. And if it is done next year or a year after that we’ll probably see values like that again. It’s cherry picking to create statistical analysis when temperature is way out of trend for a moment (beginning of the year 2008). Take at least 3 year averages and compare them, not individual years because they can vary so much because of a single weather event.

  23. Tuukka, I just can’t resist recycling Steven Moshers excellent comment on this excellent blog from a couple of months ago:

    steven mosher July 12th, 2008 at 7:45 am

    I’m going to look at this in an entirely different way. It would appear that the 8 year time peroid between 2001 and 2008 will have a negative slope. That is, the 8 year slope will be negative, assumming that the first 6 months of 2008 is a good predictor of the last six months (june, july is good predictor)
    Anyways, assume that at the end of 2008 we have a negative slope for the 8 year span of 2001 -2008.

    What to make of this?

    1. Climate coolists: Its the end of AGW, AGW is wrong, models are wrong, radiative physics is wrong, we are entering an ice age. That hand is Doyle Brunson’s 10-2.
    2. Climate Warmists: it’s the weather. we have no explaination. It happens all the time. there is no information in an 8 year trend. none.

    is there information in an 8 year trend? Well, one approach to that problem is to bicker about error bars. Another approach is to see in the actual record, lets say the past 100 years, how often we see a negative trend over an 8 year peroid. Is it common? is it rare. If its RARE, then information theory tells me it has a HIGH information content. But thats just my take on things.

    So. I went to look at all 8 year trends from 1900 to 2007 ( 2008 isnt done) Here is what you find

    1. Every Batch of them ( save 1) is associated with volcanic activity. In the early 1900s, in the 6os,
    in the 70s, in the 80s in the 90s. If you find a 8 year negative slope in GSMT, You had a volcano.
    This is a good thing. It tells us the science of GW understand things.

    2. The SOLE exception is the batch of 8 year negative trends in the mid 40s. Now, until recently GCM had not been able to match these negative trends (hmm) BUT now we find that the observatiion record, the SST bucket/inlet problem, may be the cause of this apparent cooling trend.

    So, from 1900 to 2000, a time when C02 was increasing we find that on rare occasions we will see 8 year trends that are negative. The cause: volcanos, and bad SST data.

    Now, look at beyond 2000 and the last 8 years. negative trend. Any volcano? nope. any bucket problems?
    err nope. So for the first time in 100 years you have a negative slope that is not correlated with either volacanoes or bad observation data. That looks interesting. Wave your arms and cry weather?
    That’s not science. Thats like waving your arms and crying weather when it gets warmer. The appeal to ignorance. We have a cooling regime. a cooling regime that is not associated with volcanoes and not associated with data errors. I think Thats interesting and meaningful. Dont know what it means, but its the kind of thing you want to investigate rather than shrug off.

  24. Tuukka Simonen

    Yes. I’d get different results if I started last year. There are many reasons for this. One important one is that when very little data is available, the “false negative” error of test is enormous. That is, we can’t exclude even widely wrong theories because we don’t have enough data.

    If you are wondering why I didn’t start this test last year I only began blogging last December, and people asked me to give this a whirl about last Jan or Feb or so. Obviously, I’m going to use all the data available up to today’s date starting in Jan 2001 for reasons I discussed.

    Out of curiosity, what do you mean by taking 3 year averages? Three year averages of what? Also, if you are going to create an unusual tests, we’ll need to re-figure the deviation that corresponds to an confidence interval of 95%. It would change.

    For what it’s worth, I’ll continue to watch the numbers. However at any given time, it just so happens that the test can be done and the results reported. Each person may weigh the results as they wish– but currently the trend says the projections aren’t so great. It may turn out this is an outlier– we’ll all need to wait and see.

Comments are closed.