Why ‘failure to reject’ doesn’t mean much: Type 2 error.

Today, I’m going to discuss what failure to reject means when we apply a statistical test. In this discussion, I will be using the term “Statistical Power”.

In case you are wondering, this does have implications vis-a-vis Gavin’s critique Pat Michael’s testimony to Congress. But I will not be discussing that in detail. Instead, I will limit myself to discussing the statistical power of the tests I showed in a previous blog post.

As some readers know, in several previous posts, I applied a t-test to described in Santer et. al 2008 to compare simulated and observed global mean surface temperatures. The statistical test is designed to test this null hypothesis:

“The model-mean trend and the underlying mean trend for the earth’s surface temperature are equal to each other”.

In in a recent blog post I showed that if we compare simulated and observed trends in surface temperature starting in January of various years and ending in Feb 2009, we should strongly suspect the mean model projected trend is not equal to the trend in observations of the earth’s surface temperature. This result in encapsulated in a graph of the normalized difference between the model mean trend and the observed trend:

Figure 1: Normalized error in multi-model trends.
Figure 1: Normalized error in multi-model trends.

( It is worth nothing here that the graph above is not intended to imply that the results with different start years are independent from each other. It is designed to illustrate that conclusion that the model mean trend and the observed mean trend are not equal is based on a start year of 2001 is not a “short time” effect; that hypothesis is rejected for many choices of start year.)

Can we have a confidence level with that?

Mind you, because the test described above is a statistical test, when I say we should reject the null hypothesis that the model mean is equal to the observed mean, I should be reporting the confidence level.

In any statistical test test, if we select a confidence level, say p=95%, then rejection of the null hypothesis at p=95% is expected to occur (1-p)=α=5% of the time if the hypothesis is correct. That is: all statistical tests are designed to give this sort of wrong answer α percent of the time.

This sort of wrong answer, where we decree the models are wrong when they are correct is called “Type I” or “α” error.

The way the test is designed, if we select p=95% and the test is designed properly, we will make mistakes in 5% of all possible trials. This is true whether we have only small amounts of data or large amounts of data. So, those who suggest that a rejection should be ignored on the basis of too little data or “weather noise” misunderstand what the test does: the formalism of the tests accounts for the amount of “weather noise” and adjust the confidence intervals to account for any brevity in the time span.

Now, I’m going to bring up a topic no one brought up in comments.

Have any of you wondered what it would mean if the Santer t-test failed to reject? It didn’t fail… but what if it had failed to reject? Would failing to reject mean the models were right?

Ehrm…. The short answer is “not necessarily!” A slightly longer answer would add, “More specifically, in the case of evaluating the trend from 2001-now, failing to reject would not have suggested the models were correct. ”

What would it suggest? It would suggest that either a) the models might be right or b) they are wrong but we don’t have enough data to prove them wrong.

That said, in fairness, it’s important to admit that sometimes failure to reject would provide strong evidence the null hypothesis is probably right. So, how do we know when failing to reject actually suggest the models are right and when it means almost nothing at all?

We can discover how by thinking of other ways in which results statistical tests can mislead.

Consider this: Suppose someone decides to interpret “failure to reject” should always be interpreted to mean “the models are right”.

But now, suppose the models are actually wrong (by some amount), then decreeing they they are right would be a mistake.

This sort of mistake is called “Type II” or “β” error. The statistical power of a test is η=1-β. When doing a test, it’s possible to compute the probability that one will make this error, we call this probability β

I bet you are wondering…. can we calculate the magnitude of β?

As it happens, when designing statistical tests, one commonly selects a value for the confidence interval (p) and consequently the value for α=1-p. So, someone more-or-less neutral about models might say:
“I’m willing to believe the model mean doesn’t match the earth mean if you do an experiment, perform a t-test and reject hypothesis the two trend match using confidence levels of p=80% to p=95%;”

Someone who thinks models are totally worthless might substitute p=50%. Someone who thinks models are the bees-knees might substitute p=99.999% (a level ordinarily reserved for overturning the law of gravity.)

I’d been consistently using 2001 as my start year to test the hypothesis that the trend from 2001-now for the models and observations match. I selected based on the date of publication of the IPCC SRES. I like p=95%. Using the trend since 2001 and applying the Santer-like-t-test, I find we should reject the hypothesis that the mean-model trend and the observed trend are equal based on that period.

But, some readers will notice that had I selected a different start year, d* falls inside the ±95% confidence interval for a few of the start years shown in figure 1 above. That means: Had I happened to select one of those years as my “test” year, I would be reporting “fail to reject”.

Some might also suggest that if I pretended 2008 had not happened I would get a “fail to reject” FWIW: I think it’s silly to pretend 2008 didn’t happen so I haven’t checked that.

I concede that for some choices of start year, we get ‘fail to reject’, and I say: Big whip! If you continue, you will see why!

What would it mean to get a “fail to reject”?
So, what does it mean that we would get “fail to reject” for some other choice of start year?

Well… to figure this out, we need to do this:

  1. Formulate two alternate hypotheses,
  2. Decide what power we require to be convinced that “failure to reject the null hypothesis” really means the “null hypothesis is shown true.”
  3. Compute the power of the test with respect to each alternate hypothesis.
  4. Figure out the years when the trend is long enough to achieve that level of power.
  5. And finally, decide whether we believe those “failure to reject” on the graph shown above mean much.

Alternate Hypotheses
For the purpose of analysis I propose these two alternative hypotheses which are selected for illustrative purposes:

  1. The Lukewarmer Hypothesis: The true underlying trend for the earth’s surface temperature is precisely equal to 1/2 the mean model trend.
  2. The No Warming Hypothesis: The true underlying trend for the earth’s surface temperature is 0.

For the purpose of the exercise, we assume someone named “Lukewarmer” believes the first hypothesis and someone else named “The Viscount” believes the second one. We will not speculate why they believe either hypothesis or whether either is based on anything other than gut instinct. (For the purpose of the discussion, it doesn’t matter. It’s necessary to have a least 1 alternate, and better to have two.)

Now, let’s further assume Lukewarmer is willing to admit the result of a statistical test contradict his hypothesis ifwhen applying the Santer test with p=95%, the result is “fail to reject” and the statistical power of the test, η is greater than 95%. The level η=95% strikes Lukewarmer as fair, as it is equal to p=95% which he requires to diagnose a “reject”.

For brevity, we won’t guess what power the Viscount would require to convince him the “failure to reject” suggest his hypothesis of “No warming” is contradicted by the data.

Calculate the power of the tests as a function of start year.

Once we know the alternate hypothesis, it turns out that it’s pretty easy to compute the power for any given test. Because I computed d* for all years from 1950-now, I computed the power under both alternate hypothesis for every case where I computed d*. I’m going to gloss over the details, except to say that to do this I a) compute the ±95% confidence intervals in C/year, find the difference between the two ±95% intervals and the trend associated with the alternate hypothesis (i.e. 1/2 the model mean or 0 C/year), and normalize that difference the estimate of the “weather noise” for the observational data set. With these two values in hand, using the normal distribution, I find the probability that the observed result would fall outside that range. (So, yes, this involves the same assumptions as the original Santer test, including the assumption weather noise is gaussian.)

Below we can find the result of the computation of the “power” of the test tests with each particular start year, done for the two different alternate hypotheses:

Figure 2: Power of t-test as a function of start year and alternate hypothesis.
Figure 2: Power of t-test as a function of start year and alternate hypothesis.

Now, as a hypothetical, let’s consider the green curve indicating the power of the test if the “Lukewarmer” is actually correct. If the Lukewarmer Hypothesis is correct, then models projections would be fairly far off. So we should expect any decent statistical test reject the hypothesis the models are correct, right?

Well… erhmm….

Oddly enough, if we use the confidence of p=95% to diagnose a rejection, then for tests with start years later than 1996, we are more likely to “fail to reject” the the hypothesis that the model are correct than to reject that hypothesis. That means we would conclude the models are right when they are really wrong 50% or more of the time! (The yellow lines indicate the a power of 50%. )

For this reason, for years prior to 1996, it is much wiser to suspect that “fail to reject” does not mean “accept as true”. Certainly, we shouldn’t “accept as true” if we think models being off by a factor of 2 is an error we would want to detect.

So, under what circumstances does the Lukewarmer have to admit that “failure to reject” means his notion is wrong?

Well, if he had decided to cling to his hypothesis until ‘failure to reject’ occurs with a test having a power of η=95% or greater (equal to the modelers requirement of p=95%), then the Lukewarmer is justified in refusing to interpret ‘failure to reject’ as meaning his Lukewarm hypothesis is wrong unless we see these ‘failures to reject’ when comparing trends with start dates before 1986. Why 1986? Prior to that year, the power curves show powers higher than 95% as indicated by the vertical and horizontal red lines.

What about “The Viscount” who believes in “no warming”? Well, interestingly, since his alternate hypothesis is for no warming, if the t-test says “fail to reject” using a start year of 1996, he should have packed up his “no warming hypothesis”. He doesn’t have to believe the models are right– he can switch to the Lukewarmer hypothesis… or something.

Notice so far the discussion of the power of the test did not refer to the actual test shown in figure 1. That’s because the result of the t-test itself is not involved in determining the power. We did not need to know the actual observed trends to compute the power; (we did need to have an estimate of the power of the “weather noise”.)

Now, let’s put lines for the years associated with 95% on the d* graph.

Now, knowing that “Lukewarmer” gets to
a) crow about any and all rejections with p-95% ( provided the method is fair and he didn’t cherry pick his choice of start year) and
b) gets to ignore all “fails to rejects” after roughly 1986, let’s look at figure 1 again:

Figure 3: Normalized difference between models and observations with extra information.
Figure 3: Normalized difference between models and observations with extra information.

I think Lukewarmer is feeling pretty good about now. There are plenty of rejections at p-95%. So, “rejecting” the null hypothesis that the models are right is advisable based on data collected after the scenarios used to drive models were published (in 2001.)

The rejection of “model mean trend = observed mean trend” happens for the majority years. This tends to clear Lukewarmer of the accusation of cherry picking.

Though his case is well supported if we include all models, he can suggest it would be wise to ignore model runs that failed to account for the cooling effect of eruptions like Agung or Feugo. If others agree, Lukewarmer’s case looks even better.

Lukewarmer can also note that there are many strong rejections for shorter trends when it would be very, very unlikely to get rejections unless the models are wrong.

He can easily counter anyone who makes the absurd suggestion that if we pretend 2008 didn’t happen, and we might “fail to reject” that we should take that failure seriously. In reality, that failure to reject for short trends would only this: Even if the models over predict warming by a factor of 2, we would expect to get “fail to reject” in 3 out of 4 randomly chosen tests of 98 month trends. So, if we happen to get a failure to reject by pretending 2008 (i.e. the only year of data that came in after the IPCC AR4 was published) didn’t happen: Big whip. This does not suggest the models are right nor does it impugn the “rejection” based on using the data we actually have.

In short: Considering the power of this test, the model mean seems pretty well inconsistent with the observations.

Oh… did I lose track of the Viscount? You get to decide what this proves to the Viscount.

Note:

My father is ailing, so I will be leaving for Florida on Monday and returning the following Tuesday. Jim is staying home to mind the home front, but that doesn’t include moderating comments. Comments will be moderated while I am away.

13 thoughts on “Why ‘failure to reject’ doesn’t mean much: Type 2 error.”

  1. Remember GIGO? I think you’ll find that the Continental USA Class 1 stations will show that the GISS/HADCRU and derived Models are completely wrong, and the Viscount is completely correct in that there has been no warming.

  2. Hope for both are sakes the weather is nice when you come down here! And I hope your dad’s okay, to.

    You know, it seems to me that by the way hypothesis testing works in the first place a failure to reject never means much. After all, the claim is weak, merely saying “there is not enough evidence to conclude that the null hypothesis is wrong”. It doesn’t say the null is correct, which would be a strong claim.

  3. The power of the test should be going down to zero as you go backward in time, as we have every reason to believe that the modelers tried thousands of modeling details, selecting their ultimate model by its fit to historic data.

  4. Anyone who grew up in the 50’s and 60’s knows there has been warming. The questions:

    1. has the amount of warming been over-estimated/measured?
    2. Is CO2 the driver?
    3. Have we entered a prolonged period of cooling?
    4. Do we need to take drastic action?

    Damfino.

    Lucia, thoughts and prayers for you father.

  5. To all with best wishes: To clarify… Dad didn’t have any sort of crisis. My aunt Chi-chi phone and is worried about his memory. So…. I’m going down to check him out in person, haul him off to a doctor, make sure he has things arranged in good order etc. If he looks bad, then we may need to do something (like nag him to move back to Illinois were we can visit him regularly, as we do my inlaws who come for dinner twice a week.)

    But I anticipate he will be mostly ok. (Prayers still welcome!)

  6. Regarding power analysis: if scientists had to show a power analysis for every statistical test they used, many journals would go out of business. It’s virtually unknown in population biology and ecology, although the matter has been brought up in the ecological literature. Sample sizes and test groups are typically just large enough to fill the available logistical resources. As a result, everyone just agrees to ignore the issue entirely.

Comments are closed.