Today, I’m going to engage Paul_K’s doubt:
Hi Lucia,
I am a bit suspicious of your assertion that you can form a pooled statistic to make a more powerful test:-
Guess what? Paul is correct. I… ahem… can’t create a more a powerful by creating a pooled statistic. At least, it appears I can’t create a statistic that is more powerful than the most powerful of the two statistics. That’s what I’d hoped for. I thought I might be able to do it– but I was…. well… wrong.
As some of you know, I partly engaged it in What’s uncorrelated with what? For PaulK. I that post, I managed to show that I was right about something: that if we center the ‘time’ data an perform a linear fit, the errors in the estimate of the ‘intercept’ and ‘trend’ were statistically independent. Carrick and Julio confirmed this analytically.
So, it turns out some of what I thought it was true. But it turns out I overlooked something, and Paul_K’s intuition was more correct. I can’t just make a more powerful test.
So, now, to show a few things about relative power.
As some readers know, people often apply test and report the ‘p’ value, and decree that a result is statistically significant at some value of p– typically 5%. They might also report that a particular observation is not statistically significant. In this case it would often be useful to report the statistical power of the test, so that the reader can gauge whether ‘not statistically significant’ should be interpreted as meaning anything. Unfortunately, the power is rarely reported.
Failure to report power is understandable however: it depends on many things. Specifically: to compute the power of a test, the analyst needs to make all the assumptions used to compute the ‘p’ value and in addition, they need to compute power as a function of the value of some parameter in an alternate hypothesis.
For example: I might test a null hypothesis: “Observed warming will occur at a rate of m=0.2C/decade.” I can compare that to data, and making some assumptions about the residuals to a linear fit, report whether the difference between the observed warming and 0.2/decades is statistically significant. I can compute a ‘p’ value. If it’s less than 5% I can report this was statistically significant at a confidence level of 95%. (Then arguing about my statistical model can begin. Nevertheless, the exercise of putting the numbers through the crank is done.)
But suppose I get the result “not statistically significant”. Someone might want to interpret this as “the warming really is happening at a rate of 0.2C/decade”, or “it’s very probable warming is happening at a rate of 0.2C/decade” or something similar. That’s not what “not statistically significant” means. Mind you: It sometimes means that, but sometimes it merely means “You just don’t have enough data to tell.”
To distinguish the two situations, we can compute the statistical power of the test. The first step to do this is to specify an alternate hypothesis. One possible candidate for alternate hypothesis might be: “Warming is really happening at a rate of 0.10C/decade.” Once I’ve selected this, I can compute the power by:
Creating ‘N’ months synthetic data with a trend of 0.1C/decade and ‘noise’ with the properties I’d assumed when testing the null hypothesis of 0.2C/decade at test whether the trend for this synthetic data differed from 0.2 and whether the difference was found to be statistically significant at some level ‘p’. I’d then repeat this test a bajillion times and report the rate at which I’d found the trend was statistically significant. This rate is called the “statistical power” and would fall between p and 100%. Note however that for completeness, the reader needs to know that the numerical value depends on both the alternate hypothesis (i.e. 0.1 C/decade) and the ‘p’ value.
As an example: Suppose I’d run a test and discovered the residuals to a linear fit were white noise with a standard deviation of ±0.1 C. I could generated a trend of 0.1 C/decade, generate 120 data points and that for this type and level of noise of ‘noise’, if the real trend is 0.1 C/decade (or -0.1C/dec less than the null), I should be able to distinguish that in a bit more than 85% of realizations that actually happen.
I could start to create a graph. In this case, the point I just discussed corresponds to the upper-left most ‘1’ symbol in the graph below:
I could then repeat the computation at -0.9C/dec below 0.2C/dec and add the next ‘1’ to the right and continue. Notice that when I run the test at 0.0C/de below 0.2 C/dec, I have a ‘power’ of 5%. This is the false positive rate. That is: this is the rate at which I ‘reject’ 0.2C/decade even though it is right. That’s what the ‘p’ value of 5% means!
Now, I’m pretty sure some of you have gathered that the ‘1’ symbols indicate the statistical power to reject trends in this particular numerical experiment. (Please bear in mind, I did not use a noise model that describe residuals of observations. So, the curve is purely qualitative.)
Some of you will also notice traces ‘2’ and ‘3’. Trace ‘2’ is the statistical power I get if I test whether the 120 month mean differs from the ‘baseline’ created fron the 240 data points immediately preceding it. Notice in all cases, trace ‘2’ has more power than trace 1. This means that’s a more powerful tests– and so I think it’s a better test to use!
(The reason I had not been using it is that the test also involves data from the baseline, which was collected prior to the forecasting period. But I’m leaning toward thinking that is not a good reason to favor the test of trends.)
Now for the part where I reveal how we know Paul_K was right to doubt my suggestion that I could make a more powerful test by combining the parameters used in test ‘1’ and ‘2’. The power of the combined test is shown with symbols ‘3’. Note that it’s almost as powerful as ‘2’ but it’s power always lies between 1 and 2. So, based on power 2 is better.
I’ve said in the past that one should favor the more powerful test– unless one can identify a very good reason to favor another test. This strongly suggests that for testing short term trends I should switch to the test of the ‘n month means’ — unless I can think of a good reason not to. I’m pondering a bit.
Reasons I can think of to stick to testing trends:
- Testing trends is always possible. Testing the means can only be done if the test involves a series of data outside the baseline. So, I can test IPCC projections described relative to the 1980-1999 baseline if I limit test to start dates after 2000. But I can’t use that baseline and tests using earlier start dates. (This is not a big deal, as I consider the forecast periods to start in 2001. But it matters if someone wants to know how the answer changes if I use an earlier start date.)
- I’ve been testing trends. So, people might wonder if I’m picking a test that gives an answer I “like” better. (Let’s face it, given the range of people out there, I’ll get this from one ‘side’ or the ‘other’.)
The reason I can think for favoring the combined metric: It is a bit robust to cherry picking the spot in the ENSO. I know that if I pick a start data during a La Nina (i.e. 2000), I get a test that gives a lower ‘N month’ mean. If I chose a start data of 2001, I get a higher trend. So, if ” want” to reject, I use a start date of 2000 for the “N month mean” test and a start data of 2001 for the ‘trend’ test. If I “want” to fail to reject, I make the opposite choice. Using the combined test falls in between. Since this test is almost as powerful as using the ‘mean’ test, it might be a useful happy medium.
Of course, as we get more data, which tests is chose no longer matters. Nevertheless: the general rule is to pick the more powerful tests. The reduces the overall error rate given the available data. As for what I’m going to do: Same thing I was planning to do anyway: Report the result of all three tests for a while. I’m planning to start making tables showing results with start dates of 2000 and 2001. 🙂
Oh. And to remind people, Paul_K was right. I can’t create a more powerful statistic by pooling two statistics. Darn! Still, I think I have a useful statistic, and you’ll see it reported.

“I’ve said in the past that one should favor the more powerful test– unless one can identify a very good reason to favor another test.”
The issue is that a given test is, to some extent, making a different claim (or rather a claim about a claim about a hypothesis). The difference in meaning might be subtle or close in meaning but at times you need to choose a less powerful test because it is closer to what you are trying to say about the data.
I’m trying to think of a good non-climate example.
Nyq–
Of course you have to pick a test that addresses the question you are interested in addressing. I assume that’s a given and admit I didn’t say it. So, yes, if one test addresses a question you are interested in and another test does not, then you pick the test that addresses the question of interest. This trumps any argument about power. I’m sure if you hunt around you can find examples.
But I think in this case, if the true, honest to goodness question is ” Is the temperature anomaly rising at the rate equal to projections of IPCC models”, testing the ‘N month mean’ vs. “the N month trend” address that general question. Given this, selecting between the two based on power makes sense.
surely, a prior, you don’t know what slope you are looking for? As I understand it there are a large number of estimates as to what the slope should be.
Is there not information in the ‘noise’? When ever i fit I always look at both the residuals, both naked and as the sum of squares. If you can see information in the noise, then your fit has failed.
Doc,
What do you mean by “what slope you are looking for?” I’m not looking for any slope. But if I test a hypothesis, I know what null hypothesis I am testing. I know because I pick the theory I am going to test. So, clearly, I “know” what my null is. The null is the null and that’s unchanged by the “slope I am looking for.”
Sure. But I don’t need to delve into figuring out that information to state the power of a test.
What in the world does this have to do with figuring out the power contingent on fit being good.
Hi Lucia,
Sorry for the delay, but I’ve been off the air for a couple of days. Thanks for the nice compliments, but you are overestimating my prescience.
The bad news is that I still think that your combined test is conceptually challenged.
We are agreed that (for simple OLS of y on x and white noise) there is no correlation between errors in the intercept and errors in the gradient when the data are mean-centred on the x’s. However, this is not the same as saying that the two RVs are not correlated. Indeed, given the nature of the specific data it is apparent that there must be direct correlation between ybar and the gradient m.
If you start off with the same historic data so that the temperature at the start of prediction (time t=0) is nailed down, then a large positive value of m (in the observed data) implies a large value of ybar (in the observed data) and vice-versa. Hence, at the least your calculation of variance of (ybar plus gradient) needs to account for the covariance between them.
However, I don’t think that the problem stops there. I think that the distribution of the new RV (ybar plus gradient) is asymmetric and therefore not appropriate for a t-test outside certain bounds.
I’ll try to find a little time this evening to clarify the above.
Paul
Paul_K
Of course it’s not the same. First: I assume by RV you mean “real values”? As far as the concept goes: If the deterministic component of temperature rise is linear in time the RV’s are perfectly correlated. That means: These are not independent tests except inso far as the errors are uncorrelated. That is: If we collected an infinite amount of data, we expect all three tests to give the same answer. But, when the amount of data is finite, they may give different answers owing to the difference in the errors.
I only assume the errors are uncorrelated and in fact assume the RV’s are correlated– possibly perfectly so. If the RV and errors did not have these properties, I would expect pooling to have zero utility.
Yes. Were this not so it would make no sense to pool the test. I’d just report independent tests each of which answered different questoins. The question I am looking into by comparing the power of the three tests is: Of these three possible tests– all of which to some extent tests exactly the same thing which is most powerful? If the RV’s were not strongly correlated, then it wouldn’t make sense to ask which is most powerful. It wouldn’t make sense to try to pool the parameters because after pooling, I wouldn’t know what was being tested.
What do you mean by “asymetric”. Anyway, assuming I’ve correctly iterpreted what you mean by RV, the RV’s are deterministic. So they don’t have probability density functions or distributions.
I guess the main issue I don’t understand is why you think the fact that the RV’s test the same thing (and so are correlated) is a problem that precludes pooling. I not only don’t think it’s a problem, I think I can’t pool unless the RV’s test the same thing. Ideally, I want the RV’s to be prefectly correlated, but I want something that reduces the total error in the parameter I tests. Turn’s out pooling doesn’t do that– but it’s not the correlation between the RV’s that cause the problem.
Hi Lucia,
RV = Random Variable. Sorry for the UFA – Unidentified Flying Acronym.
Actually please ignore my previous message entirely. I found an error in my logic. (Gulp.) Sorry for the misdirection.
Paul
Ahhh! Ok. Well… anyway, you were right. I couldn’t get more power… 😉