{"id":15767,"date":"2011-06-24T15:50:30","date_gmt":"2011-06-24T21:50:30","guid":{"rendered":"http:\/\/rankexploits.com\/musings\/?p=15767"},"modified":"2011-06-24T15:50:30","modified_gmt":"2011-06-24T21:50:30","slug":"relative-statistical-power-of-3-tests","status":"publish","type":"post","link":"https:\/\/rankexploits.com\/musings\/2011\/relative-statistical-power-of-3-tests\/","title":{"rendered":"Relative Statistical Power of 3 tests."},"content":{"rendered":"<p>Today, I&#8217;m going to engage <a href=\"http:\/\/rankexploits.com\/musings\/2011\/noaa-may-cooler-than-april\/\">Paul_K&#8217;s doubt:<\/a><\/p>\n<blockquote><p>Hi Lucia,<br \/>\nI am a bit suspicious of your assertion that you can form a pooled statistic to make a more powerful test:-<\/p><\/blockquote>\n<p>Guess what? Paul is correct. I&#8230; ahem&#8230; can&#8217;t create a more a powerful by creating a pooled statistic.    At least, it appears I can&#8217;t create a statistic that is more powerful than the most powerful of the two statistics. That&#8217;s what I&#8217;d hoped for. I thought I might be able to do it&#8211; but I was&#8230;. well&#8230; wrong. <\/p>\n<p>As some of you know, I partly engaged it in <a href=\"http:\/\/rankexploits.com\/musings\/2011\/whats-uncorrelated-with-what-for-paulk\/\">What\u00e2\u20ac\u2122s uncorrelated with what? For PaulK<\/a>. I that post, I managed to show that I was right about something: that if we center the &#8216;time&#8217; data an perform a linear fit, the errors in the estimate of the &#8216;intercept&#8217; and &#8216;trend&#8217; were statistically independent.  Carrick and Julio confirmed this analytically.<!--more--><\/p>\n<p>So, it turns out some of what I thought it was true. But it turns out I overlooked something, and Paul_K&#8217;s intuition was more correct. I can&#8217;t just make a more powerful test.<\/p>\n<p>So, now, to show a few things about relative power.<\/p>\n<p>As some readers know, people often apply test and report the &#8216;p&#8217; value, and decree that a result is statistically significant at some value of p&#8211; typically 5%.  They might also report that a particular observation is not statistically significant. In this case it would often be useful to report the <I>statistical power<\/i> of the test, so that the reader can gauge whether &#8216;not statistically significant&#8217; should be interpreted as meaning anything.  Unfortunately, the power is rarely reported.<\/p>\n<p>Failure to report power is understandable however: it depends on many things.  Specifically: to compute the power of a test, the analyst needs to make all the assumptions used to compute the &#8216;p&#8217; value and in addition, they need to compute power as a function of the value of some parameter in an <i>alternate<\/i> hypothesis.  <\/p>\n<p>For example: I might test a null hypothesis: &#8220;Observed warming will occur at a rate of m=0.2C\/decade.&#8221;  I can compare that to data, and making some assumptions about the residuals to a linear fit, report whether the difference between the observed warming and 0.2\/decades is statistically significant. I can compute a &#8216;p&#8217; value. If it&#8217;s less than 5% I can report this was statistically significant at a confidence level of 95%.  (Then arguing about my statistical model can begin. Nevertheless, the exercise of putting the numbers through the crank is done.)<\/p>\n<p>But suppose I get the result &#8220;not statistically significant&#8221;.  Someone might want to interpret this as &#8220;the warming really is happening at a rate of 0.2C\/decade&#8221;, or &#8220;it&#8217;s very probable warming is happening at a rate of 0.2C\/decade&#8221; or something similar.  That&#8217;s not what &#8220;not statistically significant&#8221; means. Mind you: It <i>sometimes<\/i> means that, but sometimes it merely means &#8220;You just don&#8217;t have enough data to tell.&#8221;<\/p>\n<p>To distinguish the two situations, we can compute the <i>statistical power<\/i> of the test.  The first step to do this is to specify an <I>alternate<\/i> hypothesis.   One possible candidate for alternate hypothesis might be: &#8220;Warming is really happening at a rate of 0.10C\/decade.&#8221; Once I&#8217;ve selected this, I can compute the power by:<\/p>\n<p>Creating &#8216;N&#8217; months synthetic data with a trend of 0.1C\/decade and &#8216;noise&#8217; with the properties I&#8217;d assumed when testing the null hypothesis of 0.2C\/decade at test whether the trend for this synthetic data differed from 0.2 and whether the difference was found to be statistically significant at some level &#8216;p&#8217;.  I&#8217;d then repeat this test a bajillion times and report the rate at which I&#8217;d found the trend was statistically significant. This rate is called the &#8220;statistical power&#8221; and would fall between p and 100%.  Note however that for completeness, the reader needs to know that the numerical value depends on both the alternate hypothesis (i.e. 0.1 C\/decade) and the &#8216;p&#8217; value. <\/p>\n<p>As an example: Suppose I&#8217;d run a test and discovered the residuals to a linear fit were <i>white noise<\/i> with a standard deviation of \u00c2\u00b10.1 C. I could generated a trend of 0.1 C\/decade, generate 120 data points and that for <i>this<\/I> type and level of noise of &#8216;noise&#8217;, if the real trend is 0.1 C\/decade (or -0.1C\/dec less than the null), I should be able to distinguish that in a bit more than 85% of realizations that actually happen.  <\/p>\n<p>I could start to create a graph.  In this case, the point I just discussed corresponds to the upper-left most &#8216;1&#8217; symbol in the graph below:<\/p>\n<p><a href=\"http:\/\/rankexploits.com\/musings\/wp-content\/uploads\/2011\/06\/StatPower_White.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/rankexploits.com\/musings\/wp-content\/uploads\/2011\/06\/StatPower_White-500x500.png\" alt=\"\" title=\"StatPower_White\" width=\"500\" height=\"500\" class=\"aligncenter size-medium wp-image-15772\" srcset=\"https:\/\/rankexploits.com\/musings\/wp-content\/uploads\/2011\/06\/StatPower_White-500x500.png 500w, https:\/\/rankexploits.com\/musings\/wp-content\/uploads\/2011\/06\/StatPower_White-300x300.png 300w, https:\/\/rankexploits.com\/musings\/wp-content\/uploads\/2011\/06\/StatPower_White.png 1008w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><\/a><\/p>\n<p>I could then repeat the computation at -0.9C\/dec below 0.2C\/dec and add the next &#8216;1&#8217; to the right and continue. Notice that when I run the test at 0.0C\/de below 0.2 C\/dec, I have a &#8216;power&#8217; of 5%. This is the false positive rate. That is: this is the rate at which I &#8216;reject&#8217; 0.2C\/decade <I>even though it is right<\/i>. That&#8217;s what the &#8216;p&#8217; value of 5% means!<\/p>\n<p>Now, I&#8217;m pretty sure some of you have gathered that the &#8216;1&#8217; symbols indicate the statistical power to reject trends in this particular numerical experiment. (Please bear in mind, I did <i>not<\/i> use a noise model that describe residuals of observations. So, the curve is purely qualitative.)<\/p>\n<p>Some of you will also notice traces &#8216;2&#8217; and &#8216;3&#8217;.  Trace &#8216;2&#8217; is the statistical power I get if I test whether the 120 month mean differs from the &#8216;baseline&#8217; created fron the 240 data points immediately preceding it.  Notice in all cases, trace &#8216;2&#8217; has more power than trace 1.  This means that&#8217;s a more powerful tests&#8211; and so I think it&#8217;s a better test to use! <\/p>\n<p>(The reason I had not been using it is that the test also involves data from the baseline, which was collected prior to the forecasting period. But I&#8217;m leaning toward thinking that is not a good reason to favor the test of trends.)<\/p>\n<p>Now for the part where I reveal how we know Paul_K was right to doubt my suggestion that I could make a more powerful test by combining the parameters used in test &#8216;1&#8217; and &#8216;2&#8217;. The power of the combined test is shown with symbols &#8216;3&#8217;.  Note that it&#8217;s <I>almost<\/I> as powerful as &#8216;2&#8217; but it&#8217;s power always lies between 1 and 2.    So, based on power 2 is better.<\/p>\n<p>I&#8217;ve said in the past that one should favor the more powerful test&#8211; unless one can identify a <em>very<\/em> good reason to favor another test.  This strongly suggests that for testing short term trends I should switch to the test of the &#8216;n month means&#8217; &#8212; unless I can think of a good reason not to.  I&#8217;m pondering a bit. <\/p>\n<p>Reasons I can think of to stick to testing trends:<\/p>\n<ol>\n<li>Testing trends is <i>always<\/I> possible. Testing the means can only be done if the test involves a series of data outside the baseline. So, I can test IPCC projections described relative to the 1980-1999 baseline if I limit test to start dates after 2000.  But I can&#8217;t use that baseline and tests using earlier start dates. (This is not a big deal, as I consider the forecast periods to start in 2001. But it matters if someone wants to know how the answer changes if I use an earlier start date.)<\/li>\n<li>I&#8217;ve <i>been<\/I> testing trends. So, people might wonder if I&#8217;m picking a test that gives an answer I &#8220;like&#8221; better. (Let&#8217;s face it, given the range of people out there, I&#8217;ll get this from one &#8216;side&#8217; or the &#8216;other&#8217;.)<\/li>\n<\/ol>\n<p>The reason I can think for favoring the combined metric: It is a bit robust to cherry picking the spot in the ENSO.  I <I>know<\/i> that if I pick a start data during a La Nina (i.e. 2000), I get a test that gives a lower &#8216;N month&#8217; mean. If I chose a start data of 2001, I get a higher trend. So, if &#8221; want&#8221; to reject, I use a start date of 2000 for the &#8220;N month mean&#8221; test and a start data of 2001 for the &#8216;trend&#8217; test. If I &#8220;want&#8221; to fail to reject, I make the opposite choice. Using the combined test falls in between.   Since this test is almost as powerful as using the &#8216;mean&#8217; test, it might be a useful happy medium.<\/p>\n<p>Of course, as we get more data, which tests is chose no longer matters.  Nevertheless: the general rule is to pick the more powerful tests.  The reduces the overall error rate given the available data.   As for what I&#8217;m going to do: Same thing I was planning to do anyway: Report the result of all three tests for a while.  I&#8217;m planning to start making tables showing results with start dates of 2000 and 2001. \ud83d\ude42 <\/p>\n<p>Oh. And to remind people, Paul_K was right.  I can&#8217;t create a more powerful statistic by pooling two statistics. Darn! Still, I think I have a useful statistic, and you&#8217;ll see it reported. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, I&#8217;m going to engage Paul_K&#8217;s doubt: Hi Lucia, I am a bit suspicious of your assertion that you can form a pooled statistic to make a more powerful test:- Guess what? Paul is correct. I&#8230; ahem&#8230; can&#8217;t create a more a powerful by creating a pooled statistic. At least, it appears I can&#8217;t create &hellip; <a href=\"https:\/\/rankexploits.com\/musings\/2011\/relative-statistical-power-of-3-tests\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Relative Statistical Power of 3 tests.<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-15767","post","type-post","status-publish","format-standard","hentry","category-statistics"],"_links":{"self":[{"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/posts\/15767","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/comments?post=15767"}],"version-history":[{"count":0,"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/posts\/15767\/revisions"}],"wp:attachment":[{"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/media?parent=15767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/categories?post=15767"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rankexploits.com\/musings\/wp-json\/wp\/v2\/tags?post=15767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}