Pooled Method of Testing a Projection

The purpose of this post is to discuss a method of testing a projection that I think has greater power than testing trends alone. (Yes, Paul_K. We discussed this before. I now think it can be done!! 🙂 ) To discuss this, I’m going to discuss how I might test a hypothetical prediction. Suppose I’ve been winning 0 quatloos for a long time, and there is no trend in my quatloo winning. If I do nothing, most people assume I’m going to continue to win nothing. Below, to the right of the vertical red line I have plotted my quatloo winnings in each bet (blue circles), along with the mean value and the trend in time (solid blue line):


The dashed lines represent the standard error in the mean quatloo winnings per bet.

To the right of the red line, is my prediction of number of quatloos I will earn in the future if I don’t come up with some nifty system to improve my betting performance.

As you can see, in this example, I am predicting the future “Quatloo anomaly” will be zero. The uncertainty intervals are the uncertainty in the mean over the baseline (left) and the uncertainty in the mean over the forecast period which has not yet occurred.

We’ll now pretend I concoct a ‘system’ that I think will improve my ability to win Quatloos. It will take practice, so I think my winnings will tend to increase at a rate of Q(i)=mi where ‘m’ is the trend. Of course, my null hypothesis remains m=0, and I don’t really know how large ‘m’ is. I merely hope it’s positive.

I’m going to use my system for 120 plays. At the end of that time, I want to test the null hypothesis that

  1. I will earn 0 Quatloos each month in the future period. That is E[Q]=0
  2. The trend m is zero. That is m=0.

I can test either using an appropriate ‘t’ test. Each would involve finding a ‘t’ value for the data in the upcoming period and comparing that to a student t-distribution. I can call the two t value tmean and ttrend.

I now have two possible tests. Strictly speaking, I am testing two separate hypotheses both of which are related since the amount to “the system doesn’t work”. So, I really only want 1 test.

If I’m limited to one test or the other, I’d really like is to pick the more statistically powerful of the two tests. That is: When applied at the same statistical significance (that is a pre-defined type I error rate) , I’d like the test that results in the greatest statistical power (which is the same as the lowest type II error rate.)

Oddly enough, I also know that I can show the errors in both tests are independent. That is, if the null hypotheses are true and both m=0 and E[Q]=0, then the slight deviations in the actual ‘m’ and average quatloo winnings I will observe during the upcoming periods are statistically independent. (This was discussed here here.)
So the fact that there are two methods and the errors in the two methods are statistically independent leads me to an interesting situation.

There should be some method of combining the tests to create a pooled test that is more powerful than either alone. That is, there should be a method of creating a pooled

tpooled=wtrendttrend+wmeantmean

with weights selected such that the standard deviation of the pooled ts is 1 and the statistical power of the test is maximized. (Note that the standard deviations of ttrend and tmean are one by definition. That’s a requirement for t values.)

So, I concocted the following rule, which seems to work:

  1. Define a effect size for an alternate hypothesis: that is the hypothesis that might be true. This can be any value other than zero since the magnitude cancels: I pick me=0.1 quatloo/decade. If that hypothesis is true, I find I will win Q=m*120 months quatloos.
  2. Based on data available prior to the forecast period, estimate the ‘te,m‘ and te,mean that would exist if this size effect is observed during the upcoming prediction period. To estimate these I need to estimate the variability of 120 month trends given data prior to the forecast periods, and I need to estimate the variability in the difference between the mean in the prediction and baseline period. (Those are provided in the figure and are based on the noise prior to the forecast.)
  3. Compute relative weights for each test variable ‘i’ as as wi= te,i^2/ sum(te,j^2). (Notice the size effect cancels out when the t’s are normalized.)
  4. Collect data during the sample period. Afterwards, compute the t value based on the sample data (using normal methods.) Then create a pooled t using tpooled=wtrendts,trend+wmeants,mean. Here the ‘s’ denotes sample data.

I’ve done a number of test and this weighting does appear to always result in a more powerful test than either the trend or mean tests alone provided the weights are based on the size effect during the forecast period and the individual t-tests properly account for all errors. For the current case, I collected data on Quatloo winnings and the results are as below:

The actually important part of the graph is the information in the lower left. What I want you to notice is the based on synthetic data I created, ‘t’ for the mean anomaly test was 18.44. The t for the trend test was 12.57. I weighted the two t’s using relative weights of (0.66,0.34)/sqrt(0.66^2+0.34^2). this resulted in a tpooled of 22.13. The significant feature is 22 is larger than either 18.4 or 12.6. This isn’t important for the synthetic data I show which has so much power it’s ridiculous. But the feature is very important if we have just barely enough data to generate sufficient power to hope for “fail to rejects”.

Synthetic tests using 10000 repeat trials indicate that this pooled t has the proper characteristics and is– on average– larger than either of the two other ‘t’. That means the resulting test has higher power than either the test comparing trends or the test for comparing anomalies alone. Consequently, it is preferred.

So, this is sort of a “goldilocks” test!

I know this is sketchy. I’ll be happy to answer questions. But this post is mostly so I have a place holder to remember what I did. (I uploaded the really badly organized script with embarrassing incomprehensible notes and a bunch of side tests. And… yes… this can be applied to model tests. I haven’t done it yet because I hadn’t verified I knew how to weight the two tests. Did model tests with weights of (0.5, 0.5 )/sqrt(0.5). Now I need to repeat that with the correct weights. 🙂

==========
Note::I’ve discussed this before and Paul_K was dubious. So, this result is something of a turn about because while the first test I though up way back then did not seem to optimize this one does. I didn’t upload that script, so I’m not sure if I applied the test entirely properly in that post. I may have forgotten to account for the ‘running start’ aspect of testing the means making the test of the means appear more powerful than it ought to.)

85 thoughts on “Pooled Method of Testing a Projection”

  1. Lucia,
    It looks like a superb trick if you can validate it. I will go through it in some detail tomorrow. I still find the result counterintuitive – which makes it more interesting.
    It has wide potential applicability – worthy of a short article in one of the stats journals, I would think.

  2. lucia, you say:

    Oddly enough, I also know that I can show the errors in both tests are independent.

    Is this limited to white noise, or does it hold more generally?

  3. Paul_K–

    Lucia,
    It looks like a superb trick if you can validate it. I will go through it in some detail tomorrow. I still find the result counterintuitive – which makes it more interesting.
    It has wide potential applicability – worthy of a short article in one of the stats journals, I would think.

    I know! But yes– if I haven’t pulled a boner– yes. It’s would be useful.

    It isn’t a better test of the trend itself, and it isn’t a better test of the mean. But the errors are orthgonal. So… should work!

  4. Humm.. The mean of 18.44 and 12.57 is 15.5. Times 2^0.5 is 21.93; very close to 22.13. Just a coincidence, or does the power increase as N^0.5, where N is the number of independent hypotheses?

  5. Lucia,

    One doubt: doesn’t the trend already include all the information to calculate the mean? It almost sounds too good to be true.

  6. SteveF–
    If the trend is known, it includes all the information required to compute the mean. However, neither is known.

    The issue
    1) We are testing a null which specifies the a (trend/ mean rise over period). Those are clearly connected.
    2) Whatever method we use, it needs to be such that if would reject a correct null at the specified rate. So, for this example, if m=0, we want to reject this at defined rate α = 5%.
    3a) We can come up with a method to test the trend “m=0” based on data and organize it so that we make false positive errors at α. This involves a ttrend.
    3b) We can also come up with a method to test the mean rise (m *n/2) over the period. This involves a different tmean.
    4) If m=0– as in the null, then mean value of ttrend and the mean value of tmean are zero over a bunch of experiments. The only reason the ‘t’ values differ from zero is “noise” or “error” or “uncertainty”. The way the t’s are defined, they have standard deviations of 1. And both are simply random variables. With respect to the “mean” value– zero– they are just “errors” or “uncertainty” (depending on how you want to look at it. We could show this by doing a while bunch of tests with synthetic data etc. The standard t test lets us figure out the value of t that results in false positive at rate &alpha.
    5) Now, the very odd thing about these two t’s is that the are orthogonal. That is: if the null is true, if you find the two ts and make a scatter plot, it will look like this:

    See
    what’s uncorrelated with what.

    So… if they are orthogonal… then if we add

    tpooled= a* tmean+ b* ttrend

    Its variance will be

    σpooled^2 = (a *σmean)^2 + (b* σtrend)

    So… under the null hypothesis any linear combination of pooled values will have a mean of 0 and standard deviation of 1 if we pick a and b such that a^2+b^2=1.

    (to be continued.)

  7. So this pooled t
    a) exists and
    b) could also be used in a test. Just like the other two.

    That means we have an infinite number of choices all of which give the correct false positive rejection rate.

    But now: The whole purpose of testing the null is that we think there is a possibility it is wrong. So, what we are thinking is that m might not be equal to 0. It might take on some other value. We don’t know that value– but let us speculate it is a value me. This can assume any numerical value except 0.

    Now: pick a value for m. Call m=me the alternate hypothesis. IF this alternate hypothesis is true, then the mean value of the two t’s are not zero. They would be

    ttrend,e= me/(standard deviation in trend over the test period).
    For a period with n=120 points with “(y_i,t_i)” where y are the data and t are the times, , that happens to be

    (standard deviation in trend over the test period) =(sd(t_i) * sqrt(n)) where σ is the standard deviation in residuals to a fit to a straight line.

    So, under this null hypothesis, the expected value E[ ttrend,e] is no longer zero. But it still has a standard deviation of 1. So… it’s still a “t” value.

    We can do a similar thing for the mean rise– but the formula for the standard error in the mean rise is a bit different (and for baselined values we need to account for the uncertainty in the baseline.)

    We also know in this particular case that if the null is wrong

    But now we have these two t’s. We have figured out that a linear combination of t’s is also a t– with standard deviation of 1. So all can permit a t-test.

    The way power works in a t-test, the higher the expected value of the ‘t’ under the alternate hypothesis the higher the power. of the test. It’s true you won’t realize the expected value– but the power is found based on the expected value of the t.

    So. the question is: Is there a choice of “a” and “b” that will maximize the expected value of tpooled.
    And it turns out there is. You can do the algebra, and the values of a and b that maximize the value of tpooled have a ratio a/b= E[te,trend]/ E[te,mean].

    This has the interesting property that the pooled t is dominated by the higher power test but it gains just a little from the lower test.

    When I previously did an example for Paul, I had picked “1/2” and “1/2” which isn’t right. But I also suspect I set things up a bit wrong in another way– I wasn’t testing the “mean” correctly. I did it in a way that gave it a “false start”– which wasn’t good. (My more recent script has a whole bunch of things trying to sort out that issue. I would need to create a 2nd cartoon to explain what I did and why it went wrong. But this way seems right.)

    So this pooled t is not technically how far off the trend is nor how far off the mean is. But it’s does detect whether the linear +mean prediction was “too warm” or “too cold”.

    For the purposes of testing models– it’s actually a good thing.

    Unless… I’ve just done something really stupid. But it really does seem to work. And… I rand a shit wad of monte-carlo! But you have to set it up right!

  8. SteveF–

    Just a coincidence, or does the power increase as N^0.5, where N is the number of independent hypotheses?

    Oddly, the numerical values you picked are to some extent a concidence… but not much of one because I made an example with very high power. Those ts are much larger than 2!

    When the null hypothesis is wrong (as it is in the right hand part of the figure) the average value of ‘t’ you get in a test will increase as N^(3/2). The power increases in a more complicated way. You need to look at how two gaussian distributions overlap to see what happens

  9. SteveF

    It almost sounds too good to be true.

    BTW: I do think it seems almost too good to be true. But on the other hand, if a number of methods exist, one of them has to be the best method. Right?

    What will seem really too good to be true is if the degrees of freedom for the t-distribution increase. I don’t know what the degrees of freedom should be… and doing the math, I can’t think of any reason why they don’t. In some sense, once we get to the ‘t’, it’s just a parameter of the t-distribution. It just happens to be related to the number of degrees of freedom in the data set. But… it really will boggle my mind if the degrees of freedom increase. Oddly though… it might. I may have to create a whole bunch of distribution charts with a HUGE number of t’s to see how the tails behave. Very weird notion.

  10. Lucia,
    What’s your argument that t_pooled follows a t-distribution? What dof do you use?

    Another complication – the multipliers w also have distributions.

  11. Lucia,
    I think I see the problem. Your increased power is based on testing as if you were adding together twice as many iid’s, one lot weighted according to the trend formula, and the other according to the intercept formula (or mean).

    But you don’t have twice as many iid’s. You have the same number added in twice, and the combined variance is not the sum of the variances, but the variance formed from the sum squares of the combined (weighted added) coefficients. Which will be a lot higher – you can’t get ahead this way. That higher variance is what goes in the denominator of t_pooled.

  12. Lucia,
    let me put it this way – suppose you were testing whether a mean was different from zero. On this arithmetic you could get a more powerful test by forming a weighted sum of the mean with itself.

  13. Nick,
    “On this arithmetic you could get a more powerful test by forming a weighted sum of the mean with itself.”

    I don’t think so.
    Say you defined a statistic
    X’ = w1*xbar + w2*xbar
    The mean of X’ is self-evident.
    The sample variance of X’ is given by (w1^2 *Var(x) + w2^2*Var(x))

    If you set both weights at unity, then the variance of X’ becomes 2*var(x). You gain nothing on the t-test of X’.

    Now look at the choice of weights that Lucia has picked.
    Note that w1 + w2 = 1, and that, for your example, the sample variance of X’ is given by var(x). So you gain nothing by doing the combination you suggest.

    Ha, but!!!
    Lucia,
    In writing the above, I realise that one of the things you have done is used up one additional df in the calculation of the weights. It must be accounted for.

  14. Nick

    I think I see the problem. Your increased power is based on testing as if you were adding together twice as many iid’s, one lot weighted according to the trend formula, and the other according to the intercept formula (or mean).

    I am adding as if I took two tests which test slightly different– but related things–whose errors are orthogonal and created a pooled test whose errors are orthogonal. The new test has greater power. I don’t think this is a “problem”. I think it is a “desirable feature”.

    What’s your argument that t_pooled follows a t-distribution?

    The sum of two orthogonal t’s is t distributed. These are two orthogonal t’s. So, that should hold– but I admit it’s worth checking. It’s easy to check.

    What dof do you use?

    That has me scratching my head a bit. A formula exists for the “dof” and I could calculate it. (Formula in this article: http://en.wikipedia.org/wiki/Student%27s_t-test ) But I’m sure as heck going to check whether using it gives proper results. But it looks like the limiting behavior isn’t going to be too screwy. I think I’m going to lose dofs. Given that, strictly speaking the weights should be rejiggered to maximize the power– not the t itself. But the effect of the size of the t is generally going to be larger than the loss of dofs.

  15. Nick

    On this arithmetic you could get a more powerful test by forming a weighted sum of the mean with itself.

    I have absolutely no idea what you mean. If there is only one test, the power is maximized by taking the weighted sum of that test in one sense. It’s the following:

    tpooled=1* tonly test that exists + 0 * tnon-existent test which therefor has no power and hence a t of zero

    So… you the most powerful test is the only test you have. You don’t increase the power by weighting it by one. But it is the maximum power. (And it’s sure as heck better than giving any weight to the “no test” option!)

  16. PaulK

    In writing the above, I realise that one of the things you have done is used up one additional df in the calculation of the weights.

    Hmm… maybe. Except it might be a little more complicated than that. They can be computed before I collect forecast data.

    I think I made an error that will make nearly zero difference to the numerical results. I think the weights don’t go as t^2 based on the effect we wish to detect. I think they go as t^1. It makes very little difference because in this case, the those two t’s are roughly even in size. But it would make a difference in hypothetical cases where the one tests was much more powerful than others. (And I do think this method of combining is more general than what I’ve shown above. It would work for some hypothetical experiments a person might do if they had different teams who took different approaches to test something.)

  17. Yep. Algebra error. MonteCarlo with 100,000 cases has a synthetic test. Using correct formula, with T representing E(t)
    T_mean=1.5944
    T_trend=1.9300
    T_opt(wrong)= 2.4931
    T_opt(right)=2.5034

    I’d done it with 10,000 cases and got similar results. So… I cranked up to 100,000 and waited (20 minutes? something.) Maybe I’ll set it to 1,000,000 and go exercise or something….

    After that, I can save all the individual t’s and make histograms so I can check whether the pooled really looks t distributed and I can compare to the distribution with the ‘proper’ dof. I think it should be t.

    Oh PaulK–
    It occurs to me: For this particular case, going through the algebra, there is no loss of dof when estimating the weights because all parameters estimated based on data cancel. The final formula for the weights can be computed based on the “times” at which data were collected. Those are deterministic!

  18. Nick

    You have the same number added in twice, and the combined variance is not the sum of the variances, but the variance formed from the sum squares of the combined (weighted added) coefficients. Which will be a lot higher – you can’t get ahead this way. That higher variance is what goes in the denominator of t_pooled.

    Just to be clear: My pooled t has a variance of 1. I normalized to make it so.

  19. Lucia,
    “These are two orthogonal t’s.”
    I’m not convinced they are. If e_i is a set of iids, then your intercept model, for large dof, is going to be Σ (A_i+a_i*e_i) for some coefs A,a.
    And the trend model is going to be Σ (B_i + b_i*e_i). But from the regression model, these are not independent e_i. They are the same.

    So when you form your weighted sum, you’ll have
    Σ (W1*A_i+W2*B_i) + Σ(W1*a_i+W2*b_i)*e_i

    And the variance of this is Σ (W1*a_i+W2*b_i)^2.

    But I think you are implicitly using Σ( (W1*a_i)^2+(W2*b_i)^2).

    Are they orthogonal? – the covariance is Σ a_i*b_i. Maybe this sum has to be zero – I’m not sure. It’s the key question.

    (What I mean by getting a more powerful test combining the mean with itself – try doing the same calc with .707*mean + .707*mean).

    If the orthogonality does work, you’ll have the combined t as
    Σ (W1*A_i+W2*B_i)/√N/√(Σ(W1*a_i)^2+Σ(W2*b_i)^2)
    It seems to be the combined t has to be between the two constituent t’s.

  20. Paul_K
    “If you set both weights at unity, then the variance of X’ becomes 2*var(x). You gain nothing on the t-test of X’.”

    Yes you do. The mean doubles, the sd increases by √2, as then does t.

  21. Whew. That took a while:
    numMonte=10^6 # 10^6 takes a long time. Go exercise or something.
    > ts=matrix( unlist(lapply(rep(N_tot,numMonte),fit_tTests,n=n,mult=0.5)),ncol=10,nrow=numMonte,byrow=TRUE)
    > mean_Ts=apply(ts,2,mean);round(mean_Ts,4)
    [1] 1.5921 1.9282 2.4903 2.5006 240.0000 120.0000 0.4060 0.5940 0.4526 0.5474
    > sd_Ts=apply(ts,2,sd);round(sd_Ts,2)
    [1] 1.01 1.00 1.01 1.01 0.00 0.00 0.00 0.00 0.00 0.00

  22. Nick Stokes

    I’m not convinced they are.

    They are. This isn’t even controversial.

    But from the regression model, these are not independent e_i. They are the same

    I didn’t say the e_i’s are independent. I didn’t even say the t’s are independent.

    I said the t’s are orthogonal. Independence guarantees orthogonality. But orthogonality does not require independence.

    your intercept model

    I didn’t say the intercept. I said the mean. That was discussed in the post I linked– because it’s common for people to misinterpret “the mean” and “the intercept”. They are the same only in the case where you define time=0 for the time series such that the average over “time” over your data =0. The definition of the time when t=0 is deterministic quantity and you can define t=0 anywhere you like with no loss of generality. The mean will remain unchanged, but the intercept at t=0 will change.

    Maybe this sum has to be zero – I’m not sure. It’s the key question.

    The sum for the t’s for the mean and trend is zero when the errors are white. (They don’t need to be gaussian. Just white). The errors are not necessarily the residuals. The residuals are what’s left over after the fit. But the errors are what you start with and defined based on the “true” value of trend and intercept.

    The proof is easy. I just did it– but I’ll leave it to you to do it. To simplify the math, I advice defining time as
    x_i=c(1:N)-N/2. That way the mean of the x_i will be zero so you can drop a lot of terms right at the outset. Then, you’ll need to recognize that E(y_i y_j)= 0 if i≠j. Then.. you’ll be done!

    But doing the proof, I revise my answer to brandon

    Is this limited to white noise, or does it hold more generally?

    The proof the errors are orthogonal is for white noise. So my current method of estimating the weighting factors applies to white noise. For autocorrelation, I would need a model for the ACF, and then compute the optimum weights based on that model. (The optimum values for coefficients will still exist. They’ll just differ in magnitude from the ones based on assuming the errors are white.)

    The proof that the errors are orthogonal when the errors are autocorrelated is also possible if you define time so the average of time over your data points is 0 and x_i= -x_(N/2-i) which can also. As previously mentioned, the first is always possible– it’s an arbitrary shift in definition of the time when t=0 and involves no loss of generality. The second happens for our time series data.

    So… answer to Brandon’s question: The result that the trend and mean (not intercept) are orthogonal is general provided the data are sampled evenly with no gaps.

  23. Guys– Thanks for asking questions.
    Nick– Also, thanks for those questions. It motivated me to scratch out the proof which I don’t want to type. It’s not hard– once you understand I mean the mean and not the intercept, you should be able to do it rather quickly. Obviously, the error in the intercept not only is not orthogonal to the error in the trend– it isn’t even a value you can define. It changes based on your definition of time=0. I don’t even like to think of errors in intercepts because in a huge number of instances, they are stupid confusing things. But.. they are in all the books and programmed into all routines.

  24. Nick

    Paul_K
    “If you set both weights at unity, then the variance of X’ becomes 2*var(x). You gain nothing on the t-test of X’.”

    Yes you do. The mean doubles, the sd increases by √2, as then does t.

    I never ultimately set both weights to unity. if both tests had equal t’s under the alternate hypothesis the pooled t would be:

    t_pooled= (1* t_1 + 1* t_2)/√2 .

    Whatever the weights are, they must be chosen so that the standard deviation of t_pooled =1. That’s a constraint on the choice. The relative magitude of the ts under the alternate hypothesis dictates the ratio of the two weights– but the get normalized to ensure that the standard deviation of t_pooled =1. (And I check this in my scripts. Otherwise, I live in fear of the dreaded but ubiquitous “failed to take square root of 2” error, which ranks up there with the dreaded and equally ubiquitous “failed to take square root of π error”.)

  25. YFNWG (Comment #105226)

    Don’t worry. The exciting graphs are going to be the series where I pick a whole bunch of start years to check the multi-model mean and the result is (reject, reject, reject, reject, reject….)

    Or… I think it will be. I did this with wrong weights. One of the reasons for this post is to sit down and really think out what they needed to be– and what the “mental model” for saying what they ought to be is.

    The “mental model” is that
    a) Null hypothesis in ‘idea’ form: expected value of model temperature at any time is equal to the expected value of the earth temperature.

    b) The alternate hypothesis is : the expected value of model temperature is correct during the baseline. At the end of the baseline, the difference between the expected value of temperatures in the model deviates from the exepected value for the temperature of the earth and that deviation is of the form (trend * time since end of baseline).

    Under this, we can:
    1) test difference in trend specifically.
    2) test difference in mean anomaly.
    3) create weights to create pooled test that has greater power to detect “model is off track”.

  26. Nick,lucia, ignore my stupid input.
    The variance of 2*x is always 4*var(x). The variance
    of x+y if thry have the same variance and are independent is twice
    the variance of x. Thr difference is the independence between the variables.

  27. Lucia,
    In your Monte Carlo’s, do you actually verify that t_pooled follows a t-distribution? In particular, has sd 1? You said that the sum of two orthogonal t’s is a t, but you’ve multiplied by weighting factors. Now even multiplying a single t by a non-unit leads to a non-t variable, though it can of course be rescaled.

    It seems to me that you have to verify that t_pooled has unit sd. Otherwise a “t-value” doesn’t mean what you think it means.

  28. Nick

    In your Monte Carlo’s, do you actually verify that t_pooled follows a t-distribution?

    No. As I said above, that’s one of the things I need to do. But it took about 2 hours to run the millions monte-carlo to verify I’d done something not quite right. (Makes no difference qualitatively, but the weights were slightly wrong. Still gave sd’s of 1.)

    In particular, has sd 1?

    Abso-frickin’ lutely sd=1.

    See out put with 10^6 cases.
    > sd_Ts=apply(ts,2,sd);round(sd_Ts,2)
    [1] 1.01 1.00 1.01 1.01 0.00 0.00 0.00 0.00 0.00 0.00

    The first 4 are sds. The first is for trend, 2nd for mean, 3rd for incorrectly calculated pooled, forth for pooled. Takes a while to get precision– but 1.01 is pretty close to 1.

    Now even multiplying a single t by a non-unit leads to a non-t variable, though it can of course be rescaled.

    What do you mean by “multiplying a single t by a non-unit”? Do you mean multiplying it by something other than a constant? For this example, I multiply each t by constant that can be determined without regard to anything other than the times at which data are observed. The weights are actually identical for every single realization.

    (I realize the discussion reads as if the weights depend on the observations. That’s because it’s easier to explain that way. But it turns out that the factors that depend on the observations all cancel when things are normalized. For white noise, the final weights depend only on the time at which data are observed. If you know the number of points in the baseline “N” and the number of points in the forecast period ‘n’ and you define a spacing between the end of the baseline and the beginning of the forecast period– which is zero in the example– those three numbers are sufficient information to compute the weights. They are just numbers.)

    It seems to me that you have to verify that t_pooled has unit sd.

    I agree. That’s why my method of determining the values involves the requirement that the pooled t have a standard deviation of 1. This sd is imposed algebraically.

    That’s also why I have verified by having my script spit out the values — so that I can make sure I haven’t introduced a bug that results in not getting an sd of 1.

  29. Lucia,
    I’m just looking at the algebra. You take t1 and t2, and form
    t0=w1*t1+w2*t2
    Now already w1*t1 isn’t t-distributed. Nor is w2*t2. It’s just scaling, but still.

    Then you add them, and I gather, implicitly rescale (by choice of w1, w2) so that sd is 1. Maybe that restores t-ness – it’s not clear.

    In fact, if t1 and t2 were unit normal distrib, the products w*t aren’t (unit), and adding two variables normally distributed but with different sd makes something that isn’t normal.

  30. Nick–
    I’m running other modules so the test of ‘t-ness’ of the distribution will be done tomorrow. But I have a question for you. I know you are familiar with a t-test done to test whether the means of two populations with gaussian errors about the mean differ. In that test, you sample population 1, estimate the mean. If our test involved N bits of data, we know the sample mean is an estimate of the mean, and the error in the sample mean can be described using a t_1 where t_i=”sample mean- true mean”/ standard error in mean.

    The standard error in the mean is the ratio of the sample standard deviation to N. This t has N-1 degrees of freedom. So

    t_1= error_1/(s_1/sqrt(N-1)) is a t.

    Then if someone else goes out and measures another object “2” , Say the test out n samples. They can get
    t_2=error_2/(s_2/sqrt(n-1))

    We can’t know the value of these ts- but they exist.

    And, if we subtract the sample means we can get
    sample_mean_1- sample_mean_2 = (true_mean_1-true_mean_2) + (t_1-t-2)

    And it’s know the difference (t_1-t_2) properly scaled to have an standard deviation of 1 is itself a t.

    And, more over…. since -t_2 is also a t, that means the sum of two ts with different standard deviations, properly scaled to have a sd of 1 is itself a t.

    So… we know that if we pool evenly as (t_1+t_2)/sqrt(2) that’s going to be a t. And we know (1* t_1 + 0 * t_2 ) is a t as is ( 0* t_1 + 1 * t_2 )

    Other than possible issues with where we might need to have “-1” in things like n-1 etc required to keep things straight, do you think the weighted sum of two t’s isn’t going to be a t? I realize that I need to find the proof and I need to sort out where the -1’s might go.
    But do you honestly anticipate the general case is not going to have the form of a t distribution?

    In the end, whether it does or does not doesn’t make my method fail. If it turns out not to be, I can always resolve the issue by estimating optimum weigting this way this way and running monte-carlo to show that the results do have higher power– and close to the optiumu. I’m pretty confident the pdf of the pooled t not going to suddenly take on some weird-ass shape. It’s clearly at ‘t’ the two extreme weightings and the middle one. But I’m still interested in knowing if you really think this is going to have a distribution that is dramatically different from a t.

  31. Nick

    w1*t1 isn’t t-distributed. Nor is w2*t2

    Of couse not. I never said they were. I say: t1 is a t and t2 is a t. If you do a paired t test, you take the difference between t1/some number – t2/some number. Obviously, t1/some number is not a t and t2 over some number is not a t. But t1 and t2 are.

    and adding two variables normally distributed but with different sd makes something that isn’t normal.

    On what planet?

    I know you don’t take wikipedia as an authority on that. But in this case, you are wrong and they are right:

    http://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables

    If X and Y are independent random variables that are normally distributed (not necessarily jointly so), then their sum is also normally distributed.

    This is rather well known.

    Obvious example: If you X is a normally distributed random variable with sd=σ , and you collect X_1, then X_1+X_2 is a normally distributed random variable. Moreover, if you take Z= (X_1+X_2)/2 it has standard deviation of σ/sqrt(2 ). Now, if you take a third sample X_3 and add it to Z+X_3, that’s still normally distributed even though the standard deviation of Z is not equal to the standard deviation of X_i.

    This works as long as the variables share a mean.

    The only issue I see with my pooled ‘t’ is whether I need some ‘-1’s etc to deal with sample size when pooling. With respect to the magnitude of the answers, that issue affect numbers in somewhere in around the 3rd sig fig. Not that that can’t matter in some sense– but the resolution is either: do a little bit more algebra or run montecarlo to show that the weights are very close to optimal.

  32. Lucia,
    “But do you honestly anticipate the general case is not going to have the form of a t distribution?”
    Yes, I do. I’m trying to work out the large dof limit, where t is just unit normal. But as I said, if you add two variables that are normal but with different sd (as in w*t) then the result is very much not normally distributed. In particular, and relevant here, the tail behaviour is entirely that of the variable with larger spread (w).

    I’m trying to work out the tail probability of your 22.13 assuming the component t’s are unit normal.

    Update – of course, if they are independent the sum is normal. But that’s not what we have here.

  33. Hmmm… I think Nick is correct that the pooled “t” is not a t. Sum’s of two Normals with equal means are Normal though. And in the case where the t’s have large numbers of degrees of freedom, these are going to be close to normal. So the pooled concept has to work in the limit of large numbers of of data in the time series.

    So… testing the pooled variable have to be monte-carlo to get the 5% cut-offs. So.. I’ll do that tomorrow.

  34. Nick

    But as I said, if you add two variables that are normal but with different sd (as in w*t) then the result is very much not normally distributed.

    You said that but it’s wrong. If two normals have the different standard deviations but share a mean, their sum is normal.

    But you are right that the sum of two t’s is not a t. To the extent that the t’s have high dof’s they are going to look like normals– so the sum will look normal. But to the extent that they aren’t normal, that’s going to make a difference in the tails which can matter.

  35. Lucia,
    “You said that but it’s wrong. If two normals have the different standard deviations but share a mean, their sum is normal.”
    It might be – I’m still not sure. That’s certainly true if they are independent.

    I keep thinking of your original example. Two t values of about 12 and 18, and a combined t of 22.13. As you say, all ridiculously significant. And if the 12 and 18 independently happened by chance, even more improbable, and I think the 22.13 reflects this. But I don’t think anyone really thinks they did.

  36. Even better wiki link

    Lucia, when I encounter problems like this I admit I peek: I run a Monte Carlo and e.g. compare my predicted false alarm rate against the simulated rate.

  37. Nick–
    They are t’s under the null hypothesis. Of course the become high if the null hypothesis is wrong-as it is in this example. It’s the errorsthat are random– not the signal.

    Lucia, when I encounter problems like this I admit I peek: I run a Monte Carlo and e.g. compare my predicted false alarm rate against the simulated rate.

    I plan to do that. 🙂

  38. Carrick,
    I’m very familiar with that sum of independent variables. But these are different summations of the same variables.

    And yes, it may still be that they are orthogonal – zero correlation.

    The result is counter-intuitive. Getting a much more powerful test out of the same data. And it looks to me as if the extra power comes from testing as if there were twice as much data.

    But … still thinking.

  39. Lucia,
    Let me put my discomfort with independence/orthogonality another way, starting from this quote:
    “I now have two possible tests. Strictly speaking, I am testing two separate hypotheses both of which are related since the amount to “the system doesn’t work”. So, I really only want 1 test. “

    You have one t-value that tests mean=0, and one that tests trend=0. But what does the combination test?

  40. it may still be that they are orthogonal

    Under the null hypothesis, the t’s are orthogonal.

    Getting a much more powerful test out of the same data

    That some tests are more powerful and others less powerful is not counter intuitive. I don’t know why one would “intiut” that one of the two specific test from which I pooled must be the most powerful one possible.

    And it looks to me as if the extra power comes from testing as if there were twice as much data.

    Well… twice that would mean the pooled t was 26.0781. Which it’s not. But anyway, if it’s possible to degrade power a lot by doing things wrong– and it is–, I don’t know why it isn’t possible to notice that the method that was previously being used is less powerful that optimal and then find a method that is much more powerful.

    The question is just: How much more powerful is this? I’m sure it’s more powerful.

  41. Nick

    But what does the combination test?

    It can only test a general “too warm” or “too cold” vs “on track”.

  42. Nick
    I’m not sure what you are asking. Tests under frequentist statistics create test statistics and diagnose the rate at which that statistic would happen by random chance.

    The principle of maximizing the t requires both weights to share sign– so I pick both positive rather than negative. A combination of high t_mean and low t_trend would result in a pooled t whose magnitude was lower than either. It could hypothetically result in a pooled t that triggers a false positive of “not on track” using the criterion. α=5% means we get 5% false positives. I’m not sure why this particular combination would be a “worse” issue that other false positives.

  43. I think what may be happening is that you have a more powerful test, but of a weaker result. It’s more likely to deny the null hypothesis, because you’ve allowed more things that could go wrong. Another dof. And you’ve selected the worst case.

    Here’s a simplified example. Toss a coin once a minute, 1000 times. t-value of mean (relative to what a coin should do) t1=1. t value of trend t2=2. Neither hugely improbable – trend about 4.5% (two-sided).

    Your optimal combination t is √5=2.236.. Probability halved – about 2.2% But not of mean or trend, but that optimal combination.

  44. NIck you said above:

    But as I said, if you add two variables that are normal but with different sd (as in w*t) then the result is very much not normally distributed.

    Which is very clearly a different statement than:

    I’m very familiar with that sum of independent variables. But these are different summations of the same variables.

    At this point I have no f**king clue what you are even trying to say.

    Are you trying to say anything?

  45. Carrick,
    “very clearly a different statement than:”
    Read the first bit again, down to where it says:
    “Update – of course, if they are independent the sum is normal. But that’s not what we have here.”
    It’s not different.

    Do you think one can get a much more powerful test (of something) just by testing different combinations of the same data?

  46. Nick:

    Do you think one can get a much more powerful test (of something) just by testing different combinations of the same data?

    Of course.

    See DFT.

  47. Suppose you have two t-vars t1 and t2. Then Lucia’s optimum combination to test is √(t1^2+t2^2). Now, this doesn’t look t-ish any more. The mean is not zero. For large dof, t1^2+t2^2 has a chi-square distro of dof 2.

    In fact, this (√) seems to be (for large dof) a Rayleigh distribution. It comes from the problem – if the wind speed in any direction is normally distributed (same σ), what is the distribution of total wind speed?

    Now the RD is only a slightly modified normal, and has similar asymptotic behaviour to a normal, but with, in this case, twice the variance. Which matters for the significance test.

    So I think SteveF was on to something.

    Update – not so sure about doubled variance – checking.

  48. Well if you want to restrict yourself to the question you asked me, it’s clearly possible to come up with more optimal testing (in some sense) of a set of data than introductory concepts like comparing trends or means.

    Linear processing with white noise is easy to study and understand, but it typically represents the lower threshold on what is possible. Bear in mind, for signal detection, signal + white gaussian noise is worse case because that means the “noise” has no information in it. Correlation can be exploited to improve detection threshold. (See “fluctuation based processing” for example.)

    Anyway these issues are easily sorted out with Monte Carlo studies and posting code, which is how I would approach it at this point. Coming up with ideas to test with simulation seems more useful than arguing over the characteristics of a distribution.

    The sort of test Lucia is using reminds me of something I’ve seen from high energy experimental physics (where their bread and butter involves optimal projections and robust signal detection methods). It may already be a “known result” if it is correct.

  49. Nick

    Here’s a simplified example. Toss a coin once a minute, 1000 times. t-value of mean (relative to what a coin should do) t1=1. t value of trend t2=2. Neither hugely improbable – trend about 4.5% (two-sided).

    Your optimal combination t is √5=2.236.. Probability halved – about 2.2% But not of mean or trend, but that optimal combination.

    I’m can’t even begin to guess what point you are trying to make. If a test is done with α=5%, it is always the case that &alpha=5%. That’s what we want in a test.

    Do you think one can get a much more powerful test (of something) just by testing different combinations of the same data?

    If the test you were previously doing was not the most powerful one possible with a give set of data, then yes, you can get a more powerful test by choosing a different test. On the other hand, if you are already doing the most powerful test possible with a given collection of data, you can’t do a more powerful test. One should try to pick the most powerful test to detect an effect.

    Look: All you need to do is monte carlo to determine which test is most powerful. This attempt at showing that power can’t be improved by posing what appear to be silly rhetorical questions is pretty lame.

    Carrick

    It may already be a “known result” if it is correct.

    If you can remember the test and we can compare,that would be great. It would be much better to just cite an existing test than explain all the properties!

  50. Nick

    .Then Lucia’s optimum combination to test is √(t1^2+t2^2).

    Nooooooooo!!!!! Absolutely not.
    They aren’t the same “t”. The weights are best thought of determined by examining t’s under a hypothetical effect one wishes to detect. No data from the test period are involved in determining their size. None.

    The ‘ts’ for the test are based on the data during the test period. Please re-read.

  51. Nick==

    Your optimal combination t is √5=2.236.. Probability halved – about 2.2% But not of mean or trend, but that optimal combination.

    I would say I have no idea where you get that √5 but it appears to be based on the complete mis-understandign of the thing that made me write “nooooo” above.

    The weights cannot be obtained from the result of the test.

  52. Lucia,
    “Nooooooooo!!!!!”
    You said
    “That is, there should be a method of creating a pooled
    t_pooled=w_trend t_trend+w_mean t_mean
    with weights selected such that the standard deviation of the pooled ts is 1 and the statistical power of the test is maximized.”

    Well, the maximum comes when w is a unit vector parallel to the t pair, and its value is the distance. You’ve estimated from a different period, but that’s still likely to come up with a similar result. Your combination of t=18.44, t=12.57 gave 22.13; the Euclidean is 22.32.

    I guess you could say that it wouldn’t have come up with a similar result if the null hypothesis (randomness) was true.

  53. For what it’s worth Nick, on the coin flip problem– assuming you are testing a null of “mean =0” with heads +1 and tails -1, and the null is that the person who is taking coinflipping lessons will be unable to improve their ability to flip coins– but he thought he could, you might have set up the alternate hypothesis is that– owing to coin flipping lessons, he can start flipping more heads with his performance improving linearly over the first 1000 flips. If the plan was for him to test over 1000 flips, the weight on the out come of the trend test would be 0.5001874 and the weight on the mean would be 0.865917.

    This is obtained by considering the outcome under the hypothetical that you will improve at a linear rate. The ratio of the optimums is
    ratio = 2*sd(c(1:1000))/10^3 0.5776389. Normalizing we get
    weights=c(1,ratio) followed by
    weights/sqrt(sum(weights^2))
    [1] 0.8659172 0.5001874.

    If the t for the trend was 2 and the t for the mean was 1, the pooled t would be 1.866292. So, it happens to fall in between the two cases and you would decree the coin flipping lessons did not work.

    You do don’t change these weights based on the outcome of the flips. They can be computed based on knowing that the test will involve 1000 flips, and that you both agree the coin is fair– so you don’t need a “baseline” for the coin. (If you need a baseline to identify the “mean” outcome for that coin, the weights change and also depend on the number of flips you used to figure out the mean outcome for that particular coin.)

    Whether or not anyone would do such a test on honest to goodness coinflips would likely depend on whether someone had proposed that particular alternate hypothesis as being the rival to the null.

    If I were someone taking coin flipping lessons and I believed that through practice I could improve my coin flipping more or less linearly over time I would want to bet based on the pooled t because that represent the maximum probability that I would win if the lessons worked. (Note: the best weights depend on the form of the alternate hypothesis.)

    Meanwhile, I’d have a 5% chance of losing if they did not work.

  54. Nick–

    Well, the maximum comes when w is a unit vector parallel to the t pair,

    But that’s the incorrect t-pair to use when maximizing the w’s. I’ve said you don’t use the pair from the data numerous times in comments, and the process is described in bulleted form in the post. Your substituting some completely different method isn’t really going to get you or anyone else anywhere.

  55. Nick

    I guess you could say that it wouldn’t have come up with a similar result if the null hypothesis (randomness) was true.

    What I would say is that it would very very rarely come up with a similar result if the null hypothesis was true and that would be a correct statement. Read the discussion of how the weights are obtained. That’s how the are obtained.

    Also: You can’t pull the weights off that figure because– I’m assuming– in the coin flip you are testing if it’s fair? If so, you know the mean for the null =0 (if heads is +1 and tails -1). My weights involve rebaselining– and that changes things.

  56. Lucia,
    I see that I haven’t focussed adequately on the prediction aspect of what you’re saying, so I’ll think more about that. My view is still that you’ve a powerful test for a weaker result. You said:
    “At the end of that time, I want to test the null hypothesis that
    I will earn 0 Quatloos each month in the future period. That is E[Q]=0
    The trend m is zero. That is m=0.”

    ie both true. That’s a stronger null hypothesis, so negating it is a weaker result.

    Here’s a line of thinking which may not align with yours, but I’m thinking about. Your results, predicted in advance or not, are modelled by a set of unit iid’s (ε_i). Looking for non-random patterns. So you’ve taken a dot prod with a unit vec (1,1,1,1,1…) normalised, to get the mean. Then one with (,,,,-1,0,1,…) (normalised) to get the trend. These are also unit random vectors, and the t’s that your testing are the coefficients in the orthonormal expansion with these 2 basis vectors.

    You could go on, with more orthonormal vectors. The SS of the coefs follows Ξ-square, and I think the more general test is Ξ-sq on a larger number of those coefficients.

    But it’s the same deal. You’re making the null hypothesis more restrictive so you can get more powerful tests, but then negating the null hypothesis means less.

  57. Nick

    You’re making the null hypothesis more restrictive so you can get more powerful tests, but then negating the null hypothesis means less.

    a) I’m not sure I am making it more restrictive (whatever that means).
    b) null hypotheses are required to be specific– which I suspect are “restrictive” in whatever sense you mean it. On is supposed to create nulls that are specific.
    c) Even if it’s somehow “more restricitve”… so?

    As far as I can see, the idea we want to test is
    Expected Value of [ Error In Prediction(time period) }=0 for all times in the period of interest. (That happens to be the forecast period.)

    That’s the actual feature of interest we want to learn something about.

    If this is true, the mean error should be 0. The trend in the error should be zero. So, both properties should be true.

    It’s not “worse” to create a pooled test that that lets us detect that
    Expected Value of [ Error In Prediction(time period) }=0
    cannot be is not true when it is false while holding false positives constant. You can concoct a test by looking at one property alone (mean during period= 0 [exclusive of] trend during period) or you can concoct a test by looking at two properties in a pooled way. None of these tests “means more” (or less) than the other.

    The null hypothesis isn’t “negated less” by looking at a pooled test!

    The judgement on this does not involve discussing “dot products”. It can be– and is best– made by discussing the null that is being tested in words and seeing whether it aligns up with the “fact” that one is trying to test. We are trying to test whether there is a discrepancy between models and projections. And the pooled test tests this just as well as testing the trend only or the mean only.

    Anyway, your current line of argument is silly. The fact of the matter is that when I show trends mis-match, I have sometimes had people then say: Reply that can’t mean anything because means don’t fail. And I have every reason to believe that if the trend didn’t fail but the mean did, they would give the opposite response.

    Well…. we could perfectly well show that if someone requires both tests with false positive rates of 5%, that creates a test has a false positive rate of 0.25%. The cure for this is to recognize that someone is going to say they need to see both fail and to use the information in a “binomial” way, we need to reduce the test for individual cases to p=77.6%. Then if our test is “both fail”, the false positive rate for that is 5%– which is our claimed level. (If someone wants to use a different false positive rate– that’s fine. But they need to say that explicitly. Not hide it with pile up of requirements in mysterious ways.)

    Meanwhile, if we are allowed to deem models based on fishing for the case we “liked” and presenting that only (i.e. cherry picking) then because there are two possible tests (mean and trend) if we can just get away with running both in your little offices and showing the one we “like” we can dupe people into thinking our false positive rate is 5% when it’s really 78%. So, if we are allowed to “fish” and present one only (concealing the other) and we are only going to present the one, we should be using Bonneferroni.

    Because two tests exist whether we like it or not, it makes more sense to talk about how we should use them together. And whatever way we use them together, the principle should be that we end up with a test whose overall false positive rate is what we claim it is (say 5%) and has the maximum power to reject the null hypothesis when it is wrong.

    This is the principle that causes us to make correct diagnoses at the highest possible rate. It’s the way we should organize tests. No amount of waiving around “unit idds” changes this concept. The the pooled test idea implements the basic logical notion properly: We use the data we have in a way that maximizes the likelyhood we will detect the error in the model is not 0 .

  58. Carrick/Nick–
    One fruitful thing. This conversation has clarified something in my mind. I already knew that when I added the estimate for measurement error to my test of models based on a “model weather” estimate, the new test variable was not a ‘t’– because the estimate for the measurement error is– if anything– gaussian.

    And, I could tell that my synthetic experiments failed to reject at slightly too low a rate– especially if I ran them in the limit of lots of experimental error. I’d reject at rates of 4.5% or so when I was claiming false positive rates of 5%. Not a big deal for the most part–but still… Obviously, having a significance of 4.5% when you are only claiming a significance of 5% is an unnecessary erosion of power to detect when the null is false– so it’s best to have α=5% when you claim 5%.

    The discussion of t-distributions has confirmed my suspicion of why this happened. So… I think I can fix that up a bit. (I’d planned to anyway. My longer script has all sorts of comments with ideas of what to try. )

  59. Ok… Since I ultimately agreed that the ‘t_pooled” is not t-distributed– but I think it must be close, I did as Carrick suggested and just peaked at false positive rejection rates. These should be 5% if the method is “right” and I use an infinite number of runs. I used “numMonte” runs of 10^5. (10^6 takes way too long…)

    To test the the effect of not really being ‘t’ I used a smallish number of data for the individual trend tests: That is: The baseline will be 12 months long, the forecast periods 12 months long.

    # to test issues with ‘t’, use small number of dof’s.
    N_tot=24
    n=12

    I check the mean value of ‘t’s over the 10^5 tests. These should be zero ± some small error.
    > # mean t’s. Should be 0 for false positive test.
    > round(1/sqrt(numMonte),3); # standard error in mean based on number of simulatins.
    [1] 0.003
    > round(mean_Ts[1:3],3); # ts
    trend mean pooled_alt
    0.001 -0.003 -0.001
    That looks about right.

    I check the rejection rates
    > # REJECTION RATES: False positive in %
    > round(100*sqrt(alpha*(1-alpha)/numMonte),2); # estimated standard error in rejection rate.
    [1] 0.07
    > round(100*mean_Ts[4:8],2); # mean rejection rates. want 5%.
    rej_trend rej_mean rej_pooled_lenient rej_pooled_strict rej_pooled_gaussian
    4.94 5.10 5.62 4.31 7.10

    As expected the rej based on the trend and mean are within 2 standard errors of 5%. (That is within ±0.14)

    I also looked at the rejection rates three possible ways– none of which is entirely methodologically sound. 🙂 First: note the number of dof for the trend and mean tetsts are different because means are baselined. So, the dof’s for the baseline period follow a complicated-is formula based on the number in the baseline and the number in the forecast period. It’s larger than the number for the trend.)

    “pooled_strict”, I pretended the pooled t is a t with the dof’s from the test of the mean trend.
    “pooled_lenient” : I pretend the pooled t is a t with dofs from the test of the baseline.
    “pooled_gaussian”: I pretend the pooled t is a gaussian.

    All are wrong, but I anticipate that ‘pooled_gaussian’ will give the most false positive rejections. ‘pooled strict’ will give the fewest. The goal is to compare these to 5%. (I can later fix up in a more sensible way. But I want to detect the order or magnitude of the numerical error involved not fully dealing with the ‘non-t’ aspect of the ‘pooled t’.)

    Anyway, I found:
    rej_pooled_lenient =5.62%
    rej_pooled_strict = 4.31%
    rej_pooled_gaussian =7.10 %

    All differ from 5% by more than 2* the standard error. I’m reasonably confident I can say that: If I take the expedient– but not precisely correct– path of using the dof’s from the test with the smaller number of dof’s, I create a test that is ‘too strict’. That is: I will make fewer false positive errors than I claim. (I know that the rate will approach 5% as the number of dof’s increases because it that case, everything starts to look gaussian– and the pooled t will become the sum of two gaussians– which is gaussian.)

    On the other hand, if I use the dof’s from the test with the larger number of dof’s, it will be “too lenient”.

    Obviously, because it’s easy, I’m going to see what happens if I find the dof’s for the pooled t using the average. That takes two lines of code… and a hour to run. (Why didn’t I do that before?!)

  60. Will Nick Stokes find the fatal flaw in the method, or will Lucia emerge victorious with a more powerful method to declare climate model projections… err…. wrong? 😉
    Joking aside, it will be interesting to see if anyone can poke holes in it.
    .
    Of course, the alternative is to wait a couple of years until nobody (not even Sir Nick the Resolute, descendent of brave Sir Robin of Python) can argue that GCM model projections and reality are plausibly congruent.

  61. couple of years until nobody

    Possibly only months.

    One of the motivations here is that *I strongly anticipate* that if I submit the rejections based on the trends using “model weather”– which is just borderline now– someone will say “but the means aren’t outside”. I anticipate this because over the course of complying with various niggling requests in a knappenberter etc. paper, at some point, someone wanted to suggest that if the means are in, and the trends are out, we need to wait for *both*. But of course, insisting on *both* would be a higher power test. That particular paper was already a bit complicated in structure– making it a bit difficult to pinpoint which comments went with which bits of the analysis. But in anycase, I was pretty darn sure that — give power estimates based on models, that if the models are wrong, we’d see this again, and we will begin to approach the point where results might be “robust” to “new weather”. (Of course, that depends on how far off models are– but I think we are getting near that.)

    I know that I need to include a bit that discusses means, absolutes, the method — and I think I need to apply both to test the null of “models are right” and the null of “no warming”. (The latter is actually harder *when the weather noise is based on models”. Heck, it’s harder anyway– because there is no person or body who has a concrete expression of what they mean. Do they mean “warming stopped”? Do they mean: Yes, warming. But due to sun? Whaaaa? What to test differs. So… difficult.)

  62. SteveF:

    Joking aside, it will be interesting to see if anyone can poke holes in it.

    Note he was right that my “pooled T” is not a t distributed thingy. But… it ends up not mattering. If I just that *as an approximation* I will treat it as such and use the dof that is the minimum of the one for trend and intercept, I get a method that is “too lenient” to the models. So… no one can say that a diagnosis of “reject” was too strict. Statistical analyses can never achieve perfection– so that should be sufficient.

    I could run monte-carlo to make the level of “lenient/strict” just right. But… that would result in more rejections of models. Not fewer.

  63. Lucia,
    “Nooooooooo!!!!!”
    OK, I’ve looked again at your prediction, and I’ll push back against this. I don’t think there is any data-based prediction.

    You’ve simply postulated a trend of m=0.1 Q/dec. And your synthesiser generates a level of noise, which you measure while things are flat. Let’s say measured sd=σ Q.

    Then you get a t_m. Where from? Well, the regression formula says that the estimate is m and the sd is σ*√(12/N) (I’m using large N approx; I think about N+1/2). So your
    t_m=12.57= m*√(N/12)/σ

    Then t_mean. Well, the predicted mean is 0.5*m – half the trend for the decade. And uncertainty? Well, the uncertainty of the estimated 120 month mean would be
    t_mean=18.44= m/2*√(N)/σ

    So you said that it was OK to posit m=0.1 because it cancels out. But it cancels out with σ. This (sd of noise) can be figured from t_mean as 0.0295 Q and from t_m as .0252 Q.

    Put another way, the only data-based number is the ratio of observed noise to posited trend. That’s any number you want to posit.

    So you’re not finding weights based on past data, or shouldn’t be. The ratio t_mean/t_m should be √3, and the optimal weights (1,√3)/2 ie cos(π/3), sin(π/3). Your numbers vary slightly from this, and you may have calculated the t-denominators slightly differently. But it’s all based on the ratio of measured noise to posited trend.

    But I still maintain that your “more powerful” test tests a less interesting (less plausible) null hypothesis, with a weaker result. And I think the test of that is – it’s hard to say what the result actually is.

  64. Nick

    You’ve simply postulated a trend of m=0.1 Q/dec.

    Actually, no. I haven’t.

    So you said that it was OK to posit m=0.1 because it cancels out. But it cancels out with σ.

    The numerical values of m and sigma both cancel out algebraically. you can pick any values other than zero you like– you get the same weights.

    Put another way, the only data-based number is the ratio of observed noise to posited trend. That’s any number you want to posit.

    There is no posited trend. And the weights do not depend on either of these two values.

    But it’s all based on the ratio of measured noise to posited trend.

    No. The truth is: the numerical value of the weights in this problem are functions of the number of points in the baseline and the number of points in the forecast. I’ve told you this several times now.

    I have no idea how the null “models predict the data correctly” is a less interesting null hypothesis than “models get the trend” right. But.. whatever.

  65. Lucia,
    “Nick
    ‘You’ve simply postulated a trend of m=0.1 Q/dec.’
    Actually, no. I haven’t.”

    Well, you say
    “This can be any value other than zero since the magnitude cancels: I pick m_e=0.1 quatloo/decade.

    But I’m now rather at a loss to see in what sense you are testing a hypothesis at all. The mean that you’re testing is constrained to be half the trend times decade. Maybe with some fuzz for initial uncertainty. Mean and trend are far from independent, so the null hypothesis that they are both zero is effectively a test that just one of them is, since their ratio is constrained.

    How is this a stochastic problem?

  66. Nick–
    To explain the idea, I pick a value.But it cancels out. I could pick 0.1, I could pick 0.3, I could pick 100000. It cancels out. That was what I meant when I wrote “This can be any value other then zero since the magnitude cancels.”

    If your “issue” is that one cannot discuss power without the notion of an “alternate” hypothesis, then you simply have issues with the notion of statistical power. Because statistical power cannot be discussed without postulating the existence of a alternate hypothesis.

    you are testing a hypothesis at all
    I mystified why you creating a test makes sense if someone claimed the expected value of quatloos is zero at all times during the forecast period. That is if someone claimed Q(t)=0.

    Statistical tests need to be more limited and it’s pointless to do a test with zero power. That’s why even though the thing people really want to test is E[Q(t)]=0 for all (t) they will organize things into test like:

    Assume:
    Q_i=m*t_i+b+e_i

    Make the null hypothesis m=0. The alternate is m≠0.
    Then do a statistical test on “m”. If they find that “m≠0” the will learn what they are actually interested in which is that “E[Q(t)]=0 for all t” is not true.

    I think what you have gotten yourself seriously confused if you think that the test of the trend is “the more interesting thing”. The trend is only a diagnostic for what we really want to know: Which is: does a expected value from a model runs equal the expected value for earth under matched conditions?

  67. Nick

    Mean and trend are far from independent, so the null hypothesis that they are both zero is effectively a test that just one of them is, since their ratio is constrained.

    Sigh…
    1)the errors i.e. uncertainty in the sample mean and sample trend are independent.
    2) the true mean and true trend are related.

    The method assumes both (1) and (2) are true. So you need to get the difference between “signal” and “noise” sorted out in your thinking. Because your confusion is seriously showing here.

  68. Lucia,
    “To explain the idea, I pick a value.But it cancels out. I could pick 0.1, I could pick 0.3, I could pick 100000. It cancels out. “

    It’s actually a scaling factor on your t estimates. If you pick 0.1 Q/d, you get t_mean=18.44. If you pick 0.3 Q/d, with the same noise in baseline, you get t_mean=55.32. And 10000, well, let’s not. But then in terms of significance, what do these values mean?

    A big t is meant to mean that there is only a small chance that the numerator (here mean) could go to zero with a fluctuation in the random component. The value you pick has a big effect on that. But what does it mean?

    Your t_m and t_mean are in a constrained ratio; that ratio is determined just by the number of baseline and observation months. But you want a test based on whether they both could be zero?

    There’s a place for dimensional analysis here. Your input observation data is noise, in quatloos. You posit a trend, in quat/decade. You might as well posit it in quat/timestep. No surprise that you find the numbers depend on that ratio, and quatloos have gone. It’s become a dimensionless problem. But what can that mean? Can you even state a dimensionless hypothesis?

  69. Nick

    It’s actually a scaling factor on your t estimates.

    Uhmm… no. It is not a “scaling factor” in the estimates. It is a feature of the data in the “forecast” period. It’s a “thing” the test is trying to detect.

    I think you are very confused Nick. Very confused.

  70. Nick–

    ?? You just said it’s a number you can pick at random.

    Either we are talking at cross purposes or you are changing subjects. When you wrote this:

    ou’ve simply postulated a trend of m=0.1 Q/dec. And your synthesiser generates a level of noise, which you measure while things are flat. Let’s say measured sd=σ Q

    I assumed you were discussing this choice

    I pick me=0.1 quatloo/decade.

    That is the only reference to 0.1 quatloo/dec in the post. I give a numerical value for purposes of explanation of the idea used to determine the weights– but the result for the numerical values of the weights is unaffected by this choice. It cancels. The value of “noise” I pick in my fit also does not affect the numerical value of the weights. I would get the same value for the weights no matter what.

    This is the only “new” thing in the post. That is a discussion of how to get the weights that will be used in

    t_pooled= a t_m + b_t_mean. That’s it. It defines a and b. It does not affect the magnitude of t_m and t_mean one iota.

    You are then switching to this

    It’s actually a scaling factor on your t estimates. If you pick 0.1 Q/d, you get t_mean=18.44. If you pick 0.3 Q/d, with the same noise in baseline, you get t_mean=55.32. And 10000, well, let’s not. But then in terms of significance, what do these values mean?

    You seem to now be discussing the t’s obtained from the data in the forecast periods (in orange to the right of the red line.)

    These are not affected by the hypothetical trend I discussed while compute the weights. The t_mean and t_m are computed in the ordinary way one computes them when testing a mean and a trend.
    1) t_m is the ‘t’ for the trend would be spit out if you inserted the data indicated in orange into R and fit it to a linear regression.
    2) t_mean is the t’ for the intercept that would be spit out if you first centered the time data so it had a mean of zero and then stuffed it into R.

    You could also use EXCEL or a programable calculator to get these t’s if you stuffed in the “orange” data from the “forecast” period.

    These are not affected by anything involved in the computations of the weight. They are not anything new. They are what you get if you stuff data into R and they mean precisely what the mean when they come out of R.

    Is the magntude of these values affected by the trend and intercept of the orange data and the amount of noise in the data? Of course. They are supposed to be.

    Naturally, to show the magnitude of my pooled t relative to the two from which it is pooled, I used an example. So if your worry is that the numerical results in my example spring from the choices I made to create the example: Yes. Of course. Duh.

    As it happens: in this example, what we see is that:
    1) I created data where we know the correct answer should be “the trend is statistically significant”.
    2) Both the mean and trend test gave the correct answer: It is significant.
    3) The pooled test did also. In the case of the pooled test, we got a answer that one would interepret as “even stronger”.

    My goal is precisely to get a pooled variable with more power.

    To show it has more power, I would need to do many, many similar examples– all with specified values of m/σ where m is the trend in data in the forecast period (not the arbitrary value used to compute the weights) and create a graph showing power as a function of m/&sigma. This m/&sigma would be a property of the data in the forecast periods.

    FWIW: You seem to be complaining that the magnitude of ‘t’s I’m getting are affected by the signal/noise ratio of the data. Well… yah-huh. That’s the desired feature in a statistical metric meant to detect whether noisy data contains a signal. So yes: When I pick examples with high signal/noise ratios, the t’s will be large. Because that’s the feature that makes t’s useful for detecting signal in noisy data. That’s why they are used.

    I really think you need to go pull out a chapter on statistical power and review it. Do an example problem for the simple case of detecting the difference in to means. It might help you focus on the conventions of what is done.

  71. Lucia,
    Humm…
    76 comments and all has gone dead. I think you have scared away the natives (infidels?). I think maybe you hurt Nick Stokes’ feelings, if that is possible. 😉
    Still, if you can show that your new method declares a known (toy) trend “false” or “true” before the standard methods, that should be good enough.

  72. Or bad enough. My plan forwards is to show analysis two ways:
    1) pooled (which is better)
    2) trends only (which for models is 2nd best.)

    Even though I think pooled is better, I don’t think I’d get a paper accepted if it was the *only* way I was getting rejections. People don’t “want” the models rejected so if I’m only getting rejections with a ‘new’ nonstandard method, it’s not going to pass. But, I know that I will need to discuss how we consider the mean– since in a few months, the multi-model trends are going to be out *using models as “weather”. And at least some reviewers are then going to say “but the means aren’t”. So… I need to discuss that.

  73. Isn’t the additional information in the trend the autocorrelation at more than one time step? IE the reason its a more powerful negative is that it shows that the expected value is zero at one time step, and also that there is no autocorrelation at least at the trend horizon that would change the null hypothesis….

  74. OK if there is no autocorrelation then I will humbly bet you a Quatloo that this method can’t possibly have any value when you figure out what it really does. Where do you suspect the additional “information” is coming from?

  75. CPV–

    when you figure out what it really does.

    When? I already know what it really does. It has higher power than two other methods.

    Where do you suspect the additional “information” is coming from?

    What additional information? It has higher power — getting fewer typeII errors than other methods of processing the data. There are lots of statistical tests in books. No one claims the higher power ones got “more information”. They have higher statistical power based on how you handle the same amount of information. That’s what’s good about some methods vs. others.

Comments are closed.