Correction for ARIMA out of the box confidence intervals.

The problem: False positive rate is too high.

Once we state our assumptions for the statistical model (i.e. linear trend (m) + ARMA11 ‘noise’) we want our method of testing a null hypothesis about the magnitude of the trend (i.e. m) to reject that hypothesis at the rate stated. So, if we say we want to state our results at a confidence level of α=5%, (or 95% =1-5% confidence levels) we want our method to reject an absolutely true null hypothesis with probability 5%. We can test whether a proposed method does this by creating synthetic data that matches the null hypothesis and the assumptions for our model and testing over and over again. I tested the methods for trendless ARMA noise with ar=0.698 and ma=-0.322, and applied both the “Tamino- SRJ ‘green’ method” (which is a method of estimating uncertainties described by Tamino, but applied using trends and AR, MA coefficients based on the same period) and using the confidence intervals from the R Arima method out of the box. The results with 10,000 attempted realizations are shown with :

Replacement figure (after fixing bug.)

(Note: The reason I say “attempted” is that when one generates noise, sometimes that noise doesn’t look like ARIMA11, or it doesn’t look stationary or has other features that make the method of testing the trend fail. My code detects those and skips them. So those cases are not included in the results. FWIW: The Tamino-green method ‘fails’ in this way more often than ARIMA. The main reason for ‘failure’ arises when owing to the specific values of the noise in a realization the lag 2 coefficient is sometimes larger than the lag 1 coefficient.)

Above, you can see that when an analyst applies either method above to data that really is ARIMA(1,0,1) with ar=0.698 and ma=-0.322, they will “reject” a true null hypothesis too often. The method is a big worse when ARIMA(1,0,1) is used, but neither method results in acceptable rejection rates. So, both would require fixing. Because I think it’s a bit silly to ‘fix’ a method whose motivation is to have something one can use when they don’t have access to more powerful tools (like R, or montecarlo etc.) I proceeded to only worry about adjusting the out-of-the box ARIMA-R method.

I recognized that the reasons the out-of-the-box ARIMA-R method is imperfect is that the method estimates the true magnitude of the parameters (art, mat, σt) based on the sample values obtained in each realization (i.e. montecarlo run). That is, the ‘method’ programmed in R estimates the standard error in the trend, sms, using (ars, mas, ss) where the ‘s’ stands for ‘sample’. One then accepts or rejects a null hypothesis that the true trend is mt based on whether the trend for the ratio of the ms/sms is greater than a particular value called “tcritical“. I had been estimating “tcritical” using the student t distribution for the number of degrees of freedom. There are reasons why this is likely not correct: One is that the student-t distribution was developed for testing when residuals are normally distributed and statistically independent. By definition, ARMA11 residuals are not statistically independent.

Nevertheless, it is useful to reject based on the magnitude of a ratio of ms/sms. The question is: What tcrit should I use? I decided to estimate the best value empirically by running “M” (10,000 here) cases, computing the “t” values for each case, and then finding the value of t that rejected 5% of trends.

The ratio of tcrit/tstudent found for a range of months are shown below using open purple circles :

Recognizing that each case contains noise, I decided find the best fit line through log(tcrit/tstudent -1) = a Months + b. and then used tcrit(Months) as the best estimate of the critical value in a t-test. (This choice may or may not hold up long term. I would need to run a larger number of monte-carlo cases to know whether that functional form really works.)

Of course if I used the t values on the previous monte-carlo data, this method would work perfectly. To get a test of repeatability, I ran it on freshly generated series. The results new rejection rates using the ‘tcrit values are shown below:

Image after revision:

Caveat:
It’s worth noting that this method of testing this method is imperfect but the imperfection is not apparent in my test. The reason is that what I have found is the tcritical applies to ar=0.698, ma-0.322. It is, in some sense “perfect” for that case. However, when I apply this method to test the hypothesis using observational data, I can never know the correct values of ar and ma for the “true” process. I will have estimates of ar and ma based on the observation. I can then estimate tcritical based on the estimates of ar and ma. To the extent that the estimated values of ar and ma are ‘not true’, my value of tcritical will be slightly off.

A second order correction for this error could be ginned up. (While writing this paragraph, I thought of how to do it. It would require a bit more coding. I’m going to think about whether my idea is right before coding though.) For now, I’m assuming that the larger portion of the error is accounted for by the method I am using. In my next post, I’ll show how application of this method affects the uncertainty intervals for GISTemp data at short time periods.

(Then, I’ll announce the ice bet winners! I’m just waiting a but ‘just in case’ NSDIC suddenly announces an error or something. I don’t think they will– but I don’t know if slight revisions are unprecedented or not.)

Update on Caveat
I wanted to estimate whether the values of ‘t_crit’ above might differ from the ‘correct’ values. Truly correct values would be based on knowledge of the ‘true’ ar, ma parameters. But we will always estimate based on sample values. To estimate, I

  1. Create a time series with ‘N’ months based on the “true” values of ar and ma.
  2. Estimate ars and mas from that series.
  3. Using ars and mas create a new time series. Find t_2=m/sm for this series. I call this t_2 to denote it’s the “second correction”. (t_1 would be the ratio of m/sm computed using synthetically generated series with the ‘true’ ar, ma values.)
  4. Repeat M times.

I compared the distribution of ‘t’ obtained this way to those obtained from the ‘true’ (ar, ma) used in step 1. These t_2 are roughly 1% smaller. The ratio of t_1/t_2 computed using 5000 ‘runs’ is shown below:

From this I infer that the t_criticals estimated in the method described in the blog post above are too small and should be reduced by roughly 1%. This is small and I might be justified in not doing the computations to figure it out, For now, I’m going to just carry along the computational overburden of computing the adjustment, and use it only if the 2nd order correction suggests I widen the uncertainty intervals.

code: Unless I “saved” wrong when converting to .txt the code. SDrunsV_ARMA11_forKen2
Update 10/3/2012: The code above does contain a bug in the “create runs” portion. For some reason, I included an unnecessary spin-up period and failed to remove the extra months. Oddly, owing to later operations, the bug made less of a difference than I anticipated. The images have been replaced. There are no changes in the narrative.

23 thoughts on “Correction for ARIMA out of the box confidence intervals.”

  1. The “Tamino-SRJ” green method is what in my plots are called “arma(1,1) estimated over trend period” or as we call in in the other thread “Taminos method with unfrozen arma-parameters”, right?

  2. The “Tamino-SRJ” green method is what in my plots are called “arma(1,1) estimated over trend period” or as we call in in the other thread “Taminos method with unfrozen arma-parameters”, right?

    Yes. Sorry– I didn’t make that clear. (I was planning to show “results” first– and then it would have been clearer. But then I thought for logic, it would be better to show the method first.)

    I can’t really do the other two methods here. (I don’t think. Anyway, my code isn’t set up to do them.)

  3. To do it with frozen arma-parameters you just need to move the estimation of nu_giss out of the loop. Here is only that part of my code:
    ## get arma parameters from 1975
    shortstart <- 1975
    grecent=shortstart)
    gm1 <- lm(Anomaly ~ yr_frac, data = grecent)
    res <- residuals(gm1)
    q = acf(res,plot=FALSE)$acf[2:3]; q
    phi = q[2]/q[1]; phi
    nu.giss_1975 = 1+2*q[1]/(1-phi); nu.giss_1975

    # regression loop
    for(i in 1:((nr-1)/12) ) {
    Start <- gtemp[(1+12*(i-1)),1]
    est[i,1] <- Start; est[i,2] <- end
    # pick out a subset of data
    grecent =Start & yr_frac <= end )
    n <- nrow(grecent)
    ## fit a linear model through these recent data
    gm1 <- lm(Anomaly ~ yr_frac, data = grecent)
    # standard error corrected with frozen arma parameters
    # nu_giss_1975 is calculated outside loop
    est[i,7]=summary(gm1)$coef[,2][2]*sqrt(nu.giss_1975)
    }

  4. SRJ–
    It’s not that I don’t know how to do it. With respect to my montecarlo, to “freeze” based on 36 years (or whatever) I need to:

    Create a 36 year series with known AR, MA, sigma.
    Find AR_long, MA_long, sigma_long for those.
    Then trim it to “M” months the series of interest.
    Then do whatever….

    It’s doable. But I just started be generating “M” months.

    The main reason I’m not going to add that to the monte carlo bit is that the purpose of the monte carlo is to figure out how to correct the “out of the box ARIMA-R” method.

    I don’t think there is much point in “correcting” the Tamino method. The main reason I don’t has nothing to do with whether it is good, bad or indifferent. It’s that that method was proposed specifically as something one could do if you didn’t have lots of computational power. That’s discussed in the Lee and Lund paper where all the math explaining why it would work (in the limit of large N ) is provided. So, basically, it’s a method that might be useful if you are stranded on a dessert island with a long time series and programmable calculator preprogrammed with linear regression and ability to compute to correlation coefficients and not much else.

    In that case, you aren’t going to run monte-carlo to correct. So… I figure I might as well just correct ARIMA-R. (Especially, since if I were to show the “success rate” of even applying the method, ARIMA-R blows up less often. In the truly short number of month limit, the “Tamino SRJ Green” method blows up fairly often. )

  5. The comments type size was dramatically reduced on my computer image. Is it an error on my end or are you attempting to filter out old eyes. I did want to comment on this thread.

  6. Kenneth–
    It might be your eyes. I can’t tell…. I set firefox to override any defaults and show me BIG FONTS all the time.

    Does the small font start at some particular location? If it does, I can look for a tag that might not be closed.

  7. lucia (Comment #104444)

    “critical based on the estimates of ar and ma. To the extent that the estimated values of ar and ma are ‘not true’, my value of tcritical will be slightly off.”

    Lucia, it starts right there with the begining of that excerpt.

  8. Ken. It should be fixed.. I had closed italics tags where I needed closed subscript tags. If it’s not fixed, let me know.

  9. Lucia, I am well into my analysis and the approach you have with determining CIs – without the use of Monte Carlo analysis – is, I think, using the se of the arima function to determine those intervals. I also assume that when you consider the limitations of using smaller segments lengths that the limitation would apply to both cases, i.e. freezing and not freezing the arima model coefficients.

    I think I will be able to show that the wider CIs obtain from freezing the arima model coefficients is because those coefficients are larger for the entire series than for the smaller segments towards the end of the series and larger (at least for the arima coefficient) will cause the CI range to expand.

    My Monte Carlo analysis with 10,000 simulations finds the correct mean trend very closely over all time segments with the upper and lower CIs obviously varying with segment length. The mean of the arima coefficients for ar (ma) in my simulations are biased low (high) with the biases increasing with smaller segment lengths. I will need to determine why this is occurring. I’ll look at cases where there is no trend and where the SD is smaller.

    I am tracking the se values (sqrt(var.coef)[4,4]) along with the trends and ar,ma and sd of the residuals in my Monte Carlo replications and thus far the mean of the se values tracks well, but not exactly, the CIs. Of course, with the Monte Carlo I can put CIs on the se values also.

  10. On the caveat: I’m finding that the ‘adjustment’ for the t_critical is off by roughly 1-2%. I would need to inflate by about that amount. It takes more computational time to compute it– but since the adjustment requires widening the error bars, I’m going to do it. I can always do laundry, vaccuum etc. while letting R run. 🙂

  11. Lucia, I have linked some of my results from 10,000 simulations of 2 arima models using the time periods from 119 to 443 months. I used the same arima coefficients for all time periods in order to see what happens with ar and ma coefficients, the trend slopes and the standard deviations of the residuals and the associated CIs as one progresses to shorter time periods. I also determined the mean se using the sqrt(var.coef[4,4]) from each simulation. I used arima.sim to obtain 10,000 simulations of either of the the 2 arima models (as described in the linked tables below) and used the arima function to obtain the parameter values presented in the tables in the first two links below.

    From the tables in the first two links below it is apparent that the mean ar coefficient changes to smaller values as you progress to shorter time segments and ma coefficient changes to less negative values with the same progression. The mean of the trend slope remains nearly constant in this same progression. Comparing the trend slope CIs and the mean of 2se values shows that these values track fairly well, but that for the shorter time periods the se would provide somewhat more narrow CI ranges.

    In order to allay my concerns with the changes in ar and ma coefficients on going to shorter time periods, I looked at simulations without trends, smaller standard deviations, using a burn in period and finally with using models with only an ar coefficient. The determining factor was obvious when it was found that models with a single arima coefficient, as in the difference between an arima(1,0,1) and arima(1,0,0), had very little change in the ar coefficient when used with shorter segments. I think this is a result of the arima function compensating the less positive ar coefficient by pairing it with a less negative ma coefficient. What is important in this analysis is realizing that the mean trend is very stable and that would not occur if there was not some compensating effect occurring – I think.

    To further analyze this concern I used a short segment of 143 months with the arima model as described in the table in the third link below and produced 10,000 arima simulations which with the arima function were used to produce the results in the first two tables linked below, but in this analysis the 10,000 results were placed into subsets by the ar value. The table shows the compensating effect of the ma coefficient on the changing ar coefficient. It also shows that the trend means and CI ranges are stable with the ar/ma combinations. It also should be noted that the CIs that would be produced from the mean 2se values are not stable when the ar/ma coefficients are further from the mean condition.

    Finally, I wanted to reassure myself that a larger ar coefficient, as we see over the entire series from 1975-2011 and the longer segments of the series , leads to larger CIs for the trend slope and that is why I used two different arima models with the first derived from the shortest segment of the 1975-2011 series and the second from the longest segment. In other words, I wanted to confirm that the wider CI limits obtained for the shorter series segments when the arima coefficients were “frozen” were the result of the use of the arima coefficients for the entire series versus those resulting from arima models of the shorter segments. In comparing the CI and se values between the two tables in the first two links below it is readily apparent that two models have very different CI ranges over all segment lengths. Therefore, I think we can say there are two countervailing effects at work in changing CI ranges in progressing to shorter segments, i.e., on one hand, the shorter segments produce the expected larger CI ranges from fewer degrees of freedom, but that effect is somewhat counter balanced by the shorter series having arima models with coefficients that would produce smaller CI ranges if all other effects were equal.

    http://imageshack.us/a/img37/4241/simulationsarimaghcn1.png

    http://imageshack.us/a/img839/513/simulationsarimaghcn2.png


    http://imageshack.us/a/img33/4527/simulationsarimaghcn3.png

    I cannot see any reasons from this analysis for not using the ar/ma coefficients for each series segment in determining the CI limits for the trends of the segments. It appears that a Monte Carlo treatment is required for determining trend CIs until other methods can be shown to produce similar results. I am not entirely satisfied with my understanding of the changing of ar/ma combinations on going to shorter segments. I have done some preliminary searches of the literature of this phenomena without finding a good explanation. I would therefore have to admit that my Monte Carlo approach may have some problems/weaknesses.

  12. Kenneth

    the trend slopes and

    Could you define “trend slope”? Do you mean when you ran your arima, you had a known trend, and that’s the trend you got from your fit? Or do you mean something else?

    I’m going to take the liberty of moving your image links so they are near the text that refers to them. That way I’ll be able to read words and glance at the image simultaneously instead of scrolling back and forth.

  13. Ok…

    but that for the shorter time periods the se would provide somewhat more narrow CI ranges.

    Yes. This is what constitutes “the problem”.

    I have done some preliminary searches of the literature of this phenomena without finding a good explanation. I would therefore have to admit that my Monte Carlo approach may have some problems/weaknesses.

    AR coefficients for red noise are also biased low at small intervals. This has been discussed by Annan at his blog, in the paper discussing Schwarz and I think in Santer’s paper. That these are biased is know. I think there is a closed form estimate for the bias with red noise. The difficulty is that you can only estimate the bias based on a sample value– and the math tells you the value if based on the known AR value.

    But… if you are concluding we need to do montecarlo, that’s the same I’ve concluded. (I think you’ve concluded the trend is unbiased– which is something I wanted to know but didn’t test. I just ran montecarlo with zero mean.)

  14. lucia (Comment #104505)

    Paul_K used the polite expression on another thread that my descriptions tended to use “iconoclastic terminology” – which I would not deny. I use trend slope and I know when I do that others don’t. I think a trend has a slope and an intercept and thus I distinguish. Trend slope is what we have been referring to as trend and is that thingy for which we are attempting to find CIs.

    I think we are in essential agreement in what is required and acceptable in estimating the trend CIs. If you had a convenient link to the Annan discussion, I would appreciate it, but no biggy as I can look on my own. What I found interesting in the simulations was that when using arima(1,0,0) the mean ar coefficient changes only a minor amount on going to smaller segments while for arima(1,0,1) the ar and ma coefficients change by considerably larger amounts.

    I have not had time to compare the CIs the you have estimated to those that I previously linked. My initial interest in this topic was looking for the best arima model to fit the temperature data and the model I found was arima(2,0,0) and trend stationary. I think we agree that the difference between these models one might select makes a difference mainly in the margins, but in that margin of what model using unfrozen arima coefficients has the most short segment upper CIs below the 0.2 trend line, I found the counts were for (2,0,0), (1,0,2) and (1,0,1) were 6, 5 and 5 respectively. The counts using frozen coefficients were, respectively, 3, 4 and 0.

    Regardless of whomever prevails in these analyses, what I find most interesting is that it appears that Tamino selected conditions/measurement method(s) that can be shown to give the maximum upper trend CI and thus least likely to show significance differences for a downward trending slope, while other apparent legitimately selected conditions/methods can be found that show a much lower upper CI and thus significant differences.

  15. Regardless of whomever prevails in these analyses, what I find most interesting is that it appears that Tamino selected conditions/measurement method(s) that can be shown to give the maximum upper trend CI and thus least likely to show significance differences for a downward trending slope, while other apparent legitimately selected conditions/methods can be found that show a much lower upper CI and thus significant differences.

    Well…. yes. And I thinking finding these is not difficult. The word “cherry picking” comes to mind. Some people seem to think that the only possible cherry picking is start year. But that isn’t so. Given a sufficiently vast assortment of plausible sounding methods and tendency to refuse to explain the advantages or disadvantages of ones sources, lots of cherry picking can go on.

    I’ve said it before. I see nothing to make me back away from saying that Tamino has cherry picked for large uncertainty intervals.

  16. BTW: I read the link on control charts. I have no idea how a control chart would apply to testing whether the multi-model mean is correct. It might be useful to detect whether the current temperatures are “out of control”– meaning the trend has changed. I don’t think it has– and have never claimed they have. So, I’m not going to spend my time on that particular tests because I don’t think it’s appropriate for getting the answer to a question I think has even the tiniest chance of resulting in the answer “yes”. But if Steven Goddard wants to use it. maybe he will. 🙂

    But if your general point is to resolve this we might need a multiyear test- yes. Because there is no way to get anyone to agree on a specific standard year that is the “right” on to use to test forecasts. But one things I do know: you don’t test the forecast using the hindcast. Counting “fail to rejects” of trends computed using temperatures from the hindcast is inappropriate if you want to test the forecast. The reason is: You are basically doing something like this:

    “I hunted through an large archive of pre-tested. After searching high and low, through countless coins, I found a coin that had 70 head flips and 30 tails flips in the first 100 flips. I predict this will continue”.

    Then, someone flips the coin. In the first 20 flips they get 11 tails and 9 heads and tell you that no way does this coin get 70% heads. And then you tell them that they have to count the 100 flips you claim to have done in the past.

    But that’s wrong. They can– and should– start their test to see if this magical coin gets 70% heads from scratch.

  17. Control charts would only be useful to say (with certainty) that global average temperatures are “not in statistical control”… They are auto-correlated like crazy, with negative and (mostly) positive trends that cover a wide range of data points, and with apparent oscillatory behavior at multiple time scales.
    All a control chart would say is that you have lots of “special causes” for temperature variations which need to be identified and eliminated if you want to have the global average temperature “in control”.
    But we already knew there are lots of special causes… some identified and some not.

Comments are closed.