In comments on my post showing that if we use linear trend + ARMA 1,1 ‘noise’ to model the GISTemp time series for the earth temperatures , we are currently seeng rejectionsg of the multi-model of 0.2 C/decade projected in the AR4 as falling outside the 2-σ uncertainty bands. That is: that analysis is suggesting that the models are warm relative to the true trend and the difference is statistically significant. However, you will note I didn’t quite say that analysis shows this. Rather, I ended the post with:
I’d also note: I suspect if I were to run Monte Carlo cases for ARIMA(1,0,1) cases this method slightly under-estimates the uncertainty intervals. Note that I am not using this method in my comparison post. the uncertainty intervals I have been showing are nearly always wider than those computed using ARMA(1,0,1). But I’m under the impression the method itself is standard enough provided one is confident the ‘noise’ is ARMA11.
It is in fact that case that generally if I estimate the uncertainty in repeated samples of trends using ARMA(1,0,1) the analysis gives the correct size for uncertainty intervals as the number of months is infinite, but they are a bit too small when we have too few months. What this means is that one needs to check and correct for this.
Checking and correcting is, in some sense, equivalent to “recognizing you need to use the ‘student t’ instead of the “gaussian” distribution when estimating uncertainty bands on data with normally distributed errors. It’s not necessarily a big deal just as one is required to state assumptions, correcting for bias in the false positive rate under those assumptions needs to be done.
In this post, I’m going to discuss how we can recognize that we need a correction to a method for testing whether we should reject or accept a hypothesis about that the trend in global temperatures is some value m=xC/decade and how we might correct a method we use. That discussion will be based on results of M=1000 ‘realizations’ (i.e. runs) of “N” month trendless time series that were generated using R’s arima.sim with ar=0.4 and ma=0.2. (These numerical values should not be taken as representing earth values.) I’ll also some features of the method. The final graph shows a ‘rejection” rate and represents “the problem that needs to be corrected”.
Because the earlier post involved a discussion of a Tamino analysis , I will also be comparing features of the method I used to one commenter SRJ found in a ‘code snip’ in a FosterandRahmstorf. I’ll be calling this the “Tamino-code-snip” method. (I should probably call it the modified tamino-code-snip method. To run monte-carlo I made a few modifications that prevented the method from dividing by zero or taking the square root of negative numbers. The R code is here.)
This post is intended mostly as a supplement to conversation in comments at the other post– so it’s not necessarily formally organized. I’ll now simply discuss some features we want in a “good” statistical method, and compare the “arima” method and the “tamino-code-snip” methods based on how well they seem to achieve these goas.
Unbaised Ideally, a statistical method should be unbiased. For example: If we are testing trends, we would prefer a statistical method which, if applied to an infinite number of realizations gave the correct trend on average. The following graph shows the estimated bias in the trend from each the two methods. To test this, I computed a ‘t’ statistic by taking the ratio of mean trend computed over N=1000 realizations to the standard error computed as the standard deviation of those trends divided by the square root of (N):

The horizontal blue traces show the 95% confidence intervals for the ‘t’ statistics. None of the cases tested fell outside those. (This is a bit of a surprise. There are 13 test; it would probably be ok if 1 test case fell outside.)
According to this test, neither method appears biased when applied to trendless ARMA11 noise with the chosen parameters. Bear in mind: further tests might be required since it’s at least hypothetically possible that bias only exists when a trend exists or for certain ar and ma values. But based on the present test, both methods appear to give unbiased estimates of the trend.
Smallest possible Least Square Error (i.e. variance): All other things being equal, a method that provides smaller variances in the parameter being estimated is ‘better’. Lower variance is “better” because it mean that we will generally get an estimate that is closer to the correct value.
Below I have plotted the ratio of the variability in trends using the “tamino-code-snip” method vs. using the standard arima method precoded into R. Note that if the ratio is greater than 1, it means the precoded arima in R is “better” on this metric.
(Update: Re-ran with 5,000 realizations.)
The graph tells us that according to this test, the precoded arima method is a little “better” than the “tamino-code-snip” method because it tends to give answers that are closer to the correct value: that is the mean square error in the estimate of the trend is smaller. Other ways to look at this are:
- If we used ARIMA and were able to estimate the uncertainty intervals correct, the correct error bars for this method would be smaller than the correct ones using the “tamino-code-snip” method. This is purely a feature of the method not the data.
- If we want to test a hypothesis (like m=0 or m=0.2C/decade) and that hypothesis is wrong we will expect detect this with fewer months data if we use the ARIMA method rather than the “tamino-code-snip” method.
Once again: This answer might change if I ran a different test. (In fact, I’m sure that either the answer on the bias or the variance must change under some circumstances because the Tamino-code-snip method uses OLS which is “BLUE”– i.e. the best least squares unbiased estimator”. So, there must be some circumstance where arima loses.)
Fidelity on confidence intervals.
Another feature we want from a test case is that if we say that our tests would reject a correct null hypothesis 5% of the time, we want it to reject that correct null 5% of the time. We don’t want to reject it 6% (too many false positives) nor at 4% (too few false positives).
To test whether the methods gave too few or to many false positives, I used both the ARIMA method and the “tamino-code-snip” method to reject m=0 which in the case of this synthetic data is true. When doing this, followed the practice I think SRJ did: I used qt(0.975,Neff) where Neff is the “effective number of degrees of freedom” as defined in the “tamino method”. For Arima, I used the number of degrees of freedom from ARIMA.
Here are the rejections rates for both cases with the “tamino” method in red and the arima in black. The horizontal trace shows the “correct” rejection ratio:
Clearly, there both methods reject a correct null to frequently at for small numbers of months. Eyeballing, I’d say the Tamino (red) method appears to over-reject a bit less and does look better on this metric. I’d need to run more than 1000 cases and do some tests to know for sure. But it looks a bit better. (Note: Update has 5000 realizations and it is clear the Tamino method rejects at closer to 5%.)
However, it is worth nothing that in principle, this issue of improper rejection rate should be largely correctable. In fact– one of the reasons the Tamino method is “better” is that he’s corrected for number of degrees of freedome while the arima method did not. This widened his error bars– and it widened them enough to more than compensate for the fact that uncertainty is larger under the “tamino method” (as seen in the 2nd figure.)
I’ve discussed correcting the error bars in the past. Ideally, the method of correction should be systematic and not permit a large measure of … uhmmm….”discretion”. Too much “discretion” is a temptation to cherry pick to get results one “prefers” or– failing that– can make people suspect cherry picking. Ideally, it should also be possible to demonstrate the properties of the corrected method using montecarlo. One also hopes that the method is sufficiently systematic that a future analyst would be able to apply “the method” to other problems without being required to apply his own unique “discretion”.
How can this method be corrected?
I observe that both methods already assume that the “noise” is ARMA 1,1; so that method will remain unchanged.
Also: I think Tamino’s method was to recognize that part of the problem lies in the fact that our estimates of the magnitude of the ar and ma parameters (and for that matter, the mean square or the residuals ) is uncertain. In principle, if one was sure that the “noise” is ARMA 1,1, but otherwise, the data are homoskedastic, and the forced trend is constant through the period, it would be tempting to just compute the ar and ma coefficients using a longer time period. In that case one would just use the entire time series back to 1900 or so. However, if one thinks the force trend is not linear and/or the errors are heteroskedastic and so forth, one can’t use this method. When data are heteroskedastic and one thinks that the properties of forced trend changed during the period of analysis, I have “issues” with the notion that the proper method is to have the analyst “look” at the data and decide that in some circumstances non-linearities in the forced trend introduced by volcanic eruptions can be treated as unforced noise.
So, I reject that method of “looking” at the data and picking a date that includes non-linearities in the force trend and treats those as part of the unforced “noise” when computing the coefficient for what one might postulate is the “universal noise processes for the earth’ weather system”.
I would propose that the best way to correct the similar over rejections for small number of months in either method shown above is to do something similar to using a student ‘t’ when the residuals to the fit are white noise. I propose finding the magnitude for the equivalent ‘t’ should involve monte-carlo done as follows:
1) Pick the time series we wish to analyze number of months during which we wish to apply the test. Let’s say that is 120 months (i.e. 10 years.)
2) Find the best ARMA(1,1) for that series using arima. Record the parameters: arsample, masample and record the number of months “N”.
These (arsample, masample and “N”) are now ‘frozen’ and used in the following:
3) Create M synthetic ARMA11 series with N months and parameters arsample, masample. Find the value of “teff,N” that when used as the critical value in the t-test applied in all M cases will result in the correct rejection rate. Note that teff,N is a function of the ar and ma and also the number of months. As the number of months increases, this teff,N” will approach the value for the t distribution (and so also the gaussian distribution.)
This method would bring rejection rates down closer to their proper values though testing would be required to ensure that the final rate is bang on correct. It can be used in statistics applications were the time series is really, truly short where the method devised by Tamino cannot be used. This method does not permit the analyst a whole lot of “discretion”: that is, he can’t ponder the advantages of expanding the time window to 30 years vs. 35 vs. 40 vs. the full 100 years.
Finally, once we have the rejection rates down to their proper values, it becomes possible to explore the power of two unbiased methods both of which give appropriate false positive rejections rates and select the one with the lower false negative rejection rate (i.e. the one with higher power).
If someone is interested, I can code this method up and Monday I can see how much this will widen the uncertainty intervals for short time periods. I don’t think it will be huge– but we’ll be able to get a better estimate of the correct uncertainty intervals for trends if the earth’s weather noise is “ARIMA11.”

Hmmmm… the second a decided to code I realized my first thought is computationally inefficient. My first crack is goint to be to estimate the t based on the ratio of the “true” spread of trends to the “estimated” ones at similar ar, ma. I’ll do that with synthetic first. (I always prefer to do it with synthetic first.)
Lucia, I wanted to look at 1000 simulations two ways: one with the ARIMA (1,0,1) model determined for each of the 28 segments with start years from 1975 to 2002 and all with end year 2011 and the other using the entire series for fitting an ARIMA(1,0,1) model and using that model for each of the segments. What I show in the link below is the trend with CIs as estimated from the Monte Carlo simulations.
The first approach gives results similar to what you and I obtained with our codes using 28 separate ATIMA (1,0,1) model fits and shows 5 later segments with the upper CI falling below the 0.2 degree C per decade trend. The second approach shows 3 later segments with the upper CI close to the 0.2 trend line but not falling beneath it. The second approach also shows the upper CI limit lower than we see in the Tamino graph.
My question now would be how much of the change in ar coefficients is due to changes in the series towards end (as in changing the ar coefficient) and how much is due to what you have shown, I think, with the fitting efficiency with length of series segment with a know ARIMA(1,0,1) model.
I have not read your post here in detail and hope I am not merely replicating part of what you have done.
http://imageshack.us/a/img811/3434/trendsarima1012ways.png

So– in both cases, you used montecarlo to find the uncertainty intevals? But in one case, you used the ar, ma and presumably size of innovations from 1975-now. (Tamino way)
In the other you used the ar, ma and presumably size of innovations that match the period of the trend? Is that what you mean? And then, using the montecarlo, you reject the other 5% (two sided.) right? (That method always rejects 5% which is the goal– so it should be “fair”. )
You get smaller uncertainties intervals when you base things on the period where you match the trend, right? That’s what I thought you’d get– but I want to make sure I understand you.
(If you are doing what I think, I think you have the ‘fair’ method. For some reason, I was thinking in terms of wording with ‘t_effective’ but you’ve deconvoluted the verbiage. Your method rejects 5%– so it meets that criteria.
Shoot– I went to the grocery store and I think I remember why I think we need to do something more involved….. I’ll do it (probably Monday.)
Or, you can partially do it. Take the uncertainty intervals you computed using the AR, Ma and residuals from in your 2nd figure. Multiply by the ratio of the *standard deviation of the trends in your 1000 cases* to the *square root of the pooled variance from all 1000 cases*. I think that will be closer to “Right”. Explanation later… I’m going to go cook.
Lucia, you have described what I did in my post above. For completeness I wanted to do the same Monte Carlos on the other ARIMA models that I had found fit well the 1975-2011 GHCN mean global monthly temperatures. Those were ARIMA(2,0,0) which had the best AIC score and also had by far the best portion of top ranked fits in my Monte Carlo exercise and ARIMA (1,0,2) which was a leading candidate along with the ARIMA(1,0,1) model that has already been analyzed.
The results are linked below and show that with the ARIMA(2,0,0) model, and using separate model fits for each 28 segments, the upper CI trends for the shortest segments have 6 falling below the 0.2 trend line and using the model derived from the entire series for the separate segments have 3 falling below the 0.2 trend line for the shortest segments.
Similarly these numbers for the ARIMA (1,0,2) model are 5 and 4 , respectively. All the ARIMA models have nearly the same CI spread over the longest segments of the series for both methods used in the Monte Carlo exercise.
With the proposition of the shorter segment trend upper CI falling below 0.2 or 0.17 being marginal, it appears that the choice of method and ARIMA model can make a difference in the final result when considering p <= 0.05 significance.
http://imageshack.us/a/img607/5696/trendsarima2002ways.png
http://imageshack.us/a/img145/9627/trendsarima1022ways.png
There is a question or 2 that I would like to look at with the main one being do the shorter segments in this series actually change such that different ARIMA ar and/or ma coefficients are required or is it merely all accounted for in the analyses you have made showing months used for modeling and degree of fit.
Kenneth–Yes. But to correct for the “problem” associated with short number of months, I think we need to widen the error bars you are getting with monte-carlo. The widening exercise will involve… monte carlo.
Here is why we need to widen.
When arima computes the variance returned in coef.var[4.4] (or some such) it is doing the following:
1) Using the standard devaition of residuals (sigma) to the fit it got *for that sample*.
2) Using the AR and Ma coefficients it got *for that sample*.
Then, assuming these are “true”, it estimates the variability of trends you would get if you ran montecarlo for *that set of variables*. So, if you run montecarlo *for that set of variables* the standard deviations of trends will match the one ARIMA estimates.
That does not correct for the “small sample size” problem because the *problem* aries because the estimated values of AR, MA and sigma are estimates and not “true”. If they were true, the linear algebra would have worked. MonteCarlo using the estimated values doesn’t “fix” the problem because if you run enough cases, it will just give you the same estimate the linear algebra gave.
So… my idea is that what you do is :
1) Run monte carlo with a trend of 0 to create series using *estimates* of AR, MA and sigma you got fitting to GISS Temp. (When you run each case, you will get a *different MA, AR, and sigma relative to the “true” ones you used to create the estimates.)
2) For each case, obtain the *estimate* of the variance from that case- var, and record the trend, m. (Don’t worry about the AR, MA, Sigma values– which will be off. But we aren’t actually going to try to fix those yet.)
3) After running “Num= 1 shitwad” worth of cases, find the variance of the “m” over Num. Find the average of the variance in sm^2– which was the estimate.
To correct for the small number effect, inflate your uncertainty intervals by the ratio of the standard deviation of the trends to the average of the estimate from arima.
This will probably be approximately correct. It won’t be *exactly* correct– we can’t be “exactly correct” in statistics ever. The rescaling might still be a little biased because the “correct” way to do this is to *know the true AR, MA* and use that information when running the monte carlo. But we can’t know those– so this method should reduce the bias, but not perfectly. (I think no closed form solution exists, so this is the only computationally quick thing I can think of. I can think of more computationally intensive things– but I don’t know that they would actually be better. So… I’m just going to do this and then run the monte carlo.)
I’m looking at two ways of doing this– by scaling as I described and by just finding the ‘t_effective’ value from monte carlo. I’ll explain in a post– but after I also check to see how much of the bias remains after fixing. (I’m “betting” it will be 90% of the bias in the size of the uncertainty intervals. That’s a wild ass guess. 🙂 )
Hmm… unless I have a bug, I’ve done enough to know that if we want a fiddle factor we need to look at the quantiles in the ‘t= m/sm” where m is the trend we got from run ‘i’ and ‘sm’ is the estimated variability.’ These t’s have fatter tails than the student t. 🙂
Lucia, just to be clear when I run a Monte Carlo I run the arima function on the segment of the series of interest, using the model, e.g. ARIMA(2,0,0) and obtain the ar coefficients, the slope and intercept of the regression with time and the sd of the residuals. I then plug these values into the arima.sim fucntion and obtain 1000 realizations from which I extract the 95% CIs and mean trend using the lm function on each realization. When I use the whole series as a model I do the same as above except now I enter the ar coefficients for the entire series for all 28 segments.
My R code is below where GNLO is the GHCN global monthly mean temperature series:
CIM=matrix(data=NA, ncol=3, nrow=28)
for(k in 1:28){
TW=window(GNLO,start=c(k+1974,1), end=c(2011,11), frequency=12)
Arm=arima(TW,order=c(2,0,0), xreg=time(TW))
AR1=as.vector(Arm$coef[1])
AR2=as.vector(Arm$coef[2])
LM=lm(TW~time(TW))
Intercept=coef(summary(LM))[1,1]
Slope= coef(summary(LM))[2,1]
Len=length(TW)
SD=sd(residuals(LM))
MatN=matrix(data=NA, ncol=Len, nrow=1000)
for(i in 1:1000){
Arm=arima.sim(n=Len,list(ar=c(AR1, AR2)),sd=SD)+as.vector(time(TW)*Slope+Intercept)
MatN[i,]=Arm
}
Q1=rep(NA, 1000)
for(j in 1:1000){
LM=lm(MatN[j,]~c(1:Len))
Q1[j]=as.numeric(coef(LM)[2])
}
CIM[k,1]=mean(Q1)*120
CIM[k,2]=quantile(Q1,probs=0.025)*120
CIM[k,3]=quantile(Q1,probs=0.975)*120
}
plot(1975:2002,CIM[,1], main=”1000 Simulations of ARIMA(2,0,0) Model
for Start Year on X-Axis and 2011 End Year”, sub=”ARIMA Model used for Segments was Fitted to That Segment”, xlab =”Start Years”, ylab=”Trend in Degrees C Per Decade”, type=”b”, ylim=c(-0.2,.3))
lines(1975:2002,CIM[,2], type=”b”,col=”red”)
lines(1975:2002,CIM[,3], type=”b”,col=”red”)
abline(h=0.2,col=”green”)
As an aside, I found the mathematical basis of the adjustment that Tamino purportedly used for degrees of freedom in calculating the CIs for the trends in the link below. I still do not know exactly how he made the calculations leading to his graph because the code snippet indicated to me that he was dealing with each segment on an individual basis. He merely talks about the ARMA(1,1) model so as to justify his using the particular adjustment that was applied.
http://biomet.oxfordjournals.org/content/91/1/240.full.pdf+html
By the way, I have to read in better detail your latest posts before commenting.
Kenneth–
There is a blog post where Tamino describes how he gets his ar and ma coefficients using the first two acf coefficients. It’s based on the principle that for ARMA, the acf will look like
r_0=1 (by definition)
r_1=r_1
r_2= r_1*(theta)
r_3= r_1*(theta)^2
r_i=r_1*theta^(i-1).
He does an infinite sum to get his formula for N_eff in terms of r_1 and theta. So, you see theta=q[2]/q[1] in the code snip. He never has to compute the actual values of ar or ma (though one can if one wishes.)
With respect to the code snippet– bear in mind that’s from a paper and the graph is in a blog post. The blog post merely tells you that hes done this sort of thing before. He has– but I’m pretty sure he’s tweaked the method. (I think– tho I am not sure– that he sometimes gets theta=q[2]/q[1], he sometimes fits to the first several acf’s and so on.)
So, don’t assume that the code snippet matches the blog post perfectly.
Hi Lucia
That looks interesting. I do not have the time to go into detail now, so just one quick comment about qt(0.975,Neff) . In my plots in the other post, I did not use that, I just went with 2 sigma errorbars. Mainly because I had forgotten how easy it was to use the qt function in R, and when I remembered that, I decided to stick with 2 sigmas in order to make comparison with your plots easier.
When I use the qt() function I have always used N as the actual number of datapoints in the regression, not N_eff because I used to believe that correcting the standard errors with replacing N with N_eff were enough. Now I am not sure what is the correct way.
SRJ–
I found some t_effetives… and I think my method works now! I’ll have to slap the “blue” Tamino ones on the same graph. I actually don’t know if mine are wider of if his are wider. I just know I can show mind work with synthetic data– and I’ll be writing a bit more.
After that, I can estimate power. I can compare the size of to models with repeat runs. (I really have been trying to find something that works for that! But.. there are some pesky issues. Including the the fact that the control runs don’t seem to necessarily be steady in all instances which make estimating variaiblity of *long* trends difficult!)
SRJ– I wanted to add Tamino like error bars. Did you freeze the AR and MA only? Or the “sigmas” too? T seems to have had several tweaks over time– example here he discusses getting all three:
http://web.archive.org/web/20081020062832/http://tamino.wordpress.com/2008/09/16/alphabet-soup-part-3a-parameters-for-arma11/
Here he reports values he used for all 3:
http://web.archive.org/web/20081006110819/http://tamino.wordpress.com/2008/09/16/alphabet-soup-part-3b-parameters-for-arma11-continued/
And here
http://web.archive.org/web/20081029173848/http://tamino.wordpress.com/2008/09/12/dont-get-fooled-again/
he creates 7 year trends and so forth– using evidently all 3 including the σ .
The which is done could “matter” both to the size of the error bars and to the degree to which I think his method is flawed. (You can see on one of the posts that I comment on the volcano issue.)
Hi Lucia
In those plots I showed in the other post I used the method from Foster&Rahmstof, i.e. I estimate nu from phi (the AR term) and the lag 1 autocorrelation. And then I freeze nu and use that for all trends. I do not calculate the sigma_w.
This is the relevant parts of the code, it does both the Tamino way with frozen and unfrozen arma-parameters and the the arima way:
for(i in 1:((nr-1)/12) ) {
Start <- gtemp[(1+12*(i-1)),1]
est[i,1] <- Start; est[i,2] <- end
# pick out a subset of data
grecent =Start & yr_frac <= end )
n <- nrow(grecent)
## fit a linear model through these recent data
gm1 <- lm(Anomaly ~ yr_frac, data = grecent)
res <- residuals(gm1)
q = acf(res,plot=FALSE)$acf[2:3]; q
phi = q[2]/q[1]; phi
lambda=q[1]/phi
nu.giss= 1+2*q[1]/(1-phi); nu.giss
coef.gm1 = coef(gm1);
se.coef.gm1 = summary(gm1)$coef[,2]
# arma(1,1) correction with unfrozen arma
se.gm1 = se.coef.gm1*sqrt(nu.giss);
est[i,3]=coef.gm1[2]; est[i,4]=se.gm1[2]
# using arima
temp.ts <- ts(grecent[,3], start=Start, freq=12)
xreg=(grecent[,1]-mean(grecent[,1]))/sd(grecent[,1])
arma <- arima(temp.ts, order = c(1, 0, 1), xreg=xreg)
est[i,5] <- coef(arma)[4]/sd(grecent[,1])
est[i,6] <- sqrt(arma$var.coef[4,4])/sd(grecent[,1])
# standards error corrected with frozen arma parameters
# nu_giss_1975 is calculated nu_giss above, just using data
# 1975 to recent instead of changing periods
est[i,7]=summary(gm1)$coef[,2][2]*sqrt(nu.giss_1975)
}
est is a matrix to hold the estimated values, gtemp is the giss data, end is a fixed endtime that can be specified.
I think you only need sigma_w when you want to simalute an arma-proces. I don't want to do that in this code, so I don't estimate sigma_w.
SRJ=–
Thanks.
Technically, you do only need sigma_w if you run monte carlo– as Tamino did in “Don’t get Fooled again”. When he did that he froze the triple (AR, MA and sigma_w) from 1976. I objected to that– and that’s part of the reason I remember that I really objected to his choices.
Doing this:
se.coef.gm1 = summary(gm1)$coef[,2]
is equivalent to not freezing sigma_w. You aren’t conscious of the “sigma_w” (magnitude of residuals) being recalculated on the basis of the fit — but that R does that to estimate the uncertainty in the trend under the assumption of what noise. (Or it does something equivalent. It’s involved in the process of estimating it.)
So the way your attempt to replicate Tamino (which you shoed in blue) appear equivalent to estimating the uncertainty starting in ‘year n’ by:
1) Frozen: Using AR and MA based on time series since 1976.
2) Not Frozen: Using residuals from fit to based on time series since ‘year n’ to get the ‘ols’ estimate of the uncertainty in ‘m’. (sigma_w) not frozen. Use this to estimate uncertainty in ‘m’ (sigma_m. The formula which appears in undergrad text books is just sigma_w/(sqrt ((N-2)*var(time_i) ) )
3) Using (1- frozen) to rescale values “sigma_m” (unfrozen) which R calculated from sigma_w (unfrozen.)
If this matches what’s in his recent figure, that’s likely what he did do to create the blog post.
I’m just trying to sort things out because it seems that different things have been frozen at different times. To me it looks like:
Old “don’t get fooled” blog post: has “frozen/frozen”,
Later post you tried to replicate may have “frozen/unfrozen”,
Peer reviewed paper may have “unfrozen/unfrozen” for steps 1 and 2.
FWIW: I get much bigger error bars for shorter trends if I freeze all three. And that’s the method that I really hated way back — my comments actually got captured in the wayback version. And much of my “cherry picking” accusation stems from the full set of choices.
Lucia,
“And here
http://web.archive.org/web/200…..led-again/
he creates 7 year trends and so forth– using evidently all 3 including the σ . ”
I think he here uses the script he gave in a comment on the “How Long” post. That is a script that simulates arma-processes. The script is here:
##########################
# generate ARMA(p,q) noise, script by Tamino
##########################
armanoise = function(n.points=1000,ar.coef=0,ma.coef=0,wn.sig=1){
p = length(ar.coef)
q = length(ma.coef)
z = numeric(n.points+200)
x = numeric(n.points+200)
n0 = max(p,q)+1
z[1:n0] = rnorm(n0,0,1)
x[1:n0] = z[1:n0]
for (n in (n0+1):(n.points+200)){
z[n] = rnorm(1,0,1)
x[n] = z[n]
if (p>0){
for (j in 1:p){
x[n] = x[n] + ar.coef[j]*x[n-j]
}
}
if (q>0){
for (j in 1:q){
x[n] = x[n] + ma.coef[j]*z[n-j]
}
}
}
x = wn.sig*x[201:(n.points+200)]
x
}
About not freezing sigma_w:
Yes that is not frozen since it is given as
sigma_w = sigma_residuals/sqrt(1+(phi+theta)^2/(1-phi^2))
When doing regression over the n different time spans, we will have a new value of sigma_residuals each time. And then if we freeze the arma’s, we are dividing with the same number each time, so for each value of n we get a different value of sigma_w.
SRJ–
I’m planning to show how a variety of methods stack up.
The black uncertainty intervals are using ARIMA. We know those are too tight– but I’ll be showing how I correct them … probably tomorrow.
The blue are what you think replicates the Tamino figure (freeze AR, MA from 1976). (You chose blue for this before.)
The red is freeze AR, MA, sigma from 1976. That seems to be what Tamino did in “Don’t get fooled again.” (I have issues with that choice. Yada, yada…)
I’m missing your “green” which I suspect will fall rather near the black. I plan to show this in green because you chose that.
Right now these are all 2-sigma… I guess I need to tweak to use qt. But that’s not going to make much of a difference.
I need to show my correction. I need to pick a color… Purple? 🙂 (Adding mine requires running monte-carlo. I’ve done it– but I didn’t have the other traces on there. So… I don’t actually know if my correction is going to be inside or outside your blue.
SRJ– I guess he hadn’t learned of the existence of arima.sim!
He probably didn’t live in fear or mosher laughing at his using loops. 🙂
(Mosher has been a big help to me. Explaining that some things are better with vectors. Which is true. That said– sometimes vectors suck. It’s fine to create NxM array of temperatures just so I can write all the montecarlo stuff as two vector operations. But after I have it working, I suddenly want to do something like make N “1,000,000”. The poor mac goes “Please…. Stopppp… My brain hurts!!!” (The little wheel just spins and spins” Plus, sometimes when you have a code that works, you don’t necessarily want to read how it could have been more efficient if you just learned the existence of 1 line utility blah! 🙂 )
I think it just shows that he likes to do things explicitly. Citing from Alphabet Soup, part 3a:
“Or, you can fit the ARMA(1,1) model directly using a stats program like R (yes, I’ve recently started using R and I like it very much). That would save a lot of work! But at least, with these posts, we can see how the numbers can be estimated from observed parameters — and I always want to know how things are computed, I’m never satisfied with numbers that “magically†appear out of a “black box.—
And yep, R can be very frustrating to learn.
SRJ–
I have no objection to Tamino doing that. It’s what I tended to do until Mosher would say “why didn’t you use….?” After that I learned to search for what I might want.
R not only lets you fit using arima, it lets you generate using arima.sim, which is what Tamino is doing in that code snip. You can easily verify that R does it right by checking acf’s, residuals and so on. But if he wants to write his own code, that’s fine.
Also, just because a package exists doesn’t mean someone who uses it has to use it as a “black box”. With R, you can always read the code itself. You can also check the results for cases you know how to do. Arima.sim does generate ARMA1,1. I tend to be compulsive about checking what comes out of blackboxes, and the stuff it returns does have the correct ar, ma parameters, the correct residuals and so on. If I wrote my own generator, I’d still check. So, for me, it was easier to use the package and check than to write my own code and check.
I agree R can be frustrating to learn.
I think it’s pretty cool actually that he writes his own versions. Given that he developed his own variant on the DFT, it makes sense he would be interested in the mechanics.
Not “black boxing” is one of the requirements I have for my Ph.D. students. Even if they end up using a standard package (e.g., fftw), I expect them to write their own tools initially, so they have some idea what is “under the hood.”
I have no such expectation for blog posts of course, and am grateful when I can find a short-cut there. (Unless the purpose is to teach myself a method, in which case the black box is nice for “ground truth” for my own code.)
@Carrick:I like that someone would understand DFT by writing their own. At the same time, I dislike that someone would use their own DFT for something important rather than widely-used libraries. Especially someone who has not been a professional programmer. Too much room for subtle bugs to creep in, and only one pair of eyes looking at it.
It’s similar to how the climate field has quite a few parochial, ad hoc methodologies when they could have used methods from economics and statistics. (And that most of the software is written from scratch in Fortran, and often (usually?) by non-programmers, making it hard to follow.)
Lucia, I see from the above comments that the issue foremost in my mind is being discussed and that is the to freeze (as you have aptly termed it) or not freeze the arima coefficients. I have been attempting to do a brute force Monte Carlo analysis of what might be wrong with not freezing the arima coefficents as I think intuitively that not freezing is the proper approach.
I can see where a given arima model with one set of coefficients could best fit the entire series and another set (and even maybe another model) would better fit a segment. In fact I can see this where a model and set of coefficients might be used to simulate a long series and then one went back and modeled a portion of that series and obtained different coefficients.
I’ll link my simulation results when I have done what I think is sufficient to satisfy the questions that have arisen in my mind.
I suspect that if I had the mathematical skills and fortitude I could answer or attempt to answer my questions deductively, but with a reasonably capable computer and some R functions to use and abuse I have already learned a few interesting aspects of arima modeling and am merrily on my way to inductively approaching the problem.
I think Paul_K and perhaps others here have pointed to potential reasons for not looking seriously at shorter trend periods and I would tend to agree, but once you start looking at some of these problems and checking the results of the likes of a Tamino it is almost too much fun to stop.
Kenneth–Tamino seems to have “frozen” at least two different ways. One in his blog post “Don’t be fooled” and a different way in the more recent blog post. Whatever its merits, it appears that in the peer reviewed F&R the method was applied with no freezing.
I think that if there are any questions about cherry picking or there are any questions about the longer period having non-linearities in the forced response– absent in the shorter period, then the correct method is not freezing and correcting for error in the rejection rate due to a short period.
I think the appropriate correct can be estimated fairly well using monte-carlo. I’ll be showing that. I’m just sorting out legends, numbers, matching colors (to avoid confusing changes. I want “green” for the thing SRJ showed in green, blue for his blue and so on.)
I should add that the arguments for not freezing and correcting for the short periods are pretty strong even if you think the longer trend is similar to the shorter one. One reason is that the analysis simply requires fewer assumptions. You make an assumption about the shorter period only. You have no additional ones about the longer period. So, the correcting with montecarlo method should work under both sets of assumptions.
When you get different results by adding an assumption… well… the additional assumption might be wrong!
Carrick– I did write my own generators for AR and ARIMA back before I learned R. I did that in excel and in fortran. It’s not — after all- remotely difficult.
I wasn’t so much commenting on Tamino’s desire to not-black box –though I did mention he might not have been aware of Arima.sim. I’m also noticing the “for” loops used in cases where using “apply”, “lapply” and “tapply” work. Mosher gave me something of a hard time — mostly encouraging me to change over. He’s right to encourage that– but sometimes, I still have big loops (for good reasons.)
As I’ve gotten more used to R, lots of loops get replaced with “apply”, “lapply” etc. I suspect the same would apply with Tamino.
I do, btw, know my corrected uncertainty intervals are smaller than Tamino’s now. I’m just getting my graphs sorted out– adding “helpful” guides that let people with astigmatism read off useful values more easily — and running with a larger number of realizations.
Lucia
“Whatever its merits, it appears that in the peer reviewed F&R the method was applied with no freezing.”
What do you base this on? The code for figure 6 F&R (which is a figure like the ones we are making) is not available.
Another thing, I have several times referenced to Taminos post “How Long” from 2009. I just saw that he made a new post with the same title in July this year. The original “How long” is here:
http://web.archive.org/web/20100104074329/http://tamino.wordpress.com/2009/12/15/how-long/
In the new post he derives an estimate for the standard error of the trend rate in the GISTEMP series as a function of trend length. He conservatively estimates this as:
sigma_b ≈ 0.5 / sqrt(T^3)
where sigma_b is the standard error of the trend estimate and T is the time span in years.
Here is the post:
http://tamino.wordpress.com/2012/07/06/how-long/
Wayne2:
DFTs aren’t particular a complicated algorithm, so I wouldn’t be particularly concerned by a person implementing their own version of that.
It’s my experience that as many errors creep in by misuing the fairly cumbursome general packages as creep in through one’s own coding.
Regardless of your approach, I think it’s important to do some form of regression testing.
My approach with DFTs is to do a straightforward computation of the Fourier amplitudes at the frequency(ies) of interest, then compare the output of that code with my production code.
If you’re only interested in one frequency, then an intelligent implementation of the Fourier coefficient at that frequency is faster and more robust than a DFT in an case. For example, there aren’t any issues with picking the right bin or the right window size etc, and of course you can compute the Fourier coefficient for an arbitrary freuency (with the DFT it’s limited by window size and sampling rate).
Anyway, even in this simple example, there are places where using canned software aren’t the right programming choice.
The real problem with “amateurs” is they don’t know to test their code using multiple agorithms, or how to structure code in a modular fashion so more complete testing can be designed. Using canned software will never fix incompetency.
The code for foster and rahmstorf is here:
http://tamino.wordpress.com/2011/12/15/data-and-code-for-foster-rahmstorf-2011/
You need to download and unzip.
But I admit I’m going by reading the narrative of the paper. They discuss the method they use and I read no discussion of applying the method by using values fit to any particular year. I’m assuming they would describe freezing the values if they did that. That’s sort of the SOP when discussing methodology in papers.
Lucia,
Ya sometimes loops are unadvoidable in R and sometimes its just plain easier to loop. Usually I start with a loop just to test things out and then I try to vectorize the solution. Of course the wizards on the help list ( some quite famous ) can vectorize in their frickin heads. Some days I stare at these one line wonders they come up with and its quite humbling.. even for a prick like me.
in case you are wondering arima.sim looks like this on the inside
function (model, n, rand.gen = rnorm, innov = rand.gen(n, …),
n.start = NA, start.innov = rand.gen(n.start, …), …)
{
if (!is.list(model))
stop(“‘model’ must be list”)
p <- length(model$ar)
if (p) {
minroots <- min(Mod(polyroot(c(1, -model$ar))))
if (minroots <= 1)
stop("'ar' part of model is not stationary")
}
q <- length(model$ma)
if (is.na(n.start))
n.start 0, ceiling(6/log(minroots)),
0)
if (n.start < p + q)
stop("burn-in 'n.start' must be as long as 'ar + ma'")
d <- 0
if (!is.null(ord <- model$order)) {
if (length(ord) != 3L)
stop("'model$order' must be of length 3")
if (p != ord[1L])
stop("inconsistent specification of 'ar' order")
if (q != ord[3L])
stop("inconsistent specification of 'ma' order")
d <- ord[2L]
if (d != round(d) || d < 0)
stop("number of differences must be a positive integer")
}
if (!missing(start.innov) && length(start.innov) < n.start)
stop(gettextf("'start.innov' is too short: need %d points",
n.start), domain = NA)
x <- ts(c(start.innov[1L:n.start], innov[1L:n]), start = 1 –
n.start)
if (length(model$ma)) {
x <- filter(x, c(1, model$ma), sides = 1L)
x[seq_along(model$ma)] <- 0
}
if (length(model$ar))
x 0)
x 0)
x <- diffinv(x, differences = d)
as.ts(x)
}
so in the end if you want to write your own version you can just steal this code or steal any of the code for any function and adapt it as you see fit. When you are done, let me know and I can build a package for your code and then you can publish it on CRAN.
easy peasy.
SRJ:
True variability dropping as T^3/2 happens for any and everthing that is stationary with ARIMA(p,0,q). That includes Arima(0,0,0) (white noise) ,AR(1,0,0) (red noise) or ARIMA(1,0,1) (which can either be the monthly average of the weiner process or or the sum of red and white noise.)
So, that doesn’t tell us much of anything about what he froze nor the model he used. It just tells us he thinks the process is stationary (which I agree with) and it’s not fractionally difference (which I’m not absolutely sure about.)
The noise model one applies to data affects the estimate of “sigma_b”. And oddly, what is appropriate when forecasting how long we might have to wait can differ from what is appropriate when estimating variability given known forcings. The reason is that if I have to estimate the uncertainty in the upcoming 10 year trends, I don’t know if a volcano will erupt. But if I try to estimate the uncertainty given knowledge that none erupted, I can take that part of the uncertainty out.
Lucia
”
The code for foster and rahmstorf is here:
http://tamino.wordpress.com/20…..torf-2011/
You need to download and unzip.
But I admit I’m going by reading the narrative of the paper. They discuss the method they use and I read no discussion of applying the method by using values fit to any particular year. I’m assuming they would describe freezing the values if they did that. That’s sort of the SOP when discussing methodology in papers.”
I know, I linked to it in the other thread. I have gone through the code line by line, the code for making figure 6 is not there, i.e. it is not available. Unless it was added since I downloaded the zip folder almost a year ago.
I think that they froze the arma-parameters, at least that seems to be what Tamino normally does when doing these kinds of plots, and it fits with his point about “using lots of data to estimate the arma parameters”.
Btw, it is quite impressing all this monte carlo stuff you have made on this subject.
edit to add this in response to Lucia # 104429
Yep that tells us nothing about what he froze or not. But from reading the post he arrives at the simple expression I quoted from a more complicated one, underways he inserts nu = 10.6 (number of data points per effective degree of freedom). I get that value of nu for giss data from 1975-2012. So in our terminology I think we should say that this expression is based on frozen armas from 1975
SRJ–
I only have what you pointed to.
I know in blog posts he normally freezes. Maybe he froze in the paper. Maybe not. I was just assuming some text in the narrative would mention the concept if he’d done it. But… maybe not.
Monte-carlo is easier than theory. You just run it.
Despite the choppy jump into thsi post, I’m glad I did this first, because I realized I do want to look at the “2nd order correction” to see if the first correction probably caught most of the problem that leads to rejecting too much.
That way, I can show final corrected GISTemp graphs once instead of first ones which might need to be corrected again!
response to Lucia # 104429
Yep that tells us nothing about what he froze or not. But from reading the post he arrives at the simple expression I quoted from a more complicated one, underways he inserts nu = 10.6 (number of data points per effective degree of freedom). I get that very same value of nu for giss data from 1975 to june 2012 (same time span as were available when he wrote the post). I also get the same noise variance as he does.
So in our terminology I think we should say that this expression is based on frozen armas from 1975 since he use the estimated nu to derive the equation I quoted. That equation he then use to investigate how many years of data that are needed to obtain a trend statistically different from 0.
@Carrick: True, using canned software won’t fix incompetence. But writing your own from scratch — and I agree that a DFT is a poor choice to make my point — requires an additional level of competence and conscientiousness above and beyond that needed to understand the data and use the canned software properly.
I’m not saying that programming is a job best left to professional programmers. (And pretty much anything you do in R, for example, is programming to some degree, no matter how many packages you rely on.) I’m just nervous when data analysis becomes more opaque due to novel tools, novel methods, novel terminology, etc. Using “canned” software properly removes layers, in my opinion.
SRJ–
OK. Then you are probably right.
It can’t be 1975 because UAH and RSS didn’t start then. So, maybe 1980? Same idea though. Maybe he used “frozen, frozen, frozen” in the paper too. (Or “frozen, frozen, unfrozen”)?
Probably one would have to ask him. I can’t. I’m banned due to “the 2nd law of thermodynamics incident”.
It can. But once I verified ARIMA.Sim does what I want it to do, I figure it’s a lot easier just to use it. I don’t need to maintain the library and — more importantly– bugs and warning flags are in some of these packages. If you look at the monte carlo code for the next post you’ll see it was nice that arima throws out stuff that lets me know when it tanks and I can use ‘try’ etc.
Obviously, canned stuff has it’s plussees and minuses. I think the main reason students need to write some stuff is to some extent to learn that you also need to check stuff. I mean… students really ought to learn how to reality check *excel*’s curve fitting if only to learn that you should *know* if it does what you think it does. (I’m pretty sure very early versions of excel forgot to divide by N-1 when computing sample standard deviations. And… really, 99% of the time, that’s what students want. It’s pretty rare for anyone to sample the *entire population* so that the sample size N == population size.
@lucia: Yes, that’s what I mean. Once you know that arima.sim does what you want to do, it’s easier since someone else documents and maintains it, hundreds of people around the world probably use it (and so test it to some extent), and I can easily locate and use it without depending on you. They’ve also been more likely to make sure it works with other R packages than your own code. (And I’m generously assuming you’d code in R… if you coded it in Fortran, say, that’s a whole other level to work through.)
So it removes a layer of opaqueness from your investigation, which I say is a good thing.
If you’re depending on your own code, you may have a bug, I’d have to obtain it, I may not be certain it does what you think it does, a year from now it may be missing, etc, etc. So instead of focusing on your investigation, there’s an extra layer — your programming — which also has to be investigated/reproduced/etc and I think it’s a good idea to avoid that kind of thing. Especially when the “canned” alternatives are so handy.
Lucia
“It can’t be 1975 because UAH and RSS didn’t start then. So, maybe 1980? Same idea though. Maybe he used “frozen, frozen, frozen†in the paper too. (Or “frozen, frozen, unfrozenâ€)?”
(We are talking past each other, maybe the following will clear things up, rather than confuse more)
In that comment I was not talking about the paper but the How Long 2012 version. In your Comment #104429 you were discussing an equation I quoted from that blog post, and you wrote:
“So, that doesn’t tell us much of anything about what he froze nor the model he used.”
I assumed you talking about the blog post, but I now realize that you were referring to F&R 2011.
In my Comment #104433, I was talking about the blog post from July 2012. In that post he uses GISS data not UAH or RSS. And I replicate the same values of the parameters as he does in that blogpost using GISS data from 1975-june 2012.
What he did in F&R figure 6 I don’t really know. I think he estimated arma-parameters for each series and froze ar and ma (but not sigma_w) since 1979 to make the plot.
Wayne2:
You’re assuming that R is bug free. It’s not, and there have been changes in default behavior in packages from version to version.
If you don’t keep up with that, the defaults can change, and your script, that once worked perfectly, will suddenly be producing errors.
I recall hearing of at least one paper that got withdrawn after the authors had realized that they had been bitten by this problem (using one of the R packages).
There is no perfect solution. You have to verify commercial software just as if it were your own.
As to opaqueness… I find packages like LAPACK to be extremely opaque and often very inefficient compared to specialized code I write for a particular problem (e.g., a PDE solver).
For example I know of an example where somebody used LAPACK in C++ for a beamfinder code, so they had LAPACK, LAPACK++, BLAS, ARMADILLO, GSL, and a bunch of other dependent libraries.
I could have written the entire problem out (it was a fixed matrix size problem, low dimension) in under 1000-lines of code.
I can point to nearly as many badly written packages that lean heavily on “standard” libraries that end up being nearly impossible to port to even a more recent version of the same operating system, as I can poorly written “roll your own” programs.
There are certainly examples where I lean heavily on canned software, but even there, I treat it as part of the solution, not the full solution by itself.
I think this situation with Lewandowsky’s paper is a perfect example of the dangers in using canned software without understanding the limitations of what you are doing. Simply because you know the accepted order in which to push the buttons on a canned program, and even if you are feeding something besides garbage in, that doesn’t mean you won’t get garbage out, if one or more of the assumptions of the method aren’t met.
A lot of these problems get fixed if you know how to Monte Carlo to verify your analysis methodology isn’t flawed. A number or other of them get fixed if you can derive the method and implement it yourself, because hopefully by then, you know the limits and caveats needed to use the method.
For naive users, simply saying “I followed established protocol for analysis of this data” is equivalent to verifying that the method actually works on your data. For them, there is resistance to the very notion that they have to do anything outside of what they learned in Stats 501 or equivalent course.
(This last applies equally to people who do DSP work on digitized signals for which noise is always white and uncorrelated as it does sociologists & psychologists, in case anybody is sensitive about a too narrowly focussed criticism here.)
Carrick–
Oddly enough, setting up the Monte Carlo to do anything remotely complicated in R can require someone to engage quite a few of the same thoughts as writing the initial packages you might use.
For example: Say you read Tamino’s method and liked it. You think it might come in useful at some point when people “out there” aren’t going to have R handy– they’ll have EXCEL. So, you want to program it into EXCEL. Could be useful ‘out in the field’ thing.
But now… you need to anticipate everything that might go wrong in the field. You could sit down and look at the method and see that there is a problem if you ever divide by zero and/or a different problem if the computed “theta” is ever 1. That will crash the program. But… maybe you figure that’s rarely going to happen.
But now there are non-program crashing issues. What if “theta” is greater than 1? Strictly speaking, the method won’t work. Other things happen.
If you code the monte-carlo these issues will come up. Then you do have to code in “what to do”. (I just sent “null” because… well… I don’t think an analyst would use this method if he actually got data that looked like that. He would decide the data are telling him that contrary to his assumption, the data aren’t ARMA11 and so forth. But since I can’t know what he would do, I don’t want to add that to either the rejections or fail to reject tally.)
carrick
“As to opaqueness… I find packages like LAPACK to be extremely opaque and often very inefficient compared to specialized code I write for a particular problem (e.g., a PDE solver).”
ya and imagine the lovely problem I hit when R calls LAPACK ( yup R calls fortan LAPACK ) and LAPACK throws a stupid error code that is untraceable. ya the LAPCK code appears to go through all these machinations to document various error codes and then munges them when it reports them back. I spent on week on that shit.. argg in Fortran.
@Steven Mosher: Speaking of Fortran, have you looked at or worked with GISS’s mkTsMap.f at all? I’m trying to use it and I believe that when I output 2-degree cells as monthly anomaly time series, the mean of each cell over the baseline time period should be essentially zero. But more than 4% of the means of the cells with data (i.e. land and near-land) are outside +/- 0.05.
(These non-zero means are all over the globe, though concentrated in places like central South America, central Australia, and southern Africa, with only three such cells in the continental US.)
Maybe I’m misunderstanding, maybe using R to read the Fortran raw data has an issue, maybe there’s an issue with gfortran on the Mac, I really don’t know.
@Carrick: Using your Lewandowsky example, imagine if he had coded his analysis in Fortran instead of using Mplus? In some sense, it might not matter, since we can take his description of the problem and code it using an R package or Mplus ourselves, I guess. But it’s another layer in his analysis, beyond understanding how to set up an experiment, check the data, and utilize the proper technique (including knowing if the technique’s assumptions are met).
He happens to have failed at a more foundational step, but even if he’d managed to navigate beyond that, coding it in Fortran himself would be another opportunity for him to fail and for anyone following him to get lost in a maze of passages, all alike.
Wayne2, let’s explore that–if he had written the code himself, perhaps the assumption that ordinal numbers behave the same properties as numbers belonging to a metric would have struck him as problematic.
If for no other reason than that, as I suggest, you should test your code using simulated data. At least until you’ve firmly convinced yourself that the method applies (in the case of languages like R where you get automatic updates by default, as I pointed out, code that once worked can mysteriously start bletching out junk on you without you touching a line of code).
Also, Lucia points out, the very process of Monte Carlo’ing forces you to deal with fundamental data related issues. I don’t see writing PCA code on top of that to be particularly onerous—nor do you really have to here, my real objectioin is failure to verify.
II were to do this project from “scratch”, I’d probably use code from Numerical Recipes to do the heavy lifting. A lighter (in borrowed code) solution like that is pretty easy to read through and dissect. A code like LaPack that has been heavily optimized for large element matrices, really not so much.
At a more fundamental level, the problem wasn’t that he used canned software, but that he used an established protocol in his field for analyzing data uncritically.
He seems to assume that because he follows the protocol, sort of like a medical doctor in the emergency room, that he’s covered in case of failure. In medicine (and clinical psych to a lesser degree) you get exposed legally if you deviate from an established protocol. In science you get exposed academically as a sheep when you don’t use your brain.
Carrick
I think that what you are saying is that the user has to know what his/her tools actually do. It works in all fields. Who is to blame for the fact that the second word in the Arden edition of Shakespeares reads “hse”? Is it a proof-reading error, something that has survived since the First Folio, or an example of Shakespeare’s gender-tolerance. (I made up the example before anyone starts hassling the Folger Library)
@Carrick: You assume that Lewandowsky would do a better job because he had to do it himself. That’s what you would do, because you’re conscientious, inquisitive, and skilled. My guess is that he would have simply added an additional set of errors and obfuscation to his work.
Let me come at it from the opposite tack. I’m working on a paper that addresses Hansen’s “Loaded Dice” paper. In my first cut at investigation, I decided that there are odd things going on in the GISS homogenization and aggregation, so I decided I’d use actual GISS station data (yes, that doesn’t bypass the homogenization issue), but would then use thin-plate spline interpolation to come up with my own gridded set of data.
Eventually, I realized that if I took that approach, I’d get bogged down in arguments about my gridding and never get to my main point. First they’d argue if TPS was appropriate at all, then if it was appropriate for extreme data, if it was appropriate for correlated data, and on and on. So I decided to go with GISS gridded data, and in fact the exact same data as Hansen used. If I could make my point with that data, I’d skip a whole bunch of up-front disagreement.
Imagine I’d still gone with the TPS idea, and had decided to roll my own. Now that would become an issue: did I do it right or not? On the other hand, if I’d used (which I did) R’s fields’ package’s Tps function, I could note that I’d done: ‘1920.1949 <- Tps (coords (m.1920.1949), marks (m.1920.1949), lon.lat=T, scale.type="unscaled")'. You'd be able to replicate what I did, you'd be able to look up documentation and note that I hadn't specified a lambda so it was determined by GCV. You'd be able to ask me what lambda was chosen, and what equivalent degrees of freedom that corresponds to.
If I rolled my own, it might not be standard, it might not be parameterized with lambda at all. At best, you'd throw out anything to do with what I did and decide you'd have to simply try to replicate my now-black-box results as best as you could.
No guarantee that using something "canned" will be better, or will make you smarter, or even make your teeth whiter, but it introduces another layer to be slogged through and validated with less visibility than a canned approach.
Someone like yourself or Lucia can use a canned approach or roll their own entirely from first principles or anything in between. But you're the exceptions, and I'm thinking of the rule.
Wayne2:
My point is if he’s not capable of doing this himself, he shouldn’t be doing it. That’s like letting an untrained janitor analyze your data for you. If he’s not capable of doing more than following a set of rules he learned in first-semeter grad school, he’s got no business touching data or writing papers, at least without a strong stats person working that part for him. (I’m not taking about somebody who spends his time generating reports for government entities here either. Those don’t ever get tested for validity by third parties.)
I would typically do both (see our example about DFT versus Fourier coefficients). I might end up using the commercial package if there were political concerns with using your own. Regarding gridding… plenty of people use their own algorithms and nobody asks questions. (I do for example, Nick Stokes is another. If you haven’t looked through his code, since you’re interested in this issue, I’d recommend it.)
This is the way I see this: The first step in science is to convince yourself you’ve got the right answer. Code replication, Monte Carloing are key to that here.
The second is figuring out how to make a forceful argument once you’ve done that.
@Carrick: “My point is if he’s not capable of doing this himself, he shouldn’t be doing it.” I agree 100%. But the fact is he may choose to do it even though he’s not capable, and even if he’s capable he may do it incorrectly. We, as readers, come into the discussion after that choice has been made by the author.
“The first step in science is to convince yourself you’ve got the right answer. Code replication, Monte Carloing are key to that here.
The second is figuring out how to make a forceful argument once you’ve done that.”
Very good distinction, and perhaps I see the light now. I believe your two steps are those of the author. I’m looking at those two steps from the perspective of the consumer and adding a third step: “The third step is for readers to convince themselves that your forceful argument is well-grounded and correct.”
As I mentioned about the paper I’m working on, I’ve redone my analysis three or four times now. The original analysis persuaded me that I had a point, but as I’m thinking about persuading others I am simplifying and standardizing my procedure so I can make a stronger argument and not get bogged down in arguments about my data/procedure rather than my main point.
The paper (forceful argument) does contain unique elements that must work well so as to not obscure the analysis. In that sense, it’s like a window into the analysis. But the analysis itself also should lend itself to clarity and common understanding, and thus your two steps are more like a cycle that starts with step one, but may iterate from step two back to step one.
As an example, if you use non-standard terminology in step two, you will make your point more obscure: you’re introducing an additional step for your readers who now have to translate from your quirky vocabulary to something they understand. And I think this is analogous to doing step one with a 3000 line Fortran program where 100 lines of R would accomplish the same thing more understandably to the audience, and more easily replicated.
Perhaps you always take a complete black box approach to validating something you read. Just get the inputs, assumptions, and outputs and skip their “forceful argument” entirely. Do the analysis as you would do it, and if their answers are consistent with yours, they are probably right, otherwise they are probably wrong. In that case, my ideas make no sense. You don’t care if they programmed machine code into their Xbox or consulted with dolphin oracles, it’s a black box to be tested.
Perhaps I’m naive in thinking that simplification and commonality make it easier for me to follow/replicate/question a forceful argument. Regardless, an argument that forces me to treat it as a black box is not forceful in my book.