Recently, Tamino described statistical model for “weather noise” using an ARMA (1,1) model. He computed his ARMA(1,1) “weather noise” using GISS observations since 1975. I think this choice is inappropriate because the choice of statistical model, and analysis period fail to account for an important observation about the variability of observed “weather”.
That is: Tamino’s analysis method does not distinguish between the variability in Global Mean Surface Temperature (GMST) during periods with no major volcanic eruptions and those that occur soon after large stratospheric volcanic eruptions. The consequence is his analysis overestimates the frequency of 0C/century trends during periods when the stratosphere is, and remains, clear of stratospheric aerosols injected by major volcanic eruptions.
So, in response, will repeat the analysis, using a time period that, while still imperfect, I think more appropriate. Today, I will
- Describe a time frame I consider better suited to obtaining ARMA(1,1) parameters describing “weather noise” that might possibly apply to observation of GMST since 2001.
- Explain how I get ARMA(1,1) parameters based on a period of data.
- Obtain the ARMA(1,1) parameters for HadCrut
- Show the distribution of 92 month OLS trends on might anticipate based on this type of “weather”
- Show where the current HadCRUT3 case falls in the distribution.
Statistical Notion of Weather: Oranges vs. Fruit Salad
Weather data during certain periods can be described by a process consisting of a trend and noise. The trend, interpreted as a climate variable, might vary depending on the magnitude of excess forcing during an era. In contrast, the specific temperatures on the surface of the earth would be a realization of “weather.”
We can also define climate parameters that define the statistical distribution of the “weather noise”. So, for example, one might expect that if the “weather noise” is arises from the non-linearities in the equations governing conservation of mass, momentum and energy, the variability will have features we can characterize. So, we might be able to observe the noise to see if it is “white” (meaning weather today is uncorrelated with weather yesterday), “red” (meaning weather has some memory, but of a specific form) or some other form.
However, while we can discuss a wide range of statistical models for “weather noise”, there is an important principle that is not strongly related to the algebraic manipulations involved in implementing the logic underlying our choices.
The logic says this: We must compare like to like. No matter how much math we do, if we start with apples, oranges, bananas and cherries, then slice, dice and squeeze, in the end, we will have a product that tells us something about variations across different fruit. But, if we wanted to learn specifically about variations across different oranges it would be better to start the analysis based on oranges only. In so far as it is possible, leave out the apples, bananas and cherries!
What does this fruit-salad digression have to do with Tamino’s analysis and ARMA(1,1)? Well, I think years with nearly constant stratospheric aerosols are the “oranges”. Years with large eruptions are the “apples, bananas and cherries”. In the analogy, Tamino’s analysis made a fruit salad of “weather”. I’m trying to focus on the “100% orange juice years”.
Why does the choice of years matter?
The period from 1975-1994 contains large variations in external forcing due to eruptions of stratospheric volcanoes. Since that time, the stratosphere has cleared; we might expect the variability in “weather noise” to have calmed down to the level typical of periods when Pinatubo, El Chicon and Fuego are not erupting.
Below, I include a graphic illustrating the forcing estimated from data available at GISS. (For more details read a previous post.))
Given the known impact of these volcanic eruptions, one might suspect the current “weather noise” is more likely similar to the weather noise during periods with relatively modest variations in stratospheric aerosols. During these periods, the earth weather is subject to ENSO, the PDO, the AMO and etc. but they are free from the influence of dramatic variations in stratospheric aerosols.
Since I believe the influence of the volcanic eruptions is to strongly affect the variability of medium term trends, I will now repeat the analysis Tamino recently preformed I will estimate the ARMA(1,1) parameters following the method Tamino described in comments, but using data starting in November 1913 and ending in Dec. 1943.
Why these dates?
The start date is selected to exclude the volcanic eruptions in the early period of the 20th century. The general idea is described here, but I relaxed my cut-off, raising it to 0.5 W/m2 to create a 30-year string of data that is as close to “volcano free” as possible in the observational record. (Note: this choice permits more external forcing due to variations in volcanic aerosols than exists in the current era. For this reason, we might expect my ARMA(1,1) parameters might over estimate the size of the uncertainty intervals relative to the relatively rare historic periods when the stratosphere clears. The result is, I could “fail to falsify” when I should “falsify”, and the reason is that some “volcano years” are included in my analysis.)
The end date is selected to avoid the “jet-inlet, bucket” errors that are suspected to have caused the steep dip in sea surface temperatures during 1945. Since it is difficult to discern precisely when individual errors might have occurred, I eliminated both 1944 and 1945, thus giving myself a buffer between the dip thought to be cause by the transition in measurement techniques that happened during WWII while still providing 30 years worth of data to estimate the ARMA(1,1) parameters.
Now, on to the boring bits: How I got the ARMA(1,1) parameters. These will be used in a later post to determine if the trend associated with the 92 months of “weather” since 2001 are inconsistent with 2C/century.
General Equation for ARMA(1,1) noise.
Tamino described general ARMA noise in a series of posts at his site. For my purposes, I simply need the form of the ARMA(1,1) equation that will be used to determine whether the trend since 2001 is inconsistent with 2C/century.
If the variable ‘x’ represents “noise” (that is, a deviation from a trend) in a weather process, and is assumed to obey an ARMA(1,1) process, then the value of the “nth” data point for x is said to be described by an equation of this form:
where φ and θ are parameters describing the ARMA(1,1) process and “w” is a white noise process with a variance of σw2=〈wnwn〉.
From a statistical modeling point of view, the parameters φ, θ and σw are the climate parameters describing the variability of the “weather noise”.
Our goal is to obtain numerical values for these three parameters based on weather data. In comments at his blog, Tamino suggested this could be done by determining three parameters from a string of 30+ years of weather data and then using algebra. That I can do.
Obtaining values of parameters
Ok, so how do we obtain the parameters? First, we must find a suitable 30 year (or greater) string of data and compute three parameters:
- The standard error in the residuals: sy.
- The lag 1 autocorrelation: Ï1.
- The lag 2 autocorrelation: Ï2.
Using HadCrut3 data from November 1913 through Dec 1943 inclusive and downloaded a month ago, and EXCEL’s Linest and CORREL functions, I obtain:sy=0.13, Ï1=0.57 and Ï2 = 0.47. (Values are rounded for blogging purposes, but not rounded in EXCEL. 🙂 )
So, these are now just numbers. We can now use them to obtain the parameters for the ARMA(1,1) process. Tamino suggested I could do this with algebra. So, I’ll go ahead and do that.
Obtaining φ
It’s easiest to begin by obtaining an equation for φ by multiplying the definition of the ARMA(1,1) noise process by xn-2 and finding the expected value, represented by 〈 〉 brackets. The first step is:
The properties of white noise, “wn” are such that the expected value of wn and anything that occurs before “n” is zero. So, 〈wn xn-2〉=0 and likewise 〈wn-1xn-2〉=0. That leaves:
If the process is stationary, we can use the definition of the lag 1 and lag 2 coefficients to obtain:
Naturally, I entered that formula into EXCEL, and I get φ=0.8168. This is now just a number. 🙂
Obtaining θ
Determining the value of θ is slightly more painful. It turns out we need to remember the quadratic equation from 8th grade! But, the process is similar to the one described above. I first multiply equation (1) by xn, apply the average, discover a few more steps, but obtain:
and then I multiply equation (1) by xn-1, do some other steps, and obtain:
Subtracting 6 from 5, and doing some 8th grade algebra results in a quadratic equation for θ in terms of known values:
with A=C=(Ï1-φ) and B=2 φ A + (1-φ2);
For Hadley data from1914-1994, using already known values for Ï1 and φ, we get two solutions. But, the definition of the ARMA process requires us to pick the a value between -1 and 1, so the proper choice is θ= -0.3844.
So… θ is just a number now! 🙂
Obtaining σw
Once all other values are known, the simplest way to obtain σw is using equation (6) recognizing that 〈xnxn〉= sy2 obtained fitting the 30 years worth of data using Excel’s Linest. For Hadley data with known Ï1 we obtain σw = 0.1071.
These are now just numbers!
Monte Carlo to obtain distribution of trends.
Now that we have the values of φ, θ and σw, it’s possible to generate strings corresponding to 92 months of “weather data” based on a statistical model that assumes there is a trend of 2C/century and “weather noise” around that trend. I ran 2000 of these, and computed the OLS trends for each of the 2000 cases. Here’s the distribution:

For those who like numbers, the mean trend for all 2000 cases was 2.03 C/century, and the standard deviation of trends was σT,921.61 C/century. (Tamino’s weather noise gave about σT,90=2.0 C/century.)
Does 2C/century falsify based on the ARMA fit to Hadley?
Are you wondering whether the Hadley trend falls outside the 95% confidence interval using this statistical model? Nope.
The edge of the 95% confidence interval is -0.0109 C/year, and the trend based 92 months of data is -0.0103 C/year. So, using the two-tailed criterion, this just squeaks inside the 95% confidence bounds. They way frequentist statistics work that means if you accept this method as a valid way to test for significance, and use the 95% confidence interval (two-tailed), then 2C/century is not currently falsifying.
Later this week I’ll repeat this for: GISS, NOAA and the Average of GISS, NOAA and HadCrut. (I can’t use the satellite data because they were not operating from 1914-1944.) I need to run monte-carlo for each individually, and haven’t done so yet. But based on the magnitude of the trends I anticipate the results will be GISS and NOAA won’t falsify 2C/century; the average of all three will probably falsify 2C/century. (But.. I do need to run the monte-carlo.)
How could the average falsify when the individual ones don’t happen? Well, if the errors in the measurements were statistically independent, we’d expect the average to be the one with the least measurement uncertainty, and it would usually “falsify” incorrect theories a little before the others. Readers will recall that, as it happens, I’ve always considered the averaged sets my standard, and the possibility of averaging away some errors is my reason. At the same time, we would obviously have more confidence in diagnosis if the results from all measurement groups agree with each other.
Points to consider when interpreting these results.
There are a few points worth considering when interpreting these results.
- I used data from Nov. 1913-Dec. 1943. The choice of years used to obtain the ARMA(1,1) coefficients does affect the results. I picked these to find a 30 year periods with minimal variations in forcings due to variable loading of volcanic aerosols. However, the period since 2001 has lower variability in forcing due to volcanic aerosols than the analysis period. So, this means the estimate of the variability of trends I obtained using monte carlo may be too large.
Unfortunately, I can’t get a 30 year period with a truly clear stratosphere.
- It’s possible that using the early period suffers from an additional difficulty that makes the uncertainty intervals too large. That is: past measurement periods may have had larger measurement uncertainty due to less surface coverage or less precise instrumentation. I suspect this issue is less important that the issue related to vulcaisms, but it still exists. If measurements were previously less precise, then this analysis over estimates the variability of OLS trends. This would mean I would “fail to falsify” hypothesis that are false too frequently.
- The end date cuts out periods after Jan. 1944. This is due to the jet-inlet/bucket transition that introduced a great deal of uncertainty in the measurements. The effect of that unusual amount of measurement uncertainty would plausibly be to increase the apparent variability computed 92 month trends. The cut-off date is rather arbitrary, except that we know the major red-flag appears in 1945.
- The analysis simply imposes ARMA(1,1) as “truth”. I have done no tests to determine whether other statistical model might better represent the data. For example: maybe AR(1) with white measurement noise works better. Or maybe some altogether different model works better. To the extent that choosing one model over another results in different uncertainty estimates, these uncertainty intervals may be either too large or too small. That is the way of statistics.
As I said: Later this week, I’ll show results based on the other data sets. I’ll be repeating this as we accumulate more data. We can see if, over time the results of this method line up with the results of the “Red-Weather noise + White measurement noise” method discussed in July (which by the way, I still haven’t explained how to implement, and which I still like.) 🙂
Update: Sept. 18.
I noticed Tamino posted a discussion of the precise method he used to obtain his fit. To obtain φ, he fit the natural log of the auto-correlation to the lag. (I used this method last January when discussing the effect of white measurement noise on the correlation for an AR(1) proceess.)
Since there are several ways to obtain the value of φ, I generated 400 months of data using Taminos parameters, and repeated 20 times to obtain 20 realizations. Then I compute the value of φ using three methods: 1) The simple ratio method described and used above, 2) the fit of ln(ρn, vs n) and 3) a weighted least squares method that assumed the error in the individual ρn is constant as a function of “n”.
The results surprised me. Month’s ago, I assumed using the method 2, based on the log fit would give the best results on average. That appear to be incorrect. The simple ratio method worked best!
For a known theoretical value of φ=0.8493, the average over 20 realizations of 40 months were
- φ= 0.8252 and sφ=0.0477 for the simple ratio method,
- φ= 0.8142, sφ=0.0494 for using the log method .
- φ= 0.9242, sφ=0.1417 f using the weighted least squares.
So, for this specific tests, the average over 20 computations using the ratio method gave the answer with the least bias and also had the smallest standard deviation. I’d have to run more than 20 cases to determine that method is really any better than the “log” method suggested by Tamino. However, both the method Tamino used and the one I picked before I read his post are clearly better than the weighted least squares I dreamed up.
Since the ratio method I described here is easier to implement and appears to give as good or better results than the log method, I’ll use the ratio method. (I’ll run a few more quick monte-carlo tests to figure out how the roughly 3% bias ultimately affects the reject/accept diagnosis at the 95% confidence level.)
I must say I rather like this method.
But I have a (semi) hypothetical question. Say the historical data in a particular dataset has been known to be altered from time to time. How much could minor changes to the monthly data affect the modelling of the “low volcanic” weather noise?
Raphael–
If the data set is altered tiny amounts, the parameters would likely change tiny amounts.
The difficulty is if the data from 1914-1943 were changed a lot. The reason I’m not using data after Jan 1944 is that could happen. (This is unfortunate, as that is a volcano free periods and the temperature dropped.)
My main problem with this analysis is that 1914-1922 or so includes an eruption. It’s not Krakatoa, but it’s there. Also, the dust was still high from the previous volcano eruptions during that period. So, the relatively fast “up” trend at the beginning of the period could be partly due to volcanic dusts clearing. But…. there is no way to avoid that and at least do an analysis that shows if we avoid including years with Krakatoa, Pinatubo or El Chicon going off, we get less variability.
Honestly, I don’t know what qualifies as “tiny amounts” vs “changed alot”.
Are the changes in the GISS data from Jan 02 tiny when compared to the current data or have they been changed alot?
Raphael–
To give a specific answer to how much the parameters would change based on a specific updates to GISS would require me to run the actual numbers. The parameters I gave are for data I downloaded in August. That’s because the new data for Sept. weren’t yet available when I did the numbers, and I ran the monte carlo using the previous ones.
Generally, the data for 1914-1944 don’t change wildly from update to update, so the parameters are probably almost exactly if calculated using the August or September downloads.
If I decide to do this this way again, I’ll use freshly downloaded values, and then I can report if the parameters changed.
As for the trends since 2001– the updates do affect these. If recalculated 2001-july 2008 using the Sept. data and compared to the result using the August data, it would change a little. This is because the most recent data are never entirely stable, and they matter.
Sorry about all that. I didn’t intend to suggest more work for you. You do quite a bit of it with little thanks. (BTW, Thank you!) I realize we have to make the best of the data provided.
Raphael–
I know you weren’t intending to suggest more work. I just wanted to explain why I couldn’t give a more specific quantitative answer.
When I do post the ARMA(1,1) for the other data sets, you’ll see the range of parameters we get with the different data sets. I’ll make a table, and that may help you see the magnitude of differences. Each measurement system gets slightly different sets of parameters.
Presumably, if one agency — say GISS– suddenly updates — the change would be small compared to the current difference bewten HadCrut and GISS parameters.
(One of the reasons I like the average of GISS/ HadCrut and NOAA is that if one agency suddenly discovers some error, that change doesn’t affect the trend for the average data set!)
Lucia,
No problem.
BTW, was your comment over on William Briggs blog in his genius post, “On the other hand, if we define a genius as someone in the top 1% of IQ scores on a test administered by Mensa, the probability is 1%– by definition,” supposed to be as funny as it is? (i’d ask there but I think I’ve filled that thread with enough off topic posts.)
Raphael–
Of course it’s supposed to be funny. Did you like my choice of geniuses? 🙂
I am sure it was supposed to be funny, most people find mensa funny. But to be precise, I am dying to know if excluding half of the people eligible to join mensa was intentional. 🙂
I wouldn’t fault either of your choices, even though I really don’t believe in genius. 🙂
I was once dragged to a Mensa party. Rest assured that at least half those people should be excluded. (From what, I shall not say.)
Nor would you need to say. I know what you mean.