Voted Early!

I also promised not to vote often!

Literally: I had to fill out and sign a form saying I knew voting early means I can’t vote at my normal polling place on election day.

I’ve worked the polls before and I know in the past the procedure for absentee ballots was to check those voters had not cast during normal voting. Presumably DuPage county will have some specific procedure to verify I don’t break my promise and vote twice.

I encourage anyone who thinks they might be rushed on election day to check into early voting. And remember: there is more than the Presidential election at stake. In addition for casting my vote for President, congressman, local judges and so on, I voted on three other issues. These were:

  1. Amendment to the Illinois constitution touching on pensions:

    If you believe the Illinois Constitution should be amended to require a three-fifths majority vote in order to increase a benefit under any public pension or retirement system, you should vote YES on the question. If you believe the Illinois Constitution should not be amended to require a three-fifths majority vote in order to increase a benefit under any public pension or retirement system, you should vote NO on the question. Three-fifths of those voting on the question or a majority of those voting in the election must vote YES in order for the amendment to become effective on January 9, 2013.

    Given the state of Illinois finances, and other existing constitutional provisions protecting benefits already promised, I voted YES.

  2. In my county we also voted on DuPage County Public Office Advisory Question:

    Should Illinois law permit an individual to hold two or more public elected offices simultaneously?

    I voted no.

  3. In Lisle we cast rather irrelevant votes on the Lisle Township Advisory Individual Rights Amendment Question:

    Should the United States Constitution be amended to clearly state that only individual persons, and not corporations, associations, or any other organizational entities, are entitled to the rights enumerated in the Constitution?

    This question’s appearance on ballots for my town only is rather odd. But I guess if the township wants to know what I think I’d be happy to answer the real question: I think corporations should be able to donate money to campaigns. I advised the Township I did not want anyone to amend the US Constitution in this way.

Have fun voting. Today. Tomorrow. Or on or before Nov. 6!

Mike’s Nobel Embellishments.

It has been widely reported the Nobel Prize Committee is of the opinion Michael Mann was never awarded the Nobel Prize. Mark Steyn, Roger Jr., Judy Curry have all commented. One thing I wanted to know though: precisely which text on the photo from in the following

3) The text underneath the diploma is entirely his own. We issued only the diploma to the IPCC as such. No individuals on the IPCC side received anything in 2007.

Is the certificate the part outlined in black? The language on that bit seems to be Norwegian. Did Mann mount this and add everything else? Or did the IPCC? Does anyone know? (Feel free to continue discussing the libel case here. 175 is enough comments on the other thread. I’m going to close it! I’ll also shift some of the more recent comments from that comment to this post.)

Pooled Method of Testing a Projection

The purpose of this post is to discuss a method of testing a projection that I think has greater power than testing trends alone. (Yes, Paul_K. We discussed this before. I now think it can be done!! 🙂 ) To discuss this, I’m going to discuss how I might test a hypothetical prediction. Suppose I’ve been winning 0 quatloos for a long time, and there is no trend in my quatloo winning. If I do nothing, most people assume I’m going to continue to win nothing. Below, to the right of the vertical red line I have plotted my quatloo winnings in each bet (blue circles), along with the mean value and the trend in time (solid blue line):


The dashed lines represent the standard error in the mean quatloo winnings per bet.

To the right of the red line, is my prediction of number of quatloos I will earn in the future if I don’t come up with some nifty system to improve my betting performance.

As you can see, in this example, I am predicting the future “Quatloo anomaly” will be zero. The uncertainty intervals are the uncertainty in the mean over the baseline (left) and the uncertainty in the mean over the forecast period which has not yet occurred.

We’ll now pretend I concoct a ‘system’ that I think will improve my ability to win Quatloos. It will take practice, so I think my winnings will tend to increase at a rate of Q(i)=mi where ‘m’ is the trend. Of course, my null hypothesis remains m=0, and I don’t really know how large ‘m’ is. I merely hope it’s positive.

I’m going to use my system for 120 plays. At the end of that time, I want to test the null hypothesis that

  1. I will earn 0 Quatloos each month in the future period. That is E[Q]=0
  2. The trend m is zero. That is m=0.

I can test either using an appropriate ‘t’ test. Each would involve finding a ‘t’ value for the data in the upcoming period and comparing that to a student t-distribution. I can call the two t value tmean and ttrend.

I now have two possible tests. Strictly speaking, I am testing two separate hypotheses both of which are related since the amount to “the system doesn’t work”. So, I really only want 1 test.

If I’m limited to one test or the other, I’d really like is to pick the more statistically powerful of the two tests. That is: When applied at the same statistical significance (that is a pre-defined type I error rate) , I’d like the test that results in the greatest statistical power (which is the same as the lowest type II error rate.)

Oddly enough, I also know that I can show the errors in both tests are independent. That is, if the null hypotheses are true and both m=0 and E[Q]=0, then the slight deviations in the actual ‘m’ and average quatloo winnings I will observe during the upcoming periods are statistically independent. (This was discussed here here.)
So the fact that there are two methods and the errors in the two methods are statistically independent leads me to an interesting situation.

There should be some method of combining the tests to create a pooled test that is more powerful than either alone. That is, there should be a method of creating a pooled

tpooled=wtrendttrend+wmeantmean

with weights selected such that the standard deviation of the pooled ts is 1 and the statistical power of the test is maximized. (Note that the standard deviations of ttrend and tmean are one by definition. That’s a requirement for t values.)

So, I concocted the following rule, which seems to work:

  1. Define a effect size for an alternate hypothesis: that is the hypothesis that might be true. This can be any value other than zero since the magnitude cancels: I pick me=0.1 quatloo/decade. If that hypothesis is true, I find I will win Q=m*120 months quatloos.
  2. Based on data available prior to the forecast period, estimate the ‘te,m‘ and te,mean that would exist if this size effect is observed during the upcoming prediction period. To estimate these I need to estimate the variability of 120 month trends given data prior to the forecast periods, and I need to estimate the variability in the difference between the mean in the prediction and baseline period. (Those are provided in the figure and are based on the noise prior to the forecast.)
  3. Compute relative weights for each test variable ‘i’ as as wi= te,i^2/ sum(te,j^2). (Notice the size effect cancels out when the t’s are normalized.)
  4. Collect data during the sample period. Afterwards, compute the t value based on the sample data (using normal methods.) Then create a pooled t using tpooled=wtrendts,trend+wmeants,mean. Here the ‘s’ denotes sample data.

I’ve done a number of test and this weighting does appear to always result in a more powerful test than either the trend or mean tests alone provided the weights are based on the size effect during the forecast period and the individual t-tests properly account for all errors. For the current case, I collected data on Quatloo winnings and the results are as below:

The actually important part of the graph is the information in the lower left. What I want you to notice is the based on synthetic data I created, ‘t’ for the mean anomaly test was 18.44. The t for the trend test was 12.57. I weighted the two t’s using relative weights of (0.66,0.34)/sqrt(0.66^2+0.34^2). this resulted in a tpooled of 22.13. The significant feature is 22 is larger than either 18.4 or 12.6. This isn’t important for the synthetic data I show which has so much power it’s ridiculous. But the feature is very important if we have just barely enough data to generate sufficient power to hope for “fail to rejects”.

Synthetic tests using 10000 repeat trials indicate that this pooled t has the proper characteristics and is– on average– larger than either of the two other ‘t’. That means the resulting test has higher power than either the test comparing trends or the test for comparing anomalies alone. Consequently, it is preferred.

So, this is sort of a “goldilocks” test!

I know this is sketchy. I’ll be happy to answer questions. But this post is mostly so I have a place holder to remember what I did. (I uploaded the really badly organized script with embarrassing incomprehensible notes and a bunch of side tests. And… yes… this can be applied to model tests. I haven’t done it yet because I hadn’t verified I knew how to weight the two tests. Did model tests with weights of (0.5, 0.5 )/sqrt(0.5). Now I need to repeat that with the correct weights. 🙂

==========
Note::I’ve discussed this before and Paul_K was dubious. So, this result is something of a turn about because while the first test I though up way back then did not seem to optimize this one does. I didn’t upload that script, so I’m not sure if I applied the test entirely properly in that post. I may have forgotten to account for the ‘running start’ aspect of testing the means making the test of the means appear more powerful than it ought to.)

Pinatubo Climate Sensitivity and Two Dogs that didn’t bark in the night

Gregory (Scotland Yard detective): “Is there any other point to which you would wish to draw my attention?”

Holmes: “To the curious incident of the dog in the night-time.”

Gregory: “The dog did nothing in the night-time.”

Holmes: “That was the curious incident.”

From The Memoirs of Sherlock Holmes (Arthur Conan Doyle)

 

Pinatubo was hailed by many climate scientists as a unique opportunity to test climate sensitivity.  It was the first major volcanic eruption during the satellite era.  For the very first time, we had top-of-atmosphere (TOA) measurements of radiative flux changes (from the ERBE) during a major eruption.

I would like to consider four papers here which deal with the estimation of climate sensitivity from the Pinatubo data;  they are:-

 

Douglass and Knox 2005  (“DK2005”);

Wigley et al 2005  (“Wigley2005”);

Forster and Gregory 2006  (“FG2006”);

Soden et al 2002 (“Soden2002 “) .

Continue reading Pinatubo Climate Sensitivity and Two Dogs that didn’t bark in the night

Rose v. Met Office Kerfuffle

The David Rose of the Daily Mail wrote a post with the title Global warming stopped 16 years ago, reveals Met Office […] has set off quite a snit. In turn, the Met Office posted a their answers to questions David Rose asked the Met office. Naturally, this spawned quite a kerfuffle, Anthony has posted twice!

I was in the middle of a “script-reorganization” so I couldn’t comment quickly. I’m at a stopping point where I can discuss things so I’ll pause to comment a bit claims in Rose’s article and statements in the Met Office’s response to David’s question. I’ll probably post again next week to engage the issue “cherry picking”. (That discussion requires the planned script functionality extension which motivated the “organization” activity.)

First: I want to state claims a reader coming across Rose’s article might infer he is making. Three of these claims appear to be:

  1. Global warming has occurred. He writes: “… global warming is real, and some of it at least has been caused by the CO2 emitted by fossil fuels”.
  2. Warming may be slower than the level catatrophists (whoever these people might be) have claimed. He writes, “the evidence is beginning to suggest that it may be happening much slower than the catastrophists have claimed”
  3. Global warming stopped 16 years ago. This claim is in the title presented in bold to anyone who comes across the article.

Given the kerfuffle, it’s worth nothing that the three claims taken collectively are entirely in agreement with this “clarifying” graphic the Met Office shows:

Unfortunately, if presented as a rebuttal or response to any question Rose actually asked, that graph is an utterly irrelevant red herring.

Because the Met office graph is irrelevant as any sort of “rebuttal” from the Met Office, I’ve pondered a bit on organize my comments on these claims. I’ve decided the best thing to do might be to simply show the answers I would have given to Rose’s first two question. The third is such a strangely worded leading question that I don’t know how it really can be given a good answer. (I think the Met Office’s answer is rather silly– but I’ve developed some sympathy for that after I tried to think of a good answer to it. Possibly, the best answer would be “I don’t understand what you are trying to ask. Could you clarify?”)

Now, on to my answer to Rose’s two questions:

Q.1 “First, please confirm that they do indeed reveal no warming trend since 1997.”

I would have understood this to mean no real warming trend since Jan 1997. So my answer would have been: No, I cannot confirm “no warming trend since 1997”. Using HadCrut4, the trend from January 1997 – Aug 2012 is positive. That is: There has been a slight amount of warming during that period.

I might have supplemented that with a graph. The one below shows the trends based on HadCrut4 and two other agencies:

(Note: circles are expansion to include measurement uncertainty. I haven’t updated HadCrut4 — it uses HadCrut3 info.)

The HadCrut4 trend is shown in purple. At +0.05C/dec, it hovers above the solid black line indicating “no trend”. So, it shows warming. Slight warming– but still, warming.

No more blathering required for this question.

Q.2 “Second, tell me what this says about the models used by the IPCC and others which have predicted a rise of 0.2 degrees celsius per decade for the 21st century. I accept that there will always be periods when a rising gradient may be interrupted. But this flat period has now gone on for about the same time as the 1980 – 1996 warming.”

My answer to this would be that if Jan 1997 was not a cherry pick, and if HadCrut4 contains no error, then the trend since Jan. 1997 suggests temperatures are rising more slowly suggested by the multi-model mean in the AR4. That is: The mean warming in the AR4 models is faster than consistent with the earth weather since Jan 1997.

I diagnose this using the trace circled in red below:


The trace outlined in red shows the multi-model mean over models that used A1B to project into the 21st century. The blue error bars indicate the ±95% uncertainty in trends estimated using a “Model centric” method. That is: Examining models with more than 1 run projecting into the 21st century, I estimated the variance in 188 month runs for that model. I then found the average variance over the 11 models that permitted that operation, and took the standard deviation and used that as the “multi-model” estimate of the standard deviation of 188 month trends in a particular model. The ±95% spread in weather is illustrated by the span of the dark blue uncertainty intervals inside the red oval trace. Notice the bottom of that span lies above the HadCrut4 188-month trend.

This means that based on that particular choice of start year, when compared to HadCrut4, on average the AR4 models are running warm. This is true even though the HadCrut4 shows a warming trend.

At this point, I might have discussed cherry picking a bit. A case can be made that in this particular case, Jan 1997 is a rather inapt start point. But… if I were the Met Office suggesting that, I know perfectly well, the next question is:

I’m asking about 15 years because that’s a claim in a peer reviewed paper. What about that old claim that 0C/dec could last up to 15 years?

These papers use a method that permits the Met office to write misleading things like this:

So in that sense, such a period is not unexpected. It is not uncommon in the simulations for these periods to last up to 15 years, but longer periods are unlikely.

The analysis method used to justify claims like that involved taking standard deviation of trends over all models runs ignoring the fact that different models show different trends. Using that standard deviation around the multi-model mean shows the HadCrut4 trend falls inside that spread of runs; it also shows 0C/dec very near the lower boundary of the ±95% confidence intervals. It is on this basis that (some) people justify stating that we should “expect” to see 1 in 40 “earth” trends as low as 0C/dec even if the underlying trend matches the multi-model mean.

As I observed at the time: If you are trying to detect whether the ensemble is biased that method is wrong. Moreover, if you are trying to estimate the variability of earth trends (or even the spread of due to weather in a typical model) that method is just wrong.

It’s wrong because with respect to deciding whether the difference in the multi-model mean and the earth realization is due to weather we need as estimate of variability due to “weather” or “internal variability”. And real difference in mean warming from model to model have nothing to do with “weather”, or “internal variability” or anything of the sort. That is the structural difference in mean warming in different models. Not. Weather.

An estimate of the ±95% spread of trends inside “weather” enhanced by the structural uncertainty in the model mean trends is shown to the right of the uncertainty intervals I circled above. That is: this is the spread of all “weather in all models”. Note that spread is larger than the spread of “weather” in a typical model and the bottom of that trace just grazes 0C/century and HadCrut4 lies inside this trace.

It is that spread grazing zero that justified the claim of “15 years” in the papers that made that claim.

So what does the earth trend falling outside the spread of “weather” but inside the spread of “weather + structural uncertainty” mean? It means

a) if we estimate weather based on models, that HadCrut4 trend is outside the range consistent with the multi-model mean trend but
b) the earth trend is inside the full range of weather enhanced by the spread due to structural uncertainty.

How can this happen? Some models show less warming than the multi-model mean; others show more warming than the multi-model mean. Given this situation, it is possible for an earth trend to fall inside the range of weather consistent with “all weather in all models” because they fall inside the spread of weather associated with models that are– on average– warming more slowly than the multi-model mean.

Now, before I end this, I’d like to observe a few things:

  1. In this particular case, 15 years is a cherry pick. We know that 1998 was one of the largest El Ninos seen in a long, long, long time. So, in some sense, we know that’s not a “good” year to really start. I’m not going to show results over a range of start years– if I did, I won’t reject the multi-model mean if I pick a large variety of years. (It’s close– but no cigar.)

    Rather than show a bunch of different start years, I am extending my script to create a method that is more “robust” against cherry picking. That is: It will compare the deviation in absolute anomalies along with the deviation in trends. That metric is harder to cherry pick because if you pick the start date in a “warm” year, the trend is low, but the mean difference is high. In contrast, if you pick the start date in a “cool” year, the trend is high, but the mean is low. I’d show that, but it requires more coding. I’ll be showing this a bit later this week. I don’t know if using the “new” method will result in a reject or accept since 1997. But since cherry picking is rampant (on all sides), I want to budget time on creating a “harder to cherry pick method”.

  2. My focusing on Rose’s questions means I’ve failed to slam him for “global warming stopped”. There is no evidence global warming stopped in any meaningful sense.

    First: The 15 year trend is positive. That’s “slow warming” but not “stopped”

    Second: we know the trend over the past 50 years or century are both positive. Rose says so himself. And we know temperatures are variable. So, there is ample evidence that we have been in a warming trend; it should take quite a bit of cooling to decree we have evidence it actually stopped. Specifically, if we are estimating weather using the “model centric” method I use here, if we took the uncertainty intervals from models and placed them around ‘0’, we need to see a 188-month trend of approximately -0.13 C/decade before we can reject a hypothesis that warming is positive. That level is the ±95% window (one sided) that would correspond to a forced trend of 0C/decade. A +0.04C/dec is certainly well above -0.13C/dec.

    To be sure, it may be that when he wrote “stopped” Rose meant “paused” or “slowed”. But generally, “X stopped” gives the impression X is not about to resume at any moment. When applied to random variables, it gives the impression one believes the expected value for X has “stopped” being what it was before. The fact is: there is zero evidence that we should expect the trend in the upcoming 15 years to not be positive. Even without any consideration of models — merely resorting to extrapolation– there is every reason to expect a positive trend over the upcoming 15 years.

This was pretty rambling. But the highlights:

1) If picking 1997 was “fair”, the HadCrut4 trend falls outside the ±95% confidence window defined by the multi-model mean and “typical model weather.
2) This suggests the multi-model mean is warmer than consistent with earth weather.
3) The earth trend falls inside the ±95% spread of “weather + structural uncertainty” of models. This suggests some individual models are not too warm.
4) The Met office conflates (2) and (3) in a misleading way. To be fair, I think they likely aren’t doing it on purpose. They are collectively a bit confused and they have mislead themselves. Nevertheless. both (2) and (3) are true.
5) There is no evidence “global warming stopped”– or at least there isn’t any evidence if “stopped” is defined as “stopped and we should not expect it to resume”.
6) There’s a good chance Met office is going to continue to have difficulty defending the model mean. After all, if the anticipated El Nino is either weak or non-existent, next year Rose may well announce that we’ve had 16 years 0C/dec!
7) It really would help if they “discover” that the multi-model trend is biased high and stop trying to suggest the discrepancy can be explained by “weather” (or worse- volcanoes!)

Adding Multi-Model Means to Model v. Observations Graphs.

Currently, I’m just adding tests to my graphs. For that reason, the discussion will be brief merely explaining the additional informaiton added to one of the graphs from Observations V. Models: “Model Weather”. where I discussed a method to see whether the earth weather trend falls inside the spread of “weather trends” using the estimate of variability of model trends based on repeat runs for matched periods from that model.

I told readers I’d add the multi-model mean. I’ve actually added 3 versions of the multi-model mean shown in bold below. My “preferred” version is labeled “MT”.


(Careful readers will note that for convenience I am drawing data from a 2nd hand source, and these trends end in June or July.)

The three versions of the comparison to the multi-model mean are:

  1. M1: This trace represents the 95% range of weather for a forecast data point assuming the 11 models used to create the forecast all replicate the probability distribution of “earth weather”. I computed the multi-model mean for the test period over the 11 models with more than 1 run. I estimated the spread due to the “weather” in each model by taking the square root of the mean variance. (I’ll call this the pooled mean s.d.) That weather is shown in slate blue. I estimated the uncertainty in the model mean using the ratio of the standard deviation of the model mean over the 11 models to the square root of the 11; this is shown in grey. I then extended the “weather” uncertainty intervals by pooling the 95% spread of weather and the standard error in the sample mean. These are shown in slightly lighter blue. If earth weather was measured perfectly, and that range was a forecast we would expect trends in the forecast period to have a 95% probability of falling in that forecast.

    Carrick correctl observed that a full comparison requires inclusion of measurement uncertainty in the earth trend. I estimated that and added it to the uncertainty intervals. The lower bounds of the trend including measurement uncertainty are shown with open circles. (See illustration below for more details.)

  2. MT: This trace represents the 95% range of weather for an as yet unseen realization under the assumption that all models used to compute the A1B SRES replicate the probability distribution of “earth weather” but we were only able to estimate the standard deviation of trends due to “weather” using the 11 models with repeat runs. The only difference between this and the previous test is the multi-model mean is a bit higher and “22” was used to compute the standard error in the mean. This is the multi-model mean actually used in the AR4, and for that reason, I consider it the “preferred” mean to test.

    When the earth trend falls below the open circles, this indicates that the earth trend as “weather” falls outside the range that is consistent with the multimodel mean– assuming we accept the pooled variance of weather in models as the best estimate.

  3. ME is called “M enhanced”. This is the result one would get if one estimated the uncertainty using the spread of trends in each model (i.e. what the models actually estimate as uncertainty in trends due to weather”) and then added the spread in the mean trends (i.e. scatter in the location of open circles) into the estimate in the “weather trend”.) This could be justified if the difference in the model trends was due entirely to statistical error and not due to any real difference in trends. In that case, the uncertainty intervals computed under ME enhanced would on average be equal in size to those computed under M1. But we would have a larger number of samples and so get a more precise results. However, in this case, it’s actually pretty easy to show that trends in the forecast region differ from model to model. (I have not shown this test, but I’ve done it.) So, this method results in a spread that is not an estimate of the variability in 138 month trends due to “earth weather”. Rather, it is simply an estimate of the ±95% range of runs including both the effects of “weather” and the fact that some models show more warming and some show slower warming. (This is, btw, the set of assumptions that published papers used to insist the current trends are consistent with the models. They may be consistent in the sense that some models show less warming than the mean and/or are very variable. But this method does not contradict any diagnosis that the ensemble is biased warm.

Note: All graphs assume probability distributions for trends are Gaussian (i.e. Normal).

Guide
There were questions on reading the graph in prior posts. This might help:

Variability of 150-Month means: Model Values v. ARMA estimates.

In the previous post, SteveF asked

There are big differences in the uncertainty range for different models. Is this mainly because some models have fewer runs (which adds to uncertainty) or because some models have more run-to-run variability?

I answered, but I also thought I would create a graphic that shows the standard deviation of 150-month trends as computed over repeat runs of models forced using the A1B SRES and also compares those to the spread estimated using ARMA11. Below, the standard deviation of 150-month trends computed based on repeated samples of models with more than 1 run are shown in ‘slate blue’ as “95% confidence intervals” centered on zero rather than their respective multi-model means. The arima estimates for GISTemp (red), HadCrut3(purple) and NOAA(green) are shown by horizontal lines. The 95% confidence interval based on the “pooled standard deviation” for all models is shown to the far right in black. (A pooled standard deviation is based on taking the square root of the average of the variance.)

Note that generally speaking the variability over repeat samples of model runs is larger than estimated using ARMA applied to the model data. If one thought all models agreed and created the correct spectral properties of “weather”, then one might explain the discrepancy by saying that “weather noise” is not ARMA11. (And this might even be so.) But — oddly enough– if we treat noise in models as ARMA11 gives more or less correct sized uncertainty intervals for 150 month means in the aggregate, though not for individual models. For some models, it would be too high; others too low. All this means is that the weather noise in models is not “ARMA11” (which would be evident if we examined their residuals in any case.) It is also the case that weather noise in individual models do not have identical spectral properties. (No. I’m not going to show them for every model. But I’ll simply point out that the slope-o-grams for models differ dramatically see slopeograms.)

Nevertheless, it is useful to look at a feature that strongly affects the estimate of the variability of 150 month trends. This is the standard error or residuals to a linear trend. When computed during periods with no fast variations in the forced trend (i.e. no Pinatubo-like eruption) this quantity gives some measure of the short termvariability due to weather in models relative to that seen on planet earth. The pooled mean of for the ‘se’ of residuals over repeat periods during the forecast period are show with slate blue circle, the pooled mean over all models are shown with black circles and the ordinary mean over all models with grey circles. The associated standard deviation in the ‘se’ over the periods in the forecast period are also shown. The ‘ses’ associated with the single 150 month trend for observations from 2000-now are showin with red, purple and green lines for GISTemp, HadCrut and NOAA/NCDC respectively:

Using the “method of the eyeball”, it seems that on the whole, the models tend to have larger short term variability than seen on the real earth. The model variability tends to be higher than the observed variability, and this is true despite the fact that measurement uncertainty means that the standard error residuals for the earth includes contributions from ‘earth weather” and measurement uncertainty.

This doesn’t necessarily tell us whether the models over or under estimate the longer term variability or even the variability of 150 month trends. But, I think it’s worth nothing that despite the lack of “ENSO”, the some models contain a surfeit of variability at frequencies below 150 months– and I would say that some of that variability is both (a) wild and (b) clearly wrong. (Honestly, I’m not sure weather is sufficiently predictable to permit agriculture on planets ECHAM or FGOAL also known as “the wild-crazy weather models used to claim the recent earth trends falls inside the range consistent with models”.)

Hmm… I guess I should also compare the average lag 1 auto-correlation. Those are measurable statistics. I haven’t done that recently and if I’ve done it in the past, I don’t recall the result I get. I’ll add that.

Observations V. Models: “Model Weather”.

Previously, I showed that if we the observations of of Global Surface Temperature using “linear forced trend + ARMA11-noise” , we would reject the null hypothesis that the trend in the multi-model mean temperature for AR4 models correspoding to the A1B SRES matches the trend corresponding to three observational data sets (GISTemp, NOAA/NCDC and HadCrut3) using a number of choices of start years– though not all possible choices. (See Arima11 Test: Reject AR4 Multi-Model Mean since 1980, 1995, 2001,2001,2003.) I leave it to readers to decide what the result means–and merely report that the rejections are occurring.

Of course, when interpreting what the rejections mean, one might observe that for some periods analyzed, the forced trend may not be linear and even when it is, the noise may not be ARIMA. So, it is useful to look at the result a different way. Today, I’ll show what we get if we limit our tests comparing least squares trends from simulations to those from observations as “test statistics” and use estimate the variability trends over independent realizations of “weather” from the spread of trends in an individual model. This test can be applied to individual models– which I will do. (It can also be applied to the multi-model mean if we make additional assumption. When the assumptions apply, I strongly suspect that test is has greater statistical power. I will defer that comparison.) There are 11 models with more than 1 run; results will be shown for each of these 11 models.

I’ll provide the main result first, and explain the test afterwards. The main results are encapsulated in the graphs below:

The main findings are:

  1. If we apply test whether the observation the trend associated with Global Mean Temperature computed starting in Jan 2000, we reject the ‘null’ hypothesis that the observation falls within the spread of “weather” characteristic of a particular model in (2,5,4) cases based on comparisons to (GISS, HadCRUT3 and NOAA) respectively. If we test the same null hypothesis using trends that begin in Jan 2001, we reject the null in (4, 6, 5) cases when observations are based on (GISS, HadCRUT3 and NOAA) respectively.
  2. Using monte-carlo analysis simulating the above test 104 times to account for the fact model results are independent of each other but all cases are compared to the same observation, I find if this test was applied to a system of models which reproduced both the mean trend and the variability of mean trends correctly, we would expect the false positive rate of rejection seen for trends starting in 2000 to be (10.0%, 4.3%, 5.7%) for GISS, HadCRUT3 and NOAA) respectively. For the number of rejections seen when the test is applied to trend beginning in 2001, false positive rate would be (6.0%, 3.3%, 4.6%) for the number of rejections seen with (GISS, HadCRUT3 and NOAA) respectively.

    If we use the significance level of 5% (i.e. 1-95%) as our criteria for significance, this test would guide us to reject the hypothesis that HadCrut or NOAA trends are consistent with weather in all models and that it is unlikely the result the mis-match in trends since 2001 occurred by chance. However, comparison to GISS suggests that even if models correctly replicate the earth trend and variance in trends, the number of rejections observed could have arisen due to chance. Using the same significance level but applying the test using trends since Jan 2000, the number of rejections for both NOAA and GISTemp are not sufficiently large to reject the null hypothesis that model means match the observations, but the number of rejections is sufficient to state the HadCrut3 model does not match the ‘weather’ in at least some models.

    (Note: I am assuming I would reject models as biased if we had too many rejections on either extreme with zero rejections in the other direction. So, the rate at which we would see rejections with model all “too warm” is one half that indicated.)

Those are the main results. Once again: What these statistically significant mismatches for some start years and some observational data sets mean could be a matter of some debate. For example: Among other things, I will need to change the script to use HadCrut4 when it is reporting reliably. (The report through Aug 2012 is available now, but last month data were only available through Dec. 2010. I have not yet updated my script to read it.)

But, no matter what they mean, I think it’s useful to show where the relative agreement or disagreement between models and data. (I will later be adding the multi-model mean. I just haven’t done so yet.)

For those interested in how things were computed, I’m going to give a very cursory description.

Trend for period of interest
The observed trends were computed over the 150 (138) months beginning in Jan 2000 (2001). These are indicated with horizontal lines above; (Red=GISTemp, Green=NOAA/NCDC, Purple=HadCrut3). The mean trend for models were computed for the same period of time by first computing the trends over N available runs, and then averaging over all N runs. Each trend is indicated with a brown open circle.

Estimate of standard deviation of “model trends”
For each model, the variance model trends by computing the average of the variance over “N” runs 12 non-overlapping 150 (138) month periods of simulations for models with more than 1 run. This corresponds to the estimate of the variability in trends would would expect to arise from “weather” if a particular model is correct. The ±95% range for spread of weather around the model mean is indicated by the vertical blue lines with the inner most horizontal tick.

Uncertainty in model mean trend.
The sample mean trend for each model is an estimate of the mean that would be obtained in the limit that the number of runs was infinite. However, this quantity is uncertain. The standard error for this estimate was computed as the ratio of the standard deviation to the square root of the number of runs. This quantity was pooled with the standard deviation for model trends to create the ±95% range for “weather trends” we would expect given the additional uncertainty in our estimate of the mean trend. That range is indicated by the outer blue tick.

Note that in principle, if the model is correct in the sense that the model mean over an infinite number of runs would reproduce the expected value of the earth’s mean trend and the expected value of the variance in the earth’s trends if– hypothetically– we could keep rerunning the earth weather, an individual realization of earth’s weather should fall inside the outer-tick of the blue line with probability 95%. However, the observation of earth weather also contain measurement noise while models do not. When we compare an individual realization of the earths trend to “model weather”, we must include this measurement error. (Note: We don’t need include it when we estimate uncertainty in trends based on the residuals to a linear fit in a time series.)

Estimate of measurement uncertainty for observations
To estimate this quantity, I first obtained estimates of the uncertainty in annual average temperature data for each observational data set. I then created annual average temperature series for each data set, and added “red noise” with a specified value lag 1 correlation coefficients to the published data set with residuals that reproduced the estimate for the errors in annual average temperatures and computed the least squares trend. For each choice of lag 1 correlation coefficient, repeated this 100 times and computed the standard deviation of the resulting trends.

This process was repeated for lag 1 correlation coefficients from r1=0 to 0.9 at intervals of 0.1. I then selected the the largest estimate of uncertainty due to measurement errors corresponding to estimated values of uncertainty in individual annual average temperatures. (The maximum occurred with lag 1 coefficients of about r1=0.7 or 0.8. It’s likely the r1 coefficient for measurement errors is a lower value; my estimate likely represents an upper bound on the possible magnitude of the uncertainty due to measurement error.)

Estimates of the observational error were computed for each observational data set. These were then pooled with the uncertainty due to “model weather”. The extended lower bounds for observations consistent with the model data are shown with open circles (red=GISS, purple=HadCrut, blue=NOAA).

Diagnosis for individual models.
In this method, we diagnose that an observation falls outside the range consistent with “model weather” if the trend associated with the observation (shown with a horizontal line) falls below the open circle corresponding to the pooled uncertainty resulting from: 1) “model weather”, 2) uncertainty and the determination of the model mean and 3) uncertainty due to measurement uncertainty.

Diagnosis for models collectively.
Because we are testing 11 models against observations, we would expect to get 1 case “false positive” rejection at p=95% to occur more than 1 in 20 times. However, determining how often we might see several at a time is complicated by the fact that the tests are not independent. Every model is compared to the same earth trend.

To determine the rate at which “m” models reject, I replicated the methodology used to create the full graph above under a “null-set” assumption that a) all models and the earth share the same mean trend b) all models and earth share the same variability of “N month trends”. That is: I find the false positive rate (i.e. rate at which we would reject a “correct null” ) that we would reject m out of 11 models under the assumption all models are “right”. If this false positive rate false below 5%, we conclude it is unlikely that an event of this magnitude would happen through random chance. That is: we can reject the notion the that earth weather is consistent with weather for that particular model.

Bet on October GISTemp

This month we will be betting on the GISTemp Land Ocean October monthly anomaly that will be published at
http://data.giss.nasa.gov/gistemp/graphs_v3/Fig.C.txt

when it first appears in November, 2012.

Note: We are not betting on the value that October anomaly that will appear in December, 2012. We all know that will likely update both because data from some stations arrives late and also because the GISTemp method causes historical temperatures anomalies to change as later data arrives. So: You are betting on the value that will be posted in November.

GISTemp is replacing UAH this month because I am uncertain about whether UAH will announce an anomaly promptly in November and I suspect GISTemp will be the first of the surface based groups to announce their anomaly. When UAH returns to reporting promptly we will revert to that series.

The betting script is below:

[sockulator(../musings/wp-content/plugins/BettingScripts/UAHBets5.php?Metric=October GISTemp: Land and Ocean TTL?Units=C?cutOffMonth=10?cutOffDay=15?cutOffYear=2012?DateMetric=September, 2012?)sockulator]
Bets Close 10/14/2012

For those wondering how GISTemp trends are comparing to the multi-model mean projections from the AR4, trends computed for various start years are shown below:

The “heavy slate blue” is trends computed based on the monthly model mean monthly anomalies ( I need to add this to the legend.) The black is the trend computed based on GISTemp monthly anomaly values. The heavy purple is the ±95% uncertainty (two sided) in the trend computed using GISTemp estimated assuming the data can be modeled as (linear trend + ARMA(1,1) ‘noise’); the bias for too-tight uncertainty intervals was corrected using monte-carlo runs.

The very light slate-blue trace with open circles is the full uncertainty for evaluating the agreement between observed trends and the trend for the multi-model mean. It is computed by taking the pooled confidence intervals based on the standard error in the multi-model mean trend (computed over 22 models ) and the uncertainty in the trend associated with the GISTemp observations.

For years when the open-slate blue circles representing the upper confidence interval lie below the closed slate blue circles that indicates the statistical model suggests we “reject” the hypothesis that the multi model mean is consistent with observations with a confidence of 95% and we conclude the multi-model mean is warming at a faster rate than the observations. These rejection does not tell us why the multi-model mean is warming to fast, but it indicates that it probably is warming faster than the earth is warming.

Does this information help when betting on October values? Not really. When betting on October values, it’s more useful to know that Augusts published value is 0.56C and that ENSO has been on the higher end of “La Nada” over the past few months. Not El Nino. Not La Nina. September’s anomaly would be useful to know but has yet to be published; it may or may not be published before the betting tables close.

Good luck!