Carrot Eater’s Challenge: Rate of Rejections when applied to simulations pt. 2.

Today, I’m going to continue discuss “The Carrot Eater Test” and briefly explain why the roughly 9%-10% rejections are fall well inside the range of what we would expect if my test of the model mean works as I claim it works. I discussed the main issue in comment of Carrot’s Challenge, pt. Oddly, I discovered a second issue that also covers the situation. The two issues are:

  1. If the spread of trends over the 55 runs is partly due to variance in model mean trend from 22 different models, then the rejection rate in The Carrot Test is not the false positive rate for my test. It’s is the sum of the false positive rate plus some number or true rejections that detect the fact that the mean from some models really does differ from the model mean.
  2. The Carrot Test was applied to a finite size sample. That is, the sample consists of only 55 runs form 22 models.

Issue of Differences in Model Mean Trends

The notion that applying a hypothesis test to a trend of 0.2C/decade to all 55 runs in the ensemble for Carrots is should result in a 5% rejection rate is based on the implicit assumption that if we ran a model A a million times, the average of trend over all those those runs would be 0.2C/decade, and that if we ran model B a million times, the average trend over all those those those runs would also be 0.2 C/decade and that the same outcome would apply to all models. That is: it assumes if modelers budgets were sufficiently large, all models would agree about the mean trend over the time period being tested.

No one truly thinks this is so.

To the extent that the average over a million runs from Model A might differ from 0.2C/decade, we expect a test to determine whether individual model realizations are consistent with 0.2C/decade will result in more than 5% rejections.

So, when applying The Carrot Eater Test to each of the 55 runs from 22 models, we should expect the rejection rate of “The Carrot Eater Test” to be larger than the false positive rate for my test of an individual run.

The obvious question is: How much larger might we expect the false positive rate to be?

My answer is: using an poor-mans exel based ‘monte-carlo’ analysis, and an estimate of the variance due to the model spread from an ANOVA computation which indicates the models spread contribute roughly 19.5% of the variance, in the 100 month trends, I conclude we expect that the mean rate for “The Carrot Test” is a rejection rate of about 8%. That is to say, if if the modelers had conveniently run 55 tests using 22 models, provided them to me and let me run the test. And then, they did it again, and I ran the test, and then they did it again, until I had done this a zillion times. I estimate that I would have a false positive rate average over all tests would be about 8%. This is somewhat lower than the 9.1% of rejections I obtained comparing 55 trends to the multi-model mean for the sample of 55 runs.

However, I applied the test to only 1 set of 55 runs from the AR4. I don’t expect that to achieve an 8% rejection rate every single time I run the test on 55 runs. To estimate the ±95% range for rejections, , I applied the test to the 55 model runs from 22 models runs of synthetically generated time series with ARMA(1,1) noise. Out of lowest rejection rate was 0%; the highest was 18.1%. These are very crude estimate of the 95% confidence intervals, but they suggest we could expect up to 18% rejections on one test if my method is works.

This result does depend on my setting the spread of trends over models to explain 19.5% of the variance in the full spread of trends. (I can justify other values, but will defer that until later when I explain how I came up with 19.5%. I did quite a few companion tests while estimating that value. I estimated it three different ways and this method gave the lowest estimate for the variability due to models. I think it’s also the most reasonable method. Using higher variability for the model spread would lead me to anticipate more than 8% rejections on average.)

Simple uncertainty

Oddly enough, I have an even easier reason why 9.1% rejections obtained from one sample of 55 runs doesn’t mean my test based on a single realization is inconsistent with a rejection rate of 5%!

To get the absolute lower bound of the rejection rate I might expect if the spread in trends due to model bias is zero, I repeated the test described above attributing the full spread in trends to internal variability in models. That is: I assume all models agree on the mean trend. When tested over 40 runs,the false positive rate ranged from 0 to 10.9%. The high value was achieved twice. Oddly, this means that even if someone believes that each individual model would return the same multi-model mean for a 100 month trend with that value equal to the sample multi-model mean for the collection used in the AR3, the results in the previous rejection rate of 9.1% falls inside the ±95% confidence intervals! So, I would report that the rejection rate of 9.1% over 55 runs was not inconsistent with a rejection rate of 5%.

Oddly enough, this second result could be explained analytically, through recourse to the binomial theorem where 6 rejections out of 55 tests represents represents the upper bound of the ±95% confidence interval applied two tailed. (Five or fewer rejections out of 55 occur at a rate of 94%, 6 occur at a rate of 98%. The cut off for ±95% for a two tailed test is ordinarily 97.5%, but since this test is discrete, we consider 6 out of 55 to fall inside those limits. )

However, I would never hang my hat on this reason for a very simple reason. I plan to repeat this analysis using an honest to goodness script that doesn’t run butt-slow. When I do this, I can apply the carrot test to data from simulations from sequential periods of 104 months. I think I can do this for at least 9 batches from the 21st century. Because the individual models really do display different trends during different periods, and that really should elevate the rejection rate above 5%, I expect that if my method works, I will be getting rejection rates higher than 5%– that is closer to 8%.

Wrap up

I actually want to thank Carrot for this test. Even though I think some work is required to relate the rejections to the properties of the sample we are testing (so as to explain why the rejections should be closer to 8% than 5%), I think it will represent a very useful fiduciary test to see if the ARMA(1,1) method I use seems to give an appropriate number of rejections when applied to simulated data. Currently, I think the test results are consistent with my method having a suitable level of ‘strictness’ when applied to temperature from simulations of climate models. However, it remains to be seen if this result holds up when I apply it to 9 decades worth of 100 month trends. (I’m not doing that in EXCEL.)

The purpose of doing this in EXCEL is to help people with very limited programming skills see what is going on. For those who want to that, here’s the spreadsheet for the monte-carlo part. It’s s_l_u_g_g_i_s_h. AnomsCarrot