Observations V. Models: “Model Weather”.

Previously, I showed that if we the observations of of Global Surface Temperature using “linear forced trend + ARMA11-noise” , we would reject the null hypothesis that the trend in the multi-model mean temperature for AR4 models correspoding to the A1B SRES matches the trend corresponding to three observational data sets (GISTemp, NOAA/NCDC and HadCrut3) using a number of choices of start years– though not all possible choices. (See Arima11 Test: Reject AR4 Multi-Model Mean since 1980, 1995, 2001,2001,2003.) I leave it to readers to decide what the result means–and merely report that the rejections are occurring.

Of course, when interpreting what the rejections mean, one might observe that for some periods analyzed, the forced trend may not be linear and even when it is, the noise may not be ARIMA. So, it is useful to look at the result a different way. Today, I’ll show what we get if we limit our tests comparing least squares trends from simulations to those from observations as “test statistics” and use estimate the variability trends over independent realizations of “weather” from the spread of trends in an individual model. This test can be applied to individual models– which I will do. (It can also be applied to the multi-model mean if we make additional assumption. When the assumptions apply, I strongly suspect that test is has greater statistical power. I will defer that comparison.) There are 11 models with more than 1 run; results will be shown for each of these 11 models.

I’ll provide the main result first, and explain the test afterwards. The main results are encapsulated in the graphs below:

The main findings are:

  1. If we apply test whether the observation the trend associated with Global Mean Temperature computed starting in Jan 2000, we reject the ‘null’ hypothesis that the observation falls within the spread of “weather” characteristic of a particular model in (2,5,4) cases based on comparisons to (GISS, HadCRUT3 and NOAA) respectively. If we test the same null hypothesis using trends that begin in Jan 2001, we reject the null in (4, 6, 5) cases when observations are based on (GISS, HadCRUT3 and NOAA) respectively.
  2. Using monte-carlo analysis simulating the above test 104 times to account for the fact model results are independent of each other but all cases are compared to the same observation, I find if this test was applied to a system of models which reproduced both the mean trend and the variability of mean trends correctly, we would expect the false positive rate of rejection seen for trends starting in 2000 to be (10.0%, 4.3%, 5.7%) for GISS, HadCRUT3 and NOAA) respectively. For the number of rejections seen when the test is applied to trend beginning in 2001, false positive rate would be (6.0%, 3.3%, 4.6%) for the number of rejections seen with (GISS, HadCRUT3 and NOAA) respectively.

    If we use the significance level of 5% (i.e. 1-95%) as our criteria for significance, this test would guide us to reject the hypothesis that HadCrut or NOAA trends are consistent with weather in all models and that it is unlikely the result the mis-match in trends since 2001 occurred by chance. However, comparison to GISS suggests that even if models correctly replicate the earth trend and variance in trends, the number of rejections observed could have arisen due to chance. Using the same significance level but applying the test using trends since Jan 2000, the number of rejections for both NOAA and GISTemp are not sufficiently large to reject the null hypothesis that model means match the observations, but the number of rejections is sufficient to state the HadCrut3 model does not match the ‘weather’ in at least some models.

    (Note: I am assuming I would reject models as biased if we had too many rejections on either extreme with zero rejections in the other direction. So, the rate at which we would see rejections with model all “too warm” is one half that indicated.)

Those are the main results. Once again: What these statistically significant mismatches for some start years and some observational data sets mean could be a matter of some debate. For example: Among other things, I will need to change the script to use HadCrut4 when it is reporting reliably. (The report through Aug 2012 is available now, but last month data were only available through Dec. 2010. I have not yet updated my script to read it.)

But, no matter what they mean, I think it’s useful to show where the relative agreement or disagreement between models and data. (I will later be adding the multi-model mean. I just haven’t done so yet.)

For those interested in how things were computed, I’m going to give a very cursory description.

Trend for period of interest
The observed trends were computed over the 150 (138) months beginning in Jan 2000 (2001). These are indicated with horizontal lines above; (Red=GISTemp, Green=NOAA/NCDC, Purple=HadCrut3). The mean trend for models were computed for the same period of time by first computing the trends over N available runs, and then averaging over all N runs. Each trend is indicated with a brown open circle.

Estimate of standard deviation of “model trends”
For each model, the variance model trends by computing the average of the variance over “N” runs 12 non-overlapping 150 (138) month periods of simulations for models with more than 1 run. This corresponds to the estimate of the variability in trends would would expect to arise from “weather” if a particular model is correct. The ±95% range for spread of weather around the model mean is indicated by the vertical blue lines with the inner most horizontal tick.

Uncertainty in model mean trend.
The sample mean trend for each model is an estimate of the mean that would be obtained in the limit that the number of runs was infinite. However, this quantity is uncertain. The standard error for this estimate was computed as the ratio of the standard deviation to the square root of the number of runs. This quantity was pooled with the standard deviation for model trends to create the ±95% range for “weather trends” we would expect given the additional uncertainty in our estimate of the mean trend. That range is indicated by the outer blue tick.

Note that in principle, if the model is correct in the sense that the model mean over an infinite number of runs would reproduce the expected value of the earth’s mean trend and the expected value of the variance in the earth’s trends if– hypothetically– we could keep rerunning the earth weather, an individual realization of earth’s weather should fall inside the outer-tick of the blue line with probability 95%. However, the observation of earth weather also contain measurement noise while models do not. When we compare an individual realization of the earths trend to “model weather”, we must include this measurement error. (Note: We don’t need include it when we estimate uncertainty in trends based on the residuals to a linear fit in a time series.)

Estimate of measurement uncertainty for observations
To estimate this quantity, I first obtained estimates of the uncertainty in annual average temperature data for each observational data set. I then created annual average temperature series for each data set, and added “red noise” with a specified value lag 1 correlation coefficients to the published data set with residuals that reproduced the estimate for the errors in annual average temperatures and computed the least squares trend. For each choice of lag 1 correlation coefficient, repeated this 100 times and computed the standard deviation of the resulting trends.

This process was repeated for lag 1 correlation coefficients from r1=0 to 0.9 at intervals of 0.1. I then selected the the largest estimate of uncertainty due to measurement errors corresponding to estimated values of uncertainty in individual annual average temperatures. (The maximum occurred with lag 1 coefficients of about r1=0.7 or 0.8. It’s likely the r1 coefficient for measurement errors is a lower value; my estimate likely represents an upper bound on the possible magnitude of the uncertainty due to measurement error.)

Estimates of the observational error were computed for each observational data set. These were then pooled with the uncertainty due to “model weather”. The extended lower bounds for observations consistent with the model data are shown with open circles (red=GISS, purple=HadCrut, blue=NOAA).

Diagnosis for individual models.
In this method, we diagnose that an observation falls outside the range consistent with “model weather” if the trend associated with the observation (shown with a horizontal line) falls below the open circle corresponding to the pooled uncertainty resulting from: 1) “model weather”, 2) uncertainty and the determination of the model mean and 3) uncertainty due to measurement uncertainty.

Diagnosis for models collectively.
Because we are testing 11 models against observations, we would expect to get 1 case “false positive” rejection at p=95% to occur more than 1 in 20 times. However, determining how often we might see several at a time is complicated by the fact that the tests are not independent. Every model is compared to the same earth trend.

To determine the rate at which “m” models reject, I replicated the methodology used to create the full graph above under a “null-set” assumption that a) all models and the earth share the same mean trend b) all models and earth share the same variability of “N month trends”. That is: I find the false positive rate (i.e. rate at which we would reject a “correct null” ) that we would reject m out of 11 models under the assumption all models are “right”. If this false positive rate false below 5%, we conclude it is unlikely that an event of this magnitude would happen through random chance. That is: we can reject the notion the that earth weather is consistent with weather for that particular model.

48 thoughts on “Observations V. Models: “Model Weather”.”

  1. unless I were Nick Stokes, I would conclude the models are biased high. Nick Stokes would probably argue your test is biased low.

  2. diogeneses–
    This test is not biased low. 🙂

    There are two problems with this test:
    1) With respecting detecting whether the mean is too low, it has lower power than would be optimal. That is: the test could be tweaked to deduce the rate of incorrect “fail to rejects” when the models are wrong. (Nick is unlikely to complain about this.)

    2) One might “reject” because the “weather noise” in models is results in smaller estimates of the variability of 150 month trends than would be true for “real earth weather”. However, it’s worth nothing the “weather noise” estimates based on weather observations using ARMA11 suggests the variability of 150 months trends is lower than magnitude suggested by most models. So, we end up with a situation where logically– to reject this finding– someone has to say the variability of trends is larger than both the variability we see in models and the variability we see for earth trends.

    For those wondering: I’m obviously gearing up to have a large test with tests using different methods to estimate the variability of “earth weather”. I need to group things to do a test that shows that the results are cherry picked — and that we are rejecting with a wide range or reasonable analytical. choices.

  3. SteveF: 1 . Yes. I just didn’t get around to fiddling with creating the legends and so on. The names are in column headings– but “period” is the first heading. So model 1 is ncar_pcm1. The others follow.
    [1] “period” “ncar_pcm1” “ncar_ccsm3_0” “giss_model_e_r”
    [5] “giss_model_e_h” “giss_aom” “cccma_cgcm3_1_t47” “echo_g”
    [9] “miroc3_2_medres” “mri_cgcm2_3_2a” “mpi_echam5” “iap_fgoals”
    >

  4. Lucia,

    There are big differences in the uncertainty range for different models. Is this mainly because some models have fewer runs (which adds to uncertainty) or because some models have more run-to-run variability?

  5. SteveF–
    For this test, the main factor affecting the length of the “blue” line is run-to-run variability in models (i.e. “weather”). Specifically:
    If you look at the blue vertical line with pairs of horizontal “-” at the top and bottom, the inner length is dictated purely by run-to-run variability (i.e “model weather noise”) and the multiplier for the 95% bounds. (Z_crit ~1.96). So call that Z_crit*sd_weather where sd_weather is the estimate of the variability of “m-month” trends.

    But you see there are two pairs of horizontal bounds on each vertical blue line. The outer pair is required because while the inner pair represents the bounds for “weather in the model”, we don’t know the “true” location for the mean trend (i.e. center circle). That is: suppose a model has 2 runs. I can estimate the mean one would get if we had one bajillion runs by estimating the mean over 2 trends. But… that will be an estimate. So, the standard error in that is se_mean =sd_weather/sqrt(2).

    In the case of 2 runs, the length between the outer horizontal slashes is sd_weather*sqrt((1+2)/2) (For 3 runs this would be sd_weather*sqrt((1+3)/3) But as you can see, the most number of runs can widen things is sqrt(1.5)~ 1.22. So, most of the length of the blue line in this graph is “weather”.

  6. While Diogenes concludes that the models are biased high, this could equally mean that we simply happen to be in an unusually cool period at present and that this will change in the near future (i.e. the world is biased low).

    However, instead of continuing to use a multi-model-mean as the best guess for future climate, are we not now in a position to outright reject a number of the models since the real temps are outside the range they consider possible (to 95%)?

    The remaining models can contribute to the MMM, unless/until of course they also move outside the range of acceptablity.

    It’s possible that one of the rejected models was in fact the most accurate but was fed dodgy data. Should rejected models be allowed back in if they can be “tweaked” to get the right numbers now?

  7. Lucia,

    Just so that I can get this straight in my head, what your tests show is that every single model tested has its mean trend higher than the highest observed trend shown. Is that correct?

  8. If a model is unbiased it should have an equal chance of running hot or cold, as a coin has an even chance of falling heads or tails. The chance of flipping 11 heads is fairly small (0.5^11).

    Perhaps this is an unfair comparison, since ten years’ cool weather would end up confounding all the models “hot,” even if they all accurately reflect reality on longer time scales.

    One would ask then what peculiarity of the last ten years is not built in to the various models.

    Maybe someone should build a “natural selection” algorithm where the worst-performing models are thrown out, the better performers combined and bred and the next generation similarly weeded…

    Indeed entire departments could be weeded out by how far off reality their models are! Now that would be pressure!

  9. However, instead of continuing to use a multi-model-mean as the best guess for future climate, are we not now in a position to outright reject a number of the models since the real temps are outside the range they consider possible (to 95%)?

    Precisely. That’s why you can’t test the probability that “N out of M” models using a simple bi-nomial theorem test which would apply if each test is uncorrelated. If you think of the earth’s trend as “forced + noise” it could simply be that the current “noise” is in the cold direction. In that case, if models are right we should see:

    a) All model means above the earth trend (which is just what we see) and
    b) usually the earth’s trend would fall inside the ±95% spread for most models.

    Arfur Bryant
    Yes. But in and of itself, the fact that every model mean is above the earth trend doesn’t mean much. The earth trends is equivalent to a model run and the earth trend is not above every model run.

  10. Jit
    The coin toss isn’t fair– that’s why my test to see how often 3/11 models would reject using test requires some lines of code. (I wrote it before I was comfortale with apply, lapply etc. So… more lines of code than necessary. 🙂 )

    If you want to do a gambling analogy think of it like this:

    Suppose “the house” periodically posts “winning” numbers based on rolls of 2 house dice. These dice have been secretly concealed and may or may not have 6 sides and the numbers painted on the faces may or may not be 1-6. These dice may or may not be “fair”.

    eg. (http://www.gameparts.net/polyhedral_dice.htm)

    But we do know that if the house rolled their dice over and over and over again, there would be a average outcome and a spread of outcomes around the average.

    Now to “predict” the upcoming house roll, suppose 11 modeler’s claiming to be predict “the house” create dice the believe mimic the predictions of the house dice. Each modeler rolls his dice 10 times. Each posts modeler publishes the results of his 10 rolls. From this,I could compute the mean roll and the standard deviation for each modelers roll. I could plot the mean using an orange circle and the spread around the mean with vertical blue lines as in the graph below:

    Next: House GISTemp rolls. They post their number. We can now put a red line on the graph. That’s the value GIStemp posted. It also represents one roll of the die.

    Now: in this analogy, we see GISTemp is lower than all model means. But that fact wouldn’t tell us that the models don’t properly simulate the house roll because the house might just have a low roll for the house..

    But it is possible to see whether some models might be too high. For example: in that graph, all rolls for ‘model 2’ are too high We diagnose that because the bottom of the blue line for ‘model 2’ lies above the red line indicating “GISTemp”. (My analogy doesn’t include error in reading the numbers on the house dice– the graph has open circles to account for that. )

    What the comparison of the bottom of blue line for “model 2” and the red line does is tell us whether the “house roll” is outside the range of rolls for “model 2”. There are other tests — but this one is a “weather type” test– and for model 2, it says it’s too warm.

  11. Jit,

    One would ask then what peculiarity of the last ten years is not built in to the various models.

    That is the trillion dollar question. Foster and Rahmstorf say the last 10-12 years have been cooler than the models suggest because of a large assumed solar effect (~+/-0.1C over the solar cycle, with no time lag!) and because La Nina has dominated recent years (wait, wouldn’t it be easy to include solar variation TOA in the models?). The F & R analysis is silly of course, but they are at least looking for an explanation for the divergence between models and reality. The truth is curve fitting disconnected from rational physical causality/plausible mechanisms is just, well, silly curve fitting.
    The other great unknown is whether part of the increase in temperatures between ~1975 and ~2001 was caused by a slower natural pseudo-periodic variation (like the AMO); if so, then the slower rate of temperature increase since 2001 could be the result of being on the downward side of the AMO. In this case the models (which don’t show any AMO-like behavior), are way wrong…. and I personally suspect that is what is causing much of the divergence. Only time will tell.

  12. Could you link to text versions of your model means and errors (or even an html table)?

    Also, for the three temperatures series, you should be showing their uncertainty bounds too. Do you have estimates of their uncertainties and if so could you also link or post their central values and errors?

  13. “If this false positive rate false below 5%, we conclude it is unlikely that an event of this magnitude would happen through random chance. That is: we can reject the notion the that earth weather is consistent with weather for that particular model. ”


    To put the result in context. Doesn’t this assume you have run this test over a randomly selected section of history? Or to put it another way: there may only be a 5% chance of seeing this trend now, but it is likely you will see it more than once over, let’s say, a 50 year period if the secular trend is indeed 0.2C/dec?

  14. Carrick-

    Could you link to text versions of your model means and errors (or even an html table)?

    I guess I’d have to make one and upload it first. (I can do that.)

    Also, for the three temperatures series, you should be showing their uncertainty bounds too. Do you have estimates of their uncertainties and if so could you also link or post their central values and errors?

    Those are on the graph– just shown weirdly to make it easier to “eyeball” the results. Instead of showing them around the observed trend, I pooled them with the *model spread*. So, the open circle below the model is located sqrt(total_model_sd^2+ measurement_uncert^2) below the circle for the model mean.

    The alternative is to show a whole bunch of horizontal lines and have people trying to “eyeball” trying to guesstimate whether the fact that the for the probability test you need to take the sqrt of the sum of square of errors so you can’t just decide by “overlap/no-overlap”.

    I’m trying to puzzle out the best way to create graphs that show “everything” of interest. I’m not sure there is one.

  15. Re: SteveF (Oct 12 08:38),

    then the slower rate of temperature increase since 2001 could be the result of being on the downward side of the AMO.

    Not yet. A simple sinusoidal curve fit (66.6 year period) to the AMO data says we’re at the peak this year. The downward slope won’t be very high for another 9 years or so. Of course what that implies is that the climate sensitivity to ghg forcing isn’t very high.

  16. DavE

    To put the result in context. Doesn’t this assume you have run this test over a randomly selected section of history?

    Yes and no. Because it depends on how you define “history”.

    You don’t want cherry picking. So, for example: You can’t have the analyst just hunt for a period in the population of all periods that meet the criterion for the test until you find a “good” or “bad” one and then do the analysis that shows what you prefer to show.

    But now the question is: Which historical periods are eligible to test forecasting ability?

    In the present case, the hind-cast when models were tuned is the 20th century. The forecast– data not available for tuning is the forecast. You can’t test the forecast by comparing models to observations from 1920-1930 and so on. It’s just not a forecast. That period– and many other periods– are just not in the population of eligible periods for testing the IPCC AR4 forecast.

    So, even if you apply the concept of “picking randomly from all qualifying periods, to test the forecast, you can only pick periods in the forecast.

    And to get statistical power, you would test the forecast using the maximum amount of data available– that is always ending “now”. So, with respect to appropriate period of history to select from there is only 1: The longest possible period of time in the “forecast period”. You can randomly select any period you like from the “appropriate” population of periods– and that population contains exactly 1 period.

    The only real question is: what’s the correct precise division for the ‘forecast’ period. I think 2001 is the most reasonable one because the SRES were frozen then. (Moreover, the IPCC published the TAR in 2001. So, I think it really makes very little sense to suggest the AR4 models “forecast” from a period before the TAR was even published. But many advocate using 2000 mostly because of the way the data are stored by modelers and used by the IPCC. In the AR4, there is a break between Dec 1999 an Jan 2000 where the authors computed a “hindcast” through Dec. 1999 and a “forecast” afterwards. This break is not entirely related to when the modeling groups switched from 20th century runs to SRES– but it’s imposed by the formality of what the IPCC authors did when creating their projections and hindcast.

    So, I’m showing results based on these two choices for start years. Right now, I’m pulling observation data from process trends (for convenience). He’s only updated through June. But I’m trying to get all scripts organized and so on. I’ll read fresh later on. (I have the scripts for that mostly in place– but I haven’t coded HadCrut4 which seems to just this month be up to date! So, for this set of scripts, I’m deferring coding reading the data from the individual groups until I see HadCrut 4 updating. )

  17. Jit

    Maybe someone should build a “natural selection” algorithm where the worst-performing models are thrown out, the better performers combined and bred and the next generation similarly weeded…

    There is a difficulty with this idea. On the one hand, there is some sense to throwing away the obviously bad models. But on the other hand: We only have 1 realization of earth weather for the “forecast”. That earth weather will either be on the warm or cold side of it’s “average” for the historical forcings that occurred.

    Suppose some models are too warm and some too cold.

    If the earth weather was very cold ‘for the earth’, we will tend to throw out models that are “too warm”. But we probably won’t detect any of the models that are “too cool”– and some may be “too cool”. Because we keep the models that are ‘too cool’, we might then tend to have a cool bias in our predictions of the future.

    As for the future: I’m curious to see what whether NCAR is still in the AR5, and whether it is still “hot-hot”. 🙂

  18. Lucia,

    So, the open circle below the model is located sqrt(total_model_sd^2+ measurement_uncert^2) below the circle for the model mean.

    Ok that makes sense. It’s all that’s needed for making the comparisons. Thanks.

  19. Re: lucia (Oct 12 10:44),

    The real problem is that using one metric like global average temperature throws away a tremendous amount of information. The models could have hindcasted the twentieth century global average temperature perfectly and still be useless for forecasting because they may get the right answer for the wrong reasons. Someone should be looking at things like latitudinal cloud distribution. If that’s way off compared to the real world weather, then what we have is curve fitting, not a real physical model. Curve fits generally cannot be projected very far beyond the data limits.

  20. Lucia,

    Your range choices and assumptions are reasonable, but I would differ. I think the advantages of including the 1990s in the trend analysis, with otherwise so little data, outways the advantages of the more conservative 2001 start year, because 2001 data set is very limited and at least seemingly selective. Particularly (and I know this should be covered in your noise assesment) a range which has a cooling bias (2:1 ratio of El Nino’s to La Ninas in the first 5 years and ends with a 1:2 ratio in the last 5 years).

    IMO using the maximum forecast data from 1990 is a good test of the models for the following reasons.

    1. The modeling (AR4 & TAR) began in the late 1990s and I think hindcasting is also a legitimate test of a model. A model that does well at hindcast and forecast is stronger than one that succeeds at only one. A1B modeled trend did drop a bit for AR4 but A2 (a better match for 2000 and the 2010 CO2 concentrations) is the same for TAR and AR4, ie. 0.19C/dec.

    2. Circa 1990 is the reference year for the models so it is approriate to use it and the post 2000 projections to calculate slopes from the reference year.

    I would also add that while, at this point, it looks more likely than not that the GISS trend to 2025 will be less than the multi-model mean of 0.2C/dec for A1B, it also looks likely to fall within the projected range of 0.16-0.24C/dec which is an appropriate test of the model IMO. Afterall, even if the secular trend were measured at 0.195C/ dec in 2023, as we got closer to 2025 the chances of hitting 0.200C/dec would mathematically diminish to zero, but one would hardly reject the model on those numbers.

    I guess, the most relevant stat to me would be, what is the most likely trend we will see 1990-2025, so far, and what is the statistical probablity at this point that the 0.16-0.24 projection range is wrong – if you insist on the A1B scenario.

  21. DeWitt–
    I don’t disagree with you. I’m only noting a feature that arises if we throw away based on how well the forecast the period immediately after forecasts were made. The same would happen if you applied the principle to hindcasting.

    Of course you also have the difficulty that you keep models that might have been right for the wrong reason and those would then veer off in the future. But the notion is based on the hope that at least getting rid of the ones that were clearly wrong would result in better predictions. Presumably the ones that hindcast badly or predicted the first bit of the forecast reason incorrectly did so for a reason and so you would not wish to keep them.

    But there is a difficulty that we only have 1 realization of a forecast period to test which models are “bad”. So, the idea of sifting based on a model being ‘bad’ in the forecast period is fraught a liable to remedy the bias issue by just creating a bias in the opposite direction of “weather noise” that occurred during the earth period.

    Anyway, by the time a forecast can be tested, the modelers tweak their models, rerun them and the brand-spanking-new models are likely to hindcast well. But this time, the hindcast will include years that were in the forecast periods of the previous forecast.

    So… I anticipate in the AR5, anything that disagreed too violently during period that was the ‘forecast’ for the AR4 is likely to be tweaked out of violent disagreement in the AR5 runs.

  22. DaveE–
    There is nothing wrong with comparing AR4 models results to observations during the 90s. But claiming that is a comparison of a forecast is frankly just bullshit. There is no other way to put it. This post is about testing the forecast in the AR4. So that comparison is irrelevant.

    because 2001 data set is very limited and at least seemingly selective.

    How is it “seemingly selective”? If the way it is selected is to insist that one must test forecast using data in the forcast periods: Yes. That criterion is applied. In otherwords: It was selected properly based on the question we are trying to answer. That question is: Can the models forecast? It is absolutely proper to select based on what one wishes to learn.

    I discussed why one might pick 2000 vs. 2001. I show both results. These are the right years to “select”. Simply decreeing they were “seemingly selected” and insinuating that “selection” is somehow wrong is odd. Either these years divide the forecast from hindcast or they don’t.

    As for “limited”: Of course the forecast periods is shorter than the hindcast period. All that means is uncertainty intervals are higher. It doesn’t mean we can’t compute them and it doesn’t mean we can’t diagnose things about the models based on the shorter time periods.

    The modeling (AR4 & TAR) began in the late 1990s

    First:It appears you are trying to change the subject from “testing the forecast from the AR4” to “testing models in general– including the TAR”.

    Second, you seem to want to base the years for testing to entirely different things _- the AR4 and the TAR– on the history of something entirely different: that of “modeling”. And your notions of when modeling began may be muddled. In reality, both modeling and climate modeling predates even the FirstAnnualReport (FAR)– and so began before the late 90s! Heck… my husband worked on DOE’s arm program in 1990– and the goal for that was then– and is now– to provide data to improve “modeling” efforts.

    The idea that “modeling” started in the late 90s is both
    a) absurd and
    b) irrelevant to determining the start date for testing forecast for models in the AR4. That is a specific forecast based on specific models used a specific way.

    As for any claim that it makes sense to test forecasts from the AR4 based on a start year of 1990 bwecasue you think modeling begain in the “late 90s”:
    a) Starting in the late 90s would not put the early 90s into the “forecast periods” for the AR4.

    b) Moreover, it’s worth noting that both the rise and fall of volcanic aerosols from Pinatubo are used as forcings in the many model runs in the AR4 and– more over– the rough temperature response to those forcings was known to modelers before the used then to drive models in the AR4. So, that post-pinatubo period is in the hindcast. It is ridiculous to argue for including any start period before Pinatubos eruption– and even the temperature rebound after the eruption in the “forecast” period for models in the AR4. This is a very important event with respect to tuning of models used in the AR4.

    If you have some year in the late 90s to suggest as a start year that might be an alternative to 2000 or 2001 for testing the forecast in the AR4, suggest that year and say why that specific year divides forecast from hindcast. But 1990 is clearly nutso for testing the forecast in the AR4 because it is before the eruption of Pinatubo — and the temperature drop and recovering subsequent to that eruption were both known to modelers and used to refine, check or tune the aerosol parameterizations.

    Circa 1990 is the reference year for the models

    Huh? Which models are “the” models? And what’s Circa 1990 even mean?

    The AR4 does not provide projections relative to “1990”. It provides projections for the 21st century with numerical values stated relative to a baseline is the average from jan 1980-dec 1999. If you want to claim you are using what they used, you should use what they used. They did not make forecasts relative to 1990. And even if they had, that would simply specify the baseline.

    A baseline is required to define where “T=0” is, but it does not define the boundary separating the forecast from the hindcast.

    BTW: I do periodically show the temperatures on the precise baseline stated in the AR4 and I also show and test trends from 1980. I’ve also shown trends over a range of start years. Obviously, I can’t show every possible thing in every post. But you can see comparisons of models to GISS here:
    http://rankexploits.com/musings/2012/arima11-test-reject-ar4-multi-model-mean-since-1980-1995-200120012003/

    The models don’t look so hot with a variety of choices of start year in that period. But I don’t call the tests of trends that include periods in the hindcast a test of the forecast. Because they aren’t.

    0.16-0.24C/dec which is an appropriate test of the model IMO

    I have no idea why your ” IMO” is any sort of justification for the appropriate test “models” (whichever models you even mean). This post isn’t about testing “models”. It’s about testing the AR4 forecast.

    On this post, I request you could limit yourself to a discussion of what one might do to testing the forecast in the AR4. Not some notion about testing “the models” including those in the TAR. That’s just totally off topic. If you want to discuss testing the TAR or “the models” (in the TAR) — you are free to do so. The world is large and there are many forums.

    But stop trying to change the subject in this post where that discussion is testing AR4 forecasts. If you think that subject is too narrow– maybe so. But this is a blog post and it’s about a particular topic.

    If– in future posts- you once again try to change the subject to the TAR, or use the dates of TAR forecasts to explain what choices we should make when testing the AR4, I’m will moderate you. After moderating, I will not clear your comments if they are trying to change the subject to how we should test the TAR. Though I discuss the TAR from time to time– it’s rare. I’m not especially interested in it. That is OT on this post because it is largely irrelevant to testing the AR4.

  23. Lucia,
    Nice post.
    Mmm. So the models don’t work if you calculate the variance from the “true” weather noise and they don’t work if you calculate the variance from the models’ own (arbitrary) variations.
    In any engineering environment that should be sufficient to say that we should not be using the model results for decision making. Period.
    In the business world, if there were a meeting of commercial and technical people evaluating a major investment on the basis of these data, they would conclude that it was necessary to exclude the model data as unreliable and they would test the remaining data to see if a sensible decision could be made. Anyone trying to use a statistical sleight of hand to show that the uncertainty in the observations could be expanded to include the model predictions (within say a 80% confidence interval) or that the models’ uncertainty could be similarly expanded would receive very short shrift.
    The onus is (and should be) always on the modeler to demonstrate that he can add predictive value – never on the consumer to prove that a particular model is unhelpful, when it is demonstrably failing to get the mean prediction correct.

  24. Could DaveE have that affliction that Nick Stokes suffers from: the inability to conceptualize the statistically critical differences between hindcast and forecast? If so Lucia you might not want to be so hard on him but rather allow him to be ignored.

  25. Paul_K

    Mmm. So the models don’t work if you calculate the variance from the “true” weather noise and they don’t work if you calculate the variance from the models’ own (arbitrary) variations.

    I will be summarizing eventually. I will show:
    1) comparison if we use ARMA11
    2) comparison if we use weather noise from model to test individual model mean.
    3) comparison if we use average weather noise from multi-model mean to test individual means and multi-model means and finally, the once that seems to be deemed “right”
    4) test by seeing if earth is in weather noise, where “weather noise” includes both the average weather in models and is enhanced by the spread in the forced response of models.

    (4) would only be better than (3) if a) the model mean trends agree with each other in the forecast periods. In such a case (4) should get the same answer as (3) but be more powerful.

    However, if either (a) is violated…. 4 isn’t correct because the spread in runs is not the best estimate of variability “weather” it is “weather” plus “uncertainty due to structural uncertainties in the modeling process resulting in different forced responses”. The later is not “weather” in any way.

    Oh… and I will show (a) is violated. Then I’m going to have to wrap up and write things up.

    Kenneth
    Were it not for the time wasting repeated tendency to create confusion by discussing the history and timing of the TAR when trying to explain how we should test forecasts for the AR4, or his desire to explain how “the models” might be right because… welll… there is the TAR, I would not be threatening DaveE with moderation. He can discuss his theories about how one might test whatever he wants to test about TAR somewhere else. The appropriate choices for testing the TAR forecasts and the results of any tests of the TAR forecasts do not belong on this thread.

  26. DaveE suffers the unbearable lightness of thinking which seems so painfully common in those who are terribly concerned about global warming. Of course hindcasts are not forecasts, and of course the models were ‘tuned’ to match pre-2000 data. As he has several times before demonstrated, DaveE is unwilling to consider any test of model predictions, since those predictions grow ever further from reality. Changing the subject to obfuscate and sidetrack is the only avenue that remains open to him to defend ‘the consensus’. Bad news DaveE: the consensus seems unable to make accurate predictions about the future.

  27. DeWitt Payne (Comment #104836),
    “Not yet. A simple sinusoidal curve fit (66.6 year period) to the AMO data says we’re at the peak this year. The downward slope won’t be very high for another 9 years or so.”
    Fair enough, but I think there is enough uncertainty in where the AMO peak lies to not exclude the possibility of some downward influence already. (http://en.wikipedia.org/w/index.php?title=File:Amo_timeseries_1856-present.svg&page=1 and http://en.wikipedia.org/w/index.php?title=File:Atlantic_Multidecadal_Oscillation.svg&page=1)
    In any case, if the AMO really is influencing average temperatures, the recent years are near the top of the ‘curve’ where the slope, positive or negative, is lower in magnitude than it was pre-2000. And yes, any AMO influence means less causal influence from GHG’s and suggests lower climate sensitivity to GHG’s. Even an eyeball evaluation indicates an influence of +/-0.1C on average temps… and suggests ~1/3 of warming from the early 1970’s to early 2000’s was unforced.

  28. SteveF, one of the things that I find really amusing about all of this, and it’s not just DaveE who falls in this trap, is the idea that everything just has to be perfect and if you find any problems, then you’ve made a mistake.

    Really, forecasting 0.2°C is a bit like forecasting “sunny and mild” with a weather model. Plenty of us have planned picnics around this type of forecast to have the weather turn out to be “cold and rainy”.

    That doesn’t mean you throw out the weather model. There are plenty of reasons why that weather model could end up being off, just like a GCM could get the slope wrong.

    What’s important with wrong GCM forecasts isn’t that you have to throw out the model, but that the GCMs, for whatever reason, aren’t useful for forecasting. Of course it could also mean the physics of the GCMS, as implemented, is wrong, but there are plenty of reasons why a forecaster might give the forecast (e.g., personnel bias), or heh maybe just understate the uncertainty of the modeled forecast.

  29. Lucia,

    Wow, must have hit a nerve!
    The modeling for the TAR forecasts began in the late 1990s, not all modeling of course – don’t be ridiculous.

    The AR4 multi-model mean forecast is with respect to the 1980-1999 mean. Agreed? So why would you disregard that information from the model data when testing model trend vs observed trend?

  30. DaveE —
    “The AR4 multi-model mean forecast is with respect to the 1980-1999 mean.”
    By this, you presumably mean that the anomalies are computed with respect to means over the 1980-1999 period. Could well be. But this has nothing to do with what years were *forecast*. Lucia has said this about three different ways, I’m not sure why you don’t grok the difference between a forecast and a hindcast.

  31. The AR4 forecast is with respect to the 1980-1999 mean (or my shorthand circa 1990). Therefore the first forecast point 2000 implicitly has slope information in it. That is not a hndcast.

    The 0.16-.24 C/dec range for A1B projection is not my opinion, it is the AR4 projection for A1B for the 1990-2025. So to reject the forecast you need to test to the forecast, not just against a nominal 0.2C/dec and trends excluding the baseline information. Of course you can test whatever you want, but you can’t say you are rejecting the AR4 forecast.

  32. DaveE —
    I think you need to review what anomaly means. There is no slope information in the assignment of an anomaly period, or equivalently, in the placement of the vertical zero of an anomaly graph.

    As has been mentioned before, the AR4 projection for A1B over 2000-2030 is not “0.16-0.24 C/dec”. The multi-model mean has a trend of around 0.23 C/dec. Look at Fig 10-4 or 10-5. [You have to enlarge the pictures to make an accurate estimate.]

  33. There’s an easier way to estimate the trend for the multi-model mean than enlarging AR4 graphics. Use KNMI Explorer.
    Click on “Monthly CMIP3+ scenario runs”
    Select multi-model mean, tas; click “Select”
    Enter latitude range -90 to +90; longitude range -180 to +180; click “Make time series”
    Download raw data for anomalies, compute OLS trend.
    I get 0.225 C/dec for the trend 2000-2030.

  34. DaveE-
    You didn’t hit a nerve. I’ve told you these posts aren’t about the TAR before and you are persisting in trying to explain how we evaluate AR4 forecasts based on claims (which may or may not be correct) about the TAR. Those claims are irrelevant– this is not about the TAR.

    The AR4 multi-model mean forecast is with respect to the 1980-1999 mean. Agreed? So why would you disregard that information from the model data when testing model trend vs observed trend?

    Simple: The baseline is not part of the forecast. If you are testing the forecast, you test the forecast.

    The only thing the baseline does is define the location of “0”. This has nothing to do with which period is the forecast. To test forecast, you use data from the forecast. This is very simple.

    The 0.16-.24 C/dec range for A1B projection is not my opinion

    So, now we have your opinion on what the range is given with no other basis. Interestingly, there is no statement about probability distribution for that range. It’s just stated flatly as “a range”. (Is it 1sigma? 2sigma? On ‘weather’? Or what?) Giving now details is pretty weird if one is going to discuss statistics. That makes it utterly untestable.

    But that range is irrelevant to to determining whether the ensemble is biased high or low. And knowing whether the ensemble is biased high is– in my opinion– important. That’s what I’m testing and I’m reporting on that.

    If you want to consider other questions about properties of the forecast– fine. There are a zillion properties one might wish to explore. But it’s rather irrelevant to this post and does not affect the answer to the question I am testing.

    Of course you can test whatever you want, but you can’t say you are rejecting the AR4 forecast.

    If you are discussing anything about statistics or claiming to do anything with statistical tests, you ought to know perfectly well that statistical tests are very specific and apply to very specific properties. Changing the subject to a different question is pointless. The fact is: this test shows the multi-model mean is likely biased high– subject to the assumptions of the test.

    And yes, I am testing a specific question about a specific property of the forecast. I can perfectly well say the multi-model mean– a property of the forecast– is biased high. That’s the conclusion of the test I applied. You can want to ask different questions, but that doesn’t mean I can’t observe the result of a particular test or state what it shows about a particular property of the forecast: The forecast is biased high.

  35. DavidE:

    Here’s a suggestion. If you want to test models using trends from 1990-now, then when you compute your uncertainty intervals, you cannot based those on the assumption that the data from “1990-beginning of forecast period” is unknown. You need to redo the statistical test to be based on conditional probability conditioned on knowledge of the observations before the forecast periods.

    It is possible to do this, but it is tedious. Just fitting a trend to all the data and treating the data in the hindcast period as “unknown” when testing the null does not result in the correct false positive rate of rejection.

    This could easily be shown using montecarlo and you can learn the principle using white noise. Just do this with a zero trend (becasue that’s easiest.)

    1) Pick a dividing date for the forecast. I suggest Jan 2000 as an illustrative example.
    2) Use a random number generator to create data for each month from 1990-Dec 1999. Freeze that for all tests. It’s the ‘previously known data”.

    Now: Start montecarlo as follows. For each realization “i”:
    a) generate monthly data for Jan 2000-now.
    b) tack that onto the “previously known data”.
    c) compute the trend for the string of 1990-now data just created. Save that m_i.
    d) also compute the estimated uncertainty. sm_i for the old trend.
    e) Also apply a “t” test to determine whether you “reject m=0” based on that fit with a significance of 5%. Record a r_i = 1 if you reject, record a r_i=0 if you “fail to reject m=0.

    Repeat a-d N=1000 times.

    Using the 1000 results, compute the pooled average of sm_ave=sqrt(sm_i)^2.

    Compute the standard deviations of the ms: sigma_m=sd(m_i).

    Compute the average of “r_i”

    If your method “works” the average of r_i will be 5%. That’s the claimed false positive rate. If you get a different r_i, your method doesn’t work. Compare sigma_m and sm_ave. Ponder that. Then…. after you have done this, get back to us. Tell us what r_i you get. (Remember, we can check.)

    Maybe you will have learned that you can’t apply the test you are applying to test the “forecast” using data that was known before the “forecast”– or may you will at least have learned that you can’t apply a regression and believe that the uncertainty estimates for the trend computed as if the data prior to the forecast was unknown describe the statistical uncertainty in the forecast.

    Or maybe you won’t have. Possibly because you so love the idea of testing whether you can predict the future by including predictions of the past. But the fact is: You can’t do that. Or if you do do it, you need to modify your test to account for the fact that the past data was not uncertain; it was known.

  36. BTW: Before sharing your theory of what that means:

    Agung erupted in 1963.
    Fuego erupted in 1974.
    El Chichon erupted in 1982.
    Pinatubo erputed in 1993.

    All three events are thought to have been injected aerosols into the stratosphere (few eruptions manage to do this) and all are simulated by those AOGCMs that accounted for volcanic eruptions during the 20th century. Find the peaks.

    The eruption of Fuego also
    a) precipitated panicked calls between our house and that of the Ramirez household in Guatemala. My sister’s God-mother’s daughter was staying with us at the time and
    b) resulted in the loss of my sister’s baptismal records.

  37. HaroldW

    As has been mentioned before, the AR4 projection for A1B over 2000-2030 is not “0.16-0.24 C/dec”. The multi-model mean has a trend of around 0.23 C/dec. Look at Fig 10-4 or 10-5. [You have to enlarge the pictures to make an accurate estimate.]

    Table 10.5. (http://www.ipcc.ch/publications_and_data/ar4/wg1/en/ch10s10-3-1.html) says the average temperature from 2011–2030 will be 0.69 relative to a baseline of 1980-1999. If we compute a trend/ dec by comparing centers, we get 0.69/3.1 = 0.22C/dec. I have no idea what he means by decreeing the forecast in the AR4 for that period is precisely the value he gives. It maybe some post-publication of a trend one might get by fiddling with some numbers. But I don’t have any idea where DaveE thinks the authors of the AR4 specifically gave the range he claims “is” the projection.

    Moreover, it’s rather odd that his upper bound of 0.24C/dec is rather close to 0.22C/dec — the numerical values that is actually consistent with table 10.5 while his lower bound of 0.16C/dec is well below this value.

    Well… when the time comes 16C/dec may turn out to be true. In which case, we will know the mean was high. If that happens, I have no idea why anyone would think we couldn’t observe that 0.16 < 0.22 and state that the mean turned out to be higher than the value in the table.

    Right now, comparisons of the data and the observations suggest the 0.69C — the value actually printed in the table 10.5 of the AR4 — will be likely end up high relative to the observation.

    Whoops failed to notice it’s 31 years. edited…

  38. Lucia:

    I might have known that the coin toss analogy wasn’t fair. I just find it hard to see the real record as one of many possible outcomes.

    I also thought of a flaw in the “natural selection” of models idea after returning to hoovering the dining room. This is that you might get a better fit, but you don’t get something closer to modelling “real” processes. You said something similar I believe.

    As for “R”, well, say it like a pirate and it’s about right. It’s a single letter that causes me great distress. Why didn’t they call it “K” or something to cheer me up when I’m getting inexplicable errors in what looks like perfectly respectable code to me?!

  39. Jit

    I just find it hard to see the real record as one of many possible outcomes.

    If we are going to discuss probabilities, it’s necessary to view the earth trend that way. Of course, as a practical matter, we can’t collect multiple replicate samples of “earth” weather around the “forced earth trajectory” for any period. But to understand how to compare things, you do have to think of the earth realization as if as if the earth is some sort of “preferred model” that all the other models are trying to predict and as if you could run many “earth realizations”.

    Oddly– we actually get a lot of confusion on the issue of how to treat the earth in all sorts of directions. Some people want to count “weather” twice during a comparison; some want to count it 0 times. Or they remember to count it in one part of the analysis and forget during another. And so on.

    I’ve been know to swear at R. My method of using it has also progressed.

Comments are closed.