On Hypothesis Testing: Testing “weather noise” in models.

One of the on going blog-disputes related to testing the IPCC AR4 projections of the Global Mean Surface Temperature (GMST) is whether one should the variability of 8 year trends for earth weather based on observation of the earth’s weather or based on the variability of model trends.

Of course, if the statistical properties “model weather noise” in GCMs used by the IPCC has been show to agree perfectly with those of “earth weather noise”, there would be no issue. In such a case, many realizations of model predictions would provide accurate descriptions of earth weather, and the large number would provide more precise estimates.

The difficulty is that the average of an infinite number of realizations of “model weather” will not match that of “earth weather” if the models don’t reproduce earth weather. In terms of the coin flip analogy: If you roll loaded dice an infinite number of times, the “average” result will not match the average for a fair dice.

So, the argument the argument about replacing the variability of 8 year trends estimated based on observations of the earth with those observed in models has nothing to do with the number of realizations. There are more realizations for models.

The argument is over this question: Do individual model realizations reproduce the statistical properties of realizations of “earth weather”? Or are realizations of “model weather” drawn from some process that is different from “earth weather”?

What do I think about “model weather”?

I think the statistical characteristics of models do not match the statistical distribution of earth weather in some important ways. I think statistical tests applied to a few easily computed statistics from models can show this. (That said, I could be wrong. )

Oddly enough, though my ultimate focus is to show models do not reproduce the statistical properties of “earth weather”, I will begin by comparing weather in models to each other. Obviously, if the models collective predict earth weather, they must at least agree with each other. So, showing they don’t even agree with each other means we can’t trust any individual model to reproduce earth weather.

However, that’s not the main reason to show the models disagree with each other. The main reason is that some statistical questions and tests between earth weather and model weather require use to know whether all the models provide similar statistics for “weather noise”. If they do, then we can later treat all model runs as individual realizations “the same thing”. Otherwise, each model is a realization of its own ‘planet weather”, and we may do different tests, or discuss different issues.

In the next few posts, I’ll examine features of weather noise in models—usually focusing on the five 90 month periods from 1914-1951 which are more or less unaffected by variations in stratospheric aerosols due to eruptions of stratospheric volcanoes.

Why this time period?

The absence of the variations due to this factor is important because stratospheric volcanic eruptions are known to markedly affect GMST. Some GCMs used by the IPCC AR4 did not include these effects, while others did. So, one might expect the variations in GMST during those periods to differ from model to model for that reason. In contrast, differences during other periods are likely due to parameterizations within models.

I have in the past used the period from 1921-1947; you can read the reasoning here. In a later post, for some tests to estimate the variability of earth’s “weather noise” during periods free of volcanic eruptions, I commented that there is now ambiguity related to the time period near 1945; this is due to the “jet inlet-bucket” uncertainty in measurements discussed here. This ambiguity still exists, but does not affect our ability to compare “weather noise” in one model to that in another.

So, for the purpose of comparing “model weather” to “model weather”, I expanded the region for a number of reasons. The principal ones are a) I know people would want to see whether the increasing the window for comparison would change the outcome of the tests, b) the bucket issue isn’t important and c) if models agree in the tighter “volcano free periods” but disagree during “volcano periods” we will not consider that a fault of models.

Why 90 month periods?

I am focusing on features of “weather noise” that can be observed during 90 month periods because this is approximately the time period I have been using to test the IPCC AR4 projections. The most recent test used 91 months—but 90 months gives a nice round number.

What metric will I examine today?

Today, I will examine the following metric, which is related to “weather noise”: The sample standard deviation of residuals of line fit using Ordinary Least Squares (OLS) 90 months of data. I’ll refer to the sample standard deviations “sT” The residuals happen to provide a measure of the inter-annual variability, and the standard deviation of the residuals for 90 months of data will be treated as “an observable”.

What’s an observable? Something we can observe, and quantify! In this case, the only claims I make about this observable are:

  1. if the models reproduce the correct statistical distribution for the earth’s “weather noise,” the residuals to OLS fits (i.e. sT) should have the same statistical distributions as residuals to OLS fits for the earth’s “weather noise”.
  2. if the GCMs used in the AR4 individually reproduce the earth’s weather noise, then statistical distribution of sT in one model should not disagree with the distribution in another model.

The second point is not to say that “sT” will be identical in each and every realization (or model run. Rather, that when multiple runs are available for “model A” and “model B”, the results should not differ on average.

The Hypothesis Tests

For the purpose of testing models, I will now assume and test the following: “ The median sample standard deviation of residuals to an linear fit to ‘model weather data’ is identical for all ‘N’ models used tested “ (The number ‘N’ will vary from test to test, as described in the results.)

What does this mean, in non-statistical terms? Well, if all GCM’s reproduce “earth weather noise”, one would anticipate the standard deviation of residuals to 90 month OLS fits for runs from all models will be drawn from the same statistical distribution. This would, presumably, be the distribution for the standard deviation of residuals for OLS fits of the earth’s weather noise.

To test this hypothesis, I :

  1. Downloaded “model data” for 19 IPCC models. (The selection method is described here.
  2. Computed the OLS trend for five non-overlapping periods. These begin in Jan 1914, Jun 1921, Jan 1929, Jun 1936 and Jan 1944. (This standard method is available using LINEST in EXCEL; it is applied to each 90 month period. )
  3. Computed the standard deviation of residuals. (This is also done using LINEST in EXCEL. However, due to the nature of EXCEL, I calculated the residuals themselves and determined the standard deviations. Those who might check my numbers will notice my values will differ from those in EXCEL by sqrt (89/90); this constant factor makes absolutely no difference to the later test. )
  4. Applied the Kurskal-Wallis Test to determine whether the median ranks are all identical. (This test is described on page 410 of Ott and Longnecker, and assumes the distribution functions are identical for all cases. (This assumption may be incorrect. However, the distribution function for the residuals should be identical for all models if the models all reproduce the statistic distribution of earth weather.)

Caveats:
To apply the Kurksal-Wallis tests to determine whether the median rank of sT is equal for all models, the individual observation of sT from each model must be independent. This assumption should apply to any two observation of sT from the same time period but from entirely different models runs. It is not clear it applies to sT observed in adjacent 90 month periods of the same run.

Also, I will be estimating the ‘p’ values for the Kurksal-Wallis test using the Chi-square distribution function as implemented in the template available here.) Using the Chi-square distribution an approximation which holds when there are many samples. Unfortunately, there are actually very few runs for each models, so use of the Chi-square tests gives approximate ‘p’. This means this blog posts show the answer we get if the ‘p’ values are determined using the Chi-square distribution; for this reason, the results are provisional.

If anyone knows where I can look up more precise ‘p’ values for the H distribution, let me know and I will re-do this. (Meanwhile, I will be hunting them down.)

Results

For first hypothesis test, I compared the standard deviations of residuals computed for the N models with more than 1 run during the period of interest. These were: CGCM3.1 (T47): 5 runs, GISS AOM: 2 runs, GISS EH: 3 runs, FGOALS g1.0: 3 runs, MIROC3.2 (medres): 3 runs, ECHO G: 3 runs MRI CGCM 2.3.2: 5 runs ECHAM5/ MPI-OM: 3 runs.

For these runs, I found the following standard deviations for the residuals, sT, of :

  1. CGCM3.1 (T47): 0.12 C, 0.14C, 0.12C, 0.14C, 0.12C.
  2. ECHAM5/ MPI-OM: 0.13 C, 0.27 C, 0.16 C.
  3. ECHO G: 0.16 C, 0.14 C, 0.14 C.
  4. FGOALS g1.0: 0.25 C, 0.28 C, 0.26 C.
  5. GISS AOM: 0.07 C,0.09 C.
  6. GISS EH: 0.09C, 0.10C, 0.12C.
  7. MRI CGCM 2.3.2: 0.14 C, 0.16 C, 0.11C, 0.14C, 0.12C.
  8. MIROC3.2 (medres): 0.08 C, 0.10 C, 0.10 C.

Mere mortals examining this list might notice that the minimum standard deviation of the residuals for the three runs of FGOALs (aka “Planet AC Current”) is more than twice as large as the maximum standard deviation for GISS EH, GISS AOM or MIROC3.2 (medres). Based on that, they might suspect that maybe, just maybe, the “weather noise” across the models don’t represent different realizations of “weather” with identical median values for this particular observable. Just maybe FGOALs weather is “noisier” than the weather for the GISS AOM.

We could test the whether the media for FGOALS is the same as for GISS EH, or GISS AOM or any individual model individually. However, that invites cherry picking. After all, I mentioned FGOALs precisely because it has the largest residuals and GISS AOM because it has the smallest residuals (i.e. least noise).

Also, for some models — like GISS AOM–, we have only two runs—this makes it difficult to do a test because the power is low. After all, it’s possible the median standard deviation of residuals for GISS AOM matches that of FGOALS but both are smaller than all those for FGOAL by pure random chance. With only two runs, that could happen, right? It’s also possible the median for GISS AOM matches those of all other models, but the two runs happen to be the lowest and 3rd lowest standard deviations of all cases by pure random chance. But your gut is telling you the second is less likely, right? Well, your gut is right! 🙂

So, of course, our gut feeling still needs systematic testing.

That’s where Kurskal-Wallis comes in. I’m not going to explain the test– you can go to the library and read about it in Ott and Longnecker (or any number of statistics texts available at your public library.)

It turns out that if we test the hypothesis that all sT values for the 90 months beginning in 1914 are equal using Kurskal-Wallis, the test tells us the probability that they are equal is p=0.4%. The value of 0.4% is less than 5% so we reject the idea that the sT for these models share the same median.

By this test, their “weather noise” for these 8 models are not drawn from the same population. In other words: different models have different amounts of “weather noise”.

It’s possible that one particular of the models reproduce the same “weather noise” as another particular model. But collectively, they each create their own “weather noise”.

More importantly from a model testing point of view, while some may reproduce the magnitude of earth weather noise, at least some don’t. Bear in mind also, the test isn’t simply distinguishing statistical “gotcha” for the models. Simple inspection shows that, by this particular measure, the “weather noise” in some models is twice as variable as in other models.

What about other years?

I bet some people think the main reason I expanded the range of “no-volcano” years was a “trick”. After all, in the past, I said 1914 isn’t that far from a volcano eruption. So, maybe, I included it to increase the variability across models by having the properties of some be affected by volcanic eruptions and some not. (Some models ignore all volcano eruptions in the hindcast.)

Well… no….

If I repeat for the other groups of years, the hypothesis that the models reproduce the same sT is rejected with p values of 0.7%, 0.2%, 0.4% and for the 90 month periods starting in Jun 1921, Jan 1929, Jun 1936 and Jan 1944 respectively.

We can then repeat this test using the other four time periods. The results tell us we can reject the hypothesis the sT’s share medians in all cases. (The p value happen to be

Why didn’t you use all the models

I bet you think not using all the models is a trick? Right? If I added the runs from models with only 1 run, I’d get a different result?

Well… that’s not why I ignored results from some of the models.

The real reason is test only makes sense if we have at least 2 runs for any model included in the test. Ideally we’d have even more– I’d like five. (The reason I want the results are provisional is the use of the Chi-squared distribution is approximate when we have too few runs per model.)

However, if we assume that the results for the 90 month period starting in 1914 is uncorrelated with that starting in 1921 and so on, then we can run the test using every single model. Also, we end up with at least 5 runs per model. (This was one of the reason I broke the data into 90 month periods. Though, it’s not the main reason.)

The difficulty with this approach is that it’s not clear the standard deviation of the residuals for adjacent 90 month blocks for the same run are uncorrelated. I could make some arguments suggesting correlation is weak. However, I don’t have enough data to test the assumption, and it may be incorrect.

So, if you don’t wish to believe the result for that reason, that’s valid. The main reason for showing this result is to demonstrate what the test would reveal if it were performed.

It turns out that if I run the test treating different time periods for the same run as uncorrelated, we must reject the hypothesis the sT’s share a median with p= 7 x 10-20%. This somewhat less than 5%. So, if we accept this test as valid, we reject the premise that models create “weather noise” with the same magnitude of “weather noise” as measured by standard deviation of residuals from an OLS fit. However, as I noted before, the results value of sT in adjacent years may be correlated.

If they are correlated, this test with it’s low p value means little.

However, the previous tests are unaffected, and they also say: The models have different magnitudes of “weather noise”– provided we accept using the Chi-squared distribution as a valid approximation for estimating the “p” values.

Main result

Using the standard deviation of the residuals to an OLS fit based on 90 months data, and applying the Kruskal-Wallis test to data from the 20th century “volcano free” or “volcano-lite” period, and estimating the ‘p’ values using the Chi-squared distribution, the premise that realizations of “model weather” are drawn from the same statistical distribution is rejected. That is: according to the test, the hypothesis that models all produce the same “weather noise” as each other is deemed false to the 95% probability.

However, recall there is an important caveat: I used the Chi-square distribution to estimate the p values. All were well below the 5% threshold required to decree the null hypothesis false. However, I need to find tables of the “H” distribution to further refine this.

The resources I’ve found don’t tell me whether using the Chi-square tests results in excess false positives of false negatives, so I don’t know the direction of the potential error. What I do know for sure is that now that I’ve done the test, there is someone “out there” in the world who knows where to find tables for the “H” distribution. If the using the H tables instead of the Chi-squared gives different result, I’m sure they’ll tell us. 🙂

What about other metrics?

You’ll see results of tests with other metrics soon. I picked metrics that were easy to compute, familiar to people and are involved in the hypothesis test I have been reporting each month.

I will be showing the results for two other metrics: The lag 1 autocorrelation calculated over 90 month periods and the observed variability of 8 year trends. I’ll also be comparing to “earth” values later on, and discussing how the value for this early “non-volcano” period compare to the properties of weather since 2001.

But the comparison to “earth” weather will come later. I will, however, mention that based on the measures I have examined, “earth” weather tends to be less variable than “model” weather, at least during periods without volcano eruptions. ☺

18 thoughts on “On Hypothesis Testing: Testing “weather noise” in models.”

  1. Lucia,
    On this question of testing a projection vs weather noise or model noise – surely both are relevant and should be considered. But there may be overlap.

    Suppose a coin is tossed 100 times and someone makes a projection with a central tendency of 50 heads. Then in testing you’d allow for ±5 “tossing noise”. But if they had already projected 50±5 on that basis, you don’t make an extra allowance..

    But suppose it is known that coins can be biased ±20% (and you can’t tell in advance), and the projector had said, on that basis, 50±10 heads. That doesn’t include tossing noise, so you should add that error for the test. If they had included that in quoting the projection (say 50±12), then again you don’t have to add anything.

    In climate, the models probably include some structural uncertainties and some allowance for weather noise. So if someone makes a projection using models and quotes an uncertainty, you could use that.

    What if you think the models have underestimated weather noise, as is often said? Then if you’re really trying to get the evaluation right, you should try to adjust for the difference, because you are probably less interested in whether they are getting the noise right than in whether they are getting the mean right.

    But what I think can’t be right is to ignore quoted model uncertainty altogether. It includes things other than weather noise that just can’t be ignored. That’s why I continue to be uncomfortable with your citations of an “IPCC central tendency of 2C/century” as a suitable object to test over a decade. If the IPCC had really made a serious projection for this period, they would, or should, have quoted a model-based uncertainty.

  2. NIck–
    If the person making the prediction says 60 ± 20 , if we figure out based on data that it’s 30±15%, we can still observe that 60 is out of the range. Of course,they can also point out that the upper bound of 30±15 is 45 and still inside the 60 ±20%, because 40 is less than 45.

    But that doesn’t prevent anyone from pointing out that 60 is outside the range of what could be true, while admitting that 45 is still possible. And if that means the modelers are happy because 40-45 was inside the range of predictions and it may still pann out, – well and good!

    But, 60 is still out, and there is no reason someone can’t point that out. Of course, some people can feel pointing this out is not useful. However, it’s still true and other peopel think pointing it out is useful.

    Having said all that….Out of curiosity, are you conceding the central tendency of 2C/century does fall outside the range consistent with the data? And that the only way for the IPCC projections to be “right” is for the lower portion of the uncertainty interval to still fall inside the range consistent with the recent earth weather?

    On another point you bring up: Like you, I have read some papers suggesting the models underestimate weather noise.

    However, comparing the weather noise to the model runs I downloaded, I don’t understand precisely how that statement is justfied. At least, I don’t see how it applies to the models underlying the AR4.

    If I compare metrics describing variability of “weather” during periods with no stratospheric volcanic eruptions , the earth appears to have noticably less “weather noise”. In fact, it has roughly 40% less interannual variability.

    Were I to “adjust” the models for this, I would divide the “model weather noise” by a factor of 1.4.

    If the IPCC had really made a serious projection for this period, they would, or should, have quoted a model-based uncertainty.

    Oh? So, are you suggesting figure 10.4, reproduced in the technical summary was not meant to be taken seriously? Are you suggesting the uncertainty intervals in that figure aren’t model based? The text describes the uncertainty interval as the ±1sigma range for the average temperature projected by the individual models. I would call those “model uncertainties”. It’s a bit difficult for a reader to determine the precise magnitude of the uncertainty or the trend from that graph. One had to blow it up. But it sure looks like the IPCC authors meant to communicate precisely the information in that graph. The show it rather prominently!

  3. Lucia,

    To answer your first question, yes, you’ve established that a prediction that temperatures would follow a linear trend (plus weather noise) of 0.2 C/decade over the last decade (well, eight years, and subject to insecurity about whether the residuals are normally distributed) is falsified. But Fig 10.4 didn’t say that. It just said that, subject to some scenario assumptions, future temperatures would be found within a particular range (colored), and for each future time it marked the mean of that range. That projection includes weather noise and model uncertainties. Now the direct way to falsify that is to show that temperatures are falling outside the range.

    What you’ve done is to draw a tangent to the mean curve and say that is the “IPCC central tendency”. But there are various other lines that could have been drawn for this eight year period that would have fitted within the Fig 10.4 colored area, and following those could not be regarded as a falsification of that projection. So even with this method, there is implied model error in the trend to take account
    of. There is the further issue that the models don’t predict a linear rate of change, and while you can fit a line to a model prediction, you can’t really regard the discrepancies as random variables.

    I don’t have a particular view of whether the models correctly estimate weather noise; I’m just saying that if you think they have it wrong, it would be appropriate to make an adjustment.

  4. Nick–
    I don’t know why you think Figure 10.4 doesn’t show a linear trend for the central tendency of their projection. It’s quite obvious the central tendency is linear in that figure. As for the scenario assumptions: a) for the first 30 years or so, the projections are insensitive to the three projections illustrated and b) it is believed we followed those. This turns that projection into a forecast for all practical purposes. Had a volcano erupted or mars attacked I would concede your point about the “scenarios” issue.

    Now the direct way to falsify that is to show that temperatures are falling outside the range.

    This is one direct way to test. Testing the trend is also perfectly direct. It’s certainly not “indirect”.

    Unfortunately, the method you suggest– showing temperature fall outside the range– is a poor one. Specifically, that method has has no statistical power. What this means is if the true trend is not 2C/century, your gives the wrong answer at a greater rate than testing trends directly, as I do.

    I ran some tests using synthetic data on EXCEL and plan to show this a bit later. But basically, if the weather noise data was white, with “weather noise” with a standard deviation of 0.1 C and a trend of 2 C/century, and we collected 90 months of data we could test the trend either of the following two ways:

    1) See if the 90th month of data fell outside the 95% confidence limits.
    2) See if the OLS trend fell outside the red -noise corrected 95% confidence limis.

    Both methods would result in roughly 5% false positives– as imposed by the choice of 95% confidence. That is to say: If the true trend were 2C/century, we’d incorrectly reject at a rate of 5%. (The Tamino Lee&Lund method would method would actually result in about 4.5% false positives because it makes the uncertainty intervals too small.)

    So far, both method might seem equivalent. After all, we get false positives at the same rate. But false positives aren’t the only thing that matter. They certaintly aren’t the only thing if you are trying to discover the truth or test theories. We need to look at the false negative rate.

    If, in the example above, the true trend were really 0 C/century, the OLS method would reject 2C/century in 98 % of cases. So, it would be right 98% of the time and wrong 2% of the time.

    In contrast, the “see if the data fall outside the range” method (which appears to be what you suggest) would reject 2C/century 33% of the time. That is to say, it would tell us the wrong answer 67% of the time and we’d only get the right answer 33% of the time!

    This is referred to as having power of 33% with respect to the hypothesis of 0C/century. The power of 98% obtained by testing trend directly is superior.

    I have repeated this with AR(1) noise and the “see if the data fall inside the uncertaint bounds” remains lower power than the OLS trend fit. This is why the method you are describing is not the standard method discussed in text books. It is a method that gives the wrong answer when we could have determined the correct answer by choosing a test method with more power.

    As for drawing various lines– I could do that too. But the lines that are consistent with the data don’t happen to be consistent with the climate projections.

    There is the further issue that the models don’t predict a linear rate of change, and while you can fit a line to a model prediction, you can’t really regard the discrepancies as random variables.

    Your argument is based on interpreting the projections as weather forcasts. I have never said the models predict linear trends for weather. I treat the model projections as projections of climate and find the climate trends consistent with the weather we have obtained.

    Why those who defend models wish to insist we treat Figure 10.4 as a weather forecast mystifies me.

    I’m just saying that if you think they have it wrong, it would be appropriate to make an adjustment.

    If the modelers come to believe the weather noise is wrong, they will adjust their models to try to get correct weather noise. I’m not going to do that for them.

    Currently, I’m not going to “adjust” the IPCC projections and then test the adjusted projections. I’m testing the projections they published. Who in the world cares if I test projections I changed, and then tested?

    However, based on the current analysis, their weather noise looks to large. The ‘adjustment’ would be to tighten their error bars with respect to this feature. Their projections would then be even more wrong. If you want to test modified projection, I currently find the standard error in the residuals for models exceeds the values for the earth by 40%. So, if you know how to use that information to reduce the IPCC uncertainty intervals as you suggest, go ahead. Then, tell us your answer, and I’ll test your modification of the IPCC projections.

  5. Lucia,

    I am still being driven by the “Grumbine question”.

    Is there a some length of time in which the model “weather” is consistant with the real world “weather” for the same length of time? As in… should we use the model’s “weather” even if we were to test 30 years of real world “weather”?

  6. Raphael–
    Actually…. looking at the “model weather”, it looks like the interannual variation is too large during periods without volcanic activity, and that it would remain too large no matter how long we ran the models!

    This would suggest that we shouldn’t use model weather to test hypotheses about trends no matter how little earth weather we have, and no matter how much model weather we have!

    Or at least, we shouldn’t use model weather if we want to maximize the likelihood we’ll get the correct answer to the hypothesis test.

  7. Lucia,

    Testing the trend is also perfectly direct. It’s certainly not “indirect”.

    Yes, it must be indirect, because Fig 10.4 did not cite a trend. You had to create one.
    What is does do is show region in which temperatures may fall, and draws a path which is the mid-point of that region. Now when you say “It’s quite obvious the central tendency is linear in that figure. “, to me it isn’t at all obvious. The path they have drawn isn’t linear. You can fit a line, but that is what I mean about indirect. You’re testing something that you have derived. Further, you’re introducing an assumption not present in Fig 10.4, which is an expectation of linear growth. In fact the models, of which Fig 10.4 describes an envelope, do not envisage linear growth in temperature.

    BTW, I was not making an issue about scenarios – I agree that they have minor effect so far.

    On “adjustments”, I’m not suggesting adjusting any prediction, just saying that you should figure out what is the right noise to use for statistical significance. It’s actually academic, because you currently use entirely noise from weather observations. The projection in 10.4 shows their version of noise, and it is made up of model uncertainty and their version of weather noise. You really ought to include that somehow, and in doing so you would have to reconcile their allowance for projected weather noise and your inclusion of observed noise.

    Direct testing of whether the temperature falls in the specified range need not lack power. Obviously if you use a single straying as a falsification, that will overdo it. But there has to be some reasonable measure of when going beyond the limits is significant. QC people do this a lot.

    I think the argument for this test as having more power is a bit like looking for your lost keys under the lamp, instead of in the darker place where you most likely did lose them. The power is not helpful unless you are testing the right thing.

  8. Nick–
    Once again, you are insisting that I treat the IPCC projections as being intended to predict “weather noise” rather than average climate behavior.

    Figure 10.4 shows a trend for the climate. Or are you suggesting the figure shows trendless variations in GMST?

    The summary for policy makers discusses the anticipated trend for the climate — saying 0.2 C/decade. Table 10.5 discusses the change over three decades. The document discusses trends.

    The trends I obtain using OLS are, by definition, averages. As such, they are estimate of the “climate signal” based on a recognized method. The uncertainty bounds are intended to quantify our uncertainty in estimating that average trend– or signal.

    You appear to continue to insist I test this using methods appropriate to determine whether or not climate models can be used to predict “weather noise”. That is to say: You are, for some reasons, insisting that I test the wrong thing. Or at least, you don’t want me to test the hypothesis about the climate signal which is determined using an averaging method applied to the observations.

    That said: If you are suggesting a specific test to test whatever it is you think I ought to test, name the test. That way, we can all look it up, determine what the unnamed method these QC people use and if this method appropriate to determining whether the IPCC correctly predicted the climate signal.

  9. Lucia,
    You can certainly say that there is a long term rising trend in the projection of Fig 10.4, which does run for 300 years. And there may be some merit in trying to fit a line to parts of the range. But it’s your line; it’s not in the graph they presented.

    I agree they have made references to a trend in the text without quoting uncertainties, and could be criticised for sloppiness. But there isn’t much point in just going ahead on the basis that if they didn’t quote an error range, it can be assumed to be just weather noise. Explicit projections like 10.4 do indicate an error range, and testing should take account of it.

    I don’t know a lot about the QC kinds of tests, and I am personally doubtful of the utility of testing projections over such short intervals. Part of the reason is this. GCM’s are a bit like turbulence models. You can run them in the expectation that they will reasonably describe averages in the medium to long term. But you don’t have an initial condition, and so you have to run them for a while to let the turbulence parameters diffuse and settle. GCM’s also don’t have adequate starting data, so you have to run for a period with historical data. But you can’t go too far back, because you start introducing more error as the data deteriorates. So there’s actually uncertainty even about the state in 2000 (which can’t be called weather noise).

    Now it’s true that this initial uncertainty should have been included in Fig 10.4, and it may have been. A practical issue is in reading the graph to sufficient accuracy, since it does have a 300+ year range.

    Gavin’s approach of looking at the range of what individual models predicted for the period at least was using better characterised data.

  10. Nick–
    On the “that’s not ‘their’ trend
    If you think the fact that the IPCC doesn’t draw the line through their own curve is a major point, ok. But I consider that graph to communicate their projections. Right now, the portions that show the central tendency of 2C/century for the current period which is outside the range consistent with the data.

    As for the practical uncertainty in reading the graphs: A) It’s easy to blow up on Adobe. Use the zoom tool. That graph is in PDF. There are LOADS of pixels. B) The values can be replicated by downloading the “model data”, which I have. The trend is 2C/century. C) The authors provide some quantitative number in tables– the trend is 2C/century.

    So, I guess for now: Those who think it’s interesting to see whether 2C/century is consistent with the data think it’s interesting. Evidently, you don’t.

    On uncertainty intervals
    I agree with you that the tests would benefit from discussion of the uncertainty intervals. I’ve never disagreed.

    I’d discuss the uncertainty intervals on the graph further, but those are difficult to read. Moreover, they authors are not precise in their discussion what features contribute to the uncertainty band. (There are quite a few paragraphs of discussion, and it’s clear they don’t think it’s purely weather noise. It is a combination of internal variability in the models — which may or may not correspond to weather and cross-model variability due to different parameterizations, and forcings chosen by the modelers. I can get paragraphs for you if this is important.)

    The only thing we know is the values of the uncertainty intervals are computed by calculating the standard deviation of model average predictions. In the limit of many runs, this would absolutely positively not be “weather” noise. The only reason “weather” noise contributes to those bands is that there are only a few runs per model.

    In any case, if those uncertainty intervals were weather noise– as you suggest, the method I am using is perfectly appropriate. I don’t need to consider it when testing the trend against earth data. I can use the earth weather noise.

    Anyway, on my most recent post, you’ll see the 95% spread for the model runs for each month. So, even though I think the method you seem to describe likely has no power, we will soon see if the data busts out of the 95% spread. (It touched it in January.)

    (Oh. I should note– if the data and model projections used real temperatures instead of anomalies, the “projections” would already be way off using this method. The reason it has no power is related to the fact that the data and models are baselined to share zero anomalies between 1980-1999.)

    On GCM’s being a bit like turbulence models
    Of course they are!

    So, aren’t you perpetually surprised that when Tom Vonk posts doubts at RC about climate models explaining that using analogies with turbulence models, the analogy to turbulence is waved away as unimportant? 🙂

    I don’t know why you are specifically concerned about the fact that models don’t have good IC’s at the outset — at least with regard to my testing them. Lack of knowledge of IC’s is a feature shared by flows affected by turbulence. The poor knowledge of IC’s never precludes testing whether the model predictions for the averages or moments, spectra are match observations or are correct.

    It certainly doesn’t prevent us from testing whether models match each other!

    All it means is the modelers need to understand the need to “spin up” their runs and not make projections. (This qualitatively similar to experimentalists knowing that “fully developed flow” only exists far downstream of bends. There are zillioins of features like that. None preclude testing published predictions of models.)

    In any case, I am testing models predictions that, in the judgement of the modelers, were sufficiently far from the “spin up” period to mostly resolve the IC issue. I am testing averaged values. I am not testing them as “weather”. So, why are you suggesting lack of initial conditions affects my ability to test whether published predictions agree with observations?

    Of course the IC issue doesn’t preclude comparing to see if the predictions agree!

    The IC issue could, hypothetically, explain why models are wrong. Maybe they are wrong because models can’t predict behavior after 2000 due to poor IC’s in 1900 (or incorrect estimates of forcing from 1900-2000), that is a short coming related to predicting climate using models.

    That would be an interesting thing to learn about climate, wouldn’t it?

    But, if that is the case, it means the tests I am doing are useful!

    On short time period tests for turbulence models:
    That you are doubtful about short periods– like seven years– can be used to test models is fine as a feeling.

    However, it might be more convincing to those who don’t doubt if you could explain the basis of your doubt. With turbulent flows, it is entirely possible to take measurements (or run models) and estimate the uncertainty in the measurement (or prediction) based on the observed statistical distribution of the turbulent flow itself.

    These sorts of measurements and tests are done routinely in studies of turbulent flow and the criteria are explained and understood in text like “Bendat and Piersol”.

    When deciding whether the time period is sufficient, we use words like “integral time scale”, “energy spectral density function” etc. If you have any concrete information suggesting there is a large “spike” or “bulge” of energy at some long time scales, why don’t you provide it?

    I frequently ask this of those who simply “doubt” 7 year periods, the basis for the 30 years, and (other than Gavin) all I get appears to be “tradition”. That is obviously a poor explanation as it has no physical basis. (This might be fine if proper explanations weren’t routine in other areas of investigation.)

    (FWIW: I disagree with Gavin about using the range of spread of model predictions, but at least his argument isn’t “tradition”. )

    On the QC test— you admit you don’t know a lot about them. You also don’t name any beyond calling them QC types of tests. So I can’t look them up. If you can learn a name, then I’ll look into it. 🙂

  11. Lucia,
    On QC testing, cusum was all the rage in the ’90s.

    I see you have put up a new thread, which is probably the place for me to put any further responses.

  12. Nick–

    I’m reading about CUSUM. It’s supposed to let me detect changes in results of processes. Like say, I make beer, I’m supposed to be able to detect that the batches are now going bad. Or something.

    I read a bit at NIST. All I can figure is one might be able to assume the average GMST in the projection is the “incontrol” process, subtract and make the Q sum. Then…. Well, if we actually applied a “p” value of some sort, we’d get the same result as for the trend analysis you don’t like.

    So, do you have any other hints for how cusum would be used?

  13. Lucia,
    Yes, you have a process with a measure fluctuating about a mean. Maybe the pH of the beer. You chart samples and typically draw max and min values based on observed standard deviation. You don’t want to stop the process if there is a single excursion, so you compute a cusum and set a criterion based on that.

    Here it is more complicated, because the temperature plot is not even; the limits are growing. But the aim is somewhat similar – after what excursion pattern do you send the GCM’s back for repair?

    As I say, I’m no authority on this. I spent time in the 90’s evading efforts by TQM people to take over my life, and I listened to talks on this in an unsympathetic frame of mind.

  14. Nick–
    I understand it as a process control issue. But… in this instances, what’s the analog to the ph in your mind?

    As far as I could see, the analog to the ph is the average of the predicted temperature anomaly from the models. I can find that each month — I’ve already done it. Since I have that, I can compare the deviations, and then see when the Cusum is “out of whack”.

    The difficulty is I’m pretty sure that in this instance, if I set a criteria that gives 5% false positives on decreeing the GCM’s wrong, I will simply be replicating the trend analysis! I’m not positive, but I think in this instance it’s basically the same!

    (In cases where the trend was NOT linear, I’d get something different– but whether you think I’m just slapping the linear trend up there or not, the fact is, that over the first 20-30 years of this century, it is dang linear.

    But, if you think the average temperature predicted by the models is the equivalent of ‘ph’ in your example, I’ll be happy to do this and we can see what we get.

  15. Lucia,
    Well, I think the observed temperature (GISS etc) is the analogue of measured pH, and the colored temperature region in fig 10.4 is an analogue of the control region. As I say, the control limits are usually two parallel lines, so I’m not sure how the test is done for diverging irregular boundaries.

  16. Nick–
    In the documentations I’ve run across, the control region is a “V-mask”. The lines aren’t parallel. The origin of the “Vmask” – which is near the “point” is placed on the most recent datum.

    The difficulty is the documentation indicates that this is suited towards detecting “changes” in an existing trend. I’m not looking for “changes”. I’m trying to test whether predictions work.

    Also, the documentation isn’t done in terms of p-values.
    So… that’s why I’m trying to find other documentation.

    Oh… if your “control” region was the shaded region in Figure 10.4, we’re touching it (based on downloaded runs) Here’s the annual average plot.

    The monthly data was “out” in Feb, and jumped back in. (But 10.4 is annual average. That’s why I smoothed. Still, the jan-dec avearge is the “official” data. So…. hold your breath, and wait to see if it warms of cools between now and jan1. The weather had to warm above last yeasr aug-dec values for that lagging average not to go down. )

  17. Lucia,
    basically cusum is a hypothesis test. The null hypothesis is that the process is working as it should. If you watch it for long enough there will be random excursions beyond the limits; cusum is trying to decide whether a pattern of excursions could be random, or indicates that the process is not working as expected. If you like, our process is the monthly production of observed temperatures, and the projections are the description of how it “ought” to work. If cusum says there is something significant, then the process is not conforming. The GCM’s are the part of the “production” process than can be changed.

    The plot you showed above illustrates this well. The plot has come close to the “control limits” in the past, and is doing so now. If it goes beyond, that could, in the random model, be just chance; cusum is a way of looking at the history to help decide.

Comments are closed.