Bayesian Projection of RCP4.5.

I’ve been threatening promising to post a Bayesian weighted projection based on the rcp4.5 models I’d downloaded from The Climate Explorer. Today, I’m going to explain how I concocted a projections and compare it to the projection I would obtain if I simply averaged over model means from models with more than 1 run in the forecast period.

A model mean projection created by computing the multi-model mean over from models with more than 1 run in the rcp4.5 scenario is shown in the R color ‘tomato’ below. The solid line is the mean projection. The dashed line is the 2-sigma spread in the combined uncertainty of the model mean (of hypothetical models that might have been included in the set), uncertainty in which model is correct and the ‘model weather’ creating the spread in the runs for each model. Model runs are show in various ‘rainbow’ colors, with the higher trends in ‘warmer’ colors and lower trend in cooler colors.
MeanProjection

One of the difficulties with the projections above is that we already know that:

  1. Observed temperatures are lagging projections after 2000. This suggests the models may be running to warm but does not provide an explanation why.
  2. The residuals to trends fit over 30 years are consistently high for models compared to observations. This suggests that at frequencies less than 30 years ‘weather noise’ in models is too large.

Given this I wondered: What projections would I obtain if I assigned a weight by initially giving all models above equal weights, then computing relative probabilities of each models in the following way:

  1. Assume that one of the models is “correct” and initially all models are assumed equally probable. This is my prior and will be written: Po(j=correct model) where the o subscript will indicate ‘prior’.
  2. Divide the 90 year period ending in Dec. 2012 into three 30 year periods. These will be denoted with subscripts ‘k’ one for each period.
  3. For period (a) the linear trend, mobs,k and (b) rmsobs,k of residuals for each of the ‘n’ runs for an observational data set. (I chose GISTemp here.)
  4. For each model, j, compute the (a) rms of residuals for each of the ‘n’ runs. Compute the mean and standard deviation of the rms of residuals over these ‘n’ runs rmsmean,j, sdrms,j. Then assuming rms of residuals are normally distributed about the model mean, find the probability the that the observed value would occur given model ‘j’ is assumed correct. That is P(rmsobs,k|j=correct model) is the probability we would observed trend rmsobs,k during period ‘k’ if ‘j’ is the correct model. This probability is defined by my assumption the distributions would be Gaussian, the model mean rms for the periods, and the standard deviation of the rms values for runs (plus a slight addition to account for the finite size effect to to having a finite number of runs.)

    Note that given the prior, one would expect the probability of observing rmsobs,k to have been the sum over all ‘j’ models of the product of observing this trend given model ‘j’ is assumed true and the probability that model ‘j’ is thought to be true:

    Po(rmsobs,k)= P(rmsobs,k|j=correct model) * Po(j=correct model)

    This provides a relative weighting for the model at this step. Using the relative weighting and Bayes law, I computed the posterior probability that model “j=correct model” based on it’s agreement with ‘rms’ data during period ‘k’ as:

    P1(j=correct model)
    = Po(j=correct model) * [ P(rmsobs,k|j=correct model) / Po(rmsobs,k) ]

    I then repeated this for all ‘k’ periods each time using the posterior from the previous trend as the current prior.

    This step tended to pick out models with higher-than-30 year frequency ‘weather noise’ was closest to that in the observations. (As a practical matter, it tended to pick out models with runs that had less ‘noise’ than the typical model.)

  5. After weighting by ‘noise, I repeated the above procedure but using trends during each of the three periods. I cpmputed (a) the linear trend ( mmean,j, sdm,j) over these ‘n’ runs, assumed the distribution of trends was Gaussian.

When this process was completed, I had a posterior distribution of models giving the probability each model was “the” correct model with the weighting based on agreement with GISTemp based on the criterion defined above.

Once completed, I plotted out the new projection, which is shown in blue below:

ReweightedProjection
The more probable models with weights given on a scale of (1-100) are indicated. Others have been dropped off the graph. As you can see, the rescaled trends are lower; I think this is principally due to the fact that the final observed 30 year trends are much lower than the data so models with the lowest trend are given higher probabilities.

I’m not going to discuss this too much because I suspect people are going to want to note key features of the analysis more than the actual results. Two very important features are:

  1. The Bayesian projection is clearly affected by the prior chosen to describe which models we thought probable in the first place. In this case, one prior assumption is that the models I am including in the analysis do contain the “truth”. If you do not believe the models in anyway shape of form, the Bayesian projections ought not to be seen as “better” (even if you ‘like’ it ‘better’).
  2. The shift in the posterior is clearly affected by which observational data sets we used to compute transition probabilities for the priors. My choice was dictated by wanting to use metrics based on surface temperature — which is what I want to predict. Of these, I wanted to select easy to compute metrics. I used the agreement with the observations of a metric whose magnitude is affected by ‘weather noise at frequencies less than 30 years’ and the agreement with ’30 year trends’ in surface temperatures. Other people could pick other metrics (e.g. ability to predict tropical cloudiness). There are many metrics one could pick and I might note there is a vast choice to pick ( or cherry pick ) from!

For now: The blue curve is “Lucia’s Bayesian for rcp4.5”. Will it pan out? Beats me!

Update: (July 4, 2013. Place holder so I remember relevance! )Saw this paper on Twitter: Climate Models, Calibration and Confi rmation. I need to read further in light of thoughts I had while doing above calculation and thinking about comparing my “Bayesian” optimum from the models to “no warming” models and so on.

70 thoughts on “Bayesian Projection of RCP4.5.”

  1. Eyeballing, it seems that “Lucia’s Bayesian” (blue curve) is closer to the observations from about 1998 to present, but before that the unweighted average (red line) seems to be better most of the time. Why is that?

  2. Lance
    The one thing I did not use was absolute distance between observed anomaly and model anomaly. So the match is on (a) 30 year trend and (b) variability.

    As for which matches better: earlier: I don’t actually know. First: It is impossible for either match to be ‘better’ on average from 1961-1990. All runs are mathematically forced to show an average of 0 during that period. So if you think you see an average difference over that span, it’s a illusion. The most you could see is better “wiggle or trend matching”.

    In the first 30 year period (1923-1952), the blue line is sometimes closer the ‘tomato’ line is sometimes better. (Tomato is the official ‘R’ color for that particular red.) But I think it’s clear neither line is better or worse during tha tperiod.

    For the 2nd period (1953-1992), the ‘blue’ seems to respond to Pinatubo more than the earth did. OTOH, the earth responds to an earlier eruption more than the models (blue or orange). Given that eruptions are overlaid on ENSO, seeing the models have heavier post-eruption and lighter post-eruptions dips might not be a bad thing for the models. Also: ENSO wiggles are missing owing to the 36 month smotting. So you cannot necessarily see the ‘rms’ distance from a straightline fit during that period. That’s what I used to test– and you can’t see it.

    One difficulty: I arbitrarily chose 30 year lengths as a ‘climate’ period which makes the final period start immediately after Pinatubo. This can be an issue because this is the period with a large mis-match in mean trend relative to the observation. Given the shape of a Gaussian, a mis-match in trend in this period may have a quite heavy influence on the selection of posterior weights.

    I haven’t explored sensitivity to picking different start periods but that could affect which model is seen as ‘most probable’. To some extent: This is the answer you get with these assumptions about how to weight. I don’t know that there are better assumptions about which parameters one could use to find the “best” model. I seem to recollect Trenberth tried something with “which models fit something or other in the tropical troposphere” best. (Anyone remember?) The thing is: At this point, I know if I go hunting, I can probably cherry pick to get “Bayesian” projections anywhere from the range of the lowest warming model to the highest warming one. The question would them be: If we are going to downselect, what way seems best? (Of course, comparing to the future not yet available is best. But we can’t do that. If we did, we wouldn’t need a forecast!)

  3. Thanks for doing this analysis. I’m not sure I am properly parsing the sentence where you describe the 2-sigma, dashed lines, though. Is this correct: for each time period, your 2-sigma interval combines the variability in each model’s multiple runs (“weather”?), the uncertaintainty in the ensemble mean, and the uncertainty from your which-one-is-correct weighting scheme, assuming all of these are gaussian?

  4. Lucia (#117472):
    “For the 2nd period (1953-1992), the ‘blue’ seems to respond to Pinatubo more than the earth did.”
    I’m confused. As the periods are 30 years, the 2nd period should be 1953-1982, which doesn’t include Pinatubo. Although it does end with El Chichón. Is that just a typo on your part?

  5. HaroldW–
    Your rigth. I’m wrong. 1982/83 is the division. It was more than a typo–mis-thinking on my part.

  6. Wayne:
    Let’s start with unweighted.

    We imagine that the earth’s trajectory would fall inside the range of “weather” for whichever of the models is “right”. (And either it’s one of these specific models are each is an estimate of the ‘right’ model. Depends on how you think of it.)

    We have a location of the mean: That’s a sample value.

    But one would at least imagine that if one modeling group had been defunded and another existed, one could have had a slightly different combination of AOGCMs. So: if there is uncertainty in the “mean” from the universe of “all possible AOGCMs” that are somehow possible in the current era of computational power and understanding. So, that’s one contribution.

    Then, even if we had the same models, each modeling group ran certain runs. That also contributes to the uncertainty our knowledge of the mean. (Computationally, its not possible to pull that out of the above.)

    But the mean might not be the correct mean for the “true” model. So, in fact, the mean for the “true” model is thought to fall in the spread of all the means. That will be the standard deviation of means. But this spread is added to the bit above. (The bit above is smaller btw.)

    Then, since we expect the earth trend to fall around the mean of the “true” model by an amount of “weahter”, need the uncertainty for the “weather”.

    Three contributions are added together to get a variance that contains the “weather” given the uncertainty of the forecast.

    Weighting isn’t much difference, I just weight to get the averages and variances.

  7. Oh BTW: The 2sigma is 2sigma whether or not it’s Gaussian. It’s the conversion to p values that’s affected by the assumption of Gaussian.

  8. Lucia: Thanks, that makes it clear.

    A further question of clarification: how are multiple runs of a model set up: different priors, perturbed hyperparameters, different random seed, etc? Is this consistent among modeling teams, or is it arbitrary?

    I just have this doubt about “weather” in the real-world sense being well-characterized by differences among and within models that still don’t handle clouds and cloud-formation well.

  9. Wayne:
    I think the method is:

    1) Run a very long control run past a point they consider “quasi-steady”. (This is diagnosed by time averaged properties being fairly stable.) In these runs, forcings are constant at a level thought to be ‘pre-industrial’. Run a bit longer….

    2) Pick some after the control run is quasi steady as initial conditions. Call that time zero, start varying forcings as in the industrial period… and so on. These are the ‘historic’ runs, and can continue with forecast scenarios.

    3) For a 2nd run, pick a different start point from the control run. For a 3rd run… a different one yet.

    Mostly, they try to space the start points out to ‘hit’ different points on ENSO/PDO and so on. That way, you try to explore starting at different points in cycles of ‘oscillations’ that might exist.

    I just have this doubt about “weather” in the real-world sense being well-characterized by differences among and within models that still don’t handle clouds and cloud-formation well.

    If they models don’t handle these well… then yes, they should end up with difficulties mimicking earth properties. This is somewhat separate from the question of how to get a distribution of runs that explore the behavior of the model. So: You can get a distribution of “model weather” from a bunch of model runs. Whether they match earth weather is a different question. But at least, it’s worth knowing how the sample to discover the statistical properties of model weather.

  10. lucia writes “There are many metrics one could pick and I might note there is a vast choice to pick ( or cherry pick ) from! ”

    You could keep yourself busy for months looking at the various combinations 🙂

    I think it would be useful in general to have full statistics for each model published showing how well many aspects of the climate had been modelled compared to the measurements. I suspect it would be quite an eye opener.

  11. TimTheToolMan

    I think it would be useful in general to have full statistics for each model published showing how well many aspects of the climate had been modelled compared to the measurements. I suspect it would be quite an eye opener.

    Maybe. But a system of documentation by peer-review in the literature is precisely one where this will never happen. It’s the sort of thing that if desired needs to be a project at a national lab or agency that ends up documented in a big, thick, very boring lab or agency report.

  12. Hi Lucia,

    I’m going to have to think about this for a while.

    My first problem is (like TimTheToolMan) the choice of metric. It is well established that the trend in temperature can be matched with a large energy imbalance and large ocean heat uptake or with a small energy imbalance and a smaller ocean heat uptake. The energy imbalance can be controlled by adjusting forcings (esp aerosols) or adjusting the ocean model. So the fact that the historic trends in temperature in model A are better than in model B does not per se demonstrate that model A has better predictive ability – even in temperature. It may therefore be more reasonable to apply weights based on the summed residuals from both ocean heat gain and temperature, or indeed on the summed residuals of a series of vectors which are deemed to be “important to a good model match”.

    But then I am left with a conceptual question. Suppose that none of the models actually match the important matching vectors. (In fact I am certain that this is true if the number of matching vectors is expanded.) Then what does the weighted average actually mean? In your specific case, you have elected to use solely temperature as the matching vector. Fair enough. The weightings indicate that in reality only one model, the GISS-E2-R is important – and that is because it exhibits a lower late trend than the other models. However, the average trend from this model is still above the observational trend, and this average GISS-E2_R trend is then nudged even higher in your final result by the (admittedly small) weighting granted to the other models with higher late trends.

    I think therefore that you can only conclude the obvious (with no disrespect to your demonstration of the methodology) – that if a bunch of predictions all sit to one side of a target, then no weighting algorithm can yield the target value.

  13. Lucia,
    Interesting approach. The match of your selected/weighted models to actual temperatures in response to Pinatubo remains not so good. Is the response to volcanoes part of “weather noise”?

  14. Lucia,
    Interesting approach. The match of your selected/weighted models to actual temperatures in response to Pinatubo remains not so good. Is the response to volcanoes part of “weather noise”? The match to ” normal weather noise” versus volcanic noise seem to me not necessarily telling us the same thing. Volcanic would reveal more about fidelity to a significant perturbation in energy, while normal weather more about the internal dynamics. The two may be related, but response to a reasonably well defined external forcing seems to me a better test.

  15. Humm, I tried to edit a comment, but ended up with two. Don’t know how that happened.

  16. SteveF–

    nteresting approach. The match of your selected/weighted models to actual temperatures in response to Pinatubo remains not so good. Is the response to volcanoes part of “weather noise”?

    It’s not really “weather noise” — or that sort of depends on what word someone wants to use. It’s not ENSO/PDO etc. anyway.

    The difficulty with a match in the “temperature” space is that no matter what you have, ENSO/PDO etc continues. So if an eruption happens during La Nina vs. El Nino, that’s going to modulate. I don’t think there is any way to mentally separate the effects of both in the actual time series for temperature. But it has to be thought about when deciding whether one thinks the models response might be “right” or “wrong.

    The match to ” normal weather noise” versus volcanic noise seem to me not necessarily telling us the same thing.

    100% agreement here. The difficulty with matching residuals is it’s ambiguous. We can think of two limits:

    1) Weather has no ENSO/PDO/Oscialltion noise. In which case the variability from a linear trend would be due only to external features like “sun/volcano” etc. A good model should get the correct rms– so a mismatch is a “bad sign”. (In this case a perfect model should even match the timing. So if this was the case, I could come up with an even better comparison involving ‘wiggle matching’. That seems to be your thinking. There are at least two difficulties for using this in the “probability” weighting:
    (a) We don’t have numerous volcano eruptions that are significant in size relative to ENSO/PDO. If we had dozens, this would be easy. and
    (b) I’m not confident the modelers know the forcing for the eruption. So, two models could be equally good, but one mismatch springing from the forcing files driving things badly. This might be a more short term effect and not a sign the model is really overly sensitive or deficient in physics.

    So that makes the “exact wiggle matching” after volcanoes a bit dicey if used in probability for weighting which models seem more reliable.

    2) There is no “sun/volcano” stuff, but only “ENSO/PDO/Oscillations”. In which case, a “good” model should match correct rms. But you don’t expect to match the timing of the ups and downs. So, a mismatch is a “bad sign”. And I actually think it is a very bad sign if a model can’t get the base level of this sort of ‘weather noise’ right.

    In the intermediate case: the rms is influencec by both. So a “good” model should tend to match it, but a bad one won’t. In probability I can ding the ones that don’t match. But the test absolutely does not distinguish which thing the model missed,nor can it tell me why.

    So, I thought general level of rms as a metric to estimate relative probabilities of one model relative to another us probably ok since getting it wrong is generally bad. (That is: Assuming the modelers all included solar/volcanic and so on. If they leave it out altogether, that’s a problem.) It’s not a good diagnostic to figure out what the model is missing nor why. You need other methods to discriminate.

  17. Interesting analysis, Lucia. I suspect you will obtain and have obtained feedback on other approaches, but in the end I think these analyses are instructive with regards to determining the capability of the models and the limitations of using the model outputs for forecasting – and hindcasting for that matter.

    My current concern is whether I can determine whether there are modeling differences between the historical part of the models and the forecasting part. Obviously the inputs for modeling the historical part are fairly certain while going forward an uncertain scenario has to be selected.

    The “weather” noise can get in the way of looking at shorter term trends, but use of ensemble averages over an entire scenario, e.g. RCP4.5, helps average the noise out. I think the underlying trends should be the more important metric in evaluating model performance.

    My calculations on the standard deviations of the regression residuals of linear segments of the model series versus the observed series (HadCRU4, GHCN and GISS) does show the models in general have a higher standard deviations. My lying eyes also had wrongly concluded that the variation in the models was generally less in the forecasting periods than in the historical part of the series. Noise on a sloping trend line, I think, looks less than when the trend line is flatter. The standard deviations of the regression residuals does vary from model to model and also within the three observed series. There are some variations in some models from the historical to the forecast periods, but on average over an entire scenario series the standard deviations do not change much from going from the historical to forecast periods.

  18. Kenneth

    The “weather” noise can get in the way of looking at shorter term trends, but use of ensemble averages over an entire scenario, e.g. RCP4.5, helps average the noise out. I think the underlying trends should be the more important metric in evaluating model performance.

    That depends on which aspect of model performance you are trying to evaluate. These models fall in the class of “simulators” which are supposed to simulate the earth’s climate, including recreating general patterns of the ‘weather’. To the extent that any descriptive statistics of the weather do not match those of the earth that suggests a lack of fidelity in their goal (and likely a problem with the ‘physics’ in the ‘model planet’).

    Because the reason we are supposed to have faith in their ability to predict long term trends is that they do simulate the earth’s weather/climate system, any lapse in the ability to simulate the shorter term events casts doubt on their ability to forecast long term trends. (Or it at least casts doubt on their ability to forecast any better than extrapolations based on time series analyses.)

    So, I think it is quite valuable to look at the ability of models to predict shorter time scale features which repeat over and over again. If they fail to reproduce those, that suggest a flaw, and if there is a flaw, one might have less confidence in that model.

    This is not to say that the flaws might not turn out to be irrelevant to the long term forecast. But really, what basis can we use other than waiting to see if the forecast worked? And more over, by the time data arrives to test a forecast, the modeling community will have a whole new set. AR4 was in 2007. AR5: now. If we want to try to gauge our expectation for the future climate now, we have to work with what we have.

    The standard deviations of the regression residuals does vary from model to model and also within the three observed series. There are some variations in some models from the historical to the forecast periods, but on average over an entire scenario series the standard deviations do not change much from going from the historical to forecast periods.

    For models, I think that’s clearly “model physics” and the difference is large. I’ve seen that doesn’t change much before. My general impression is the models have too much short term noise. This risks giving modelers the false impression that 10 year trends caused by things like “ENSO & shorter term” fluctuations is larger than it is for the real earth. The result would be to make it less likely to detect a true deviation from a model trajectory when it occurs. (BTW: A model having too much shorter term noise does not exclude the possibility of having too little long term noise. But it is very difficult for people to determine the amount of long-term noise in observations as it either requires paleo-reconstructions of a much longer term time frame for the thermometer record.)

    For the observations: different sets will introduce different amounts of ‘measurement noise’. This would be expected because each uses a different algorithm.

  19. Lucia,

    I don’t think there is any way to mentally separate the effects of both in the actual time series for temperature. But it has to be thought about when deciding whether one thinks the models response might be “right” or “wrong.

    I think a reasonable estimate of the ENSO influence on temperatures can be ‘removed’ from the historical data to get a better idea of the Earth’s response to Pinatubo. For Pinatubo, the impact on solar intensity was actually measured, rather than guesstimated, so the uncertainty in applied forcing is lower. The adjusted Pinatubo temperature influence still looks to seems to me somewhat low compared to the models. Doing that with the Hadley data using the ENSO influence I got from the regressions in my last guest post, and then smoothing over 36 months (like above) the peak Pinatubo temperature response was -0.20C, while my eyeball estimate of the selected model response in your second graphic is about -0.26C or -0.27C.

  20. SteveF-

    I think a reasonable estimate of the ENSO influence on temperatures can be ‘removed’ from the historical data to get a better idea of the Earth’s response to Pinatubo.

    Possibly. But if models are ‘right’ ENSO will also be in the model run. Removing ENSO from a model run requires pulling out an awful lot of data, exploring for correlations (if they exist) and so on. So, for what I did here, it’s a problem. If I was going to do that to gauge the probability of a particular AOGCM has ‘good’ physics, it would probably be better just to compare whether the correlation between some ENSO parameter and surface temperature exists and is similar to what we see in the data.

    I agree Pinatubo in models looks big. Picking ones with smaller volcanic responses might pull out the less sensitive ones. (The sensitivity to choice of metric for ranking is something that ought to be done if something like this is to be taken seriously. I haven’t explored that.)

    For my first cut, I just didn’t chose to use ‘wiggle matching’ in the probability. That means we can see that picking the other two, I get ‘not such great wiggle matching’. Presumably if I picked ‘wiggle matching’, I’d get either worse ‘trend fitting’ or worse ‘matching rms residuals’.

    I’m not making big claims for what we have here– I wanted to do it as much to think about it as anything else. Some of my main thoughts are “There are a zillions things we could use to ‘rank’ the models. I bet if I tried hard enough and programmed a method to intelligently “hunt”, I could probably create nearly any forecast I ‘like’!” (The once caveat: If I pick one that doesn’t match the surface temperature’s we’ve already observed, people are going to see that at this point. )

  21. I am currently thinking in terms of the models attempting to simulate the historical temperature record as being a two part effort: the first part and first part of the historical series is that dominated by the “natural” effects before the major effects of GHGs occurred around 1970 and then the part of the series forced by the GHGs after 1970. I want to aim my analysis at how well the various models can simulate the historical record before and after 1970.

    My analysis needs some metrics to do what I propose above. What I have found in a general visual inspection of my segmented trend lines (and my lying eyes have actually lied to me previously) is that the models give rather mixed simulations of the historical temperatures prior to 1970 that might get the overall trend correct but by very different routes. After 1970 nearly all the models have an upward trend that is greater than observed series show. This implies to me that the models can reasonably well simulate the noisy intermediate structure of the observed climate but not the timing of the variations and that the models overestimate the forcing due to increasing GHGs.

    I think the 1970-current time period is the critical one in how well the models can predict longer term trends due to GHGs and that pre-1970 period points to the period where Trenbreth can note the randomly occurring 10-15 year trends that can be occurring naturally.

  22. In analyzing the 42 CMIP5 RCP45 models and observed monthly mean global temperature series, I settled on comparing the linear trends for the following 9 periods of time: 1916-1970, 1971-2013May, 1916-1928, 1929-1942, 1943-1956, 1957-1970, 1971-1984, 1985-1998, 1999-2013May. I selected 2 longer periods and then 7 periods of 13 to 14 years. The longer periods should be less affected by the “weather” noise than the shorter ones. The model trends were for individual models and a mean trend was used were a model had multiple runs. The Observed series trends were an average of HadCRU4, GHCN and GISS trends.

    I summarized the results all into a single graph which I have linked below. The graphed results show that the Observed trends are biased away from the model trends while the model trends are spread in a fashion that would appear to be random.

    In order to better quantify the observations from above I ranked the models based on how close those trends approached the Observed trends for the 9 periods. I then summed those rankings over all 9 periods to provide a score for each model. Next I did a 10,000 replicate Monte Carlo to determine whether any of the model rankings were better/worse than would be expected by chance. None of the models were outside the 95% CIs.

    In order to determine whether the Observed (or any of the model) scores were biased towards the extreme ends of the distribution I ranked all the models and the Observed trends for all 9 time periods by how close the trend came to the either end of the distribution for the 9 time periods. The rankings were then summed to give an overall score for all 42 model trends and the Observed trends. I then did a 10,000 replicate Monte Carlo to determine the probability of any of these scores resulting from chance. The probability of the Observed score occurring by chance was approximately 1 in 5000 . Two models had score probabilities of around 1 in 100 which might not be too unexpected given the 43 items in the calculation.

    With this simple minded analysis I find that the models appear to be differentiated by having rather random trend results over the time periods studied while the Observed trends over these periods appear to be biased to one end or other of the distributions of trends. The random distribution of model trends in the 7 time periods of 13 to 14 years might support the Trenbreth observation of short duration trends in models, but the biasing of the Observed trends to the model trends puts a whole different perspective on what that means vis a vis Observed versus M modeled temperatures.

    http://imageshack.us/a/img812/7659/i9x.png

  23. Kennet– I don’t understand your graph. Are the colored dots observation? (Don’t seem to be/) If so, they seem to be rather spread around inside the model runs? (Or are the colors model means? Or what? )

    On another issue: Clearly, these comparison over these periods is not independent. 1916-1928 is contained in 1916-1970.

    Another issue: How did you decide what periods to test trends?

  24. Kenneth,
    That would be expected if the data is being driven away (higher or lower) from a long term secular trend by one or more factors not simulated by the models.

  25. Lucia:

    The open black circles are the trends from the 42 models and the green dots are the Observed trends(average of HadCRU4, GHCN and GISS) while the red and yellow dots are the trends for the worst and best scoring models, respectively.

    I selected the periods by my attempt to partition the 2 longer term periods into approximately equal time periods that would be somewhat representative of “weather” to see whether the models would randomly produce short term trends and the relationship of the observed trends over those shorter time periods.

    The 1916-1970 period was selected because going back further with the historical record puts much more uncertainty into the data and the 1970 partition was chosen because of the consideration of when GHGs would have began influencing the observed temperatures in earnest.

    Obviously the longer term trends include the shorter term ones and are not strictly independent, but with the model short term trends apparently randomly jumping around I wanted to see how the models performed longer term with less influence of the weather noise. I conjecture that the shorter term trends in the models might be driven by factors different than the longer term ones. When I said a simple minded analysis I meant, that amongst other things, your observation on independence might well affect the CIs I calculated using all 9 time periods, but I wanted to get a feel for the models performance beyond the general statements one might obtain from the IPCC that the models are in good agreement with the historical record.

    SteveF:

    Your surmise may well be explanatory here. Perhaps the Observed and model trends are out of phase with one another.

    What surprised me is that the biases of the Observed periodic trends within a rather random distribution of model trends would show so dramatically in the historical period where the models could be fit to the Observed. That observation indicates that the models were perhaps not that much over fitted to match the historical record.

    I think the best we can hope for the model performance at this point in time is that models get the longer term trends correct – and the 1971-2013 May period comparison of models to Observed puts that proposition in doubt.

  26. Kenneth,

    I conjecture that the shorter term trends in the models might be driven by factors different than the longer term ones.

    Short term trends are heavily influenced by where their “model” might be inside whatever “ENSO/PDO” etc cycle it has. So we expect those to look more “noisy”. Longer term trends should be more heavily influenced by external forcings.

    I think the best we can hope for the model performance at this point in time is that models get the longer term trends correct – and the 1971-2013 May period comparison of models to Observed puts that proposition in doubt.

    That’s the only type of trend AOGCM’s as used are supposed to get close to right. Their is not attempt to matched every wiggle in the earth’s ‘weather’ record.

    The only way in which comparing short term trends make sense is to see if the earth’s observation falls inside the spread of trends would exhibit. (This is the only way testing long term trends makes sense too– but the effect of ENSO/PDO etc. narrows relative to the response to external forcings.)

  27. Kenneth,
    Perhaps it would be constructive to look at the frequency power spectrum for models versus data. My guess is that the models have too much power in the higher frequencies and too little in the low frequencies. Accurate relatively long term variation due to oceanic variation (eg. Changes in thermohaline circulation rate) would require a very realistic ocean model, as well as an accurate treatment of the atmosphere, including cloud effects. It is pretty clear the models aren’t yet close to that.

  28. lucia (Comment #117567)

    Agreed.

    SteveF (Comment #117568)

    I am novice when it comes to doing and interpreting spectrum analyses, but that was my next approach. I am fairly familiar with what R has to offer for those analyses.

  29. Re: SteveF (Jul 5 15:45),

    Accurate relatively long term variation due to oceanic variation (eg. Changes in thermohaline circulation rate) would require a very realistic ocean model….

    Good luck with that. The time constants of ocean vs atmosphere are different by several orders of magnitude. According to what I’ve read, there’s no way that you can actually spin up a realistic ocean model in the same number of cycles you use to spin up the atmosphere. You have to play even more games with dissipation and viscosity in the ocean than the atmosphere during the spin up and then go back to more realistic conditions after spin up. Yet more opportunity for tuning kludges.

  30. DeWitt,
    If it was easy we wouldn’t be making the big bucks from payoffs from the fossil fuel industry, right? 😉

  31. SteveF (Comment #117568)
    July 5th, 2013 at 3:45 pm

    “Kenneth,
    Perhaps it would be constructive to look at the frequency power spectrum for models versus data. My guess is that the models have too much power in the higher frequencies and too little in the low frequencies.”

    Well the easiest way to split any signal (power wise), temperature or otherwise, is to throw at a bandpass splitter/cascaded low pass filter and see where in the spectrum the energy is (as RMS of the various bands) and what is left, if anything, to assign.

    Just my way of looking at things 🙂

  32. SteveF (Comment #117568)
    July 5th, 2013 at 3:45 pm

    “Kenneth,
    Perhaps it would be constructive to look at the frequency power spectrum for models versus data. My guess is that the models have too much power in the higher frequencies and too little in the low frequencies.”

    Well the easiest way to split any signal (power wise), temperature or otherwise, is to throw it at a bandpass splitter/cascaded low pass filter circuit and see where in the spectrum the energy is (as RMS power in the various bands) and what is left, if anything, to assign.

    That should allow you to compare between the various temperature sources and decide if they match and by how much.

    Just my engineering way of looking at things 🙂

  33. Richard LH,
    OK So use any filter function you think is suitable. Please, don’t hold back, show us the way.

  34. Just noticed a paper using an ensemble of CMIP models to estimate aerosol contributions. link.

    One of the authors, Elli Highwood, has a post up on it, though I personally didn’t find her discussion very helpful.

  35. Re: Carrick (Jul 6 09:12),

    So if the models include the aerosol indirect effect, which some folks in the biz don’t think is significant, they fit the weather in the mid twentieth century better. Can you say “tuning”. There were major droughts and low temperatures in North America in the 1870’s-1880’s. Was that anthropogenic aerosols too?

  36. DeWitt, to be clear, I’m not linking this paper supporting their paper, just trying to fully understand it and it’s methodology.

    It does seem to me there is a selection problem that “warmer” models will favor larger aerosol effects to compensate for periods of unexplained cooling. I suppose the argument is that they are employing a multivariate analysis that includes precipitation.

    I would like to see a similar analysis that only uses the 20-year period where we have usable aerosol data.

    I’d also like to have them look at model predictions of the effect of aerosols on temperature vs precipitation, to see whether the effects on precipitation and temperature from aerosols (I worrying mostly about the sign of the correlation) are the same as expected from the models.

  37. DeWitt Payne (Comment #117575)
    July 5th, 2013 at 10:55 pm

    Grandma may have sucked eggs but she did not wear army boots.

    I am planning to use the spec and simspec functions from R library(seewave) to compare model spectra to that of the observed.

  38. The Elli Highwood post is relevant to the discussion here on spectral analysis.

  39. Carrick,
    I found the paper’s argument completely unconvincing. Cooling from any cause would lead to less total rainfall. As far as I can tell, it is more of the same tripe that is always on offer: aerosols explain everything. Another in the long line of “you can never show the models are wrong” papers…. papers which just happen to show up when the models make poor projections. There is no aerosol data, just arm waves and models. As usual, it’s only models, all the way down.

  40. SteveF, I don’t buy into your argument that “Cooling from any cause would lead to less total rainfall”.

    As a counter example, increased cloud cover associated with increased aerosols could lead to both cooler temperatures and increased precipitation.

  41. Carrick,
    Increasing the number of droplets does not necessarily increase rainfall; it may do the opposite by extending the time required for droplets to grow to a size were accretion generates rain. Increasing the number of droplets certainly can increase albedo, but even the magnitude of that influence is contested (AR5 SOD). I am much more persuaded by projections than by hind-cast ‘tuning’ of models to fit the known data.

  42. SteveF:

    Increasing the number of droplets does not necessarily increase rainfall; it may do the opposite by extending the time required for droplets to grow to a size were accretion generates rain.

    Yes, it “could”, but it also “could” increase rainfall. I was objecting to your word choice “would” in “Cooling from any cause would lead to less total rainfall”. I don’t think the processes are well enough understood to use “would” here in either direction.

    As to the AR5, this is a case where democracy doesn’t work very well. The question isn’t whether you can find meteorologists who believe a particular thing, rather the question should be what do the best meteorologists think. With the IPCC, it seems that politics plays a big role in selecting who’s results to “believe”, and that the role that politics is playing just gets worse with each new round of reports.

  43. Carrick,
    I think there is a broad agreement that warming of the oceans will increase rainfall. Models and observations (Wentz et al 2007, Science) agree on that, though observations apparently indicate a greater increase than models. Conversely, cooling ought then lead to less rainfall.
    .
    My objection is to the paper’s unsupported claim that aerosols were responsible for the observed cooling (and the corresponding drop in rainfall). I wonder if the deadline for publication for AR5 has passed, since this paper seems to me tailor made to defend the ‘honor’ of the models against papers which suggest aerosol influences have been overstated in the models.

  44. Carrick,
    That paper is too late for WGI AR5; looks like acceptance by March 15, 2013 is required, and that paper was accepted May 20. Of course, there have been exceptions to the deadline rules made before, so it might end up in WGI anyway. IIUC, it would still qualify for WGII references.

  45. SteveF:

    I think there is a broad agreement that warming of the oceans will increase rainfall.

    But that’s a different cause and effect argument, one I agree with in general. (Meaning it’s a statement that holds for global, rather than regional, temperature & precipitation.) The issue here is the effects of aerosols on precipitation here, not the effect of temperature on precipitation.

    I think the problem is there isn’t broad agreement about how big an effect it is, nor do I think the GCMs have enough fidelity to accurately model this sort of effect.

    I think the real problem with the study is it is relying on model output, where the model output isn’t useful and the models have been partially tuned with a particular temperature and aerosol history to reproduce observed pattern in the data, to try and infer the historical aerosol forcings.

    I’d expect you to recover the original assumptions used in the model in this case, rather than anything new.

  46. Carrick,

    I’d expect you to recover the original assumptions used in the model in this case, rather than anything new.

    On that we agree. It is what kludges do! 😮

  47. Re: Carrick (Jul 7 11:07),

    I was objecting to your word choice “would” in “Cooling from any cause would lead to less total rainfall”. I don’t think the processes are well enough understood to use “would” here in either direction.

    If you do a simplistic radiative balance calculation, cooling would decrease total precipitation because DLR decreases more slowly than OLR at the surface as the temperature decreases. The converse is also true. A smaller difference implies less convective heat transfer. So unless the decrease in convection is all in the sensible heat transfer (highly unlikely), a smaller difference means less latent heat transfer, which is equivalent to less precipitation by definition.

    MODTRAN, US 1976 Standard Atmosphere, clear sky, 0 km:

    Ts      Up          Down     Δ (W/m²)
    288.2 360.472 258.673 101.799
    287.2 356.076 255.722 100.354

    The rate of change of global precipitation with temperature is another thing that distinguishes low climate sensitivity models from high sensitivity models.

    Edit: Forget that. I was using constant water vapor pressure rather than constant RH. There’s very little difference with constant RH and it’s in the other direction, from 100.354 to 102.203 W/m²

  48. DeWitt, to be clear, we all (I think) agree that precipitation tracks with temperature. The question here isn’t the effect of temperature on preciptation, but aerosols on preciptation (and temperature).

    In other words, precipitation P is a function P(T,A) where “A” is aerosol concentration, and it’s extremely probably that P(T,A) = P0, where P0 is a fixed rate of precipitation, is satisfied by multiple values of temperature and aerosol concentrations. (I was going to use the variable instead of their meaning but it turns to not be safe for work).

  49. Re: Carrick (Jul 7 15:47),

    Yes, but how strong is the effect of A compared to T? My impression is that evidence from cloud seeding experiments shows that it isn’t strong except under very special circumstances. The Chinese tried to use cloud seeding to remove pollution during the Beijing Olympics in 2008 and to prevent rain on the opening and closing ceremonies. Whether it worked or not is disputed.

  50. Carrick, DeWitt,
    I think it is interesting that Stephen Schwartz (aerosol specialist) seems a leading advocate for lowering aerosol offsets. His efforts to estimate climate sensitivity independently of models using autocorrelation of temperatures (he concluded low equilibrium sensitivity) led to a very hostile response from different modeling groups. No fool he, Schwartz understands that large aerosol offsets are all that stand between the models and armageddon…. er, make that between the models and a required change in cloud parameterizations… and substantially lower sensitivity estimates (AKA armageddon).

  51. SteveF (Comment #117574)
    July 5th, 2013 at 8:04 pm
    Richard LH,
    OK So use any filter function you think is suitable. Please, don’t hold back, show us the way.

    DeWitt Payne (Comment #117575)
    July 5th, 2013 at 10:55 pm
    RichardLH,

    Or to put it another way: Teach your Grandmother to suck eggs.

    Ok. Well I’ll do my best 🙂

    I have been discusssing with Tamino on his site my suggestions and this my summary of my position as best I can to date.

    If you were to take an ideal model of a constant input power source and apply it to mixed reflector/absorbtion surface of a rotating sphere.
    To measure the temperature time response (from the ‘daily’ input rotation/periodic function) in such a model it is easy to split in into time bands/frequencies in the natural temperature response.
    From Daily, through Yearly to the 1461 day true ‘Solar Year’.
    If you wish, use FT to spit out any bands that contain energy (or the circuit above).
    Describe both the input RMS and bandpass outputs RMS for each period/frequency band. Note that an ‘end around’ sum can be performed to provide validity and any remaining RMS that is yet left to be assigned.

  52. And a suggestion that FT may not be what you are looking for.

    RichardLH | July 8, 2013 at 10:21 am | Reply This is also logical. Pure tones are mostly found in rigid or organised structures. Fluid or gas structures in nature are much more chaotically dominated than pure tone.
    Easy to get set of nice tones out a taut string from branch to ground in a breeze. Less easy to see if the lower end is cut.

  53. And if no-one else out there sees a ‘matching’ function in the surface response……

  54. I mean, how else would you model an irregular natural waveguide when fed with a sine wave?

  55. I getting bored with cross posting. I’ll just add this and go and watch TV.

    RichardLH | July 8, 2013 at 7:11 pm | Reply Your comment is awaiting moderation.

    KR | July 8, 2013 at 2:51 pm | Reply

    “Digital frequency filtering is just that – a method of amplifying or reducing particular frequencies.”

    Actually this is digital bandpass filtering. All of the energy will be correctly assigned into a ‘bin’ which is a frequency/period band.

    Think of it like histograms for energy in time.

  56. Re: RichardLH (Jul 8 13:13),

    You’re bored!

    You are aware that any filter function can be digitally implemented on the FT of a signal, aren’t you? It’s pretty much the basis for digital signal processing. And the power spectral density of a signal is not limited to single frequencies.

    As Carrick might say: You don’t grok frequency domain analysis.

  57. Carrick,

    In other words, precipitation P is a function P(T,A) where “A” is aerosol concentration, and it’s extremely probably that P(T,A) = P0, where P0 is a fixed rate of precipitation, is satisfied by multiple values of temperature and aerosol concentrations.

    Sure, but aerosols in general cool (direct and indirect effects), and automatically reduce temperature, so P(T, A) = P0 implies that increasing aerosols increase precipitation more than the expected reduction in rainfall due to cooling effects. I think that strains credulity. The only way to connect increasing aerosols to increases in rainfall is via influence on cloud nuclei, and as I noted above, connecting smaller cloud droplet size with increased rainfall is not simple. I have not seen any evidence that is really the case.

  58. DeWitt Payne (Comment #117602)
    July 8th, 2013 at 2:11 pm
    Re: RichardLH (Jul 8 13:13),

    “You’re bored!

    You are aware that any filter function can be digitally implemented on the FT of a signal, aren’t you? It’s pretty much the basis for digital signal processing. And the power spectral density of a signal is not limited to single frequencies.

    As Carrick might say: You don’t grok frequency domain analysis.”

    I think that you may need to study more on demodulator and noise reduction circuits in analogue (and the well known record length limitations of FT).

    You know, the sort of things that power most of the instruments that you collect data with in the first place.

    Then turn it to digital as decribed and use on the end of the datta as well as the start. That’s all.

    Just give it a think for a second and reply.

  59. If you were to take a ideal model of a constant input power source and apply it to mixed reflector/absorbtion surface of a rotating sphere.
    To measure the temperature time response (from the ‘daily’ input rotation/periodic function) in such a model it is easy to split in into time bands/frequencies in the natural response.
    From Daily, through Yearly to the 1461 true ‘Solar Year’.
    If you wish, use FT to spit out any bands that contain energy (or the circuit above).
    Describe the both the input and bandpass outputs. Note that an ‘end around’ sum can be performed to provide validity and any remaining RMS that is yet left to be assigned.

  60. Oh, and whilst you are at it, walk down the corridor to the RF lab and ask about ‘waveguide matching functions’ and ‘acceptance/reflections of RF in irregular, dymanic surfaces in the natural world’ from a constant, high power, source.

    Then do a BIG moving average instead because all that other maths is way too complicated. Just like the circuit does and I have describe.

  61. Using FT is like trying to work out the frequencies in a set of ropes hanging from the branches of a tree and finding that you can only easily ‘see’ the ones that have their tips caught on the ground.

    Gives nice sine waves then. Otherwise, just too much organised chaos.

    The film is short though, could really do with a longer recording to work it all out properly. 🙂

  62. As a continuation of my comparison of the CMIP5 model temperature series with the three major Observed series, I did some calculations using R functions for spectral analyses. I used the mean monthly global temperature series from KNMI, and, for this particular analysis, I used the historical part of the RCP45 scenario series.

    In order to avoid complications in the analysis due to the trends in these series I divided the series into 2 parts from 1916-1970 and 1971-2013 May, did a time regression on those parts separately and then recombined the residuals to obtain a residual series for the entire 1916-2013 May period for the 3 Observed and 106 RCP45 series. I could have used a breakpoint analysis and obtained more linear segments for time regression and recombination of the residuals, but at some point I think one has to decide how much of the cyclical nature of the series would be obscured by that approach. Anyway I merely wanted to obtain an initial look at the spectral comparisons of the Observed and modeled temperature series here.

    I did a three part analysis of the residual series. In the first part I smoothed the residual series using the loess function in R with span=0.3. The 3 Observed and 9 typical modeled smoothed series are shown in the first link below. Obviously the smooth removes most of the higher frequency structure and leaves the lower frequency spectral information. It can be seen that the Observed series tend to show a lower frequency cycles while the models show more middle frequency cycles.

    The second part of the analysis involved using the spec function from R library(seewave) to obtain the relative amplitude of the frequency content of the residual series. I converted the frequency to periods in months and show the ranked 5 highest amplitude periods for the Observed and modeled residuals in the second link below. The table in that link shows that the observed series are dominated by longer periods (lower frequencies) while the modeled series have higher rankings for shorter periods (higher frequencies).

    In the third part of the analysis, I used the R function simspec library(seewave) to estimate the similarities between the Observed GHCN residual series and the Observed HadCRU4, GISS and 106 RCP45 residual series. The results are shown in the table in the third link below. The table shows that, while there are differences amongst the Observed series, there are larger differences between the Observed and modeled series – as would be expected given the results of the second part analysis.

    http://imageshack.us/a/img812/256/o4m1.png

    http://imageshack.us/a/img14/8186/wd9g.png

    http://imageshack.us/a/img692/1382/z4pc.png

  63. Ok. So I know you all think I am a prat. Sorry about that.

    It has been so much fun sparing with Tamino and Eli. Bit unfair really. Like shooting fish in a barrel now I think.

    But hey, this just one mans viewpoint. Maybe I am just seeing patterns in the mist.

    I AM seeing patterns in the data as well though. Patterns that appear to relate to the orbital patterns of the three main bodies involved.

    This may all be down to Gravity and orbit. Nothing more.

    It cannot, just cannot be a co-incidence. When you see four points in a line it MUST mean something. Pure chance is low on the list. 3 maybe, 4 unlikely. Especially when they relate to orbits.

    And it just comes from playing with 1.3371… and its integer series.

    When you digitaly sample data, you must remember that you have changed to that world.

    As it happens we did not choose the sampling period. The orbit does that. We HAVE to honour that.

    Hence the 1,2,3,5,7,9,12…..

    For that sequence to have the patterns in it the relative orbital positions of the Sun, Earth, Moon with

    1,28,12,37 and 48/9 being there. That’s beyond chance.

    So having hone the knife over there I give you mu position to date. Please feel free to comment (or just say ‘Go Away’).

    It’s been fun playing.

  64. Last posts at Tamino’s:

    RichardLH | July 9, 2013 at 6:44 pm | Reply Eli Rabett | July 9, 2013 at 5:19 pm | Reply

    So there is just pure Numeroiogy in the sequence, 1,2.3.4,5,7,9,12,… then?

    No bearing on Climate at all?

    RichardLH | July 9, 2013 at 6:46 pm | Reply [Response: What the hell is this about? Are you accusing me — or anyone else — of partiality? You are out of line.]

    I do hope NOT.

    This was no attempt to call any names. That would be very wrong and improper. I do not believe I have ever done so.

    I was just attempting to make this a ‘double blind’ test.

    That sort of impartaility, only.

    tamino | July 9, 2013 at 7:31 pm | Reply To RichardLH: you have had plenty of opportunity to express your opinion. It’s time for you to find another outlet.

    To others: please do not argue with him, it’s pointless.

    RichardLH | July 9, 2013 at 7:58 pm | Reply Your comment is awaiting moderation.

    Ok. I did not come here for arguement. I came for discussion. I leave, for now anyway, with this thought.

    It looks like the integer digital series based on the mutiplier 1.3371… when used on digitally sampled data (such as those in climate temperature research) contains as part of that series all the major frequencies and its sub-harmonics of the masses and rotations of the three major bodies involved.

    Then this apparently means that Climate = round(1.3371…) with the 48/49 month mismatch to the integer main cycle being the inevitable output of natures usual trick of mixing 1/2 cycles (both plus and minus) in groupings that reflect longer, non interger cycles to disperse power out to ‘zero’/’ground’.

    That half cycle mix (rather than pure single cycle tones) makes examining the data we have very difficult.

    Just my positon at present.

    I think you for allowing to think, and thus present it correctly.

Comments are closed.