I continue to be rather interested in the possible ways one might create a reconstruction of historic temperatures based on a time series of temperatures T measured over a finite calibration period with N years and a finite number of proxies M that are thought to respond temperature. There appear to be an assortment of methods which vary in at least the following ways:
1) The method regressing proxy response and temperature during the calibration period,
2) the method selecting proxies in the first place,
3) the method of down-selecting proxies from an initial selection,
4) the method of calculating the reconstructed historic temperature based on the down-selected proxies.
We’ve flung around a bit of math to discuss now various methods might be biased, but I finally decided just to create a script that will permit me to compare results of different “method of calculating the reconstructed historic temperature based on the down-selected proxies” (i.e. 4) and seeing how the bias and noisiness of each method can be affected by the method of ” own-selecting proxies from an initial selection”.
I think the examples I’ve chosen are illustrative– but readers should be cautioned that these were chose to highlight biases and noisiness that might occur along with the differences bewteen the various methods. Figuring out how the potential biases might affect a honest to goodness reconstruction would require numerical experiments. (And of course any estimates of potential errors in a reconstruction would be based on the assumption that proxies really do respond to temperature, that that temperature response persist in the historic period and so on.)
With this in mind, I’m going to give a cursory description of the methods.
1) The method regressing proxy response and temperature during the calibration period
All methods will use a forward equation fitting some proxy value P to some temperature, T. Generically, it will be assumed that the expected value of a proxy variable “P” at location ‘i’ is a linear function of temperature anomaly. That is:
where $latex \lambda_{i} $ is a function of proxy ‘i’. T is some temperature.
In “the litrachure” some fit proxy values to a regional value (i.e. Northern Hemispheric Temperature, or Southern Hemispheric Temperature) others fit proxy values to an estimate of the temperature near the proxy itself. This choice can affect potential bias and noisiness, so to explores this, my synthetic experiments will use two types of proxies called “global” and “local”.
M “Global” proxies $latex P_{G,i} $ will be generated from a ‘known’ global temperature baselined to the calibration period (which is the item we really wish to reconstruct) using an equation of this form:
where T_{G} represents the ‘global’ temperature which will have an imposed known value and $latex w_{G,i} $ is gaussian white noise with standard deviation 1 and s_T is the sample standard deviation of the temperature over the calibration period. In the current exercise, our goal is to examine how well different methods reproduce this global temperature outside the calibration periods so this temperature will also be referred to as the “Target” temperature.
In the current synthetic experiments all global proxies will share the same value of $latex \lambda_{G,i} = \lambda_{G} $ which will be 0.25 computed over the calibration period of N years.
The second type of proxy will be generated based on “local” temperatures using a very similar equation.
where $latex T_{L,i} $ represents the ‘local’ temperature at proxy location ‘i’ and $latex w_{L,i} $ is gaussian white noise with standard deviation 1. (Note: the sample standard deviation of these proxies will equal the sample standard deviation of the local temperature. However: These s.d. at proxy ‘i’ will not match that at proxy ‘j’ This will be important later. I would also suggest that this is the right way to weight for what I call “method 1” later on. An alternate way that I think may sometimes be done would be totally c*appy.)
The local temperature $latex T_{L,i} $ will be generated such that on average, the correlation between the local temperature $latex T_{L,i} $ and the global (or target) temperature $latex T_{G} $ is 0.7. (Note 0.25/0.7 = 0.357. However, each local temperature will also differ from the global (or target) temperature. I think it’s useful to clarify this with this graph showing the synthetic temperature that will be used:
The global (i.e. target) temperature is illustrated in red. This temperature will be a constant for 80 years and then exhibit a ramp up for 80 years. Each expected value of local temperature at a proxy will have a qualitatively similar shape, but with a different trend during the final 80 years; these are shown with dashed blue lines. The actual temperature at each proxy will be generated by adding noise to the expected value– these actual temperatures are showing in grey. The point of all of this is that the average temperature at proxy ‘i’ $latex T_{L,i} $ will differ from the average temperature at proxy ‘j’ when i≠j.
Before proceeding: In case you haven’t guessed, the final 80 years will be the calibration periods; the first 80 years will be the reconstruction period.
In the current synthetic experiments each proxies, i, will be assigned unique values of $latex \lambda_{L,i} $. These values will be drawn from a Normal distribution with mean $latex E[\lambda_{L}] = 0.357 $ and standard deviation of $latex \sigma_{L} =0.1$.
Moreover, to show a potential bias in some of the methods of creating proxies, I will assign the $latex \lambda_{L,i} $ such that they are negatively correlated with the average temperature trend for a at proxy ‘i’ during the calibration period set to -0.5. What this means is that in the figure above, the proxy whose temperature corresponds to the lowest dark blue trace is likely to have a higher lower than average value of $latex \lambda_{L,i} $ while the proxy whose temperature corresponds to the highest blue trace is likely to have the lowest highest value of $latex \lambda_{L,i} $. Language corrected because I forgot the script sets the correlation based on the trend, not the value during the reconstruction period. See blog comments.
Given the distribution of $latex \lambda_{G} $, $latex \lambda_{L,i} $, $latex T_{G} $, $latex T_{L,i} $ $latex P_{G,i} $ and $latex P_{L,i} $ above, I will use three basic methods of estimating the temperature in during the first 80 ‘years’ based on calibrating proxies to temperature over the final 80 ‘years’ in the figure above. Each basic method will then be ‘tweaked’!
The basic methods are:
- Method I: Compute reconstructed temperature using $latex T_{rec,I} = \sum P_i/ \sum m_i $ where the sums are over “M” proxies $latex m_i $ are the best fit coefficients for $latex P_i = m_i T_i $, with $latex (P_i,T_i) $ from the calibration period (i.e. final 80 years.) This method can be applied with either local or global proxies fit to either $latex T_i $. (I’ll show global proxies fit to global temperature and local proxies fit to local temperature.)
- Method II: Compute reconstructionusing $latex T_{rec,II} = \sum [ P_i/ m_i ] $ Here, the $latex m_i $ are exactly as above.
- Method III :Compute reconstruction using $latex T_{rec,III} = \sum [ P_i]/ m_G $ where m_Gi is computed by fitting $latex \sum [ P_i]= m_G T_G $. That is: The we compute sum over all proxies and the find the best fit trend with the global temperature.
Results!
I’ll now show reconstructions using M=5000 proxies. This number is selected to demonstrate what we might hypothetically get if we have a a fairly short calibration periods (~80 years) but could somehow– quite miraculously– be able to obtain many, many proxies. (I’d use even more but I don’t want to take 10 minutes to generate a graph!) Reality is we are stuck with very few proxies. But for now, my goal is to merely see whether the results of a method would converge on the correct value if only we could get enough decent proxies. (Bo will want me to look at fewer proxies and I can do that afterwards.)
I am going to let most images speak for themselves. (I’ll admit some of the later ones will require some discussion– but I want to let people look at them and interpret them for themselves first.)
Fit global proxies to global temperatures
Reconstructions using Methods I-III with all trends obtained using fits to global temperature (i.e. the Target) and applying no proxies removed for low correlation:
Reconstructions using Methods I-III with trends obtained fitting global proxies to global temperature and eliminating those proxies whose “R” values were not significant to at the 95% level. (Note this eliminates roughly 40% of proxies.)

(Note: the rms and average difference in the legend are based on comparison between the reconstruction and the target during the reconstruction period only.)
Fit to Local Temperatures
Reconstructions using Methods I-III with trends obtained fitting local proxies to local temperature:

(Note: I think method II is noisier than previously because now the individual proxies display a range of responsiveness to temperature.)
There is a lot going on in the graph above. But I’m going to yank my fingers away from the keyboard and let people have fun looking at it and explaining the offsetting biases. That will help people understand just how pesky picking the best method to create the reconstruction might be!
The code: HockeyStick_Lines_July9



In the prior toy problems, Lucia gave us “actual temperatures” during the reconstruction periods that had a sinusoidal shape. Weird, but that usefully highlighted the effect that many methods had to attenuate (dampen) the signal, such that reconstructions made the past looked more placid than it actually was.
Most readers (not all) thought that it was very useful to explore this defect of reconstruction methods that are commonly used in the actual, non-toy literature.
The current spherical cow has a flat “actual temperature” during the entire pre-instrumental period. Thus, we will have a harder time seeing which of the methods give misleadingly flattened profiles.
Amac
Yes.
Oddly, with the sinusoids, someone said they always thought of “bias” as affecting the net change over the period. I tried to expplain that if you grok the math, you’ll now that the attenuation of the sinusoid meant that the the net change from past to current would be flattened!
If you grok the math, the sinusiods and the current graphs communicate exactly the same thing: Temperature far from zero are are either compressed toward or expanded from zero relative to what was “real”.
The advantage of the current trace is you can easily see which problems could potentially be fixed by temporal smoothing and which could not.
So for example, on this one
All three colored traces lie above -100 which is the “real past”. That means in all three cases, the amplitude of a sinusoid would be attenuated. In the “red” case (Method 1) the amplitude would be attunated 27% as indicated by “ave dif”. That’s quite a bit.
I was going to explain some more… but I think I might have a sign error… gotta go hunt! Grrr!
Ahhh! I do need to change the sign on the covariance! The covariance is actually computed between the trend during the calibration and the value of γ (which was not what I intended. I need to fix up some text in the post.)
The cov = -0.1262 in the right hand notes is the (-1)* covariance between the proxy temperatures in the historic period and the ‘presponsiveness’ γ during the calibration periods. (This covariance could be non-zero, if, for example, all ‘instrument/proxies’ have similar responsiveness, but you compute different correlations because the temperature did not vary enough to get a good correlation at some proxy locations.)
Because this covariance exists, when the ‘method’ got to the point where we downselected proxies by picking the ones whose trend with temperature is statistically significant, we picked the ones with lower trends. The result is that the method “used” proxies all of which were sitting at locations where the temperature rise was lower than happened on average.
That effect can be seen with the dashed brown line (just between the jittery blue and jittery green lines).
Then the red line is even higher above that because (as Bo pointed out to me) when this correlation exists, method 1 is biased relative to the measured average from the collection on which it is based.
Plus… all three cases are are higher than the real trend due to the trend issue discussed before: That is: even without the existence of this correlation, we tended to pick proxies that appear more responsive than they really are (owing to the short time period of calibration.)
The situation is sufficiently complicated that when you screen, you can’t necessarily know in advance whether you are going to attenuate or expand variability. (You could estimate which happened based on your data– but you need to be aware that this should be done otherwise you aren’t going to look for the problem and correct it!)
BTW: I can concoct much worse methods by scaling all proxy values to have the same standard deviation during the calibration period. (My impression based on — I think Nick’s– past criticisms of my failure to do the scaling is that such scaling is done. It can lead to disaster. But I don’t know if it’s really done. So I didn’t show those really, really bad graphs! )
.
Can you point out a high-profile, screening-based paper where this isn’t done ?
I do fear that this post will lead to more requests for the exact defitition of ‘screening fallacy’!
toto–
I have no idea. As far as I can tell, none looked for these specific issues. But maybe they all looked for the bias that arises do to screening, or the specific method. Since you brought the issue of specific papers up, can you point to one where these specific biases are mentioned and corrected for? Because merely running things with “pseudoproxies” doesn’t automatically find these things unless one is looking for them. Algebra works better.
Toto,
Was that an attempt at levity?
SteveF
Of course. Because — as we all know– there is a law of the universe that says if we do not nail down a precise and extremely narrow definition of “the screening fallacy”, we are not allowed to discuss the vast assortment of biases that might be introduced by screening.
Similarly, if we do not define the word fruit down to meaning “organically grown granny smith apples from trees grafted onto the roots of crab apple tree” we cannot discuss the nutrative qualities of fruit. The only appropriate response to any such attempt to discuss fruit would be “I still haven’t heard definition of the word ‘fruit'”. 😉
toto (Comment #99191)
> Can you point out a high-profile, screening-based paper where this isn’t done ?
[“This” being accounting for the artefactual attenuation of the signal in the reconstruction period, that results from the proxy-based reconstruction methods used.]
As far as I know, Mann08 would qualify as a high-profile screening based paper where that wasn’t done.
SteveF (Comment #99192)
> I do fear that this post will lead to more requests for the exact definition of ‘screening fallacy’!
Discussing a completely different subject, Randall Parker notes a catchy phrase to describe what has gone wrong with American intellectuals: the war against pattern recognition.
Amac–
The fact that VonStorch and Zorita wrote a paper discussing the existence of the attenuation at all after some of the high profile papers were published and V&Z’s papers contribution is pointing out something overlooked would tend to suggest that no paper before V&Z would have recognized this issue and corrected for it. Whether the papers afterwards recognize the issue and adequately account for it, I don’t know. But really, even with VonStorch’s paper, I’m not sure people truly understand that many things can introduce bias, screening can do so in a variety of ways and you need to specifically look for biases.
MonteCarlo is great. But unless you are looking for bias and run a heck of a lot, bias can hide quite nicely inside ‘noise’.
Toto:
Let’s flip this and reverse it, Toto.
How about you name one “high-profile, screening-based paper where this is done”?
(I’m sure it should be easy.)
It would be really nice to look at the plot of n=number vs. the sum of squares in the reconstruction period.
I am a bit shocked how noisy it still is with 5,000 samples.
Doc–
It might be nice, but it would involve coding I’m not going to do. Why it’s so noisy has been discussed on other threads and has to do with R being low. (I have 0.25.) Also, the inherent spread of 0.1 makes things bad too.
Carrick
Christiansen and Lundquist at least try to address the issue of loss of variability and show some methods loss a lot. Their LOC is designed to lose less. (I still need to get the details of the ones they compared LOC to to show it. In this post, LOC doesn’t look good– but I think that’s likely because RegEm and the other one they tested differ in key ways form my “method 1 and 3 above. There are implementations of methos 1 that can be really, really, really bad. I didn’t show those!
The fact is though: algegra is very useful to help guide the monte carlo runs and to help pick which things to vary in your pseudo-proxies. I overlooked the correlation issue that results in bias in method 1– Bo pointed it out to me after I emailed him.
Lucia,
I think the reason why your method 2 has big noise is that it divides each proxy by m_i, and the range of these includes zero. Screening works well there to raise the range of m away from zero. But that also creates bias.
Methods 1 and 3 had much less noise because the denominators are a single averaged values, so dividing by near zero isn’t an issue.
When you introduce local temperatures, you are allowing variable sensitivity. It also adds noise. Method 1 performs badly because the m’s in the denominator are for local and not global. Method 3 does well because you do compute the denominator m from regression with target T.
The effect on method 2 is the most interesting. The m values in the denominator make more sense because they are indirectly a regression on target. Two things there:
1. The noise reduces, but not as much (as G). The reason is that selection is less effective, because of the extra noise (between local temp and target).
2. The bias increases, but not as much. This is also related to the diminished effectiveness of selection. But there is another aspect. Your choice of uniform λ_G is the worst case for selection. There’s nothing to gain in terms of getting more sensitive proxies, and the selection is then entirely on noise alignment with temperature, which maximises the bias effect. Variable λ would have less bias whether global or local.
Nick
Yes. That’s why.
Yes. Method 2 is biased both when screened and when not screened.
Yes. In my synthetic cases. In a real world case, we’d have variable sensitivity using global proxies as well, but I used a single value for global precisely to let people see this. When sensitivity varies across proxies, the bias issues in method 2 get worse.
In which case do you think method 1 performs badly? Do you mean when it’s screened against local? Or unscreened?
The problem with method 1 is not that the proxies are ‘local’ rather than global. The problem with method 1 arises when I introduce a non-zero correlation between the proxie response γi and the temperature that exists at the proxy location Pi. To make this problem visually apparent, I set the correlation to 50%– which I fairly high. I don’t know how high it would be in real proxies.
Method 3 doesn’t suffer from this. Note that when screened method 3 still has a problem. So screening must be done with care. (I have a “light screening” portion in the code, but I haven’t plotted those. I’d generally prefer “light screening” if we really thought there were clunkers in the distribution”.)
.
We’ve been there before. All of the recent Mann papers, and apparently the Gergis un-paper that started this series of posts, have some “validation” / test set that is distinct from the “calibration”/screening/training set.
.
This is done precisely to estimate how reliable (or not) the whole process is at estimating the quantity of interest outside the training period. Any of the biases that you have discussed over this series will be apparent there. So will many other possible biases, introduced either by the dataset or the methods. People use test sets precisely to estimate the compound effect of such biases.
.
I take it you don’t like the way they do their validation? That’s fine, many people don’t. But it’s still quite different from suggesting that they didn’t do any!
toto
‘some “validation†/ test set that is distinct from the “calibrationâ€/screening/training set.’
Is not the same as testing for these specific things. If you want to claim those papers did things appropriately, you should explain precisely what they did and why you think that achieves the goal of checking this.
Explain precisely what feature of the validations would reveal these biases.
It is simply not true that running monte carlo on pseudo proxies will reveal everything. It can only do so if the pseudo-proxies and the tests were designed in a way that could possibly reveal the issue.
Why do you assume that? I’ve said nothing specific about them. I prefer to look at general issue and then move to specifics. But if you are going to claim these issues have been addressed, they show how and why you think they have.
I didn’t say they did nothing. I have no idea why you want to try to put such words in my mouth. But if you have some insight showing that some paper some where has specifically addressed these potential things tell us. You’re the one who is trying to connect this to specific papers– not me. So if you are going to make a claim, back it up. Don’t sit here whining that I am not bringing forward evidence to support claims I’ve neither made nor ‘insinuated’!
toto (Comment #99223) —
> All of the recent Mann papers… have some “validation†/ test set that is distinct from the “calibrationâ€/screening/training set.
toto, are you including Mann08 and Mann09 in “all the recent Mann papers”?
I agree that they include test sets and discuss validation, using data that are distinct from the calibration/screening/training set.
However, that wasn’t the question you posed upthread at #99191:
> Can you point out a high-profile, screening-based paper where this isn’t done ?
The antecedent for this is in Lucia’s #99189:
So, “this” refers to “attenuate or expand variability” — attenuate, for the most part. That’s what the toy problems of the past few threads have been exploring.
To the best of my recollection, the authors of Mann08 and Mann09 display no awareness that attenuation of signal in the reconstruction period is a possible or likely pitfall of their methods.
If you disagree, could you steer me to the place in the paper or SI where they tackle this issue?
Lucia:
.
There seems to be a misunderstanding. The “test set” I’m talking about has nothing to do with the monte carlo experiments on pseudo-proxies (these are a nice complement, but it’s something else entirely).
.
It consists in looking at how a reconstruction screened/calibrated/trained using period P1 performs at matching the actual record on period P2.
.
Which is exactly what you do on the graphs for this series of posts!
.
.
Because if these or other biases occur, as they are likely to do for any method that weights different proxies differently, the reconstruction differ from the original on the validation/test period, more than it does on the training/calibration/test period. This is expected to occur to some degree, but the question is: how much? This can be estimated by looking at the difference between reconstruction and record in the test period, or even better, by comparing with a red-noise-proxy reconstruction.
.
As I said, all of Mann’s recent papers do this. So, I’m told, did the Gergis paper.
Amac
.
1) the specific problem of attenuation is I think what they call “loss of variance”, following established usage.
.
2) The whole point of the machine learning approach is that, as soon as your method starts weighting inputs differently, things can go wrong in unpredictable ways. You can lose variance. You can also exaggerate it. You can also generate complete garbage that just happens to have the same variance as the original. As usual in engineering, things can go wrong in more ways than you can think of!
.
E.g. imagine that my method is simply, “take the proxy that ends up with the highest slope upwards/downwards in the past, and expand it infinitely”. This is a well-specified algorithm. It also has close to zero information about actual temperatures outside the training period. But its total variance will probably be largely above that of the real temperatures!
.
By using a test set separate from your training set, you do not just catch attenuations or amplifications of variance – you catch these, but also all the ways in which your method actually diverges from the original signal once you leave the dataset on which you trained it. Which, if you think of it, is what you are really interested in, rather than the effects of one specific potential bias.
.
Of course this does not solve all problems. For one, the test set must be fully separate and actually challenging (I understand some people object to Mann’s version on these grounds). Also, you are still dependent on some assumption of uniformity of proxy behaviour in the distant past – as is any reconstruction method. If this is violated, then testing on part of the historical record may not inform you about your reconstruction’s behaviour in the distant past. But IIUC this is not what Lucia is talking about in these posts.
toto–
Go get the text and graphs for what you claim was done. Because we’ve discussed this before and as I recall, the examples you brought up did not do exactly what I have done. I’m looking at the Gergis paper and I don’t see any figure showing how a reconstruction obtained using period 1 performed at matching period 2.
If you know it exists in some paper, presumably you can find it and present it. I’ve already said I’m looking at hypotheticals first. I’ve skimmed papers and I’m going to look at them in details afterwards. Meanwhile, I’ve made no claims about what is or is not specifically in them.
If it important to you to let the world know what’s in them this instant and you have this information and want to show it to me now do so. But so far, you haven’t presented anything. You are just telling me that something you think you recall is probably “exactly” the correct test. I’m not changing the order of my approach based on that. But if you want to do the work to present hunt down that which you believe exists and bring it to my attention, go ahead and do it. Please bring figures, text etc.
BTW: I have Gergis’s paper. I see some words about calibration and verification, but I don’t see the specific comparisons that are comparable to what is in my graphs. But maybe I’m missing something. I’m only skimming — and I plan to continue skimming for now. Because my posts aren’t about “what did Gergis do” yet. I prefer to work on synthetic data first and look at what papers did later. I’ve said this — and believe it or not, that’s the order I prefer.
toto-
Just bring by the figures, tables and text where a test was done instead of just telling us you have a fuzzy recollection it was done somewhere. Surely you can quote, take screen shots and compose paragraphs just like everyone else here rather than trying to assign wild goose chases to others.
toto —
If I’m reading you right, you are saying that a two-part screening/validation strategy as used in Mann08 is sufficient to deal with attenuation problems during the reconstruction period. Is that right?
Lucia:
.
Sure. Mann et al. 2008, section “Validation exercises” in “Results”. Figures 2, S2, S3, supplementary Excel files.
.
And just for the fun of it – Mann Bradley Hughes 1998, section “Verification” in “Methods”, and Supplementary section “Results of Calibration/Verification Exercises”.
.
For Gergis, as I said, I am relying on other people’s telling me that “in each [reconstruction] there is a calibration period and a verification period”. If you’re saying that’s wrong, then my questions are answered, but apparently you’re not:
.
.
Fair enough, but then it seems a bit confusing (to me at least) to start off this series with Josh’s cartoon of scientists actually sieving for hockey sticks?
.
Amac:
.
That’s a bit ambiguous – if the attenuation is caused by non-uniform proxy behaviour in the distant past, then testing on the recent record cannot catch that. But as for “pure” screening effects, yes, it should deal with that – and a host of other problems as well (more precisely, it allows you to estimate these effects). Some people have problems with the particular of Mann’s implementation, but that’s another matter.
Toto–
Let’s step through Mann. As I said before: I have made no claims about what is done in this (or most) papers. But you are representing them as having done ‘exactly’ what is requried–and on skimming I don’t see necessarily see that. Of course, I’m skimming– but just looking at the figures is not enough to know whether what they did amounts to a real check. So, I’m going to make you do work by asking you questions– and they will be presented little by little.
First, I’m trying to understand precisely how they screened.
I read the section on “Data” I read:
These at least seems to suggest that the screening is done on the 1850-1995 interval. I clicked to get the supplemental material. It read
Once again, this suggest the screening is done over the full 1850-1995 calibration interval.
Am I mistaken about this? Was screening over some other interval? After I know the answer to this, I might be able to go further.
Toto–
Gergis screens. The “screening fallacy” is generally, the failure to recognize what might go wrong if one “screens”. Gergis happened to make an error when implementing screening and that’s how discussion of screening got revived on blogs.
I didn’t plan a “series”. My posts were triggered by things commeters claimed in the discussions of Josh’s cartoon. So, it’s not as if I had a “series” planned and picked the cartoon to set some sort of theme. This isn’t a high priced organized PBS series– it’s a blog. That means things aren’t really planned or mapped out in advance.
Anyway, let me know if I have mis-identified the time periods that Mann used for screening. With respect to the topic of my blog posts, this feature ‘matters’.
toto:
Yes we have been there before and you’re still stuck on zero.
Mann’s papers don’t correct for screening bias. Neither did Gergis.
This was demonstrated very nicely by von Storch (in the case of Mann), though apparently you’d rather take Mann’s word for it than go through the math to figure out why he’s wrong.
toto:
See comment about zero and being stuck on it.
The uncentered PC method used by MBH screens for hockey sticks. It’s another one of those points that is quite easy to understand if you would bother with sitting down and going through the math.
[Of course it “doesn’t matter.” No mistake by Mann ever matters. That’s a tautology.]
toto:
As is well known, this verification was done incorrectly, adverse results involving correlation were withheld, and in any case is largely irrelevant to the question of screening bias. It’s not only a strawman, it’s a “caught on fire, fell over, then sank into the swamp” strawman.
It’s telling you choose one of the worst quality academic papers of the entire series of paleoclimate reconstructions as your exemplar of how it “gets done right.”
Carrick–
I’m just going to start with the first one he thinks has done something that I should know about and try to ask questions to figure out what he thinks was done. I have no idea what toto thinks was done in “validation” of “verification” and how it interacts with what is in my blog post. Maybe he has recognized something real, but getting him to explain what he thinks was done in his own words is like pulling teeth.
So…. first, I want him to tell me how he thinks things were screened. Then we can move on to what he thinks was done in “validation” and “verification”. Meanwhile, I’m reading the paper to try to figure out what, precisely is even plotted in “Figure 2a”. I know the black like is CRU. But there is no legen. What’s the blue line? The red? The orange? There is no legend. Am I supposed to guess or hunt down the legend in the excel spreadsheets?
Lucia:
For for the test/validation, screening is done only over the (separate) calibration period(s):
Also note this bit:
Again, people argue about whether Mann’s particular implementation of the training/testing procedure is valid, but to me this suggests that scientists doing multi-proxy reconstruction are very much aware of the possible biases that screening might induce.
Yes, but the question is: does she also screen using the validation period when she does the validation (i.e. is she using her training set as a test set)?
I guess I should stop relying on hearsay and check by myself, so if anyone has a link to Gergis’ preprint, I’d be grateful!
Carrick: The “fun of it” bit was at least in part thinking about you. 🙂
toto:
LOL. Yep.
toto (Comment #99367) —
I think it’s a brave choice on your part to nominate Mann08 for this exercise; I’m very interested in looking over your shoulder as you and Lucia work through the math. The paper is defective and slipshod in many ways, but that doesn’t mean — at all — that the authors failed to account for attenuation in the recon period. If they did, that should be recognized and applauded. It would go a long way towards making up for those other deficiencies, IMO.
.
> Again, people argue about whether Mann’s particular implementation of the training/testing procedure is valid
Well, I don’t want to “argue about it” — I want to discuss it, so as to understand what Mann et al did right. And I hope we can all agree to focus on the paper’s handling of the attenuation problem with its screening/validation strategy. Rather than rehashing lakebedseds and the rest of that laundry list.
.
> to me this suggests that scientists doing multi-proxy reconstruction are very much aware of the possible biases that screening might induce.
Well, I don’t see it that way. The authors’ awareness is of as little interest to me at this point as how much they tip when they eat at Applebee’s. I’m keen to know what they did, and whether it satisfies the concerns raised in these threads with respect to attenuation of signal during the recon period.
toto–
1) The highlighted in periods comparing data to calibration and hold out figures in the various figures 2 are 1850-1900 and 1950-something not
“the shorter calibration intervals (1896–1995 and 1850–1949) “.
2) (1896–1995 and 1850–1949) overlap
substantially.
3) The text discusses says “based on both the early (1850–1949) calibration/late (1950–1995)”
So, could you hunt down precisely what the heck was done here? It actually matters which lines were created with which sort of screening. This may all be very clear to you. Maybe you know from the excel spreadsheets. But if so, tell me what you think was done.
The “also note” bit about an argument is irrelevant to our discussion.
Look– I think it’s better to stick with one paper at a time. Right now, it looks like there is some bizarre overlapping, spacing and mismatching of of calibration and screening periods. Unless you can explain what was actually done, I can’t see that this verification/ validation test did much to detect any problems that could be introduced by screening.
It would be nice if we could figure out if the first paper you present as having done something useful did it. I’d rather look at this before moving on.
toto:
I don’t really think many people argue about the validity of the MBH implementation. I think it is widely recognized as flawed. I agree they understood the importance of verification testing, let’s just say it’s a bit puzzling they didn’t report on correlation as a verification statistic [although Gergis does do so, appropriately], and of course they screwed up the alternative test they used.
Beyond that, it’s pretty clear they didn’t understand the effect of screening on bias, it wasn’t until von Storch 2004 that it was generally recognized as a problem (he refers to it as “loss of variance”, but it’s the same general phenomena).
[Also the issue of loss of variance was not addressed in Mann 2008 either, as as was pointed out by von Storch. I would say, generally there is no way to use correlation as a method of weighting without the introduction of bias, unless you know the answer to start with, or make a very very lucky guess.]
Amac
The math? I’m trying to figure out the structure of what was done. Since toto says he knows, I’ just going to ask him rather than reading a paragraph with an ambiguous discussion with a link to supplemental materials that make me still wonder what was done.
I’m hoping toto can explain precisely what was done to validate and verify and which lines where which on the graphs with no legends. I know maybe I could find the stuff. But he says he’s going by these particular figures, so I figure it’s easier just to ask him.
lucia, I believe I can answer your confusion about what was done. Mann did his screening by using the correlation of proxies 1850-1995 period.
He then sought to “verify” his screened reconstruction by doing screening on two other periods, one early, one late. For the former, he used the 1850-1949 period. For the latter, he used the 1896–1995 period. These two overlap with each other, but that’s “okay” since they’re designed to test the effect of screening over the full 1850-1995 period, not each other. In other words, look at each test as a separate comparison of the full reconstruction to a reconstruction made using a hold-out period.
As for what Figure 2 shows, the legend actually is adequate if you look closely enough. The lines seen in each graph are different reconstructions, each starting a different period in time (so that they could make use of different proxy sets). You can match the colors of the lines to the color bar to tell what starting date they represent.
Edit: If I wasn’t clear about something, or you need more information about whatever, feel free to ask. I’ve spent quite a bit of time on this topic, and I can probably answer most questions.
Now then, for my personal gratification, I’ll discuss an aspect screening I find amazing. It’s somewhat off-topic, but bears mentioning.
484 proxies passed Mann’s screening. This is 40% of the 1209 proxies used. Mann claims only ~13% would pass by chance (a false claim, but I’ll get back to it).
True temperature proxies should pass screening not only over the whole interval but also over shorter intervals. That’s not the case. Of those 484 proxies, only 342 pass both subperiods used for validation testing (1850-1949 and 1896-1995). Of those 342, 71 are Luterbacher “proxies.” There are all sorts of things which can be said about the Luterbacher “proxies,” but the biggest issue is they are made by including modern temperature data! Of the remaining 273 proxies, one has a positive correlation in one subperiod, but a negative correlation in the other.
And finally, of the other 272, at least 93 passed after being truncated at 1960, then infilled in with new data…
Brandon,
” In other words, look at each test as a separate comparison of the full reconstruction to a reconstruction made using a hold-out period.”
If he used all the data for the screened reconstruction, then there does not appear that anything was “held out”. Can you explain what is being held out?
Oh, there’s another thing to consider about this screening lucia. For screening, Mann didn’t compare each proxy to the nearest local temperatures. Instead, he compared each to the nearest *two* sets of local temperatures, and if the proxy showed significant correlation with either, he kept it.
As for what all this means, I don’t know as far as toto’s claims. I can’t see what any of this has to do with avoiding variance attenuation. Screening over one period rather than another won’t change the fact screening itself deflates variance.
SteveF:
Data was held out for the validation testing, not the main reconstruction. In other words, he created one reconstruction with no hold-out data, then he created two more reconstructions, each with different hold-out data.
(Technically, he created many reconstructions for each of those three as he created many reconstructions with different starting periods for each. It’s not relevant to your question, but it is important for understanding some of his figures.)
Brandon
My impression is that toto is suggesting comparing “early calibration/late validation” vs. “late calibration/early screening” to each other tells us something about bias from the screening. But I don’t see how that is so– because the calibration periods largely overlap.
But toto seems to think they do– so I’d like him to explain why he thinks they do.
I’m still confused. Are all the colored traces in graphs 2A-2D based on proxies during the full calibration period ((1850–1995)?
If yes, what’s supposed to be communicated by the band highlighting the 1950-1990 and 1850-1900?
Toto said I should look at those figures. And right now, I want to know what it is about those figures toto thinks shows us that the method has been checked for bias. To do that, I need toto to write several paragraphs.
Brandon
If so, which figures are the reconstructions made with the hold-out period? (Also,since even with the hold out period, one is inside the other, this would hardly be good proof that screening didn’t bias. But I want to verify your notion is correct and if it is, the comparison of the reconstruction to those made with the hold out periods must exist. Pointing to them would clarify a lot.)
I haven’t tested that. But it seems to me if I tweaked my code this would tend to aggravate any potential for screening bias rather than mitigate it. Of course, that depends on whether the correlation with one or other local temperature really is better or whether you’re just getting more chances to pick a trend that is higher than the ‘true’ value.
I have no idea why he thinks that. I don’t even think it’d be true if the two subperiods didn’t overlap so much.
Nope. They’re different reconstructions created by using different starting periods (match the color of the line to the color bar on the right to tell which line goes with which period). The colored traces on the left are the ones created using a late calibration period. The colored traces on the right are ones created using an early calibration period.
In other words, the two sections with colored lines are showing the “validation” reconstructions from his subperiod testing.
Where are the other two?
Brandon
We are not communicating. Let’s talk about individual traces in figure 2A.
Start with the dark blue 1: That is a reconstruction using proxies that began in sometime around 300. (I can’t tell each blue that well). Fair enough.
What was the screening period for this reconstruction?
What was the calibration period for this proxy?
Are both 1850–1995?
Now the red line: That was created using proxies that began shortly before 1800. Where those proxies screened against 1850–1995?? Was the calibration period 1850-1995?
The paragraph discussing that figure is talking about several validation/verification periods. “the early (1850–1949) calibration/late (1950–1995) validation and late (1896–1995) calibration/early (1850–1895) validation”. Are any of the traces in figure 2A based on “the early (1850–1949) calibration/late (1950–1995) validation and late (1896–1995) calibration/early (1850–1895) validation”.?
Yes? No?
For now, let’s just stick with figure 2A. Then we can move to 2B.
lucia:
I’m not sure what the effect on the final reconstruction would be since I think that would depend on the distribution of local temperatures. However, one thing it will certainly do is increase how many proxies pass screening. If the amount of variance attenuation is impacted by the number of proxies included, the amount of attenuation would also be affected.
Another thing it will do is increase the number of proxies we’d expect to pass by chance (and any signal/noise ratios we’d calculate using that value). I don’t think that’s relevant to the current discussion though.
lucia:
My apologies. Specific questions should clarify things:
First, it may be causing confusion that the periods are called “calibration periods.” In this paper, “calibration period” and “screening period” are always the same.
With that clarified, it’s important to note there are actually two dark blue lines. For the one on the left side, the screening period is 1896–1995. For the one on the right side, the screening period is 1850–1949.
The same as above. Make sure to note it includes not only the “proxies that began shortly before 1800,” but also all the proxies that began earlier.
All of them, as explained above. The traces shown in the graph represent the “validation period” for each of the two you list here.
BTW:
Even more interestingly, this decline in outside the calibration period exactly what you would expect from the bias due to screening “all good” proxies along with some correlation in the ‘noise’ in proxies. But I want to avoid interpreting it this way until after I read Totos explanation for how we know how Mann’s method in this paper prevents this bias from affecting his historic reconstruction.
lucia:
Indeed. I forgot about that part of the paper. For what it’s worth, I think your interpretation is completely right, and toto just isn’t making sense.
Brandon
Ok. So that explains why the agreement looks fairly crappy in both areas. (Doesn’t mean no skill…. but lets face it, not the sort of agreement you expect in the calibration period.)
Now, assuming you are correct, there is some evidence that the CPS cases are biased.
In the left hand side of 2A, the reconstructions tend to be too warm relative to CRU. That’s what you would expect if they were screened against data from the warmer part of the ‘black trace” (CRU). On the right hand side of 2B, the reconstructions tend to be too cool relative to CRU. That’s what you would expect if they were created based on screening during the cooler part of CRU.
Unfortunately, it’s difficult to quantify the likely magnitude of this bias using the “method of the eyeball” because (unlike my toy cases) the noise in the proxies may have temporal autocorrelation. So you won’t see the sudden “jump” at the end of the screening period. Instead, you’ll see a slow slide into the ultimate amount of bias.
So… if your description of 2 is correct, their appearance seems to suggest that Mann’s CPS is biased!
FWIW– The same issue with 2A would hold with 2B.
On the other hand, it’s not suggested as strongly wiht EIV. There are mixed issues there.
lucia:
Indeed!
By the way, I understand not just taking me just at my word, and if I were you, I’d want to wait for toto’s next comment before reaching any conclusions as well. But for what it’s worth, there’s been a lot of discussion over this “validation testing,” as well as examination of the code (and even replication of results in R). Whatever toto may be trying to say, he’s saying it about well-covered topics.
By the way, I forgot to point out one other fact about Mann’s screening I find incredible. Not only does Mann match correlations against the two nearest local temperatures, he also uses the absolute value of the correlations. His 13% value is based on a one-sided test when in reality he did a two-sided test. Once you account for that, you’d expect a quarter of the proxies to pass by chance alone!
Actually, something occurs to me. One would expect a quarter of proxies to pass Mann’s two-sided screening (discounting the fact he matches against two temperature records) by chance alone. However, one would not expect a quarter of *his* proxies to do so. Remember, he uses proxies which have been truncated because they’re “divergent” (they were then infilled with data to “fix” them).* He also used proxies which included instrumental data in them. Both of these facts would reduce the number of proxies we’d expect to pass with negative correlations.
Apparently some of Mann’s crazy choices mitigate some of his other crazy choices. Unfortunately, that doesn’t make it any easier to figure out what biases exist in his work.
*I don’t recall the exact number off-hand, but it is at least several hundred.
I’m not so worried about that. Having only 40% pass would bias many methods whether he’s accepting proxies that are pure noise or rejecting some weak proxies that contain signal.
I think CPS is similar to “method 3”. But I have no idea what EIV is similar too. So, I don’t know how screening would affect that anyway.
Brandon–
Rereading, it does seem the figure captions match your description.
I keep looking at figures 2A and 2B and it seems to me that those look exactly as we would expect if bias was introduced by screening. I’m pretty sure ‘CPS’ is similar to “method 3” (but it might be something I would have called ‘method 4’ but didn’t show and which matches method 1 a bit more.
So, figure 2 in Mann seems to suggest that his CPS results are biased. More would have to be done to estimate the magnitude– but that figure looks like CPS is biased in precisely the way I would expect it to be given the math and the screening.
I reserve judgement on EIV. I haven’t done any toy problems with it– and also, it doesn’t look as biased as CPS. (There are issues on the ‘older’ part but the ‘newer part’ isn’t consistently too cool. So… I’m not sure.)
Toto– Do you concur? Figure 2 strongly suggests Mann’s CPS is biased?
lucia:
Glad to hear it! I know I often think I’m clearer than I really am.
I’m afraid I don’t remember what you assigned to each number, and I don’t remember which thread it was on. However, CPS is a very simple process. You: 1) Standardize the proxies, 2) Average the proxies, 3) Rescale the result. Spatial weighting adds a bit of complexity, but it’s a simple process overall.
Of course, some of what can be said about CPS may not directly apply to Mann’s implementation of CPS as there are all sorts of issues with how he implemented it. Don’t get me started on the smoothing issues (including the smoothing of some data 15+ times), or the checking correlation of infilled temperature series with infilled proxy data, or…
Anyway, yeah, that figure definitely does show what we’d expect from the bias we’d expect to see.
D’oh. Stupid blockquote tags will be the death of me.
Anyway, as a last comment for the night, I highly recommend anyone interested in the issue of screening/calibration and bias take a look at Figure S11 (S means it’s in the Supplementary Information). It has two graphs (one for CPS, one for EIV), each showing the effect of using the full calibration period versus the early/late calibration periods. Unlike the Figure 2 which has been discussed so far, these graphs cover ~1600 years.
One interesting aspect of EIV is during the calibration period, the results will be identical to what is being calibrated against (and thus has no meaning). Part of what causes this is orientation of the input is irrelevant to it. Series with a negative correlation to the temperature record will be flipped so they match it. You could have a dozen proxies sloping downward in modern times, add some noise, run them through EIV, and they will almost all flip over.
My guess is EIV causes variance attenuation, but I don’t think that’s the only type of bias it creates. Take a look at the late calibration reconstruction, and you’ll see why.).
Brandon
Well.. your editing for brevity here. . But in fact, Mann discusses three possible ways to “standardize proxies” in his supplemental material and after averaging proxies I can think of two ways to “rescale the result”. Mann’s supplemental material cites a previous Mann paper which I have not chased down. Without reading it, I can’t know which of my two guesses (if either) is correct. (Of the three ways he discusses standardizing proxies, I think he chose one that seems ok and matches my “method 3”. )
Reading Brandon Shollenberger’s explanations of Mann08’s methods brings to mind Einstein’s famous quip, “Make everything as simple as possible, but not simpler.”
As if.
This paper’s authors had oversize ambitions: be the first to construct a 2,000 year global reconstruction based on all types of proxies, and based on calibrating proxies to “local” temperatures, rather than to a global average temperature.
Looking back to the accompanying press release and subsequent (continuing) over-the-top behavior at RealClimate and elsewhere, I think it’s also fair to say that these authors suffered from confirmation bias: the paper was designed to prove (1) that the hockey stick is real, and readily derived from multiple avenues of investigation, and (2) that treering records are a reliable guide to past temperatures, with their results affirmed by other types of proxies.
That’s a lot of ground for a six page paper to cover.
A better publication process would have put some upper bounds on the authors’ ambitions, guiding them to scale back their claims to what they could plausibly support — and to what their audience would be able to comprehend and evaluate. Put another way, Mann08 is a poster child for the defects of academic peer review, perhaps especially for the slow-pitch option that PNAS has traditionally offered to National Academy members. (PNAS got lots of brickbats over this issue, and they instituted some reforms; I suspect but don’t know that Lonnie Thompson submitted the manuscript via this route.)
The paper has these components:
1. Produce 2,000 year temp anomalies by the CPS (composite plus scale) method for the Northern and Southern hemispheres and the globe.
2. Produce 2,000 year temp anomalies by a CFR (climate field reconstruction) method (EIV, or regularized-expectation maximization error in variables) for the Northern and Southern hemispheres and the globe.
3. Show that both methods are valid, and consistent with one another.
4. Make maximum use of all proxies, even though they are of varying length.
5. Compile a list of all suitable proxies of all types.
6. Develop this list by screening proxies against the entire instrumental record, then by using two variants of a two-step calibration/validation procedure on those proxies that passed: an early (1850–1949) calibration/late (1950–1995) validation and also a late (1896–1995) calibration/early (1850–1895) validation.
7. Establish 90% and 95% significance levels for both CPS and EIV by a set of Monte Carlo experiments.
8. Demonstrate that the fundamental finding of both CPS and EIV recons — the 20th-century hockey stick — confirms the findings of earlier work by Esper et al (2002), Mann and Jones (2003), and Moberg et al (2005) (Fig. 3 panel 1), and by Jones et al (1998), Mann et al (1999), Crowley and Lowery (2000), Huang et al (2000), Briffa et al (2001), Mann et al (2003), and Oerlemans (2005) (Fig. 3 panel 2).
There’s more, but I’ll stop here — I haven’t even gotten to the SI.
It would have been very interesting to have read the three or so peer reviews of the earlier drafts of this manuscript, and the editor’s comments to the authors. I strongly suspect that this commentary would have demonstrated only the most superficial understanding of the authors’ methods. If reviewers did give good advice to improve this mess, it must have gone mostly unheeded. This view is supported by the multiple glaring errors that sashayed through the rigorous [sic] vetting process, only to be uncovered in the days and weeks after publication by genuinely-critical readers.
Rather than following Einstein’s directive, Mann08’s authors instead took a page from W.C. Fields’ formula for entertaining one’s audience.
“Rather than following Einstein’s directive, Mann08′s authors instead took a page from W.C. Fields’ formula for entertaining one’s audience.”
Einstein got things wrong, too.
bugs +1
lucia:
What “three possible ways” are you referring to? I know he standardizes all the proxies, averages any in the same gridcell, standardizes them again over the calibration period (this time using the gridcell temperature record as a target), spatially averages the results, and finally standardizes them with the hemispheric temperature record as the target. That’s the same process each time, just with different standardization targets.
Are you perhaps thinking of the standardization he did as preprocessing of data? He does use a “spline approach” on two proxies and an RCS approach on three more, but that was to get the series in the same “form” as other proxies being used (it is generally done by the people who provide proxy series). It was a part of preparing his dataset for analysis, not a part of the CPS methodology (those proxy series are used the same way in the EIV reconstruction).
Or am I perhaps just missing something?
After the final averaging, Mann just rescales the data to have the same mean and standard deviation as as the modern temperature record for the hemisphere. I’m not sure what other way you might have in mind.
So, he discusses 3 methods and mentions the ‘sophisticated procedure’ one by Hegerl. I think that means there are at least 4 ways of standardizing in cps. Mann used a particular one. But the way he describes it, CPS as a term might permit some choice of standardization method. To the extent that the standardization method.
One might rescale using average temperature observed at proxies during the calibration period rather than the NH CRU data. So, that’s a choice.
Alternatively, one might rescale by rebaselining the temperature to havezero mean, and then using Trec= mean[P] / m using the trend fit between mean[P] = m mean[T], in which case, the proxies will have a larger standard deviation than the Temperatures. (In my toy model the fact that the proxies will generally have a larger standard deviation than temperature is obvious from the fact that they contain some ‘chatter’ type noise while mean[T] is smooth.
If Mann does rescale by making the standard deviations of mean[P(t)] = standard deviation[T] over the calibration periods, then I think this method of standardization will introduce some weird biases in my ‘toy’ cases.
I don’t know if anyone does it the alternative ways I describe, but each is at least hypothetically possible if all you have is mean[P] and T_target values in a calibration period. Unless you specify how the rescaling was done, then a person who doesn’t know which choice is made when the method called “CPS” is applied can’t know merely based on someone saying “then rescale”.
I certainly wouldn’t have guessed “rescales the data to have the same mean and standard deviation as as the modern temperature record for the hemisphere”!
Let me step through what Mann describes:
I don’t know if I need to look there for details. I thought I did– maybe I don’t. Seems maybe the following is complete:
For proxies with annual time scale, he accepted all with |r|>0.11 for at least one of two nearby temperature measurement sites. For those with decadal time scales |r|>0.34 over the full calibration period. He tossed the other ones.
He smoothed using a low pass filter. (I assume for P only)
He scaled all individual P’s and T’s to have zero mean and unit standard deviation over the calibration interval.
He collected together all proxies in the same grid box and averaged them. After which he scaled them to the same mean and standard deviation as the smoothed surface temperature, T, for that grid box. So, this is locally.
(Did he really do this? Did he average first— including averaging tree rings with speloetherms and what not– and then scale the average? Or scale the individual proxies first and then average? The order of the words in the sentence match the first interpretation, but the second makes more sense.)
I assume he weighted by the size of the grid boxes that contained proxies and then computed the weighted average over to get the average over the target hemisphere. (Are some grid boxes larger than others? Does this mean if there was 1 large box holding 1 proxy that might– hypothetically– outweight 10 proxies in a small grid box? Did some boxes get left out? This would only bother me a little– I’m mostly just wondering. I assume some boxes might be empty.)
So… he stretched the above average to have the same mean and standard deviation as the CRU data over the calibration period? Right?
This seems to be a caveat on the grid box issue. He broke some boxes down into smaller boxes.
After that he discusses other ways to assemble this.
Have I got what he did right? I can try to code a “method 4” that contains elements of this. But I would prefer to match the order or “averaging/scaling” operations.
lucia:
I definitely agree there are lots of ways to rescale the final results. In fact, there’s an infinite number (though they won’t all qualify as CPS). But those are all part of step 3, the rescaling. The fact they can “standardize” the results in different ways doesn’t have any bearing on how they can standardize the proxies.
Fair enough. I left it at, “Rescale the result” because I was trying to keep the description brief.
While CPS isn’t that specific about what to rescale to, it does seem to require one rescale to the instrumental temperature record. Mann’s first reference on CPS says this:
Assuming that’s accurate, CPS requires one rescale to the temperature record, but it allows one leeway on just how to do that rescaling.
lucia:
I think that reference might be necessary to find out what CPS allows, as in if Mann had other options he could have used. But aside from that, it shouldn’t be necessary.
Your description seems to cover what he did correctly, but I’ll have to look at the details in a bit. I’m heading out for lunch now.
Brandon– You’re right. Those are methods to rescale the results. Not methods to standardize. My mistake.
Ok. So my “method III” rescalse to the same temperature record as CPS. But it does so differently from Mann. I use the best fit trend. He made the standard deviations of P and T match. Right?
No hurry. I’m trying to unpack the procedure to see if I can make a ‘toy’ out of it. The whole ‘grid box’ etc. issue is a bit thorny in terms of ‘toy’.
Brandon–
I’m uncertainty about certain order of operations.
I assume he did what made sense:
1) Smoothed temperature in CRU grid box by passing through same f=0.1/year filter he used to already smoothed individual proxies.
2) scaled individual already smoothed proxies to have same mean and standard deviation as the smoothed CRU temperature.
3) Averaged over individual proxies.
Since the proxies aren’t even the same thing, it would not make sense to do (3) before (2). If you did, you’d get different results by measuring ring widths in “meters” vs. “millimeters”.
lucia:
No prob. It’s easy to mix up things like that when you first start examining something. I don’t want to say how much time I’ve put into studying this paper, and I still don’t know everything there is to know about it.
Yup. Well, technically he uses the “decadal standard deviation.” I guess that might matter some, but honestly, with all the smoothing he does, I wouldn’t even bother to check.
Why? 😛
I’m not sure how much of an effect it would have since all proxies are normalized as the first step of CPS, but he does do (3) before (2).
Brandon
Oh. Yes. You’re right. Either order should be ok then.
I’ve been trying to think through the pile-up of rescalings a bit. And also this method…. I can see that in the limit of an infinite calibration period, it returns the correct value of T. But…. otherwise…. I think it’s … uhmmm…biased…
lucia:
That was my impression, but I hadn’t put enough thought into the matter to be sure.
I wonder what would happen in this case if one used a different period (on the proxies) for scaling than for screening.