Comments on the previous thread have wandered off into a discussion of Christensen & Ljungqvist 2012. Today’s cartoon problem is motivated by our discussion in comments. As such, this computation represents a “swag” or “back of the envelope” sort of computation to see whether based on a quick read of the numerical values in C&L 2012, I think their estimate for the peak temperature in the MWP could– possibly– be biased low. My main conclusion is that given the numerical values that can be picked out, a back of the envelop calculation suggests it… might… be… low…. However, before anyone who “wants” the MWP to be warmer gets too excited: C&L did a bunch of calculation and the only thing the back of the envelope calculation means is that I need to dig more. (This is annoying because I have always avoided staying out of details about reconstructions. I don’t really want to learn the ins and outs of Bristlecone pines. I’ve discussed this with SteveMcIntyre and he says he can totally understand that…. 🙂 )
As a way of background, I report the discussion. Carrick started it off with
Here’s a paper which uses correlation-based screening that claims to avoid the loss of variance present in the usual method Christensen & Ljungqvist 2012.
Although I always read comments and Carrick’s in particular, I did not click the link. I was busy discussing bias that happens as a result of screening, and mostly going back and forth with Nick. I was also in the process of writing a script using what I will call “Method 2” and looking at numbers.
In a response to Ken (who was apologizing for going OT) I wrote:
If I understand him correctly, Nick has been suggesting that the bias I’ve discussed only happens if we do the order like this:
1) Compute fitting parameters ‘m’ and ‘b’ to to N proxies under assumption ‘m’ in w=mT+b.
2) Compute mean calibrationbased on the N proxies. (Creates a curve.)
3) Compute <m> based on N proxies. and estimate temperature curve as <w>/<m>But we wouldn’t get the bias if we
1) Compute fitting parameters ‘m’ and ‘b’ to to N proxies under assumption ‘m’ in w=mT+b. This gives us N (m,b) pairs.
2) For each of the N proxies, scale to estimate Test=(w-b)/m with b and m being the ones computed for that proxy.
3) average over all Test to get the reconstruction for T as a function of time.I’m finding
a) The second method is really weird.
b) If you screen heavily, it is biased in the way I’ve been showing for first method.
c) If you don’t screen, it is biased making the variability too high, but which is that is trivial to correct using a scaling factor which is easy to compute based on the unscreened data.
d) Corrected, it results in a method that unbiased but horribly noisy– and so worse that the first method.I can see very little advantage to the 2nd method. Maybe after I blog about it tomorrow, Nick will be able to what the advantage to the 2nd method, but right now I’m inclined to call it “The totally insane method that results in ridiculously noisy reconstructions that is always biased in one way or another. In fact, as a method it strikes me as so horribly insane that I can’t imagine I’ve understood what the process is supposed to be. Because relative to the first method, it looks…. well absolutely horrible!
In the following comment, Nick responded
Lucia,
I’m not sure if I advocated it, but your method 2 is I think almost exactly what is described in the C&Lundquist paper that Carrick linked. I’m experimenting with that. I’m surprised that it’s considered new.
Ok… so it looks like “Method 2” may be what C&L 2012 call “LOC”. Unless I am misunderstanding something, it is a very weird method in the sense that it’s properties are… well.. odd. At the risk of showing I truly don’t understand it, I’m not going to post, and also run a ‘back of the envelope’ type computation that roughly matches some of the numerical values that can be picked out of C&L 2012. (Picking out more numerical values would require careful attention– so this is “the cartoon”.)
The Method
As described, C&L’s LOC seems to be no more than this (page 769-770):
The LOC reconstruction method is introduced and motivated in Christiansen (2011), and additional details and discussions can be found in Christiansen (2012), Tingley and Li (2012), and Christiansen and Ljungqvist (2011). Here we only give
a brief description. The method requires calibration periods with simultaneous values of proxies and local temperatures.We assume that both proxies and temperatures are centered to zero in the calibration period. The LOC reconstruction method relates proxies to local temperatures and consists of three steps: (1) The proxies are screened and only proxies with a statistically significant relation to the local temperature in the calibration period are preserved. (2) Each of the proxies that passed the test is related linearly to the local temperature; P =λ T + ξ, where the noise and the local temperature T are considered independent. It is important here that the proxy is chosen as the dependent variable. The regression coefficient is determined from the data in the calibration period. The local temperature is then reconstructed by T = P/λ. (3) The reconstructed local temperatures are averaged to form the reconstructed large-scale (here extratropical NH) mean temperature
This is essentially what I wrote in my comment above but with different symbols used. I used w where C&L used P, I use m for their λ and I permitted a non-zero intercept in my fit. To recapture there’s I can just set b=0. Same method.
The math is easier with their method. So, I’m going to go ahead with theirs with one change. I’m going to add a subscript to their (3) and say, “For proxy the reconstructed temperature is estimated using the estimate of λ based on the sample data, λi,est
The local temperature is then reconstructed by
This clarifies that the temperature reconstruction for each proxy is based on an estimate of λ which must be understood if we are to tease out whether a method is biased.
In their next step, C&L say ” The reconstructed local temperatures are averaged to form the reconstructed large-scale (here extratropical NH) mean temperature”
In my notation, this means they do
It is worth nothing that λest≠λ. Rather, because λ is estimated based on a finite amount of data, λest =λest+ λ’ where λ’ is an error in the determination of the true value λ. Under the assumptions of this analysis mean[λest]= λ and P =λ T + ξ, where ξ is scaled noise.
So
If the noise is gaussian white nose (as it will be in my current cartoon this becomes
No further simplification is permitted because even when mean[λ’]=0 there is nothing to guarantee mean[1/(λ+ λ’) ] = 1/mean[λ]; om tje general case inequality holds.
When the quantity mean[λ /(λest+ λ’)]=1, the method is unbiased in the sense that if we reconstruct a known signal that is a sinusoid, the peak-to-trough amplitude of the sinusoid will match that of the original signal with “noise” super imposed on it. If mean[λ /(λest+ λ’)]< 1, outside the calibration period the amplitude of the sinusoid in the reconstruction will be smaller than in the original. If mean[λ /(λest+ λ’)]>1 the amplitude of the sinusoid will larger than in the original. (In all cases, high frequency white noise will be super-imposed on the reconstruction.)
The Cartoon Case
I’m going to apply what I think is LOC to my “cartoon” case which draws some numerical values from C&L 2012 (though certain details surely do not match. I don’t know how important these other details might be). In my cartoon case, I will create synthetic 1000 proxies with identical proxies to reconstruct a ‘local’ temperature that is sinusoidal with a period of 160 years. The known true correlation between each proxy variable P and temperature will be R=0.32 based on the full temperature range for the synthetic temperature and the ‘noise’ in the proxies will be gaussian white noise. The calibration period will be the final 80 years in the period which will end with an uptick– but slightly before the peak of the sinusoid. I will the compute a proxy reconstruction using the method described by C&L 2012, but in one case, I will not screen, in the other I will screen and base my reconstruction on proxies whose sample correlation resulted in a screen that retained 60.43% of the proxies. Rscreen= 0.29.
Note: My choices are pulled from C&L 2012. The 80 years was chosen because C&L use 1880-1960 in at least some of the reconstructions presented in their paper. For my cartoon case where all proxies have the same R, when I used an 80 year calibration period a choice of R=0.32 results in keeping roughly the same number of proxies at the same cutoff R. In their paper, I think they screened and retained 55/91=60.43% proxies using a cut off R= 0.29.
I will discuss some results based on both screened and unscreened cases.
The proxy distribution
In the first step, I generated 1000 proxies with R=0.32, then computed the correlation between each synthetic proxy and the true temperature. This resulted in a distribution of Rsample shown below:
Note that even though the “true” correlation between every single one of the proxie and temperature is 0.32. The estimate from each of the 1000 proxies is shown below based on data over 80 years exhibits a range. The mean correlation for the proxies 0.31 which is biased a bit low (as expected for correlations). The vertical red line separates those proxies with sample correlations below the screening limit and those above the screening limit. (I screened to retain the same ratio retained in C&L making the screening limit a random variable. In this instance it was 0.288.)
Though not required to implement C&L’s LOC, I then computed the value of the trend for each of the N proxies. Because these are synthetically generated, the ‘true’ value should be λ=0.32. The 1000 estimate of λest based on 80 ‘years’ of data are shown. The ‘true’ value λ=0.32 is shown by a vertical yellow line; the mean value over the 1000 proxies is shown in blue; the two are very similar in magnitude.

I repeated the same exercise for the screened proxies. Notice mean estimate for λest now differs from the true mean, λ=0.32. For this particular set of proxies mean[λest]= 0.39
Before proceeding, I want to discuss a consequence of two things: a) the presence of a distribution in λest and b) the bias in λest relative after screening. Recall equation (6)
Because there is a distribution in λest and in this particular case using synthetically generated data we know the true value of λ, we can compute the sample average of f=λ /λest based on the 1000 proxies and the roughly 600 screened proxies. In the unscreened case, I get f=1.22 which is greater than 1. These means that if I use LOC without screening, the peak to trough amplitudes of my sinusoidal temperature will be inflated on reconstruction. In the screened case, I get f=0.85 which is less than 1. This means that under the screening choice here, the peak to trough values in the screening will be biased low. (This latter bias is of the type discussed by in my previous posts and by Roman at CA.)
It’s worth noting that I can compute the values of inflation or deflation I anticipate will arise with the method only because my proxies are synthetically generated and I know the true value of λ for each proxy. Estimating the inflation based on real data would require making assumptions about the distribution of the true values of λ which would be difficult to test given the large spread in λ one anticipates to exist when the value is computed based on only 80 years.
That ends this digression from applying the method for now; later I will draw the readers attention to these estimate of the inflation or deflation of the peak to trough values.
I will now continue on and compute the reconstruction by applying (3a) and then averaging over the proxies. Below, the reconstruction for screened proxies is shown in dark green and compared to the known true temperature shown in black.

Note the peak to trough amplitude of the dark green (reconstructed) oscillation is appear to be damped relative to the black oscillation. I previously estimated that amplitude of the reconstruction would be dampled by a factor of f=0.85. I divided the reconstructed temperatures by this factor to create the ‘corrected’ reconstruction; this is shown in blue.
(For those who are only reading this because they want to say “Whooo hooo. The medieval period might be warmer than shown in C&L, the difference between the light blue and the dark green graphs is the key feature. But don’t shout “whoo hoo” too loudly because there are other details in C&L I have not yet considered and which might mitigate this. All this really says is: We need to look at the details about the individual proxies and so on. But this is the end of the back of the envelope the part that shows C&L might be biased and suggests the reviewers need to require people using the LOC method to explain more about their method.)
Unscreened LOC
I repeated the above using the 1000 unscreened proxies. The reconstruction based on the unscreened proxies is shown in purple. Notice the peak to trough amplitude exceeds the amplitude for the ‘true’ temperature shown with a black trace:

Recall in the digression I anticipated the proxy reconstruction based on unscreened LOC would have f=1.22 for this case. I deflated this to create the adjusted reconstruction shown in violet.
I would also like readers to apply the “method of eyeballs” and note that the unscreened reconstructions are noisier than the screened reconstruction. This noise problem can rise to absolutely pathological levels if one uses individual proxies with weaker correlations.
I will now summarize a few things based on what we see here. Using this ‘cartoon’ analysis it appears:
- Unscreened LOC ‘stretches’ the amplitude of oscillations in a reconstruction. The consequence of this is that if an analyst were unaware of the feature and did not estimate or correct for the effect, their proxies reconstructions would make historic temperature appear to have larger low frequency oscillations than actually occurred and possibly lead to people over-estimating the peak temperatures in the MWP
- Screening ‘compresses’ the amplitude of the oscillations of the reconstruction relative to the values one would get from the unscreened case. The amount of compression will depend of features in the proxies themselves and the cut-off level for screening. If heavily screened the result can be that the proxy reconstructions show net damping. If used to create a historic reconstruction of earth temperatures, this could make historic temperature appear to have smaller low frequency oscillations and possibly lead to people underestimating the peak temperatures in the MWP.
- It is possible that some how, by accident, the amount of compression could exactly counter balance the stretching inherent in unscreened LOC, but unless estimated one cannot know whether the bias will be ‘matter’
- If the inherent correlation between the proxy and temperature is low, the unscreened LOC method will be very, very noisy and also biased.
Currently, at least based on this toy model, I’m not seeing much to recommend LOC. Also, as this example pulls correlation values and screening values that fall in the ballpark of those in C&L2012, the results suggest a potential for bias in their results. That suggests that we need to look at that paper more carefully to figure out whether they recognized and adjusted for this bias. Or even whether the bias would remain for their specific proxy values. I just pulled values in the ball park. It may be that on further review, I’ll conclude the potential for bias didn’t result in any real honest to goodness bias.
Worse… I could end up with egg on my face. I’m trying to implement the method as it seems to be described. Possibly, there is some feature that I am missing. If anyone can find that,I’d welcome an explanation of how the LOC is really implemented.
Comments
I now invite comments. Note: On this thread, I want people to restrict their comments to problems that arise when we assume the proxies are correlated with temperature, the correlation is linear and the true, underlying correlation does not change over time. Those other problems are important, but I want to focus on cases where the assumptions about behavior of proxies over time are correct and difficulties arise in application of a methodology.

Had to hit refresh. Really the first comment? Anyway, I have avoided going into much detail on this – I’d rather have smart people like Lucia, Carrick, etc. tell me what to think ( 😉 , appeal to authority) – the situation strikes me as analogous to fitting a curve using sum of squares regression vs. some kind of sum of square roots (ignoring complex numbers, say (sqrt(abs(x)). So we wind up overweighting fluctuations then scaling down, which seems better than underweighting fluctuations (reducing variance?). I’m not sure of this analogy so I’ll stop.
No. It’s something worse. I think believes that LOC is not inherently biased has likely forgotten:
a) that the trends from a linear regression are themselves estimates. So they are random variables.
b) that for a random variable, y:
mean[y] ≠1/mean[1/y].
(or at least equality will not hold generally.)
The C&L paper seems to just glide past this assumption with no remark. Maybe the remark is buried somewhere in an appendix or supplemental materials, but if so, this is remarkably poor organization.
Those who think screening doesn’t introduce bias
c) forget that most the short cuts and assumptions they frequently like to make assume that samples are independent of each other and drawn randomly from the full population of possible samples. These cannot be made after screening.
Because of the weirdness of this method, extreme care would be required to identify whether bias was introduced in a particular instance. If the mechanisms whereby bias is introduced are not recognized one might accidentally run pseudo-proxy studies in a ‘sweet spot’ where you might not see the bias. Because I haven’t explored the feature, I’m not sure how my ‘f’ parameter changes with bias cut off for white noise. Also, I don’t know how it changes if the various proxies have a distribution of &lamda;true. But you would really need to be careful not to trick yourself into not noticing the problems that might exist in your analysis.
Lucia –
Post-hoc scaling would not work if the bias is asymmetric? (note I am used to bias implying a consistent up or down, here in the cyclical example I am seeing mostly references to damping or exaggerating amplitudes)
BillC–
What do you mean by “post-hoc scaling”.
I think that’s because you are used to seeing reconstructions of earth temperature which tend to start during a colder period and end during a warm one. On a graph with a temperature ramp, the change in amplitude would manifest itself as a change in the difference between the initial and final temperatures.
I consider anything where in the limit that you had an infinite number of proxies, the limit of Test,N=infinite≠Ttrue to be biased.
I consider the standard deviation in (Test,N-Test,N=infinity) to provide an estimate of the ‘noise’.
My cartoons are using N which is not infinite. Probably because of the way I am coding my mac is grinding to a halt somewhere between N=1,000 and N=10,000.
post-hoc scaling: from your post
I didn’t see in the post what the LOC stands for. This paper may shed light. From the abstract:
Note: the paper I linked to above would be an instance of a Tingley and Li 2011, not sure if the same as the Tingley and Li 2012 referenced by C&L as per your post.
BillC–
Ahh.. that was a comment when I first started looking at this. The scaling is easy to compute in a synthetic case using unscreened data. It might not be so easy to compute in a not synthetic case. But it definitely cannot be computed based on screened data. The issue is unrelated to symmetry. The easy or difficulty has to do with how much you know the distribution Rest in unscreened samples of proxies and the distrubution of Rtrue for the proxies themselves.
For synthetic data, you know that both distributions.
For real data, you have Rest for the unscreened samples, and you might make an assumption that lets you infer Rtrue. So, you can estimate f with some possibiity of success.
But if you screen, you know longer know the distribution of Rest for the unscreened samples and so can’t infer the distrubution of the true R’s. (That is: You can’t do it unless you do side calculations based on the unscreened samples!)
BillC
I approve of these assumptions. 🙂
That’s a very brief way of saying what I elaborate on by introducing equation (3a) and continuuing thorugh (6). The problem is that the person writing that doesn’t come right out and say that the failure to propagate uncertainty hides the fact that the method will be biased.
In comments on the other thread, you’ll see I say it’s ‘hella noisy’.
I haven’t looked to see if Li’s method fixes the problems.
But LOC… I have a hard time seeing this as a good method!
BillC
From Tingley and Li:
They are discussing unscreened applicatins. The bolded statement is equivalent to saying that the “fs” I computed above will be very large if the ‘true R’ is near 0. That’s true and it’s what I get.
They say this:
I’d go further. As far as I can see the Christian and Ljundqvist papers don’t discuss the existence of the bias in any meaningful way. (It’s possible I just haven’t found those parts yet.)
just a quick note: this is a preprint of the “original” loc-paper by bo christiansen. Bo and Fredrik are certainly happy to discuss the method. This is Bo’s preprint-reply to the comment by Tingley and Li (linked above in #98821)
There’re are “identical” “superficial” loc-posts at the Klimazwiebel and the Air Vent.
Maybe I missed something, but the goal of Bo (and Fredrik) was to preserve low-frequency variability and to minimise biases. Christiansen (2011) accepts an overestimation of (interannual) variance to achieve the “correct” trans-decadal variance. I think that’s discussed rather conclusively in Christiansen 2011 and in his reply (both linked above).
ob–
I read the paper you linked last night. In your own words, you tell me what about the discussoin in Christiansen 2011 is conclusive with regard to the issues raised by Tingley and Li?
FWIW–
Figure 1 in the preprint-reply to Tingely and Li has bias in Pnew even in their method (green). Christiansen calls it “small” — but that would seem relative to the method shown in blue. Hes using correlations about &lambda=0.4 (which is close to my cartoon). By method of “the eyeball”, his bias for the unscreened LOC would appear to be about f=1.15 while I have about 1.22 in my post. But that could be explained by my R=0.32, for which f will be higher. (It’s easy to explain why)
So, his reply is showing the same thing I am.
This makes it sound like they are discussing the bias, and may have accounted for it. But I haven’t found those details in their papers. (Maybe they are somewhere– I just can’t see it.)
I really like your approach of using synthetic data and then actually testing how the reconstruction methodology will actually work.
I would like to know why you used gaussian white noise in the synthetic data though.
ob–
The interesting to me part of christiansen 2011 is figure 3 and the discussion of the pseudo-proxies. On the other thread, I groused a bit about the brevity in describing those pseod-proxies. They say only
Which means I have to find whatever it is in past papers. ( Ok… not such a horrible thing. But still… I can’t really decide what I think without the other papers.)
Doc–
Simplicity. Fewer parameter choices. If I picked Red noise, someone is going to say “Why R=0.2 instead of R=0.9”. If I pick ARMA(1,1) further questions. And so on.
I also don’t want anyone to think that the bias I am seeing is somehow intractely linked to the choice of noise model. It’s not.
The magnitude of bias is going to be affected by those things. The fact of it will not.
Also, it is precisely because all sorts of people get so wrapped up into their ‘favorite’ noise model that when I am discussing something that has nothing to do with the choice of noise model I prefer to pick the simplest possible. That’s white noise.
Ok… I was getting the impression ‘CPS’ was what I called ‘method 1’. CPS is evidently this
This is not my “method 1” either screened or not screened. Maybe it could be I should think the above has a potential for being an even worse method than the first two I thought.
This is really amazing! My first thought of how to do this was: “Oh. Method 1 would be easy and if not screened at least not biased.” Then, due to the suggestion of applying the correlation to each proxy…. I did Method 2 which seems to be Christianson and Ljundqvist. I have yet to see an advantage of Method 2 over method 1. (Both are about equally difficult or easy to implement depending on how you look at it.)
But it looks like CPS is another variation– and my first impression is that it’s going to be biased. It’s potentially noisy in a very creative way too. I guess we’ll see when the time comes.
Lucia,
I think standard CPS would not weight by correlation – just an ordinary average. I think that weighting is a good idea, though; it’s my next plan. I’ll look up the Hegerl paper; they might be using Deming regression too (EiV).
Nick– Do they compute the fit on the weighted proxy? That’s different from what I’m doing, but I don’t think it’s ‘important’ with respect to bias. It could matter for noise.
I’m trying to figure out why Christiansen gets quite as much loss in variance as he does for CPS in one of his papers. Maybe he’s got the fit the wrong way around? (There has to be a reason!)
Lucia:
When $latex |\lambda’| \ll \lambda$, and $latex \lambda’$ may be treated as uncorrelated white noise, isn’t this statement true?
$latex {\rm mean} \left[{\lambda \over \lambda + \lambda’} \right] = 1 + \sqrt{E[\lambda’^2]\over N \lambda} \ge 1$.
If the noise is correlated with the correlation coefficient $latex \rho$, you replace this with
$latex {\rm mean} \left[{\lambda \over \lambda + \lambda’} \right] = 1 + \sqrt{E[\lambda’^2]\over N’ \lambda}$,
where
$latex N’ = {\rho – 1\over \rho +1} N$
Or am I wrong-headed here? (There may be a factor of 1/2 in there too. I did this in my head after *finally* finishing the tile project, so….YMMV).
Nick–
I computed weighted by proxy following the brief discussion of CPS from C&L. It’s biased. 🙂
Of course, if we aren’t sure that’s CPS, then that observation doesn’t apply to CPS. But it applies to ‘weighted by correlation’. I could try ‘weighted by r^2’. But it will be biased too!
Lucia,
The Hegerl paper is here. Yes, I think they do as you describe. Normalize proxies, weight by correlation and average, regress average on T using Deming, though they call it TLS with estimated variance ratio.
Totally flubbed it.
Second try, this time from scratch:
$latex {\lambda \over \lambda + \lambda’} = 1 – {\lambda’\over\lambda} + {\lambda’^2 \over\lambda^2} + \cdots$
Take the expectation value:
$latex E\left[{\lambda \over \lambda + \lambda’} \right] = 1 +{E[\lambda’^2]\over \lambda^2} + \cdot$
Take the mean of this and you get:
$latex {\rm mean} \left[{\lambda \over \lambda + \lambda’} \right] = 1 + {E[\lambda’^2] \over N \lambda^2} + \cdots$
Make more sense?
Carrick, You’re right. In the case of Gaussia noise we can estimate using a series expansion. But the problem is that for many of these cases λ’ is not much smaller than λ
That’s definitely a good way to show the ratio is greater than 1 when noise is gaussian.
Nick
I suspect the key reason the CPS don’t show the MWP in Christiansen’s paper while his LOC is this difference:
For CPS:
for LOC:
The latter is bound to give higher correlations on average. The former. After all, if you had a local thermometer in my back yard and found the correlation with NH mean temperature, that’s going to be less than 1. And whatever it responds to, a tree in my back yard is more likely to respond to temperature in my back yard than than to NH mean temperature. Same for lake varves, isotopes etc.
So, I suspect it’s the “LOC” feature that’s makes the big difference in Christiansen’s lower bias relative to CPS.
This presents the question: Why would anyone who understood how low R affects the amount of noise and bias in these methods take correlations to NH mean temperature instead of local temperatures when dealing with something like a tree, lake varve, isotope etc. It’s nuts. Just nuts.
Carrick,
I don’t think you can do that. The λ’s are quite scattered and can be negative. The random variable in your denominator includes zero in its range, and often not as an outlier.
I’d like to get the expectation you write down for the Deming regression, but have that problem. I think you can only do it for the CCE case, where T is regressed on P.
Lucia:
Once in a place far-far away from here (about 560 miles) in a time long-long ago (about 20 years), I studied this exact problem.
We were interested in computing the harmonic mean:
$latex {1\over x_H} = {1\over N} \sum_{n=1^N} {1\over x_n}$
and the error of the mean associated with it.
In that case $latex x_n > 0$, which is not the case here and is probably why you get wild behavior sometimes.
This turns out to be a better estimate of the “central tendency of a distribution” than arithmetic mean in cases where you have a subset of data that are outliers. In cases where you don’t have outliers (every data point belongs to the same normal distribution), mean works as well or better. With outliers, it can perform far worse.
There are actually papers out there on this distribution, but at the time I knew how to compute the distribution from scratch. Good luck for me finding those neurons now. I suspect they jumped ship and are partying in Detroit now. 😉
It’s an important point that for this method to work properly you need a positive definite quantity. Without screening out small or negative slopes, I don’t see how it would give anything useful here.
Nick:
I agree the conditions I assumed aren’t met here, but I think it works in the case where λ’ is small, right? That’s why I specified the derivation for the case |λ’| << λ.
BTW, since I'm never evaluating E[λ’^2], the assumption of normalcy is not being used. If you have correlation in the data you compute N' based on the type of statistic you are assuming, where N' ≤ N. I think it's a general statement that λ/(λ + λ’) ≥ 1 as long as |λ’| < λ.
Nick–

I don’t know what their names are, but here are my method I’s (screened and unscreened) and “CPS” as described by C&L applied to the toy problem:
Weighting by R comes out between unscreened and screened. So, by having weights between -1 and 1, the bias from screening is less than if you screen using 0 or 1 only.
Lucia,
As far as varves go – I could be wrong – they might respond more to hemispheric temperature than local temperature, in some cases. If they are linked more to precipitation than temperature in general. That may be true for other proxies.
Carrick – congrats on the tile.
A few details come back re harmonic mean.
We were looking for robust methods of compute spectra, so if A(f) is the complex amplitude of the discrete Fourier transfer of the series x(t) = s(t) + n(t) where s is signal and n is measurement noise of course, then
A(f) = S(f) + N(f)
where S(f) is the transform of s(t) and N(f) is the noise introduced by measurement error.
Given N(f) = x_n + i y_n, it can be shown if n(t) is white gaussian noise, then so are x_n and y_n (it’s simple really, the sum of series of gaussian numbers, with any non-trivial weighting, is itself a gaussian number, and the DFT is a special example of the sum of a series of number).
Anyway, where this gets interesting is |N(f)| is given by a Rayleigh distribution and |N(f)|^2 by the exponential distribution (which is much simpler to integrate).
What we were doing was writing is x_n(f) = |A_n(f)|^2 where A_n(f) refers to the complex discrete Fourier coefficient corresponding to frequency “f” in window “n”, then trying different types of processing on the series x_n to minimize the impact of “out of band” noise. Note that x_n(f) is the power in frequency bin “f”.
Clearly for our case x_n > 0 (since measurements are never noiseless), and if you restrict yourself to the exponential distribution, the math is much simpler. For my case, x_n gets very large when you have “out of band noise” (e.g., a car driving by in the case of outdoor sound), so it makes a very small contribution to the sum 1/x_n, which is what you are looking for (a method for automatically selecting only for measurements that belong to the distribution we’re interested in and getting a central-value estimate of).
What they are doing seems conceptually similar, but taking mean[1/(1+λ’/λ)] seems to me like a recipe for disaster, as long as λ’/λ is unbounded.
Carrick
I think to some extent, that’s why they screen. But that could all be avoided by just starting with rational selection (which is how they get their 91 proxies in the first place) and the using my ‘method 1’.
Light screening would be possible in case one was worried about proxies that really are not temperature sensitive. These introduce no bias but do make things noisy. But how, I ask you, is using a method that is fundamentally hella-noisy a “cure” for the fact that including some “non proxies” and a not-noisy method is a little noisy?
Why?
That may be. But that problem is separate from the problems based on the method alone.
BillC:
Thanks! It feels good to be doing something a little less back breaking for a few days.
Lucia the paper has interactive comments.
http://www.clim-past-discuss.net/7/3991/2011/cpd-7-3991-2011-discussion.html
The comment and response to review #2 may be of interest to you.
Christiansen also refers to a reply they produced to a 2011 paper that has relevance to the 2012 paper.
http://web.dmi.dk/fsweb/solar-terrestrial/staff/boc/reply_to_moberg.pdf
I can’t exactly say if this will help you as it’s over my head but section 2 and 3 seem relevant. Note the second link is different to the reply link provided by ob.
Here’s a link on the relationship between arithmetic and harmonic means.
Basically, $latex x_h < x_a$ where $latex x_h = 1/n \times [1/x_1 + 1/x_2 + \cdots 1/x_n]^{-1}$ and $latex x_a = [x_1 + x_2 + x_3 + \cdots x_n]/n$ for positive $latex x_1, x_2, \cdots$ unless $latex x_1 = x_2 = x_3 = \cdots = x_n$ (in which case they’re equal).
This only applies if all members of series are positive. OTH, if you have enough noise that you end up getting slopes with the wrong sign, I’m not sure there’s too much of a point in attempting a reconstruction in any case.
How to make Money in Climate Science
1. find a major unanswered question.
3. question top scientists to see what they will accept as an answer
3. cherry pick data and methods to arrive at that answer
4. re-label this technique “training†– it makes it sound intelligent.
5. publish the result.
The results will seem correct to fellow scientists, especially those at the top, so they wont bother to check the math. Everyone will be impressed you have answered the hard question. More so because you will have proven their best guess correct and made them look good in the process. You will advance in your career in science. Fame and fortune will follow.
If anyone does question the results:
6. stall – play for time – they will likely give-up.
7. if not – lose the data and methods
BillC (Comment #98847)
> [Varves] might respond more to hemispheric temperature than local temperature, in some cases.
My view is that to understand what varved sediments represent, one must study the context and natural history of the particular record under consideration.
For instance, there are German (IIRC) lakebed varve records that begin prior to the Younger Dryas (Wikipedia, 12,800 – 11,500 years BP). Varves are thick during that interval, which was cold and dry in that part of the world. Winds whipped up dry rock flour left by the departed glaciers into dust clouds; this material (loess) constituted most of those years’ thickness.
Obviously, this mechanism was not at work in other, more recent varved sediments.
AMac,
Oh I don’t doubt it. My point was simply that proxies are not necessarily more sensitive to local temperature than regional or hemispheric temperature if the relationship to temperature is indirect. Which is kind of OT for this thread anyway.
BillC, AMac, the issues you guys raise aren’t really relevant to the difference between lucia’s method one and two, but they are quite relevant. People doing reconstructions often decide to use a series because someone published saying it was a temperature proxy. That overlooks all sorts of issues, including the issue you raised.
If a proxy tracks temperature on a scale not used by a reconstruction’s methodology, any correlation found between it and the temperature it’s compared to will be spurious.* This means you could have a good proxy show good correlation by chance alone because it isn’t a proxy for what the authors use it for. You usually see the issue arise with local temperature proxies being matched against hemispheric/global temperatures, but it’s no better when inverted.
*Technically, the correlation may not be spurious as the two scales could also be correlated. However, without showing the strength of correlation between the two scales, there’s no way to estimate how much of the proxy’s correlation is spurious.
BillC–
Oddly, it’s not OT because one of the features of LOC is that uses fits between local temperature and proxy response variable. In contrast, CPS evidently fits to NH temperature.
Which fit should be better depends on what makes sense. That’s why I asked you why varves should be more responsive to NH temperatures rather than local temperatures.
Ideally, when we fit
P=RT+ noise
We would like to use a temperature that affects P. If it turns out the ‘true’ relationship based on some sort of understanding of phenomenology of the proxy (e.g. tree, varve, isotope, whatever) is:
P=R_g T_g + R_l T_l + noise
where ‘g’ is global and ‘l’ is local
that would be something we might want to know. (If T_l and T_g co-vary, it’s going to be difficult to tease ot R_g and R_l accurately, but still…)
That said if for some reason, we are forced to only correlate with one variable and, R_g < < R_l we would prefer to correlate against local temperatures. If R_g >> R_l we would prefer to correlate against global.
Here’s a WUWT article on ice cores”.
“‘We don’t believe the ice cores can be interpreted purely as a signal of temperature’”
It appears they also respond to moisture (where the moisture comes from). Which brings us back to the question, is there such a thing as a “true temperature proxy” or should proxies just be used as mechanism for tracking climate change?
This problem gets messy the more you look at it.
Carrick–
Reading that article, it looks like they think sometimes the ice core responds mostly to relatively local temperature (at least the Atlantic ocean) and some times to temperature far, far away (the Pacific Ocean!)
It makes you wonder about any temperature reconstructions based on ice cores anywhere.
Generally I think varves are more related to precip than temps. Based on statements by Isaac Held and others I believe precip is *more* related to regional or global temps than local ones, since precipitation systems move long distances. There are some good recent discussions of varves at CA. AMac’s example is a sort of regional/hemispheric lack of precip, so it jives, while also illustrating the need to be specific.
Rule of Accuracy: When working toward the solution of a problem, it always helps if you know the answer.
For those that think the answer is hockey sticks (without doubt), then the solution will always be hockey sticks.
For Nick, a precision and accuracy joke I heard last month:
A gentleman accosts a guard in the museum and asks, “How old is that mummy?”
The guard responds, “I believe it is 2005 years old.”
The gentleman asks “Why that particular age, 2005?”
The guard responds “Well, I was told it was 2000 years old when I started work here five years ago.”
Lucia, Carrick,
It seems to me that when you talk of local and global temperature dependence, you need to consider what you have measured, and what you want at the end. One can think about whether ice cores are more dependent on local temp, but if you don’t have that measured (usually not well) then it’s academic. If they are dependent on unrecorded local, you’ll only get correlation with global through covariance, and the discrepancy between local and global just adds to the noise.
A target temp is reconstructed, and at some stage everything has to be aggregated to that level. I see different classses of algorithm:
1. CPS and like – aggregate the proxies at the start, by normalizing and averaging. This reduces noise but is somewhat arbitrary.
2. Hegerl et al, Gergis (I think)_ and like – do individual regressions on the target temp and then average the resultant temps. Individual regressions are noisy; need care so this doesn’t bias the aggregation. Noise damped through screening and weighting.
3. LOC – regresses individual proxies on local temp, and then spatially aggregates temps to target. (I think our variant regresses on target, so more like 2).
“It appears they also respond to moisture (where the moisture comes from). ”
At AGU 2010 I sat in on Phil Jones Presentation where he
made this point about some greenland cores reconstructing
a grid cell over by iceland cause that is where the moisture
came from.
Hmm. I suppose evidence in the form of a consistent
weather pattern or “tracer’ studies could support that.
I cant imagine that these claims are made soley on the basis
of correlation. dunno.
Steven Mosher (Comment #98869)
June 26th, 2012 at 9:02 pm
“At AGU 2010 I sat in on Phil Jones Presentation where he
made this point about some greenland cores reconstructing
a grid cell over by iceland cause that is where the moisture
came from.”
The trouble with this (in part) is that the source the moisture is itself obviously going to vary. Climate changes may significantly change the hydrological cycle, after all, shifting precipitation patterns. There was a story on WUWT about how it’s speculated the Greenland ice cores exaggerate the Younger Dryas cooling due to such an effect. Personally it struck me that their confidence in the ice core records outside that period and in other locations should logically also be suspect but the study authors seemed to focus solely on the Younger Dryas in Greenland because it is difficult for models to explain.
ya andrew. when I read that I thought of jones talk.
I’m curious about something. Is there a (relatively) simple way of telling what effect there’d be if screening was inconsistent? For example, if 30 series were screened for R = .1, 30 for R = .29, 30 for R = .50 (and the 91st was discarded for not meeting the data requirements laid out in the paper). If you could estimate the bias of screening at each of those levels, could you just average the three? If the sample sizes weren’t equal, could you weight your averaging by number screened?
Obviously, one wouldn’t expect to see that sort of process described in the methodology section of a paper, but it does hold relevance for how input for a paper is picked. When authors say they only use series which are believed to be temperature proxies, they’re using a form of pre-screening (which isn’t inherently bad). However, the standards used to decide what is and isn’t a temperature proxy vary, meaning the pre-screening isn’t uniform.
I know the example I gave and the issue I’m curious about aren’t directly translatable, but I think the former could give some insight into the latter. It might even be possible to figure out a way to estimate how proxy selection impacts the results of reconstructions.
Brandon
If you want to estimate the specific effect in a specific result presented, you have to go through all the details in that paper.
Otherwise, all you can do is a back of the envelop estimate.
Consider: Even the current method where I assume everything shares one “true R” I would need graphs and curves to show how a change in the ‘true R’ and choice of screens affects potential bias. It wouldn’t be difficult to make those– but it would take some time to run the code and organize the presentation to explain to others. (And is it worth it? Probably not.)
I think trying to find the effect or pre-screening the 91 proxies in Gergis would be conceptually easy but time consuming and tedious. You would have to dig up information in each of the 91 proxies, figure out on what basis those authors screened or selected materials from others that might have matched the criteria to create their proxy– and worse, figure out if there were even more proxies that got shunted to the side. All this would take quite a bit of time. Ideally, you would want to assemble all the data that the authors looked (including the discarded stuff) before reporting each of their proxies.
Likely much of this information does not exist.
Once you had all required for all 91 proxies information assembled, the math would be easy. To the extent information was incomplete, estimating the effect would require making assumptions based on the data you have.
If you are making the point that one can only work with data one has, then it’s true: you have no choice, you have no choice. If owing to lack of data, you can’t choose to apply the better method, then you can’t use the better method. No one is going to disagree with that.
But that’s not really the situation in most of these reconstructions. There are thermometers on Greenland etc.
If an ice core was dependent on local values, and you had relatively local temperatures, then it would be better to use those. For example: at least using temperatures in the general vicinity — like someplace in Antarctica should get you a better correlation with the core than using ones for the total surface of the earth. (That is: If the proxy really is a local detector.) At least some temperatures have been measured in Antarctica.
Where you screen is important because screening does introduce bias. If the data do respond to local temperatures it very easy to show that it’s better to screen based on local correlation than to screen based on correlation with global temperatures. (Yes. I’ve already run the toy cases….)
Sure. But if you screen based on correlation using noiser (low correlation) data, you will introduce a greater bias during the screening process. If you are stuck with this because you have no choice, you are stuck with it. But it seems to me many choice exist– including not screening.
It’s not any less arbitrary than methods 2 and 3 that follow. (Because: the principle of using the method that introduced no bias or less bias is not generally considered ‘arbitrary’. That’s considered good practice in statistics.)
If you call CPS ‘arbitrary’, this is just as ‘arbitrary’ as the previous. But I would say it’s worse than “arbitrary”. If they screen , this introduced bias. And the principle of avoiding bias makes these methods less attractive.
Please try to remember: unbiased noise in individual regressions does not bias the resulting reconstruction. It just introduces unbiased noise into the reconstruction. Screening and weighting are the steps that bias. It’s also not at all clear uncertainty is reduced by screening. The appearance of noisiness is reduced, so that the result looks less noisy to those using “method of eyeball”. But that’s not necessarily the same thing.
Do you mean my cartoon? Yes. The way it’s set up it currently aggregates on target because I started off with something simple to demonstrate the qualitative the effect of screening.
I’ve started re-organizing to aggregate on ‘local’. What we can discover is the obvious which is that if a proxy responds to local temperatures and you are going to screen you should screen based on correlation to local temperatures.
Also, if you are going to convert each proxy to temperature individually and then average those (a method that introduced bias) and those proxies are thought to be local responders, you should screen based on local temperatures. Screening on global temperatures results in greater bias.
So, the only open issue is: Is a particular proxy local or global? If it’s local it’s best to do any screening using local regression. (Or at least as local as possible.) Basically: Ideally, to minimize bias introduced by screening, do the screening against the specific temperature to which the proxy responds. Not some other one.
lucia:
I don’t think I’d actually want to do it, but I am interested in the concept. As it stands now, even if I had all the data needed to do it, I’m not sure how it would be done. It seems to me it should be simple enough, but I’ve never tried a problem like it before. That’s why I asked. And I’m happy to hear you say “the math would be easy.”
Indeed. I wouldn’t expect any specific calculations to come from it, but back of the envelope estimates can help one get a feel for what the fully calculated results may be.
The topic mostly interests me because I know of a number of cases where one proxy (or even just one version of a proxy) was used over another without any justification.* I know that’s not right, and I know that will affect results, but most of the time, I don’t know how it will affect the results. I think it would be nice to be able to have an idea of that without having to redo all the calculations.
*At least, no justification is offered. In each case, the actual justification is the one gives the “right” answer.
Brandon
Anytime someone eliminates a potential proxy that otherwise meets criteria while keeping another one whose meta-data seem indistinguishable, that creates a warning flag.
Ordinarily, this is the sort of thing a reviewer ought to catch and ask about. But the number of reviewers are small and even if they are good, there are bound to be things they don’t catch. Or there might be shared knowledge. So, the choice may be justified, but other readers aren’t clued in. The latter is often the claim about what happens. But it’s not at all clear that the ‘shared knowledge’ is necessarily correct. That’s where forcing someone to state the reason is useful– but page limits work against this.
Assuming that’s true, that’s the sort of thing that makes outsiders deeply suspicious of results. That’s why if there is a reasonable justification it ought to be stated.
By the way, for the quantity
$latex {}\qquad \alpha ={1\over K} \sum_{k=1}^K {\lambda \over \lambda + \lambda’}$,
with $latex \lambda’$ a normally distributed error, the average over $latex K$ measurements of $latex \alpha$
$latex {}\qquad E_K[\alpha] = {1\over K} \sum_{k=1}^K \alpha_K \rightarrow \infty\; \hbox{as}\; K\rightarrow \infty$.
That is $latex E\left[{\lambda \over \lambda + \lambda’}\right]$ is undefined. The divergence is logarithmic, which is why it isn’t readily apparent in Monte Carlos that it doesn’t converge.
You have to require that $latex \lambda + \lambda’ > \varepsilon > 0$ in order for $latex E[\cdots]$ to remain finite.
In the case of $latex \lambda + \lambda’ > 0$ (as with the problem I was considering where $latex \lambda’$ was the measurement error in power, a positive definite quantity), there is no issue with convergence.
I think that explains why $latex E_K[\cdots]$ is “hella noisy” in this case.
Carrick–
I think that’s why you must screen with this method. By screening, they ensure that λ+λ’>0 during the calibration period for all retained proxies. But of course, this doesn’t solve the bias problem which exists as long as there is any chance things you eliminating actually have λ>0. But you don’t know λ. You only know λest = λ+λ’ during the period where you estimated the trend. So you can’t screen on the real value of λ !
lucia:
I couldn’t agree more. Of course, sometimes what gets screened out would have been screened out even without the pre-screening, so some would probably argue it “doesn’t matter” if they do this.
The biggest problem is if it happens once, the issue can propagate through dozens of other papers simply because they trust the work of their colleagues. People often make comments about how an issue in one paper doesn’t matter because it is just one paper, but uncorrected (and even some corrected) errors get made over and over.
I know you’re probably not interested in delving into the various proxies reconstructions use, but I want to give an example. In the first C&L paper linked to in this thread, one of the proxies is the Avam-Taimyr series. Prior to a 2008 paper by Keith Briffa, the series was just the Taimyr series, and it did not have a hockey stick. Then, the series got updated by adding tree ring data from the Bol’shoi Avam site, several hundred kilometers away. The 2008 paper reported this addition (though it didn’t explain why that data, rather than other data, was added). However, it did not report the addition of other data from a third site.
After this update, the series had something of a hockey stick. Had data from other nearby (in fact, closer) sites been added instead, the resulting series wouldn’t have had a hockey stick.
This is just one example, but if we know one bad series gets used, how can we trust the rest? And how do we estimate the effect this sort of issue has? I’m not sure we can.
I’m never sure how much effort I should put into getting quotes and citations when discussing things on a blog like this. However, I should point out the full Briffa 2008 paper I mention is available here.
Anyone can read it and see if an adequate explanation is given as to why the particular data added was added rather than some other data from the same region (or in fact, from closer sites).
“If you are making the point that one can only work with data one has, then it’s true: you have no choice, you have no choice.”
It’s more than that – it’s what you’re trying to do with the data too. Though finding enough local data (eg icecores) to actually calibrate with is hard.
Take CPS. There you average the proxies before you even look at temperature. There is no place for local temperatures in the algorithm (what’s local for an average?). And again, if local T isn’t in your algorithm, there’s no point in worrying about whether it is a better predictor.
My second category (Hegerl et al, Gergis) average immediately after T-P regression. That only works if regression produces a common output. Otherwise you’re back with Cat 3 – LOC, where you have to spatially aggregate the fitted local T’s.
Nick–
Presumably “what you are trying to do” is get an unbiased proxy reconstruction with the lowest noise possible from available or obtainable data. The ideal in statistics is BLUE. If you can’t do that, you want to minimize bias or know what it is and/or minimize noisiness. But for what you want to learn in most proxy reconstructions in AGW smoothing a reconstruction can deal with ‘noise’ provided that the reconstruction isn’t biased.
But if you want “what you are trying to do” is “I am trying to do CPS because that’s what my teacher assigned me to do so I can learn CPS” then of course the relative merits of CPS, LOC or any other method one might propose is irrelevant. You are trying to do what you need to do to get 5 points on your homework.
But what’s the point of observing this? I thought the conversation is about which methods would work better. The fact that CPS doesn’t use local data to calibrate only means CPS doesn’t use local data to calibrate. It doesn’t mean that one couldn’t come up with better methods if one would decide to use local data. And certainly the fact that CPS doesn’t use local data doesn’t magically mean that methods that use local data aren’t inherently better than ones that don’t.
Well…. no. Not using CPS doesn’t mean your back to Cat 3-LOC.
Christiansiansen proposed LOC and his LOC is done a certain specific way. That doesn’t mean the only way to use local data is to spacially aggregate the fitted local T’s. You could propose to
a) fit trends to local T’s, to get N trends, m.
b) compute the mean trend ‘m’ over all N.
c) find the mean Proxy value over N.
d) divide the mean proxy value by the mean trend.
If you felt the need to screen, you could screen on local trends.
This process wouldn’t be LOC. It would be a different method. (Evidently a new one!)
Just because a method isn’t what LOC or CPS do doesn’t mean one can’t do it. It might mean you have to provided a longer explanation of the method to get your stuff published– but it doesn’t mean you can’t do it. More importantly, it doesn’t mean the new method isn’t better than either LOC Or CPS. As it can be done, the questions are:
1) What is the bias of this new method relative to the other ones and
2) How noisy is it.
3) Are data available to permit it.
But when you can use LOC or CPS, you could do this order.
Lucia:
Oddly that may be one of the problems with screening only for r ≥ r0. Too high of a correlation may be evidence for spurious correlation. It might be best to try the typical SNR in order to estimate a range of r values r0 ≤ r ≤ r1 to retain.
This would avoid the problems with failure to converge and could substantially reduce the amount of bias introduced by screening.
Nice example of a potential screening fallacy today, Tamsin Edwards @flimsin on twitter:
“Willy Aspinall just gave one of his interesting talks on expert elicitation: collecting expert opinions to forecast volcanic eruptions.
They test the experts’ ability to estimate uncertainty with questions they know the answer to. Then ask them for predictions.
Experts are downweighted if their answers to test Qs are bad: overconfident, or too far from the truth.
Apparently it’s impossible to game the system. You get the best weight if you give your true belief, not what you think best test answer is. Which is cool. ‪#expertelicitation‬
This talk was part of a Royal Statistical Society meeting on natural hazards. Interesting day.”
That looks much to long to be a tweet. Did she put out multiple tweets?