CPS with ‘variance matching’: (Will be method IV)

When reading Mann08, I realized I need a “method IV” in my previous graphs. In method IV, I will rescale the mean proxy value so the variance of the proxy values match the variance of the temperature during the reconstruction period. This will (to some extent) mimic the portion highlighted below.

All proxies available within a given instrumental surface temperature grid box were then averaged and scaled to the same mean and decadal standard deviation as the nearest available 5° latitude by longitude
instrumental surface temperature grid box series over the calibration period. […] The gridded proxy data were then areally weighted, spatially averaged over the target hemisphere, and scaled to have the same mean and decadal standard deviation (we refer to the latter as ‘‘variance matching’’) as the target hemispheric (NH or SH) mean temperature series.

I say to some extent because for the time being I am not adding the decadal smoothing (which will matter for one of the features in the figures below.)

I quickly slapped together a script to illustrate that this method is also biased under screening. Because it can be shown to have a second type of bias that exists only when there are a small number of proxies, I am showing one graph created starting 50 proxies and another with 484. Blue is used for all proxies; green for the reconstructions when we screen those proxies. Screening is done against global temperatures. The calibration period is 80 years.

In the graph below, red is the target value the reconstruction should match:

If you examine the “blue” (unscreened) trace, you can see that it is noisy because there are only 50 proxies. More importantly, the mean value over the reconstruction period is biased high- and the bias is statistically significant. The green trace illustrates what happens if we bias selecting only those proxies that exhibited a statistically significant correlation with the target temperature during the calibration period. The bias increases.

In the following graph, I increased the numbers of proxies.

Notice with the bias for the blue (unscreened) result is now non-zero but small. This is CPS that scales using ‘‘variance matching’’ is biased when few proxies are used whether or not we screen. This bias vanishes as we increase the number of proxies used. In contrast, the green (screened) trace remains biased. That is because this bias does not vanish when we increase the number of proxies.

Because this method is used, I’ll be adding this to my other tests using “toy” temperatures calling it “method IV”. That way we can compare the relative bias for different numbers of proxies, calibration periods, screening and other analytical choices a reconstruct-o-logist might make.

34 thoughts on “CPS with ‘variance matching’: (Will be method IV)”

  1. Humm…. I do believe that you will soon know a lot more about selected proxy reconstructions than the Mann himself. Maybe you can publish your work and become a leader in the field.

  2. SteveF–
    I do want to get a bunch of these down. I’m pretty sure that one things that needs to be advocated for is reporting a particular skill metric. These bias errors all are of the form

    T_rec/T_true = A where A is some value. So… there may need to be some check on the magnitude of these ratios in a validation period. For unbiased, A=1. Otherwise… not.

    Maybe they already look at this… or not. I don’t know. But this is a metric that would help people detect these biases where as some others whose definition I read might detect a little– but not specifically bias. (Of course, other ways could detect it too.)

  3. I may be wrong, but one clear message seems to be that unscreened (by correlation with temperature) proxies don’t often generate biases in the reconstruction period, while screened proxies usually do. Given that fact, choosing to use screening based on correlation with instrument temperature would seem at best dubious, and at worst very unwise. A minimum requirement if screening is used would be to show that the reconstruction method does not generate a bias using synthetic data.

  4. Isn’t this more like method 3B than method IV? There are an infinite amount of ways to rescale one’s results after averaging. It seems silly to label each as its own method. Heck, you could rescale the results given by any method in the same ways.

  5. Brandon

    Isn’t this more like method 3B than method IV? There are an infinite amount of ways to rescale one’s results after averaging.

    Maybe. But I don’t intend to have a complex numbering system that tries to communicate anything about the relationship between method “i” and method “j”. In my script and posts it will be called “method IV”. Otherwise, we’ll start having things like “IIIb,ix” to mean ” CPS with variance matching applied to proxies that were decadally smoothed in the frequency domain” etc.
    For now, as I add any, I’ll just number I, II, III, IV and so on.

  6. Regarding screening, it seems to me that you have an informational advantage in your simulations that actual researchers do not: you know that all your proxies actually reflect the underlying data, because you built them that way. Any screening on any metric must therefore (in your simulations) exclude good data. So it’s not too surprising that a reconstruction that includes all good data is more skillful than one that does not.

    But does a researcher always know this? Species X might have different metabolic pathways than species Y, and therefore might or might not be a better or worse proxy. In a large data universe, how do we know which datasets are or are not effective temperature proxies, unless we test? And if a particular dataset fails a test, isn’t it appropriate and correct to exclude the dataset?

    I’d like to propose therefore that you include in your list of proxies some proxies which (by design) do not necessarily reflect the underlying data (e.g., totally random proxies), and see if the negative effects of screening still exist. My guess is there is a happy medium in there somewhere, and it would be useful to find out where it is.

  7. KAP–
    Researchers can test their methods against synthetic data. It’s a pretty standard thing to do.

    But does a researcher always know this?

    No. But the problem that arises when you screen data is well understood. It’s worse than this problem!

    And if a particular dataset fails a test, isn’t it appropriate and correct to exclude the dataset?

    Is this a rhetorical question?

    If you know that a certain result is what you would get when you screen noise, and you get that result after screening, it makes no senses to believe the result is “signal”. If you know a certain thing happens when you screen good data and your answer appears to have that feature, it makes no sense to conclude the feature is in the data.

    I’d like to propose therefore that you include in your list of proxies some proxies which (by design) do not necessarily reflect the underlying data (i.e., totally random proxies), and see if the negative effects of screening still exist. My guess is there is a happy medium in there somewhere, and it would be useful to find out where it is.

    I’ve done some of that. But not much. I agree it is something that needs to be thought about. But do consider: In the limit that you have 100% noise, screening can create hockey sticks out of noise. In the limit you have 100% proxies where each proxy contains some signal but some noise, screening can exaggerate hockey sticks (through loss of variance.)

    In these two limits, not screening won’t cause harm. In the case of all pure noise proxies you learn nothing but at least you don’t trick yourself. In the case of 100% meaningful proxies, you will get the correct answer– provided you have lots of proxies.

  8. In the limit that you have 100% noise, screening can create hockey sticks out of noise.

    Well, not really. Screened data is always non-random data, i.e., screening creates a signal out of noise. That’s why evolution works.

  9. I don’t see why you think that screened data is always non-random.

    It may have correlation in it, but that could be introduced by the screening process, and may have nothing to do with the underlying data or the process you are trying to study.

    I don’t think evolution doesn’t work by that mechanism at all. Perhaps you can explain why “that’s why evolution works” makes any sense in that context.

  10. Screened random noise is still random, it’s just correlated. Having a nonzero correlation between samples doesn’t make it non-random.

  11. @lucia

    If you know that a certain result is what you would get when you screen noise, and you get that result after screening, it makes no senses to believe the result is “signal”. If you know a certain thing happens when you screen good data and your answer appears to have that feature, it makes no sense to conclude the feature is in the data.

    But how big is the signal compared to the expected noise?

  12. I don’t see why you think that screened data is always non-random.

    Because it is, except in the special case where your screening is also random. But in fact, screening is never random, because that’s why you screen: you have some criterion on which to screen. Therefore only certain datasets pass the screen, and that insures there is signal in the screened data, regardless of the datasets’ origins.

    Simple case: roll a die and record the results. The resulting dataset is random. Now screen and look at only the sixes. A dataset of sixes is the result, and that is a non-random dataset: even though it was generated randomly, the screen insures non-randomness.

    More complex case: Mutations are random. Natural selection is the screen. Complex organisms are the highly nonrandom result.

  13. Is this a rhetorical question?
    If you know that a certain result is what you would get when you screen noise, and you get that result after screening, it makes no senses to believe the result is “signal”.

    Hmmm, I think we’re dancing around a semantic difference here rather than a mathematical one. So I guess it’s a rhetorical answer … 🙂 But yes, I would call it a “signal” in the data. If you didn’t know the data was screened, you would detect it as such.

    If you know a certain thing happens when you screen good data and your answer appears to have that feature, it makes no sense to conclude the feature is in the data.

    But the feature IS in the data: you insured it was there with the screening process!

    The rhetorical difference we seem to be having is not whether the feature is in the data (it is), but whether the feature should be labelled “random” or “noise” rather than “non-random” or “signal”. N’cest pas?

  14. Screened random noise is still random, it’s just correlated. Having a nonzero correlation between samples doesn’t make it non-random.

    Nonsense. If you want truly random data, there is no reason to screen.

    Why then do you screen? To create non-randomness out of randomness.

  15. KAP

    Well, not really. Screened data is always non-random data, i.e., screening creates a signal out of noise. That’s why evolution works.

    When screening noise, the output is the shape of the screen. If you interpret this as telling you something about anything other than the screen, you will make an error.

    Why then do you screen? To create non-randomness out of randomness.

    Uhmm… that may be a purpose when screening mud to separate gold from from “not gold”. It’s not the intended purpose when doing a proxy reconstruction.

  16. When screening noise, the output is the shape of the screen. If you interpret this as telling you something about anything other than the screen, you will make an error.

    Absolutely correct. Thus, if screening proxies with the instrumental record, a set of screened proxies will have a “blade” of a hockey stick, because the instrumental record has that. A proxy that does not reflect the instrumental record is not skillful.

    But that tells us nothing at all about the shape of the pre-instrumental portion of the reconstruction, nor what that should look like in the slightest.

  17. Uhmm… that may be a purpose when screening mud to separate gold from from “not gold”. It’s not the intended purpose when doing a proxy reconstruction.

    Well, proxies are not supposed to be random in the first place. But when you screen proxies, you do so to insure that the proxy reflects a known signal in the instrumental record. So yes, you’re screening to insure that there is a signal there; or at least, that is the effect of the screen.

  18. KAP

    So yes, you’re screening to insure that there is a signal there; or at least, that is the effect of the screen.

    This is not the effect of the screen. It is what people who don’t understand screening imagine the effect to be. Unfortunately, they are mistaken.

  19. This is not the effect of the screen. It is what people who don’t understand screening imagine the effect to be. Unfortunately, they are mistaken.

    So what then is the effect of the screen, in your view?

  20. When the proxies consist of all noise, the screen picks out time series whose noise happens to match the screen during the calibration period. This does not turn noise into “signal”. Moreover, because the noise that accidentally matched the screen during the calibration signal remains noise during the reconstruction period. So, we can’t learn anything about the reconstruction period from analyzing the noise in that proxy.

    This is easy to show using excel spreadsheets or any amount of montecarlo. It’s been shown here, at other places and is accepted as true.

  21. KAP, excuse my french but if you are going to use the language it’s “n’est-ce pas?” and not “N’cest pas?”

  22. When the proxies consist of all noise, the screen picks out time series whose noise happens to match the screen during the calibration period. This does not turn noise into “signal”.

    Once again our difference seems to be semantics rather than math. For my part, numbers are numbers, and a dataset that matches a signal contains signal, regardless of the source of the dataset.

    Moreover, because the noise that accidentally matched the screen during the calibration signal remains noise during the reconstruction period.

    I assume that by “reconstruction period” you’re referring to the non-calibration period. A sculptor can look through a landfill and select junk that he turns into valuable art. If you look at the result and see junk, I won’t argue too much; I just prefer to see the art. The act of selection doesn’t change any individual item, but it does change the aggregate.

    So, we can’t learn anything about the reconstruction period from analyzing the noise in that proxy.

    Exactly right, for the reconstruction period. But during the calibration period, it seems (to me, at least) misleading to believe that the “blade” of the hockey stick is random, when actually it has been imposed by the researcher via the selection screen.

  23. A sculptor can look through a landfill and select junk that he turns into valuable art.

    We might have fewer sematic issues if you used the terms “noise” or “signal” the way other people do and don’t turn to metaphors about art.

    when actually it has been imposed by the researcher via the selection screen.

    In some cases, the blade has been imposed by screening.

    Honestly, I am finding it difficult to figure out what idea you are trying to express.

  24. Honestly, I am finding it difficult to figure out what idea you are trying to express.

    Not surprising because KAP is talking BS. References to gold, mutations and art out of junk are just nonsense. According to him/her everything that passes “screening” is a meaningful proxy.

    I would bet that this person has not had any statistical training beyond above a superficial level. Airy-fairy takes on a whole new meaning…

  25. Landfills? Art? Whoa… things are getting pretty strange! 🙂
    At least KAP demonstrates why it is unwise to enter in the middle on a long technical thread, guns ablaze, making a perfect fool of yourself. For your own sake KAP, read, think, understand, and THEN comment. Jeesh…..

  26. Kap:

    A sculptor can look through a landfill and select junk that he turns into valuable art.

    But it has nothing to do with “signals”. Try this definition: “A fluctuating quantity or impulse whose variations represent information.” Generally a “signal” characterizes a variation in a measured quantity P that covaries with some known “independent” quantity T.

    It implies the presence of correlation between P and T, but the implication is that this correlation is stable and reproducible. The implication of this in turn is that it is beholden upon the scientist to demonstrate that an observed correlation between P and T is a “signal” in that is it meets the criteria of being stable and reproducible. One doesn’t simply assume these ab initio, unless you like getting laughed at by the larger scientific body, even if like some paleoclimatesciencemunchkins, you think this is an appropriate thing to do.

    Trash that somebody finds interesting in a junkyard has no relevance to signals, just to patterns of junk (noise) that people find interesting. It is, in fact, the quintessential example of retrospective analysis of data to spot patterns, then publish them, without bothering to see of they are replicable on non-cherry-picked data.

  27. I’ll just ask my question again.

    bugs (Comment #99953)
    July 25th, 2012 at 2:49 am

    @lucia

    If you know that a certain result is what you would get when you screen noise, and you get that result after screening, it makes no senses to believe the result is “signal”. If you know a certain thing happens when you screen good data and your answer appears to have that feature, it makes no sense to conclude the feature is in the data.

    But how big is the signal compared to the expected noise?

  28. Carrick 100003,
    Well summarized. I particularly like the “beholden upon”; that is one of the factors which separates pseudoscientific mumbo-jumbo from serious science.

  29. bugs:

    But how big is the signal compared to the expected noise?

    Part of the philosophy of measurement theory is “you don’t know”.

    This is why experimental/observational design is important in teasing out this information.

    There are some measurements for which the answers can’t be satisfactorily provided, for which we are forced to conclude that true measurement of the quantity cannot (yet) be made unambiguously.

    In most fields of science, this is accepted, and people work to improve the methodology so that they can lay claim to “being first” in taking a rigorous measurement of the quantity in question. In other fields, people aren’t as well trained, and you end up with a sloppy mess of people equating the uniformity principle (which as I’ve said relates to physical law, not to observed correlations between measured quantities) with the necessary conditions for establishing that you have a “signal.”

    It may be that for climate the best we can do in paleoreconstructions is have some non-dimensional “optimality index”. This is great if you want to look at how climate varies over time (e.g. influence of ENSO), but aren’t particularly concerned about whether this “optimality” is related only to temperature or precipitation. (Shades of Soon and Baliunas, maybe without the unneeded polemics at the end of their paper that got them in hot water).

    True temperature reconstructions are I think possible. At the moment, I’m equally convinced that they haven’t been fully achieved, but if you look at the data in this light, they are still interesting.

  30. SteveF:

    Well summarized. I particularly like the “beholden upon”; that is one of the factors which separates pseudoscientific mumbo-jumbo from serious science.

    Yes indeed this is the criterion for separating science from psuedo-science.

    I haven’t gotten into this, but there are other fields (pharmacology and nutrition) that suffer from similar aspects of pseudoscientific treatment of data. I say this because there are some who are supersensitive to any criticism of climate science (even though being able to withstand criticism is another measure of the robustness of a result) who might think I pick on climate science singularly and unduly.

    Like climate, there are substantial policy and economic incentives in these other fields to take short-cuts, with the biggest distinction in behavior being incidents of misbehavior (withholding of adverse results being an example) and outright fraud usually get called out rather than being papered over or whitewashed.

    Unlike climate science, where equating people who disagree with you with insects seems to be the norm… this guy, who by the way thinks he is an expert on communicating with the public.

  31. Re: bugs (Jul 26 03:42),
    A thought experiment.
    Let’s assume you have an old, analog TV receiver and you’ve tuned it to channel 3. The TV receiver is functioning properly and it is filtering for channel 3 which means it is reducing the signal for other channels and amplifying the band where the broadcast should be. In your receiving area there are no analog TV transmitters so there is no signal but of course there is random noise.

    What will you see? How long will you need to wait to see the re-run of MASH episode 1?

    Lucia has defined the problem thus and the result is the same.
    bob

Comments are closed.