All posts by Brandon Shollenberger

The Saga Continues

As readers will remember, the recent Cook et al paper sought to examine the “consensus” on global warming but failed to state what that consensus actually is. We’ve tried to figure it out, and the current theory is there is no definition for it. I’ve recently pursued this matter at Skeptical Science in the hopes of clarifying things. After about forty comments, Tom Curtis stood up and tried to address the issue. I won’t quote his entire comment as it is rather long, but his claim boils down to:

For consistency, therefore, endorsing AGW in (1) must mean endorsing AGW as equal to or greater than 50% of the cause of recent warming. It follows, on the grounds of consistency that that is the meaning of “endorses AGW” whereever it occurs in the paper.

Ergo though John Cook may have lacked an explicit definition of endorsement, he and the raters had an implicit definition which is in the paper. What is more, that implicit definition is, or is very close to the tacit definition actually used by raters in rating abstracts.

Curtis says there is a specific definition. He says John Cook and Dana Nuccitelli were both wrong to say otherwise. Curtis participated in this project, meaning one participant says the creator of the system and the person running the project were both wrong about the rating system. That’s an odd claim, but lets consider it.

The second category is listed as, “Explicitly endorses but does not quantify or minimize.” If we replace “endorses” with the phrase Tom Curtis says it means, we get, “Explicitly endorses AGW as >50% but does not quantify or minimize.” That makes no sense. Even if we ignore the fact it’s inherently contradictory, if category two endorses AGW as >50%, there is no difference between it and category one. Clearly, Curtis’s answer is wrong. I tried to discuss this issue with him, but my comment got deleted.

Which brings us to a continuation of a subject we’ve previously discussed, the Skeptical Science group’s usage of words in… unusual ways. Skeptical Science has a Comments Policy which forbids a thing they call sloganeering. They define it as:

No sloganeering. Comments consisting of simple assertion of a myth already debunked by one of the main articles, and which contain no relevant counter argument or evidence from the peer reviewed literature constitutes trolling rather than genuine discussion. As such they will be deleted. If you think our debunking of one of those myths is in error, you are welcome to discuss that on the relevant thread, provided you give substantial reasons for believing the debunking is in error. It is asked that you do not clutter up threads by responding to comments that consist just of slogans.

To qualify as sloganeering, something must include an “assertion of a myth already debunked by one of the main articles.” The entire point I’ve been making is the Skeptical Science group has never defined the “consensus” they sought to study. It’s never been discussed by them, much less in a post by them. That means nothing I said could possibly qualify as sloganeering. Despite this, I was told by a moderator I was engaging in sloganeering. I responded to this, observing nothing I had done could possibly qualify as sloganeering. I then asked what I was supposed to stop doing. That got deleted for being a “moderation complaint” despite it not having any complaint in it.

The stated rules of Skeptical Science have little bearing on the actual rules of Skeptical Science. This apparently stems from the fact that to them, words means just what they choose for them to mean. It may be convenient for them, but it is extremely inconvenient for anyone who has the audacity to ask a simple question like, “What is the consensus?” Not only have Cook et al failed to answer that question, they’re now refusing to answer that question and abusing those who dare ask it.

It’s remarkable really. For all the attention the media has given this paper, nobody has bothered to ask, “What is the consensus?” That needs to change.

Climate Science P0rn

There’s no way to define it. We just know it when we see it. That’s how Supreme Court Justice Potter Stewart described p0rn. And that’s what John Cook and pals were looking for.

I kid you not. After all our discussion of what the “consensus” is, or what it even means, it turns out the answer was in the leaked Skeptical Science forum all along. In a topic titled, “Defining the scientific consensus,” John Cook explains:

Okay, so we’ve ruled out a definition of AGW being “any amount of human influence” or “more than 50% human influence”. We’re basically going with Ari’s p0rno approach (I probably should stop calling it that 🙂 which is AGW = “humans are causing global warming”. Eg – no specific quantification which is the only way we can do it considering the breadth of papers we’re surveying.

That’s right. They were completely aware of the two possible consensus positions. They made dozens of posts talking about them. Dana Nuccitelli even wanted to discuss both.

The way I see the final paper is that we’ll conclude ‘There’s an x% consensus supporting the AGW theory, and y% explicitly put the human contribution at >50%

But instead, they rejected any sort of scientific process in favor of hand-waving value judgments. They intentionally rejected all explicit or clear definitions for their “consensus” in favor of the “p0rno approach.”

So now we know. 97% of climate science is p0rn. And Barack Obama likes it 😉

Why Symmetry is Bad

Today’s post is going to be a discussion of a technical point. It assumes you’re familiar with the ongoing discussion of Cook et al and the ratings used in that paper.

Cook et al’s study was guaranteed to be bad from the start because of an incredibly basic mistake: They tried to force their ratings to be symmetrical. That was a terrible idea. To understand why, you first have to realize Cook et al examined two different “consensus positions”:

1) Humans cause some global warming.
2) Humans cause most global warming.

If they had studied only one of these positions, a simple, symmetrical scale would have been appropriate. However, the two positions overlap each other, and that means the resulting scale cannot be symmetrical.

It will be more clear with a demonstration. For this demonstration, we’ll discard the distinction between explicit and implicit, and there will be no neutral papers. That means we’ll want to look at endorse/reject and +/-50%. That’s gives us four possible combinations.

1) Endorse, +50%
2) Endorse, -50%
3) Reject, -50%
4) Reject, +50%

There’s no way to reject and quantify something at the same time so we discard the second part of 3) and 4). That makes the two identical, leaving us with just three categories:

1) Endorse, +50%
2) Endorse, -50%
3) Reject

As you can see, there is no symmetry. That is what Cook et al’s rating system should have been like. Specifically, it should have been:

1) Explicitly Endorse, +50%
2) Implicitly Endorse, +50%
3) Explicitly Endorse, -50%
4) Implicitly Endorse, -50%
5) Explicitly Endorse
6) Implicitly Endorse
7) Neutral
8) Implicitly Reject
9) Explicitly Reject

You’ll note the lack of symmetry in no way hurts this rating system. The categories are all disjoint, and we can examine the “consensus” on either position 1) or 2) just by adding different categories together. This is different from Cook et al’s rating system where abstracts could fit multiple categories. That’s important because the overlapping categories made it impossible to extract actual values for position 1), and it made it impossible to do “apples-to-apples” comparisons between any categories.

I have no idea why John Cook would have thought his scale was a good choice.  The best idea I have is he thought the scale should be symmetrical for whatever reason.  It’s an amateur mistake that shows he had no real idea of how to do this sort of thing, and he didn’t bother to talk to someone who did.

On the Consensus

John Cook, proprietor of the website Skeptical Science, recently published a paper with the help of members of his site. They describe their study as examining the abstracts of “over 12,000 peer-reviewed climate science papers” and finding “a 97% consensus in the peer-reviewed literature that humans are causing global warming.” This study has received media fanfare, and even Barack Obama, the President of the United States, tweeted about it.

We’ve been having fun on this site about this study, but what I say next I cannot say with any humor. It is simply too serious. Skeptical Science recently invited people to rate the 12,000+ abstracts via Skeptical Science’s interactive rating system so people could “measure the climate consensus” themselves. An additional feature of the system allows users to view the abstracts, as well as the ratings given by the people behind the paper.

The guidelines for rating these abstracts show only the highest rating value blames the majority of global warming on humans. No other rating says how much humans contribute to global warming. The only time an abstract is rated as saying how much humans contribute to global warming is if it mentions:

that human activity is a dominant influence or has caused most of recent climate change (>50%).

If we use the system’s search feature for abstracts that meet this requirement, we get 65 results. That is 65, out of the 12,000+ examined abstracts. Not only is that value incredibly small, it is smaller than another value listed in the paper:

Reject AGW 0.7% (78)

Remembering AGW stands for anthropogenic global warming, or global warming caused by humans, take a minute to let that sink in.  This study done by John Cook and others, praised by the President of the United States, found more scientific publications whose abstracts reject global warming than say humans are primarily to blame for it.

The “consensus” they’re promoting says it is more likely humans have a negligible impact on the planet’s warming than a large one.

I Do Not Think it Means What You Think it Means

John Cook, with the help of volunteers from Skeptical Science, recently published a paper seeking to quantify the consensus on global warming. There is much to be said about it, but the most interesting part may be the fact the authors use what is, to put it charitably, novel definitions for words:

Each abstract was categorized by two independent, anonymized raters.

What would you consider “independent”? Would you consider raters independent if they participate in the same, small forum? How about if they are moderators for the same site? How about if they’ve published papers together in the last six months? Those are all true of “independent” raters in this project.

But how about this? What if the raters talked to each other about their ratings? Surely we can’t say people who work together to produce results are independent of each other. Nobody would call that independent. Just look at what Glenn Tamblyn said in the leaked SKS forums:

So I think now the Cone of Silence should descend while the ratings are done. Cheer each other on as far as the count is concerned, but don’t discuss ratings at all. If a reviewer finds an abstract to hard to classify, skip it and those ones can be dealt with at a later stage.

That makes sense. What doesn’t make sense is that people would make topics in the SKS forum like:

Does this mean what it seems to mean?
second opinion??
how to rate: Cool Dudes: The Denial Of Climate Change…

That’s right. The “independent” raters talked to each other about how to rate the papers. This must be some new form of independence I’ve never heard of. I’m not the only one thrown off by this. Sarah Green, one of the most active raters, observed the non-independence:

But, this is clearly not an independent poll, nor really a statistical exercise. We are just assisting in the effort to apply defined criteria to the abstracts with the goal of classifying them as objectively as possible.
Disagreements arise because neither the criteria nor the abstracts can be 100% precise. We have already gone down the path of trying to reach a consensus through the discussions of particular cases. From the start we would never be able to claim that ratings were done by independent, unbiased, or random people anyhow.

One must wonder at the fact an author of the paper calls the work independent despite having said just a year earlier, “we would never be able to claim” it is independent. Perhaps there is some new definition for “never” I’m unaware of.

Surely things can’t be any worse, right? I mean, you can’t get much more non-independent than talking to each other about what answers to give. About the only way you could be less independent is if you actually compared answers then changed the ones that disagreed so that they would match. And nobody would do that, right? I mean, John Cook would never suggest:

Once all the papers have been rated twice, I will add a new section to TCP: “Disagreements”. This page will show all the instances where someone has rated a paper differently to you…
What I suggest happens here is we all look through all the instances where we disagree with another rating, see what ratings/comments they have. If we agree with their ratings (perhaps it was an early rating back before some of our clarifying discussion or just a mistake), then we upgrade our rating to make it consistent with the other rating and it disappears from the list.

Oh… Um… At least the raters were anonymized? It’s not like John Cook published graphs showing the progress made by various raters, with their names listed or anything. Oh wait. He did.

But hey, at least the names were mostly user handles. It’s not like everyone knows who those people are or anything. I mean, most of those people wouldn’t be the nine authors of the paper or anything… right?

Oh for God’s sake!

I Tried

I tried. I tried to be generous. I tried to find some technical issue for why John Cook’s latest survey would not produce a random sample of the 12,000+ papers in his database. I tried to find some innocent programming mistake we could all understand.

John Cook insists that isn’t the case. In response to an e-mail I sent him, he said:*

Q1 & 2: I use an SQL query to randomly select 10 abstracts. I restricted the search to only papers that have received a “self-rating” from the author of the paper (a survey we ran in 2012) and also to make the survey a little easier to stomach for the participant, I restricted the search to abstracts under 1000 characters. Some of the abstracts are mind-boggingly long (which seems to defeat the purpose of having a short summary abstract but I digress). So the SQL query used was this:
SELECT * FROM papers WHERE Self_Rating > 0 AND Abstract != ” AND LENGTH(Abstract) < 1000 ORDER BY RAND() LIMIT 10

In other words, when John Cook e-mailed people claiming to invite readers “to peruse the abstracts of” “around 12,000 papers,” it wasn’t true. When he posted on Skeptical Science saying people are “invited to rate the abstracts of” “over 12,000 papers listed in the ‘Web of Science,'” it wasn’t true. And when the survey says:

You will be shown the title and abstracts (summary) for 10 randomly selected scientific papers, obtained from a search of the ISI ‘Web Of Science’ database for papers matching the topic ‘global warming’ or ‘global climate change’ between 1991 to 2011.

It isn’t true. In fact, it seems John Cook lied every time he’s told people what the survey is. Only, he doesn’t seem to believe he’s lying. Even after telling me about how the samples are not randomly chosen from the 12,000+ papers, he says, “The survey’s claim of randomness is correct.” It’s like saying:

I have a hundred papers here. I’m going to have you read one at random.

(This randomly chosen paper will be selected only from papers with a prime number.)

But leaving off the parenthetical. John Cook apparently doesn’t believe that is lying. I think most people would disagree. But even if we don’t call it a lie, we must call it extremely deceptive.

 

*I don’t agree John Cook did as he claims as far as the programming goes. For example, I’ve previously shown results where some samples were sorted and others were not. His description doesn’t explain that.  But I’ll leave discussion of technical issues like that to comments of the previous post.

A Random Failure

John Cook, proprietor of Skeptical Science, recently asked people to host links to a survey he’s running. lucia has made several posts on this, and I’ve taken a particular interest in one aspect of the survey: It’s supposed randomness. Participants are told they’re being shown 10 randomly selected abstracts to review. They aren’t. The selection isn’t random.

That’s a strong conclusion. It’s possibly a damning conclusion. As such, I’m going to take a minute to explain where the idea originated from. If you aren’t interested, just skip to the end of the post.

Believe it or not, the idea came from Skeptical Science itself. Specifically, it came from something a commenter there (chriskoz) said:

Interestingly, my sample included exactly the same “Agaves” (“Neutral”) paper that Oriolus Traillii mentioned @5, how probable is it? John, please make sure that the survey selects truly random selection for all participants (i.e. check your random generator).

He was referring to a comment ten hours earlier where a user (Oriolus Traillii) mentioned an abstract that appeared in his survey. I thought the fact both of them saw the same abstract was interesting, but I didn’t make much of it until I opened up the survey. To my amazement, I was given the exact same abstract as well.

That seemed incredible. Only a handful of people had discussed their results by this point, and two saw the same abstract I saw. There are over 12,000 abstracts to draw from, and each person is only shown 10. The similarity in our surveys was too much of a coincidence for me. I had to look into things.

Now we’ll fast forward a few days (and skip past several wrong ideas) to reach last night. Last night, I examined the HTTP requests/responses used when communicating with the server hosting the survey. I was immediately struck by an idea I didn’t want to believe. It turns out John Cook used something called a “session ID” to help control what papers were presented to people. That was a huge mistake.  You don’t need to know the details (but go here if you want them). All you need to know is the session ID is stored on your computer. That means you can change it. That means you can change what papers you’re shown. You can do things like pull up a dozen sets of papers then pick one and go back to it.

I was shocked to find out such an obvious design flaw existed in the survey. But I was also grateful. This provided me an opportunity to try to test the random number generator (RNG) used to pick papers to display to people taking the survey. I spent a couple hours doing so. After the first few minutes, I knew something was wrong. After a few hours, I knew I could prove it. So I made this:

RNG_Test

That is 160 numbers corresponding to 160 papers I got from 16 consecutive surveys. As you can see, 2 papers showed up three times in 16 surveys. Another 27 showed up twice. That is clearly not random. On the other hand, that is a lot less random than what most people have likely experienced. The reason for that is I picked session IDs I knew would emphasize the lack of randomness in the RNG. In other words, I exploited a flaw in the RNG.

That flaw is related to a concept called entropy, and it’s really quite simple. The sixteen session IDs I chose to use were very similar. My first session ID was 1. The next was 11. The next 111. So forth and so on. The similarity in these session IDs carried through the RNG and into the results. It shouldn’t have. The fact the RNG let that happen proves the RNG is flawed. If it weren’t, I would have gotten results more like:

RNG_Normal

The difference in these two graphs proves Oriolus Traillii, chriskoz and I did not see the same paper purely by chance. It happened because the “10 random abstracts” shown to us were not randomly chosen. That doesn’t mean the survey is complete garbage, but it does mean the survey cannot be said to be random.  And it does mean any claims based upon the survey being random cannot be supported until this issue is dealt with.

For those interested in more information, I’ll discuss additional details and problems of the RNG in the comments section.

It’s “Fancy,” Sort of…

Update: This blog post contains mistakes because of a programming error.  The mistakes will be addressed as soon as I am capable of doing so, but in the meantime, I direct anyone who hasn’t been following the comments section to read this comment.  My apologies – Brandon Shollenberger

I have a confession to make. I was wrong. Tamino was right.

Yesterday, I happened across a blog post Tamino recently wrote. I had never heard of any of the people or groups being discussed in it, so I only skimmed at first. I was barely paying attention until I reached the second figure:

I immediately did a double take. I was sure there was no way I saw a smoothed line go up while the unsmoothed values went down. Only, I had. Caught off guard, I began reading the post more closely. Immediately, I read:

Lest you object that I’ve applied some fancy smoothing method designed to get what I wanted, let’s apply the simplest “smoothing” method of all. Let’s compute 5-year averages rather than 1-year averages. We get this:

Again, I was shocked. This figure showed an even greater uptick than the last! My mind couldn’t understand how the data was going down, but Tamino’s derived lines were going up. I quickly read the rest of the post, saw no explanation, and made a comment on this site asking people if they could tell me whether those figures made sense to them. A few comments were exchanged, and then I realized I could just ask Tamino:

Tamino, I was wondering if you could explain what kind of smoothing you used for your figures. Your smoothed line extends as far as the data does, implying none is excluded, yet it goes up while at the end, your data goes down. The last four points are equal to or lower than the ten points before them, so I can’t figure out why the smoothed line shows a steady increase.

Incidentally, it’s kind of funny you mention the possibility of someone objecting “that [you]’ve applied some fancy smoothing method designed to get what [you] wanted” as I’m sure some would get the impression you did just that as your smoothed line goes up while the data goes down. The same is true for your five year average graph which has a different visual impact because of the periods you used.

Mind you,none of this has any real relevance to the point you make in your blog post. I’m just curious about how that smoothed line was made.

As I noted in my comment on his site, I didn’t think the issue was hugely important. It was just an oddity that caught my eye. Tamino promptly answered, explaining:

The smoothing method was a modified lowess smooth (I use a different weighting function than the usual tricube) applied to the monthly data (rather than annual averages). Therefore it includes the most recent 4 months, which is astoundingly warmer than any preceding third-of-a-year

I saw he said he had used a (modified) LOWESS smooth, and there were four months of data which hadn’t been displayed in his second figure. Knowing what a LOWESS smooth is like, I assumed the extra data was responsible for what I saw. There was a bit more back-and-forth over other points, then I first expressed that understanding:

I have to disagree with you saying the “intuitive result” was wrong. The reason I was thrown off is I didn’t know different amounts of data were being used in the two lines. Had I known you were including more (and higher) data in one, the curve wouldn’t have seemed unintuitive.

The only way intuition was wrong was it said two lines representing the same data would represent the same data.

Tamino again promptly responded, telling me I was wrong:

Computing the smooth without the first four months of 2012 would not have sensibly changed its result. It contradicts the intuitive result because the intuitive result is wrong, not because data from 2012 contradict it. The fact that you fail to understand this, confirms how important it is to emphasize that noisy data can fool people into drawing the wrong conclusion.

I responded again, telling Tamino I found his response “nearly impossible to believe.” When he responded, he reiterated I was wrong, but this time with a great deal of virulence. I decided his petulant behavior made it not worth posting on his blog, and I planned to stop there.

But I couldn’t get rid of my curiosity. Tamino had seemed very sure of what he said, and I didn’t have any sort of evidence on my side. That bothered me, so I decided to test Tamino’s claim that I’d get “much the same thing” if I tried to replicate his work. A little while later, I had this graph:

You’ll note in my figure, the smoothed line goes down at the end, exactly as I would expect, and exactly as my intuition told me.

As I compared my result to his, I realized the only differences were at the beginning and end of the lines. The middle ~80% was identical. I quickly realized what was going on. Tamino had said he used “a modified lowess smooth” with a different weighting function. That was to blame, not the extra four months of data like I had thought (you can see the difference adding the four months from 2012 makes in blue below):

So as I said, I was wrong. Tamino was right.  It was not the addition of four months data that made his end point rise despite the data falling.

But that’s not the end of the story. Why did Tamino get such a radically different end to his series? The data went down. When I used a LOWESS smooth, my line went down. So why did Tamino’s go up?

There’s only one answer. For some reason, Tamino’s modified version of LOWESS gives extremely low weight to the data near the ends of the series. Why? I don’t know. He doesn’t explain it in his post. In fact, he doesn’t even say what he did in his post. It would be completely impossible for anyone reading his blog to know what he did, much less why he did it.

Of course, Tamino offered a second approach in case anyone accused him of using “some fancy smoothing method designed to get what” he wanted. Naturally, I suspected that graph was questionable too, so I decided to test something. What if when calculating the five-year averages, we don’t include 2012, which only had four months of data?

I was speechless. If I took the four months from 2012 as a whole year and averaged them with the data for 2008-2011, I got Tamino’s results. If I took the average from 2007-2011, I got a result nearly half a degree lower. Not only does this have a huge visual impact, it means Tamino used one data set in three different ways:

1) Annual averages, excluding data from 2012.
2) Monthly smoothed, using all data.
3) Five-year averages, using four months from 2012 as a whole year.

Not only is it highly peculiar to use data in so many different ways, Tamino didn’t tell anyone what he had done. Nobody reading his post would realize how different his graphs could look with different choices which were at least as good, if not better than, his undisclosed choices.

Does any of this have any impact on the point of his blog post? No. To be honest, I still know nothing about the Tom Harris his post criticizes. The point of this is people promoting global warming concerns should make it possible to understand their work without having to figure out undisclosed details based on arbitrary and unexplained decisions. In fact, everyone should.

Finally, here is what Tamino’s GISS/USHCN comparison would have looked like if he had used a standard implementation of LOWESS (GISS – blue, USHCN – red:

Update 2 (5/11/2012):  After some consideration, I’ve decided to leave this post as is as the basic point of the post is unaffected by the errors in it.  As such, the errors will be left for posterity, and anyone interested in the corrected version of the technical details can read the comments section.  Moreover, after some more time for discussion, I’ll collect the details into a single, simple explanation, and I’ll update this post to include a link to it.