UPDATE:

Continue reading Steig’s Antarctica: Part Three – Creative Mathemagic
Earlier I had said that I didn’t think Harry would make much of a difference to Steig’s results. I added the caveat that I had concerns about the paper – they just weren’t with Harry. This post describes a portion of them.
ant_recon_names1 <–save this as “ant_recon_names.txt”
Steig’s Antarctica Part Two: AVHRR vs. AWS Reconstructions
My understanding of Steig’s analyses:
The purpose of #2 seems to be an effort to show that the results are not overly sensitive to the stations used for the reconstruction. The purpose of #3 seems to be an effort to show that the results are independent of the method (i.e., both RegEM and PCA reconstructions yield similar results). The purpose of #4 is to show that the late period (1982-2006) results can be confirmed by AWS data.
The reason for these is to show that processing AVHRR data in the manner described in the paper has physical validity. #2 – #4 provide the benchmarks against which the main reconstruction is compared, inasmuch as actual ground temperatures are a direct measurement of surface air temperatures and infrared satellite information are the proxies.
Without the reconstruction data, I have no means of analyzing #2 and #3, and Steig did not spend much time discussing those results. He provided some maps that visually show a degree of correlation with the main reconstruction. However, Steig did provide the reconstruction data for #4. He also spent the most time talking about #4 in the Supplemental Information. From this, we might choose to infer that the AWS reconstruction provided the benchmark that best matched the TIR data. With that in mind, let’s look at how well the satellite holds up to the AWS reconstruction.
In order to compare, we need to find the grid location from the full reconstruction (ant_recon.txt) that contains the AWS station. This is accomplished by taking the latitude difference and multiplying by 1.852km and the longitude difference and multiplying by (pi/180)*cos(latitude)*6398km. Square them, sum them, take the square root, and sort the distances. The ant_recon_names.txt file has these collated: station name, closest full recon column number, distance in km.
SOURCE ANALYSIS
Subtitle: What made it into the paper (and what didn’t)?
The first thing we note is that – of the 63 AWS stations used in ant_recon_aws.txt, only 26 appear on Table S1. The caption explains that the remainder either had insufficient data or demonstrated poor verification skill.
TABLE S1 EXCLUSION CRITERIA
According to the caption, stations that do not have enough calibration information (less than 40% complete) or demonstrate insufficient verification skill are not included. Sidebar: though it seems reasonable on the surface, there is no additional discussion about why the cutoff points were set at these levels. If results are to be discarded, some justification ought to be provided. At the very least, those results that almost made it should be mentioned to allow the reader to judge how sensitive the final results are to the cutoff criteria. However, a closer examination shows the completeness criterion to be applied loosely at best:
Just to cover all bases, since Steig stated that there were both early (1982-1994.5) and late (1994.5-2006) calibration/verification periods, we should see if the caption was referring to the need to be >40% complete in just the early period, just the late period, or both. No matter how we look at it, some stations that do not meet the criteria are included, and some that do are excluded (verification skill notwithstanding).
Early period, less than 40% complete, included in Table S1: Cape Ross, Elaine, Enigma Lake, LGB20, LGB35, Larsen Ice Shelf, Marilyn, Nico, Pegasus South, Tourmaline Plateau. Of special note, LGB35 has a total of 6 months of data in this period and 3 other sites are less than 20% complete.
Early period, greater than 40%, excluded from Table S1: Byrd, Uranus Glacier. 11 other stations are greater than 20% complete – in other words, more complete than many stations that were included.
Late period, less than 40% complete, included in Table S1: D_10 (25% complete).
Late period, greater than 40% complete, excluded from Table S1: 17 stations. 10 other stations were excluded that were more complete than D_10.
A complete listing of the stations, including the number of months present for the overall period and sub-periods is contained in “Table S1 Inconsistencies.txtâ€.
MISSING DATA POINTS
Since RegEM does not replace actual data with imputed values, we should be able to tell if any data from any time series was not included by subtracting the reconstruction from the actual values. Because the reconstruction is in anomalies, we first must make anomalies out of the actual data.
The easiest way to do this is simply find the average value for each month where data is present and subtract that value. Unfortunately, when this is done, we find the following:
This means that – for whatever reason – Steig did not use all available data to calculate the anomalies. These offsets will result in noise when doing point-by-point comparisons between actual and reconstructed data.
To fix this, instead of using the monthly averages for the anomalies, we will instead subtract RECON from ACTUAL and use the resulting mode of each month for the anomaly. That method yields:
Now that we are using the same anomalies as the reconstruction, let’s take a quick look at how the READER data compares to the AVHRR reconstruction:
Some of the stations have very little data, so determining trends is not without substantial error. However, it does appear as if the READER data does not match well to the AVHRR reconstruction. For now, we’ll leave it at this. We may look at the actual data vs. AVHRR reconstruction in more detail later.
Next, we want to see if any data is missing. Because our anomalies match exactly with Steig’s, this is simple. We just look for any points on a plot of ACTUAL – RECON that do not lie on zero. After doing this, we come up with the following list:
Stations with a significant number of missing points: Bonaparte Point (~half) and D_47 (14 points).
EDIT: The issue with Bonaparte Point is because the data set has changed at BAS from the time Steig did his analysis and when I downloaded the sets. BAS did not post a correction notice for this.
Stations with 4 – 10 missing points: Cape Denison, Hi Priestley Gl, and Marble Point (which appears in S1).
Stations with 1 – 3 missing points: Butler Island, Clean Air, D_10, D_80, Dome CII, Elaine, Ferrell, GEO3, Gill, Harry, Henry, LGB20, Larsen Ice Shelf, Manuela, Marilyn, Minna Bluff, Mt. Siple, Nico, Pegasus North, Pegasus South, Port Martin, Possession Island, Racer Rock, Relay Station, Schwerdtfeger, and Whitlock.
That is a total of (2 + 3 + 26) 31 stations where some data points were excluded from the analysis. Of note, the Harry, Elaine, and Gill points that were excluded are present in both the pre- and post-corrected READER sets, so the corrections did not affect that. Additionally, the Ferrell points that were excluded were all late-time points that were significantly cooler than earlier points. The 2005 Clean Air data that was excluded was later removed by BAS as being errant.
While there is nothing unusual about removing points that seem to be errant, this highlights the importance of Steve McIntyre’s quest to have data provided AS USED. In some cases with the AWS stations, the entire series has ~20 data points. Removal of 2 points is removal of 10% of the data and quite conceivably could significantly impact verification skill for that station.
The other thing this indicates is that Steig did actually go through the raw data and remove outliers despite having implied at RC that such a quality control task was too arduous to be practical. Almost half of the station data sets were modified in some fashion from what is posted on the READER site – including Harry. Rather than assume that Steig was deliberately manipulating data to achieve the desired result, a more likely and reasonable assumption would be that he succumbed to the all-too-human tendency not to look quite as critically at data that fit his view of warming in Antarctica. This, too, highlights the importance of posting data sets AS USED – especially prior to the paper being published. This gives the researcher an opportunity to correct mistakes and other things that were overlooked prior to the words being permanently emblazoned on paper.
Sidebar: my personal opinion is that there is nothing untoward here – the Ferrell points do indeed look odd and I probably would have excluded them myself. I haven’t yet looked at the rest of the actual data to see if the removed points truly seem to be outliers. Regardless, the number of removed points is small, and while it may affect verification scores for some stations with little data, I seriously doubt it affects the overall reconstruction in any observable way.
TABLE S1 vs. S2
Since Byrd, pre-correction Harry, and Mt. Siple do not appear in Table S1 yet had more than 40% of their data complete, they must have failed to show sufficient verification skill. However, all three (along with Siple) are later used in Table S2 to show good correlation to the 15-predictor reconstruction. If they could not show sufficient verification skill within their own reconstruction, then the comparison to a different reconstruction is meaningless. Pulling pieces that DO NOT pass the verification criteria and using them to show a correlation is tantamount to admitting that the verification criteria have no power.
The other obvious question is why were other stations not included in the comparison? Elaine, Gill, Lettau, and Schwerdtfeger all passed verification and are in the same area as Harry and Byrd. For the peninsula, Uranus Glacier has more data over a longer period than Siple. The reader is left to wonder at the reasons behind station selection.
TREND ANALYSIS
Subtitle: What the AVHRR/AWS comparison tells us (and what it doesn’t)
Let us start this off with a couple of graphs. On the top are the 1957-2007 trends for the 26 AWS stations in Table S1. On the bottom are the 1957-2007 trends for the corresponding grid points in the AVHRR reconstruction. There is a typo in the caption for Table S1. It says the trends listed correspond to the full (AVHRR) reconstruction, but they do not. The trends are for the AWS reconstruction.


They look pretty similar. The AWS data has a mean trend of 0.172 deg C/decade and the AVHRR data has a mean trend of 0.127 deg C/decade. Fairly close. Let us do a paired t-test to evaluate our confidence intervals:
 t = 2.0415, df = 25, p-value = 0.05189alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-0.0003933995 0.0895627379sample estimates:mean of the differences0.04458467
Note that the 95% confidence intervals include zero, so we cannot say to a confidence level of 95% that there is a statistically significant difference in the means.
Just for fun, let’s look at all 63 stations:

AWS mean: 0.138 deg C/decade. AVHRR mean: 0.125 deg C/decade. T-test:
t = 0.7398, df = 62, p-value = 0.4622alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-0.02202794 0.04791103sample estimates:mean of the differences0.01294154
An interesting thing happened. We note that the AVHRR trend is relatively insensitive to the inclusion/exclusion of grid points while the AWS trend is highly sensitive to station inclusion/exclusion. This would indicate that justification for why the cutoff points were chosen as they were should be required.
Now let’s dig a little deeper. We should remember that the AWS data only applies after 1980 and the AVHRR data only applies after 1982. Prior to that the reconstruction is driven by the manned surface stations for both reconstructions. So if we intend to use the AWS reconstruction to show concordance with the AVHRR reconstruction, we would be better served to break up the trends into pre-1980 and post-1982 buckets:

The AWS recon (mean: 0.209 deg C/decade) shows much greater warming than the AVHRR recon (0.117 deg C/decade). To see how robust the difference in means is, let us do a paired t-test on the trends. We get:
t = 3.0137, df = 25, p-value = 0.005843alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.02924094 0.15547439sample estimates:mean of the differences0.09235766
Note the small p-value and wide confidence interval.
Now let’s look at post-1982:
The AWS recon (mean: 0.0078 deg C/decade) shows much less warming than the AVHRR recon (0.260 deg C/decade). Our t-test tells us:
t = -6.35, df = 25, p-value = 1.202e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-0.3338020 -0.1703029sample estimates:mean of the differences-0.2520525
Curiouser and curiouser.
Just to see, if we look at the post-1982 trends for all stations (not just the ones in Table S1), we see:

The AWS mean is 0.039 deg C/decade and the AVHRR mean is 0.255 deg C/decade. Our t-test says:
t = -8.8007, df = 62, p-value = 1.634e-12alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-0.2660813 -0.1675805sample estimates:mean of the differences-0.2168309
Pretty much the same as the 26 station test.
While the AWS recon and AVHRR recon have a similar 1957-2007 linear trend, how they get from 1957 to 2007 is wholly different. The AVHRR shows fairly steady warming throughout, while the AWS shows strong warming to 1980 and a flat trend afterwards. Our t-tests would indicate that in early and late periods, the trend data comes from different populations.
So now we ask ourselves, does the AWS reconstruction provide any meaningful “reality check†on the AVHRR reconstruction? The statistics tell us that it does not.
CORRELATIONS
The last thing we will look at (for the moment) is the correlation between the data sets. To start, let us look at a scatterplot of the AVHRR grid points vs. the 26 AWS stations in the AWS recon:
t = 128.6379, df = 15598, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.7097751 0.7250059sample estimates:cor0.7174762
At first glance, it appears as if there is a decent correlation between the two sets, with the range on the AWS set being about twice the range on the AVHRR set. Now let’s look at the period from 1957-1980:
t = 154.6585, df = 7798, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.8628489 0.8737656sample estimates:cor0.8684124
The correlation here appears much stronger. Now 1980-2007:
t = 64.8594, df = 7824, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.5767256 0.6055481sample estimates:cor0.5913256
The correlation looks much less strong. In fact, as time progresses, the correlation continues to degrade. Here’s 2000-2007:
t = 28.1629, df = 2624, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.4518528 0.5106172sample estimates:cor0.4817765
Note that the correlation visually seems to be dominated by a few series. I have not yet had time to see which ones are dominant. Additionally, none of this takes into account the autocorrelation reported at CA, which was significant (see Roman’s Deconstructing thread).
CONCLUSION
(Sort of)
Superficially, the AVHRR recon and AWS recon appear to match reasonably well. The match is driven primarily by the 1957-1980 timeframe – where neither AWS data nor AVHRR data are present. It degrades significantly post-1980. This result should not be surprising since the 1957-1980 timeframe utilizes manned surface station data in both reconstructions.
This does not mean that Steig’s AVHRR reconstruction is an inaccurate picture of Antarctica, but it does mean that the provided benchmark does not match within the certainty needed to call the AVHRR trends accurate. Since they differ substantially from each other, one or the other (or both) must differ substantially from reality. They happen to end at about the same value in 2007 – which gives reasonably well-matched linear trends from 1957 – but they approach the end value in very different ways.
NEXT
If I have time, there will be a Part Three to this. I am intending to:
And a few other odds and ends (like seeing which series dominate the positive correlations).
Cheers! 🙂