GHCN Version 3! (beta)

Via Nick Stokes’ blog (who himself found it in a comment by CCE over at our sister blog The Whiteboard), I learned today that a beta version of GHCN v3 had been released. The files and read-me are available here, and contain two files of interest:

I’ll take a detailed look in this post at exactly what changed, but the short answer is not too much. Version 3 added about 500 new stations (> 1000 post-2006), so no huge new data update quite yet. The big news in GHCN v3 is the new and improved inhomogeneity algorithm that draws on the work that Menne et al did on USHCN. Interestingly enough, the new algorithm appears to increase the trend a bit less than the old v2 adjustments.

(click to embiggen)

Lets start by examining the number of station records available by year in each series. I’m not distinguishing between adjusted and unadjusted series here, because the differences between station counts in any given year is too trivial to see on the chart (its useful to note here the correct way to generate GHCN v2 adjusted data, and that v3 has taken a new approach in creating a full series for the adjusted data instead of a change-log that needs to be combined with unadjusted data for missing months).

Here we see that the in most cases GHCN v3 adds stations; ~200 prior to 1950,  ~500 from 1950-2006, and ~1000 post-2006. Prior to 1895, however, GHCN v3 appears to have between 100 and 500 fewer stations. I’m not sure why these have been excluded, and we shouldn’t complain about it quite yet given that this is still an early beta release.

Another noteworthy change is the apparent decision to retire the imod/station_id distinction that caused quite a bit of confusion in v2. In the new version there are only station ids, and each individual imod is considered to be a unique station. Similarly, duplicates appear to be removed all-together. I’m not entirely sure how this was done in the case of largely non-overlapping duplicates, if they were combined into a single series or treated as separate stations (e.g. the duplicates we saw in the Kathmandu case).

Next lets compare unadjusted GHCN v2 data to GHCN v3 data:

As expected, the series appear quite similar, and most of the difference is likely due to the additional stations in GHCN v3. The trends in each are shown below.

1880-2009 (C per decade):

  • GHCN v2 – 0.068
  • GHCN v3 – 0.065

1950-2009 (C per decade):

  • GHCN v2 – 0.180
  • GHCN v3 – 0.176

1975-2009 (C per decade):

  • GHCN v2 – 0.302
  • GHCN v3 – 0.299

In all cases the trends in v3 are very slightly lower than those in v2.

Now the adjusted data:

The adjusted data is also quite similar, though it differs a bit less in the middle and a bit more on both ends. The trends are:

1880-2009 (C per decade):

  • GHCN v2 – 0.079
  • GHCN v3 – 0.074

1950-2009 (C per decade):

  • GHCN v2 – 0.197
  • GHCN v3 – 0.188

1975-2009 (C per decade):

  • GHCN v2 – 0.322
  • GHCN v3 – 0.310

Here we see that the new v3 adjusted data has a notably lower trend than the old v2 adjusted data. This suggests that the net effect of the adjustments is smaller in v3, something borne out when explicitly comparing them:

Overall, nothing too earth-shattering here. Its somewhat unfortunate that 3.0 doesn’t contain a thorough station data update, though the word is that is coming in GHCN 3.1. Also, please bear in mind that this is still a beta product, so all early analysis should be taken with a small boulder of salt.

25 thoughts on “GHCN Version 3! (beta)”

  1. @cce
    “The actual h/t goes to commenter pd at clearclimatecode.org”

    Yeah, it was me 🙂 There was “v3” folder before september 2nd, but there was no data then.

    Unfortunetaly GHCN v3 have the same problem with feb 1991 polish stations as GHCN v2 (missing minus sign), so i think, that the others possible errors are not corrected too.

  2. Zeke,

    I’ve downloaded teh data and have started looking at the ‘unadjusted’ station inventory file ghcnm.v3.0.0-beta1.20100917.qcu.inv and have been comparing it with the GHCN v2 equivalent v2temperature.inv.

    Have you done any basic station counts on this file? I make it that there are exactly 7280 records in both the V3 and v2 station inventory files which would appear to contradict your statement ‘Version 3 added about 500 new stations (> 1000 post-2006), so no huge new data update quite yet’ as i can’t see any evidence straight off that any new stations have been deleted (unless some of those in V2 have been replaced by any exact same no. of new stations in V3).

    I’ve also done a cross tabulation query of the no. of stations (records in teh sttaion inventory file) grouped by country/country code and again from what i can see each country has exactly the same no of stations (records) in the V3 as it has in the V2 file. For example there are 1921 ‘UNITED STATES OF AMERICA’ stations in both files and 847 ‘CANADA’ sttaions in both files.

    What i have notice dis that for about 2/3 of the US stations the WMO station code/imod combination (which represents a unique station record in the inventory file) have been replaced with ‘IDs’ that are not in the 70,000 range (as they all are in the V2 file). Is this because they’ve used the USHCN V2 station IDs for a lot of the US stations? I ask because I’m no where near as familiar with the USHCN V2 dataset as I am with the GHCN V2 dataset.

    I can only assume the the diferences between the no of stations by year chart for V2 versus V3 above are due to additions of further monthly avergare temperature data for EXISTING stations in the station inventory file and NOT due to additional stations as your statements seem to imply. Could you recheck and see if you can confirm my findings please Zeke as this is an important point.

    You’ve probably already worked out that I’m looking to compare teh chnages/additions to data on an individual station basis as my main interest is in looking at how the changes/additions to the dataset have effected the warming/cooling trends for individual stations. I’m particularly interested to see whether or not NCDC have made any significant changes to how they adjust raw data for individual stations as a great many of the individual station V2 adjustments had no physically justifiable explanation IMO. Let’s see if things have remained much the same or for that matter have gotten even worse in this repsect in going from V2 to v3. I some how doubt that things have improved but lets wait and see. I suspect it won’t be long now before Willis E has an undated thread on Darwin adjustments up on WUWT. I might even be able to beat him to the punch, you never know.

  3. Zeke,

    I’ve downloaded teh data and have started looking at the ‘unadjusted’ station inventory file ghcnm.v3.0.0-beta1.20100917.qcu.inv and have been comparing it with the GHCN v2 equivalent v2temperature.inv.

    Have you done any basic station counts on this file? I make it that there are exactly 7280 records in both the V3 and v2 station inventory files which would appear to contradict your statement ‘Version 3 added about 500 new stations (> 1000 post-2006), so no huge new data update quite yet’ as i can’t see any evidence straight off that any new stations have been deleted (unless some of those in V2 have been replaced by any exact same no. of new stations in V3).

    I’ve also done a cross tabulation query of the no. of stations (records in teh sttaion inventory file) grouped by country/country code and again from what i can see each country has exactly the same no of stations (records) in the V3 as it has in the V2 file. For example there are 1921 ‘UNITED STATES OF AMERICA’ stations in both files and 847 ‘CANADA’ sttaions in both files.

    What i have notice dis that for about 2/3 of the US stations the WMO station code/imod combination (which represents a unique station record in the inventory file) have been replaced with ‘IDs’ that are not in the 70,000 range (as they all are in the V2 file). Is this because they’ve used the USHCN V2 station IDs for a lot of the US stations? I ask because I’m no where near as familiar with the USHCN V2 dataset as I am with the GHCN V2 dataset.

    I can only assume the the diferences between the no of stations by year chart for V2 versus V3 above are due to additions of further monthly avergare temperature data for EXISTING stations in the station inventory file and NOT due to additional stations as your statements seem to imply. Could you recheck and see if you can confirm my findings please Zeke as this is an important point.

    You’ve probably already worked out that I’m looking to compare teh chnages/additions to data on an individual station basis as my main interest is in looking at how the changes/additions to the dataset have effected the warming/cooling trends for individual stations. I’m particularly interested to see whether or not NCDC have made any significant changes to how they adjust raw data for individual stations as a great many of the individual station V2 adjustments had no physically justifiable explanation IMO. Let’s see if things have remained much the same or for that matter have gotten even worse in this repsect in going from V2 to v3. I some how doubt that things have improved but lets wait and see. I suspect it won’t be long now before Willis E has an undated thread on Darwin adjustments up on WUWT. I might even be able to beat him to the punch, you never know.

  4. Pingback: Anonymous
  5. Kevin,

    I believe that you are correct, that the total number of stations didn’t change, but the number of stations available for different months has changed with the addition of more monthly data for existing stations.

    I should have said that v3 added 500 new records for the average month, rather than 500 new stations per se.

  6. Kevin,

    I believe that you are correct, that the total number of stations didn’t change, but the number of stations available for different months has changed with the addition of more monthly data for existing stations.

    I should have said that v3 added 500 new records for the average month, rather than 500 new stations per se.

    As far as adjustments go, they will tend to be normally distributed around a mean. Looking at the mean and distribution is much more interesting than cherry-picking extremes.

  7. Zeke

    Given that NOAA now admit that the US temp record (supposedly the best in the world) is flawed and that ROW records are worse why are you getting so exciteed about this?

  8. Dave Andrews,

    I’m not sure what you are talking about. If “is flawed” means that raw measurements need to be adjusted for site change, instrument change, time of observation changes, and other inhomogenities, they admitted that about three decades ago 😛

    Its somewhat the nature of the beast in using historical weather data to try and evaluate climate changes. The challenge is to quantify and test the magnitude and effect of various factors that can bias the temperature record, and adjust accordingly.

  9. Zeke,

    If the original data on which you are trying to assess bias and then adjust is flawed and data in the ROW is often non existent or not available, how are you able to come to any valid conclusion about surface temperature?

  10. Dave Andrews,

    Because we can derive tests to see how big biases are and adjust as appropriate. Climate data isn’t perfect, especially on regional levels, but globally the temperature record paints a pretty clear picture that is remarkably resilient to being sliced in different ways (e.g. only long-lived rural stations, only well sited stations, etc.).

  11. Dave Andrews:

    If the original data on which you are trying to assess bias and then adjust is flawed and data in the ROW is often non existent or not available, how are you able to come to any valid conclusion about surface temperature?

    Another version of Zeke’s comments is you can’t logically just throw away data because it has potential problems. You have to show that the flaws fatally cripple the utility for which they were collected first. I.e., you need a model that you can then test analytically. Zeke has given you some examples of how we know the warming signal in the data is real, I can provide others if you’re really interested.

    At the moment, the signal from global warming seems to be at least 10x the magnitude of any corrections found, and likely the actual systematic error contributes no more than about 5% to the observed warming.

  12. “Another version of Zeke’s comments is you can’t logically just throw away data because it has potential problems.”

    Carrick,

    Are you saying you can never throw away data because the data has problems?

    Andrew

  13. Andrew:

    Are you saying you can never throw away data because the data has problems?

    No of course not…

    Problems with data are unavoidable, so in practice, what you do is use the data if the magnitude of the problem can be assessed and if it doesn’t affect the proposed used of the data. In this sense, there are certainly data that are “unusable”.

    The idea that we’ll ever have perfect data is a bit unrealistic, especially if the data are being used to analyze a question they weren’t original gathered to address. The real world has warts.

  14. Thanks, Carrick. I just was looking for some clarity, because in following climate news the last several years, I’ve noticed that in climate science, “just throwing away data” seems to be common practice. Like dropping temperature measurement stations.

    Andrew

  15. Andrew, you usually need solid reasons to throw away data, at least while adhering to recognized standards of practice in measurement.

    More commonly, the questionable data are all initially included, and you do a “with and without” version (e.g., Tiljander). If a solid argument can be mounted to reject the data, they should not be destroyed, just not used in the study. I keep everything, including channels with dead microphones.

  16. The blurb for the recent Exeter meeting on surface temperatures said

    ““These datasets were adequate for assessing whether climate was changing at the global scale“ but “They do not constitute a sufficiently large sample to truly understand our uncertainty at regional scales”

    How can they possibly be adequate at a global scale if they are far from adequate at a regional scale?

  17. Now problem with minus sign in polish stations at feb 1991 is fixed. I’ve also noticed that there are some new data (ie. 1990-2000 for Zielona Gora (63512400) <- data for that station are correct).

Comments are closed.