All posts by Steve Mosher

Population II

The last post brought up some interesting issues that will help to guide the exploration of  the population at stations over time. Rather than add to the first post, I’ll  do another post. If folks raise interesting comments and I can quickly put up clarifying graphics, I will append them to the post and announce that in the comments. First, a little roadmap to where I hope to go. I’m putting together a repository of some new code for working with Berkeley data, and so this project is a way of testing that out and adding functionality that one needs to do an actual project. In general terms I want to test out some new metadata and some different ways of classifying stations and looking for UHI. In the end I want to see how we end up adjusting or not adjusting places that we would classify as urban. How well, if at all, does the adjustment process work on this problem. We of course have a global answer, but how does it look in detail station by station? And are there ways to improve it?

From 100K feet the plan would be to build a filter that hopefully can divide urban from rural, or provide some categories or a continuous measure of urbanity. In the past I’ve built filters that just contained everything: population density, built area, nightlights, airports. And I’ve also experimented with “buffers” around these elements to insure that rural stations are truly isolated. You can think of the filter as having two stages. In the first stage we try to classify sites that we have good observational reasons for suspecting UHI. From observational studies we know that large cities with tall building and many people suffer from the most UHI. The combination of tall buildings, built surfaces, and human activity create UHI. As we move away from the city toward a more pristine environment some of the causes of UHI ( namely tall building, dense impervious surfaces, and human activity) diminish and it should follow that UHI diminishes.

At the limit as the number of building shrinks and  the surface transformations become smaller, then our concerns  become microsite concerns.  Or we could refer to three different scales : at the city scale (meso scale) and the neighborhood scale and finally the backyard or microscale.  The approach I want to take is to first categorize the easy to categorize stations using population. After the first filtering then we end up with two piles: One pile that is clearly urban and has all the cause of UHI present, and the other pile that is going to necessarily be more arguable. The second filtering will be applied to these remaining stations. It will use some new high resolution satellite products.  Another way to look at this is that in the first pass we will build a pile that has all the known causes of UHI ( tall buildings, many people, dense development ) and in the second pile we will have much fewer people and little development. At least that’s the plan. As always if interesting sidelines come up I will take a look at them,  but some things may get naturally diverted to the second stage.

The first post raised some interesting questions, namely about the number of stations, the number of stations over time, and what do we do about stations that are close to urban cores, or stations in the transition zone (TZ). A good example of a TZ station is de Bilt (http://onlinelibrary.wiley.com/doi/10.1002/joc.902/abstract).  Up until recently this was one of the few empirical studies of UHI  at the outskirts of cities. A more recent study is here:  http://onlinelibrary.wiley.com/doi/10.1002/qj.2836/pdf . More on those two studies toward the end.

 

Stations:

The stations used come from Berkeley Earth’s data set  which ingest data from 14 or so different source decks. Many of these sources contain  duplicate stations, or stations lacking metadata, shorter stations  not collated in the usual inventories like GHCN_Monthly. I’ll point out some of that as we look at the maps. For the most part series like GISS and HADCRUT depend upon an anomaly period ( 1951-80, 1961-90) such that if a station doesn’t have data during that period it isn’t used. After merging  our source decks we end up with ~43K different stations. This is prior to any “slicing”. many of the stations are shorter series, for example  CRN, the gold standard in the US, starts  around 2005. Other data products don’t use this data. What that means is that over time stations come and go such that when we look at them over time we will see that there is no time at which all 43K are present. Below I’ve collated all the stations that appear within a given 30 year period. Note, this doesn’t mean they are all at least 30 years long.  The “Pre” period is stations that exist before 1850. For reference the current station count ( May 2016) was  ~19K.

StationsPerPeriod

Geographically the stations are distributed like this

map1850-1880Map18801910

Map19101940Map19401970

Map19702000Map20002016

With these 43K stations the next series of steps is to geolocate them in the Hyde 3.1 Population density grids. As we discussed the Hyde dataset goes back to the beginning of our data and comes in 5 minute  grids. For what follows we will only be looking at post 1850 data, although we can go back early if there is a relevant question.  Extracting the data  left roughly 3K stations  on the floor. They had no population data. Some of this is due to the Hyde data set lacking population for antarctica, and some of it due to the fact that we also have data from bouys, oil platforms, small islands, atols and stationary ships. Some of it is due to location errors, were a coastal station has latitude and longitude that is in the water. For now I set these aside, so we are working with ~40K stations.

After collating the population I decided to recode it into bins for  display. Since the log of population confused some folks I decided a  descriptive binning might aid the discussion. So I used a slightly modified  approach from RPA. They set up the following categories.

Natural < 25 per sq mile
Rural 26 to 150
Exurban 151 to 500
Sprawl 501-2500
Dense 2501 -10000
Urban Core 10K+

Since there are so few  urban core sites in the data, I lowered the threshold from 10K to 5K.  To give you an idea of how the station locations change over time, I’ve sampled them at 1850 and at 2005.  This doesn’t mean these locations were all populated with stations in 1850, rather it just shows the population “class” of  the grid cells that over time will have stations in them. For example, in 1850  35K of the locations have populations less than 25 people per sq mile ( ~65 people per sq km) Over time those locations get developed and transition to other classes. In 2005  25K of the locations are still “Natural” by this classification scheme. Note that doesn’t mean humans haven’t altered that landscape, it suggests however that they haven’t turned it into NYC.  Also, in 1850 there are a small number of locations that have dense populations ( “Sprawl” Class)

GridPopulationClass

At this point we could simply divide our stations into two piles:  One pile  of Urban core, Dense Suburb, and Sprawl and a second pile of ExUrban, Rural and Natural and then start to look at higher resolution datasets  for other changes to the surface.  Conceptually a pile of  “Urban” sites or sites where we know most of the causes of UHI are present,  and  a second pile  where the changes to the surface are much less dramatic, lets call it VHI (village heat island) and potentially microsite ( which can happen anywhere ).

Or we could make some refinements at the population screen. One concern is that rural areas and natural areas can occur adjacent to urban areas. UHI doesn’t know about city borders or grid cell borders. There are a couple of approaches to handle this. Once would be to start with all known cities and build buffers around the cities. The other approach is to start with the locations and determine if they are adjacent or close to any urban areas. The question, of course, is  how big should we make these buffers? and do we have any observational evidence that helps us to set the buffer? Because the Hyde data is 5 minutes there is already a buffer of sorts built it, but it is at most 4-8km. Modelling of UHI can provide some guidance, and there are a couple of relevant studies that give empirical guidance: The De Bilt study and the recent BUCL  study.

In the Birmingham study a dense network of stations were studied for 20 months to determine how UHI can spread from the urban core to surrounding areas. Birmingham is the second largest city in England with a population of 1.1M. There are  ~250 cities in the world the same size or larger. The study is listed below. Some important takeaways. For a city of this size the data indicated that “rural” sites 12km away could be effected.

Below is a map of Birmingham population and three station locations. two stations are in the urban core and a third is located outside the core.

Birmingham2

So we can add a condition to our filter and  “cull out” similar situations:  To do this culling a buffer was created around every station. The population density of Birmingham was used as a guide and then the population classes were recoded to indicate if a station was close to a dense population zone. The BUCL study seems to indicate that in the worst case conditions ( wind dependent ) rural sites 12km away could be effected. To be on the safe side I used a 20km buffer. Below stations have been recoded  (  _U) to indicate whether they are close to (20km) a urban core that is as dense or denser than Birmingham.

 

Adjacent2

For the “Natural” sites  ( ~25K  total)  66 locations were within 20KM  of an urban core: Of the roughly 8K  Rural sites, ~200 were close to Urban cores. Of the roughly 4K ExUrban locations  around 10% were within 20km of urban cores, and lastly roughly 1/3 of the 2500  “Sprawl” sites were close to cores the size of Birmingham.

The other empirical study is the DeBilt study.  Below is the population grid for the De Bilt area

DeBiltPop

The study is listed below. The population density here is somewhat lower,  Utrecht has on the order of 260K people and Zeist has roughly 60K.  They indicate that over 100 years,  the UHI at DeBilt amounts to .1C +-.06, or roughly 10% of its trend over time.  (For reference, our adjustment code, lowers the trend at this site. ) With a population of 60K people Zeist is roughly 4-5km from the site.  The next step will be to see what using these figures do as a further filter.   As a quick look at the issue, I took the locations and populations of 66,000 cities and villages around the world. This was reduced to those cities that had a population of 50K or more. Then I calculated distances to every site  finding which sites are close to  cities of this size.  In the next step I’ll apply this filter as well. The following gives you an idea of how many sites are close to smaller population centers.. four distance classes are used for display purposes only ( 0-20km, 20-50km, 50-100km, 100+ km). For example, there are ~4000 locations that are “Natural” and located 20-50km from any city of 50K or more.

Distance&PopClass

Discussion.

I’d like to keep the discussion focused on the issues of population  and leave the discussion of adjustments for later.

 

Reading suggestions

Stewart ID. 2011. A systematic review and scientific critique of methodology in
modern urban heat island literature. International Journal of Climatology 31:
200–217, doi:10.1002/joc.2141.
http://onlinelibrary.wiley.com/doi/10.1002/joc.902/abstract
http://onlinelibrary.wiley.com/doi/10.1002/qj.2836/pdf

Population

A few readers here at  rankexploits have asked questions or made claims about population as it relates to UHI, so I took some time to start a new project looking at population over time at the various stations in the BerkeleyEarth dataset. When it’s done the code will be on github for folks who want to see what I’ve done. Today we will just scratch the surface and depending on your questions ,I may go in the directions that are brought up in conversation.

Population is often considered as a regression variable for UHI. This started with Oke, who related population  to the max UHI detected at cities various cities. However, over time, population has been supplemented as explanatory variable for UHI with other variables that relate to the urban energy balance. Population would still play a role here insomuch as it contributes to the anthropogenic heat flux in energy balance approaches; but, on its own, doesn’t do a very effective job of explaining UHI. Other factors matter more.  We can discuss that in the comments; but, for now we will just start by collecting what data there is on population and showing how it varies station to station and how it varies over time.

The source for the data is the Hyde 3.1 dataset, a 5 minute ( roughly 8km ) grid.  I’ve compiled the dataset back to 1700. There are various metrics we could use: total population count; urban count; rural count, and population densities. I selected density. The data exists for every decade up to 2000 and then 2005.  City size follows  Zipf’s law or a power law. And if we look at a histogram of population by grid cell we will find that large cities are very rare and that most of the world is undeveloped.  If you want to have some fun reading about this topic  start here for an easy introduction.

I mention this because people are often fooled when commentators select certain cities to demonstrate UHI. They tend to select the large cities, and of course there are very few of those. The vast majority of the planet is unpopulated and as we will see only a tiny slice of station data come from these large cities. The vast majority of stations are not in areas that you would call urban. In the end, however we can just dispense with the terminology “urban” and “rural” and substitute objective metrics that capture the causes of UHI.

To make the data somewhat easier to visualize, I’ve taken the density and re-coded it. To keep it simple I’ve used log of the population density base 10, so on the following map  values of 0  represent  no population ( 0 to 1 person per sq km)  1 represents  1-10 people, 2 = 11-100, 3 = 101-1000, 4 = 1001-10000, and 5 is 10K+.  And we see what we all know: the earth is largely unpopulated.DensityLog10

One concern over UHI  is that when we sample the earth we are sampling it generally where people live and thus have a sample that may be infected by UHI.  To give you a sense of this I have divided the 43K stations into two classes. Since Hyde data ends in 2005, I defined one set of stations as “Active” if they had temperature data  after 2004.  Stations that ended before 2004 are “not active” . In the future I’ll develop a population history for every station depending on its start date and end date. But for now, let’s just look at what kind of population density we see at active stations:

PopulationHistogramBoth CDF

A couple of points before we move on. First note that in the active data roughly 15% of the stations are located in areas that have zero population  within  5 arc minutes of the station location. Roughly 90% of the data comes from stations  that have less than 1000 people per sq km. For reference Hansen defines rural as roughly less than 10 people per sq km.

Those numbers are hard to visualize. To help visualize that I selected random locations in the US that represent  zero people, 10 people per sq km, 100, 1000, 10000 and then the max.

Below is a random US location with zero people

ZeroPopmap

10 people per sq km is pictured below

Tensqkm

100 people per sq km is depicted below

Hundredsqkmdetail

One thousand per sq km is depicted below. Roughly 90% of stations have population densities less than that.  A couple of points here. If you live in an area where lot size is roughly an acre, and if you have the average number of  people living in your house, then guess what the number of people per sq km that represents? Or if you live in an area with a standard .35 acre plot?

ThousandsqkmDetail

10 thousand people per sq km is shown below.  One question to ask here is how do we get to density’s this large?

TenThousandsqkm

 

As you can see once we get to the  10K people per sq km we have a landscape that is substantially transformed; and, it’s the transformation of the landscape and not so much the number of people living there that matters. We can discuss that going forward, but for now it’s important to understand that 90% of the sites used by Berkeley Earth are not urban. Whether we use somewhat arbitrary population cutoffs (  1000 people per sq km is suburban) or whether  we use measures of how much the landscape is transformed, the vast major of stations are placed in areas that are not urban. For the most part  you don’t find UHI in the global record because 90% of the stations are not located in areas that have the surface features that cause UHI.

Lastly, below is a picture of the most dense  area in the US

 

MaxUSdetail

Opps. That was zoomed in on central park.  Here we are at altitude.

MaxUS

I switched those last two to make a point. When we say  urban  we have to remember that the urban landscape  is heterogeneous, as is UHI. It may matter where exactly the station is located.

Finally, when we talk about UHI studies,  the vast majority are done for locations that have densities much greater than 1000 people per sq km. There is very little study of UHI  in areas that have population densities less than 1000 people per sq km. And the reason should be obvious by looking at the photo. Those locations are not urban.

 

BUCL .

BUCL  Study Area

BUCLStudy

 

BUCL   Site  W07

Wo7hires

 

BUCL  Site W04

W04

BUCL  Site  Wo04  Hires

W04hires

 

Paradise Detail

ParadiseHires

 

Statistics  for  Non Adjacent Stations.

Stations are scored by their adjacency to urban areas. Here  using BUCL parameters  as a guide. The city center has a population density of  ~5000 people per sq km. Below  the population density is shown for the study area. Paradise, W4 and W7 are shown for illustration.

Birminghampop

Filtering stations  only include   A) Active stations in 2005  and B) Stations that are  not  adjacent to  grid cells with populations 5K  or  higher. Adjacency is determined at a 20km  range.

 

Isolated & Active

 

IsolatedActiveCDF

 

 

Berkeley Earth & AIRS

In previous comments  Robert Way referred to some AIRS work that I’ve been  doing ,so I thought I’d share that with Lucia’s readers. Some background first. Over the course of the last 6 months or so I’ve been collating a variety of satellite data sets. There are several projects I’m working on some are related to testing R packages but for the most part I am looking at ways to improve our temperature products during  the 1979 to present period. As you know we solve the temperature field by first regressing out latitude and altitude and then krigging the residual or weather.  There are two primary concerns: How high can we push the resolution ( we are at ¼ degree) and are we over smoothing the field. To improve the resolution I’ve been looking at regressing out other geographical features that drive temperature, some of which change over time.  Here is a short list of some of the geographical features we’ve been looking at: distance from coast, percent of water in the grid cell, terrain aspect and slope, cold air drainage, local vegetation, albedo, impervious area. More long time series of these features are coming on line. For example, for impervious area I should be able to use Landsat and create a monthly global gridded series of impervious area or use monthly albedo ( white sky or black sky).  The issue of local smoothness is a related issue and in our AGU poster  we compared the smoothness of the field to other products: Prism, UAH, RSS and several reanalysis. The problem with UAH and RSS is that their resolution is 2.5 degrees. I swapped mails with Roy about the possibility of re gridding down to 25km, but it would be a lot of work. At AGU I ran into the AIRS team and they had some products at 1 degree and another product at 45km. So, I got curious.

The AIRS data is in a nasty format called HDF. A single month of AIRS data comes in a 500mb HDF file and that file has over 700 SDS ( scientific data sets). As luck would have it the guys I worked with on the MODIS project in R were also working on a new utility  (gdalUtils)  that makes reading and unpacking HDF  in R  really easy. So, I started working on  AIRS. To give you a hint with AIRS I’ve got cloud layers at 12 different pressure levels (1018 hPa  up to 22 hPa or something like that ) so that allows me to look for effects on low clouds  in conjunction with changes in GCR. Plot spoiler:  no effect. When Cowtan and Way came out and folks wanted to question  their use of UAH my first thought was to look at AIRS as it gives me temperatures  from the skin up  through multiple layers. And if I really want to push the envelope they have some swath level product (45km)  at 100 pressure levels. Working with swath data would require another terabyte drive and for this preliminary probing I stuck with the gridded 1deg data. It’s days of download time if anybody wants the code.

The data set I’ve selected is AIRS Version 6, level 3 data. In particular I’ve selected a few interesting files from the over 700 climate data files that sensor delivers.  I showed a few charts in my post on Judiths , I’ll repeat those here:

 

Berkeley Earth Surface TemperatureFigure 4DAirs Surface Air Temperature

 

Given the differences in methodologies for each of these products and given that they measure slightly different parameters, you can’t simply compare them.  Below I show what the differences look like. AIRS skin temperature estimate the temperature of the surface, whereas Berkeley measures the air above the surface. Consequently we see that show up in the mean difference maps.

BerkversusAirSat

 

BerkVersusAirsSkinmap

For a sense of whether it will work or not to use these satellite retrievals to inform our surface estimation, I look at the correlation. The effects of clouds is pretty evident, but over land we have good correlations

SATCorrelationFinal

 

In terms of trends it looks like this

GlobalAnomalies

The Berkeley Average falls in between the Skin trend and the SAT trend as we would expect.

Finally, I’ll show the results for the Arctic (60N-90N). How well does our Baseline estimation of the arctic compare with AIRS Skin and AIRS SAT. Here is what the correlation looks like for SAT followed by the anomalies.

Corairssatarct

ArcticAnomaliesCompare

And this is what the difference series looks like. The spike in late 2003 I’m betting will be cloud related

ArcticAnomalyDiff

At this stage what I’m seeing is that over the 2002 to present period that our method captures the trend in the arctic pretty well and the ground work here would get passed on to others to see if it was helpful or held up under examination.  I haven’t done any comparisons with Cowtan and Way or Hadcrut. From my standpoint it made more sense to pass this data onto Robert and Kevin and let them delve into it deeper. However, I’ve  yet to see any data from any source that would make me change my assessment that HADCRUT is biased low. I suppose one could argue it is the lower bound. In addition to this data there will also be some ice surface temperature series coming available that cover the entire 1980ish to present.

My concluding remarks come down to this. Cowtan and Way, Berkeley Earth and GISS  all  estimate what the non sampled areas in the arctic would look like. The way to test these predictions is by looking at out of sample data where we can find it ( Buoys for example)  and to compare the trends we see from different products, reanalysis and satellite data sets.  AIRS is one more data set to look at, the answer seems pretty clear.

The last note I will make is that the similarity in trends between Berkeley and AIRS should give one pause about claiming large UHI effects over the time period covered.

 

If folks what the time series, just say so in the comments and I’ll upload to Google Docs