The last post brought up some interesting issues that will help to guide the exploration of the population at stations over time. Rather than add to the first post, I’ll do another post. If folks raise interesting comments and I can quickly put up clarifying graphics, I will append them to the post and announce that in the comments. First, a little roadmap to where I hope to go. I’m putting together a repository of some new code for working with Berkeley data, and so this project is a way of testing that out and adding functionality that one needs to do an actual project. In general terms I want to test out some new metadata and some different ways of classifying stations and looking for UHI. In the end I want to see how we end up adjusting or not adjusting places that we would classify as urban. How well, if at all, does the adjustment process work on this problem. We of course have a global answer, but how does it look in detail station by station? And are there ways to improve it?
From 100K feet the plan would be to build a filter that hopefully can divide urban from rural, or provide some categories or a continuous measure of urbanity. In the past I’ve built filters that just contained everything: population density, built area, nightlights, airports. And I’ve also experimented with “buffers” around these elements to insure that rural stations are truly isolated. You can think of the filter as having two stages. In the first stage we try to classify sites that we have good observational reasons for suspecting UHI. From observational studies we know that large cities with tall building and many people suffer from the most UHI. The combination of tall buildings, built surfaces, and human activity create UHI. As we move away from the city toward a more pristine environment some of the causes of UHI ( namely tall building, dense impervious surfaces, and human activity) diminish and it should follow that UHI diminishes.
At the limit as the number of building shrinks and  the surface transformations become smaller, then our concerns  become microsite concerns.  Or we could refer to three different scales : at the city scale (meso scale) and the neighborhood scale and finally the backyard or microscale.  The approach I want to take is to first categorize the easy to categorize stations using population. After the first filtering then we end up with two piles: One pile that is clearly urban and has all the cause of UHI present, and the other pile that is going to necessarily be more arguable. The second filtering will be applied to these remaining stations. It will use some new high resolution satellite products.  Another way to look at this is that in the first pass we will build a pile that has all the known causes of UHI ( tall buildings, many people, dense development ) and in the second pile we will have much fewer people and little development. At least that’s the plan. As always if interesting sidelines come up I will take a look at them,  but some things may get naturally diverted to the second stage.
The first post raised some interesting questions, namely about the number of stations, the number of stations over time, and what do we do about stations that are close to urban cores, or stations in the transition zone (TZ). A good example of a TZ station is de Bilt (http://onlinelibrary.wiley.com/doi/10.1002/joc.902/abstract). Â Up until recently this was one of the few empirical studies of UHI Â at the outskirts of cities. A more recent study is here: Â http://onlinelibrary.wiley.com/doi/10.1002/qj.2836/pdf . More on those two studies toward the end.
Stations:
The stations used come from Berkeley Earth’s data set  which ingest data from 14 or so different source decks. Many of these sources contain  duplicate stations, or stations lacking metadata, shorter stations  not collated in the usual inventories like GHCN_Monthly. I’ll point out some of that as we look at the maps. For the most part series like GISS and HADCRUT depend upon an anomaly period ( 1951-80, 1961-90) such that if a station doesn’t have data during that period it isn’t used. After merging  our source decks we end up with ~43K different stations. This is prior to any “slicing”. many of the stations are shorter series, for example  CRN, the gold standard in the US, starts  around 2005. Other data products don’t use this data. What that means is that over time stations come and go such that when we look at them over time we will see that there is no time at which all 43K are present. Below I’ve collated all the stations that appear within a given 30 year period. Note, this doesn’t mean they are all at least 30 years long.  The “Pre” period is stations that exist before 1850. For reference the current station count ( May 2016) was  ~19K.
Geographically the stations are distributed like this
With these 43K stations the next series of steps is to geolocate them in the Hyde 3.1 Population density grids. As we discussed the Hyde dataset goes back to the beginning of our data and comes in 5 minute  grids. For what follows we will only be looking at post 1850 data, although we can go back early if there is a relevant question.  Extracting the data  left roughly 3K stations  on the floor. They had no population data. Some of this is due to the Hyde data set lacking population for antarctica, and some of it due to the fact that we also have data from bouys, oil platforms, small islands, atols and stationary ships. Some of it is due to location errors, were a coastal station has latitude and longitude that is in the water. For now I set these aside, so we are working with ~40K stations.
After collating the population I decided to recode it into bins for  display. Since the log of population confused some folks I decided a  descriptive binning might aid the discussion. So I used a slightly modified  approach from RPA. They set up the following categories.
Natural < 25 per sq mile
Rural 26 to 150
Exurban 151 to 500
Sprawl 501-2500
Dense 2501 -10000
Urban Core 10K+
Since there are so few  urban core sites in the data, I lowered the threshold from 10K to 5K.  To give you an idea of how the station locations change over time, I’ve sampled them at 1850 and at 2005.  This doesn’t mean these locations were all populated with stations in 1850, rather it just shows the population “class” of  the grid cells that over time will have stations in them. For example, in 1850  35K of the locations have populations less than 25 people per sq mile ( ~65 people per sq km) Over time those locations get developed and transition to other classes. In 2005  25K of the locations are still “Natural” by this classification scheme. Note that doesn’t mean humans haven’t altered that landscape, it suggests however that they haven’t turned it into NYC.  Also, in 1850 there are a small number of locations that have dense populations ( “Sprawl” Class)
At this point we could simply divide our stations into two piles:  One pile  of Urban core, Dense Suburb, and Sprawl and a second pile of ExUrban, Rural and Natural and then start to look at higher resolution datasets  for other changes to the surface.  Conceptually a pile of  “Urban” sites or sites where we know most of the causes of UHI are present,  and  a second pile  where the changes to the surface are much less dramatic, lets call it VHI (village heat island) and potentially microsite ( which can happen anywhere ).
Or we could make some refinements at the population screen. One concern is that rural areas and natural areas can occur adjacent to urban areas. UHI doesn’t know about city borders or grid cell borders. There are a couple of approaches to handle this. Once would be to start with all known cities and build buffers around the cities. The other approach is to start with the locations and determine if they are adjacent or close to any urban areas. The question, of course, is  how big should we make these buffers? and do we have any observational evidence that helps us to set the buffer? Because the Hyde data is 5 minutes there is already a buffer of sorts built it, but it is at most 4-8km. Modelling of UHI can provide some guidance, and there are a couple of relevant studies that give empirical guidance: The De Bilt study and the recent BUCL  study.
In the Birmingham study a dense network of stations were studied for 20 months to determine how UHI can spread from the urban core to surrounding areas. Birmingham is the second largest city in England with a population of 1.1M. There are  ~250 cities in the world the same size or larger. The study is listed below. Some important takeaways. For a city of this size the data indicated that “rural” sites 12km away could be effected.
Below is a map of Birmingham population and three station locations. two stations are in the urban core and a third is located outside the core.
So we can add a condition to our filter and  “cull out” similar situations:  To do this culling a buffer was created around every station. The population density of Birmingham was used as a guide and then the population classes were recoded to indicate if a station was close to a dense population zone. The BUCL study seems to indicate that in the worst case conditions ( wind dependent ) rural sites 12km away could be effected. To be on the safe side I used a 20km buffer. Below stations have been recoded  (  _U) to indicate whether they are close to (20km) a urban core that is as dense or denser than Birmingham.
For the “Natural” sites  ( ~25K  total)  66 locations were within 20KM  of an urban core: Of the roughly 8K  Rural sites, ~200 were close to Urban cores. Of the roughly 4K ExUrban locations  around 10% were within 20km of urban cores, and lastly roughly 1/3 of the 2500  “Sprawl” sites were close to cores the size of Birmingham.
The other empirical study is the DeBilt study. Â Below is the population grid for the De Bilt area
The study is listed below. The population density here is somewhat lower,  Utrecht has on the order of 260K people and Zeist has roughly 60K.  They indicate that over 100 years,  the UHI at DeBilt amounts to .1C +-.06, or roughly 10% of its trend over time.  (For reference, our adjustment code, lowers the trend at this site. ) With a population of 60K people Zeist is roughly 4-5km from the site.  The next step will be to see what using these figures do as a further filter.  As a quick look at the issue, I took the locations and populations of 66,000 cities and villages around the world. This was reduced to those cities that had a population of 50K or more. Then I calculated distances to every site  finding which sites are close to  cities of this size.  In the next step I’ll apply this filter as well. The following gives you an idea of how many sites are close to smaller population centers.. four distance classes are used for display purposes only ( 0-20km, 20-50km, 50-100km, 100+ km). For example, there are ~4000 locations that are “Natural” and located 20-50km from any city of 50K or more.
Discussion.
I’d like to keep the discussion focused on the issues of population  and leave the discussion of adjustments for later.
Reading suggestions







































