Calculating Annual Average Temperatures: Conditional Averages

Annual Avearaged  Daily High Temp vs Year VaexJoeToday’s “How to teach yourself R lesson” involves calculating annual average temperatures from a string of daily temperatures.

The average temperatures, which I must observe, look fairly trendless are plotted to the right. That said, there may be a trend buried in there. If so, it’s sufficiently small to require teasing out using statistics.

Now, for those interested in learning R, I will describe what I did in some detail. I will then make a few comment in the final summary.

Backgraound: Recall: Having obtained the large file containing daily high temperatures recorded at VaexJoe, Sweden, I wanted to figure out how to calculate the annual averages daily high using functions available through the R project.

Main steps to calculating the average temperatures

It turned out these averages were easy to calculate using the ‘zoo’ package.

The steps I performed to calculate the annual average were:

  1. Load the zoo library.
  2. Create a ‘zoo’ class variable with temperatures indexed by date.
  3. Use ‘aggregate’ to calculate the means as a function of some condition. In my case, the ‘condition’ is the year in which data are collected.

Naturally, after calculating these things, I graphed a chart, examined a histogram and looked at some simple statistics.

R project Session Details

I typed the commands shown in blue. When ‘R’ has processed my command, it either spits back the ‘>’ prompt or says something and then spits out the ‘>’ prompt. The # indicate my comments which I typed into ‘R’.

>#read in my file. (For details see Read Data into R.)

>vaexJoe<- read.table("file:///Users/lucia/Desktop/global/R_Data_Sets/VaexJoe.txt",sep="\t",header=TRUE);

># view names of files
>names(vaexJoe)
[1] “SOUID” “DATE” “TX” “Q_TX” “Temp_9” “DAY”
[7] “Temp_19” “EarlyHalf” “LateHalf” “Record” “Day” “Year”

># recall that “Temp_19” contains the cleaned up daily temperature datas with “NA” (not available) listed where ever the raw file indicated an error.

> # load the zoo library so I can use it. (Details: Attach an r-project package.)
> library(zoo)
># after doing this, R spits out stuff starting with Attaching package: ‘zoo’

># create a ‘zoo’ object with temperatures in column ‘Temp_19’ and dates starting with January 1, 1918, which is the first date in the data file.
>z <- zooreg(Temp_19, start = as.Date("1918-01-01"))

># Ask R to provide a summary because I have no confidence in commands I have never used before.
> summary(z)

Index z
Min. :1918-01-01 Min. :-211.0
1st Qu.:1939-01-01 1st Qu.: 32.0
Median :1960-01-01 Median : 102.0
Mean :1960-01-01 Mean : 104.5
3rd Qu.:1980-12-31 3rd Qu.: 177.0
Max. :2001-12-31 Max. : 344.0
NA’s : 57.0

># I notice the first and last dates match. The temperarutes ranges are correct etc. So, the zoo object seems ok.
># notice there are 57 “NA” values in the full ‘z’ column. Those correspond to the bad data. Previous inspection indicated most occurred druing the 20 and one occurred during WWII.

># Calculate means using the “aggregate” function
>aggregate(z, as.numeric(format(time(z), “%Y”)), mean)

># R prints out the mean temperatures for the years fro 1918 to 2001. I notice “NA” appears in every year with missing data. At some point, I might care about that, but not today.

># Today, I decide to stuff the aggregate data into a variable called “means” which will be calculated for each year. (Maybe this is a frame or a class? I don’t get R project lingo yet.)
>means<-aggregate(z, as.numeric(format(time(z), "%Y")), mean)

># Eyeball to plot of Average temperature vs. Time.
>plot(means)

Annual Avearaged  Daily High Temp vs Year VaexJoe

So, what did I learn?

Recall, currently, I’m trying to learn how to use R. I succeeded in

  1. calculating an average yearly temperatures from a string of daily temperatures.
  2. learning how R handles missing data. When even one data point is missing from a years worth of data, the aggregate function returns “NA” for the full year’s worth of data.
  3. plotting and eyeballing the averaged data. I see that there are a fair number of “NA”s during the 20s and 30’s and one during the 40s. I also see that the variability is rather large compared to any trend with time.

What about Global Climate Change?

I know the two or three readers that will visit are going to ask “What about ‘the ultimate’ question? Has Vaexjoe warmed due to Anthropogenic Global Warming? ”

Based on what I’ve done so far, the technically correct answer is “Beats me!”

In fact, we haven’t even learned much about the testable question Tamino posed which was: Was the average Vaxjoe temperature between 1985 and 2001 warmer than the average temperature between 1977 and 1984 and is that difference is statistically significant to the 95% confidence level.

For now, we can eyeball the data and decide whether or not the answer seems obvious by simply viewing data.

I’d suggest the answer is no, the answer isn’t obvious. Any trend is small; the scatter is large.

Still, as I once told my students: when the answer is so obvious you can tell by looking at a graph, you usually don’t need to do any statistical tests to convince anyone the trend exists. (Or… you don’t need to do the statistical tests unless they are mandated by law or corporate policy. )

But in this data set, trend is small compared to the scatter. So, we feel confident a trend exists without doing some statistical tests and doing these tests– and doing them correctly.

11 thoughts on “Calculating Annual Average Temperatures: Conditional Averages”

  1. Lucia,

    I similarly downloaded these data when Tamino wrote up his thread, although I use MATLAB rather than R. Your data set seems to have more missing than mine? I think I only lost one year (1947) for having a substantial number (30-odd?) of missing data points. Might be my dodgy coding though!

    Using the complete data set, and prompted by your initial observations regarding non-normality, I tried the Lilliefors and Bera-Jarque tests for normality. You might be amused by the results! They seem to confirm your initial suspicions.

  2. Spence_UK: There are two groups of missing data. One is utterly missing. That happened only in 1947. There is also “suspect” data. You’ll find those occurring in the 30s. The issues in the 30s affects very few samples. The issue in the 40s lasted a the full month.

    I created a vector that puts NA when only the utterly missing is gone and one that puts NA in for both.

    Once I manage to learn R better, I plan to do analyses for both when long term data is useful.

    On the missing data:
    I have developed a romantic theory that sometimes during WWII, a thermometer was required by one of the valiant Norwegians who skiied over and wiped out the Nazi’s heavy water factory. In this romantic vision, the thermometer in Vaexjoe was the only one available in all Scandinavia, so he took it, fulfilled his mission, then returned the thermometer one month later.

    Do not tell me my history is deficient because the dates are wrong. I like this theory,and I’m sticking to it. (No statistical tests will convince me of the contrary.)

  3. I had a closer look at my data, and I had been ignoring the quality flag (tut!). But that only wipes out a couple of points in my series – 2 days in 1942. The 2 days in 1942 are very cold (perhaps triggering an outlier condition).

    Note also (just for confusion), ECA have two data sets; one “blended” and one “unblended”. I tried both of these, still not getting a good match to your data though. The only criteria I use for quality are:

    Q_TG of zero
    Temperature measurement field more than -9999

    Using these rules I get the 1920s and 1930s fully filled.

    The missing dates in 1942 don’t seem too far from Operation Grouse ๐Ÿ™‚ Just a bit before, perhaps it was a rehearsal!

  4. “Was the average Vaxjoe temperature between 1985 and 2001 warmer than the average temperature between 1977 and 1984 and is that difference statistically significant to the 95% confidence level.”

    There’s more to this question, yes.

    What’s the margin of error in the measurements?

    Have there been any attempts to correlate temperature to days of sunlight, humidity levels or the like?

    Are the measurements over time the same to the extent we can compare 1977 to 1985 to 2000? Or have conditions changed making them not comprable? As an example, if we know the the thermometers have been consistently off by 2.2C, it doesn’t matter. But if in 1977 they were off .7C and grew to 2.2C by 2000, or were off 2.2C in 1977 shrinking to .7 by 2000, then we have other issues to overcome, or we can’t use the data. (This could be due to either the calibration of the thermometers, replacements with more or less accurate ones, or external forces introducing a bias.) Do we know and can we adjust for these types of issues reliably?

    Is “the average Vaxjoe temperature” a quantifiable thing in the first place? Don’t we really mean the derived tMean of the air temperature at the monitoring stations averaged over time to establish a variance from the usual value during the base period?

  5. D’OH

    Just realised, I downloaded the mean temps, you’ve got max temps. That would explain the difference.

    Still fails standard normality tests though ๐Ÿ™‚

  6. Neil– These are VaexJoe Sweden. That’s a bit inland, and southish as Sweeden goes.

    I’m still not making much progress with R! (I guess I prefer lumped parameter models.)

  7. #658, lucia:

    So there’s no particular reason to expect a trend from a local measurement.

    I guess you’re just trying to see if you can get the application to work?

    (btw: It’s “Neal“, not “Neil“.)

  8. Neal– Yes. That’s my major motivation. Tamino happened to pick this location to discuss doing some data manipulations.
    Sorry about the spelling!

  9. Lucia, if you were a real, bona fide scientist, you would know about data quality and homogeneity problems and you would also know that global means global and cannot be estimated with one data point.
    It is too bad that internet provides such an easy way of spreading bullshit science.
    I came to your blog looking for an “R” solution and, by the way, your R skills are poor too
    Regards
    Enric

  10. Enric–
    Yes. My R skills are pathetic. I was trying to learn them and these were my self lessons. but decide it wasn’t worth it because I can do what I want other ways. Sorry google sent you here. But, that’s google for you. Try the email list server.

    Send R-help mailing list submissions to
    r-help@r-project.org

    To subscribe or unsubscribe via the World Wide Web, visit
    https://stat.ethz.ch/mailman/listinfo/r-help
    or, via email, send a message with subject or body ‘help’ to
    r-help-request@r-project.org

    You can reach the person managing the list at
    r-help-owner@r-project.org

    When replying, please edit your Subject line so it is more specific
    than “Re: Contents of R-help digest…”

    I have never suggested that one can make conclusions about global warming based on 1 data point.

Comments are closed.