Read Data into R and Create A Histogram

This post describes what I did to make the very first file I read into the R project statistics package readable. If you obtain a file from some database, it is likely it will not be in a form suitable for reading into R and this may help you. Because I never feel confident I’ve read data unless I can see it, I’ll also show a histogram of the data in the file.

What did the downloaded file look like

The first file I down loaded happened to describe Temperature data for VaexJoe Sweden. When I opened it, the top of the file looked like this:

THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface

air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.

Data and metadata available at http://eca.knmi.nl

FILE FORMAT (MISSING VALUE CODE = -9999):
01-06 SOUID: Source identifier
08-15 DATE : Date YYYYMMDD
17-21 TX : daily Maximum temperature in 0.1 âˆžC
23-27 Q_TX : quality code for TX (0=’valid’; 1=’suspect’; 9=’missing’)

This is the blended series of location SWEDEN, VAEXJOE (location-ID: 1).

Blended and updated with sources: 100002
See file source.txt and location.txt for more info.

SOUID, DATE, TX, Q_TX
100002,19180101, 0, 0
100002,19180102, -40, 0
100002,19180103, -75, 0

What does R want to read?

All the information in the top of the file will ultimately be important, but, having read the manual, I knew that to import this data into R, I need to file that looks more like this:

SOUID, DATE, TX, Q_TX
100002 19180101 0 0
100002 19180102 -40 0
100002 19180103 -75 0

How do we fix that?

To “fix” the file, I opened it in a text editor and simply deleted the information at the top of the file. Then, I imported the file into Excel, and saved as a text file. That stripped out the commas.

I then dropped the file into a folder inside a folder on my desktop (hoping for the best.)

I then clicked open “R” in applications, moused over some icons, and clicked the one I circled in green below:

That opened a dialog box, and I did what came naturally, and found the file I’d saved.

The “R” screen then said:

> source(“/Users/lucia/Desktop/global/R_Data_Sets/VaexJoe.txt”)
Error in parse(file, n = -1, NULL, “?”) : syntax error at
1: SOUID DATE

This was not at all encouraging. However, not to be detered, I did a little more reading, and then typed this next to the “>”.

>vaexJoe<- read.table("file:///Users/lucia/Desktop/global/R_Data_Sets/VaexJoe.txt",sep="\t",header=TRUE)

I waited until another “>” appeared and then typed

>names(vaexJoe)

This appeared:
[1] “SOUID” “DATE” “TX” “Q_TX”

Notice these are the top lines in the file I intended to read in. They are labels for the data.

I had read the file!

But… what does the file tell me?

What did they mean? Well, the significance of the labels was described in the portion of the file I’d cut out.

“SOUID” is a station identifier. The value in the file corresponds to VaexJoe, Sweden.
Date is the calendar date when associated with the data,
“TX” are maximum temperatures recorded in VaexJoe on that day. The values are given in 0.1C which means a reading of 12 corresponds to 1.2C
“Q_TX” are quality flags. “0” means the data are considered reliable by whoever created this file. “1” means the data are suspect. “9” means the the data are total crap.

Look at a histogram

Now that I’d read in data, I wanted to examine the histogram of the data. To simplify typing, I had discovered I could type:

> attach(vaexJoe)

This makes “vaexJoe” a default for the time being. I will often now be able to refer to data by the lables.

To examine a histogram of the temperatures I typed:

> hist(TX)

Here’s the histogram of raw Temperature, TX:

Hmmm… That looks odd! There’s a whole bunch of data near 0C, and then a tiny bit of data near -1000.C. What does that mean?

What this means is there are some errors in the data.

Errors! Oh Crap. What do I do?

Recall the prefatory material indicated there were error flags “1” and “9”. It turns out when there are “9”s, the data themselves also read -9999. This is to prevent future boneheaded analysts from taking this data seriously.

(Curious readers may be interested to learn that “-9999” bad data correspond to measurements made during August 1946. So, even though Sweden did not go to war, maybe we can blame this on Hitler or Stalin.)

No matter who or what is to blame for this lack of data I needed to clean up the data and get the “-9999” values out of the record.

How I cleaned up the data.

There are wonderful and glorious ways to clean up data in R. However, I don’t know them yet. I’m just starting to teach myself R. So, to get on with it, I turned to Excel!

I read the R manual and discovered if I changed the “-9999” to “NA”, I would have access to all sorts of commands that told R to ignore the “NA”s. So, I opened the txt file in Excel, and used an Excel command to create a new column that copied the values in TX, but replaced all “-9999” with “NA”.

I labeled the files “Temp_9”.

I saved the file as text.

Then, I relaunched “R” and repeated my steps. This time, the label “Temp_9” appeared.

>hist(Temp_9,breaks=200) # read histogram of Temp_9, creating 200 breaks.

The “break=200” lets me bin the data in a variety of ways. I had fiddled with that number a little, testing values like 1000, 50 etc. The text after # is a comment and does nothing. (It lets me remember what I did. )

What does the clean data look like?

Here’s the histogram of temperatures after excluding the -9999s. (I’ll deal with the decision about excluding the suspect data at a later point.)

That’s it for learning how to read data, or how to create a histogram.

However, I do want to discuss the histogram itself.

What can we learn from this histogram?

Every time you look at data, it’s worth asking whether or not the image you see means anything. Sometimes the answer is “no”, but sometimes, you learn a lot.

BiModal Distribution

This histogram shows the data are bi-modally distributed; this observation is important because most statistical tests involve assumptions about the distribution of the data. The normal distribution is a common assumption and we know for certain: The data is not normally distributed.

Do we need to give up all hope of doing statistical tests? No. It may turn out that, after some processing, we can do tests on something that is normally distributed. However, after seeing this, a careful analyst knows to always check the distribution in someway. At a minimum, they create histograms and look at them.

One peak near freezing point of water.

Notice another feature: one of the peaks near 0C. That happens to be the freezing point of water. That may turn out to be important to later analysis. (Or not.)

Weird jaggedy “stuff”.

There is another interesting feature. Notice there is a dark “black” bimodal distribution, a dark grey one, and then a jaggedy bit above it.
What could that mean?

It’s fairly well known that technicians, grad students, and especially elite scientists often have limited vision. When the do experiments, they tend to round readings to certain values, like 1C, 0.5C or similar values.

When a technician rounds to 0.5C, you tend to see lots of measurements near 0.0C, 0.5C and 1.0C, but none near 0.4C.

The exact level of rounding may vary from person to person, and it will definitely vary change as instruments are changed. We’ll see later that technician’s rounded more during the early record, and less later on.

I suspect this was due to a change in instruments, but I won’t sweat that too much right now.

The fact that real data are rounded rarely presents serious difficulties for the analyst. However, it’s generally worthwhile to be aware of any obvious changes in instrumentation, or data quality in a data record, as these may affect conclusions later on. (It also may not; I haven’t yet done any hypothesis tests, so I have no idea what will arise in the test I plan to perform.)

Next steps

Learn to make other plots, and make some decisions about how I might process these data

@Jay,
I haven’t formulated the precise hypothesis to be tested yet! This seems odd, but there is a reason for it. I’m partly motivated to learn R; I’m also partly motivated to see what answer you get to the questions Tamino asked in this blog post, and I’m partly interested in making sure I test a hypothesis that relates to a meaningful underlying question.

You’ll note that Tamino states “The data were not chosen to establish any particular result or behavior. “, which may well be true. However, it’s fairly well recognized that, when testing a hypothesis, it’s important to understand what you are trying to learn.

I will ultimately repeat the hypothesis test comparing average temperatures from 1977-1985 to those from 1985-2001.

@Peter,
I don’t use Linux, I’m on a mac. I know a bit about regular expressions, but I don’t know how learning more would help me with this problem. I’ve asked a few questions on the R email list, and I’m fairly confident I can do this with R alone, and assinging myself the task of learning Awk would likely make all this take longer.

However, I may at some point ask people how some issues are addressed, and if using a regex string helps, I’d be happy if someone suggested an appropriate regex string!

Right now, I’m figuring out how to import packages to do some of the analases I anticipate doing with R.

@Mike: One of the reasons I’m going to dicuss how I did thing with R in ridiculous detail is I found that R documentation is difficult for anyone self teaching. I also immediately noticed their web site s*cks.

Why did they use frames? Don’t they know that makes it impossible for search engines to index properly, and consequently it’s impossible to google for your answer? Don’t the know that if your software is called “R” you need to type “The R Project’ on every single friggin’ page if you want searchers to have a string they can enter into search queries?

I predict that even though this will be a boring blog, it will gain the #1 search position for R quickly.

6 thoughts on “Read Data into R and Create A Histogram”

Mike Rankin says:

November 17, 2007 at 6:20 pm

I came over here from climateaudit to see what you were blogging on. Please do continue your R lessions. I acquired the R package a year and a half ago. I ran immediately into the problem of getting data into usable form. I was easily discouraged. I was quite proficient with SAS at one time but no longer have access to the software.
peter says:

November 17, 2007 at 11:04 pm

Might find it helpful to learn a bit about regular expressions for text manipulation. If you use Linux, awk is very powerful, otherwise Perl. A good text editor will normally support regex. The best book on awk is by ahno et al, only available used. There’s quite a bit on the net. You can do almost any text manipulation you need with awk and perl. There are also (free) regular expression manipulators with graphical interface. If you look for regex linux you’ll find one. Regex coach is a free one. Regex Buddy is a very good windows package, with a great manual, paid.
Jay Currie says:

November 17, 2007 at 11:24 pm

What an interesting project. I came via CA.

About a million years ago I did most of a thesis in econometrics/political stats. Played with fairly serious regression analysis. In SPSS in Fortran.

I like what you are doing but I am wondering about a couple of things: what is the hypothesis being tested? What would look like a negation and what would a confirmation look like.

there is much to be said for just looking at the data; but to make it science I really think you need to test a particular model.

So try.

Email me if you want very rusty help.
lucia says:

November 18, 2007 at 6:00 am

@Jay,
I haven’t formulated the precise hypothesis to be tested yet! This seems odd, but there is a reason for it. I’m partly motivated to learn R; I’m also partly motivated to see what answer you get to the questions Tamino asked in this blog post, and I’m partly interested in making sure I test a hypothesis that relates to a meaningful underlying question.

You’ll note that Tamino states “The data were not chosen to establish any particular result or behavior. “, which may well be true. However, it’s fairly well recognized that, when testing a hypothesis, it’s important to understand what you are trying to learn.

I will ultimately repeat the hypothesis test comparing average temperatures from 1977-1985 to those from 1985-2001.

@Peter,
I don’t use Linux, I’m on a mac. I know a bit about regular expressions, but I don’t know how learning more would help me with this problem. I’ve asked a few questions on the R email list, and I’m fairly confident I can do this with R alone, and assinging myself the task of learning Awk would likely make all this take longer.

However, I may at some point ask people how some issues are addressed, and if using a regex string helps, I’d be happy if someone suggested an appropriate regex string!

Right now, I’m figuring out how to import packages to do some of the analases I anticipate doing with R.

@Mike: One of the reasons I’m going to dicuss how I did thing with R in ridiculous detail is I found that R documentation is difficult for anyone self teaching. I also immediately noticed their web site s*cks.

Why did they use frames? Don’t they know that makes it impossible for search engines to index properly, and consequently it’s impossible to google for your answer? Don’t the know that if your software is called “R” you need to type “The R Project’ on every single friggin’ page if you want searchers to have a string they can enter into search queries?

I predict that even though this will be a boring blog, it will gain the #1 search position for R quickly.
Pingback: Rank Exploits » Calculating Annual Average Temperatures: Conditional Averages
Peter says:

April 29, 2008 at 6:25 am

The read.table() has a skip option to specify the number of lines to ignore at the begining of the file (which looks like twenty to me). If you had used this, you wouldn’t have needed to edit the file to remove the header. e.g.

vaexJoe <- read.table(“file:///Users/lucia/Desktop/global/R_Data_Sets/VaexJoe.txt”, sep=”\t”, header=TRUE, skip=20)

Try this at the R command line to learn more:

help(read.table)

You’d probably be able to use the optional na.strings argument to tell R to treat the data points “-9999” as NA. i.e.

vaexJoe <- read.table(“file:///Users/lucia/Desktop/global/R_Data_Sets/VaexJoe.txt”, sep=”\t”, header=TRUE, skip=20, na.strings=”-9999″)

I’m sure you’ll pick up more tricks with time 🙂

Comments are closed.

The Blackboard