Comparison of variance estimates for trend: Part 1 (synthetic data.)

Those reading comments are aware that I have been squirreled away swearing at the R interpreter. I’m not going to discuss the things I’m trying to do with R quite yet because it will be clearer when I can discuss it along with results. But, I’ve reached a point where it is useful to post something, even though that something is rather trivial in the sense that people familiar with statistics know these things. The result is a post that shows nothing novel; the post will be rambling and just suddenly end. This befits it’s motivation is merely to have material to link back to when questions come up on a future post. It’s sort of like “an appendix”. 🙂

(The questions will be things like, “but that distribution is skewed, how do you know you can do a t-test”, or “why do you think the variance in trend, ‘m’ estimated from a fit to a time series is even supposed to be the unbiased estimator of the variance in the actual trends”? (As opposed to comparing medians, or standard deviations and so on. I asked myself these things, and so, I figure others are likely to ask them.)

The post and several follow on posts will discuss some observations one makes when fitting ordinarily least squares trends to a time series of “N” monthly data points when the monthly noise really is Gaussian white noise .

The noise used to create the time series

I will begin by making sure I know how to generate Gaussian white noise in R. To convince myself I’ve found the proper R command (rnorm) I created a long series of white noise, and then plotted the histogram. The R code that reads like this:

# create gaussian white noise.
manyCases=10000 # length of the “noise” series I am going to create.
m_noise=rnorm(manyCases) # generate “manyCases” of white noise
var_p=var(m_noise) # compute the variance
format(var_p,digits=3, nsmall=3) # format.
#
Title=paste(“PDF of noise: variance=”,format(var_p,digits=3, nsmall=3))
xname=”noise”
hist(m_noise,main=Title, xlab = xname) # look at histogram

The final command displays a histogram with a title and x-axis label of my choice, but no legends. I then wrote some R code to slap some info into legend boxes.

Using the “eyeball tests”, I see the distribution looks bellshaped, which is a pretty good sign it’s Gaussian. But, when hunting around for R plotting features, I also discovered I can other more useful plots using “qqnorm”.

To make nice “Quantile-Quantile Plots” using the R command qqnorm. I add a title and a label to the x axis, and using the R command “qqline() to add a straight line. The code snippet is:

Title="Cummulative distribution of white noise"
qqnorm(m_noise,ylab="Quantiles", xlab="white noise",main=Title)
qqline(m_noise)

Examining the graphic to the left, I see that the synthetic data all fall on the straight line and conclude that the “noise” do seem to be Gaussian. (I don’t bother to check if they are independent.)

Fitting Ordinary Least Squares to synthetic time series

Now that I’m pretty confident rnorm does generate Gaussian noise, I can create a many time series with “N” months that follows the relationship

(1)$latex \displaystyle y_{i}=m t_{i}+u_{i} $
where $latex u_{i} $ is gaussian white noise and $latex m $ is a deterministic linear trend with time and the subscript “i” represents the month.

For the purpose of today’s post, the magnitude of the trend is unimportant, so I will be generating series with zero deterministic trend (i.e. m=0).

After I generate an individual series of “N” months, I can apply Ordinary Least Squares (OLS) to that series and compute the trend OLS returns for that series. Owing to the noise, the computed trend for that particular series will rarely be zero. It’s known that these individual trends will be normally distributed, and the variance of that distribution can be estimated from the standard error in the mean, sm, returned by any statistical package that provides OLS fits.

I’m now going to demonstrate something well known, which is that if I generate “M” monthly series, and create M pairs of trends and standard errors in the trend, $latex (m_{j},sm_{j}) $ with both the trend then the variance of the “m’s” will equal the mean of the square of the “sm’s”. That is: the square of the standard error in “m” is the unbiased estimator of the variance in the trends. I can show this happens by generating 10,000 time series, computing 10,000 pairs of $latex (m_{j},sm_{j}) $ and then computing the variance over all 10,000 m’s, and the mean of the squares of the sm’s, which I will write as $latex () $.

To the right and above I’ve displayed the histogram of the 10,000 individual trends. The title indicates the magnitudes of the variance of the means and the average of the square of the estimated standard errors. Inspecting, you will see these are very close. If I ran more than 10,000 trends, I could create graphs with smaller differences between $latex () $. and $latex var(m) $.

While here, it is worth examining the distribution of both $latex var(m) $ and $latex ()$. While the means of these two are equal, the distributions are dissimilar. The two distributions are shown below:

median(var)=8.770743e-06 median($latex () $ )=1.198514e-05
mean(var)=1.199416e-05 mean($latex () $ )=1.201076e-05

 

Note the distribution of var(m) is highly skewed while the distribution of $latex () $ looks nearly gaussian. Also note while the means of the two distributions match, the median of var(m) is quite a bit lower than the mean computed over the same pairs. Meanwhile, the median of $latex () $ computed from two estimates of the variance of fits to two time series of 100 months is very close to the mean $latex () $ from the same pairs of time series.

This happens because the each estimate of the variance, sm^2, based on the time series uses information from 100 data points, while the variance var(m) based on the 2, m. Because there are fewer data points in the later, its distribution is much more skewed.

At this point, some of you might wonder whether I’m going to do a t-test to compare the means from $latex () $ and the mean(var). The answer is: Nope. The reason is that to apply the a paired t-test, I need the distribution of the difference in the two quantities to be Gaussian (i.e. normal.) Because the the distribution of mean(var(m)) (i.e. is so heavily skewed while that of $latex () $ is not, I can’t apply such a t-test to compare these with the amount of processing done so far.

But… but… I will also reveal that my ultimate goal is to apply a t-test to compare the $latex () $ estimated from fits to 100 (or so) months of gcm data to the $latex () $ from a collection of the same runs. (I will not be doing ordinary least squares, I’ll be estimating the $latex () $ using red corrected white noise.) So, either tomorrow or in a few days, I’ll be explaining what I do to apply this to the “simple” case of trends with gaussian white noise residuals, and then after wards I’ll apply it to data from a collection of models from the AR4.

R-Script:( In progress to check lots of junk. Includes and non-working bits and is disorganized.)

72 thoughts on “Comparison of variance estimates for trend: Part 1 (synthetic data.)”

  1. Thanks for this Lucia. I need a little more time to work through your sequence, and will do so later. I found myself musing about related stuff after PaulK1’s comment about UAH’s slow response to the La Nina and sst’s. I plotted a version of the Reynold’s sst time series – rescaled and offset – for an apples to apples comparison along side the UAH time series. I also plotted a scatterplot of the Reynolds ocean data wrt UAH and fitted a the 29 year linear model to it (for some reason, I can’t get the linear fit to plot along side the scatterplot). Since there is a lagged tropospheric response to a change in sst, I compared the scatters of various lag times for the best visual correlation. A 1 month UAH lag seemed to produce the best correlation, but much larger lags still showed strong correlation. I’m still in the middle of all this and need to confirm, but it looks to me like Paul is correct, the data point for the most recent value falls greater than +/- 1 sd outside of the predicted value suggested by the linear model.

    I have not gotten this far yet, but it could be argued that the linear prediction for the November should be offset by a value which considers the seasonal tendancies for the November UAH values. I am also not sure that the SE for the linear model is a proper confidence boundary as it seems that a case could be made that the UAH response time to a change in SST might not have homogenious lag times.

  2. Layman–
    Even if you have a good predictor for a random process, points will fall outside ±1sd fairly frequently.

    But yes, when predicting UAH, which has been subject to instrument changes that can introduce non-physical artifacts into the anomalies, you might need to consider the seasonal tendencies in the anomalies when trying to forecast or intepret next months values.

    (You might need to do this for all the metrics since AGW should also modify the “best guess” for monthly average temperaures in a way that results in a detectable monthly cycle relative to some previous baseline.)

  3. Even if you have a good predictor for a random process, points will fall outside ±1sd fairly frequently.

    Furthermore, If Paul is making the case that there has been a change involving a dampened response or increased lag or whatever, there is nothing but one (maybe two?) stray data points to support that. As is usual in such situations, there is a reversion to the mean.

  4. Hi Lucia – please can I check something? If I’ve understood correctly the first part of your testing is to check that the output of the rmorm function is white noise. You do this by showing approx. normal distribution of the generated values which you check with the qqnorm and qqline functions. This establishes pretty well you have a normally distributed data set. The thing I’m not sure about is if that means you have white noise. I checked on wikipedia and the article on white noise states:

    “It is often incorrectly assumed that Gaussian noise (i.e., noise with a Gaussian amplitude distribution — see normal distribution) is necessarily white noise, yet neither property implies the other. Gaussianity refers to the probability distribution with respect to the value i.e. the probability that the signal has a certain given value, while the term ‘white’ refers to the way the signal power is distributed over time or among frequencies.”

    I’m wondering if the normal distribution of OLS trends you refer to for multiple series shows/supports uncorrelated data within each series but I’ll not go on as I’m on thin ice in terms of my knowledge. Apologies if I’m way off point here or have misunderstood but thought it worth checking. Thanks.

  5. Re: curious (Dec 7 14:41),

    A time series created using gaussian noise should be white. But it would be nice to see the power vs. frequency spectrum. Then you can compare white noise to pink or red noise. Audio systems are usually tested with pink noise where the power per octave is constant. On a power vs linear frequency plot, pink noise power would have a negative slope, but would have a slope of zero for log frequency plot (IIRC).

  6. Curious– You are correct. Since I’m not used to R, my intention is to generate Gaussian white noise, then check. I checked the “Gaussian” aspect but not the “white” aspect because I didn’t check the correlogram.

    I could check. I just didn’t. That’s what I meant when I wrote this:

    Examining the graphic to the left, I see that the synthetic data all fall on the straight line and conclude that the “noise” do seem to be Gaussian. (I don’t bother to check if they are independent.)

    If each individual “noise” entry is statistically independent of all the others, then the noise will be “White”. Otherwise, it will have some color. (Could be pink… blue… red… whatever.)

    rnorm is supposed to generate Gaussian white noise and for now, I’m taking R’s word for the whitness aspect– for now.

    Later on, I will be generating Gaussian red noise, and FWIW, I’ll also be doing some tests where I check for correlation in residuals– but not in this post. So, you are correct it is worth checking, but I’m not discussing that in detail here.

  7. Dewitt–
    If you happen to know the R command to quickly show the frequency spectrum, I’ll throw that into the script and show them. For now, some of what I show is limited by my hunting for the R commands and also, whether or not the point seems to address a question I am pretty certain will come up in… oh… a week when I show results for gcms.

  8. Lucia, DeWitt – thanks. I thought it worth checking especially as I have seen some debate from times gone by where there have been heavy comments based on the “quality of noise” used in analyses. I wasn’t following back then, and it is out of my limited knowledge area, but as you seemed to be fielding this post to prempt criticism I thought it relevant. Looking forwards to more in due course.

  9. curious–
    It’s not so much to pre-empt criticism as to point back when certain questions I am sure will be asked get asked. I don’t so much expect people to ask if rnorm really generates gaussian white noise. But, when I am going to apply a t-test to some results that are not gaussian. T-test formally apply to data that are Gaussian. But, in reality, it gets applied to data that aren’t Gaussian all the time, and the question when quibbling over this is: How much does the deviation from Gaussian “matter”.

    So… it happens that right now, I’m worrying about the non-Gaussian nature.

  10. Lucia,

    I think your R code isn’t running as fast as it could. This snippet of code:

    # ——– First create white and look at variances. They will most certainly not be normally distributed.
    numNoise=2
    for(i in c(1:manyCases)){
    series_white=rnorm(numNoise) # create numRuns runs, normally distributed.
    if(i==1){ var_p =var(series_white)
    }else{ var_p <-c(var_p,var(series_white))
    }

    }

    Can be compressed into:

    numNoise=2
    var_p = rep(NA, manyCases)
    for(i in 1:manyCases){
    series_white = rnorm(numNoise)
    var_p[i] = var(series_white)
    }

    This code is faster because you’re preallocating space for var_p. The other way, R has to rebuild the array every time you concatenate it. In your for loop, the indices don’t need to have the concatenator. If you want to save more time, you can write your own OLS function. If you’re just interested in the slope and standard error, then you can easily derive the equations and only calculate what you require because lm() computes other statistics which aren’t necessary for your analysis (at least at this stage) and slow the calculations down by. I have a feeling that you and R are going to become best friends! I can’t wait to see the look on your face when you for the first time load Fortran code into R and see your calculations fly out at lightning speed!

  11. Lucia,

    If you happen to know the R command to quickly show the frequency spectrum

    There’s the spectrum function. Type ?spectrum for details. There’s also fft. I had some beautiful code that made using this function more easy than it is, unfortunately, it got lost when I switched to Ubuntu.

  12. lucia,

    If you’re going to do fft (and maybe spectrum too) then you need to have the number of samples in your time series be a power of 2, I think. So instead of 10,000 use 8,192 or 16,384.

  13. DeWitt,

    If you’re going to do fft (and maybe spectrum too) then you need to have the number of samples in your time series be a power of 2, I think.

    If I recall, that’s not a requirement. If the number of samples is a power of 2, the algorithm runs much faster.

  14. Chad–
    Thanks!
    Right now I’m not worried about saving time– but I will be eventually. If you were to see my current codes they includes comments like: “There has got to be a better way to do this!” and “Gosh, why did I do this such a stupid way. Recode this later….”

    Still, there are a few things that are just going to be left as crappy code. Sometimes, once I have the answer I seek, I don’t care to improve the code. But there are other places where… well… you should see one of the codes. Talk about horrible stupid verbose ways to do things! (But that’s what happens when you’re hunting for commands and trying to get an answer at the same time. If the suboptimal way works…well… it works!)

  15. is there a reason why you create 10,000 samples of length two in a loop?

    series <-matrix(rnorm(20000),ncol=2)
    V <- apply(series,1,var)

    or more generally:
    length_of_series <-2
    series_count <- 10000
    Series <- matrix(rnorm(length_of_series*series_count),ncol=length_of_series)
    V <- apply(Series,MARGIN=1,FUN=var)

    *apply*

    apply the FUNCTION var, to SERIES, by the row(1) MARGIN

  16. It seems that you are summing chi-square variables, and the sum tends to normal distribution when dof is large. One is chi-squared with small dof and other with large dof (which is then divided by N).

  17. UC– Absolutely! The one where I compute the variance over 2 runs is the small dof. The point of the computation will merely be to show that I’ve summed over enough to make the remaining non-normality “not matter” for a t-test.

    The demonstration will “matter” later because I know some who prefer to not like the results will latch onto that aspect.

  18. SteveMosher–

    is there a reason why you create 10,000 samples of length two in a loop?

    Yes. It’s called “not knowing R”.

  19. BTW steve–
    I have no idea what this means:

    apply the FUNCTION var, to SERIES, by the row(1) MARGIN

    I suppose I can try to use “help”…..

  20. Lucia, I have not had a chance to go over in detail your thread introduction, but what you are doing is something that I am very much interested in: Using R and problems that you might see and getting great advice from the “experts”

    I have found that in learning and using R that at first it was easier to attempt to apply Excel, but one quickly finds the limitations of Excel compared to R. The more you use R, the easier it becomes, as you no longer have to look up even simple programming steps.

    I still do some cussing at R, but less now and most of it is when I get an error message and then think I made a fundamental programming error when in fact it is almost always some little typing error. It is easier now because I look for my dumb errors instead fundamental ones and thus have taken to cussing at myself.

    Another problem I had originally was finding R routines for simple things like adding red and white noise to a time series (and yeah I read that Gaussian noise is not necessarily white noise), but that has become easier with experience also.

    What you do best at your blog is go to the black board and show what you are doing and are willing to take advice and comments from the “students”. There are some egos floating around the internet that appear incapable of, or at least uncomfortable in, that role.

  21. still do some cussing at R, but less now and most of it is when I get an error message and then think I made a fundamental programming error when in fact it is almost always some little typing error.

    The rate of swearing dropped when I discovered that quote often, R cries “syntax” error a bit late. I now know to scan up a bit further than the point where it claims I had the syntax error.

    There are other issues that cause more swearing when self teaching R compared to C, php, perl ….. But, it’s going ok now.

    I think showing bits of code was useful. I learned the solution to one of the ‘problem’ I’d groused in comments in my R codes. (All the “ifs” etc.)

  22. “apply the FUNCTION var, to SERIES, by the row(1) MARGIN”

    I used to think the same as you and then you begin to see the meaning is much less complicated and more straight forward than you might expect.

    You get some commands like colMeans and colSums in R but when it comes to standard deviation you need to use the apply function. The Margin simply is used to apply the sd function, or a function that you might construct, to the rows (1) or to the columns (2).

  23. Re: lucia (Dec 8 11:29),

    in R you should avoid ( in most cases) using loops. In most cases your code
    should run faster and it will be less prone to stupid errors.

    You should use functions like apply(), lapply(), mapply(), Vectorize(),aggregate()
    etc. .

    the apply() call list goes like this

    apply(X,MARGIN,FUN)

    X is the data structure ( a matrix ) that you want to apply a function
    to.

    MARGIN is the index ( row or column or other dimension) 1= row, 2=column.

    FUN is the function you want to apply.

    So: I build a matrix of 20,000 random numbers.
    2, columns and 10000 rows

    Series <- matrix(rnorm(20000),ncol=2)

    Now, in this format every row is a series. every column would be a time
    step.

    Now I want to calculate the variance BY ROW

    two ways

    for(i in 1:nrow(Series)){….}

    OR use apply

    V <- apply(Series,MARGIN=1,FUN=var)

    or I can do it in one line

    V <- apply(matrix(rnorm(20000,ncol=2)),1,var)

    tada!

    So, apply.. applies a FUNction to Matrix by the Margin

    As Kenneth notes there are some other function ( wrappers for basic functions)
    like rowMeans, colMeans.. etc

    Which is the same as apply(x,1,mean) or apply(x,2,mean)

    You can even define the function within the apply call

    apply(X,1, function(x) …) slick.

  24. You guys should see all my stupid questions to the R list. Saved forever.

    Basically, the community is there to help everyone who is willing to ask for help. But we dont do homework assignments and we dont teach stats.

    I’ve had some of the top experts in the field ( top graphics gurus) write code for me merely by asking a dumb a question. There are sections of my code that used to be dozens of lines longs that the time series gurus reduced to a couple lines.

    Anyway its a good idea for you to post code here and you’ll learn to be a R genius in no time.

    BTW, what platform are you using and what editor?

  25. Ken

    “apply the FUNCTION var, to SERIES, by the row(1) MARGIN”

    I used to think the same as you and then you begin to see the meaning is much less complicated and more straight forward than you might expect.

    I wasn’t imagining the meaning was complicated. I’m just saying I have no idea what he means.

    Steve–
    Thanks for further explanation. I now have the actual names of command that I can look up using “help”. “help” is great if you already know the command. Otherwise, it sucks. Similarly, most examples on the web and even in tutorials provide poor examples which either a) assume you already know lots of “R” in order to understand how to use the item you actually wanted “help” on and/or include practically no comments to help the beginner.

    I’ll try the various applies out now. 🙂

  26. BTW, what platform are you using and what editor?

    I’m running on a mac. I type into XCode and then just paste stuff into the R console.

  27. V <- apply(Series,MARGIN=1,FUN=var) This I like.

    or I can do it in one line

    V <- apply(matrix(rnorm(20000,ncol=2)),1,var)

    This example exhibits a feature I hate about R. You are permitted to like it, but, I hate it. I will always hate it.

  28. Re: lucia (Dec 8 13:13),

    Some guides:

    ??thingy_you_want help on.

    this searches help using agrep() to look for partial matches.

    Quick reference guide:
    http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBcQFjAA&url=http%3A%2F%2Fcran.r-project.org%2Fdoc%2Fcontrib%2FShort-refcard.pdf&ei=9er_TJ3TEIb6sAOC-8CvCw&usg=AFQjCNHZJHORBToYAC35KvnsALryNPWK4w&sig2=rwBIIOsQRO2OegD7mKlzxA

    http://www.ats.ucla.edu/stat/r/

    http://www.r-bloggers.com/r-tutorial-series-r-beginners-guide-and-r-bloggers-updates/

    Rather than cutting and pasting code from Xcode into R you can also
    try textmate and textwrangler

    Both allow you to highlight code and SUBMIT to R. very slick.

  29. “This example exhibits a feature I hate about R. You are permitted to like it, but, I hate it. I will always hate it.”

    Now, Lucia that does not sound very reasonable. I put together R routines with fairly simple statements because that is how I think. If I wanted to condense my program, I would go back and combine and abbreviate steps like Steve Mosher did in this one. What is to hate about that or am not getting your point again?

    Ask Steve if he puts together compound statements on his first go around.

  30. steven–
    ??thingy only helps if you know the name of the thingy. I’ve asked for all sorts of help and R tells me it doesn’t find anything.

    Thanks for the reference cards. Those were escaping my attempts to use my limited google-fu to find them. I’d found the r-bloggers. They use a sufficient number of words in their posts to make google know what they are talking about.

  31. Now, Lucia that does not sound very reasonable. I put together R routines with fairly simple statements because that is how I think. If I wanted to condense my program, I would go back and combine and abbreviate steps like Steve Mosher did in this one. What is to hate about that or am not getting your point again?

    By removing FUN and MARGIN, in the second example, steve stripped out all indications of what the pass variables variables might mean refer to. In R examples (even those used on pages representing themselves as “tutorials”,) this frequently makes utterly uncommented code even less commented.

    The fact that R seems to permit some flexibility in the order of variables can make it difficult for a beginner to unpack what was sent.

    I think most code suffers from too few comments already. But… of course.. that might just be me.

    Anyway, i don’t know how it’s unreasonable for me to hate something you happen to like while admitting that it’s perfectly fine with me if you like it. I’m pretty sure going forward, I will be leaving things like “FUN”, “MARGIN” in my codes even if R permits people like Steve, you and even 99.99% or R users to revel in leaving out those words.

  32. Re: Kenneth Fritsch (Dec 8 14:49),

    I do not put compound statements together on my first go.

    I write a single line and test it. I add a line. I test from line 1.
    I will rarely write more than a couple lines without testing.

    Thats why an editor that can SUBMIT code to R is useful.

    sometimes I work at the console and copy to the editor.
    that way I know I have working code.

    In the beginning I used loops. Then I would post working code to
    the R help list and ask ” how do I get rid of this loop”

    after a while you write more and more compressed. Remember that R loves
    vectors so after a while you start to think in vectors

  33. Steven,
    I like the compressed lines. What I don’t like is the compressed lines with the words like “FUN” and “MARGIN” left out. That’s because I see the words like FUN and MARGIN as being “comment like”, and I think code should be commented. This is especially if you are going to share and/or need to come back months later.

    Having read lots of other people’s code, it seems some people think compressed code is “self commenting”… well.. no.

  34. “Remember that R loves vectors so after a while you start to think in vectors.”

    That was the first thing and most important thing I learned about R.

    OK Lucia, now I know what irks you and you are probably right from a scientist/engineer standpoint and I would tend to be like you. It seems almost like the writer of abbreviated code (not compounded) is more showing off their more intricate knowledge of R than being informative for others to understand what they wrote. I am not and never was a code writer so I tend not to go back and compound my statements, although I must admit when I do see a single abbreviated statement that took me several statements to do I have to say wow that is neat.

    Those abbrevations work in R only because they can only be understood to mean one thing like an abbreviation after a comma when the only other arguments would need an equal sign or some other treatment. I guess some arguments can be sequence sensitive also.

  35. Kenneth–

    Those abbrevations work in R only because they can only be understood to mean one thing like an abbreviation after a comma when the only other arguments would need an equal sign or some other treatment. I guess some arguments can be sequence sensitive also.

    Except that fiddling, it seems there are plenty of exceptions where you can’t automatically tell what was sent based on the placement in the call. So, it might be (x,y,z,….) with y being optional, but somehow R can tell you only sent x skipped (or included) y, and are now sending z. At least that’s what fiddling seems to suggest.

    So, the initiate may be able to figure it out– or maybe they break open their R console and enter it to figure it out…. or something. But it’s really not an easy read.

  36. “Anyway, i don’t know how it’s unreasonable for me to hate something you happen to like while admitting that it’s perfectly fine with me if you like it.”

    Sorry Lucia, what can I say except I guess I was being unreasonable – and at the same time misunderstanding your point. I think the true comments outside the R statement are most important but what you say about the abbreviated statements, I guess I would agree.

    Steve Mosher what’s the matter with you. Do you not have any compassion for those of us who have not reached your level of competence with R?

  37. Re: lucia (Dec 8 16:49),

    Lucia,

    Optional parameters.

    myFun <- function( a,b,c="HELLO")

    a MUST BE SPECIFED
    b MUST BE SPECIFED
    c has already been specified but you can change it.

    There is an unwritten rule that you do it like this

    rnorm(n, mean = 0, sd = 1)

    ALL unspecified variables come first.
    followed by all PRESET variables

    rnorm(mean = 0, sd = 1,n)

    would be BAD R. because n is required and must be specifed by the caller.

    rnorm(mean = 0, n, sd = 1)

    ALSO BAD. variables that need to be set come first.

  38. Kenneth,

    ha, The other day I had to help some grad student with Ramus benestad’s package. so I have compassion.

    I love helping people with R I just wish I knew it better. There are some wicked good guys on the list..

    I’m officially kind of a contributor to one package(raster) although I’ve only submitted bugs and code suggestions. The guy who maintains the package is so frickin good its scary.

    For Lucia’s simulation of time series, folks might want to play around
    with simulate()
    When I get a chance I’ll see if I can make it work

  39. Steve Mosher

    rnorm(mean = 0, sd = 1,n)

    would be BAD R. because n is required and must be specifed by the caller.

    It’s bad R– but does it make R choke?

    And if you send rnorm(n,3,4) does it work? That sort of thing sends the newbie running to help(rnorm), and if done through the entire code means all newbies have to be constantly resorting to reading “help”. So…I don’t like it.

    If R doesn’t force them to write rnorm(n,mean=3,sd=4), I can’t. And if a code writer prefers to preserve his finger tips, I guess that’s his perogative. But… do… not… like… I… do… not…..

    That said, I’m not going to lose sleep over it. All languages have features, and coding communities have some practices I disprefer. I’ll mention it sometimes– but excessively compact to the point of non-commented and cryptic is something I disprefer.

  40. Steven Mosher,

    Does it execute faster if it’s one line? If it doesn’t, then writing it as one line is showing off.

  41. Dewitt,

    I’m not sure if it runs faster

    Having a separate matrix statement means you will allocate and store a matrix.

    M<-matrix(rnorm(20000),ncol=2)

    V <-apply(M,1,var)

    Leaves "M" hanging around in your memory space, when its merely used temporarily. So, when I work with Big data I like to keep these temp objects
    to a minimum.

  42. “Leaves “M” hanging around in your memory space, when its merely used temporarily. So, when I work with Big data I like to keep these temp objects to a minimum.”

    Had a problem of two with those hanging variables that I used once and then used the same name again after forgetting the first use. I then do a for statement with the forgotten variable outside the for statement brackets and come up with vector or matrix with all the same values. I am getting good at recognizing the source of this kind of error. Of course, if the variable originates in the for statement it stays there – kinda like Las Vegas .

  43. steven–
    Can you unset variables? That results in more lines, but.. still.. can make code closer to “self documenting”.

  44. “Can you unset variables? That results in more lines, but.. still.. can make code closer to “self documenting”.”

    I can.

  45. Re: lucia (Dec 8 18:53),

    Unset variables? What do you mean.

    take rnorm(n,mean=0,sd=1)

    you can call it like rnorm(n=100,mean=50,sd=4)
    or rnorm(100,50,4)

    mean=0 in the call means that if the user specifies nothing the function
    will use mean=0 as the default.

  46. Kenneth,

    Yes, leaving those variables hanging around is a problem. I’ve been bit many times by that. So I’ll avoid short names for things

    There are some good style guides that help prevent some of those issues.

  47. Steve,
    You wrote this:

    Having a separate matrix statement means you will allocate and store a matrix.

    Leaves “M” hanging around in your memory space, when its merely used temporarily.

    In lots of languages you can tell your code to unset (i.e. forget ) M after the V apply(… bit.

    In php you use unset

    So, M is hanging around a little while, but then you make it go away after you aren’t going to use it anymore. This frees up memory.

    Can you do that in R? Or are you stuck with M hanging around forever and ever and ever and ever?

  48. Nick–
    Thanks. I figured there had to be such a thing for precisely the reason Mosher complained of. You don’t want huge matrices used temporarily hanging around.

  49. Lucia you can also just set them to NULL

    Also helpful is gc()

    garbage collect

    R for some reason has very slow memory management

    In some cases I will just divide a program up into subscripts. Also,
    with the raster package we ( robert and us testers) are now testing multi processor stuff. At some point I suppose I will move to look at R on GPUs ( Using graphics engines to do math and physics has been an abiding hobby\job of mine since 1990, its come a long way since then.. someday I will tell you how the GPU got named as a GPU)

  50. Lucia, you can use ls() to find all objects in a given R environment and, as Nick S noted, rm to remove a particular one or two of them.

    ls() comes in handy when you load an R file and you do not know what objects it contains.

  51. Thanks for ls() Kenneth. Thanks for rm() Nick.

    I re-wrote my loops the way Mosher suggested. I’m not really sure things are going faster– I’d have to sit here with a clock for that and I’m not going to.

    I think the code is a little shorter and possibly more readable. In some cases, I had to break functions with nested loops into two function, which, on the one hand can be a bit longer, but on the other hand turns out to be a bit more readable since the function only does one operation.

    I do think this vector “apply” idea can reduce the rate of screwing up putting the wrong index inside [] brackets. (Though, I think my paranoid constant checking of results compared to theory helps more…)

    Anyway… I’m sitting here waiting for R to do 20,000 t-tests to see if I get a rejection rate of 5% when I reject things at p=5%. It occurred to me that since distribution I apply the t-test to is skewed, the rejection rates may be asymmetric, so I’m also doing single sided at 2.5% and seeing what I get.

    (My first try doing the t-test evaluation with 1,000 cases says I reject $latex \langle \langle var_m \rangle \rangle=\langle sm^2 \rangle \rangle $ with the alternative hypothesis of $latex (\langle \langle var \rangle \rangle \neq \langle sm^2 \rangle) $ at approximately 6% when p=5%, but, a disproportionate number of those rejections are when $latex (\langle \langle var_m \rangle \rangle < \langle sm^2 \rangle) $. (It's sufficiently asymmetric that that has to be considered when interpreting results when applied to models later. Though... only if I get a rejection... which so far I haven't!)

  52. steven– well… in real time…. s_l_o_w….

    Mind you, I’m not going to try to do 2 t-tests and 1 wilcox test on 20,000 samples of “synthetic” groups of 10 models, with 2 runs each over 9 periods very often. But…. well.. still…. running…

    When I did this with “loops” I decided to exercise in front of the tv and just check back from time to time. It’s noon. I think I could chose between “Springer”, “All My Children” and various court shows.

    (I did finally blog about the free sweetner– which I liked. I figured I should blog for them since they were nice enough to send me free stuff, and it was actually good. Now… maybe I should read everything everyone is saying about O’Donnell and figure out what I think is worth saying. That’s a toughie since I know nothing about principle components. )

  53. Ok… the slow process finished. I’m just going to paste the results of the t-test on the synthetic data here:

    # theoretical for “good” method
    > ManyCases
    [1] 20000
    >
    > # single sided (To see if roughly symmetric) Should be 1/2 theoretical
    > # theory for single sided
    > paste(format(100*p_cutoff/2,digits=2, nsmall=2),”% ± “,format(100*sdBinom_half,digits=2, nsmall=2),sep=””,”%(sd)”)
    [1] “2.50% ± 0.11%(sd)”
    > # got
    > paste(format(100*rejRateTLess,digits=2, nsmall=1),sep=””,”%”)
    [1] “4.9%”
    > paste(format(100*rejRateTMore,digits=2, nsmall=1),sep=””,”%”)
    [1] “1.2%”
    >
    > # two sides– if distribution was two sides, should be theoretical
    > paste(format(100*p_cutoff,digits=2, nsmall=2),”% ± “,format(100*sdBinom,digits=2, nsmall=2),sep=””,”%(sd)”)
    [1] “5.00% ± 0.15%(sd)”
    > # t-test
    > paste(format(100*(rejRateTMore+rejRateTLess),digits=2, nsmall=2),sep=””,”%”)
    [1] “6.05%”
    >
    > # wilcox (should reject more often.)
    > paste(format(100*rejRateW,digits=1, nsmall=2),sep=””,”%”)
    [1] “5.80%”
    >
    >
    Now, I can write up a bit about what we might say about t-tests on “similar” things (under some assumptions which might not apply to model data. But I did at least need to know what I would get if the assumptions *did* hold.)

  54. Lucia,
    I have a little utility function called timer() which I use in my codes. You start with time0=timer(0), and then every call to timer(time0) returns the elapsed time in seconds, which I print when I expect that I might be getting impatient.

    timer = function(time0=0){
    a=strsplit(format(Sys.time(),”%H %M %S”),” “);
    b=as.numeric(a[[1]]);
    tm=b %*% 60^(2:0);
    if(time0>0){tm=tm-time0;}
    tm;
    }

  55. Lucia I’m coming into this late so I hope you see the message and I’m not telling you something you already found out. You can create and edit code right in the Mac Gui. Under file you will find New Document. This will allow you to create a new file. You can write code in there and edit in right in R. You can execute the code right from the file just by highlighting it and hitting cmd return. You can also save it and run it by selecting source file from the file menu. The command method is nice because you can execute subsections one at a time.

  56. BarryW–
    There must be some trick to editing in the MacGUI because I started out trying it that way and it resulted in LOTS of screaming at the screen. The problem was I was constantly running commands as I entered and hit return and most things I typed were typos. So, I switched to editing in something that *can’t* run the R, and pasting in.

    How do you avoid running when you hit return?

  57. Do you have just a command line R or do you have one configured for Mac? If it’s the Mac Gui then you should see something like the following.

    when you start up the command line window should show a line that says:

    [R.app GUI 1.34 (5589) i386-apple-darwin9.8.0]

    or something similar. Do you have a regular Mac type menu bar at the top of your screen? If you do look under the apple at the About R entry. you should see something like

    R for Mac OS X Cocoa GUI written by:
    Simon Urbanek
    Stefano M. Iacus

    R: Copyright © 2004-2010
    The R Foundation
    for Statistical Computing
    http://www.R-project.org

    If you do then you’ve got the mac gui version. If not you’re running in the command line version. Get the Mac version! The last update I have is from May and the newest package is at

    http://cran.r-project.org/bin/macosx/

    If you are then don’t edit in the GUI at the command line prompt. It’s standard in R that every line you enter on the command line is executed. The trick is to make an R document. Look under File and you’ll see a New Document command (cmd N) hit that and you should get a new window which will say “untitled”. It’s just a text file so you can type in there without executing anything. You can save those files too and execute the code (under the File menu item that says source). You can also bring them back and execute all or part of the file or reedit it.

  58. Do you have just a command line R or do you have one configured for Mac? If it’s the Mac Gui then you should see something like the following.

    when you start up the command line window should show a line that says:

    [R.app GUI 1.34 (5589) i386-apple-darwin9.8.0]

    or something similar. Do you have a regular Mac type menu bar at the top of your screen? If you do look under the apple at the About R entry. you should see something like

    R for Mac OS X Cocoa GUI written by:
    Simon Urbanek
    Stefano M. Iacus

    R: Copyright © 2004-2010
    The R Foundation
    for Statistical Computing
    http://www.R-project.org

    If you do then you’ve got the mac gui version. If not you’re running in the command line version. Get the Mac version! The last update I have is from May and the newest package is at

    http://cran.r-project.org/bin/macosx/

    If you are then don’t edit in the GUI at the command line prompt. It’s standard in R that every line you enter on the command line is executed. The trick is to make an R document. Look under File and you’ll see a New Document command (cmd N) hit that and you should get a new window which will say “untitled”. It’s just a text file so you can type in there without executing anything. You can save those files too and execute the code (under the File menu item that says source). You can also bring them back and execute all or part of the file or reedit it.

  59. Barry–
    When I start up, it reads like this:
    R version 2.4.1 (2006-12-18)
    Copyright (C) 2006 The R Foundation for Statistical Computing
    ISBN 3-900051-07-0

    The “about info” says this:
    R for Mac OS X Aqua GUI written by:
    Simon Urbanek
    Stefano M. Iacus

    R: Copyright © 2004-2006
    The R Foundation
    for Statistical Computing
    http://www.R-project.org

    Acknowledgments:
    Cocoa Quartz device derived
    from a Byron Ellis’ work

    Preferences pane based on
    AMPreferencePane classes by
    Andreas Mayer

    Contributors:
    Rob J Goedman

    Translators:
    Anestis Antoniadis (fr)
    Alexandre Villoing (fr)
    Masafumi Okada (jp)
    Simon Urbanek (de)
    Stefano M. Iacus (it)
    Aaike De Wever (nl)

    Requires:
    R 2.4.x Framework

    Please send feedback to:
    R-SIG-Mac mailing list
    (see https://stat.ethz.ch/mailman/listinfo )
    also see R webpages for details

  60. Latest version is 2.12.1 that one’s 4 years old.

    I’d get the new one, but the Gui should be similar anyway. Did you see the New Document item in the File menu?

    You can email me if you want to continue this off line or we can continue this here. I’ll be happy to help either way.

Comments are closed.