Using US data: Looks like 1 in anything from 2,000 to 166,667.

Yesterday, I posed on the 1 in 1,594,323 weather event characterization and said it was way too high. I then estimated the probability that 13 months worth of temperatures in the lower 48 states of the us would remain in the top 1/3rd of the temperature distribution of all temperatures observed since 1895 using correlation in global temperature. I thought that would give me a reasonable value for the lag 1 correlation and didn’t check. Well… I was wrong on that.

But the concept that we need to include autocorrelation is good. And when I wrote the postpost I said I’d repeat the calculation using the US temperatures after I found them. If found them and today I’ll recompute using the US temperautres. Because of the structure of the correlation, I’m going to perform the estimate several ways. I’m doing this because a) the regional data are noisier and b) because the correlation is lower, we get a wider range in estimated probabilities based on assumption about the noise structure.

The data for the US lower 48 are available here: ftp://ftp.ncdc.noaa.gov/pub/data/cirs/drd964x.tmpst.txt. I pulled out the temperatures for the lower 48 from 1895-June 2012, turned them into anomalies by subtracting the mean values computed for each month and verified that the past 13 months are in the hottest 1/3rd based on their temperature anomalies. To illustrate this, I’ve plotted the temperatures below and added red traces to separate the region in which the top, middle and lower 1/3rd of the temperature anomalies fall. The most recent 13 months are to the right of the blue trace:

Using the NOAA/NCDC data, I computed the lag 1 auto-correlation using data from Jan 1895-June 2012. This was R= 0.164. Under the assumption that climate had not changed and the variations can be described by ‘red’ noise, I found the probability of observing 13 months in a row in the top 1/3rd of the population at 1 in 166,666.7? (The computation was agonizingly slow because I needed to run 1,000,000 cases to get anything near halfway decent sensitivity for this. The values are p=6*10^-6± 2.5e-06)

Troy_CA observed that the correlation was much larger if computed based on the 2nd half of the data. This can occur for two reasons: 1) The computed correlation is higher because the trend is real or 2) There is no trend, but the computed correlation is higher because measurements were noisier in the past. Both reasons likely contribute– but the latter falls under the assumption that “climate has not changed”. So to address this possibility, I recomputed the probability of the extreme weather event using R= assuming the noise is “Red”. Under the assumption that climate had not changed and the variations can be described by ‘red’ noise, using R=? I found the probability of observing 13 months in a row in the top 1/3rd of the population at 1 in 111,111 (i.e. p=9e-06 ± 3.e-06); this estimate is based on 1,000,000 synthetic tests.

Because David F has been asking questions about the red noise assumption, I repeated the exercise using only the 2nd half of data and described the noise using the best fit ARIMA model. This was an ARIMA(2,0,5) model. For this case, I ran 10^5 realizations of synthetic data and I found a 1 in 5,555 chance the event would occur if no warming had happened. This is still low, but also accounts for the fact that under the “no warming” assumption, does not look like red noise.

But of course, no discussion about the possibility of overestimating the probability of an event is complete without touching on “Hurst/Kolmogorov” effects, I decided to entertain David Hagen’s suggestion of HK. So I dug out the “fracdiff” library in R and estimated, stuffed in the temperature series for the 2nd half of the temperature record into the modul. Because I know the fans of Hurst/Kolmogorov are generally rooting for values of d very near 0.5, I then set the minimum ‘d’ to 0.45 and re-estimated the fractional differencing coefficients under this assumption. That resulted in estunates if (ar, d, ma)= ( 0.5511084, 0.4500553 , 0.8116391). (Note: this is not the best value of “d” suggested for the data set, it’s the best value if I force it to be greater than 0.45)

Through this valiant effort of enforcing a high value of ‘d’, I was able to concoct a ‘no climate change’ noise model based on characteristics of the thermometer record that says the likelihood of seeing 13 months in the top 1/3rd under ‘no climate change’ was 1 in 2000 (exactly!) estimated based on 10,000 simulations. I’d go for more precision, but I need to visit my mother in law in the old-folks home. So, I don’t have time to run a simulation with 1,000,000 cases.

The conclusion of this exercise Depending on the noise model I use to account for the existence of autocorrelation in the thermometer record and whether I use the full data record or only the final 50 years, I can estimate the probability of the 13 months of US temperatures in the top 1/3rd as occurring somewhere between 1 in 2,000 to 1 in 166,667 times under the assumption of “no climate change”. None of these estimates say the probability is as low as 1 in 1.6 million, but then, none are near 1 in 10 I posted yesterday either!

All probabilities estimated based on features of the data are fairly improbable. So it’s fair to say that based on characteristics of weather we see, 13 months of US temperatures in a row falling in the top 1/3rd of the full distribution would have been unusual absent warming.

I could probably dream up other noise models. But I think I’ve fiddled enough to find the highest possible probability one could claim for the event had there been no secular warming — that’s 1 in 2,000. If you can come up with some way of claiming it’s more probable, let me know!

That said: As Andrew_FL noted: we are looking at a region that covers 2% of the earth’s surface. So, at any given time, there are 50 possible non-overlapping regions of similar area. If the variability in each area is independent of the others (which it’s not) we would conclude that the probability that at least 1 will exhibit precisely the “13 months in the top 1/3 of the distributions” event might be as high as 1%. (This is calculated as p= 1- [1-1/2000]50] ) Of course these probabilities are correlated, so the real probability of seeing that event would be lower. But if we can check each month to find any conceivable event that seems unlikely, and compute the probability we would have assigned to that event had we been asked to predict it before it happened, we are likely to be able to identify many apparently improbable events with great regularity. This possibility often allows people to fall into The Texas Sharpshooter” fallacy. More amusingly, it’s what baseball announcers often seem to do. It seems they can always find some truly unique accomplishment for nearly any player who comes to bat. (Example: “Did you know the current batter has the highest RBI of any rookie who bats right handed playing in the American League this year?!”)

Given the possibility of hunting for records, it’s not clear that we can decided what it means when someone finds some non-standard event and reports the probability that this event would be found. For this reason, people often look askance of any ‘proofs’ that something unusual is happening based on someone noticing a mining data sets to find unusual event and then computing the probability after it has been observed without also considering all the other events that did not happen which would have constituted “proof” had they occurred.

That said: in the presence of persistent warming– even very slow persistent warming– seeing temperatures in the upper parts of the historic distribution will not be rare. Even seeing quite long streaks of temperatures in the upper 1/3rd of the distribution of weather will not be rare. It’s quite likely we’ll see more. How quickly will we see them? That depends on how fast it’s warming!

45 thoughts on “Using US data: Looks like 1 in anything from 2,000 to 166,667.”

  1. Hi Lucia

    I recommend that you perform the same analysis using the MSU-obtained lower tropospheric temperature data for the continental USA. [and also use the North American Regional Reanalysis of the data is available up to near the present].

    These analyses do not go as far back in the past, but they do not suffer from an apparent warm bias in the trends in the NCDC homogenized analyses; e.g. see

    Klotzbach, P.J., R.A. Pielke Sr., R.A. Pielke Jr., J.R. Christy, and R.T. McNider, 2009: An alternative explanation for differential temperature trends at the surface and in the lower troposphere. J. Geophys. Res., 114, D21102, doi:10.1029/2009JD011841. http://pielkeclimatesci.wordpress.com/files/2009/11/r-345.pdf

    Klotzbach, P.J., R.A. Pielke Sr., R.A. Pielke Jr., J.R. Christy, and R.T. McNider, 2010: Correction to: “An alternative explanation for differential temperature trends at the surface and in the lower troposphere. J. Geophys. Res., 114, D21102, doi:10.1029/2009JD011841”, J. Geophys. Res., 115, D1, doi:10.1029/2009JD013655. http://pielkeclimatesci.wordpress.com/files/2010/03/r-345a.pdf

    We have a new paper in press that provides further reasoning for this warm bias.

    Roger Sr.

  2. I was looking at the state data, from the same website, just before my computer crashed.
    I was looking at states from largest to smallest, and also having a look if precipitation correlated with temperature (it doesn’t).

    I bet you that you will find more auto-correlation with individual states like Texas or California, than Rhode Island.
    The lower 48 states have a Canada on top, two oceans on either side, a shallow pool at the bottom, then a pair huge mountain ranges running North to South.
    It appears that all these locals are quite different.

  3. “we are looking at a region that covers 2% of the earth’s surface. So, at any given time, there are 50 possible non-overlapping regions of similar area. If the variability in each area is independent of the others (which it’s not) we would conclude that the probability that at least 1 will exhibit precisely the “13 months in the top 1/3 of the distributions” event might be as high as 1%. (This is calculated as p= 1- [1-1/2000]50] )”

    It’s true that there is spatial auto-correlation, which probably reduces the effect I’m concerned with. But there is another problem:

    Arguably, a person hunting around for records need not restrict himself to predefined non overlapping areas of 2% of the Earth’s surface. Unless data came pre-divided so that there were exactly 50 areas he could look at, the rare event hunter may go looking for any 2% of the Earth he wants. In practice, Masters doesn’t even seem to limit himself to the criteria that they constitute a fixed portion of the Earth’s surface-any extreme event and he highlights it, and through innuendo, links said event to AGW. It’s his MO.

  4. Andrew:

    Arguably, a person hunting around for records need not restrict himself to predefined non overlapping areas of 2% of the Earth’s surface.

    Exactly the problem here.

    Top 1/3 for 13 months is a contrived record. The odds of finding a contrived new record is right at 100%.

    As I said in the other thread: “This illustrating both the dangers inherent in data mining and the non-probative value of retrospective analysis. You can use it to construct hypotheses, you can’t use it to test them.”

  5. ARIMA (2,0,5) seems far fetched. Was this the model that gave AR coefficients greater than 1 (i.e. non-stationary)? If so, that’s probably why you need 5 MA coefficients to tame the exponential growth from the high AR coefficients.

  6. “It has been pointed out to me that the calculation of a 1 in 1.6 million chance of occurrence (based on taking the number 1/3 and raising it to the 13th power) would be true only if each month had no correlation to the next month. Since weather patterns tend to persist, they are not truly random from one month to the next. Thus, the odds of such an event occurring are greater than 1 in 1.6 million–but are still very rare. I appreciate hearing from those of you who wrote to point out a correction was needed.”

    Jeff Masters

    Again you’ve been owned. >1:1.6 Million

  7. DeWitt

    ARIMA (2,0,5) seems far fetched.

    Agreed. The difficulty is this:

    The null hypothesis is that climate has not changed. Also: this test is nota test on a trend. This means to do the test we need to fit the “noise” model to data that has not been detrended. In contrast, when we test for a trend, we fit the residuals to a linear fit to a noise model.

    If the trend exists, fitting noise to the not-detrended data involves a mis-specification error. There should be a trend. The result is that the “fits” will tend to be weird. The reason is you are hunting for those fits that permit this data to exhibit a trend as high as we see in the data (while also fitting the finer grained structure.) It’s not too surprising that you end up with far-fetched fits given the constraints imposed by the test!

    But if we are going to test this hypothesis against the null of no warming we do have to pick a fit to not-detrended data. None will fit well.

    Ordinarily, the correct way to “fix” the problem is to admit the trend exists and fit a noise model to detrended data!

    The fact that you can’t do the “records” test in a rational way is one of the reason I hate it! (That said: When you show a record is improbable, you’ve pretty much bent over backwards giving people who want to say it was likely a big out. The method permits someone to justify almost any loonie description of “noise” conceivable to homo-statisticianus!)

    Was this the model that gave AR coefficients greater than 1 (i.e. non-stationary)?

    The arima fit procedure in R lets you specify that you only want stationary choices. So, yes. It’s stationary. It’s might be borderline– but it’s stationary.

  8. Hi all. Time for another basic question that I find surprising that I don’t know the answer to. I figure if I keep it off the Mann thread it won’t be as disruptive.

    The chart I see of CO2 concentrations from Mauna Loa is one of the most regular datasets I have ever seen. In fact it really doesn’t look natural.

    On the other hand, our emissions of CO2 are not nearly as regular. They have gone both up and down by very different amounts.

    Is this well covered in the literature and should I be embarrassed for not knowing why global CO2 concentrations don’t move in tandem with human CO2 emissions?

    Thanks

  9. This all gives me a giant headache. The odds, be they 1 in 1.6 million or be they 1 in 100,000, simply means those are the odds if one had, in June 2011, decided to calculate the odds that the next 13 months would each be in the top 1/3. Right? And if one had started in 1895 and decided to calculate the odds that one would see at least one 13 month stretch by June 2012 that had every month in the top 1/3- what would those odds be?

  10. Hi Lucia

    Here is the paper I referred to in my comment #99356.

    McNider, R. T., G.J. Steeneveld, B. Holtslag, R. Pielke Sr, S. Mackaro, A. Pour Biazar, J. T. Walters, U. S. Nair, and J. R. Christy (2012). Response and sensitivity of the nocturnal boundary layer over land to added longwave radiative forcing, J. Geophys. Res.,doi:10.1029/2012JD017578, in press. [for the complete paper, http://pielkeclimatesci.files.wordpress.com/2012/07/r-371.pdf%5D

    which Dick McNider presented a guest post on at

    http://pielkeclimatesci.wordpress.com/2012/07/12/guest-post-by-richard-mcnider-on-the-new-jgr-atmosphere-article-response-and-sensitivity-of-the-nocturnal-boundary-layer-over-land-to-added-longwave-radiative-forcing-2012/

    and Anthony Watts has an informative summary and figure in his post

    http://wattsupwiththat.com/2012/07/12/important-new-paper-on-the-nocturnal-boundary-layer-mixing-and-radiative-forcing-as-it-applies-to-ghcn-weather-stations/

    Roger Sr.

  11. Assuming we live in a world with no color – what are the odds of seeing a multi-colored rainbow? Because I saw one today!

  12. Tom Fuller,
    The data with the seasonal signal removed shows some variation around the secular trend. Most of this variation is related to changes in average sea surface temperature. Some portion of the variation probably is due to changes in the emissions of CO2, but you can’t see it because the natural variation swamps that signal. Lets say that the average year-on-year increase due to emissions is 2 ppm. If there is a terrible global recession, then we might expect to see a short term 5% to 10% drop in emissions… Corresponding to about 0.1 to 0.2 ppm per year less rate of growth during the recession, but the natural variation is probably >5 times larger. So the signal is there, but hard to see.
    .
    If you adjusted the Mauna Loa trend based on how the sea surface temperature changes over time (I did this once a long time ago) then it might be possible to see the influence of things like recessions.

  13. ‘Yancey Ward (Comment #99442)
    July 12th, 2012 at 2:29 pm

    “if one had started in 1895 and decided to calculate the odds that one would see at least one 13 month stretch by June 2012 that had every month in the top 1/3- what would those odds be?”

    One. The first 13 month period would be both a high and low record by definition.

  14. Alan–

    One. The first 13 month period would be both a high and low record by definition.

    That said, I don’t think you can define the top 1/3rd of the distribution over 0 years. The way this is worded, you can’t really do the problem until month 16. In month 16, you have 3 months prior to the final 13 months. So, then the questions is: What’s the chances that months 4-16 will be higher than two of the months in (1-3).

    Ok… maybe you can start in month 14. In that case, the question is: What’s the odd month 1 will be lower than every month from (2-14) ? Or put another way, what’s the odd month 1 will have been the lowest temperature out of 14. If the noise is white, that’s 1 in 14.

  15. Why my head hurts.

    The latest 13 month streak beat what odds over the last 116 years?

  16. This is still fundamentally an abuse of probability due to the mass of information you leave out, as I pointed out on another thread.

    For example, solve the following problem:

    What is the probability that someone will win the lottery? Answer, say, 1 in 40 million.

    What is the probability that someone will win the lottery AND 1 million people will not? Pretty low.

    The context of this problem is of the latter type. Your work and that of others commenting on it is fatally incorrect for that reason.

    WM Briggs is the only blogger who seems to understand this.

  17. Gekko

    WM Briggs is the only blogger who seems to understand this.

    Please read the paragraph beginning

    Given the possibility of hunting for records, it’s not clear that we can decided what it means when someone finds some n…

    It’s referring to the effect you are pointing to.

  18. Lucia,

    But you persist in stating a (range) of quantified probabilities.

    The effect I (and I will accept you) are talking about means I could equally validly state the probability is 1 in 2.

    It is exactly the same trap that creationists fall into when they attempt to (ab)use probability to prove their model is correct. A treat, identical to yours, applied to the evolution of intelligent life would come up with probabilities of 1 in gazillions and they then claim such an unlikely event under the evolution model means there must be a creator.

    We know, however the probability is 1 ( as we exist) and we know that ex ante the probability of it happening again unquantifiable, but much much higher than estimated because of the information left out of the calculation (eons of time, billions of galaxies, gazillions of stars etc.)

    It’s a Bayesian thing. The best bloggers have refused to play the number game and pointed out the fatal conceptual flaw in even trying to quantify probabilities in this case.

    I wish you had.

  19. “We know, however the probability is 1 ( as we exist)”

    That doesn’t sound right.

  20. Bugs,

    It happened hence 1.

    The probability of it happening again is something else

  21. Gekko

    But you persist in stating a (range) of quantified probabilities.

    Sure. There’s nothing wrong with the exercise.

    The probability of it happening again is something else

    Yes. And now we’ve done the calculation and when it happens again in a few years, it will be retrospective, not picking. That’s why there is nothing wrong with the exercise.

    There isn’t even anything conceptually wrong with trying to estimate the probability of evolution. One might want to estimate the chances we’d find life on other planets and so on. But it’s difficult to do the problem right but that doesn’t make it wrong to try.

    A treat, identical to yours, applied to the evolution of intelligent life would come up with probabilities of 1 in gazillions and they then claim such an unlikely event under the evolution model means there must be a creator.

    Other non-creationists get different numbers. There is nothing wrong with the idea of getting an estimate. The problem is– how do you identify what must happen and how do you even begin to estimate the probability of individual sub events? It’s hard.

  22. Lucia, I think Gekko is generally right here.

    If you do the analysis retrospectively (assuming N is large enough) you can always find events that are at the +2σ, by chance even +3σ or higher, if you make N large enough.

    Let’s look at it this way, the US is 2% of the world’s surface area. That means there’s 50 “United States” in the world in terms of surface area.

    Climate mitigation advocates are willing to use extreme weather from a single month to point to a particular outcome. And how many different variables are there that they examine? Temperature (really two outcomes, because cold apparently is evidence for global warming climate disruption), precipitation (two again, drought and flooding), that is we use the double sided distribution for temperature and precipitation. Stopping there to make the point… “4” outcomes from two variables. So in a year N = 12*50*2 = 1200… there will be 30 +2σ extreme events (remember we’re using the double-sided distribution here) and 4 +3σ events in any given month, 300 +2σ, 40 +3σ, and roughly 1 +4σ event in a decade.

    Now I agree with you, it’s useful to evaluate the probability of it happening again, and we can conclude from chance it won’t happen again in our lifetime. But the statistic itself came from data mining, the chance finding an anomalous result is virtually 100%. Next time, it’ll just be a different statistic.

    The take home is using extreme events tail of the distribution without a model that makes a specific prediction is a very difficult way to prove a point. You need to not only observe points on the tail, you have to make the case the tail is getting “fatter”. Why not use the mean value instead? Or the variance?

    It’s warmer (this record is a convoluted way of telling us that), part of the arctic ice cap has melted (the 2007 record is a convoluted way of saying that), whether it is getting wetter or dryer… that is complicated by the fact the models completely fail to predict precipitation rates, let alone changes in precipitation.

    Figure.

  23. But the statistic itself came from data mining, the chance finding an anomalous result is virtually 100%. Next time, it’ll just be a different statistic.

    I agree on the “next time it’ll just be a different statistics. That’s why I think hunting for records is frowned on as a method of “proving”. It’s much better to test specific predictions that are actually made.

    Still, I think it’s sometimes interesting to look at how often runs happen. (My grandmother was 1 of 8 girls. That doesn’t happen in many families– and so the statistic is interesting to compute. But of course, over the entire planet, looking for all families with 8 girls, the chance of finding one with all 8 girls was certaintly greater than 1 in 2^8.)

  24. Lucia, I know you know this, but the value of retrospective analysis is in hypothesis building, especially if you have a model that you recognize retrospectively would predict this observed phenomena.

    So what you do is go back and observe for the same period of time. In the case of screening trees, you say “this tree is a temperature proxy” and use other trees from the same grove as temperature proxies (you use correlation for verification from that sampling, rather than for selection).

    In this case, the point of the record seemed to be “gee it’s getting warmer.” My point about this is there’s a much easier way of demonstrating that.

    That said I do find calculating the odds interesting.

    Here’s one for you: How many people do you need in a party before the chance is 50% that two will have the same birthdate? (I learned that one from an Isaac Asimov article.)

  25. Since creationists, evolution and math are all being discussed, I can’t resist pointing something out:

    It is exactly the same trap that creationists fall into when they attempt to (ab)use probability to prove their model is correct. A treat, identical to yours, applied to the evolution of intelligent life would come up with probabilities of 1 in gazillions and they then claim such an unlikely event under the evolution model means there must be a creator.

    We know, however the probability is 1 ( as we exist) and we know that ex ante the probability of it happening again unquantifiable, but much much higher than estimated because of the information left out of the calculation (eons of time, billions of galaxies, gazillions of stars etc.)

    The part I made bold is wrong. Our existence does not prove the evolution of intelligent life. We can’t even prove we actually are intelligent life (or that we exist, but nobody wants to go there). We certainly can’t prove that we evolved. For all we know, some deity created the entire universe just five minutes ago!

    For a more serious examination of the subject, we can accept our perceptions of the universe are “right,” humans are intelligent, and humans evolved in the way evolution generally describes. None of that implies intelligent life evolved. Nobody knows what intelligence actually is, much less how to find it, prove its existence or understand how it came about. If humans are intelligent, we have to accept that we don’t know where that intelligence came from. We can guess or posit it came from evolution, but there’s no way to assign a probability.

  26. Here’s one for you: How many people do you need in a party before the chance is 50% that two will have the same birthdate? (I learned that one from an Isaac Asimov article.)

    It’s pretty low! Today I’m not going to do it– but I seem to remember it was something like….12? People tend to think of it as the odds someone will share my birthday, but if it’s anyone can share anyone else’s, it’s a pretty low number.

  27. lucia:

    It’s pretty low! Today I’m not going to do it– but I seem to remember it was something like….12? People tend to think of it as the odds someone will share my birthday, but if it’s anyone can share anyone else’s, it’s a pretty low number.

    I don’t feel like doing the math right now either, but I believe 12 people gives you something closer to 18%. The value I gave before, 23, is the right answer with a probability of ~51%.

  28. lucia (Comment #99643)

    Other non-creationists get different numbers. There is nothing wrong with the idea of getting an estimate. The problem is– how do you identify what must happen and how do you even begin to estimate the probability of individual sub events? It’s hard.

    What usually happens is that you replace a hard estimate with a wild guess, and then you take the expected value over eons, billions, and gazillions. 🙂

  29. Like a lot of these problems, you compute this one by computing the probability P'(n) that no two people have a birthday on the same day out of a group of p people, then you compute the probability that two people have the same birthday by writing

    $latex P(p) = 1 – P'(p)$

    For 1 person, the probability P’ = 1 and P = 0 of course.

    For 2 people, P'(2) = 1 * (1 – 1/365)

    For 3 people, P'(2) = 1 * (1 – 1/365) * (1 – 2/365)

    and so forth

    This generalized for p people to $latex P'(p) = \prod_{k=1}^p (1 – p/365)$.

    This result can be simplified, but this formula is the easiest to compute (the other version you end up with factorials that you have to express as log gamma functions).

    Here’s the awk code for how you implement this efficiently:

    function birthday(p)
    {
    prod = 1;
    for (n = 0; n < p; n++) {
    prod = prod * (1 - n/365);
    }
    return 1 - prod;
    }

  30. “It is exactly the same trap that creationists fall into when they attempt to (ab)use probability to prove their model is correct. A treat, identical to yours, applied to the evolution of intelligent life would come up with probabilities of 1 in gazillions and they then claim such an unlikely event under the evolution model means there must be a creator.

    We know, however the probability is 1 ( as we exist) and we know that ex ante the probability of it happening again unquantifiable, but much much higher than estimated because of the information left out of the calculation (eons of time, billions of galaxies, gazillions of stars etc.)”

    While I disagree with the jump from the low probability of intelligent life in the form of human beings evolving to proof of a super natural creation of the human form, the probability that the proper conditions would exist from the start of evolution to when humans evolved is considered quite low by most scientists I have read.

    I do not believe I have seen a good estimate made that would relate those odds here on earth to billions of other planets in the universe and the estimated time of our universe being existence. I have seen estimates made but always with major assumptions.

  31. Carrick
    Your solution assumes birthdays are uniformly distributed throughout the year. In fact they are not. When you use the actual distribution of birthdays I believe you get a number closer to Lucia’s 18.

    But you are correct. Most people assume a much bigger number because they do not understand the actual problem.

  32. Here’s what the reduced version of P'(p) looks like:

    $latex P'(p) = { \displaystyle 365! \over 365^p (365-p)! }$

    and here’s the C-code for computing P(p) using log-gamma functions:


    double birthday(int p)
    {
    int n = 365;
    return 1 - exp(lgamma(n+1) - lgamma(n-p+1) - k * log(n));
    }

    Note:

    $latex 365! \approx 3.64 x 10^778$ which is a big number, and why you have to employ log-gamma functions here, if you’re going to be stubborn about “purity of code.”

  33. John:

    Your solution assumes birthdays are uniformly distributed throughout the year. In fact they are not. When you use the actual distribution of birthdays I believe you get a number closer to Lucia’s 18.

    Thanks John! I forgot to mention this assumed a uniform distribution.

    Do you have a preferred distribution for birthdays? This would make an interesting Monte Carlo toy.

  34. Somewhat OT:

    Remember climacograms? I believe the rest of the world calls them Allan variance plots and there is an R package ‘allanvar’ to do them. It turns out that sort of information is important if you’re building an inertial navigation system. Considering that you have to integrate the accelerometer output twice to get position, even a small amount of white noise causes a lot of drift (back to unit roots again). If I remember correctly, the original inertial guidance units in submarines and missiles used three hand lapped beryllium spheres rotating on orthogonal axes. I think they use ring lasers to do it now. The MEMS gyros and accelerometers in your cell phone are way too noisy for inertial navigation. That’s why they have GPS and triangulation from cell towers, etc.

  35. I looked at the birthday data that I could find, and it’s pretty interesting. From this source, here is birthdate by year:

    Figure.

    Surprisingly what dominates birthday rate within a year is not day of year, but whether you were born during an increasing, stable, or declining population rate. If the rate is fixed of course, the assumption of uniform distribution is a good one.

    At first blush, the simplest extension to a non-unifrom population would have to specify the range of ages of people at the party (unless it’s a very awkward party, usually they’re pretty close in age), and from there you can work out the slope of the birthrate over the entire year. A simple linear trend in birthrate would probably suffice.

    The initial dataset I found was (it’s my impression) for younger people located here for which a uniform assumption is a good one.

    [ADDED: Thanks for the other link too, John.]

  36. I know this is obvious, but if the distribution is non-uniform, what day of the year you’re born matters.

  37. Carrick:

    and here’s the C-code for computing P(p) using log-gamma functions:

    Where did the ‘k’ variable come from?

    is a big number, and why you have to employ log-gamma functions here, if you’re going to be stubborn about “purity of code.”

    I’m not sure what you mean by “purity of code,” but if you’re willing to use a longer code segment, it’s easy to simplify the numbers before doing any calculations. I believe that should let the code run so long as you don’t pass too large a value of p.

    Thanks John! I forgot to mention this assumed a uniform distribution.

    I didn’t! The reason I posted the remark about caveats is I don’t think there’s any real way to calculate the odds otherwise. There is too much demographical information which could be specified.

  38. Brandon:

    Where did the ‘k’ variable come from?

    Sorry I changed variables from “k” to “p” and missed one. I was trying to get it to match my comments.

    Here’s the “real” version of the code as I implemented it:

    double birthday(int n, int k)
    {
    return 1 - exp(lgamma(n+1) - lgamma(n-k+1) - k * log(n));
    }

    [Yes I wrote it for the general case where the number of days on the planet are different than 365. 😉 ]

    As I usually do, I tested it against a Monte Carlo (uniform distribution assumption) to verify my code worked.

    I’m not sure what you mean by “purity of code,” but if you’re willing to use a longer code segment, it’s easy to simplify the numbers before doing any calculations. I believe that should let the code run so long as you don’t pass too large a value of p.

    If you knew some of my math friends, you’d understand. They want to start from the “human readable” version then code that up, to look *exactly* (within reason) to what the reduced expression would look like. That’s what I mean by “purity”.

    There’s no reason that I can find, for this problem, in not using the multiplicative algorithm. The errors introduced by differencing are small compared to the effects of other assumptions, and of course run time isn’t an issue.

    If you rewrite the numerator to get rid of the 365! (and have it not blow up for small p) it reduces to the multiplicative expression I gave above.

    I didn’t!

    Frankly I was riding on the fact it had already been brought up as not needing to repeat it.

  39. Carrick:

    If you knew some of my math friends, you’d understand. They want to start from the “human readable” version then code that up, to look *exactly* (within reason) to what the reduced expression would look like. That’s what I mean by “purity”.

    Ah, gotcha.

    There’s no reason that I can find, for this problem, in not using the multiplicative algorithm. The errors introduced by differencing are small compared to the effects of other assumptions, and of course run time isn’t an issue.

    I think either approach will work just as well for this problem. There’s some difference in run time and accuracy, but not enough to care.

    If you rewrite the numerator to get rid of the 365! (and have it not blow up for small p) it reduces to the multiplicative expression I gave above.

    Basically, yeah. Your approach generates P-false then subtracts it from one. The other approach would just immediately return P. I think your approach gives faster run time because the calculations are easier.

    As an aside, you can actually simplify it a little more. In your For loop, there’s no reason to set n to zero. If you set it to one, you’ll shave off an iteration. Mathematically, that’s equivalent to recognizing 365! and 365^p will always share 365.

    Edit: Or does the For loop increment n at the start of the loop, not the end?

Comments are closed.