Because something tried to load the main page of my domain nearly 30,000 times this morning,I hunted down the API that let me find the underlying IPs that hit my blog recently. The API gives me limited information– I can’t see which pages they hit using that API. Still, it does let me specifically examine the IPs Cloudflare considers “threats”. Here are the “threat” IPs that hit over 20 times in 12 hours.
I’ve highlighted two interesting ones:
{"response":{"ips":[{"ip":"188.138.86.35","classification":"threat","hits":398,"zone_name":"rankexploits.com"},
{"ip":"128.2.207.79","classification":"threat","hits":396,"zone_name":"rankexploits.com"},
{"ip":"184.107.248.202","classification":"threat","hits":88,"zone_name":"rankexploits.com"},
{"ip":"128.156.10.80","classification":"threat","hits":53,"zone_name":"rankexploits.com"},
{"ip":"89.22.206.178","classification":"threat","hits":49,"zone_name":"rankexploits.com"},
{"ip":"193.205.203.3","classification":"threat","hits":35,"zone_name":"rankexploits.com"},
{"ip":"158.234.251.71","classification":"threat","hits":26,"zone_name":"rankexploits.com"},
{"ip":"164.92.9.21","classification":"threat","hits":23,"zone_name":"rankexploits.com"},
{"ip":"94.228.34.207","classification":"threat","hits":23,"zone_name":"rankexploits.com"},
{"ip":"195.27.12.230","classification":"threat","hits":20,"zone_name":"rankexploits.com"},
{"ip":"159.54.131.7","classification":"threat","hits":20,"zone_name":"rankexploits.com"},
{"ip":"204.136.242.10","classification":"threat","hits":20,"zone_name":"rankexploits.com"},
{"ip":"128.147.28.69","classification":"threat","hits":19,"zone_name":"rankexploits.com"},
128.2.207.79 : Someone at United States Pittsburgh Carnegie Mellon University
More interesting: who is 128.156.10.80? It’s cleveland….:
OrgName: National Aeronautics and Space Administration
OrgId: NASA
Address: IS05/Office of the Chief Information Officer
City: MSFC
StateProv: AL
PostalCode: 35812
Country: US
I don’t know what criteria Cloudflare uses to decide a particular IP is a threat. But if you are an IT person at NASA CERF or at Carnegie Mellon, I suggest you figure out what’s going out of your server. Because Cloudflare sure doesn’t like you!
This sure is weird. Why are they trying to compromise your blog? Could someone be masking or hacking through their system? And even so, why go through all that trouble to disrupt your blog? Maybe someone wrote a really badly done bot for trying to skim blogs for climate change debate analysis information (which you are a marvelous source for), and don’t realize their code has gone sour.
Talk about a bizarre situation!
id suspect that there machine is comprimised and contact them
Lucia,
I agree with Mosh. It’s possible that someone is illegally using various servers as proxy servers to mask their own IP. It may seem deliciously ironic to use the Nasa server in this context. You should contact them directly to ask them to run a source check.
Paul_K–
I alerted both NASA and Carnegie Mellon that they were listed as a threat on cloudflare. What does “running a source check” mean?
ftnchek – a Fortran utility similar to “lint” for C.
That’s for F77 that I learnt at college in the 70’s. That was referred to as “running a source check”.
NASA may well still use it π
steveta_uk–
But what does the utility do? That is: by asking for a “source check” what would I actually be asking NASA to figure out? I’m not looking for an answer ind “code-glish” that tells a programmer what utilities they might use (or lets a person who knows what the utilities are to understand because they know what the utility does. I was wondering what a “source check” is. What would NASA be checking their servers for? Stuff coming into their server? Stuff going out? What about the various log files would they be checking?
Sorry, I was being facetious. A source check utility is used to validate the program source code – it has no use whatsoever in tracking down a compromised system that they might have.
NASA need to determine the source of the traffic – the IP address is a pretty good clue.
By the way, the answer from Carnegie Mellon is
No answer from NASA.
This sort of answer is why I often just block.
Since CMU invented some of the original web search engine programs (e.g. Lycos -> Yahoo), with lots of nerdy programmers running around over it’s certainly possible that some old crap is still chugging away over there and getting hung up by modern bot checks?
Why not NASA too? Ghosts in the machine. Betcha quatloos it’s not malicious (of the human sort anyway).
BillC–
I have no idea what’s possible. But if the IP for my domain or the IP I’m assigned by my ISP was showing up as a threat on various agencies, I’d suspect some sort of worm, virus or something was on my machine.
My IPs from the ISP once did come up on threat lists. I had dynamic IPs and comcast yanked the email spammers account and got the IP off the RBL lists. Things don’t end up on RBL lists without something bad going out.
I suspect the NASA and CMU are not malicious in a “human-human” sort of way. Of course initially a human wrote a script. But then it just gets unleashed.
Oh Dreamhost was down most of yesterday afternoon. You can read the two status messages for Jan 18 here: http://www.dreamhoststatus.com/
Lucia,
Re “source check”, I just meant that the NASA Administrator should be able to check on the source of the messages to your site, and may be able to determine whether they come from a legitimate user, a Trojan in the NASA server or a hacker using the NASA server as a proxy.
Paul
Paul_K– So you mean they should check their outgoing server logs? That would make sense. But I suspect they should check more generally because I don’t think cloudflare would mark them as a threat for hitting my blog only.
That said: I emailed them. I’ve heard nothing. It’s in their lap. On my side I can just ban.
Lucia,
I’m the person responsible for the web crawler that you say is hammering your site. I’m a prof at CMU who is conducting a large crawl to create a web dataset that will be used by a broad community of researchers to develop better search engine, language understanding, and text mining technology. See http://lemurproject.org/clueweb09/ for information about an earlier dataset that we created.
I’m sorry and surprised that our crawler is causing you problems. We are using the Internet Archive’s Heritrix crawler – it’s a standard crawler used by IA and other researchers. It is configured to obey robots.txt and various crawling niceness protocols. We intend to be a good citizen, not a nuisance. We are looking into what might have gone wrong with your site.
The crawler is configured to identify itself in your web server log files and provide a url to our project, so that you can contact us in case of problems. Because you didn’t see it, you emailed the wrong person. It took awhile for your email to be relayed to me.
It would be helpful if we could have an offline conversation to get additional details about what went wrong. The snippet shown above says 396 hits from our crawler, not 30,000. I’m not doubting your word – it would just be helpful for us to have more details about how many hits, what page(s), and timestamps, if you have that information. callan@cs.cmu.edu
Thanks for calling this to our attention. Once again, I’m sorry about any inconvenience. We’ll try to get it sorted out ASAP.
Jamie Callan
Face it Lucia, the Men in Black Hats are after you! Now what is it that you know that you’re not telling us? Have you figured out the identity of FOIA? π
Jamie,
Hi. I don’t think your bot did the 30,000 hits. I think I was a bit unclear. My site got hammered which motivated me to at what cloudflare tells me hit my site Then, I noticed your IP is listed as hitting 396 times– and also identified as a threat by cloudflare. Because cloudflare results in some opacity with respect to my server logs, I couldn’t go and see the hits, so I can’t tell if you hit robots.txt obeyed it disobeyed or anything.
But I think you might need to ask cloudflare why you are listed as a threat. I’ve looked around their site and I haven’t figured out what metric they use to assign threat numbers– but that’s how they list your IP. I do get access logs– it’s just the IPs are all cloudflares. I’ll go look for your crawler ua and let you know if it misbehaved in anyway.
LTI is Language Technologies Institute and the “boston-cluster” simply seems to be a name assigned to a cluster that is apparently in Pittsburgh as I don’t see any additional latency beyond Pittsburgh to the cluster itself. There would be another several milliseconds of network latency between Pittsburgh (how the traffic is routed from my location) and the cluster itself. It all appears to be in Pittsburgh.
What is also interesting to me is the 128.147.28.69 address. That is University of Pittsburgh Medical Center.
Someone running a bot-net might have decided to target you or something.
boston-cluster is a small computer cluster used by my research group at the Language Technologies Institute. The Language Technologies Institute is a graduate computer science department in Carnegie Mellon’s School of Computer Science that studies information retrieval (search engines), machine translation, speech recognition, computer assisted language learning.
173.245.55.60 – – [17/Jan/2012:01:04:17 -0800] “GET /musings/2009/lets-watch-sea-levels/feed/ HTTP/1.1” 200 9509 “http://rankexploits.com/musings/2009/lets-watch-sea-levels/” “Mozilla/5.0 (compatible; lemurwebcrawler admin@lemurproject.org; +http://boston.lti.cs.cmu.edu/crawler_12/)”
I sent you an email with a a file of all the lemur hits from the 17th (299) with the last around 10 am. The logs show none on the 18th.
Thinking as I write this: The 299 seems low since Cloudflare thinks your IP hit 396 times and I was asking for hits 12 hours prior to my blog post. So,the 299 in the list shouldn’t even be counted in the 396! (I can rerun various hours, but I have to do it soon because cloudflare has an upper limit on how many hours I can check. )
(In the last 48 hours your IP hit {“ip”:”128.2.207.79″,”classification”:”threat”,”hits”:729,”zone_name”:”rankexploits.com”}
In the 299, I don’t see anything that would look threatening. But that still potentially leaves at least
…
I don’t see anything that would have bothered me. But I don’t see the any hits to robots.txt! (Maybe you hit it on the 16th? That access file is zipped. I can look for you if you like.)
According to the access logs lemur had zero hits on the 18th.
For what it’s worth, the 30,000 hits came from
173.245.63.135 – – [18/Jan/2012:05:59:53 -0800] “GET / HTTP/1.0” 301 460 “http://rankexploits.com/musings/” “Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; OfficeLiveConnector.1.4; OfficeLivePatch.1.3; .NET4.0C; .NET4.0E)”
The 173.245.63.135 is a cloudflare IP that happens to match none of the IPS lemur hit with. But I don’t know how cloudflare assigns their IPs. On the 17th lemur came in on 173.245.55.60, 173.245.55.64, 173.245.55.71 etc. All seem to be 173.245.55.xx.
I still think you want to learn why cloudflare thinks you are a threat. (I’m wondering if their threat designation is reliable?)
Lucia
In the last 46 hours I get
"ip":"128.2.207.79","classification":"threat","hits":648,"zone_name":"rankexploits.com"48 hours
{"ip":"128.2.207.79","classification":"threat","hits":729,"zone_name":"rankexploits.com"}This is the last 30 hours.
{"ip":"128.2.207.79","classification":"threat","hits":141,"zone_name":"rankexploits.com"}28 hours
{"ip":"128.2.207.79","classification":"threat","hits":1,"zone_name":"rankexploits.com"}29 hours
{"ip":"128.2.207.79","classification":"threat","hits":72,"zone_name":"rankexploits.com"}
(I’m doing this so I can keep track of time stamps!)
what’s my time stamp at the blog?
George (Comment #88798)
I kept that last one in the list on purpose. π
Even though I’m in the process of writing stuff to interogate what I get from cloudflare’s API, I hadn’t written it yet. But I noticed 128.x.x.x popped right out. I think something might be going on in Pennsylvania!
On the one hand: Jamie’s lemurwebcrawler bot itself looks very, very well behaved. Very. Everything that leaves the lemurwebcrawler user agent is very nice. But I don’t think lemurwebcrawler accounts for the CMU hits in the 12 hours prior to my 14:XX Jan 18 blog post. lemurwebcrawler lost interest in rankexploits.com at 10 am Pacific Standard Time (noon CST) on Jan 17, and hasn’t re-appeared.
If I’m not mistaken, none of the hits in the 12 hours prior to my blog post are from “lemurwebcrawler” tally for CMU.
The thing is that bots often spread using a virus and one thing many of them do when spreading is to begin scanning ip addresses in a range. One of the indications one can sometimes get from malware spreading through a network is that it is attempting to rapidly scan through IP address ranges. Some of the more sophisticated ones will scan pseudo-random IPs at varying intervals, some of the more “brute force” scanners simply go sequentially through “blocks” of IPs. So it isn’t unusual to find “clusters” of infections around certain IP blocks. 128 is an interesting number because it is right in the middle of the IP address range (0-255 with 127 being reserved for the local machine’s internal use). The middle is often a good place to start.
So in the initial phases of malware spreading through a network, one might find hosts around a certain address range being infected first with the infection spreading across other ranges as time elapses. The infection might even change ranges where it starts scanning according to the generation of the virus or it might not. It depends on how much attention to detail the author has in actually wanting to infiltrate systems. There is enough low-hanging fruit out there that one doesn’t have to be too sophisticated to get a lot of bots running. Some bot nets have literally millions of machines under their control. Wikipedia estimates that 15% of machines world-wide are under the control to some degree by bot-nets.
lucia you are officially a geek. as if there was any doubt
mosh,
It ain’t official until there’s a certificate suitable for framing.
Just sayin π
Probably showing my total computer/internet ignorance here, but could it have been related to the Anonymous attack on government and other websites after the shutdown of Megaupload?
Here is an article on how they used others to unwittingly help them.
http://news.cnet.com/8301-27080_3-57363103-245/anonymous-tricked-people-into-joining-web-site-attacks/
true that.
Maybe Lucia is gettin her bone fides in order to join Anonymous
Undoubtedly NOT a server issue, but rather virus-infected PC’s being used in a dDOS attack.
steven– Not planning to join Anonymous! It seems that article is discussing something that happened the 19th. The hammering here was on the 18th.
I wish I’d been logging the IPs hitting the root! Dang!