I’ve managed to ban a sufficient number of “cracker” type bots that a significant fraction of the remaining ‘bot’ load is from fairly comment spam bots that do the following stupid things (in order of frequency):
- Claim they are the googlebot by spoofing the user agent.
- Claim they are referred to “wp-comments.php” from the home page of my blog, the domain root or nowhere at all.
- Give no user agent.
In contrast, honest-to-goodness commenters hitting the comments provide a referrer that points to a blogpost (e.g. ” http://rankexploits.com/musings/2012/bugs-may-find-this-sad/ “) not the home page ( “http://rankexploits.com/musings/ “) or root ( “http://rankexploits.com/).
Honest-to-goodness do not claim they are the googlebot, who is never actually inclined to comment. By the way: Most the bots claiming to be the googlebot are from Brazil. They do manage to get a few comments into the data-base. Akismet keeps you from seeing them but they are sufficiently not-stupid that I have to empty the spam bin.
I was catching these comment spammers — especially the Brazilians by logging all hits to WordPress in 15 minute long files and then running a clean up script. But that Brazilian bot manages to dance quite a few sambas before I get it. So, I’m now forbidding these in .htaccess. (I later send all 403’s to a script ; one of the things this does is report things spoofing the googlebot to cloudflare in real time. So this will trim the Brazilian comments spammer’s dance card.)
For wordpress bloggers who merely want to reduce the cpu and memory load on their servers, I recommend the following bit of code in .htaccess:
# comment controls
# if referrer is root, homepage or not my blog.
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?rankexploits\.com/?(musings/?)?$ [nc,or]
# is not from the blog itself.
RewriteCond %{HTTP_REFERER} !^http://(.+\.)?rankexploits\.com/musings [nc,or]
# claims it's a spider or has a blank user agent.
RewriteCond %{HTTP_USER_AGENT} (google) [nc,or]
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule wp-comments-post.php$ - [F,L]
# end comment controls
You’ll need to tweak line #1 above to use at your blog in the following way:
- if hosted at the top of your domain, replace the first line with:
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?mydomain\.com/?$ [nc,or]
- If hosted in a subdirectory (e.g. “blog”) replace the first line with
^http://(.+\.)?mydomain\.com/?(blog/?)?$
where “mydomain.com” is your domain name and “blog” is you subdirectory.
Should the Brazilian bots start to pretend they are bing in addition to google, you can change (google) to (goggle|bing). That will forbid commenting by the also silent bing bot.
Regular visitors should see no particular change in commenting. Comments might post slowly, but that’s a (annoying) feature, not a bug. (Also, it’s mostly due to other spam filters, not the .htaccess.)
I’ll come clean. I am spamming your site. Like the Omnidroid 9000, I have programmed a spambot that operates in many ways. One of these is to have the spambot read your posts describing your anti-spam procedures and adjust accordingly.
I have run into problems many times with email servers in Brazil being blocked by my local server because they have become spam zombies. There is also a huge amount of fraudulent phishing going on in Brazil. The phishers often create a fake “bank site” that is identical in appearance to a real site, and try to get victims to reveal account information, pass codes, etc. These sites can appear and disappear quickly (hours) to stay ahead of the police, and usually involve insiders who have access to bank customers’ email addresses; Brazil has a lot of illegal internet activity. So I guess the bot spoofing you see is consistent with this.
MikeN– Heh.
Unfortunately, over time, spambots do adjust. However, to the extent their goal is to post spam or hack in, they can’t be entirely undetectable.
In the case of comment spam, the bots could in principle become undetectable. All they need to do is look exactly like a human posting a comment. Strictly speaking this means:
1) Provide a user agent a human uses. That’s easy– just say you are using a popular browser like internet explorer or firefox. Get the usable ua string by copying from a popular browser. Copy/paste would work like a charm.
2) Spoof a referrer from a real post. In principle this is easy. You only need to know the referrer for a real blog post. In practice you need to send a crawler to learn the real post… and… This requires more lines of code, more cpu and memory for your little script.
3) Better yet, spoor a referrer for a real post with open comments. (Oddly enough, I left out the line that will catch this at my blog. The reason: at my blog, I close comments. I know a feature that all valid referrers share at my blog. But it’s not general for other blogs. That feature is that the referrer contains “2012” or is the ONE old post that has comments left open forever. )
After that, you need to get past spam filters like AKISMET which track IP’s etc or other filters that check to make sure your browser does javascript. But my .htacess redirects are catching things that don’t even do (1) & (2). I’m not quite sure why AKISMET doesn’t “know” that comments that claim to be from the “googlebot” are soooooo spammy they shouldn’t even go to the spambox– just disappear them.
But I want to do what AKISMET can’t do– ban these at cloudflare.
Anyway, the reason the comment spam bots are “stupid” difficulty is that it costs too much human time to program a bot that is smart enough to look like a human.
SteveF–
As you noted the phishing sites open and close. Similarly, the IPs associated with the spamming/cracking or other stuff also change. Owing to the fact that these things change IPs, I need to write the “unban” at cloudflare script today. I’ll be unbanning everything more than 1 week old.
Unfortunately, cloudflare does NOT make it easy to unban. I’d like something that just let me “unban” everything more than 1 week old. But what I need to do is
a) keep track of everything I banned.
b) one week later, read the file and unban it.
I can keep certain large blocks banned for a long time. (For example: lots of image scrapers are just in the business of scraping. So, I can ban large ranges associated with tineye, picscout etc manually. Then, I don’t “unban” automatically later.)
Lucia, In the UK I see quite a lot of spoof/fake Baidu bots. Do you not get a lot of them in the US of A?
Just checked on a WP blog that gets about 1500-2000 hits a day, I’ve got 2 ‘baidu spiders’ with 7 connections open between them..
Chuckles–
Both real and fake baidu bots are PITA’s.I ban both the real and fake Baidu bots at cloudflare. Also,neither the fake nor real baidu are trying to submit comment spam. So I don’t need that in this .htaccess rule. If I start seeing it, I’ll reroute (google|baidu) for user agents.
I suspect the Brazilian comment spammers were finding spoofing google got around a lot of blocks while spoofing baidu didn’t. Lots of people live in fear of blocking the google bot. Some will recall I did block it a month ago and I lost rank on google search results. I unblocked and whitelisted a whole boatload of known google IPs so those IPs will nevef be blocked at cloudflare. Now I’m fine.
The Brazilians can spoof the ua– but they still present a brazilian non-google IP. So, I can ban them. I just wasn’t doing it real time. During the 15 minute wait they were dancing all over the server logs!
New project. Send lucia hidden messages by spoofing UAs/referrers with customized text.
The alarmists are apparently finishing construction of a diabolical cyber-weapon. When complete they intend to unleash it on the rational blog world. It is the fearsome GleickBot. It pretends to be a real user by copying not only the HTTP agent of actual posters but also their name. It can be detected though, mostly because it always asks for documents and claims to have a new email address. It also inserts random punctuation into its posts and writes reviews of books it hasn’t crawled.
Brandon– It would probably work right now!
I bot come chew lucia’s googly leg. Yum. Climate Sensitivity up sea level by 0.73m by 2100. So Gleick is fake Anteros now.
nice mark. and it uses heartlandinsider as email address
test
Test