As some readers recall, I’ve been beavering away at reducing the ridiculous server load caused by various bots making constant heavy requests on the blog. The requests included excessive requests for blog post and for images. I am particularly sensitive to the image issue because I received a “Getty Demand Letter”. While I believe that the Ninth Circuit courts ruling in “Perfect 10 v. Amazon”, applies and hotlinking is not a copyright violation under us copyright law, I also think that many people would be well advised to interfere with image scraping at their blogs. Though interfering with image scraping will neither protect you from a copyright suit if you are violating copyright nor eliminate the potential that a copyright troll will incorrectly come to believe you have stepped on their copyright, it has the potential increase the cost of operation of entities like Picscout, (owned by Getty Images) Tineye (owned by Idee) and so reduce the likelihood that someone like Getty Images, Masterfile, Corbis, Hawaiian Art Network (i.e. HAN), Imageline or even the now defunct RightHaven will show up demanding money in exchange for their promise not to sue. (For more on these pests see ExtortionLetterInfo.)
More importantly: preventing image scraping will reduce your hosting costs and the frequency that your blog crashes as a result of some entity requesting zillions of images in a very short period of time.
For those uninterested in this topic: Comment will be treated as an open thread. But for those who might want to know how to reduce scraping now, I’ll post some .htacess code you can use.
————-
The heart of my scheme to prevent image scraping is a series of blocks in .htaccess which divert
certain image requests to a .php script. Those reading will see that all blocks terminate with:
RewriteRule .*\.(jpe?g|png)$ http://mydomain.com/imageXXX.php?uri=%{REQUEST_URI} [L]
This command takes all requests ending with .jpg, .png or .jpeg and sends them to a script located at http://mydomain.com/imageXXX.php. It also tacks on the uri of the image requested. (Note, I do not filter access to .gifs. )
Those who have not yet written a http://mydomain.com/imageXXX.php can simply forbid access to these images by changing this rule to
RewriteRule .*\.(jpe?g|png)$ - [F]
The ‘- [F]’ forbids access rather than sending the requests through a filter.
I use the more complicated command because I want to log, filter and sometimes permit access. But if you have not yet written a script to log or filter and are noticing massive images scraping, ‘- [F]’ is a wise course. ( FWIW: Initially, I did use ‘- [F]’. Though I did not take data, it seems to me that many bots just kept requesting images after being forbidden. In contrast, as soon as I began diverting to a .php file many bots vanished the moment they were diverted. The ‘YellowImage.jpg’ experiment and a few others were rather enlightening in this regard.)
What does imageXXX.php do?
As I mentioned: initially you can just forbid access to certain requests. But diverting to a file ultimately works better. Since I am diverting not forbidding, the .php file does this:
- Because I am using cloudflare, it pulls out the originating IP and country code.
- Logs the request to an 15 minute image log file and to a daily image log file.
- A cronjob using different script (called ‘checkfornasties’, checks the 15 minute log files and, among other things, counts the number of hits from a non-whitelisted IPs and counts the number of user-agents used by that IP during the 15 minute span. If either is excessive, that IP is banned at Cloudflare. For those at universities or in IT worrying that they will log on with their pc or mac and then turn around and use the browser on their workstation: My theory is Julio did just that yesterday when I was implementing the excess hits script. I decided using two user-agents in 15 minutes is not excessive. 🙂
This cron job also checks for known nasty or image stripping referrers and user-agents (i.e. uas), and bans requests using those.
- I manually scan through the image log file from time to time to see determine whether I should tweak the .htaccess file.
- (I also manually scan through raw server logs, but this has nothing to do with my php file.)
- Runs the request through a series of checks to decide whether it should serve the image as I sometimes wish to do. (Coding the image script to sometimes serve files is absolutely necessary if you use Cloudflare. Those who helped by answering questions about the “Yellow” and “Lavender” images, thanks! You contributed mightily to this. Especially the one who used the really unusual user agent!)
- Images with referrers on a whitelist are given an ‘ok’ and will be served the image if they survive the final step. Lots and lots and lots of you have been served images that pass through the script with no untoward effects. (Owing to screwups, some of you did briefly experience untoward effects when you tried to look at “YellowImage.jpg”. By the way: I will not be listing my white list. 🙂 )
- Recent images are given an ‘ok’ and will be served if the request survives later steps.
- All requests whether they are given an ‘ok’ or ‘not ok’ are sent through ZBblock which will block a lot of nasty things locally and either shows people ‘the scary message’ or a ‘503’. The ‘503’ is a lie– the server is not down. It just saves processing time relative to delivering ‘the scary message’ and it puts the request in a local black list and blocks that IP from further connections. (ZBblock blocked Anteros this morning; he emailed me, I fixed that issue which was likely blocking all sorts of people on a large ISP service in a particular part of the world. I cleared all IPs out of the local blacklist. )
Note: If a blog is hosted on Cloudflare, you must create a rule to prevent Cloudflare from caching any requests to ./imageXXX.php. (Some of you recall the experiment with the yellow and lavender images. I did this when I couldn’t figure out what the heck was going on. It turns out Cloudflare caches things and if a user at time 0 was sent to “imageXXX.php”, all other users in the geographic vicinity of that user were sent the same image! )
So, basically: imageXXX.php logs all requests. It sometimes serves the image. It sometimes bans you locally. And if you request an ridiculous number of images in a very short amount of time and start changing your user agent to experiment to discover whether the reason you can’t see images is your user agent, you will be banned at cloudflare.
Is imageXXX.php available to others?
Not yet. It will be eventually. Now that I know the method works, I need to organize this so people with IT skills even lower than mine can easily use it without my needing to provide a tutorial for each and every person who wants to give it a whirl. (People with great IT skills probably don’t even need my program!)
The next question people are likely to have is: Which requests get sent to this file? There are three basic ways to get sent to the file. Because I suck at .htaccess, I wrote three separate block of code (and Brandon, Kan or anyone who can tell me how to make these shorter, please do. I know it can be done, but in the first phase, I was hunting for ‘effective’ not ‘cpu-efficient’.)
The three blocks of code are described below:
Block I:
# bad image referrers
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?mydomain\.com/$ [nc,or]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?mydomain\.com$ [nc,or]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?mydomain\.com/musings$ [nc,or]
RewriteCond %{HTTP_REFERER} ^http://(.+\.)?mydomain\.com/musings/$ [nc,or]
RewriteCond %{HTTP_REFERER} index.php$ [nc,or]
RewriteCond %{HTTP_REFERER} ^feed [or]
RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{REQUEST_URI} /wp-content/uploads/
RewriteCond %{REQUEST_URI} !(2011/12|/2012/01|/2012/02|2012/04)
RewriteRule .*\.(jpe?g|png)$ http://mydomain.com/imageXXX.php?uri=%{REQUEST_URI} [L]
Motivation: Way back in December, when I first began working on getting the hammering of the site to stop, I noticed that image scrapers were persistently trying to load every single images hosted at this site back to 2007. Requests came in at a rate faster than 1/second, from a range of IPs (including some really weird ones like a prison system in Canada). Many, many, many of these requests with referrers that were clearly missing, probably fake or even worse certainly fake. For example: some came from referrers at the top of my domain (i.e. http://mydomain.com/ or http://www.mydomain.com/ ) with no blog post listed. This was very odd because if you hack back to the top of my domain (http://mydomain.com/) you will see the index page has no images; any image request with presenting that referrer is faking the referrer. The command up to “RewriteCond %{HTTP_REFERER} ^http://(.+\.)?rankexploits\.com/imageDiversion.php [nc,or]” all relate to fake referrers. The command containing “RewriteCond %{HTTP_REFERER} ^$” points to a request with no referrer (which can be legitimate). In contrast, a request for an image from a referrer containing “feed” is probably fake. Any requests for an older image contained in my ‘/wp-content/uploads/’ directory with those referrers needs to be logged and scrutinized.
As discussed above, the script is coded to sometimes send images. In the case above, the algorithm never sends the image. That means that if you request an older image from any site that includes the word ‘feed’ in the url, you will not be provided the image. (I may tweak this if I see legitimate requests in the logs. )
Initially, I thought Block I would be enough to handle my problem. In fact, initially, I thought just blocking requests containing ‘feed’ in the referrer would be enough. But it seems to me that as I began to block, new “methods” were being developed in parallel.
I began to to notice ridiculous attempts accesses zillions of images from a variety of IPs from weird places (like a prison system in Canada I kid you not). I also noticed these often had unusual user agents like “traumaCadX” which is used to process X-rays. A request using this useragent came from an IP that seemed to correspond to a group specializing in providing hotspots to airport and it was scraping images. Weird.
These requests were sending referrers that many would not wish to block. Some contained “google” or other search agent strings. To catch everything using “weird” referrers, I added another block:
Block II
# catch known image user agents and google through imageXXX.
# note: this does catch the google image bot.
RewriteCond %{HTTP_USER_AGENT} (image|pics|pict|copy|NSPlayer|vlc/|picgrabber|psbot|spider|playstation|traumaCadX|brandwatch.net|search|CoverScout|RGAnalytics|Digimarc|CoverScout|psbot|java|getty|cydral|tineye|clipish|Chilkat|web|Webinator|panscient|CCBot|Phantom|sniffer|Acoon|Copyright|ahrefs|picgrabber)[nc]
RewriteCond %{REQUEST_URI} /wp-content/uploads/
RewriteRule .*\.(jpe?g|png)$ http://mydomain.com/imageXXX.php?uri=%{REQUEST_URI} [L]
Note that I use [nc] in the list of user agents to block.
This is because various companies capitalization conventions seem to change over time. Letter series like “image”, “pics”, “pict” appear in various images scrapers like ‘picscout’, ‘picsearch’, and ‘pictobot’. Search also often appears in various bots– but luckily not the google bot, bing bot or any that I want to permit to visit. Some of these appear in mystery bots whose documentation I could not find.
Also, I am currently not too concerned about efficiency. I have not doubled checked to edit to eliminate “Webinator” on the groups that its already covered by “web”. The reason for this is that I continue to manually check the logs to determine whether a short version might be over inclusive. If it is, I don’t want to forget to retain “Webinator”.
Requests from these user agents to through the imageXXX.php. Sometimes these are served images– which is important because the appearance of ‘image’ in the list means the Googlebot-Image requests do go through the script. I can’t remove “images” from the .htaccess blockII because the scrapers can fake user agents, some do try to pass as “Googlebot-Image” when scraping. (Fortunately, ZBblock will catch some of those. My script that checks ridiculous numbers of requests in a 15 minute time window also catches some of those. Also: It does not necessarily send Googlebot-Image images. )
Unfortunately, the two previous blocks aren’t enough. Bots can spoof user agents and spoof referrers. What if a bot tells me it’s using “Mozilla something or other” and comes from “http://joesblog.com” requesting an old image. Did I start to see this? Yes I did. I don’t mind if the requests are for new images or if I can verify that they are from a blog that does link me. So, I added this:
Block III
# catch known almost anything looking for an old image imageXXX.
RewriteCond %{REQUEST_URI} /wp-content/uploads/
RewriteCond %{REQUEST_URI} !(2011/12|/2012/01|/2012/02|2012/04)
RewriteCond %{HTTP_REFERER} !(Whitelisted_domains|whitelisted_blog_posts) [nc]
RewriteRule .*\.(jpe?g|png)$ http://mydomain.com/imageXXX.php?uri=%{REQUEST_URI} [L]
This block sends requests for all old images through the script unless those requests come from blog posts I have verified request my images. I can manually verify that a blog post links my images and add them to the “RewriteCond %{HTTP_REFERER} !(Whitelisted_domains|whitelisted_blog_posts)” list. (Should I notice scrapers trying to take advantage of this whitelist, I can eliminate that line and send the blogger a note and request they copy and host my images themselves. But for now, whitelisting some blog posts that send a lot of traffic saves cpu.) Also, I need to edit ‘RewriteCond %{REQUEST_URI} !(2011/12|/2012/01|/2012/02|2012/04)’ from time to time because when 2013 rolls around, images containing “2011/12” in the url will be old.
Once again: the imageXXX.php sometimes just sends the image. In fact, the referrer whitelist inside imageXXX.php is much wider than the one in .htaccess. Many of the requests that are diverted by this block are shown the image. But because these are logged, I can catch image scrapers trying to race through images rather quickly.
Questions
For those that read this far, I’d like to know whether you can immediately see a huge hole in the strategy that a person who works at a company that makes money by scraping images and who is very strongly motivated to scrape might exploit. I tried to think of some– but it’s always better to ask people. Also, if you can tell me how to rewrite block one to eliminate at least 2 lines, let me know. And also, if there is something obviously stupid about having three block in .htaccess, let me know that– and tell me the solution!
Meanwhile, for others: Open thread. And warning: You will be seeing what I do to get rid of the cracker-bots, and referrer spammers! I’m doing this precisely to get feedback from the IT people who visit and know more than I do. But I can report: cracker ‘bots and referrer spam is way down. (And since the rate of both is entirely independent of rates of real visits, this is not merely because traffic is down due to light blogging.)