About every 3 months or so, I find I need to do a “blog clean-up”. Afterwards, I generally mention things I did to clean things up. Although most my visitors aren’t interested, and the “pros” laugh at the naive methods I describe, these posts sometimes help other bloggers “out there” who google and find things I noticed. So, this will be in the form of a “list” or things can do to “clean up” blog inefficiencies, especially if you are noticing Dreamhost has become “slow”. The problem could be you, not Dreamhost.
I always write these with advice geared toward total noobs who know almost no programming and do not want to learn unix just so they can blog. Programmers who know how to do things efficiently don’t need this advice at all. Noobies who started a blog using Dreamhosts “one click installation” don’t want to be told “first learn unix” when they are trying to fix any little thing that might go wrong. So, they need advice that lets them do things using tools easily available at Dreamhost. In my case, I generally have to clean up because the blog has started either using too much CPU or too much memory.
In this case, my blog was using too much memory. So, I decided to do things to figure out if the memory issue was due to honest to goodness traffic (in which case, the solution is to pay more for hosting) or to my blog doing something inefficient. Turns out my problem was the latter.
Here are the things I checked and fixed.
- Check cache is working properly. If not, fix this, stat!
Problem: Caching is one of the best ways to reduce sever load on shared hosting. I was using WP Super Cache, and I discovered it was misbehaving.How I discovered it: First, Supercache tells you to read the source and find a message in the footer. My message said
Why not?
On the admin side, by checking the knob in WP-Supercache that showed the list of cached pages, I also noticed WP Super Cache was malfunctioning in a resource intensive way. It used to work, but for some reasons, the cache address was malfunctioning and assigned improper addresses like “http:/rankexploits.com/musings/musings/name_of_blogpost”. This meant the plugin a) used memory creating the cache but b) couldn’t find it when someone tried to visit a page that already had a cache in store. Not good.
Solution: I tried switching all sorts of settings, deactivating, reactiviting etc. I then decided it was more time efficient to give up and find another plugin. I downloaded and installed WP total cache which is now happily caching pages.
For many blogs using too much memory, this alone would fix the major inefficiency for their blog. Just make sure you are caching and that caching is working. Yes. I mean you Keith. Right now your blog footer reads:
- Instruct good ‘bots: robots.txt.
Problem: Good ‘bots like those from search engines can sometimes swamp a low resource hobby blog.My server logs showed ‘bots and crawlers were racing through the site, sometimes loading 60 pages/second for 5 minutes at a time. They are also loading page they really don’t have any business loading. Combined with WP Super Cache misbehaving, this was causing quite a massive memory draw as multiple pages were requested simultaneously from my server at Dreamhost.
How I found it: I looked at my server logs and noticed ‘bots crawling. You can look at your server logs by visiting your Dreamhost account, clicking “manage” under “domains” in the left sidbar, then finding “ftp” under your domain name. This is Dreamhosts webftp GUI (which is currently not a bit addled.) After you click “ftp” you should arrive at in your “root” directory. You’ll see “logs”, click your domain name, then click the mysterious http.bunch_of_number. Now, click “access.log”– just the top one.
You will see a file full of something a noobie will consider “gibberish”. These are server logs showing everyone who accessed pages without triggering an error. Right now, you want to identify good ‘bots.
There are lots of ways to tell they are ‘bots. Search ‘bots generally self identify with names like “google-bot”. For example the following are search engine ‘bots
crawl-66-249-67-234.googlebot.com - - [11/Oct/2011:00:59:08 -0700] "GET /musings/page/16/?action=post%3Btopic%3D210.0%3Bnum_replies%3D1http%3A%2F%2Frankexploits.com%2Fmusings%2Fpage%2F62%2F%3Faction%3Dpost%3Btopic%3D210.0%3Bnum_replies%3D1 HTTP/1.1" 200 543 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"spider83.yandex.ru - - [11/Oct/2011:01:11:28 -0700] "GET /musings/2009/tricking-yourself-into-cherry-picking/comment-page-3/ HTTP/1.1" 200 462 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
But there are lots of different bots, some have obscure names. (Had you ever heard of the Yandex ‘bot from ru? Bet not.) For the not-very computer literate, one way is to notice something loading zillions of pages in archives. So, I could see something visiting lots of pages with names including “/musings/2009/….” or other years in the title. Those are old. Anything reading lots and lots of posts from 2009 is a ‘bot. If you look at the gibberish, you will notice that it is actually very easy to tell which things were loaded by ‘bots. But really, you need to just look.
One of the things you will notice is ‘bots will race through. So, you’ll see a string of hits that look starts like :
crawl-66-249-67-234.googlebot.com - - [11/Oct/2011:00:59:08 -0700] "GET /musings/page/16/?action=post%3Btopic%3D210.0%3Bnum_replies%3D1http%3A%2F%2Frankexploits.com%2Fmusings%2Fpage%2F62%2F%3Faction%3Dpost%3Btopic%3D210.0%3Bnum_replies%3D1 HTTP/1.1" 200 543 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"crawl-66-249-67-234.googlebot.com - - [11/Oct/2011:00:59:09 -0700] "GET /musings/2008/effect-of-including-volcanic-eruptions-on-hindcastforecast-of-gmst/ HTTP/1.1" 200 11781 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
crawl-66-249-67-234.googlebot.com - - [11/Oct/2011:00:59:12 -0700] "GET /musings/page/16/?cid=17503http://rankexploits.com/musings/page/62/?cid=17503 HTTP/1.1" 200 9077 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Notice the googlebot visited at [11/Oct/2011:00:59:08 -0700] and then one second later at [11/Oct/2011:00:59:09 -0700]. If you saw the full log, you’d see this continued for several minutes.Because I have limited memory allocated by Dreamhost, I need to slow this down.
Solution: To slow the crawl-rate I added “Crawl-delay: 600” to my robots text. (If you know nothing about robots.txt, you should read google’s webmaster advice. Make one; add it to the top of your root directory.) According to many resources, “Crawl-delay: 600” command is supposed to tell the bots to wait 600 seconds between page loads; google doesn’t mention it though.
If Crawl-delay works as advertized, ‘bots that recognize the command will visit, but they won’t ask for a zillions pages all at once, crashing my blog. I’ll be lowering the time interval to 60 seconds later on; for now, I just want to see whether bots really obey this. I’ll look at my server logs next week and see if the ‘bots seem to have slowed down.
To further reduce load, I also told ‘bots to off ‘tag’ pages and assorted other pages people rarely visit. I want google to send people to the ‘main’ address for pages anyway. So everyone of these pages should get indexed even if I keep google off ‘tags’. To tell the ‘bot to stay ways from tagged pages, I included Disallow: /musings/tag/ in the robots.text file.
- Ban bad bots: .htaccess & Project Honey Pot
Problem: Bad bots including email harvesters, comment spammers and various sorts of crackers also race through the site. In contrast to good ‘bots who you may just want to slow down or keep off pages that haven’t been updated and so don’t need their cache refreshed, we’d all just like these ‘bots to stay away. If your hosts uses Apache, you can get rid of lots of them. (Otherwise, I have no advice on what to do!)
How to detect: The first step in keeping ‘bad bots away is to recognize them. I’m going to focus on how a know-almost-nothing sort of blogger can detect these. Like good ‘bots, if you look at your server logs, you’ll see these bots race through and load all sorts of odd pages.
Most bad ‘bots don’t obey robots.txt. So, even weeks after you edit robots.txt to tell them to stay off a particular page, they will still visit it. That’s a bad sign. If you notice a ‘bot racing through the site, it doesn’t seem to be a known good-bot, and you become is suspicious, you’ll want to confirm it’s a ‘bad bot.
A fairly effective way to confirm they are ‘bad’ is to visit project honey pot and enter the IP listed in your server logs into project honey pots “search IP” tool. If you are a little bit more sophisticated about programming, you can also use project honeypot’s blacklist.
Chances are, if you’ve done nothing to block these bots in the past and you are checking for bad ‘bots manually rather than using a script, you will be spending lots and lots of time trying to identify bots. This is not time efficient.
Over the long haul, a more time efficient way for a blogger who does not want to constantly look at server logs to notice bots is to to a) get a honeypot from project honeypot. install it on your server, recording the name of your honeypot. It will be something like ‘some_weird_name.php’. Put this in your root directory. Then add:
Disallow: some_weird_name.php to your robots.txt file. (You should have one at the root of your domain by now).
A few days later, using a text editor, edit ‘some_weird_name.php’ by adding this at the end of ‘some_weird_name.php’
Provided you change “youremailaddress@whereever.com to your honest to goodness email address, this bit of code will send you an email message when anything loads the file called “some_weird_name.php”, i.e. your honeypot. The message will report the IP address.You can then check that IP at project honey pot and identify the ‘bots. If project honeypot says they are a spammer, you now know this is a bad bot and you’ll want to learn how to take action. (I added these lines and I’m now getting about 2 emails a day.)
I also added similar code to my wp-login file– but on that, I wrapped around a bit of code to avoid getting email everytime I log in.
What to do: After identifying bad ‘bots, you need to block them. Otherwise, identifying the ‘bots is pointless.
To block the ‘bots, you will need to learn about .htaccess files and add a few lines of code to your appropriate .htaccess file. If you self host and your server runs Apache, you can do this. (This means you Keith Kloor. But be careful before doing this. It’s easy, but typos can take your whole site down. Maybe set aside some time with a friend who can hold your hand if necessary.)
What you will want to do is add code to pre-existing .htaccess file. WordPress will have created one which contains information that makes your permalinks work. If you screw that up, your blog will not function. So, before you try to change your .htaccess file, save the one that currently works giving it different name. (Files whose names start with periods ‘.’ are invisible on many systems; so calling it something like ‘htaccess-orig’ might be useful).
Then, with a text editor, add code that looks like this:
order allow,deny
#illya
deny from 69.162.68.138 # project honey pot Oct 9
deny from 65.52.109.19
deny from 74.201.255.215 # project honey pot Oct 10
deny from 72.9.105.66 # Oct 11
deny from 209.124.55.53 #
#
deny from 178.216.8.25 # Oct 11 wp-admin attempt-- honey pot. wp-admin attempt
deny from 95.168.191.160 # Oct 11 wp-admin attempt-- honey pot.
deny from 95.166.24.156 # Oct 11 -- honey pot.
deny from 212.113.36.85 # Oct 11 -- honey pot.
deny from 213.155.9.156 # Oct 11 -- honey pot.
deny from 188.143.233.34 # Oct 11 wp-admin attempt-- honey pot.
deny from 91.226.212.33 # Oct 11 wp-admin attempt-- honey pot.
allow from all
The key things are: The code begins with “order allow,deny”. After that, you list all the IP’s you want to deny. Then, you end with “allow from all”. This blocks those IP’s listed. You will want to tailor this by deleting the IP’s in my example, and adding the IPs that have been hitting your blog. (There are thousands of spammers out there. Oddly, some find blog A and hammer it. Meanwhile, they aren’t hammering blog B. Once a particular IP is played out, the spammer abandons it. So, you really do want the “deny from xxx.xxx.xxx.xxx” to match IPs hitting your blog.)
The # character starts a comment. I’m writing myself notes so I know why certain IP ranges are in .htaccess. (Note: do not put any html in the comments. I tried to put a link to a web resource and that caused a fatal error.)
After you add lines, save the new file, upload to your server and name it ‘.htaccess’. Then immediately try to load your blog. If you get an ‘internal error’, delete ‘.htaccess’ and replace it with the original file that was working before you fiddled.
Each time you detect an new “spam IP”, add a “deny from xxx.xxx.xxx.xxx” to your .htaccess file. From time to time, delete IP’s you identified long ago.
If you’ve never fiddled with .htaccess, I recommend testing out your ‘.htaccess’ file in a subdiretory since you can place a separate .htaccess file in each subdirectory.
For those (like you Keith) who might want to implement this: feel free to ask questions. I or others here can probably help!
Lucia, this too is fascinating. If you run over your memory allocation, does whatever it is go to swap? How about CPU consumption? Is that your portion of a single machine? Is it possible to briefly describe the architecture of the host system and the portion assigned to you?
Is the software you require resident including the plug-ins or do you have your own virtual “root” where you can modify the system (within your allocation) as your needs vary?
it’s good of you to share these observations.
john
j ferguson–
I don’t know the details of what happens at Dreamhost. From my customer pov, I pay $x. Once I set up an account and pay $x/month, Dreamhost guarantees a certain amount of memory. They actually sometimes let me draw more. But if my account is drawing too much, it risks taking down the server and their load limiter limits how much my VPN account draws and my blog s_l_o_w_s. It also starts to drop comments etc.
At anytime, I can change my level of service varying up and down. But then my billing changes. I sometimes do increase my service to keep the site from going down. But after I do, I generally check for inefficiencies. I don’t mind paying if I need “X” memory, but I don’t want to pay extra merely because I’m too lazy to see that a WordPress plugin is malfunctioning and doing something nuts.
Owing to the way WordPress is programmed and the relative limits at Dreamhost, I hit too much memory long before I hit too much CPU. 3 years ago, it was the other way around. I don’t know what details of the architecture you want to know– I don’t delve into that. I have an account that guarantees me a certain amount of memory.
I’m not sure what you mean by this:
By software, do you mean WordPress? or PHP? Or Apache?
WordPress is written by WordPress, and I upload that. Dreamhost happens to make it trivially easy to upload and upgrade, but if they didn’t I could upload by ftp.
Dreamhost maintains the server and installs stuff like PHP. I don’t worry about those.
Thanks, you answered my question.
j ferguson–Good! I’ve fiddled with more blog-clean up stuff today. I’m hoping something I did helps me keep lots of ‘bots out. (I should make it a plugin. But it’s easier to just hack it into wp-login.php.)
We may have sparred in the past, but this is a lovely vade mecum
Some good points Lucia, the spammers and bots can try the patience of a saint, at times. I presume your references to ‘memory’ were to silicon program type RAM rather than disk space?
When WP RAM utilisation starts climbing, it’s often due to a need for a bit of MySQL or Apache tweaking. Since ALL of WordPress is constantly hauled out the DB, a default MySQL install is not good., and the numbers need to be upped a bit.
On the Apache side, I had a blog website that was locking up badly every 1-2 weeks, requiring a reboot. As you describe, memory utilisation was climbing till everything started swapping, thrashing and dying.
I traced it to the default Apache MaxRequestsperChild setting, which never ‘recycles’ child threads/processes, and their mem usage slowly creeps up till swapping starts. By settings a faily low value of 200 or 500 or so, problem went away. No mem problems or reboot in 6+ months since then.