A different kind of blacklist: Spam.

This post is triggered by 1) A desire for some feedback (see questions at the end), 2) questions about “karma points”, and 3) a bleg for tips on php scripts to create plots. Because some feedback questions are based on the discussion of how SpamKarma evaluates comments, I’ll discuss that first. I’ll finish by asking for feedback.

More about karma point than you ever wanted to know

The main spamfilter I use is called SpamKarma, It’s no longer maintained by its originator, but continues to be compatible with WordPress and continues to do a very good job at screening out spam left by robots.

A number of regular site visitors wonder what karma they have and what they might do to risk being sent to spam.

I can’t answer that because SpamKarma doesn’t associated KarmaPoints with specific visitors. SpamKarma doesn’t care about “people”, and so doesn’t try to identify specific commenters. Instead, it diagnosis the comments with specific “symptoms”. The originator devised a method that relied on noticing things spambots do that people don’t do and then also coming up with ways to detect people. Describing precisely what it does ends up being complicated because it does a lot of checks.

Here are some examples of what SpamKarma’s default modules do:

  • [[User Level]]: This check does recognize “people”. When I log in, I’m “admin”; that’s the highest user level. When Zeke logs in, he is “author”. WordPress sets and reads a cookie, to recognize me. So, I’m nearly always logged into the admin when I comment and I get positive Karma points. If you are not logged in on the admin side of the blog, your user level is “visitor” and you get no points one way or the other.
  • [[Blacklist]]: This is the other way in which “people” can gain or lose points.

    The blacklist tracks IP and domain name information and tags it “black”, “white” or “grey”. The overwhelming IP and domain name majority of entries are added by SpamKarma. I can also add and delete IPs and domain names manually.

    The IP black/white lists keep a log of IPs and count the number of times a comment with that IP was approved or sent to spam. So, if you have commented a lot using the same IP, your comments will get up to positive karma points for this; the number of points seems to depend on the number of comments associated with that IP. On the other hand, if your comment was sent to spam, you will get between negative karma points for that.

    The domain list is similar. SpamKama reads urls in comments and adds them to the white or blacklist. If your “author url” or the content of your comment includes a white or blacklist domain, your comment either gains or loses up karma points. Once again, the number of points depends on the number of times that url has been included in an approved or denied comment. (By the way: This can be more than 5 karma points. If you have a blog or a favorite web page, and you often comment, adding this to the “author url” to a comment that is accepted will ensure future comments are given a extra positive karma points to future comments. This is the single best thing you can do to give yourself “good” karma– even good enough to be able to include “spamkarma words” in your comments. )

    I can also add words like “denier” or “denialist” to a blacklist or whitelist. I rarely add words and SpamKarma never does. I add words to the whitelist if SpamKarma is dinging someone and I can’t figure out why. If your comment contains a black-listed word, you lose 5 spam karma points. (BTW: The number of points you are dinged is a bit confusing for the admin because on the admin side, I enter 100. SpamKarma takes that to mean I want it to ding you 100% of 5 points.)

    Wondering what the spamkarma “words” are? This is the complete list:

    Notice Knappenberger is on the whitelist? That’s because he was having trouble commenting a while back. I couldn’t figure out why, so I just manually whitelisted his name, IP, author_url and other features.

    Carrot Eater was disappointed he didn’t get blacklisted by adding a whole bunch of nasty “Godwin” words. Nope. They aren’t on the list. I’ll add more blacklist words if they are used repeatedly in long running blog-spats. Otherwise adding words is a PITA and not very useful. (I could also tweak troll control to include a real blacklist of words. But I don’t think it’s worth the CPU.)

  • [[Javascript Payload]]: Most spambots are not javascript enabled. Spamkarma places a code in the comment which can only be returned if your browser is javascript enabled. If your browser is not javascript enabled or you turned javascript off because of privacy concerns, you will get dinged karma points.
  • [[Encrypted Payload]]: This is similar to above, but this one involves adding some encrypted info to the comments to force bots or people to actually load the comment form at the blog.
  • [[Link Counter]]: Spammers often include loads of links. If you add a lot of links, SpamKarma dings you some karma points.
  • [[Stopwatch]]: If click submit too soon after my server sends you the web page, SpamKarma will dub you “Flash Gordon” and ding you points. How short a time? It’s a small fraction of a second. Normal humans are never dinged by this.
  • [[Entities Detector]]: This has to do with checking for improper use of html entities to avoid the blacklist filter. Normal humans are rarely dinged by this, but an html < sign can be dangerous to those who have not accumulated any positive karma.
  • [[Snowball Effect]]: This hangs up new visitors or established visitors who have brand new IP addresses, especially if they appear and post 8 comments after 10 pm while I am not here to clear their comments.

    What happens is this: Someone SpamKarma does not recognize arrives, post a few comments in an hour or two. They get in a blog-argument, and suddenly all their zingers in response to the person they were talking to are in moderation. Arghh!!!

    The new visitors always think I did this. I did not. The problem is that behavior is indistinguishable from spambot which often find a blog and return if they were successful at leaving comments. (The addition of moderating first comments will actually minimize this frustration for brand-new visitors because by the time they can get into rapid fire blog argument, SpamKarma will already notice an approved comment with a date that is not extremely recent. The snowball effect module will take this into account and be less likely to think you are a spammer.)
    Update: After writing this, I figured it would be nice to fix this a little. I checked the ‘advanced’ settings on SpamKarma and tweaked this. I told SpamKarma to apply “snowball” only if someone posts more than 15 comments in 1/2 a day; the criteria used to be 15 comments in a day. This will spank enthusiastic new visitors less frequently. Based line 113 of code posted here, I think a higher “coefficient” will also apply the snowball spam to fewer comments. So, I increased the “coefficient” from 3 to 4; this should tend to spank people who have some stored comments less violently. )

  • [[TrackBack Referrer Check]]: This checks trackbacks left by blogs who included links to my blog in their posts. Some of you will recall I reminded Anna Haynes that trackbacks, not email or leaving a comment, are a standard method of blog-to-blog communication. It is sufficiently standard that SPLOGs or spammers who don’t even have blogs will send trackbacks to try to get links to their SPLOG on my blog. The referrer checks that the trackback at least comes from a real blog that left a link to my blog on their web page.
  • [[Post Age and Activity]]: If you leave a comment on a very old post that hasn’t gotten a comment in weeks, SpamKarma gives the comment negative points.
  • [[Captcha Check]]: If you get a fairly bad, but not horrific, SpamKarma total, SpamKarma will ask you to fill out a Captcha. If you pass, it will either approve or moderate the comment. It won’t send you to spam. If your initial score was horrific, it won’t present you with a Captcha.
  • [[Anubis]]: This pretty much just adds up all the karma points and decides what to do with your comment.

I also use the plugin-add in to send comment information to AKISMET, which reports back whether it thinks the comment is “spam” or “ham”. If AKISMET thinks you are “spam” you are dinged karma points. (Also, if I tell AKISMET you were ‘spam’, Akismet logs that information and it might ding you at other WordPress blogs. So, oddly, if I tell Akismet you are a spammer, Tamino, Anthony, JeffId or other WP based bloggers may discover it thinks you are a spammer when you comment over their. However, Akismet is smart enough to wait until at least a few bloggers told it you were a spammer before doing this to you. Because of this I do try to fish people out of spam if they land in there; Spamkarma then tells Akismet I fished you out. I also send duplicate comments to “trash” not “spam”. )

Why can some people submit comments that use the word “denialist”

Notice that the way the system works, a person who gets positive karma based on their IP, their “author_URL” and various other tests can survive the negative 5 karma points for including the word “denier”. Notice also that I failed to include the plural versions. (D’oh!)

On the other hand, a new visitor who uses a blacklisted word in their first comment will almost certainly get moderated or sent to the spam bin.

Are you worried you might get hung up sing a “bad” url?

Just in case you might accidentally include a bad url in a comment; here’s a descending list:

Blacklisted Domains

I bet none of you are worried you will accidentally include any of those domains in a comment.

If you scan to the right you will see that almost all of these were added by “sk2_blacklist_plugin”. These were added by SpamKarma after their comments were diagnose as being spam. I have, on occasion, added urls manually. More frequently, I remove urls that happened to be included incorrectly. I removed goggle.com and youtube.com today.

What if you get moderated anyway?

Some of you have noticed you have been sent to spam despite doing nothing to anger spamkarma. Yep. This happens. Right now it is happening when I have some server glitches.

I am trying to track the cause of these glitches down. They were evidently due to my running scripts that drew a lot of memory–something I was doing on a personal blog I set up to tally my diet progress for myself. It has no real traffic, but anything to do with dieting gets lots of ‘bot traffic. I had thought that was a good place to test a script to let my auto-generate graphs inside a wordpress blog.

Well… it was a good place to test.

But the test seems to suggest that the memory required to load the php to create graphs exceeds the amount my hosting service thinks I should use. That means it would be a waste of time to try to create that script to run here. This is disappointing because I was going to try to autogenerate graphs here. Well, I’m obviously not going to do that!

Anyway, I no longer generate graphs at my diet blog, and now this blog seems to be throwing fewer server errors. Unfortunately, the number is not zero, so I may need to figure out whether another one of my scripts is a memory hog.

Suggestions

  • If you have suggestions on existing free php scripts that I could use to auto-generate graphs– let me know.
  • If you have suggestions on “spamwords”, let me know.
  • If you are getting “Server errors” and ending up in the spam bin, let me know.
  • If you are having other problems commenting and you are not “knittingintl/phinniethewoo/heerbommel , let me know.

30 thoughts on “A different kind of blacklist: Spam.”

  1. I may have to learn how to post within a microsecond.

    If that’s a known criterion, you’d think spammers would just add a delay.

  2. If that’s a known criterion, you’d think spammers would just add a delay.

    It’s a cost/benefit balance. What happens is someone writes a simple script and then it gets passed around to many spammers. Many script writers didn’t think to include a delay; the tests works for those ‘bots.

    The same thing holds for the javascript test and the encryption payload tests. There are spambots that can pass that test, but most don’t. That’s why the complicated SpamKarma pile-up of tests tends to work pretty well. Some stuff still sometimes gets through, but most doesn’t.

    Akismet catches most spam too– but it often can’t catch spam when a spammer is using a new host with new IP addresses. But since everyone on WordPress reports those, Akismet learns the current spam IP addresses really quickly. I suspect people who pre-screen with Spamkarma help Akismet catch stuff faster because Spamkarma diagnoses those guys while I’m sleeping and reports the to Akismet.

  3. This is the sort of black list we can all get behind! 😉

    Seriously though, I sometimes wish my blog was big enough that I had to be concerned with spam.

    But only sometimes.

    BTW Zeke, I swear that’s the first time I’ve ever read a comment here that has made me laugh! I’m used to getting far too deep in to far too serious conversations. So thanks, I needed that!

  4. Oh, a thought-What about sort, um, phonetic spelling?

    Like “Den-I-Ya!”

    A guess you’d have to be really trying hard to get around the rules to try that, though.

  5. I don’t think you need to worry about adding the plurals of those words. In the word list, the type is e.g. “regex_black” implying that SpamKarma is using a regular expression checker, which should trigger on either the singular or plural variation. At least for those words, where the plural form contains the singular lexically. Now if you wanted to avoid the word “goose”…

    By the way, I’m mildly curious about the difference between types “regex_black” and “regex_content_black” in the word list. Any ideas?

  6. I don’t think you need to worry about adding the plurals of those words.

    You’re right; I shouldn’t have to.

    I’m mildly curious about the difference between types “regex_black” and “regex_content_black” in the word list.
    Well, I can tell “regex_content_black” finds things in the content of the post. “regex_black” does not. Maybe regex_black looks at the “author domain” only? That would be a very important functionality for spam because the “author domain” is the place spammers drop links and those often contain viagra, cialis, adipex etc. But my visitors don’t put words like “denier” in their author url.

    The admin panel on SpamKarma is not very explicit on this.

  7. Lucia.. Brief explanation of regex here-

    http://wp-plugins.net/doc/sk2/sk2-user-guide/sk2-modules/blacklist/

    “:*”’Regex Blacklist”’: Applies to the URL field, and can be thought of as text snippets that are not wanted on your blog. ”

    So as I understand it, one’s for catching words in URL’s, the other is for filtering inside message content, so you’re right I think about what it does to block author domains.

    Personally, I want to find myself on the ‘entities’ list, but then may find I’m sharing it with Kibo.

  8. Yes. I deny that the Earth is spherical. It is, in fact, an oblate spheroid.

    Actually, even that is only approximately true, since the surface isn’t perfectly smooth.

  9. Actually that’s just the deviation of the geoid (surface of constant “g”=gravitational equipotential at “sea level”). The actual topographic deviation from the geoid is -11 km (Mariana Trench) to +8 km (Everest)

    Compare this to the variation of the geoid from equator (6378 km) to pole (6357 km), which is 29 km, and you see them to be of the same order of magnitude. Calling the Earth an oblate spheroid is pretty much the same at the same level as the spherical cow approximation.

    (As an aside, when you look at the Earth in terms of its topography rather than gravitational potential, you find that actually the farthest point from the center of the Earth is not Everest but rather Chimborazo in South America.)

  10. Actually that’s just the deviation of the geoid (surface of constant “g”=gravitational equipotential at “sea level”). The actual topographic deviation from the geoid is -11 km (Mariana Trench) to +8 km (Everest)

    Compare this to the variation of the geoid from equator (6378 km) to pole (6357 km), which is 29 km, and you see them to be of the same order of magnitude. Calling the Earth an oblate spheroid is pretty much the same at the same level as the spherical cow approximation.

    (As an aside, when you look at the Earth in terms of its topography rather than gravitational potential, you find that actually the farthest point from the center of the Earth is not Everest but rather Chimborazo in South America.)

  11. It’s not that bad. Way to try too hard to say I’m wronger than I am. Yes, I’m wrong. But not extremely wrong.

    Hm…what if you consider the atmosphere as part of the Earth when it comes to it’s shape?

  12. How would George Carlin and his 7 words fare with the spam catcher? Not that I would try…..

  13. Andrew of course I was just playing (I’m found of the phrase “spherical cow”, can you tell?)

    It’d be interesting to see how the shape of the atmosphere (I’d use the tropopause for that) changes with longitude, latitude and season.

  14. Carrick (Comment#47667)-I think that TOA is more normal-which in meteorology is where the pressure is .1 mb, roughly 70 kilometers on average, but curiously for the radiation budget apparently 20 kilometers is used, and for space craft re-entery, 120 kilometers.

    http://mynasadata.larc.nasa.gov/glossary.php?&letter=T

    The weird thing is that in spite of the statement that it is 20 kilometers for the TOA for radiation budget data, NASA also here:

    http://earthobservatory.nasa.gov/IOTD/view.php?id=7373

    claims that it is defined as roughly 100 kilometers. I think the author of that article is confusing TOA with the Kármán Line, an arbitrary “boundary” where space “begins” according to Fédération Aéronautique Internationale.

  15. Trying to separate people making an actual contribution to the discourse from people spamming the discussion with worthless, self-interested noise.

    I’d say it’s exactly the same kind of blacklist.

  16. carrot eater, a question for you.

    Why is it that you don’t use the term “septic” to describe people who are skeptical of the AGW Consensus position at this blog, but you do at other places?

    Is it an accurate, helpful, and useful moniker? Or just harmless fun?

Comments are closed.