Improved Captcha: Thanks Readers.

I know some people don’t like the discussion of Captchas etc. So I’ve made Zeke’s post sticky. This will show below. Meanwhile, the dicussion of captcha’s is helping me make a captcha that is probably not broken before it is even created.

I know some people don’t like Captchas, but the alternative to not using something to slow down bots is to provide no method for humans to unban themselves. Lots of people provided interesting alternatives to captchas. I’m not going to discuss each method. But right now, as a practical matter, I think a halfway decent captcha are more practical better balancing human coding time, limited crackability, keeping my server cpu and memory load low while requiring a ‘bot programmer to invest more than than it’s worth to crack the captcha. (Also: the unban page provides an email address for those who cannot solve captcha for any reason including but not limited to visual impairment.)

And for anyone who doubts the unbanning capability is worth it, it is: since the time when I polled people on a previous sample Captcha, Carrick got banned and unbanned himself. The logs also suggest a few other people successfully navigated the script–though I don’t know if they were already banned. So, in future, should you get banned, remember to find the blog on google, read the cache and click the link to the unban page.

Meanwhile, I took much of the discussion of making good Captchas to heart; I think I made a better one. In this case, “better” means that I think it is both more easily readable and — likely– not pre-broken. That is to say, I don’t think the bots that already exist are likely to be roving around cracking my current captcha; I admit I can’t be sure.

If breaking a captcha requires custom programming that means it works for the time being. That’s sufficient for things like spam control. In this particular case, the function of the captcha is to make it difficult for any existing roving bot to exhaust the 150 API call/hour budget Cloudflare a lots me. This means it’s use is not high security– it is flood control!

Now, for those who want to know details about the new captcha, here are the steps I followed this time.

Based on discussion, it appears OCR is often used as the final step to break a captcha. It occurred to me that it might be useful to find a font that can’t be interpreted by one line OCR Readers. So I hunted down some free fonts (carefully reading the licenses). I downloaded several and found 3 Dumb Fonts which I thought looked like it might be hard for OCR’s to read.

I uploaded that font and created an image of all letters and numbers I plan to base my captcha on:

I then uploaded that image to an online OCR. According to that ORC reader, the translation is:

SMOH/SIdW41WmiriceirDINISkill,A,CLAWE'N.STUVOKYN

I tested at a 2nd site which translated it into:
25%§@@§wF&%mv€yAF@@£JL@fl@;M1lR1P \§‘£?\‘Id%§7Z
Both translations are sufficiently far off that I decided this font is challenging to existing online OCRs.

Having read a bit about captcha breakers, it’s clear the fact that the font is not readable by OCR’s is useful but insufficient to having a not-yet broken captcha. A number of people who crack these things don’t use standard OCRs. What they do instead is find a popular captcha, download a bunch of captchas and create a “library” containing images of the various letters in that particular captcha.

So, if motivated to break my captcha and captcha breaker could visit, download a whole bunch of captchas and figure out what my lower case “a”, “b”, “c” and so on look like. Once they have a library, when presented the a captcha they wish to solve they compare the letters to those in their library and “bingo!”.

To thwart this, we need to do at least some of the following:

  1. Add some sort of random noise.
  2. Make it difficult to separate one letter from the other by adding noise that runs across letters, varying spacing between letters, rotating individual letters or distorting individual letters.
  3. Distorting the entire string.

I’d previously done all the above except distortion. My reading suggested distortion, overlapping letters and drawing lines that connect letters are supposed to be more effective than rotating letters, changing elevation of the baseline of the letters or adding pixel noise.

Distortion appears to be especially good at preventing people from building up nice libraries of characters to make custom captcha breakers.I decided it would be a very good idea to distort, so found a script on line that lets me distort the string slightly. This resulted in:


The translation returned by the first OCR service became:

L'Agwela-44-*mihitKraT-4>aact*,r,VialL4,0{,IL..,,T1t),R0-,;VY,

After doing these two checks, I then tweaked my previous captcha script by doing things in this order:

  1. Create a word string using the new font. (The previous font was breakable.)
  2. Writing the image of each character but varying the elevation of the baseline, the spacing between letters and the rotation of each character.
  3. Superimpose little “ellipses” of noise of both background and text color on the characters.
  4. Distort the current noisy image. (This is new.)
  5. Draw some thin broken lines over the characters.

Ok… then after testing, I decided I needed a different color scheme. So I tweaked, rechecked that the OCR still couldn’t solve the undistorted font. The new captchas look like this:

I find these more readable than the previous captchas. I could change to black and white if readers find that more readable. Since the current font doesn’t seem to be widely used, isn’t readily broken by the two online OCR readers, and I’ve added noise, varied the inclination of lines and distorted, I suspect this captcha is not broken yet. I also doubt the “prize” of getting temporarily unbanned is going to motivate a ‘bot writer to anyone to write a custom bot to break my custom captcha.

That said: some people who specialize in breaking bots might try just to show they can do it. Likely as not that sort of person will succeed. Of course, that’s not a huge problem for me. If, after thwarting, my captcha, they return to the blog try to post comments while pretending to be the Googlebot, they’ll just get banned again. If they try a variety of XSS attacks, they get banned again. Oh. Well. (BTW: I am logging unban activity.)

All in all, I think this captcha is probably a pretty good captcha. For the time being. Possibly forever.

Thank you all for various tips, and discussion that helped point me in the direction of finding discussions of how captcha are broken. 🙂

95 thoughts on “Improved Captcha: Thanks Readers.”

  1. I don’t know if this link will work, but if it does, can anyone tell me what is under the noise?

    I think it’s 98gV*J but I’m really can’t tell what the * is. Perhaps different monitors handle the noise differently.

    http://theknittingfiend.com/unban/CaptchaSecurityImages.php?decrypt=1&code=KRL35d8NaF%2Fs1ydhTajBi0Io9cSYKFEwo%2FzErdKOgtA%3D&now=1332418773&angle=10

    This next one seems to end with TZ2, but I cannot guess the start.

    http://theknittingfiend.com/unban/CaptchaSecurityImages.php?decrypt=1&code=ObkXmG0zcrNEbwHyem87zFQbncz4zxmMMKZ%2FFv%2BMmoA%3D&now=1332419069&angle=10

    (it seems the images disappear after a few minutes, so I guess the above didn’t work).

  2. Steveta_uk–
    There are no “*”. Anything that seems to be a “*” is noise. The Captchas are numbers and letters only.

    WordPress seems to break the links turning the ‘&’s into longer stuff. Also, the script doesn’t create (or display) images that are more than 4 minutes old. And even if it did, the images would be slightly different each time. So if you want to show an example of one that was impossible, you have to take a screen shot.

    In the version of the first one I created (by commenting out the time limit and fixing the link) the letters are 98gVnJ. Everynow and then I can’t read the– then I refresh and create a new on. Also, on the theory that people can usually reason out which letter they got wrong, I let people go back and retry 3 times.

    Still, if the * was confusing, I’ll go tone down the pixel noise. (I was thinking that with this font, the “noise” tends to be darker than the letters. So I do think I need to do something with that. I think I know what to do. )

  3. Humm…
    An experienced PhD engineer is worth at least $50 per hour. How many hours have the bots cost you? 😉

  4. SteveF–
    If you look at it that way, a lot. But that’s not necessarily the way to look at it. After all, if you look at everything that way, watching TV costs you money. So does time spent on walks in the park etc. 🙂

  5. Lucia,
    “watching TV costs you money”
    One of the reasons I seldom watch TV! 🙂
    I guess then that you find defeating the bots is at least mildly entertaining!

  6. SteveF–
    Yes. Now that I’m actually mostly banning what I want to ban and can unban, I think I can turn this into a plugin too. I am aware of other customers (mostly bloggers) at Dreamhost (and other places) perplexed that they can’t keep their blogs/sites etc. up and running and they don’t understand why. Running access to Dreamhost (or other hosting site) through cloudflare and autobanning bots for a period of time works. Using .htaccess, runing a php script etc. does not.

    For this to be effective, I needed to
    a) Get it to work here. Working means both banning what I want to ban while rarely banning people and observing that it made a difference. (It has!)
    b) Make it possible for those people who get banned to get themselves unbanned. Otherwise, their only method of getting unbanned is to guess my email and contact me. The latter method has all sorts of flaws. Some are related to the fact that I’m not here 100% of the time to unban people, that doing it is a 3 minute interuption that happens at unpredictable times and people often don’t provide useful information. (Specifically, many don’t tell me their IP. They think I can guess it or spend the time to read it from their email headers etc.)
    c) Make sure it works.

    Now… I need to make this a plugin and also persuade Cloudflare to let me post the link to the unban page on their ban page things would be perfect. But I don’t have control of that.

  7. I was curious how effective this CAPTCHA is, so I decided to test it a little. After 15 minutes, images went from looking like this
    to looking like this.

    What that shows is the noise is basically meaningless. In fact, the lines you added ought to have been completely removed by my code. I can see numerous pixels the approach I’m using should remove, but for some reason aren’t. I probably just made a typo in my code somewhere (I hate Python). Regardless, once that’s fixed, any of the pixels you see floating around on their own, as well as most of what’s left of the lines, will be removed.

    Mind you, this is just one step in the process of breaking the CAPTCHA. Still, it only took me 15 minutes to write that code from scratch. I don’t have too high of hopes.

    I’m pretty sure the other one was better.

  8. The battle lines of CAPTCHA GOT CHA appear to have been drawn. I am counting on you, Lucia, to kick Brandon’s butt before this is over.

  9. Brandon–
    The first image seems to be in a private image area. But I assume the first is the blue on vanilla background.

    If the lines are meaningless to a Captcha breaker, I’ll remove it. There is no point to a feature that makes it harder for a human that doesn’t thwart a bot.

    Oddly, the dots are staying because
    a) they involve practically zero computational overload
    b) some very near the letters are by products of the font which, instead of creating pixels that are all exactly the same color have a range of pixel colors. So getting rid of them would require more cpu and lines of code, not less
    and
    c) I don’t think the few I add really prevent human readability in the original. (The ones far from the letters are added during the “distortion” step. These are just slightly darker than background. My thought was that depending on the cut-off you use to take out the background a few of those added dots will end up near the letters and, possibly, by being confused with the ones associated with the “real” letter might result in a slight decorrelation for the “image” of the letter. Does this help? Well.. someone needs to break the letter for me to know! )

    Still, even if they don’t help, I’ll leave those all in.

    On the lines: I suspect that generally using heavier lines might help. The problem is I’m down to 1 px wide. that’s likely pointless. But with this font, heavy lines reduces readability a lot. For other fonts I could use heavy lines.

    I’m thinking what could really help is inside the distortion loop, recognize when a “background” pixel is near the a bunch of “font color” pixels and tweak some fraction of those background color pixels to turn into “font color” Because they slam up to font colors pixels it would would be hard to recognize as “noise” during screening and consequently might be kept. Then, when the breaker moves on to the “recognize” stage, there is a loss in correlation during the later comparison process.

    As this is all about loss of correlation… that seems useful.

    What do you think of the general idea? Because the implementation wouldn’t be very hard. I just
    a) see if a pixel is “background color” in the undistorted image. If the answer is if “yes”
    b) look at the neighbor pixels in the undistorted image. If a sufficient number are “font color” sometimes make the corresponding pixel in the distorted images “font” color. Sometimes … don’t… !

    With this particular font you could probably drive the correlelation between “noisy J” and “noiseless J” down to 80% with little impact on readability!

  10. Hmm… seems to me I might also know how to make the lines more difficult to remove without reducing readability. It seems to me you probably convereted to grey scale. The you threw away everything below some particular level. That kept some of the very “light-grey” pixels associated with the letters keeping htem as “letter” turning all retained pixels to black.

    So… if my lines include a parallel “light grey” pixels, your cleaned up image would show a darker line than my original!!!

    You know… if we both do this, I can probably make a pretty good captcha!

  11. Kenneth–
    I expect Brandon to win. It’s true that I can make a captcha he can’t decipher. All I need to do is turn the distortion up to “ridiculously high”. Or use a huge number of thick curved lines.

    The difficulty is that I need to make a captcha humans can still read!

  12. Oh… I also have other fonts choices. But I think if Brandon is game, it’s best to see if he can beat this one. Meanwhile, if he beats it, I can see if I can thwart him.

    In the process we learn which tweaks are robust.

    (I’m going to go add the parallel lines in a lighter color. 🙂 )

  13. lucia:

    The first image seems to be in a private image area. But I assume the first is the blue on vanilla background.

    Yeah, sorry about that. I copied the wrong link. The entire album is private, but you can access any image on it if you have a direct link to it. This should work.

    What do you think of the general idea? Because the implementation wouldn’t be very hard. I just

    I think I’d have to see it/look into the issue before I was sure. It would definitely make the noise harder to filter out, but it’s hard for me to say how much of an impact it would have on top of the other distortions you use. I suspect some approaches to breaking would be affected more than others.

    Hmm… seems to me I might also know how to make the lines more difficult to remove without reducing readability. It seems to me you probably convereted to grey scale. The you threw away everything below some particular level. That kept some of the very “light-grey” pixels associated with the letters keeping htem as “letter” turning all retained pixels to black.

    Turning it to grey scale isn’t actually necessary as the same filtering can be done regardless of the color scale, but you’re right about the idea. As for turning everything to black, that’s just a symptom of using binary analysis. You can design a system to break a CAPTCHA which considers many colors/shades per letter, but that’s beyond anything I want to even think about designing.

  14. Kenneth Fritsch

    The battle lines of CAPTCHA GOT CHA appear to have been drawn. I am counting on you, Lucia, to kick Brandon’s butt before this is over.

    This is akin to rooting for a guy with a tractor to move a pile of dirt before a guy with a shovel moves his. Sure, we both might have a chance, but it isn’t anything resembling an even competition.

    lucia:

    I expect Brandon to win.

    I wouldn’t. With the appropriate steps, you can design a CAPTCHA I’ll never break. Part of this is CAPTCHAs can be made unbreakable (in the same sense encryption can be). Another part is the effort to break a CAPTCHA far exceeds the effort to make one. Even if I could win, I’d likely have to spend several orders of magnitude more effort.

    Incidentally, what would even count as a “win”? How high a success rate is needed? How high a failure rate can be tolerated (wrong guesses are different than simply refreshing)?

    But I think if Brandon is game, it’s best to see if he can beat this one.

    I can already tell you I could beat this one. What I can’t tell you is how long it would take me to manage it.

  15. lucia, Brandon,

    I know nothing about this subject and your comments are very interesting to me, interesting for me to try to tease out meanings from the words you post.

    My low value thought, .02 quagloos. Can you invert or reverse the text characters of the quire? Humans don’t have a problem reading with mirrors, or headstands. Can OCRs do that?

  16. jim, you can do that, but it doesn’t make things too much more difficult for a program to recognize characters. It can simply run the same comparisons as before, but flipped/inverted. It requires the program be coded to do so, and it increases the chance of false positives, but it isn’t a huge obstacle.

  17. jim, bots basically know whatever it is they are programmed to know. Something like you described would defeat pretty much all programs one might have already made to beat CAPTCHAs, but it’d be very easy to write code to change that. All one has to do is make a list of the letters in order, then tell the program to shift what it reads by however much.

    As a general rule, the simpler (not easier) a task, the better a bot will be at doing it. If you want to beat a bot, you’ll want complexity.

  18. “If you want to beat a bot, you’ll want complexity.”

    Brandon, that is what I was thinking of… something that humans can do without thinking. 😉

    Secondarily, what climate blogs need is a GOTCHA that can filter out the dragon slayers. Knee jerk questions that the dragon spammers can’t resist…

  19. jim, what you describe is definitely the right sort of approach. It just takes more than that to beat a bot.

  20. Brandon, can you, for the CAPTCHA graphic, superimpose Tahoma text and a course halftone filter? Apply a very course halftone to a raster text? So the result is a Rorschach text, spots and dots, a fill-in-the-blanks text that OCR can’t reliably decrypt?

    lucia, I complement you for using Tahoma text in this web site, Tahoma, CA, 96142, is one of the beautiful places in America! Sugar Pine Point State Park and Rubicon.

  21. I wasn’t intending to actually try to break anything here; I was just curious about how effective the noise was. However, I just had an idea which seems like it might be very effective, and the only way to test it would be to actually try to break the CAPTCHA.

    I hate scope creep.

  22. jim, I don’t normally work with images, so I may be missing some relevant detail in the terminology, but I don’t think what you describe would be effective. There are many tools for interpolation and extrapolation, but they shouldn’t even necessary. As long as enough of the character is on the screen, you can match a pattern to it.

    If you want to make pattern matching hard, you don’t just remove parts. A gap is simply a lack of information. Unless you remove a lot, you won’t create situations where two or more things get matched (and then the reader will have as much trouble). It’s only when you start adding things to the image that pattern matching becomes difficult.

  23. Brandon, I was thinking of the word puzzles where enough of each of the letters is missing that each letter is ambiguous in it self, but the obscured letters, in order and context, the spell a word, so the mind fills-in-the-blanks. I was just trying to think of a code-able version of that puzzle. A graphic ‘filter’ applied to the code text…

  24. It sounds like I was thinking the same thing as you then. In that case, no, there isn’t really a way to do that where a human can read it easily while a bot cannot. The only way to make it difficult to beat a bot is to make it so difficult a human would have trouble too. The problem there is you’re just subtracting from the image, and that’s a simple operation.

    And I can’t take credit for the phrase “scope creep.” It’s a common phrase when dealing with projects.

  25. “And I can’t take credit for the phrase “scope creep.” It’s a common phrase when dealing with projects.”

    Fallen idols and feets of clay. Disillusionment. 😉

  26. “In that case, no, there isn’t really a way to do that where a human can read it easily while a bot cannot.”

    Then all of us who are meat, meat through-and-through, thinking sentient meat, have been defeated by the Chinese Room! Drat and dang it! Chinese (Room) bots!

    (sentient meat was a counter argument to John Searle)

  27. Damn the Chinese Room bots! Damn the Chinese Room bots!

    makes me think of Gary Larson -> Cow Poetry -> Distant Hills -> “Damn the electric fence! Damn the electric fence!”

  28. jim– It’s always possible to make something a bot can’t decipher. If you lose enough information and it’s noisy enough, a bot can’t read it.

    But to work as a Captcha, it has to be human readable. But worse that that, many humans will grouse if the solution involves any effort on their part. Have you noticed that my comments don’t have visible captchas? I use two official spam filters and two .htaccess rules.

    The htaccess rules are:
    1) block POST to wp-comments-post.php when the referrer is clearly deficient.
    2) block things pretending to be google-bot from commenting.

    I also ban any IP exhibiting this at cloudflare. The ban lasts at least 7 days. (If that IP is in a blacklist range it lasts forever.)

    If I go through logs these two rules by themselves block about 100 requests a day — and it’s that low because the IPs get banned at cloudflare. Prior to creating the .htaccess rule, about 20 a day were sneaking through to the spam bin– so they were hitting.

    Those two rule are not enough. So, I also use a filter that creates a hidden input filed in each comments. Some bots are badly programmed and fill that field. If they do, they are marked as spam. When first introduced that filter used to catch roughly 50 spams a day. But lots of bloggers used that trick now. It was easy for a programmer to rewrite the bot not to fill in the fake field. So doesn’t really work anymore. Sigh….

    But If you use wordpress software you can use the Akismet comment filter for free. So I do, and that catches some spam every day. But really, the .htaccess rule and banning at cloudflare has been very effective at reducing spam that appears on the admin side. (You would never see that. Akismet is good at diagnosing spam that is submitted. But the rule prevent it from being submitted in the first place).

    Unfortunately, I can’t easily use either at my unban page. And no matter what, I need a custom script because the unban page does a custom thing. So there is no getting around programming something.

  29. Brandon

    Incidentally, what would even count as a “win”? How high a success rate is needed? How high a failure rate can be tolerated (wrong guesses are different than simply refreshing)?

    I can tolerate a fairly high failure rate because a bot getting through is not “full failure”. Full failure is more like:

    1 the number of “clickable unban links actually clicked by bots” exceeds 25 per hour or

    2 the number of actual bot IPS unbanned per hour exceeds 5 or

    3 humans can’t succeed.

    (1) and (2) differ because bots being what they are, it’s possible that a bot would just arrive and enter random IPs that are not banned. The cloudflare API doesn’t let me check if it’s actually banned– so the bot could create a clickable link to unban an IP that is not banned. Cloudflare limites the number of API calls/ hour whether or not the API call did anything meaningful. A call is a call. So a bot that created a while bunch of stupid pointless links and clicked them could “shut down” both auto-banning and unbanning. I don’t want this to happen more than 25 times /hour.

    Meanwhile, I prefer actual bots that were banned for being bad to not be able to unban themselves. So, I would prefer fewer than 5/hour.

    The captcha is not the only provision for preventhing these failures. In addition to the captcha, a single IP won’t be granted more than 3 unban links a day. A single IP can’t build up a while bunch of unclicked ones to be clicked later in the “queue”. I have a perma-ban list of IPs for servers known to host bots.

    But humans arrive, focus on the captcha and tend to think “that’s THE security measure”. They also tend to think that I’m counting on it being 100% effective. Neither is true.

    I can already tell you I could beat this one. What I can’t tell you is how long it would take me to manage it.

    I suspect the need to spend 15 minutes might be enough to make the captcha sufficiently effective until such time as I observe a problem!

    But I am going to tweak the lines in the way I described.

    If you zoom in on the full color captcha, you might understand what I mean about intermediate colors near the letters. I think adding some to the line would make the line “look more” like the strokes in the font. The more the lines “looks like” the strokes that convey information the harder it will be for you to eliminate it while retaining information. So… I’m going to go add…

  30. “This is akin to rooting for a guy with a tractor to move a pile of dirt before a guy with a shovel moves his. Sure, we both might have a chance, but it isn’t anything resembling an even competition.”

    Brandon, I am a Chicago Cubs fan and that might explain my rooting choices. Actually the battle as it is proceeding is not fair as you get to devise a method to break Lucia’s CAPTCHA after the fact. Lets see you devise a breaking code first. In a fair battle I say Lucia kicks your butt.

    Seriously in a real life situation could not the CAPTCHA deviser merely have a series of CAPTCHA design that are randomly used. Or Brandon can you devise a CAPTCHA breaker that applies generally. I am late to this thread so if my questions where covered elsewhere just direct me there.

  31. I have failed reading the CAPTCHA before, but as I recall I was given a second and third chance and never ultimately failed. Does the multiple choices of CAPTCHAs help filter bots from humans?

  32. I can report that “the system” thwarted a bot at 05:17:01 this morning.

    Here’s the data. The bot visited:
    46.201.254.164 - - [23/Mar/2012:05:17:01 -0700] "GET /unban/unbanRequest.php HTTP/1.1" 200 2060 "http://rankexploits.com/musings/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

    $ax += (rmatch($hoster,"pool.ukrtel.net","ukrtel, forum spambots. "));
    That IP is from pool.ukrtel.net a recognized forum spam bot range. I it was probably looking for the code that matches a forum comment submission, didn't find it and left.

    Evidence for that theory is the the ukranian bot did nothing. In fact, the bot didn't even load the captcha images -- a human visitor would have. Due to the economics of spamming, lots of bots don't load images at all.... They load only what their script wants to process and the leave. Each bot will do something different. But this bot decided the page was boring and left.

  33. “But this bot decided the page was boring and left.”

    Damn, Lucia, was that right after I posted.

  34. Does the multiple choices of CAPTCHAs help filter bots from humans?

    I don’t know. But as a human, I really appreciate being able to go back, check and fix a typo. You don’t get an infinite number of tries.

    I don’t know how permitting a small number of retries affects bots. I’ll watch the logs. What I do figure is that bots might not know if they are way off nor which letter was “hard”. When I’m off, I generally mis-guessed a letter I knew I was guessing.

  35. Kenneth Fritsch:

    Brandon, I am a Chicago Cubs fan and that might explain my rooting choices. Actually the battle as it is proceeding is not fair as you get to devise a method to break Lucia’s CAPTCHA after the fact. Lets see you devise a breaking code first. In a fair battle I say Lucia kicks your butt.

    That’s like saying a burglar has an unfair advantage because the security system is installed before he plans how to break it. What, should a safe cracker know how he is going to crack a safe before he even knows which one the owner picked?

    Seriously in a real life situation could not the CAPTCHA deviser merely have a series of CAPTCHA design that are randomly used. Or Brandon can you devise a CAPTCHA breaker that applies generally. I am late to this thread so if my questions where covered elsewhere just direct me there.

    If you make a series of CAPTCHAs, it doesn’t change much. It’d be pretty easy to tell which CAPTCHA is being used if you use ones that are notably different. Given that, all the bot has to do is reload the page until it reaches a CAPTCHA it knows how to beat. This means the weakest CAPTCHA you use is all they have to beat. And if you all your CAPTCHAs are equally secure, why bother having more than one?

  36. By the way, one way to make sure I can’t break the CAPTCHA is to not let me access the page. Even if that wasn’t intentional lucia.

  37. Are you unable to access Brandon? On visitor sent a note that suggested Dreamhost was down a while. If so, for that period no one could access. But if it’s something else, I want to know. (Some IPs are banned– and I did add to the list. I don’t think it should include you… but.. I’ve been known to commit typoe. I could go look at the access logs and see.)

    Kenneth– But I get to tweak my captchas as Brandon reveals what he does. So it seems fair enough to me.

    This means the weakest CAPTCHA you use is all they have to beat.

    Yep. I thought about that. It’s not necessarily bad to cycle through different captcha but all have to be difficult to solve. Also, I think for humans, suddenly changing fonts makes it more difficult for the humans. I know that after reading the font I’ve chosen many times, I decipher more easily that before.

  38. I’m not having any trouble like that. I was just surprised when I went to check the changes you made because the page didn’t load. I had just loaded this page a few minutes before, so the timing of your site going down threw me off.

    By the way, you being able to make changes doesn’t make things fair. It gives you a massive advantage. If you make a change to your CAPTCHA, any images I’ve downloaded are basically useless. A few lines of code for you would involve rebuilding entire “databases” of images for me.

  39. Brandon–
    Hmmm… Yes.

    I did add add a parallel line in a “mid” color above each of the noise lines. In the previous captcha, those lines had been darker. But this font is so thin thick lines made things unreadable. Oh… and I did something to concentrate pixel noise near the characters but not all out in the background.

    I know your going to get rid of isolated dark pixels next — and that has to involve seeing nothing is “near” a pixel.

    I didn’t change anything about the how the letters or digits are laid down.

    Since I actually WANT to see if you can break– and after you break, see if we can tweak, I obviously have to solve the filling the database problem.

    But if it helps, I can make it easy for you to load your database perfectly automatically! If you read the setting uri, you’ll see decrypt==1. Turn that off and just send the code you want to see. (This… of course… means that as a “real” method, the method of creating captcha’s can’t permit the person developing a method to break the captcha to turn off decryption.)

  40. Bleh. Offhand I don’t know any methods for removing lines which will work with the angles your lines can wind up having. I’ll have to think about that today. Removing lines won’t be necessary to break the CAPTCHA (especially not with this font), but it should greatly increase the success rate (clean images are especially helpful to “train” against).

    But today is the first day I haven’t had a fever in a while, so I’m going to go enjoy the outside for now.

  41. Brandon—
    Why is this font easier? Obviously, I can change it– I found others that online OCR readers can’t read yet. But this was one and the previous font was immediately read by OCR.

    I won’t change this one while you are breaking it. I think having you break it is kind of fun. (That’s also why I’ll keep it so you can easily fill your database with samples with known letters!)

    Get over your fever and enjoy the outdoors.

  42. BTW: I’m glad to learn my choice of angles was good!
    (What I’d like is to figure out which simple choices makes things harder or easier to break. )

  43. Lucia,

    It seems to me that the edges of your characters have excellent contrast. My naive mind supposes that making things fuzzy would make it harder for the BrandonBot to discern where the edge is. Since you’re posting the Captcha image as a JPEG anyway, what might you gain by applying a random quality value to the image, resulting in distortion along high contrast edges?

    You know, since it’s a sick day I can do the work myself and get back to you. Follow my own rabbit trail, so to speak.

  44. Earle–
    Every image processing step involves some cpu, memory and special coding. So, I want to know for sure what features make a font “easy” or “hard” to OCR before coding to do things like applying a random quality to the image.

    You know, since it’s a sick day I can do the work myself and get back to you.

    Do you mean you are going to get back and figure out whether excellent contrast actually makes the font bot friendly? Or something else?

    The fonts are already slight distorted, position slightly random, some overlap a little and the angle varies.

  45. Lucia,
    You have an interest in captchas that has a practical application. Eventually character based captchas will be broken by elaborate OCR. I’ve guessed that one that asks humans to distinguish visible features of a picture would be harder for machines.

    Imagine a picture with a lush jungle background that shows some number of monkeys, babies, cartoon faces, dogs, cats … Ask the human to count the number of e.g. babies. I’d guess the more the spatial frequencies of the bkgd matched the spatial frequencies of the faces, the harder it would be for a machine to pick out the faces. I don’t think any picture of monkeys and babies would pose a problem for people with reasonable visual acuity.
    Regards,
    Bill Drissel
    Grand Prairie, TX

  46. Bill–

    Imagine a picture with a lush jungle background that shows some number of monkeys, babies, cartoon faces, dogs, cats … Ask the human to count the number of e.g. babies.

    I’m imagining a human asked to count the numbers of babies in an image containing more than 10 images. I imagine that human will leave the site.

    It’s true that a motivated human would likely do a better job on this if you presented a bot and a human a fresh picture neither has ever “see”. Still, I bet such a system would be very frustrating for people and, a practical matter, easy for bots.

    First: No human want to count more than 10 babies just to post a comment. None.

    So, whatever the question, a bot can just randomly enter 0-10 and get spam into the system 10% of the time. If it’s wrong, it will just try again. This is nothing to the bot. If the “count babies” system was the only system used to filter spam, a forum or blog would be overwhelmed by spam in no time and human visitors would be just as grouchy as if they has to read most captchas.

    Next: How many pictures are you going to place in your database? If you only have 10 pictures 1 for each of the answers of 1-10, the person programming the bot just stores the image to a database. Along with the correct answer for that image.

    When presented with an image from the captcha, the bot just compares pixel by pixel. Then it looks up the answer, fills it in. The success rate for the bot will be 100%. (The bot could do the same thing with little baby faces that might be superimposed in random locations on the lush jungle background.)

    The bot could easily be programmed to download and “learn” a huge number of images and break that captcha pretty much all the time.

    It’s true that bots won’t be interpreting the question or locating teh answer using human algorithms. But all you need is to recognize that a bot can be programmed to break a particular system. If something about the images involves a finite number of possibilities, the bot will break it.

  47. Fighting spotty ISP service at the moment, sending this from my phone.

    I’m testing OCR interpretability of a text image at various quality settings and will report back my results. It’s just a tad difficult to use the online OCR when your system isn’t online! 😉

  48. I found the problem in my code from before. In one of the spots I had meant to type 230, I actually typed 30. Once fixed, the output I got was this. Some signal got lost in my filtering, but as you can see, the letters are almost completely present. There would certainly be no problem to get a program to read them.

    Now then, that was with 28 lines of code. That’s without code for segmentation and the actual OCR step, but as you can imagine, it means the noise was practically useless. Fortunately,* the new lines are much harder to remove without taking out chunks of the letters.

    As for this font, the reason it is easier to read than others is the letters in it have a far stronger signal than the letters in most fonts. For example, the difference between an E and an F in most fonts is a few pixels. In this font, the difference is much larger. The increased signal strength means it’s harder for noise to disrupt the signal, and thus a bot is likely to have an easier time.

    *That is, for anyone but me. If not for the change in the lines, I’d be able to write the code for segmentation right now.

  49. Brandon–
    Interesting. Oddly, this font did not translate well using the online OCR. But the font I used previously did. (Bot there are other fonts that aren’t recognized by the online OCR. So, I would have other choices. But clearly, merely “isn’t translatable by available online OCR” shouldn’t be the the top priority. (I knew it wasn’t the only one– since the step where you fill a database with the letters means it doesn’t have to be recognized by current OCR. But still, I *thought* it seemed better than something that already was in tessearact!)

  50. My first test of JPEG compression as a signal distorter was using Lucia’s first image in the post. After a moment’s thought, I realised that I can’t test my hypothesis when the OCR programs can’t even read the undistorted text. So I went about finding a captcha image that the OCR web sites could at least have partial success in reading.

    I went to http://www.captchacreator.com/v-customize.html# and created this image, saved in the lossless PNG format:

    https://picasaweb.google.com/114220722563604051694/March262012#5724297283724618754
    (hopefully it is readable by someone other than me)

    The text in the image is “UFTNKX3H”. The OCR scanner at http://www.newocr.com/ interpreted the PNG file as “0 FT MK 76%”.

    Using GIMP 2.6.12, I saved the image at differing quality settings and tested them at http://www.newocr.com/. The results are:

    Quality 50 – “0 FT MK KM”
    Quality 25 – “0 FT MK KM”
    Quality 10 – “0 FT MK 7<3\*r"
    Quality 5 – "Error! Text can not be recognized."

    The quality 5 image is at: https://picasaweb.google.com/114220722563604051694/March262012#5724297285229867186

    So, application of the JPEG quality setting distorts the text image as compression and loss increases. Presumably captcha-breaking bots would also have difficulty parsing an image with low quality.

    From a coding perspective, using the quality setting should essentially be free, since you're already saving the image as a JPEG somewhere in your script. All that would be required is to add the quality parameter. In python, you would add a "quality=n" option to the image.save() function.

    Whether or not this adds a significant CPU burden, I don't know. The save() is doing some compression as currently coded, just using a default setting that doesn't result in visible loss. So this could be another instance of "free" distortion, or it could crank up the processing time. Dunno, guess I need to gin up a script and test it. And then assume that my results for language and platform would hold for the platform you're using.

    One upside though, the files size gets reduced pretty significantly. At a quality setting of 10, the file is reduced by about a factor of 4.

  51. My first test of JPEG compression as a signal distorter was using Lucia’s first image in the post. After a moment’s thought, I realised that I can’t test my hypothesis when the OCR programs can’t even read the undistorted text. So I went about finding a captcha image that the OCR web sites could at least have partial success in reading.

    I went to http://www.captchacreator.com/v-customize.html# and created this image, saved in the lossless PNG format:

    https://picasaweb.google.com/114220722563604051694/March262012#5724297283724618754
    (hopefully it is readable by someone other than me)

    The text in the image is “UFTNKX3H”. The OCR scanner at http://www.newocr.com/ interpreted the PNG file as “0 FT MK 76%”.

    Using GIMP 2.6.12, I saved the image at differing quality settings and tested them at http://www.newocr.com/. The results are:

    Quality 50 – “0 FT MK KM”
    Quality 25 – “0 FT MK KM”
    Quality 10 – “0 FT MK 7<3\*r"
    Quality 5 – "Error! Text can not be recognized."

    The quality 5 image is at: https://picasaweb.google.com/114220722563604051694/March262012#5724297285229867186

    So, application of the JPEG quality setting distorts the text image as compression and loss increases. Presumably captcha-breaking bots would also have difficulty parsing an image with low quality.

    From a coding perspective, using the quality setting should essentially be free, since you're already saving the image as a JPEG somewhere in your script. All that would be required is to add the quality parameter. In python, you would add a "quality=n" option to the image.save() function.

    Whether or not this adds a significant CPU burden, I don't know. The save() is doing some compression as currently coded, just using a default setting that doesn't result in visible loss. So this could be another instance of "free" distortion, or it could crank up the processing time. Dunno, guess I need to gin up a script and test it. And then assume that my results for language and platform would hold for the platform you're using.

    One upside though, the files size gets reduced pretty significantly. At a quality setting of 10, the file is reduced by about a factor of 4.

    [2nd attempt at posting, please delete if this is a dupe comment]

  52. Lucia,

    I might have a lengthy post in the moderation queue, or am I perhaps relegated to “small talker” status? 😮

    I submitted a 2288 character comment and didn’t see any indication that it had been received, even for moderation.

    *scratch head*

  53. Earle–
    If just reducing resolution on saving helps, I’d do it!! 🙂

    I couldn’t see your images at picassa though. Maybe https: is the issue?

  54. OK, need to condense perhaps. Will strip out links…

    My first test of JPEG compression as a signal distorter was using Lucia’s first image in the post. After a moment’s thought, I realised that I can’t test my hypothesis when the OCR programs can’t even read the undistorted text. So I went about finding a captcha image that the OCR web sites could at least have partial success in reading.

    I went to captchacreator.com and created this image, saved in the lossless PNG format:
    https://picasaweb.google.com/114220722563604051694/March262012?authuser=0&authkey=Gv1sRgCLjVhZrGn5KGfg&feat=directlink
    (link to Picasa gallery, first image shown is the PNG)

    The text in the image is “UFTNKX3H”. The OCR scanner at newocr.com interpreted the PNG file as “0 FT MK 76%”.

    Using GIMP 2.6.12, I saved the image at differing quality settings and tested them at newocr. The results are:

    Quality 50 – “0 FT MK KM”
    Quality 25 – “0 FT MK KM”
    Quality 10 – “0 FT MK 7<3\*r"
    Quality 5 – "Error! Text can not be recognized."

    So, application of the JPEG quality setting distorts the text image as compression and loss increases. Presumably captcha-breaking bots would also have difficulty parsing an image with low quality.

    From a coding perspective, using the quality setting should essentially be free, since you're already saving the image as a JPEG somewhere in your script. All that would be required is to add the quality parameter. In python, you would add a "quality=n" option to the image.save() function.

    Whether or not this adds a significant CPU burden, I don't know. The save() is doing some compression as currently coded, just using a default setting that doesn't result in visible loss. So this could be another instance of "free" distortion, or it could crank up the processing time. Dunno, guess I need to gin up a script and test it. And then assume that my results for language and platform would hold for the platform you're using.

    One upside though, the files size gets reduced pretty significantly. At a quality setting of 10, the file is reduced by about a factor of 4.

  55. Oops, now I am a spammer! I didn’t see my original up the comment tree. Mea culpa maxima!

    Can you delete those first two comments? The third one should have a functional Picasa link.

    Thanks!

  56. For what it’s worth, I could break a CAPTCHA using those images at any quality setting in about 30 minutes (starting from scratch). The noise you’ve added doesn’t distort the signal at all.

  57. Brandon,

    On those images, I’d agree wholeheartedly. Where the letters overlap, and there is existing noise already, I at least wonder. Anyhoo, the file size vs. computation time may make it worth the effort to incorporate it.

  58. Earle Williams, it would be worth testing the process on images like the ones lucia uses, ones where letters aren’t a single color. In cases like that, the distortion may matter. However, if you use solid color letters, the quality you save at won’t change anything.

    Of course, it could still be worthwhile for other reasons.

  59. Just following up on my commitment to grind out the script…

    Running Python 2.7.1 on Ubuntu 10.04, with the following script

    # test of JPEG compression calculation

    import os, time, random
    import Image, ImageDraw, ImageColor

    fname = "/home/earle/test.jpg"
    for q in (85,70,50,30,10,5):
    start_time = time.clock()
    loops = 2000
    circles = 10
    for i in range(1, 1+loops):
    num = str(i)
    im = Image.new("RGB",(200,50),None)
    draw = ImageDraw.Draw(im)
    for j in range(1, 1+circles):
    c_x = random.randint(10,190)
    c_y = random.randint(5,45)
    r = random.randint(1,30) + 10
    c_color = ( random.randint(5,250), random.randint(5,250), random.randint(5,250) )
    draw.ellipse((c_x - r, c_y -r, c_x + r, c_y + r), fill=c_color)
    del draw
    im.save(fname, "JPEG", quality=10)
    end_time = time.clock()
    secs = end_time - start_time
    print str(loops) + " tests at quality=" + str(q) +" in " + str(secs) + " seconds."

    Yielded the following output:

    2000 tests at quality=85 in 2.76 seconds.
    2000 tests at quality=70 in 2.76 seconds.
    2000 tests at quality=50 in 2.75 seconds.
    2000 tests at quality=30 in 2.76 seconds.
    2000 tests at quality=10 in 2.76 seconds.
    2000 tests at quality=5 in 2.77 seconds.

    I make no claim that this is a definitive test, but as a first order approximation I’d expect no difference in CPU load for saving a JPEG file at any quality setting.

  60. Yay, I just unbanned myself!

    By the way, I missed the 125 second cutoff on my smart phone by 13 seconds. Second time through was easier as Firefox remembered my IP and email.

    Please accept my most humble apologies if the 503 error I received was due to my trying to edit my prior comment.

    Having faced the wrath of Cloudflare firsthand, I have to say that the unban script is pretty darn slick. Thanks Lucia for making it possible to get back to the Blackboard and post annoying comments. 🙂

  61. Earle–
    If you got a 503, that means you got banned at cloudflare by way of being banned by ZBblock. I could try to look it up in killed.txt.

    I consolidated so everyone gets unbanned at cloudflare. Previously, there were just too many possible bans, and it was very confusing!)

    One of the reasons for mega radio silence is that today I’ve actually been working on the plugin. As with all things written to fit into a particular code, that means reading to figure out where all the ‘hooks/actions/ etc. are and figuring out how to lay them into wordpress.)

    3 extra seconds is a lot for someone to wait if the captcha is for comment spam. So it would be too long as a “spam” solution.

    Those waiting to be unbanned form cloudflare would wait that long especially since getting banned at cloudflare should be rare.

    If your banning out code, you must be enough of an engineer to understand this. (Those who don’t do any engineering at all often don’t really “get” this issue– though they still would get antzy if they had to wait 3 seconds for a captch to load every time they submitted a comment.)

  62. Out of curiosity, I decided to see what would happen if I tried reading characters with the code I have right now. I’ve only built a library of images for nine characters so far, so this shouldn’t be taken as meaning too much.

    What I did was reload the Unban page looking for images with characters I had stored copies of. When I found one, I saved the image. After collecting several images, I went through and pulled seven characters out. I then placed all seven characters into a single image file, saved it, and ran my code. Two of the seven characters matched.

    I then looked at my results and saw 2 was being selected a lot. To test things further, I removed 2 from the collection and tried again. This time, four out of seven characters matched. I changed which images of 2s I was using to higher quality ones, and the results stayed the same.

    Four out of seven isn’t a good success rate, and it is based on only nine characters out of the ~50 ones the CAPTCHA uses, but it’s still an interesting result. That’s especially true since I didn’t erase the lines before doing the comparisons. I filtered out the basic noise, and I obviously did the segmentation manually, but that’s it. Given that, I think I should be able to beat this CAPTCHA with some more work.

  63. Oh, the 2.7 odd seconds was to do 2,000 iterations of my test, which places ten circles randomly into a 200 x 50 image then saves it as a JPEG. The significance of my output is that there is no appreciable time difference saving to JPEG at near-max quality of 85 compared to minimal quality of 5. There may not be much benefit from having a smaller file or applying JPEG induced distortion, but at least there’s no added computation time.

    Of course you’re probably not actually saving the captcha image to a file, just piping out to the data stream. I’ll stick my neck out and say that you’ll see no increase or delay by applying an arbitrary quality setting for the image versus what the default currently is.

  64. Earl–Ahh! then no problem! 🙂

    I don’t save the captcha files. I display then destroy.

    I agree with you the difference won’t be time.

    It will only be a question of code maintainability. I don’t know about other people, but if I write code with extra lines, later when looking for bugs, I have to look at all that extra stuff and it takes me more time to fix bugs. (Unfortunately, I am NOT a talented programmer.)

    So, now we’ll wait to see if Brandon thinks that extra fiddle does anything to thwart a bot. If it doesn’t, it’s not worth doing (even if it takes no time.)

    Do you agree? Or is there something I’m overlooking?

  65. Lucia,

    The bots may not be as resourceful as Brandon, so there could be some bot-proofing benefit to applying a lower quality setting to your captcha images.

    The resulting images are smaller, so there will be some bandwidth-reduction benefit to applying a lower quality setting to the captcha images.

    My assumption regarding your code is that there currently is some call where you display the image as a jpeg, akin to the save() function in my script. My second assumption is that this function call will take a parameter for the quality setting. If both my assumptions hold, then all you need to do to your code is add the optional parameter for the quality you desire.

    It seems like a lot of effort expended for what may be a trivial return, but I followed down the rabbit trail because it seemed that you could make use of the JPEG quality with no computational overhead and just a de minimus amount of added code.

    If my assumptions about the image library utilized by your code are incorrect, then it probably isn’t worth your effort to implement the JPEG quality.

  66. By the way, your comment about code maintainability and extra lines reminded me of this scene from Amadeus:

    http://www.youtube.com/watch?v=Q_UsmvtyxEI

    “My dear young man, don’t take it too hard. Your work is ingenious. It’s quality work. And there are simply too many notes, that’s all. Just cut a few and it will be perfect. ”

    🙂

  67. Earle

    The bots may not be as resourceful as Brandon,

    More especially, the authors of the bot aren’t currently motivated to break my particular captcha. That’s why in the short term almost anything unique works to repel stray bot attempts almost anywhere. Moreover, having a large variety of fairly effective measures “out there” raises the cost of operation for spammers.

    The key thing for someone implementing a measure to understand is whether the measure is simply meant to reduce something or to actually secure something. Then, you design as necessary. Right now– today– my captcha is likely secure enough for me because — quite likely– no bot programmer is visiting trying to break it.

    Mind you– as I am writing a plugin for wordpress right now, I know that I eventually want to provide those who use it a means for their users to unban themselves. If the plugin became popular, then some spammers might be motivated to spend time figuring out how to break the captcha so they could systematically unban themselves at any blog that banned them at cloudflare.

    So, in that event, to prevent a flood of unbannings, if my main method of controlling floods of unbannings was the captcha, I would need a secure captcha because then bot programmers would be spending the time to break “my” captcha. (Although some “”Mozilla/5.0 (compatible; RSSMicro.com RSS/Atom Feed Robot)” at “99.108.10.22” has found the page and it returns pretty frequently.

    Interestingly enough, “RSSMicro.com RSS/Atom Feed Robot” doesn’t even load the captchas. So that bot must be programmed to not load things inside image tags which in this case is nice behavior because that means it doesn’t run the php that creates that. But I many need to program that page to block anything that admits to being a robot and or is looking for a feed. (I’m not worried about that today though.)

  68. Earle–
    On the Mozart comparison: It all depends on whether there are or are not too many notes. 🙂

    I admit that when sitting though the Magic Flute, I wouldn’t have minded if it were 10% shorter. But that’s the only Mozart thing I’ve ever felt that way about and it has more to do with wrapping up the story than any song needed fewer notes.

    Now, for operas that one does decide to leave rather than spend any more time watching, I give you “die frau ohne schatten”. During that opera, my sister and I left during the 2nd intermission. Half the audience left after the 1st intermission. The music was nice. The singers were fine. But the story and pacing….. deadly.

    Back to code: Really….. it’s one thing if a customer just gets a code someone else wrote and uses it. From their point of view, the ‘extra code’ is nothing. But if a person writes their own stuff and has to go back later– intermittently– and remember what it’s doing… well 6 months from now if I look at the code because — possibly– the Captcha has been broken, I’m going to see a bit tweaks some aspect of the captcha. I’m going to have to remember what those lines did and why I included them. I’m going to have to think about whether or not their intended fuction added value to the captcha. And now I’ll have to think about it when the reason they were included has become a dim recollection.

    I’ve got enough crud like that in various codes!

    So it matters to me whether those extra lines of code made the captcha harder to break or not. If they didn’t make it harder to break but were only added because someone imagined they might and running the bit of code isn’t too cpu intensive, it’s better to leave them out. (Or at least I think so!)

  69. lucia:

    More especially, the authors of the bot aren’t currently motivated to break my particular captcha.

    My problem with this perspective is people don’t need to write code specifically for your CAPTCHA. The same code they’d use to break a dozen other CAPTCHAs could be used to break yours. Targeting yours might let them optimize their code, and thus break your CAPTCHA more efficiently, but it isn’t necessary.

    Granted, I don’t know a lot about the demographics of spammers, so I don’t know how developed any code they’d might use to break CAPTCHAs is. Still, even if they don’t have the code to do it right now, all it would take is for the right person to write/find it, and then suddenly tons of spammers would have it. I know I’ve seen at least three different projects with the level of sophistication to break ~75% of the CAPTCHAs I’ve seen, without any tweaking.

    But if a person writes their own stuff and has to go back later– intermittently– and remember what it’s doing… well 6 months from now if I look at the code because — possibly– the Captcha has been broken, I’m going to see a bit tweaks some aspect of the captcha. I’m going to have to remember what those lines did and why I included them.

    There’s a lot of value in well-documented code for this exact reason. With proper documentation, you shouldn’t need to remember anything about the code (indeed, it should be clear to someone who has never seen the code before).

  70. Oh, I should give an update. I’ve increased my library of images to cover 22 characters (I need to figure out a directory scheme to handle lower/uppercase letters before I can expand it much more). At first it lowered the accuracy, but after I cleaned up the images I’m using (mostly removing lines), that was no longer the case. I’m now matching between ~50% of those 22 characters when I manually segment them. That value will be much higher if I build a better library. I could also switch to using a neural network, but given I have no experience with those, I don’t want to try to code one. I also don’t want to try to switch to a polynomial kernel for my SVM implementation, even if it would be more effective, because of the increased complexity of code.

    Incidentally, I’ve decided to try to adopt a segmentation process that doesn’t depend on white space, and thus won’t be impacted much by lines. It has promise, but I don’t know how much of a pain it’d be to code. If it’s as easy to implement as I hope, I think I can get fairly accurate segmentation in a day or two.

    You know, if I put enough effort into this, I could probably sell it to some spammers. That makes me feel a little dirty.

  71. For your library building efforts:

    http://theknittingfiend.com/unban/CaptchaSecurityImages.php?code=A&maxAngle=10

  72. Brandon–
    The guys who write pages on breaking Captchas, like to mention they might be helping the visually impaired. That said– they probably help spammers more.

    But really, if we can find something that’s hard to break, that would be useful. We can draw the irregular circles etc.

  73. Confession. I’m being incredibly lazy right now. I’ve had a lot of trouble finding motivation to work on this project when I’ve already established (for myself at least) what the results would be. What’s the benefit of actually “completing” the program I’m working on? I don’t want to post the code since that would only help spammers, and I don’t have any use for it myself.

    On the other hand, I hate the idea of sharing information about a project with people then not finishing it. It feels rude, like I’m letting people down.

    Okay, I’m at least going to write the segmentation code. I’m confident it will work, but since I’ve never dealt with the approach I’ll be using, I can’t be sure. That’s an interesting enough issue to want to resolve.

  74. Brandon– Don’t feel bad or obliged in anyway. You’re not being rude if you drop it. Do it while it remains fun. Otherwise… no!

    Right now my approach is going to be to monitor the file to see if anythings starts getting through. While nothing does, I know that it’s “safe enough” (for what it is.)

    I’ve got a lot of the ban/unban ‘plugininfied’ now and running that way. Some features I had with the “not plugin” version are gone because I can’t figure out how to do those conveniently for plugin.

    Plugins are useful for sharing with bloggers, most of whom are on shared hosting, want to program as little as possible etc. The idea plugin just works after two steps a) upload b) click activate. The next best is a) upload, b) click activate, when you read a note, find an API key and enter that. At it’s most complicated, you enter things in a form you open in the wordpress dashboard.

    I’m trying to figure out how to best get the plugin to ban really nasty stuff at cloudflare in real time rather than waiting around for a cron job to trigger. (I don’t mind waiting for a cron for the spambots. But the really aggressive repetitive scans including xss attacks etc. I’d like to ban those IPs at cloudflare when they come in. I know how to do that for me but I haven’t figured out a “good” way for a plugin!)

    Oh… well… ZBblock does at least stop them.

  75. lucia, I can’t help it; it’s just the way I am. Fortunately (or perhaps unfortunately), how bad I feel often doesn’t overcome how lazy I feel!

    Random side note, it would appear Skeptical Science supports Mann’s 2008 hockey stick (read the comments). You’d think when even Gavin Schmidt backs away from it, they’d drop it, but…

  76. I’ve got the first stage for my segmentation code finished, and it works quite well for how unrefined it is. The idea is after you filter out noise, you scan each column of pixels in the image and count how many pixels it has. The higher the value, the more likely that column is part of a letter.

    Once you’ve done that, you can figure out what “rules” you want to use to separate letters. In my case, I’ve set it to treat any value lower than four as being a break. In addition, I’ve set it to look for any differences in adjacent values exceeding five if there has been at least 10 columns stored in the current letter.

    Obviously, those rules are extremely crude. They could be improved greatly, especially if I used multiple passes or coded the solution with side-channel information (specifically, knowing how many characters can be in the CAPTCHA). Even so, this seems to be almost completely bypassing the lines.

    I just tested a handful of CAPTCHAs, and my results show at least 60% were correctly and completely segmented. I think that works as proof of concept.

  77. As a follow up to my comment about Skeptical Science, and so I have a record of sorts, I just had a comment deleted from that thread. This is what I said:

    Tom Curtis, I’m confident anyone with an open mind will see through your response, but for the record, I refuse to respond when I’m told by one moderator I should refrain from “personal characterizations” while another moderator implicitly accuses me of intellectual dishonesty. If snide and baseless comments can be used to criticize me, yet I cannot call using uncalibratable data (upside down, no less) “nonsensical,” it’s clear fairness is not available.

    If and when you stop insulting me and try to have a reasonable discussion, I’ll respond again.

    I’m sure the reason I’d be told this was deleted was it doesn’t discuss the science, but given they’re happily allowing Tom Curtis to insult me, that rings hollow. It would appear my previous views of Skeptical Science’s moderation were spot on.

    Edit: I just realized there is an actual Open Thread, and this isn’t it (I’m used to discussing technical stuff in open threads here). Sorry for placing my comments about SkS here.

  78. Brandon–
    Interesting. I can’t help but think with denser letters your method would work better at segmenting.

  79. lucia, probably, but I don’t think it would make as much of a difference as other things. If I refine the rules a little, I think it will work on 90%+ of your CAPTCHAs. The biggest thing is I can avoid almost all false positives because I know your CAPTCHAs have a few possibilities for number of characters. With that knowledge, the only way I’d really have trouble is maybe including bits of characters in the wrong segment from time to time.

    As an aside, I think using a similar approach to trim the space above and below characters might make the pattern matching more effective. That’s worth looking into.

  80. I know your CAPTCHAs have a few possibilities for number of characters

    Yes. And I kept as many upper and lower case letters as possible and included numbers. But there is simply a limit. You can’t ask people to start entering nonstandard symbols (or you can, but they’ll generally just reload another captcha. At least I do that when I get a screwy one from Google.)

  81. lucia, there’s actually some indication there is no meaningful benefit in having both capital and lowercase letters in CAPTCHAs. I don’t know if it’s true, but I know a number of researchers have claimed it.

  82. Brandon–
    It seems to me that if a bot can recognize the letters, there should be no benefit to upper and lower case. If they are just guessing at random, there should be a difference. That said, if the bot is guessing at random, upper or lower only is likely sufficient.

    I guess the issue is the: Does having more possibilities (i.e. 26*2+10 vs 26*1) make it more difficult for a captcha breaker to recognize the individual letters? If not, more letters doesn’t matter. If yes, then it helps.

  83. The only time having more possibilities helps is when the program isn’t sure which letter it is, and has to guess. If it can’t get a good match, having more possibilities helps. If it can get one, they won’t.

    Of course, having more possibilities also means it takes more effort to build a library, and it makes the bot (slightly) more computationally intensive. That might make it somewhat more worthwhile.

  84. I just encountered a rather unique CAPTCHA. It was plain text, but you had to download the file it was in.

Comments are closed.