Hacker News new | comments | show | ask | jobs | submit login

Was curious about the "text captcha" service. It's a collection of questions with MD5 sums of acceptable answers.

They provide an API, but I think this is a case of a project being a "service" to keep the database of questions from being free. There's no technical reason for this to be a service, and it's not a terribly complicated product that would be difficult to scale. It's a static database!

Might be neat to create an open-source bank of these CAPTCHA questions. Maybe I'll throw something together this weekend.

Wouldn't an open-source bank of CAPTCHA questions open the door for an open-source bank of answers to these questions?

Yeah, exactly, I think this is why it necessitates it being a service and kind of assume that it's not a static database for that reason. If you have a fixed list of questions, it's easy to get answers to those once and never have to do it again. Again, I think its a safe assumption that these are in some way generated on the fly.

If you were able to analyze the sentence structure of all 180 million questions, how many different sentence structures would there be? This all points to the fact that you can build algorithms to guess the answers eventually.

Not even just guess them but accurately determine them.

A few years back I was hired by a third party to build a system to break the CAPTCHA on a popular site for various evil deeds. Morals set aside, the money was good and I had a wedding to pay for. A CAPTCHA system becomes quite breakable when it becomes predictable. The system in question used an image based CAPTCHA that used the same (albeit annoying) font for each image, as well as a static distortion overlay and a second set of random distortion. By extracting a thousand sample images I was able to build a system in Perl that could determine the text with an estimated 98% success rate - and when it failed you would just request a new CAPTCHA.

My solution would be to mix up images with logic. I.E.

In the following list of images, which image number contains the green animal: {pic of zebra}, {pic of frog}, {pic of giraffe}

This would require image recognition as well as logic.

Interestingly enough, WolframAlpha can generate a CAPTCHA image of each of these text questions, as to make it harder for a bot to decode AND answer the question! Check it out: http://www.wolframalpha.com/input/?i=CAPTCHA+What+is+seven+h...

It can't work the way you explained it: I just solved your CAPTCHA with 33% success rate (waaay too high for a useful CAPTCHA). Perhaps if you ask for "the three pictures of X animal out of those 9", and you had a database with which animal has which property (and you also ran some fuzz over the images so no two images would ever be the same). I'm still skeptical...

That would assume that it is multiple choice - however if it's still free form text input, requiring the input to equal "frog" would solve that issue. Text + images + logic would offer a lot more hurdles than just any single one of those.

No, that's the reason the answers are hashed. You can't get the answer from the hash, since a hash is a one-way function. This is the same reason you never store passwords in your database in plaintext, but rather hash them first.

As elliotcarlson pointed out the issue isn't that answers can be determined automatically, it's that the cost of determining the answers can be amortized over all uses. It's the same vulnerability problem as rainbow tables. With rainbow tables you spend a lot of (automated) effort computing hashes for password guesses, the key advantage of this tactic is that it is widely applicable to every naive use of that hash function.

The amount of effort for a human to go through the list of answers and come up with answers may be non-trivial, but once completed it's applicable to every single use of the plain-text captcha system. That's bad.

Mechanical Turk

Still, almost all (or all?) of the answers are excerpts of the question. So just test all short excerpts against the hashes, voila, an answer key.

That's not the issue - as soon as you make a list of questions available for the world, all it takes is one spammer to create a matching list of answers and they can go to town. By providing that list of answers as open source you are making it easier for someone to create the counter part answer database.

Captchas are not restricted to reddit and social news site, despite what your link claims.

Oh, that was just to make fun of xkcd. I don't think xkcd links are really that good an addition to a discussion here.


Ha! Valid point!


Just tell me if you need the source code ;P

Also think about the way algorithms (like WolframAlpha) interpret the structure of the questions. Like some of the other commenters, switching some words around makes WolframAlpha fail.

It might be interesting to come up with a methodology for question structure that is harder for algorithms to interpret...?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact