Hacker News new | past | comments | ask | show | jobs | submit login
I broke textCAPTCHA with Python (textcaptchabreaker.appspot.com)
135 points by Xk on Nov 12, 2010 | hide | past | favorite | 54 comments

Wow very impressive, just tried it:

Q:Which word from list "duckweed, commendations, civilised, receptionists, flours" contains the letter "v"? A: FAILURE

Q: What word from "interdisciplinary, disorientating, decamps, moan" begins with "i"? A: FAILURE

Q: Egg, pink, white, green and pub: the 1st colour is? A: pink

Q: The day of the week in dog, Wednesday, penguin, trousers, hotel or lion is? A: wednesday

Q: Hand, jelly or Jennifer: the person's name is? A: jennifer

Q: What is fifty three thousand seven hundred and one as digits? A: 53701

Q: Cake, rice, snake, head and jelly: how many body parts in the list? A: 1

Q: If yesterday was Tuesday, what day is today? A: wednesday

Q: Which of these is a body part: lion, jelly, church, butter or nose? A: nose

Since some people are asking how it works, I'll explain how some of those work/fail.

Q: Which word from list "duckweed, commendations, civilised, receptionists, flours" contains the letter "v"? A: FAILURE

This one fails because it recognizes 'duck' as a word and doesn't know what the trailing characters are, so it fails to parse. Adding it to let a word be followed by extra characters makes it twice as slow, but makes it pass it, and so I didn't do that.

Q: Hand, jelly or Jennifer: the person's name is? A: jennifer

This is one of the tricky cases, because I convert to lower case when I pass to the parser. So if parsing fails and the word 'name' is in the string, then I return the first capitalized word and hope for the best.

Q: What is fifty three thousand seven hundred and one as digits? A: 53701

This one was an interesting exorcise in parsing as it is grouped ((fifty three) thousand) (seven hundred) and (one). I assume that all numbers are of the correct form, because it will parse 'one one and two' as the number 4 (1+1+2=4).

Q: If yesterday was Tuesday, what day is today? A: wednesday

This was just putting the dates in order and doing a lookup into the list.

I had yesterday off and so spent four or so hours writing this.

I haven't made it recognize all of the cases. It still gets a few wrong, but usually it'll just say FAILURE when it can't solve it.

As the page says, it's just a ~300 line grammar file and a parser-generator with syntax directed translation to solve the questions.

Chicken, egg: which came first?


sigh So close.

the rooster? FTW!

care to publish the src?

I replied to another comment, but since this one is getting upvoted I'll reply here too: I do not want to publish the grammar file yet. Later on, yeah, I can do that. But for now I want to leave it up to the people who run this service to improve it. I simply want to demonstrate that their scheme is not effective, I do not want to cause harm to anyone who uses it.

People using brand-new untested technology for security purposes are doing it for fun, not because it's a good idea. So releasing the source code to break it should not cause any problems.

src or it didn't happen. ;-)

Can I ask what you're using to do the parsing? Or did you write your own?

This is definitely the kind of thing that I could see doing with REBOL's parse operator, but Python doesn't (seem to) come with such a tool built in.


For a class in college I had to write a parser-generator, so I'm using mine.

Very nice. Any chance you'll release the code (or the grammar)?

I'd like to wait a little bit to see if Rob (who it seems runs the site) will make it a little better. There are people who use this service and I don't think it would be right to let spammers attack them.

Granted, it wouldn't be hard for someone else to do the same as me, but I just don't want my work used to do harm to someone else.

You might as well release it. The idea is fundamentally flawed, advantage attacker, and there is nothing they can do about it. You won't be hurting them, you'll sort of be doing them a favor. They've already sunk time into the idea, this is a chance for them to cut their time losses and run.

Though they may take it wrong and try to start an arms race with your code, to which I'd say: An arms race in the CAPTCHA space is already a win for the attackers. In case of tie, spammers win.

I think you should release your wonderful algorithm not directly as a way to hack text-captchas but in some other useful forms.

Good job on the code and kudos for not just releasing the code without giving the author some time.

It's totally hilarious to me that you had to protect your captcha breaker with a captcha.

Haha, yeah. I did it so that people wouldn't use it as a spamming service: I wanted to show that those captchas can be broken, but without any harm to anyone else.

You should have protected it with one of the text captchas - so to use your service to bypass it they'd have to have broken it already :)

Yeah :)

But then it occurred to me that someone could solve one by hand, and then use that answer to solve five more captchas that I asked for, and then recurse, eventually solving thousands of a site's captchas.

It makes perfect sense why you did it, but still funny as hell. (to me anyway)

Great job. Seeing this solve 10/10, it almost feels like artificial intelligence. You should send a resume to the Wolfram Alpha guys (their hit rate was significantly lower: http://news.ycombinator.com/item?id=1891375)

Yeah, but I have the advantage of knowing all the forms of the questions. If you were to change one word then mine would break. Or misspell anything.

Sorry, but what do you mean by "knowing all the forms of the questions"? It seems like maybe you mean all that all the questions formatted via one of several formulas like: name <x> of list<a..z>. Which would imply your tool simply identifies the format then applies some specific logic to solve the question. I'm unclear that if that is what you mean, why then do you know all the forms of the questions? Text capture purports over 180 million questions, so isn't it likely you may not have found all the possible formats of the questions? Or perhaps did you know because you have a relationship with the text captcha creator and they provided that info to you. Or better yet, did your comment mean something entirely different.

Just curious, thanks.

You are correct, I look for a pattern and apply very very basic logic.

I just got 1,000 questions from the demo page and looked how they were similar. For example, there are many questions of the form "[list of something]: the [ordinal number] [type] is?" so I put that as a rule in the grammar file. I did this for about twenty different types of questions, each of which has three or so different phrasings.

And yes, they may have 180 million questions, but in reality they have only a hundred or so modulo the specific words they pick (for the items in the list, or for the day of the week, or for the letter that the word starts with).

  > I put that as a rule in the grammar file
What are you using the process the grammar file? What type of grammar?

I wrote a parser generator, so I'm using that.

It's BNF-like.

I just typed in similar questions without following the format and it performed poorly. But at least it's a good proof-of-concept! :)

Name the third item in the list: book, fork and spoon.


Q: What is two times 8?


Which is a number: yellow, seven, book, flag?


They don't give questions formatted in other ways, so I didn't include that in the logic. As someone else pointed out, even deleting the question mark would break it. (However, making the question mark optional would be a two second fix.)

It's basically an arms race. They could inject a few misspellings and random noise here and there to throw your system off.

Fuzzy pattern matching would probably solve that.

Great job! I wrote up that post about WolframAlpha and wonder if there's an API so you could integrate WA to make it more robust?

Right now there's no API or anything -- I wrote this just to see if I could. I'm not sure if I'll be putting up an API, because I figure by the time I'd do that I can just release the source for it. But if you really wanted I probably could provide an API it in a couple of days.

The world orange has which letter in the penultimate position? A: FAILURE

Aircraft fly through the? A: FAILURE

Q: A beetle has how many legs? A: a

It does well when it recognizes the questions, but there's still plenty of room for writing new ones. It does show that it's an arms race, as someone else noted.

Very true. The person writing the questions has the advantage in this one because they can always throw in more variations and I would have to figure out all of the new ones.

There is, however, an inherit limitation to question-asking, and that is it must be understandable to a general population. So I would switch from using a parser to just looking for key words and then matching from there. (Which is, for example, how I find names in a list.)

Even if they made it so I only get one fifth the number of correct answers, a 20% success ratio would be high enough for concern. The attackers can just keep trying and trying and if 20% of the time they can register a new bot, then they win. The defender must win a much larger percent of the time.

First question could be "air" or "sky". I would guess 6 or 8 to the second one.

I would probably fail both, so I guess this isn't the kind of question that would realistically appear.

Given the recent wave of reinventing/breaking captchas, I saw the captcha at the end, and thought: "what if this free app is just a very clever way to get a bunch of intelligent people to answer captcha's for a bot?"

Then I realized if somebody went through all this trouble to design a new app like this, they deserve to have me answer captcha's for a bot somewhere.

Q: What nonsense word do the letters G-B-R-D spell? A: gbrd

Q: Which three letter word starts with the letters TH and ends with the letter E? A: th

Pretty neat.

Haha, no, I don't have this answering questions for any bot anywhere.

Actually, the reason those are solved isn't exactly what you'd expect. When it fails to find an answer, it checks to see if any word is upper-case. If it is, it cuts out all non-letters (to remove the comma at the end, for example) and then returns that word. The reason it does this is because it will occasionally asks questions such as "Which word IN this sentence is in all caps." So that's why G-B-R-D gets turned into 'gbrd', and why it says 'th'.

As soon as I saw the story about textCAPTCHA I knew somebody would tackle it and solve it.

There have been so many 'captcha alternatives' on HN recently, all of them forget that there is a very good reason why current captcha's are extremely distorted images.

This is a variant on Schneier's fundamental theorem: Any fool can design a CAPTCHA for which they cannot program a solver. Just like in crypto, the only true test of CAPTCHAs are which one survive the test of time after having been attacked again and again, which is why it's very dangerous to jump on the bandwagon of a new CAPTCHA scheme (or worse yet -- design your own).

This is one reason why I'm partial to reCAPTCHA: there is a lot of experience in OCR systems, and we know what the current state of the art -- and we know what kind of things foil it.

The only problem is you get things like google's captcha system that make you question whether or not you are human.

Tried: Q: What does Python prompt look like ?

Desired Answer: >>>

Answer Given: None.

I like this sort of approach for community oriented captcha. Ask a question that anyone within the community would know or could easily find out but a general purpose spam bot or average person would be unable to solve.

If the community is profitable enough or becomes big enough some spammer will spend 15 minutes or 50 dollars on mechanical turk and find all the answers. As others have mentioned it is an arms race in the end. Back and forth each side upping the challenge.

Care to give a little more explanation on how it works?


I'm assuming you know what parsing is, and what syntax-directed translation is. If not, the wikipedia articles (http://en.wikipedia.org/wiki/Parsing and http://en.wikipedia.org/wiki/Syntax-directed_translation) offer a reasonable explanation.

The grammar file is generally laid out as follows:

Question ::= Phrase1 | Phrase2 | Phrase3 ...

Phrase1 ::= 'if' 'the' Noun 'is' Color 'what' 'color' 'is' 'it' [and now return the word which matched Color]

Noun ::= [sequence of characters]

Color ::= 'red' | 'blue' ...

The trickiest part of it was getting it to correctly interpret numbers like 'twenty one thousand eight hundred and ninety nine'

That part is basically laid out as

Number ::= Number QuantifiedNumber | Number 'and' QuantifiedNumber | QuantifiedNumber

QuantifiedNumber ::= NumberGroup | NumberGroup Quantifier

NumberGroup ::= SingleNumber SingleNumber | SingleNumber

SingleNumber ::= 'one' | 'two' | ... 'twenty' | 'thirty' | ...

These questions didn't work for me

Q: How many thousands is a million?

Q: How many millions is a billion?

Q: There is a dog, cat and a goat: First animal?

Q: There is a dog, cat and a goat: How many humans?

Q: Black, White: Which is dark?

There are a few types of questions that it doesn't solve.

I only had so much time yesterday and so I didn't put in all the different types of questions. Maybe later I'll go back and get all the questions that it can't solve.

Nice little app, but here are two I've used it fails on:

Please join these two "words" together (without spaces): zrtvyoav and ekozuarn A: FAILURE

What starts with "bow" and ends with "ser"? A: FAILURE

Very impressive! 4/5 for me.

Here's the only one that broke:

          The 6th letter in "aviator" is?
          A: FAILURE

Just letting people know I dropped it from 10 questions down to 5 because right now it's a free google app and I'd rather not pay for CPU-time.

Failed on "The blue rainjacket is what color?". You might want to allow for both 'color' and 'colour'.

i noticed the form matters: so not including a question mark breaks it in "two plus fifteen equals?"

* What is 3 + 12 / equal too? FAIL

Congrats, I suspected a higher cost of defeating this test.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact