
I broke textCAPTCHA with Python - Xk
http://textcaptchabreaker.appspot.com/
======
bjonathan
Wow very impressive, just tried it:

Q:Which word from list "duckweed, commendations, civilised, receptionists,
flours" contains the letter "v"? A: FAILURE

Q: What word from "interdisciplinary, disorientating, decamps, moan" begins
with "i"? A: FAILURE

Q: Egg, pink, white, green and pub: the 1st colour is? A: pink

Q: The day of the week in dog, Wednesday, penguin, trousers, hotel or lion is?
A: wednesday

Q: Hand, jelly or Jennifer: the person's name is? A: jennifer

Q: What is fifty three thousand seven hundred and one as digits? A: 53701

Q: Cake, rice, snake, head and jelly: how many body parts in the list? A: 1

Q: If yesterday was Tuesday, what day is today? A: wednesday

Q: Which of these is a body part: lion, jelly, church, butter or nose? A: nose

~~~
Xk
Since some people are asking how it works, I'll explain how some of those
work/fail.

Q: Which word from list "duckweed, commendations, civilised, receptionists,
flours" contains the letter "v"? A: FAILURE

This one fails because it recognizes 'duck' as a word and doesn't know what
the trailing characters are, so it fails to parse. Adding it to let a word be
followed by extra characters makes it twice as slow, but makes it pass it, and
so I didn't do that.

Q: Hand, jelly or Jennifer: the person's name is? A: jennifer

This is one of the tricky cases, because I convert to lower case when I pass
to the parser. So if parsing fails and the word 'name' is in the string, then
I return the first capitalized word and hope for the best.

Q: What is fifty three thousand seven hundred and one as digits? A: 53701

This one was an interesting exorcise in parsing as it is grouped ((fifty
three) thousand) (seven hundred) and (one). I assume that all numbers are of
the correct form, because it will parse 'one one and two' as the number 4
(1+1+2=4).

Q: If yesterday was Tuesday, what day is today? A: wednesday

This was just putting the dates in order and doing a lookup into the list.

------
Xk
I had yesterday off and so spent four or so hours writing this.

I haven't made it recognize all of the cases. It still gets a few wrong, but
usually it'll just say FAILURE when it can't solve it.

As the page says, it's just a ~300 line grammar file and a parser-generator
with syntax directed translation to solve the questions.

~~~
jasonlotito
Chicken, egg: which came first?

A: FAILURE

 _sigh_ So close.

~~~
amanuel
the rooster? FTW!

------
bockris
It's totally hilarious to me that you had to protect your captcha breaker with
a captcha.

~~~
Xk
Haha, yeah. I did it so that people wouldn't use it as a spamming service: I
wanted to show that those captchas can be broken, but without any harm to
anyone else.

~~~
ldite
You should have protected it with one of the text captchas - so to use your
service to bypass it they'd have to have broken it already :)

~~~
Xk
Yeah :)

But then it occurred to me that someone could solve one by hand, and then use
that answer to solve five more captchas that I asked for, and then recurse,
eventually solving thousands of a site's captchas.

------
chaosmachine
Great job. Seeing this solve 10/10, it almost feels like artificial
intelligence. You should send a resume to the Wolfram Alpha guys (their hit
rate was significantly lower: <http://news.ycombinator.com/item?id=1891375>)

~~~
Xk
Yeah, but I have the advantage of knowing all the forms of the questions. If
you were to change one word then mine would break. Or misspell anything.

~~~
techiferous
I just typed in similar questions without following the format and it
performed poorly. But at least it's a good proof-of-concept! :)

Name the third item in the list: book, fork and spoon.

A: FAILURE

Q: What is two times 8?

A:

Which is a number: yellow, seven, book, flag?

A: FAILURE

~~~
Xk
They don't give questions formatted in other ways, so I didn't include that in
the logic. As someone else pointed out, even deleting the question mark would
break it. (However, making the question mark optional would be a two second
fix.)

------
harshpotatoes
Given the recent wave of reinventing/breaking captchas, I saw the captcha at
the end, and thought: "what if this free app is just a very clever way to get
a bunch of intelligent people to answer captcha's for a bot?"

Then I realized if somebody went through all this trouble to design a new app
like this, they deserve to have me answer captcha's for a bot somewhere.

Q: What nonsense word do the letters G-B-R-D spell? A: gbrd

Q: Which three letter word starts with the letters TH and ends with the letter
E? A: th

Pretty neat.

~~~
Xk
Haha, no, I don't have this answering questions for any bot anywhere.

Actually, the reason those are solved isn't exactly what you'd expect. When it
fails to find an answer, it checks to see if any word is upper-case. If it is,
it cuts out all non-letters (to remove the comma at the end, for example) and
then returns that word. The reason it does this is because it will
occasionally asks questions such as "Which word IN this sentence is in all
caps." So that's why G-B-R-D gets turned into 'gbrd', and why it says 'th'.

------
bl4k
As soon as I saw the story about textCAPTCHA I knew somebody would tackle it
and solve it.

There have been so many 'captcha alternatives' on HN recently, all of them
forget that there is a very good reason why current captcha's are extremely
distorted images.

~~~
moshezadka
This is a variant on Schneier's fundamental theorem: Any fool can design a
CAPTCHA for which they cannot program a solver. Just like in crypto, the only
true test of CAPTCHAs are which one survive the test of time after having been
attacked again and again, which is why it's very dangerous to jump on the
bandwagon of a new CAPTCHA scheme (or worse yet -- design your own).

This is one reason why I'm partial to reCAPTCHA: there is a lot of experience
in OCR systems, and we know what the current state of the art -- and we know
what kind of things foil it.

~~~
wwortiz
The only problem is you get things like google's captcha system that make you
question whether or not you are human.

------
l0nwlf
Tried: Q: What does Python prompt look like ?

Desired Answer: >>>

Answer Given: None.

~~~
jmatt
I like this sort of approach for community oriented captcha. Ask a question
that anyone within the community would know or could easily find out but a
general purpose spam bot or average person would be unable to solve.

If the community is profitable enough or becomes big enough some spammer will
spend 15 minutes or 50 dollars on mechanical turk and find all the answers. As
others have mentioned it is an arms race in the end. Back and forth each side
upping the challenge.

------
philfreo
Care to give a little more explanation on how it works?

~~~
Xk
Sure.

I'm assuming you know what parsing is, and what syntax-directed translation
is. If not, the wikipedia articles (<http://en.wikipedia.org/wiki/Parsing> and
<http://en.wikipedia.org/wiki/Syntax-directed_translation>) offer a reasonable
explanation.

The grammar file is generally laid out as follows:

Question ::= Phrase1 | Phrase2 | Phrase3 ...

Phrase1 ::= 'if' 'the' Noun 'is' Color 'what' 'color' 'is' 'it' [and now
return the word which matched Color]

Noun ::= [sequence of characters]

Color ::= 'red' | 'blue' ...

The trickiest part of it was getting it to correctly interpret numbers like
'twenty one thousand eight hundred and ninety nine'

That part is basically laid out as

Number ::= Number QuantifiedNumber | Number 'and' QuantifiedNumber |
QuantifiedNumber

QuantifiedNumber ::= NumberGroup | NumberGroup Quantifier

NumberGroup ::= SingleNumber SingleNumber | SingleNumber

SingleNumber ::= 'one' | 'two' | ... 'twenty' | 'thirty' | ...

------
finemann
These questions didn't work for me

Q: How many thousands is a million?

Q: How many millions is a billion?

Q: There is a dog, cat and a goat: First animal?

Q: There is a dog, cat and a goat: How many humans?

Q: Black, White: Which is dark?

~~~
Xk
There are a few types of questions that it doesn't solve.

I only had so much time yesterday and so I didn't put in all the different
types of questions. Maybe later I'll go back and get all the questions that it
can't solve.

------
Jach
Nice little app, but here are two I've used it fails on:

Please join these two "words" together (without spaces): zrtvyoav and ekozuarn
A: FAILURE

What starts with "bow" and ends with "ser"? A: FAILURE

------
binarymax
Very impressive! 4/5 for me.

Here's the only one that broke:

    
    
              The 6th letter in "aviator" is?
              A: FAILURE

~~~
Xk
Just letting people know I dropped it from 10 questions down to 5 because
right now it's a free google app and I'd rather not pay for CPU-time.

------
chacha102
Failed on "The blue rainjacket is what color?". You might want to allow for
both 'color' and 'colour'.

------
c4urself
i noticed the form matters: so not including a question mark breaks it in "two
plus fifteen equals?"

------
amanuel
* What is 3 + 12 / equal too? FAIL

------
9ec4c12949a4f3
Congrats, I suspected a higher cost of defeating this test.

<http://news.ycombinator.com/item?id=1891026>

