I replied to another comment, but since this one is getting upvoted I'll reply here too: I do not want to publish the grammar file yet. Later on, yeah, I can do that. But for now I want to leave it up to the people who run this service to improve it. I simply want to demonstrate that their scheme is not effective, I do not want to cause harm to anyone who uses it.
I'd like to wait a little bit to see if Rob (who it seems runs the site) will make it a little better. There are people who use this service and I don't think it would be right to let spammers attack them.
Granted, it wouldn't be hard for someone else to do the same as me, but I just don't want my work used to do harm to someone else.
You might as well release it. The idea is fundamentally flawed, advantage attacker, and there is nothing they can do about it. You won't be hurting them, you'll sort of be doing them a favor. They've already sunk time into the idea, this is a chance for them to cut their time losses and run.
Though they may take it wrong and try to start an arms race with your code, to which I'd say: An arms race in the CAPTCHA space is already a win for the attackers. In case of tie, spammers win.
Since some people are asking how it works, I'll explain how some of those work/fail.
Q: Which word from list "duckweed, commendations, civilised, receptionists, flours" contains the letter "v"? A: FAILURE
This one fails because it recognizes 'duck' as a word and doesn't know what the trailing characters are, so it fails to parse. Adding it to let a word be followed by extra characters makes it twice as slow, but makes it pass it, and so I didn't do that.
Q: Hand, jelly or Jennifer: the person's name is? A: jennifer
This is one of the tricky cases, because I convert to lower case when I pass to the parser. So if parsing fails and the word 'name' is in the string, then I return the first capitalized word and hope for the best.
Q: What is fifty three thousand seven hundred and one as digits? A: 53701
This one was an interesting exorcise in parsing as it is grouped ((fifty three) thousand) (seven hundred) and (one). I assume that all numbers are of the correct form, because it will parse 'one one and two' as the number 4 (1+1+2=4).
Q: If yesterday was Tuesday, what day is today? A: wednesday
This was just putting the dates in order and doing a lookup into the list.
Great job. Seeing this solve 10/10, it almost feels like artificial intelligence. You should send a resume to the Wolfram Alpha guys (their hit rate was significantly lower: http://news.ycombinator.com/item?id=1891375)
They don't give questions formatted in other ways, so I didn't include that in the logic. As someone else pointed out, even deleting the question mark would break it. (However, making the question mark optional would be a two second fix.)
Sorry, but what do you mean by "knowing all the forms of the questions"? It seems like maybe you mean all that all the questions formatted via one of several formulas like: name <x> of list<a..z>. Which would imply your tool simply identifies the format then applies some specific logic to solve the question. I'm unclear that if that is what you mean, why then do you know all the forms of the questions? Text capture purports over 180 million questions, so isn't it likely you may not have found all the possible formats of the questions? Or perhaps did you know because you have a relationship with the text captcha creator and they provided that info to you. Or better yet, did your comment mean something entirely different.
You are correct, I look for a pattern and apply very very basic logic.
I just got 1,000 questions from the demo page and looked how they were similar. For example, there are many questions of the form "[list of something]: the [ordinal number] [type] is?" so I put that as a rule in the grammar file. I did this for about twenty different types of questions, each of which has three or so different phrasings.
And yes, they may have 180 million questions, but in reality they have only a hundred or so modulo the specific words they pick (for the items in the list, or for the day of the week, or for the letter that the word starts with).
Right now there's no API or anything -- I wrote this just to see if I could. I'm not sure if I'll be putting up an API, because I figure by the time I'd do that I can just release the source for it. But if you really wanted I probably could provide an API it in a couple of days.
Very true. The person writing the questions has the advantage in this one because they can always throw in more variations and I would have to figure out all of the new ones.
There is, however, an inherit limitation to question-asking, and that is it must be understandable to a general population. So I would switch from using a parser to just looking for key words and then matching from there. (Which is, for example, how I find names in a list.)
Even if they made it so I only get one fifth the number of correct answers, a 20% success ratio would be high enough for concern. The attackers can just keep trying and trying and if 20% of the time they can register a new bot, then they win. The defender must win a much larger percent of the time.
But then it occurred to me that someone could solve one by hand, and then use that answer to solve five more captchas that I asked for, and then recurse, eventually solving thousands of a site's captchas.
Given the recent wave of reinventing/breaking captchas, I saw the captcha at the end, and thought: "what if this free app is just a very clever way to get a bunch of intelligent people to answer captcha's for a bot?"
Then I realized if somebody went through all this trouble to design a new app like this, they deserve to have me answer captcha's for a bot somewhere.
Q: What nonsense word do the letters G-B-R-D spell?
Q: Which three letter word starts with the letters TH and ends with the letter E?
Haha, no, I don't have this answering questions for any bot anywhere.
Actually, the reason those are solved isn't exactly what you'd expect. When it fails to find an answer, it checks to see if any word is upper-case. If it is, it cuts out all non-letters (to remove the comma at the end, for example) and then returns that word. The reason it does this is because it will occasionally asks questions such as "Which word IN this sentence is in all caps." So that's why G-B-R-D gets turned into 'gbrd', and why it says 'th'.
This is a variant on Schneier's fundamental theorem: Any fool can design a CAPTCHA for which they cannot program a solver. Just like in crypto, the only true test of CAPTCHAs are which one survive the test of time after having been attacked again and again, which is why it's very dangerous to jump on the bandwagon of a new CAPTCHA scheme (or worse yet -- design your own).
This is one reason why I'm partial to reCAPTCHA: there is a lot of experience in OCR systems, and we know what the current state of the art -- and we know what kind of things foil it.
I like this sort of approach for community oriented captcha. Ask a question that anyone within the community would know or could easily find out but a general purpose spam bot or average person would be unable to solve.
If the community is profitable enough or becomes big enough some spammer will spend 15 minutes or 50 dollars on mechanical turk and find all the answers. As others have mentioned it is an arms race in the end. Back and forth each side upping the challenge.