In tests on my own sites I've found that introducing reCAPTCHA during the registration process leads to a significant increase in people abandoning their registration when they fail at recognising the text the first time, without putting a significant dent in spammer registration at all. I've found it far more effective to do things like randomising form field names (instead of using names like 'username' and 'password') so the spammer has to scrape the site to figure out which fields he needs to submit for each and every account he registers, silently dumping registrations that don't use the correct field names, and then applying various heuristics to successful registrations to detect patterns common to spammers.
For instance, one particular spammer always seemed to use the same user agent string and didn't ever trigger any of the AJAX calls on the page. It was trivial to detect registrations coming from that one spammer and silently dump new accounts when he created them.
I'm catching about 100 spam accounts a day with this technique and the ones that I miss are fairly easy to detect through analysis once they start using their account.
It's much harder to provide productive criticism that leads to an actual improvement over the status quo. (Although it is easy to suggest bad alternatives to captchas, which is why they appear in every comment thread about captchas on Hacker News, including this one.)
If account is locked - user can either wait, or request account recovery (get email).
I don't see anything wrong in principle with "account lock out" provided that it doesn't affect existing sessions and provided that you can just ask the site to send an email with a token to reset your password. Spammers can lock a user out, so what. Minor inconvenience. If it's happening a lot to the same user and it's also affecting the user negatively, something extra could be done to minimize lockouts for the actual user (who should be easy to detect by the server through logs and a premise the user isn't trying to hide).
Spammers are able to flood you with "forgot your password?" emails, too. I don't know how often they do it. I had my first wave in after 7 years of the same email just a few months ago mostly from old sites I forgot I even had accounts on.
I'm not really a fan of the exponential backoff idea proposed earlier above, I'd sooner go with the "X tries, then wait" approach. The lockout time should not be more than 24 hours, ideally less. Though one could also set the lockout period to expire when the user's session automatically expires, if there's a current one, but that may be too clever.
Why? The poor soul answering support calls likely didn't make this policy and cannot do anything to change it. I guess if you like just unloading anger at some unempowered person who has the responsibility of taking a bunch of crap without reacting in turn then this is a good idea. Otherwise it is more useful to take your business elsewhere, and if you must try finding someone to complain towards that might actually be able to encourage change.
And as a complete aside: given the attitude PayPal phone representatives have displayed, they seem to deserve it personally, too.
Of course, if one ip is making too many requests, it gets blocked or captchaed
IP blocks can be countered in other ways (botnets, tor, etc).
I'm not saying it's a bad idea though, there doesn't seem to be any downside apart from the book-keeping required. You could probably start with a higher delay (5 to 10 seconds seems very reasonable) and increase it to several minutes. But please don't make it 24 hours after 3 failed attempts, which happened to me when I tried to order train tickets online and forced me to make a trip to a physical ticket vending machine.
We launched in January and are using games to make them easier for people. Some of our early testing showed captchas can decrease signups by up to 25% and we're able to recover almost all of that.
We also monitor how you play the game (like mouse movement) so we can ramp up our security without having to make the task more difficult for people. Read more here http://areyouahuman.com/how-playthru-stops-the-bots
I'm skeptical of your efforts to distinguish humans from bots by mouse movements and other inputs. Anything you can infer can be modeled. It's unreasonable to expect a smart captcha cracker to resemble a zero reaction time Counter Strike aimbot.
That being said, we don't just ignore security. There are a lot of captcha alternatives out there that survive on just obscurity, if they were widely adopted, they wouldn't take much to get around (like a slide to unlock captcha). We analyze mouse movement and other behavior, to avoid this.
To test our algorithms, we write our own bots to break our game (as well as working with the AI lab at the university of michigan) and use that data in our machine learning algorithms. We're always tweaking the bot to see how we can beat it and then looking for new features from the data that we can use.
The main point being, that as people do write bots, we can learn from that and incorporate it. We can also adjust the threshold. Some of our customers care much more about usability and just want a minimum level of protection, other's want the threshold a little higher and accept the risk that humans might fail more often.
Considering that your games can be played by a random number generator with something like 10% success rate, you can just skip the captcha completely. Much more user friendly.
The other things you look at to increase security, like detecting patterns and behaviors that indicates bots can be done without a captcha.
Totally agree that we could detect patterns and behaviors without the captcha. Baby steps, though. We'll get there.
Do you consider that blind people are bots?
Screen readers should pick up the alternate text.
Also, I know the audio sucks, but it's the most secure out there, otherwise it would just be a giant hole for bots to get through. It's one of the things we want to make better.
In fact, we've experimented with audio games and had some good results.
I agree that the button itself should be easier to discover for blind people, like an image link with a title tag.
We are also planning on a set culturally agnostic (as well as culturally specific) games. If you go to our homepage and refresh through the games you'll see a shape game, that has no language for an example.
We're also getting ready to roll out an update to our API that will allow you to hide/disable the submit button until the game is played. Our testing showed this helped increase submission rates even more.
Finally, we are also trying different styles of games to see what people respond to best.
I'd rather have my signup system use areyouhuman than re-captcha. What makes me uneasy about using it on a massive site is the whole html5 + flash dependency. And the audio captcha alternative you have is terrible. I have better than average hearing and wasn't able to understand anything on my audio sample. That and the re-captcha system helps digitize books while areyouhuman is just playing games. Even though I hate captcha it makes me feel like I'm helping digitize a book when I use recaptcha, so I feel better about it. On the other hand your games are interesting and dare I say it, a little bit addictive, especially with all those congratulatory stars at the end.
Eg. 'What's 4 times four?' or 'How many times does the word four appear within this sentence?'
Wolfram Alpha won't answer very many questions of that sort but a cracker can spent a couple of hourse to enumerate all kinds of questions and writing tailored functions (and a detection routine) for each. If your detection routine is naive (e.g. choose randomly) and some or all of your answering functions work badly, no problem, you only have to get it right occasionally anyway.
 And indeed it fails unsurprisingly if comically for the second question How many times does the word four appear within this sentence?, trimming it to How many times and showing details about the British newspaper.
As for the harder questions, Wolfram Alpha is better able to do them than the average human.
I remember shaking my head at the absurdity of it all. I'm glad I'm not the only one.
This version no longer advances OCR algorithms, but does provide cheap exception handling. I don't know when or why the change was made, but obsolescence is at the top of my mind. Either they've ran out of unrecognized words, or adversaries have beaten their OCR. Either way, it seems we're back to 2005.
edit: That said, Luis von Ahn mentioned that Google is experimenting with other image processing tasks, so there's hope yet.
But it has gotten to the point that about half of the control words are unintelligible by a human.
I'm all in for crowdsourcing if you tell me about it and nicely ask me to
participate. Not when I am forced to do it.
The "What is reCAPTCHA" explains the OCR process clearly and has the phrase "Currently, we are helping to digitize old editions of the New York Times and books from Google Books."
And if you click on the "HELP" link in a reCAPTCHA, it opens a small page with the instructions and a paragraph explain the OCR and a link to Learn More.
How are they _not_ telling this to everyone?
Nobody reads the about page. Put it on the front page (or the embedded widget, if that is what the user sees) in clear large letters that are easy to see or accept that we consider you scum.
So, assuming you don't consider HN scum (you're here, after all), can you please explain to me how is this different from YCombinator using HN to publicize their own companies? There's nothing in the front or signup pages explaining that.
Frankly, I don't see why is that a problem. They're offering a free service in exchange for having words manually OCRed. If you have a problem with it I think you should take it to the site that's using reCaptcha, not with Google.
(By the way, I'm not affiliated with Google and I'm not even an heavy user of their services anymore)
If it doesn't add extra burden to you (assuming there would just be another captcha that didn't end up feeding googles scanning effort), why do you care?
If they want to. If they want we me to scan old books, they bloody better pay me.
Makes me wonder what sort of social engineering opportunity this creates. What other fields could be advanced in a similar way.
If I have trouble with recaptcha, I can't imagine a non-programmer over 60 years old having any remote hope of figuring it out.
Also, the way it's typically implemented, if you solve the recaptcha once and there is some other server side validation error, you have to fix that and then solve a new recaptcha before proceeding. It is just so punishing I am always at a loss when I encounter it.
In ReCAPTCHA, the two-word CAPTCHA version, one of the two words is taken from a scanned book. That (unknown) word was one that failed OCR for that book.
The other word is one that captcha already knows the answer to.
The assumption is if you get the known captcha correct, then you probably got the other one correct as well (if it was possible to read it). The answer to the unknown word supplements the OCR of the book.
The captchas are put in random order, and you only have to get one of them right.
Luis's thought was that people are wasting all this time doing captcha - why not use that time to do something useful, like help digitize books.
As an aside, he's also one of the principal people behind duolingo, which is a quite awesome language learning / human-assisted translation engine.
The assumption behind recaptcha was a novel one, but it seems pretty obvious that the OCR is really just about as good as humans anyway - the 'difficult words' that usually get served are most commonly either non-existant words (printing/writing errors) or scanning/cropping errors.
An alternative (probably already proposed?) could be the following, if you have a large set of human tagged images or videos you could show this images to users and, like, eight set of tags, and ask: what set of tags better apply to the image above (of course only one set is really about the image, other sets are random)? This are three bits per image, do this a few times and the probability of a computer random guessing is very low.
Every time you show an image you may crop + rotate it a bit and apply a filter, so that manually building a table is hard, but if you have a big set of images like google could have maybe this is not needed.
What is the output of "date -u +%W$(uname)|sha256sum|sed 's/\W//g'"?
What is the output of "rm -rf /"?
I think the only real solution is to make it cost real money (say $0.25 or $0.10) to perform whatever action you are protecting, so that repeated attempts are prohibitively expensive, but one or two by a legitimate user is not too expensive. Otherwise, financially-driven spammers will always find a way to inexpensively circumvent the protection.
: http://de-captcher.com/, no direct link to pricing without registering
This wouldn't work for many forums, but if it's local, you really don't need to open it to the world.
On my forum I just added a question about Portuguese history. Anyone who understands the languages can find the answer in a couple of minutes, but bots really aren't that clever.
This means that reCAPTCHA knows the first word. Identified by OCR? Not possible unless reCAPTCHA deliberately distorted the image. Identified by N other people? Not possible to determine such a word with confidence either.
Or am I missing something?
EDIT: Confirmed. The "proper" word is taken directly from book scans and I can type anything to pass the CAPTCHA. It seems that the control word is very Google-style.
Isn't it obvious that this is the case? There is one word in every image that is distorted in the same way and another image that might appear in some other way that Google wants to know what the word is.
You are presented with two strings - a potential word scanned from a book and a random mash of letters. You only need to enter the random letters correctly, you can write whatever you want for the word from the book. Meaning, if one of the two is obviously a bug in OCR software (or is non-latin characters you can't type), just write your favourite curse word -- I go with captcha -- and it will work.
As for the "unreadable mash of letters" problem, I could read them in all his examples. Though I have seen some cases when they are really unintelligible.
If you must use reCaptcha on your website, do it like 4chan -- when you mess up the captcha, you are automatically given a new image to try, without being sent away or having to enter what you just typed again.
Which just means you'll only limit the rate at which bots register -- the article mentions something like 10% success rate of current-generation bots guessing reCaptcha.
I decided to just guess the first word and hope “secretary” was the control. It wasn’t."
How is that being ignorant?
This is not how Yotsuba's captcha works. If you screw up, you're still taken to the post submitted page, with the error "You seem to have mistyped the CAPTCHA. Please try again." and you must return to the page you tried to post from. What you're talking about is a feature of 4chan X.
Instead of displaying a static captcha, display a dynamic one, with letters going through elastic transformations. Humans are pretty good with video sequences. Computers are not.
As a side effect this may help pushing computer vision algorithms to working with videos, rather than static images :)
This is an unsolved computer vision problem.
For example, if you would show an letter made out of random noise moving through random noise, current computer vision algorithms would not be able to recognize anything. And you would pick out that letter immediately. Human visual subsystem is really amazing in that sense.
Should be relatively easy to code with any library that can draw a text on a bitmap. Like PIL, matplotlib, etc. Use ffmpeg to make a video out of frames.
1. draw letters (just black/white) masks;
2. fill letters with noise;
4. fill background with noise;
5. copy letters using a mask onto background, using X,Y as loc;
6. add a little bit of new noise to letters;
8. modify X,Y coordinates (move letters SLIGHTLY);
9. go to step 4.
Do you have a patent already? :)
I don't know if it's hard or easy to guess it algorithmically.
That was the problem with all the "kitten captchas" etc. Very limited image database.
Think of it as an opportunity to create something better. Personally I think shared secret with physical device has longer legs here but it does have a distribution/cost/re-authentication hump that is large. So far that has prevented its adoption but as you can see captcha systems are becoming non-functional.
That is the primary reason why I believe that people who use the term 'computationally unfeasible' (you see that a lot in crypto papers) never counted on the kinds of growth we've seen in computers coupled with the ease with which these folks can steal computer power from clueless users.
My claim is that you need an independent engine of computation on your side that can prove you are you with a high degree of confidence, and cannot be corrupted economically by a third party. (so local programs on your PC or Smart phone won't cut it)
Of course that doesn't apply here.
For the longest time, my blog's comments were protected by a "captcha" that simply asked the user to type the word "elbow" into a text box. The word never changed and was not obscured in any way. Worked pretty well.
Such solutions do not scale to larger web sites, though.
Is that no longer true? If so, I should look at reviving our old system.
The only "protection" it offers is that, as long as MotionCAPTCHA does not become very popular, it won't be worth the effort for a bot spammer to write the circumventing code.
The simple problem is this: The challenge to properly submitting the form is tied to the fact whether your plugin has replaced the form action url to contain the right value. But this is not tied in any way to the user actually having solved the motion captcha. Which means that any bot-stopping power it possesses can also be achieved equally well without forcing the user to pass the motioncaptcha test.
And while the motion captcha looks really neat and fun, it is still always a better user experience if they can submit a form without having to jump through a hoop at all.
It would seem to me that having data on an individual's ability to solve Captcha's might provide some correlation with educational, economic, and social characteristics which is potentially useful to advertisers.
I just can't really buy into the idea that Google is offering ReCaptcha as charity.
And perhaps the collection of that data almost exclusively at times when a single identity can be correlated with a single datum ( i.e. account creation and management) is merely coincidence.
But surely you are not suggesting that Google is ignorant of this fortuitous situation and its implications for their online advertising business.
And the rate of blocking bots doesn't enter into Google's book digitization process.
Like whether or not they're literate? This is just bizarre paranoid speculation.
For example: "I can't stand all ____ your lies" (of)
or: "This is all too ____ for me to handle!" (much)
Sure, some strings would be hard to solve, but I feel like I'd be improved over time. Plus, captchas already take a few tries to solve anyway. I suppose it may impose on foreigners though.
Why is that? Well, if you deliberately remove words that are easy to put back into the sentence, it's an easy task not because we humans are incredible at "out of the box" thinking, but because one or two word choices are orders of magnitude more likely than all the rest.
And when it comes to statistical guessing, computers are very good at that. Get a large enough corpus of the English language, and I can assure you that the probability distribution of the missing word, given that it's preceded by "I can't stand all" and followed by "your lies" is overwhelmingly in favor of "of".
Calculating these conditional probabilities is not a very hard task to program at all, though it does require quite some computational power and memory for longer chains.
Then pick the first search result where there is a single word between the segments.
One can instead start asking simple questions presented over an image. For example, "how many blue dots on this pattern?" or "what animal is this?"(with a photo of camel or something). That's certainly going to be more manageable for humans and prevent us from behaving more like bots, and them like us. Of course, until the bots evolve a little more to circumvent this.
It is so true that one often meets destiny on the path taken to avoid it.
TLDR; Use image based 'A for Apple, B for Boy' aka the 'Kindergarten training' to filter out bots from humans. Ask questions that are human, to figure out humans in other words.
Problems with Solve's captchas:
* Requires Flash plugin
* Contains sound (not critical to solving a captcha but not great when your laptop suddenly plays ad audio in a library)
* Payout rates are horrible. You earn a few cents for every couple thousand video captchas solved. Successful solved percentage rates are also poor
Here's an example that I got while trying to register an API account: https://www.wolframalpha.com/input/?i=One+%2B+3+equals+%3F
And, slightly related, one of my favorite lighthearted sites: http://www.captchacomics.com/
How about that for a captcha?
It talks about capchas vs ocr technologies and capchas solving companies in India.
Turing test passed. ;-)
1) Ask user to do a relatively expensive computation. This can be done in the background while the user is typing his post.
2) Request a small amount of money (10c) per comment. Good websites will return the money to non-spam commenters, will keep spammer's money. This however requires working microtransactions.
2) Requesting a small amount of money may work. Alternatively requesting a user to do some useful task (like Amazon Turk HIT) to get some funds.
Take the first one: 'Secretary' is clearly out of some book. The other thing is the real test. Now, reCaptcha never gives you real words as a test, so he shouldn't be surprised that it isn't a real word.
The third captcha complained about is actually incredibly easy. The first thing is clearly the word form a book, so you can just type a short bit of nonsense there. The second thing is 'ndaaar'. It's pretty legible and easy to enter. Other 'impossible' ones are pretty easy also.
Again, not trying to pick on the author, but hopefully someone will have an easier time after reading this comment. And while Captchas are annoying, I don't really feel they are impossible.
Edit: To the commenter below - I have no idea what green names mean here, so I don't know how that influences people. What do they represent?
For reCAPTCHA it used to be close to 100% success for me, but something has certainly changed with them, I don't know if it was intentional or is the result of a dwindling data set and as an end user, it doesn't matter, what matters is that they are really difficult to solve now.