Hacker News new | past | comments | ask | show | jobs | submit login
Captchas Are Becoming Ridiculous (andrewmunsell.com)
301 points by andrewmunsell on July 29, 2012 | hide | past | favorite | 228 comments

CAPTCHA seems relatively pointless at stopping spammers since there are dozens of online services that use human labour to solve them for a dollar per thousand.

In tests on my own sites I've found that introducing reCAPTCHA during the registration process leads to a significant increase in people abandoning their registration when they fail at recognising the text the first time, without putting a significant dent in spammer registration at all. I've found it far more effective to do things like randomising form field names (instead of using names like 'username' and 'password') so the spammer has to scrape the site to figure out which fields he needs to submit for each and every account he registers, silently dumping registrations that don't use the correct field names, and then applying various heuristics to successful registrations to detect patterns common to spammers.

For instance, one particular spammer always seemed to use the same user agent string and didn't ever trigger any of the AJAX calls on the page. It was trivial to detect registrations coming from that one spammer and silently dump new accounts when he created them.

Randomizing the field names is a great idea but as you said they would just need to scrape the HTML each time they wanted to register. Have you considered sprinkling in random bits of markup to throw off the people using regex and other lazy parsing methods? That might make it a real pain to scrape your forms depending on how the spammer parses your page.

I still have 'username', 'email', and 'password' fields in the form but I hide those elements with CSS, which no scraper is going to bother parsing. When the registration form is submitted the account is essentially hellbanned, they can 'activate' the account via the normal email confirmation process but anything they post disappears into the ether.

I'm catching about 100 spam accounts a day with this technique[1] and the ones that I miss are fairly easy to detect through analysis once they start using their account.

[1] http://i.imgur.com/kdp7Q.png

What happens if someone uses something like LastPass, RoboForm, or any of the other automatic form fillers to legitimately sign up for your website? I would imagine that these would "guess" that username means username and email means email, which may lead to false positives for real users.

One way to avoid this issue might be to plop a hidden input box on the page if this input has text in it when it's submitted, you silently drop the registration.

You could even allow these accounts to exist but then have the content created by them invisible to anyone else.

It's easy to criticize captchas. That's why there are so many articles like this, which we all knowingly nod along to as we read.

It's much harder to provide productive criticism that leads to an actual improvement over the status quo. (Although it is easy to suggest bad alternatives to captchas, which is why they appear in every comment thread about captchas on Hacker News, including this one.)

The best anyone else has come up with is, "You've typed your password wrong one time. Your account is now disabled, please call a customer service representative at 1-900-TIME-WASTE between the hours of 10:30 AM and 3:30 PM Indian Standard Time. Please note that we are closed on Monday, Tuesday, Wednesday, Thursday, and Sunday, and that we have lunch between 11:00 AM and 3:00 PM."

What about just "You've typed your password wrong N times; you must now wait 2^N seconds before trying again"?

Well, how do you prevent someone from locking you out of your account?

You don't.

If account is locked - user can either wait, or request account recovery (get email).

Account gets locked just for that computer.

Oh, you mean just for that IP? Torify. Cookies? Bots don't need to accept them. Do you have something better?

Would it work to do that by IP, and allow only X different IPs for an account to try to login on a single day? e.g. if you've tried to login with 10 different IPs on that day, you will no longer be able to login that day. (Of course this would mean saving some extra data.) The biggest problem I see is that this means people can lock you out of your account, which is probably unacceptable.

What my bank does (and PayPal too if I remember correctly) is keep track of my IP, and if it changes then it forces me to enter additional data about my account before letting me continue. (This assumes I got the password correct.) I think one or both also may use some cookie(s) to mitigate changing IPs. They may also make use of leaky browser data (like user agent strings etc.) to help identify me; they have the potential to see a lot since I'm not trying to hide from them.

I don't see anything wrong in principle with "account lock out" provided that it doesn't affect existing sessions and provided that you can just ask the site to send an email with a token to reset your password. Spammers can lock a user out, so what. Minor inconvenience. If it's happening a lot to the same user and it's also affecting the user negatively, something extra could be done to minimize lockouts for the actual user (who should be easy to detect by the server through logs and a premise the user isn't trying to hide).

Spammers are able to flood you with "forgot your password?" emails, too. I don't know how often they do it. I had my first wave in after 7 years of the same email just a few months ago mostly from old sites I forgot I even had accounts on.

I'm not really a fan of the exponential backoff idea proposed earlier above, I'd sooner go with the "X tries, then wait" approach. The lockout time should not be more than 24 hours, ideally less. Though one could also set the lockout period to expire when the user's session automatically expires, if there's a current one, but that may be too clever.

I feel that there are really two pieces of advice to give on dealing with spammers for the general case... Advice for low-traffic sites and advice for high-traffic sites. I don't have any advice with high-traffic sites since I have no experience with spam at that level (and by high-traffic I mean thousands to millions of uniques per hour), though I don't think the status quo is good enough. With low-traffic sites spam behavior is easy to detect and create a custom solution against. Custom solutions are often better than the popular stuff just by virtue of not having anyone targeting them specifically, and even if that's the case it's still easier to cat-and-mouse if the main options against spam aren't acceptable. Something as dead-simple as loading your form with javascript (or dynamically changing the URL endpoint when the submit button is clicked to something different than what's reported by the form's html attribute...) stops a lot of bots regardless of a captcha, even though you sacrifice the Lynx users. And in my own anecdotal experience I've had more success (less spam bots getting through and leaving a message) with a captcha like "Please join these two "words" together (without spaces): taeiswovd and brhpugqc" than with ReCaptcha even though it'd take less than a minute to add a parser for mine in a bot program. I used to use an arithmetic question but even the dumb bots are on to that one these days--at least the ones after my comment boxes. (I don't even think they added, they just tried numbers 0-99 and sometimes got lucky.)

PayPal doesn't ask you for some extra data. It blocks ("limits") your account so you can't actually use it until you provide them with a copy of your utility bill or something. Terrific when you're on vacation and need to make a PayPal payment. That had me so pissed of that I closed my account (which isn't possible whilst it's "limited" unless you manage to get them on the phone -- good luck with that. -- at least you can then tell the person on the phone that you think they're full of )

>at least you can then tell the person on the phone that you think they're full of

Why? The poor soul answering support calls likely didn't make this policy and cannot do anything to change it. I guess if you like just unloading anger at some unempowered person who has the responsibility of taking a bunch of crap without reacting in turn then this is a good idea. Otherwise it is more useful to take your business elsewhere, and if you must try finding someone to complain towards that might actually be able to encourage change.

By "them" I mean PayPal.

And as a complete aside: given the attitude PayPal phone representatives have displayed, they seem to deserve it personally, too.

Yeah, PayPal does a lot of crap... But I was buying some stuff on a different-than-usual computer and place just a few weeks ago, and it made me provide my bank account number as verification before it would let me send the payment. (Had to sign into my bank account to find a scan of an old check to read it off of..) In a similar vein, Amazon requires reentering your credit card every time you send to a new shipping address.

Why do you call that crap? A normal bank would do something similar. Try traveling to another country and using your ATM card, it likely won't work unless you call your bank ahead of time and tell them of your travel plans.

Having to wait 1s between attempts (per user) could be enough to prevent brute force attacks

Of course, if one ip is making too many requests, it gets blocked or captchaed

That prevents brute force attacks against one specific account, but not against an attack against all users of a site. If you have the capability to make, say, 10k login attempts per second (no idea if that's realistic), you don't slow down to 1 attempt/second, you just attack 10k different accounts once every second.

IP blocks can be countered in other ways (botnets, tor, etc).

I'm not saying it's a bad idea though, there doesn't seem to be any downside apart from the book-keeping required. You could probably start with a higher delay (5 to 10 seconds seems very reasonable) and increase it to several minutes. But please don't make it 24 hours after 3 failed attempts, which happened to me when I tried to order train tickets online and forced me to make a trip to a physical ticket vending machine.

The site should have the capability to detect login attempt for multiple attempts - Show captcha Same account login attempt from multiple IPs - show captcha With the above: a simple time wait for next attempt could solve the issue for most legit attempts right?

How would you login without cookies anyway?

I believe a cookie is a datum given by the server to the client so that the client may present it to the server when making requests later. This is entirely voluntary on the part of the client; the client is free to pretend it has never received any cookies, or to present a made-up cookie (though the server would likely recognize that as suspicious). Browsers are likely to mindlessly accept cookies from servers and to obey servers' instructions as to when to present the cookies back to the server; but obviously a bot is likely to be more sophisticated, and to not present any cookies other than ones that say "You have successfully authenticated yourself as user xxx".

you put the session id inside the url params :3

But then the session id can leak via referer header :(

He was suppose to give a solution to parent's question, not necessary a secure one.

Note that captchas are not (only) to prevent brute forcing passwords, but (also) to prevent bots from signing up for accounts. This would not work against that.

"Please call us between the hours of 9 AM and 5 PM to talk to an account creation specialist."

Take a look at our product at areyouahuman.com

We launched in January and are using games to make them easier for people. Some of our early testing showed captchas can decrease signups by up to 25% and we're able to recover almost all of that.

We also monitor how you play the game (like mouse movement) so we can ramp up our security without having to make the task more difficult for people. Read more here http://areyouahuman.com/how-playthru-stops-the-bots

Do you have any response to claims that your product is easily broken? http://news.ycombinator.com/item?id=4025791

I'm skeptical of your efforts to distinguish humans from bots by mouse movements and other inputs. Anything you can infer can be modeled. It's unreasonable to expect a smart captcha cracker to resemble a zero reaction time Counter Strike aimbot.

Thanks for asking. First, our main focus is on making something more usable for people. We also think captchas are only part of the solution and should be employed with other things (rate limiting, keyword filtering, etc)

That being said, we don't just ignore security. There are a lot of captcha alternatives out there that survive on just obscurity, if they were widely adopted, they wouldn't take much to get around (like a slide to unlock captcha). We analyze mouse movement and other behavior, to avoid this.

To test our algorithms, we write our own bots to break our game (as well as working with the AI lab at the university of michigan) and use that data in our machine learning algorithms. We're always tweaking the bot to see how we can beat it and then looking for new features from the data that we can use.

The main point being, that as people do write bots, we can learn from that and incorporate it. We can also adjust the threshold. Some of our customers care much more about usability and just want a minimum level of protection, other's want the threshold a little higher and accept the risk that humans might fail more often.

> our main focus is on making something more usable for people

Considering that your games can be played by a random number generator with something like 10% success rate, you can just skip the captcha completely. Much more user friendly.

The other things you look at to increase security, like detecting patterns and behaviors that indicates bots can be done without a captcha.

I'm note sure you can get a 10% success rate, if you have let us know, we'd love to hear about it. Note that our demo page has the threshold set to almost nothing and other security features disabled.

Totally agree that we could detect patterns and behaviors without the captcha. Baby steps, though. We'll get there.

And it still identified my as a bot in one out of three tries. And now I shall put the food in the refrigerator when the only items left are a microphone, a stapler and a bottle of household cleaner.

This is so stupid. Your CAPTCHA is unusable by blind people, and your audio CAPTCHA is inaccessible (you have to see the <canvas> element to know where to click to access to the audio CAPTCHA).

Do you consider that blind people are bots?

Actually, we should be worked be ADA compliant and have worked to make this accessible.

Screen readers should pick up the alternate text.

<a href="javascript:AYAH_fallback()" style="position:absolute;left:-10000px;top:auto;width:1px;height:1px;overflow:hidden;">Please complete this audio captcha before submitting the form</a>

Also, I know the audio sucks, but it's the most secure out there, otherwise it would just be a giant hole for bots to get through. It's one of the things we want to make better.

In fact, we've experimented with audio games and had some good results.

Look at the "download MP3" link's URL, the audio captcha comes from google.com/recaptcha.

I agree that the button itself should be easier to discover for blind people, like an image link with a title tag.

If you want to download or listen the audio CAPTCHA, you have to click on a <canvas/> element, which is totally unusable by a blind people. It's like saying “Look, we've made an elevator, but you have to climb some stairs to take it”.

Given the nature of the games, you may want to consider renaming to areyounorthamerican.com...

Ha. Looks like the domain is available. In reality we were initially targeting users in the US, but have had a lot of interest internationally. Most just use it as is, but we have had others pay for translation or custom games.

We are also planning on a set culturally agnostic (as well as culturally specific) games. If you go to our homepage and refresh through the games you'll see a shape game, that has no language for an example.

This is a neat idea, but these game captchas have some high level of similarity with in-banner-games (at least to me). Not sure if theses "catch the monkey" banners are still around but they were extremely annoying. I hope you are doing some testing to make absolutely sure that this great idea is not mistakenly experienced as some form of banner ad.

Thanks. We certainly don't want to become the equivalent of 'catch the monkey' We constantly do UX testing. Early on it was an issue that people thought we were just and ad on the page, but we updated our start screen and design. By default, we also show our games in a modal window after people click submit, so they know they have to do it. With these two things we saw registrations increase by 40% over recaptcha.

We're also getting ready to roll out an update to our API that will allow you to hide/disable the submit button until the game is played. Our testing showed this helped increase submission rates even more.

Finally, we are also trying different styles of games to see what people respond to best.

Damn, people were kind of hard on you. I think it's very clever. Even though sometimes our inventions don't work as perfectly as we expected them to. Your captcha system is new and fresh and many bots are not prepared for it. I'm sure you'll improve over time, the bots will improve over time, you'll counter, so on and so forth.

I'd rather have my signup system use areyouhuman than re-captcha. What makes me uneasy about using it on a massive site is the whole html5 + flash dependency. And the audio captcha alternative you have is terrible. I have better than average hearing and wasn't able to understand anything on my audio sample. That and the re-captcha system helps digitize books while areyouhuman is just playing games. Even though I hate captcha it makes me feel like I'm helping digitize a book when I use recaptcha, so I feel better about it. On the other hand your games are interesting and dare I say it, a little bit addictive, especially with all those congratulatory stars at the end.

Thanks. The audio challenge does suck and we're working on making it better, but still secure.

I wonder if a good approach would be to provide an audio-only website if you use the audio-only CAPTCHA? Cracking one word is one thing, but if the website then goes on with "Press 1 for your account balance..."

I don't disagree, but what is your point? That one should never say "X is broken" without inventing a Y that is better than X?

It's fine to say "X is broken", but I doubt you can find anyone on Hacker News who didn't read an article like this years ago and thus already knows X is broken. This article is beating a dead horse. The fact remains, captchas are the worst technique to prevent bots from filling out a form, except for all the other ones.

Yes, though, to be fair, this particular horse does seem to find a way to get deader as the years go on.

When a horse dies, it doesn't stay in a constant preserved state forever, it slowly decays :)

Right now I think the best solution is no captcha. HN doesnt use one for the sign up form. Yes, the HN community is flooded, but for the most part I dont think a captcha would change that.

HN is a (largely) community moderated content aggregation site. What if you don't have the benefit of a community to moderate? If you're an email service and want to stop bots from signing up and spamming the world at large, how would you do that?

No, it is easy just to throw in a CAPTCHA instead of trying to solve the problem without abusing your users. I've always said that captchas were a wrong idea and should not be there at all.

Why not pose questions with answers that are obvious to humans but difficult for ai.

Eg. 'What's 4 times four?' or 'How many times does the word four appear within this sentence?'

Natural Language Processing has become better in recent years.

> http://www.wolframalpha.com/input/?i=what%27s+4+times+four

I'm not sure you'd want to involve NLP. I think it's more down to the fact that it's relatively easy to reverse engineer the function generating the question and get the parameters that way.

Wolfram Alpha won't answer very many questions of that sort[0] but a cracker can spent a couple of hourse to enumerate all kinds of questions and writing tailored functions (and a detection routine) for each. If your detection routine is naive (e.g. choose randomly) and some or all of your answering functions work badly, no problem, you only have to get it right occasionally anyway.

[0] And indeed it fails unsurprisingly if comically for the second question How many times does the word four appear within this sentence?, trimming it to How many times and showing details about the British newspaper.

An even more outrageous but possible solution has already been implemented here.


That one is actually pretty easy to beat - the OCR is easy enough, and you can refresh the page until you get an easy question. Some of them are very simple integer arithmetic exercises.

As for the harder questions, Wolfram Alpha is better able to do them than the average human.

I'm not sure about which OCR you're talking about but I've tried ABBYY and because of the fractions ie top and bottom halves the thing craps out. Granted, if you get a single line problem it could easily be solved. Definitely there's the problem of application as well. Since this is a quantum bit service Site, you can expect people to know a minimum of integration. But I don't expect Facebook login to have anything even close to 8th grade mathematics.

I always seem to strike it lucky with that site. I can remember it coming up twice on HN (once now, once a while ago - http://news.ycombinator.com/item?id=2290466), and both times I got a one-line equation. I didn't save the equation I got this time, but it was maybe five terms long and most of the atoms were zeros.

Did you mean for humans or for English speaking humans?

given that recapture uses english words and Roman lettering, it's already pandering to a particular audience.

I remember it was demonstrated that these questions can be easily answered with high margin of accuracy using wolframalpaha and google search.

So, try manually answering one with WA and Google until you find one which can't be answered?

timeouts of few seconds are decent alternative to captcha. they can't be beaten.

I wrote an OCR for the previous generation of recaptcha (the one that preceded this one) that cracked it with a 92% success rate. The same day I finished the OCR, I watched my 50 year old mom try to input recaptcha for the registration of some website to view family photos. She eventually gave up, and that website lost another potential user.

I remember shaking my head at the absurdity of it all. I'm glad I'm not the only one.

Isn't that a result of the massive deployment of reCaptcha? It appears that all the easy words have already been solved with enough confidence, so there's only garbled scans left. Add more books?

That said, there are plenty of alternative solutions with good success rates (and far lower abandonment rates), like requiring the answer to a simple question (not math), photo captchas, randomizing inputs, using javascript techniques and honeypot forms. Captchas are so popular because they are easy to implement.

That's what I was thinking. Originally reCAPTCHA had the control word be a scan as well, which meant that an attacker had to beat their OCR. Now that the control word is computer generated, the system has devolved into a regular CAPTCHA that further asks humans for recognition task work.

This version no longer advances OCR algorithms, but does provide cheap exception handling. I don't know when or why the change was made, but obsolescence is at the top of my mind. Either they've ran out of unrecognized words, or adversaries have beaten their OCR. Either way, it seems we're back to 2005.

edit: That said, Luis von Ahn mentioned that Google is experimenting with other image processing tasks, so there's hope yet.

One nice thing about recaptcha is that you know those cut off words don't matter (badly OCR'd) and can just enter gibberish for them.

But it has gotten to the point that about half of the control words are unintelligible by a human.

Sure, but you shouldn't undermine the mission of successfully recording words that can't be identified. It's a pretty noble goal, actually.

Their goal isn't more important than my goal of resetting my password or logging in.

I, for one, refuse to be used as the source of free labor by Google just because I want to sign up for a website.

That's ridiculous. reCAPTCHA isn't making you do more work, it's just making work you have to do anyway useful. The benefit Google gets from the OCR work is microscopic, and the world gets old books and New York Times articles.

See, I would have much less of a problem with this if Google actually told this to everyone. Instead, they silently use you for free labor. That's dishonest.

I'm all in for crowdsourcing if you tell me about it and nicely ask me to participate. Not when I am forced to do it.


The "What is reCAPTCHA" explains the OCR process clearly and has the phrase "Currently, we are helping to digitize old editions of the New York Times and books from Google Books."

And if you click on the "HELP" link in a reCAPTCHA, it opens a small page with the instructions and a paragraph explain the OCR and a link to Learn More.

How are they _not_ telling this to everyone?

I'm actually wrong on this one, I see. Might be because I don't even really look at the widget itself anymore, I just type it.

Ask twenty regular users who have been forced to fill that crap out.

Nobody reads the about page. Put it on the front page (or the embedded widget, if that is what the user sees) in clear large letters that are easy to see or accept that we consider you scum.

(They do put it on the front page, I already pointed that out)

So, assuming you don't consider HN scum (you're here, after all), can you please explain to me how is this different from YCombinator using HN to publicize their own companies? There's nothing in the front or signup pages explaining that.

Frankly, I don't see why is that a problem. They're offering a free service in exchange for having words manually OCRed. If you have a problem with it I think you should take it to the site that's using reCaptcha, not with Google.

(By the way, I'm not affiliated with Google and I'm not even an heavy user of their services anymore)

It says right there on the reCAPTCHA widget "Stop spam, read books". How is that dishonest?

By that logic people I think people should also shred newspapers to make sure someone isn't reading them for free after they have finished them and then soak them in water and add a little sand and oil to make them impossible to recycle and unusable as fuel in furnaces. And stop releasing source code, at least under any sane license.

If it doesn't add extra burden to you (assuming there would just be another captcha that didn't end up feeding googles scanning effort), why do you care?

Everyone entering something different is just as valid as entering nothing at all for determining whether a word is identifiable. Even more so actually, since a null response has at least four distinct meanings that I can think of. (user took too long to enter captcha, user couldn't read OCR word, user couldn't read control word, and user didn't feel like entering OCR word)

If I encounter a broken captcha that I have to work around, it's not me who's undermining the mission.

How do you work around a broken captcha? You don't you just hit the little refresh icon most have and get a new one.

I would destroy it if I could.

If they want to. If they want we me to scan old books, they bloody better pay me.

It's funny that we are advancing the science of OCR and computer vision by funneling grey/blackmarket dollars into the field to break captchas.

Makes me wonder what sort of social engineering opportunity this creates. What other fields could be advanced in a similar way.

"Before we accept your comment we ask you to fold this protein"?

object recognition? As in, point out object X in this image.

And they are becoming less and less effective. Since they are so ubiquitous, users don't think twice about completing them. Therefore, infected users are going around unwittingly solving captchas served to them by their associated bot nets.

Wow, I've never heard of that happening before. That would definitely be an issue... Unfortunately I'm not sure how we would end up solving that issue-- there's always going to be the users that don't realize they are being exploited, to solve captchas or otherwise.

I find recaptcha so frustrating that I will always abandon the account unless it's something that I really really care about.

If I have trouble with recaptcha, I can't imagine a non-programmer over 60 years old having any remote hope of figuring it out.

Also, the way it's typically implemented, if you solve the recaptcha once and there is some other server side validation error, you have to fix that and then solve a new recaptcha before proceeding. It is just so punishing I am always at a loss when I encounter it.

There was an interesting talk that mentioned CAPTCHA by one of its creators, Luis von Ahn, at the AAAI-12 conference on AI and robotics this past week.

In ReCAPTCHA, the two-word CAPTCHA version, one of the two words is taken from a scanned book. That (unknown) word was one that failed OCR for that book.

The other word is one that captcha already knows the answer to.

The assumption is if you get the known captcha correct, then you probably got the other one correct as well (if it was possible to read it). The answer to the unknown word supplements the OCR of the book.

The captchas are put in random order, and you only have to get one of them right.

Luis's thought was that people are wasting all this time doing captcha - why not use that time to do something useful, like help digitize books.

As an aside, he's also one of the principal people behind duolingo, which is a quite awesome language learning / human-assisted translation engine.

Yeah. The actual problem as I see it is that people have been trained that you have to get captcha's "right", where with these recaptchas all you really need is a reasonable guess because there is no 'right' (and nowhere does it say that).

The assumption behind recaptcha was a novel one, but it seems pretty obvious that the OCR is really just about as good as humans anyway - the 'difficult words' that usually get served are most commonly either non-existant words (printing/writing errors) or scanning/cropping errors.

Luis von Ahn is one of the very few tech entrepreneurs from Guatemala. He's killing it.

There is no stupid "hey I've an idea!" comment in this thread, so I'll offer one.

An alternative (probably already proposed?) could be the following, if you have a large set of human tagged images or videos you could show this images to users and, like, eight set of tags, and ask: what set of tags better apply to the image above (of course only one set is really about the image, other sets are random)? This are three bits per image, do this a few times and the probability of a computer random guessing is very low.

Every time you show an image you may crop + rotate it a bit and apply a filter, so that manually building a table is hard, but if you have a big set of images like google could have maybe this is not needed.

Atleast the archlinux.org forum has the geekiest "Captcha"-system I came across so far:

    What is the output of "date -u +%W$(uname)|sha256sum|sed 's/\W//g'"?
(This is not to prevent bots, but to prevent human spammers)

    What is the output of "rm -rf /"?

That's a good question! I think I know but I'm not going to try it.

I think that the problem that we will always run into, is that any human-performable task will either be cracked by someone writing a bot, or made trivially inexpensive by apps that charge as little as $0.00139 per solved captcha (see [1]). Microsoft has implemented a tagging-type captcha, ASIRRA (see [2]), for which you can hire out the results for $0.004 per solved captcha[3].

I think the only real solution is to make it cost real money (say $0.25 or $0.10) to perform whatever action you are protecting, so that repeated attempts are prohibitively expensive, but one or two by a legitimate user is not too expensive. Otherwise, financially-driven spammers will always find a way to inexpensively circumvent the protection.

[1]: http://www.deathbycaptcha.com/user/order

[2]: http://research.microsoft.com/en-us/um/redmond/projects/asir...

[3]: http://de-captcher.com/, no direct link to pricing without registering

I wondered if it would be possible to use something like Hashcash to require a certain level of CPU usage from the user agent? It doesn't stop spam, but it's unobtrusive to the user and would slow spammers down.


Once I found a nice captcha replacement (can't remember where it was, tho). It worked like this: at the end of the form, there were 6 to 9 playing cards and the text "Click on the seven of spades". This fitted the theme of website which had something to do with poker or magic tricks (can't really remember what it was) but this can be done with a lot of other stuff that is well known by people and not by robots (e.g. 6 photos of animals and the question "Click on the dog", etc.)

A forum I run in Ireland doesn't have a need for Captchas. The forum is very regional and niche - no genuine foreign visitors - so we blocked every country apart from Ireland and the UK from posting comments. Result - no spam.

This wouldn't work for many forums, but if it's local, you really don't need to open it to the world.

So, if I was an Irish guy travelling (or even an emigrant), I'd be blocked? That seems unfortunate.

On my forum I just added a question about Portuguese history. Anyone who understands the languages can find the answer in a couple of minutes, but bots really aren't that clever.

I frequent a couple of niche forums where the rule is that you have to write an introductory post of at least X sentences (or come up with original witty answers to a list of questions etc.) before you are allowed to post anything else. All introductions are read by admins or volunteers and approved if as long as they seem written by a human. No spammers and very few trolls.

Wow - such a nice and simple approach, bet this would work for many similar small forums..

Similarly, if you run the forum on a LAN instead of the Internet, you also don't get any spam! :-/

that's great

The paranoid, tin foil hat wearing part of me has always put post Google acquisition Recaptcha in the must-be-part-of-the-long-arm-of-Google-tracking category of services encouraging me to do my very best to avoid allowing it to run in my normal browser session.

I can't seem to understand the "Onightsl secretary" CAPTCHA. We all know that the first word is extremely hard to identify, and according to the author "secretary" wasn't the control word.

This means that reCAPTCHA knows the first word. Identified by OCR? Not possible unless reCAPTCHA deliberately distorted the image. Identified by N other people? Not possible to determine such a word with confidence either.

Or am I missing something?

Also it seems that there is exactly one distorted word and exactly one "proper" word. I would assume the distorted word is the control word. I should try a few reCAPTCHAs to see if this is a correct assumption.

EDIT: Confirmed. The "proper" word is taken directly from book scans and I can type anything to pass the CAPTCHA. It seems that the control word is very Google-style.

> unless reCAPTCHA deliberately distorted the image.

Isn't it obvious that this is the case? There is one word in every image that is distorted in the same way and another image that might appear in some other way that Google wants to know what the word is.

He seems a bit ignorant on how reCaptcha works nowadays.

You are presented with two strings - a potential word scanned from a book and a random mash of letters. You only need to enter the random letters correctly, you can write whatever you want for the word from the book. Meaning, if one of the two is obviously a bug in OCR software (or is non-latin characters you can't type), just write your favourite curse word -- I go with captcha -- and it will work.

As for the "unreadable mash of letters" problem, I could read them in all his examples. Though I have seen some cases when they are really unintelligible.

If you must use reCaptcha on your website, do it like 4chan -- when you mess up the captcha, you are automatically given a new image to try, without being sent away or having to enter what you just typed again.

Which just means you'll only limit the rate at which bots register -- the article mentions something like 10% success rate of current-generation bots guessing reCaptcha.

This is common knowledge for people like you and I (that only one of the two words only counts) but the vast majority of "average" visitors do not know this and will be unable to solve the captcha.

From the article: "t’s important to note the way reCAPTCHA works. Each user (or bot) is presented with a control word, and a word unrecognized by OCR. This control word is already known to Google (who runs reCAPTCHA). If you get this first word right, it is assumed that you get the second word correct as well. So, in reality, you only need to guess the key word correctly.

I decided to just guess the first word and hope “secretary” was the control. It wasn’t."

How is that being ignorant?

It's ignorant because he then goes on to refresh recaptchas that are "impossible" to solve where the OCR word was gibberish (cut-off, etc.) but the key word was discernible.

> If you must use reCaptcha on your website, do it like 4chan

This is not how Yotsuba's captcha works. If you screw up, you're still taken to the post submitted page, with the error "You seem to have mistyped the CAPTCHA. Please try again." and you must return to the page you tried to post from. What you're talking about is a feature of 4chan X.

I see a possible solution that may work in a short term:

Instead of displaying a static captcha, display a dynamic one, with letters going through elastic transformations. Humans are pretty good with video sequences. Computers are not.

As a side effect this may help pushing computer vision algorithms to working with videos, rather than static images :)

Do OCR on every frame, perform majority voting on the result. You just made the spammer's task easier.

Here is an example of a problem that would be hard for computer to solve:


How so? the fixed part can be easily extracted. If it also moved (while morphing) then I guess it would be hard, but fixed dots in a moving background would take just a few frames for a computer to solve.

You just diff each frame and keep the parts that don't change much. Very simple to solve.

Just move letters slightly. And make them morph a bit. Would still be obvious to a human, but a computer trying to average anything would fail miserably.

This is an unsolved computer vision problem.

which pixels stay black in all frames?

Captchas are hard because there's only so much an algorithm can extract from spatial information. Computers are excellent with temporal data, given essentially unlimited memory for past video frames. Computers benefit more from increased information than humans do.

This is computer vision, an I'm somewhat an expert in the area. I can tell that video sequence recognition is _much_ harder problem than image recognition.

For example, if you would show an letter made out of random noise moving through random noise, current computer vision algorithms would not be able to recognize anything. And you would pick out that letter immediately. Human visual subsystem is really amazing in that sense.

It should be possible to do this with an animated GIF. Do you have any references/examples I could use as a starting point?

Oh. I remember reading some vision paper and in the supplement materials there've been a couple of videos with letters moving. Doubt, I'll be able to find it that easily.

Should be relatively easy to code with any library that can draw a text on a bitmap. Like PIL, matplotlib, etc. Use ffmpeg to make a video out of frames.

1. draw letters (just black/white) masks; 2. fill letters with noise; 4. fill background with noise; 5. copy letters using a mask onto background, using X,Y as loc; 6. add a little bit of new noise to letters; 8. modify X,Y coordinates (move letters SLIGHTLY); 9. go to step 4.

This is simple, brillant. The best solution I have seen ever.

Do you have a patent already? :)

I had just noticed that captchas were getting worse. I always thought sites should just make the user perform an operation. Display a hard to read but still legible "type the second letter of the third word". 'e' would be the correct response. There's got to be a reason this isn't done already, does someone know why?

Parsing and responding to a question like that isn't difficult when the questions are algorithmically generated from a finite number of human-designed types.

Bots would spam your site once every 26 tries.

I was going to say don't most modern sites limit multiple attempts in succession? Then I realized that spammers have thousands of computers at their disposal so the successive attempts would not come from the same IP. Shoot this is a hard problem...

Spammers have thousands of machines.

More than that, since they would guess one of {e,t,o,i,n}.

It assumes that the user reads English fluently.

Or how about short stories, something like; "Brian came home very late. He couldn't find Mr. Hat. While he was thinking where his cat might be, he noticed the window was wide opened." What's the name of the cat?

I don't know if it's hard or easy to guess it algorithmically.

The problem is generating enough unique questions. If you only have 10 questions, they can just hard-code them.

That was the problem with all the "kitten captchas" etc. Very limited image database.

I just thought it was me. There's hardly ever a captcha now, that I can get on the first try. I loathe captchas.

They are ridiculous, but they are so because what they are trying to achieve can no longer be easily achieved by solving the "visual acuity" problem.

Think of it as an opportunity to create something better. Personally I think shared secret with physical device has longer legs here but it does have a distribution/cost/re-authentication hump that is large. So far that has prevented its adoption but as you can see captcha systems are becoming non-functional.

I'd be curious to see the ability for a machine to solve picture based captcha systems. For example, given a lineup of 10 pictures of pets, choose the three that are cats. I've seen them before, just not widely implemented.

Well Google just talked about their code which identified kittens in Youtube videos. (http://www.nytimes.com/2012/06/26/technology/in-a-big-networ...)

That's actually pretty interesting. With time the number of computers/processors required to do these tasks will go down, but for now and based on that experiment, it almost seems more efficient to use picture based captchas.

I don't think you 'get it' Andrew :-) The folks who bust captchas, 16,000 machines is chump change, they run botnets of hundreds of thousands of machines, they dynamically buy EC2 instances, they make a lot of money.

That is the primary reason why I believe that people who use the term 'computationally unfeasible' (you see that a lot in crypto papers) never counted on the kinds of growth we've seen in computers coupled with the ease with which these folks can steal computer power from clueless users.

My claim is that you need an independent engine of computation on your side that can prove you are you with a high degree of confidence, and cannot be corrupted economically by a third party. (so local programs on your PC or Smart phone won't cut it)

Exponential CPU growth rates are factored into cryptographic protocols. As long as growth rates don't become super-exponential they're safe.

Cryptosystems that require 2^128 tries to crack aren't going to be much easier even if you have a billion machines; you really need a theoretical breakthrough. :)

Of course that doesn't apply here.

Even just randomly picking 3 images will be successful about 1% of the time which is more than enough for a bot.

Good point. 10 choose 3 = 120.

Many months into this experiment I barely get 2 spam comments a month on my blog, I totally respect reCAPTCHA, but demanding JavaScript, doing some minimal tests and perhaps an amount of computation (1 sec on iPad) is the FIRST thing we should be doing http://samsaffron.com/archive/2011/10/04/Spam+bacon+sausage+...

Almost any unique solution will work well for a small site, because it's not worth the effort to program a bot specifically for your system.

For the longest time, my blog's comments were protected by a "captcha" that simply asked the user to type the word "elbow" into a text box. The word never changed and was not obscured in any way. Worked pretty well.

Such solutions do not scale to larger web sites, though.

sure they do, you randomize algorithms or add proof of work.

But then it's not "such" a solution anymore.

So back when I was a PhD student, we had an idea for a Captcha system that would be better than the current system and all alternatives by quite a margin, but we talked to some experts in the area, and they said that none of this matters because captchas are normally broken by feeding them to mturkers or people wanting to get into porn sites.

Is that no longer true? If so, I should look at reviving our old system.

What was your idea?

Though I agree those would be hard captchas if you tried to complete them properly, It's not overly hard to guess which one is the control- the control is always bold, very likely a nonsense word, and never cut off or partially visible, because the control is generated by an algorithm rather than a scanner. The captcha isn't really designed to be regenerated until both words make sense.

Every time I read something like this I pine for never having the opportunity to turn MotionCAPTCHA[0] into a real (secure) system. Everyone said it couldn't be done, but I know it can!

[0] http://www.josscrowcroft.com/demos/motioncaptcha/

MotionCAPTCHA doesn't actually solve the CAPTCHA problem, it just adds a hurdle for human users to your submission form. There is nothing in your "roadmap" that prevents bot spammers from trivially circumventing this hurdle.

The only "protection" it offers is that, as long as MotionCAPTCHA does not become very popular, it won't be worth the effort for a bot spammer to write the circumventing code.

However, you can get exactly the same amount of protection without inconveniencing your users by obfuscating field names and doing the same JavaScript form action replacing trick you already do.

The simple problem is this: The challenge to properly submitting the form is tied to the fact whether your plugin has replaced the form action url to contain the right value. But this is not tied in any way to the user actually having solved the motion captcha. Which means that any bot-stopping power it possesses can also be achieved equally well without forcing the user to pass the motioncaptcha test.

And while the motion captcha looks really neat and fun, it is still always a better user experience if they can submit a form without having to jump through a hoop at all.

Clever. Not knowing the details of how computers beat CAPTCHAa can you explain why it would be difficult to create a program to trace the path?

I've done some research in this area with examining user input mechanics. This particular example would be solvable using B-spines [1], as one possible approach. The "path CAPTCHA" even provides some great computational hints, by indicating the starting point as a circle!


I don't think it would, really...

Your article's title should be: Captchas are sometimes so ridiculous, that they take 1 second longer to solve because you need to refresh them which is besides the point that is that we are due for a new way of thinking about authenticating human presence on the web.


ReCaptcha is free from Google. I would hazard to guess that it is free because the value of the data Google collects is greater than the cost of providing the service. What I notice is that user convenience and blocking bots don't directly enter into the value Google derives, they are only marketing points - i.e. what matters to Google is that the perception of ReCaptcha is suffiently positive that a stream of relevant data is provided.

It would seem to me that having data on an individual's ability to solve Captcha's might provide some correlation with educational, economic, and social characteristics which is potentially useful to advertisers.

I just can't really buy into the idea that Google is offering ReCaptcha as charity.

You know they use it to digitize books, yes?

Yes of course (and it is not a charity project, either). The fact that you are linking ReCaptcha to a valuable data stream for Google seems to support the my general claim that Google is deriving value from the data stream.

And perhaps the collection of that data almost exclusively at times when a single identity can be correlated with a single datum ( i.e. account creation and management) is merely coincidence.

But surely you are not suggesting that Google is ignorant of this fortuitous situation and its implications for their online advertising business.

No, I'm not. From the wording of your question I thought you didn't know they digitized books, which is a revenue stream already. They might use it for other data indeed.

There wasn't a question in my post.

And the rate of blocking bots doesn't enter into Google's book digitization process.

"It would seem to me that having data on an individual's ability to solve Captcha's might provide some correlation with educational, economic, and social characteristics which is potentially useful to advertisers."

Like whether or not they're literate? This is just bizarre paranoid speculation.

I also have an idea. Since reCptcha has so much traffic, why not just pair off users and get them to verify each other as fellows humans? Not quite sure how you'd do that but it is one approach that could let you break out of the pattern recognition arms race.

I have no expertise in this area, but how about patterns? Since we're already parsing so much text, why not take strings of legible text (with a minor visual distortion), and strip a connecting word such as a preposition? The user is then prompted to enter the connecting word. Synonyms could be matched.

For example: "I can't stand all ____ your lies" (of)

or: "This is all too ____ for me to handle!" (much)

Sure, some strings would be hard to solve, but I feel like I'd be improved over time. Plus, captchas already take a few tries to solve anyway. I suppose it may impose on foreigners though.

Well, surprisingly, that would only be as strong as doing OCR to get the words.

Why is that? Well, if you deliberately remove words that are easy to put back into the sentence, it's an easy task not because we humans are incredible at "out of the box" thinking, but because one or two word choices are orders of magnitude more likely than all the rest.

And when it comes to statistical guessing, computers are very good at that. Get a large enough corpus of the English language, and I can assure you that the probability distribution of the missing word, given that it's preceded by "I can't stand all" and followed by "your lies" is overwhelmingly in favor of "of".

Calculating these conditional probabilities is not a very hard task to program at all, though it does require quite some computational power and memory for longer chains.

Finding dropped words in coherent sentences is trivially easy given a large enough corpus of the language. Just looking at the occurrence probabilities of the surrounding words would be enough for most cases.

My Android already does this, guessing the next logical word as I'm typing. This solution to Captcha has already been broken by AI.

https://www.google.co.uk/search?q=%22I+cant+stand+all%22+%22... https://www.google.co.uk/search?q=%22This+is+all+too%22+%22f...

Then pick the first search result where there is a single word between the segments.

What stops captchabots from just brute-forcing it with common words? Or, if they wanted to get fancy, something like a Markov model of English text could really help narrow down the likely missing words.

This is remarkably easy to crack. Much easier than captchas.

This is all too complex for me to handle! (And "strange", and "weird", and "crazy" and ...)

Tweaking that a bit, I think Captchas really need to change. But not with cut-off of grammar, which can be probabilistically (brute-force) broken.

One can instead start asking simple questions presented over an image. For example, "how many blue dots on this pattern?" or "what animal is this?"(with a photo of camel or something). That's certainly going to be more manageable for humans and prevent us from behaving more like bots, and them like us. Of course, until the bots evolve a little more to circumvent this.

It is so true that one often meets destiny on the path taken to avoid it.

TLDR; Use image based 'A for Apple, B for Boy' aka the 'Kindergarten training' to filter out bots from humans. Ask questions that are human, to figure out humans in other words.

I will say up front that I have no good solution. That said, as a human, I hate captchas or anything else that makes me jump through hoops to do something that should be simple. The industry is punishing people to avoid bots or (whatever they are using them for in the instance), and yet it doesn't even work to accomplish that goal. Eliminate it all. Find something new. I wish I knew what to suggest. But it won't be anything at all like a captcha, stupid game to play, security questions, or anything like that.

For an interesting alternative to traditional captcha check out Solve Media - they monetize your captcha for you by making you input brand messages (that are retained better than just seeing a banner ad). I thought it was an interesting idea when I met the founder Max & team (although I was just slightly skeptical of the level of security provided). They seem great though, so it's worth checking out.

I have tried Solve media's service and its a good idea in theory but execution is poor.

Problems with Solve's captchas:

* Requires Flash plugin

* Contains sound (not critical to solving a captcha but not great when your laptop suddenly plays ad audio in a library)

* Payout rates are horrible. You earn a few cents for every couple thousand video captchas solved. Successful solved percentage rates are also poor

I'm a big fan of logic captchas. Not only are they way more accessible than image captchas, but the frustration factor is not nearly as high.

Wolfram Alpha does solve some of them. It seems easy to detect the type of question and manually implement a way to parse each.

Here's an example that I got while trying to register an API account: https://www.wolframalpha.com/input/?i=One+%2B+3+equals+%3F

Same. Very easy to identify human vs. non human.

Do give an example.

Seems like it would be fairly easy to parse many of these questions automatically. The only reason they work is probably that they're not widely used enough for spammers to care about.

Randomly returning a word from the question will get my bot in about 3% of the time which is plenty.

I do agree they are becoming ridiculous.. I often have a hard time figuring them out. The real question is what the next step will be. Where do we go from here? What experiments are out there for new captchas?

And, slightly related, one of my favorite lighthearted sites: http://www.captchacomics.com/

Why is it that there's all this software that can decode horrifically mangled text, and yet nothing to OCR a handwritten letter?

Filters are coming to CSS so I guess it will be possible to show an image with hidden numbers (or words) and let the user pick a color filter that applied to the image will show the numbers and submit the color and the number to the server for verification. Like color blindness tests.

How about that for a captcha?

What about that sounds hard to automate?

Google's captchas actually give me a headache looking at them, there's just too much curliness going on there.

With ReCAPTCHA, you only have to get one of the two words right. I've never had it be the illegible word (or my terrible guesses match the "correct" answer), as one might expect given the one you need is necessarily the one for which they have the answer.

People interested in capchas may also like this episode of the Hacker Medley podcast http://hackermedley.org/humans-only/

It talks about capchas vs ocr technologies and capchas solving companies in India.


Given that any type of human-solvable captcha can be easily outsourced for cheap, I think the next best alternative is some computationally expensive operation per form submission.

The known word is the one that is bent. This is the only word you have to type. You can enter "dogman" or "foo" for the cut-off (scanned) word and it will let you through.

Simply put, we need a better way. Some of these are just unintelligible and require multiple attempts. Someone please come up with a model that makes sense!

Handwritten captchas? I'm sure they can be crowdsourced.

I've noticed a steep rise in the difficulty of reCaptcha captcha's too. There isn't anyone I can contact about it, is there?

The captcha 'arms race' will continue until reCapcha can not distinguish any more between the bot and a human.

Turing test passed. ;-)

HTML5 should introduce something better than captchas, I wonder why it was not made part of the forms elements ..

I've got several times math symbols written in captchas instead of words.. how I'm supposed to write them?

Anything you write for them is accepted. In fact of these captchas you only need to get one word correct (the non-maths one). IIRC the other is scanned from a book somewhere and the computer used to scan it couldn't figure out what word it was.

Neat demo last week on alternative to word captchas -- cartoon catcha, eg drag hat on top of head, put eyeglasses on face, from Mitsuo Okada of Osaka University, at Founders Institute, he's at mitsuookada@gmail.com

I guess OCR software must be getting really good now!

With increased computing power, we are really able to do some amazing things. Unfortunately this also means that spammers also have access to this increased speed and ability to crack OCR based puzzles. But on the bright side, this also allows us to digitize books without humans more accurately (which is one of the primary purposes of reCAPTCHA)

Is there a way to report CAPTCHAs for illegibility?

I wish, but as far as I know, there isn't. I'd love to know if there was!

There's a recycle button that will generate a new one.

I guess the issue is, does Google know when someone recycles a captcha, and if they do, are they doing anything about it? It's useless if the captcha isn't flagged and is just given to another use.

The tricky thing is that there are hordes of automated captcha-breakers that will recycle captchas until they get something that's easy to OCR

This is Google, so I'd imagine they keep tallies.

Yeah, and a couple days ago I had to hit that button about 20 times before I got one I could decipher. I was about 1 or 2 clicks from giving up.

Isn't that implied by refreshing?

I see two possible long-term solutions to the captcha problem

1) Ask user to do a relatively expensive computation. This can be done in the background while the user is typing his post.

2) Request a small amount of money (10c) per comment. Good websites will return the money to non-spam commenters, will keep spammer's money. This however requires working microtransactions.

1) Asking a computer to do a computation is not such a great idea. Low powered devices (think iPads) running javascript would be at a great disadvantage to highly efficient botnet clusters that spammers own.

2) Requesting a small amount of money may work. Alternatively requesting a user to do some useful task (like Amazon Turk HIT) to get some funds.

Some computations just cannot be parallelized. (Yet. I'd be ironic if spammers advanced the field of parallel computing) The speed difference for single processors remain, but that's a single order of magnitude, except in extreme cases.

But you can always parallelize the spamming itself.

I feel like the author of this article is still slightly misunderstanding the reCaptcha. Not to criticize him, but it's almost immediately clear which word you are actually being tested on, because it's has the same general 'look' to it each time.

Take the first one: 'Secretary' is clearly out of some book. The other thing is the real test. Now, reCaptcha never gives you real words as a test, so he shouldn't be surprised that it isn't a real word.

The third captcha complained about is actually incredibly easy. The first thing is clearly the word form a book, so you can just type a short bit of nonsense there. The second thing is 'ndaaar'. It's pretty legible and easy to enter. Other 'impossible' ones are pretty easy also.

Again, not trying to pick on the author, but hopefully someone will have an easier time after reading this comment. And while Captchas are annoying, I don't really feel they are impossible.

Edit: To the commenter below - I have no idea what green names mean here, so I don't know how that influences people. What do they represent?

I feel like you didn't read the article, because the author already addresses what you said and had you read the article you'd realize you made incorrect assumptions that, again, he already addressed.

I disagree: "The capatchas were not only difficult for a computer to read, but impossible for a human." He goes on to quote that computers can guess capatchas at 10%. My point is that if you understand capatchas, you can get them right almost all of the time. (Not talking about the audio ones here, since the visual ones seemed to be the focus of the article.)

I understand captchas and so does the original article writer (if you read the article he clearly explains how reCAPTCHA works). While I would have agreed with you a couple of years ago, the author's point that reCAPTCHA has reached a point where it isn't working nearly as well is spot on. I myself fail on them about 75% of the time now -- quickly approaching the failure rate of computers.

For reCAPTCHA it used to be close to 100% success for me, but something has certainly changed with them, I don't know if it was intentional or is the result of a dwindling data set and as an end user, it doesn't matter, what matters is that they are really difficult to solve now.

The author's blog post is retarded; "herp derp I don't know how recaptcha works and it's too hard for me." He has a pretty blog and a green name on HN, so everyone upvotes it. Pretty sad IMO.

Green names mean the opposite of what you think they mean. It's for accounts which are less than 5 days old.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact