Hacker News new | comments | show | ask | jobs | submit login
reCAPTCHAs are finally readable by normal humans (arstechnica.com)
54 points by bgtyhn on Oct 25, 2013 | hide | past | web | favorite | 37 comments

I have a feeling this won't last very long now that they've publicized the fact that they profile a users interaction with the system. Things like rate of hitting captchas, mouse movements, characters typed before pressing send, etc are all easy to mimic and control if you know they're analyzing that info.

Well, since all this information has to be collected client-side, I'm sure people working on captcha solvers would have figured it out anyway.

Plus, there's a huge difference between them publishing what information they're collecting and actually knowing how they're using that data, IMO.

I thought maybe I was the only one who struggled with recaptcha captchas. On some sites I have literally tried 10-15 times before finally becoming too frustrated and giving up solving the captcha because whatever it was (form to fill out for a download, fill out a comment on a site), just wasn't worth the time I was wasting. Hugely irritating.

Initially, I was using reCAPTCHA on my site. But after hearing from irate users and specially preventing new users from registering, I decided to disable and use a simplified CAPTCHA. Loosing new potential users to CAPTCHA wasn't acceptable.

Perhaps in some glorious future utopian society, humans won't have to see CAPTCHAs at all.

They don't. You can implement basic defenses like hidden input forms or checkboxes that are masked by CSS rules or served on the client side with JavaScript to weed out bots from human users, among other less intrusive techniques.

Of course some bots may make use of high-level browser engines (such as those provided by acceptance testing frameworks) to try and get around this, plus you'll always have cheap human labor. But ultimately, anti-spam is an arms race and simple tactics like this will get rid of most unwanted agents.

We tried a ton of those techniques (hidden fields, fields filled out by Javascript, anti-replay tokens) and none of them worked - the spammers appeared to be using botted IE installs.

What worked in the end was a points system for spammy behavior: First post has URL in it? +1 point. User fills out linkedin field on profile? +1 point (seriously, none of our legit users did this...). User posts a word on the blocklist? (viagra, cialis, cvv2, etc) +1 point. User Agent is IE? +1 point (we're a Mac site). After a certain number of points, the user was banned and all their generated content deleted. After a certain number of posts without triggering the ban, they're greenlighted. Spammers quickly noticed their posts disappeared instantly and left the site.

If only my blogspammers were that quick on the uptake. I've got moderation turned on, so none of their posts ever see the light of day, and I still get tens of them every day.

> You can implement basic defenses like hidden input forms or checkboxes that are masked by CSS rules or served on the client side with JavaScript to weed out bots from human users, among other less intrusive techniques.

Umh, anything that will end up in the POST request will be reproduced by a bot, I don't even actually look at the page when implementing screen scraping modules, but just at the Network tab of the Chrome Dev Tools.

What I think have the potential to remove the need for conscious CAPTCHA solving is what Google is supposedly doing here: machine learning on behavioral patterns in the user interaction with the form (instead of just with the CAPTCHA).

No no, the hidden input fields would not be filled in by a human (because they're positioned off screen by CSS, for example), whereas a bot would fill them in because the bot is just scraping the HTML for all input fields and filling them in with something.

It seems like you had that backwards -- hope that clears it up.

Why would I have my bot fill them in? One glance at the site and I know that my bot should skip those fields.

It's meant as a defense against non-targeted bots—the ones that roam the web looking for forms to fill out.

I think that's where the issue lies: these tools do different things. Honeypot fields defend against general bots. Captchas defend against specific bots, too, but also have greater friction, so are only used when specific bots are an issue.

I did my mom's website for her small law firm and I was getting tons of bots even with hidden form fields and they didn't look targetted. Captcha helped a lot but the spam didn't stop until I used both combined with special rules, like the phone number field has to have a certain amount of numbers.

Ok, I thought it was intended as a counter-measure to targeted bots.

If you own a website and use Recaptcha for human verification you know it's been broken for years. Whether it's too hard for humans, too easy for bots, or by-passed using 3rd party labor for pennies per solve.

I wonder if a Javascript based proof-of-work system could be used as an alternative to CAPTCHAs. It wouldn't stop spam entirely but it would rate-limit it and possibly render some forms of spam unprofitable. Aa a bonus, the proof-of-work could be tied to something useful (Folding@home?) just like reCAPTCHA's CAPTCHAs are useful for digitizing books.

Nope. This was tried with Bitcoin back in the day. JavaScript is far too slow, and a spammer could just blast out thousands of times the work with a single CPU running something native. Mobile users would suffer greatly too.

Does this mean that digitizing of books through reCAPTCHAs will be done at a much slower rate or not at all?

I find it difficult to believe the captcha served over the last year or so were actually scanned from books: they were so completely illegible and nonsensical. I usually had to click refresh about half a dozen times before I could even find a sample that I could read correctly.

Half of the captcha (the illegible nonsense) is the actual test. The other half is usually easy to read; that's the scan from the book. You can actually answer anything for that part and still pass the test, although obviously if you do, you're not helping digitize books.

Are they even digitizing books anymore? I seem to always get a house number. The house numbers make it really easy to know I don't actually have to type that part

I figured out a while ago that you only ever need to type the nonsensical string.

I think its pretty clear the reading books bit was abandoned long ago. I never get non-test words that are in any way a struggle for a competent OCR system. And on the occasion that I do, its impossible for me to read either. If they provided context it would be much more helpful.

As an aside, if you've ever had to solve one of these through TOR and you happen to be running through some eastern european countries... good god those are the most frustrating captchas I've ever seen. Long strings of "mnnmrnrmnm" with contrasting colors and jpeg artifacts... a few attempts at solving those makes me want to kill someone. I feel bad for people trying to do anything on the internet from those countries. I wonder what the rationale is for making captchas nearly impossible to solve in specific regions.

TOR+Eastern Europe has probably triggered the heuristic that you're a probable bot, and it's giving you a test that will further amplify its confirmation bias.

Welcome to the preview of the day where all the networked, statistically self-optimizing IDSes simultaneously turn on us and clean the messy humans out of their technological world.

> I wonder what the rationale is for making captchas nearly impossible to solve in specific regions.

Google will captcha-block IP addresses (not just Tor) if they get too many queries from them in a short period of time, so that bots can't crawl the search results.[1]

1: https://www.torproject.org/docs/faq.html.en#GoogleCAPTCHA

It's likely the case that the house numbers are from Google Street View, and they are using them to improve the addressing for Google Maps.

I'm a bit puzzled by this update, though. I have reCAPTCHA on a wiki that I maintain, and I still see the traditional text based ones, not anything like these new number based ones. Are they rolling this out slowly?

It makes the slogan pretty disingenuous.

It still says "Stop spam. Read books." But the book-reading part is over, and now it's "Stop spam. Do Google's work for them."

While it's nice that they're trying to design the captchas to be more user friendly, these new number captchas are quite easy to crack. This might lead to an uptick in spam as recaptcha ocrs become easier to create.

You are missing the whole point. The new "easier" captchas are shown only to people that they are sure are humans already. I suspect they are looking at the number of previous captchas you have solved correctly, are you logged into gmail, etc.

You're correct. Refreshing the page a couple of times reverts the captcha to the old one with letters. So I guess they keep track of how many times they serve captchas to your ip address in an hour, and if you go over some limit they start serving you the harder captcha .

Thanks :)

That's going to end badly for 4chan, as every post requires you to fill in a captcha.

The daily routine for a 4chan poster will constitute solving that captcha 20 to 30 times.

Google knows your searches, and I am pretty sure google knows every site you visit, via google analytics, which most sites run.

Combine this with browser fingerprinting (your browser's fingerprint is incredibly unique), and the fact that you probably have a google account. There is a high probability they know who you are. From your history they can determine if you're human or not.

I cannot believe that in 2013 captchas remain the most effective way to achieve whatever they are used for. I bet Google could bake something into Chrome to avoid captchas.

We should feel lucky that at least captchas can help. In the future, probably, there will be no chance you could separate humans from computers.

Very true. So we need to find a way to prevent spam that doesn't involve differentiating between a human and a bot.

I find your statement weirdly amusing. You are deploring a lack of progress regarding something (it is increasingly difficult to tell humans and computers apart) that you would probably view as progress if stated differently (computers are getting increasingly smarter). I feel your pain but this is a very though problem and won't get any easier soon.

Personally I have an easier time typing the letter/word based captchas than the digit ones.

...and why, pray tell, are they still providing turing tests to humans, if they already know who the humans are, you ask?

Well! Very obviously, any human can behave just as maliciously as a bot might! So captchas are there to slow us down. Point blank. They are flood control. They prevent spam, be it from bot or human.

The real question is, why would Ars Technica be so chicken-shit, that they can't come out and say that?

If you want to slow someone down you add in a delay. Like many forum software already does.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact