Plus, there's a huge difference between them publishing what information they're collecting and actually knowing how they're using that data, IMO.
Of course some bots may make use of high-level browser engines (such as those provided by acceptance testing frameworks) to try and get around this, plus you'll always have cheap human labor. But ultimately, anti-spam is an arms race and simple tactics like this will get rid of most unwanted agents.
What worked in the end was a points system for spammy behavior: First post has URL in it? +1 point. User fills out linkedin field on profile? +1 point (seriously, none of our legit users did this...). User posts a word on the blocklist? (viagra, cialis, cvv2, etc) +1 point. User Agent is IE? +1 point (we're a Mac site). After a certain number of points, the user was banned and all their generated content deleted. After a certain number of posts without triggering the ban, they're greenlighted. Spammers quickly noticed their posts disappeared instantly and left the site.
Umh, anything that will end up in the POST request will be reproduced by a bot, I don't even actually look at the page when implementing screen scraping modules, but just at the Network tab of the Chrome Dev Tools.
What I think have the potential to remove the need for conscious CAPTCHA solving is what Google is supposedly doing here: machine learning on behavioral patterns in the user interaction with the form (instead of just with the CAPTCHA).
It seems like you had that backwards -- hope that clears it up.
I think that's where the issue lies: these tools do different things. Honeypot fields defend against general bots. Captchas defend against specific bots, too, but also have greater friction, so are only used when specific bots are an issue.
I think its pretty clear the reading books bit was abandoned long ago. I never get non-test words that are in any way a struggle for a competent OCR system. And on the occasion that I do, its impossible for me to read either. If they provided context it would be much more helpful.
As an aside, if you've ever had to solve one of these through TOR and you happen to be running through some eastern european countries... good god those are the most frustrating captchas I've ever seen. Long strings of "mnnmrnrmnm" with contrasting colors and jpeg artifacts... a few attempts at solving those makes me want to kill someone. I feel bad for people trying to do anything on the internet from those countries. I wonder what the rationale is for making captchas nearly impossible to solve in specific regions.
Welcome to the preview of the day where all the networked, statistically self-optimizing IDSes simultaneously turn on us and clean the messy humans out of their technological world.
Google will captcha-block IP addresses (not just Tor) if they get too many queries from them in a short period of time, so that bots can't crawl the search results.
I'm a bit puzzled by this update, though. I have reCAPTCHA on a wiki that I maintain, and I still see the traditional text based ones, not anything like these new number based ones. Are they rolling this out slowly?
It still says "Stop spam. Read books." But the book-reading part is over, and now it's "Stop spam. Do Google's work for them."
The daily routine for a 4chan poster will constitute solving that captcha 20 to 30 times.
Combine this with browser fingerprinting (your browser's fingerprint is incredibly unique), and the fact that you probably have a google account. There is a high probability they know who you are. From your history they can determine if you're human or not.
Well! Very obviously, any human can behave just as maliciously as a bot might! So captchas are there to slow us down. Point blank. They are flood control. They prevent spam, be it from bot or human.
The real question is, why would Ars Technica be so chicken-shit, that they can't come out and say that?