Hacker News new | past | comments | ask | show | jobs | submit login
Why can’t a bot tick the 'I'm not a robot' box? (quora.com)
1002 points by grzm on Feb 13, 2019 | hide | past | favorite | 618 comments

I’ve had a number of reCAPTCHA incidents where I could not pass the test for tens of images, it was a very frustrating experience. Please do not use reCAPTCHA.

The items one is supposed to select often overlap the grids, so it becomes a kind of Keynesian Beauty Contest[1] at that point; I assume they validate based on how much in alignment you are with previous answers, so it becomes a problem of “What nearby grids would a person reasonably select when there’s overlap”, or, you’re tasked at selecting <some_item> and you see <some_item> in the distant background of the image you’re supposed to classify and you need to determine “How visible would <some_item> need to be for a reasonable person to classify <some_item> as being in this image”.

On top of this, when you’re clicking through image after image after image, the additional frustrating thing is that you’re helping train their algorithms; you’re doing work because their service isn’t smart enough to know you’re human, or, you’re seen as a marginal customer they can piss off by forcing you to work for them for free.

It’s a frustrating experience when it fails, and my current strategy is to leave the website when the ‘I’m not a robot’ checkbox fails.


You’re overthinking it. I have had a similar experience. The worst were the “stop light” questions. Does it mean the pole, the light, the tiny corner overlapping another square? I used to try to include everything as it was technically true. Very frustrating – until I finally started to not care. Click on the most obvious pictures. Click click click. Done. Get it wrong? Click click click on the next one. Way faster with a much better success rate doing so. It usually only takes a couple of tries now.

"Store front" always makes me squint at garages, driveways, sidewalks and front doors.

Just a suggestion to not say "You're overthinking it" it can be considered pretty dismissive.

Maybe a cultural thing? For me, "you're over thinking it" is a familiar/friendly way of expressing an idea, similar to how you would approach a friend. It was not intended to be dismissive.

you are right. it is a cultural thing. that being said, the advice is sound. the internet is a mishmash of cultures :) if i would bring my culture to the internet, everyone would think i'm troll or horrible person. just because ppl from my culture are a bit direct and cynical :D if you'd like people to respect and take into account your culture it's good to do vice versa.

now that's over thinking :D

You're overthinking that.

Hell I try to figure out what wrong answers to give to fuck up their training data.

That would only work if everyone who got the same image gave the same wrong answer. Otherwise, the system just keeps sending the same image to more random people until it's satisfied with the consensus. Also, it keeps sending YOU more random images if you obviously don't comply with previously observed consensus.

reCAPTCHA is horrible. I am almost certain that some tests are intentionally marked as failed even if they are correct, simply so they can get more training in. And if you take any anonymizing measures, reCAPTCHA makes much of the web nigh impossible to use.

It's easily the worst UX I encounter online. I can't describe the relief I feel when I come across a "normal" CAPTCHA. It's coming to the point that if I could pay a few cents to outsource every reCAPTCHA, I gladly would simply to avoid this atrocity.

The tests are often poorly defined. I've been caught in reCAPTCHA hell a few times (but not recently), and it's often the case that I'm guessing... is this a bus? I'm not sure.

It's interesting to find out that this might have nothing to do with my ability or inability to recognize objects in blurry photos.

Are reCAPTCHA images used to train or classify data like Mechanical Turk? I thought at one point the word ones were used that way and was a major incentive for their usage.

Original reCAPTCHA images were used to transcribe text for Google Books, and later to transcribe house numbers for Google Street View.

I don't think Google's publicly said what they do with any data derived from the new "I'm not a robot" reCAPTCHA, but, given the content it usually uses, it seems likely that they're still using it for image classification in Street View, or for their self-driving car projects.

They are almost certainly being used to train Google's self-driving/Maps/etc recognition.

You probably share your IP with others like Tor or a single IP for a big company.

Lately I've been getting reCAPTCHA prompts all the time even though I'm not browsing in incognito mode and haven't cleared cookies. All I'm doing is running a very basic ad blocker, using Safari (which blocks third-party tracking), and very rarely loading a Google site. The most interaction I have with Google is when I end up having to use my corporate Google account as SSO for some other site.

Given that I'm not doing anything unusual, it really feels to me like reCAPTCHA, for all its complexity, boils down to "what's your history using Google software? Oh you rarely use it? I'm gonna give you a captcha". It didn't used to be this aggressive, but it's really ramped up in the past few weeks.

I recently replaced a bunch of securimage captchas with reCAPTCHA v2. During testing I had to shut it off because it became increasingly more complicated every page reload. First it was just one page of traffic lights, but 20 minutes later I was having to click through 5-6 pages of images. This worries me that user's might get pissed off. I'd really like to know if I've made life harder for my users in an attempt to stop the spam from the horribly broken securimage captcha.

I cancelled and deleted my Spotify account because of that. Now any site that asks me to fill a Google recaptcha is met with a swift click on the Back button.

Well. You showed them, then. I'm sure someone will notice.

Really? You gave up one of the best streaming music services because you had to use a captcha every now and then? Huh. Good data point. But you're probably not our target audience if you give up that easy, we sell engineering tools for solving difficult problems.

You said

> During testing I had to shut it off because it became increasingly more complicated every page reload. First it was just one page of traffic lights, but 20 minutes later I was having to click through 5-6 pages of images. This worries me that user's might get pissed off.

Why are you now badmouthing someone else for deleting Spotify over this exact same issue?

Anecdotally, I've gotten way more reCAPTCHA prompts since disabling third-party cookies and installing Cookie AutoDelete, so I suspect you are correct.

Same here.

Randomized user agent also seems to trigger reCAPTCHA without real cause. Should be an ADA violation for doing that.

ADA? Americans with Disabilities Act? Why would that have any bearing on user agent randomization?

I think the argument boils down to "how do you fill out the captcha if you use a screen reader"?

They already have an accommodations for the visually impaired with the audio captchas.


If you score low enough on their automatic checks, they refuse to serve you the audio challenge.

What about the deafblind?

What's their solution for any other website? It seems like they'd have a very difficult time accessing ANY site.

In a quick search it seems like NoCaptcha is the accessible answer for the issues with regular Captchas. For the most part it seems to work, most of the complaints here seem to stem from people trying to actively block some of the evaluation metrics used by the checkbox (cookies,javascript,user strings,fingerprinting,etc) which makes them look very different from normal traffic which kind of by necessity makes them look a lot more like bots.


>which makes them look very different from normal traffic which kind of by necessity makes them look a lot more like bots.

But if they are doing so because they are disabled, and the difference means they receive a worse experience, may result in an ADA complaint (especially if a government service falling under section 508 is involved).

Braille interfaces are a thing.

I think he means that having people fill out recaptcha all the time because they don't allow Google to track them should be an ADA violation.

Again I still don't see how not wanting Google to track you should qualify as a disability under the ADA.

Nonono, disabled people might be barred from using a service because of excessive recaptcha. That's what he means. I also think he really meant CVAA rather than ADA though.

Looks like they provide an appropriate alternative for people with vision disabilities: https://support.google.com/recaptcha/answer/6175971?hl=en

That only really leaves blind deaf people out, at which point we might be reaching the limits of any technology to provide access to everyone without a tooooon of work.

As someone already pointed out in another reply, you don't get this option if you score low enough.

Yeah but most people won't trigger that. Seems like most of the complaints here about triggering it often are from people who are blocking js/cookies/randomizing user strings. The NoCaptcha check box itself is better than the old system where everyone had to do the Captcha at least.

It happens to me occasionally and I am basically blacklisted from the internet. I have to solve 5 in a row and if I screw up its idea of what a streetsign is, I have to start over. It has made me cut out usage of most sites that use this broken and abusive tech. Welcome to the digital ghetto.

I have noticed this too. I've switched to DuckDuckGo for everything and I haven't changed my habits. Started getting more captchas a couple weeks ago and I know I answered several of them correctly (I'd get tested 3 times in a row).

Possible fingerprinting?

It is also plausible that because google analytics runs on so many sites that they could do something shady like put you in a pester segment if they see you coming from duckduckgo to other sites frequently. It is not hard to imaging using Recaptcha as a nuisance against other search traffic providers.

I doubt that many people would make a connection between their search engine and seeing captchas on other sites. So limited gain for, if anything, many unnecessary complaints.

This is why I love container tabs in Firefox. I like putting all of the recaptcha stuff in one container so it can't snoop on my other stuff (I'm too lazy to look into what it's doing with cookies and whatnot).

But honestly, I wish it would just die.

Technically it's a good strategy. It would be much more complex for bots to leave a trail consistent with that of a real user.

Some financial and government benefit web sites query web trackers as an extra factor in the enrollment process.

It'd be a good strategy if we were aiming for a totalitarian Google-sponsored police state..

Making (online or offline) life more difficult for people who don't want to use company X products could escalate to the point where you either accept the yoke and are admitted to the walled garden of "society" which company X has firmly cemented themselves under -- or you say no and find yourself unable to drive/fly/get a job/go to college/buy groceries in your town. It sounds like a big leap to make right now, but is a real possibility if Amazon/Google/FB don't get broken up soon.

Using a verified human Google history to allow people who would otherwise be flagged as potentially a bot to skip the CAPTCHA is justifiable. Setting up your reCAPTCHA such that the lack of a verified Google history is used as a "probably bot" signal is really quite awful.

I use Safari on a Mac with a simple ad blocker, but I use a lot of Google products, and I also get a captcha most of the time.

So perhaps it's Safari and/or the ad blocker that are to blame? Hard to say, though.

This is likely due to the new canvas fingerprinting protections introduced in iOS 12 and Safari for Mojave. Google's NHT analyzers probably don't take well to these measures that attempt to defeat canvas fingerprinting.

Same here. I have an alternative browser with no ad blocker, no tracking blocker, and sometimes I just copy the website from Safari to that other browser to avoid CAPTCHA.

Sometime ago we have a provider that give us everyday different IP. Some days we just were not able to do anything without captcha. It seems that some IP addresses were in some sort of spam base

Slightly related, but I have a fun conspiracy to share:

I'm convinced that part of the reason Google released headless Chrome is as a honeypot for bot authors to use. The idea is that instead of going through the effort of fingerprinting and identifying new bot software, release something that bot authors will use instead that you have a capability to detect.

Somewhere inside of headless Chrome, there's one or more subtle changes that make it so Google can detect whether you're using headless Chrome or normal Chrome. There's no limit to how subtle the indicator could be - maybe headless Chrome renders certain CSS elements slightly slower than normal Chrome, etc.

It sounds pretty crazy/complicated but I could definitely see it being worth it if it means detecting $X,000,000 worth of ad fraud every year

It's actually not that complicated. Most headless browser drivers have some global JavaScript functions in the `window` namespace that immediately identify themselves.

I once ran into a piece of code from the scammy advertising world that tried to redirect users to a phishing site. They cleverly tried to hide themselves from the automated quality checks some ad networks do, by checking for these functions and appearing benign if they saw them. One of the checks even created an exception and then inspected the stack trace for certain flags that apparently are only there on some type of headless browser. Clever!

Interesting idea :)

I don't think spambots are currently using Chromium or even running JavaScript. Using simple spamfilters in JavaScript still works fine on my setups.

Most modern credential stuffers use headless browsers with all the bells and whistles, html5, javascript, etc.

Login attempts are usually spread over a massive botnet of residential IPs as well, where they'll only use each IP for one or two login attempts before moving on to the next.

It's a very fascinating problem space

In my experience, the botnet didn't upgrade their JVM...it was 18-24 months out of date. THAT was what we filtered on at the F5 to blunt the attack.

Does it mean that you're breaking the experience for users who deliberately disable js by default? Can I ask you not to do that? Modern web is unusable if you let js on any webpage

Every time I fill one of these out I get the picture test, and I answer them correctly...but am asked 3-5 times to identify which blocks contain a school bus or stop light. It's very annoying.

I think it's speculated that you're recorded as being a useful classifier if you answer correctly on initial test captchas, so you get given Google's datasets for machine learning. It would explain why you get picture tests even after you should definitely pass the check.

If that is true that is a huge breach of trust. Are these practices ever audited?

By whom? For what? (Meaning - probably not, unless you count a few Googlers sanity checking launches)

You mean, they still haven't managed to classify those storefronts correctly, after 3 years or so?

Yes - I feel at this point that our labor is simply being used to help provide training data for Google's algorithms.

I mean, this has been exactly the situation since recaptcha was invented.

This is not how it is supposed to be.

In a parallel universe of fluffy niceness we willingly provide our help and in that way we get all those old books converted to ASCII and available for us to read online. Our efforts are for the good of mankind. Similarly with the newer challenges, we help the maps be up to date and again this is for the good of mankind and those needing help getting around.

Clearly this doesn't work in an era where the 'don't be evil' mantra is long forgotten and people only see Google as some advertiser friendly capitalist monopolist beast.

Google need to work on their relationship with their customers, to be a benevolent dictatorship of sorts. They are lousy at customer service and there are other pain points that they are ignorant to. I don't see how this helps.

> Google need to work on their relationship with their customers

Google's customers are those who buy advertising. The rest of us are just cannon fodder.

The "hills" category is the worst.

Have come to the conclusion it really means any patch of grass.

How to label your ML data for free.

By offering a free service to website operators that mitigates the ever-growing challenge of abuse on the internet that they and their users have to deal with.

Seems pretty bilateral.

Using highly trusted users thinking that they are just proving their honesty.

At SerpApi.com, we built a bot to check these boxes and an AI to solve the actual CAPTCHA.

Checking the box is actually not that hard. There is no advanced measurements of your mouse and touch speed. This is an Internet myth. It's more a game of cookies, making them age well, and having an organic set of headers.

The misconceptions about nonhuman/invalid traffic (NHT) seem like a problem brewing. There's an arms race between the NHT guys (e.g. ad fraud networks) and the guys trying to detect it (e.g. ad providers). Meanwhile, a lot of people are using analytics to inform decision making assuming most traffic is legitimate (e.g. news organizations, anyone doing A/B testing). The guys naively using analytics with weak feature detection may be totally unprepared to deal with nonhuman traffic from repurposed networks which have been optimized to defeat the more advanced countermeasures =/

Aren’t you afraid of being sued by Google for selling their search results?

My first thought as well, looks like the business is 100% based on breaking TOS.

From their site: Is scraping legal? In the United States, scraping public resources falls under the Fair Use doctrine, and is protected by the First Amendment. See the LinkedIn Vs. hiQ scraper ruling for more information. This does not constitute legal advice, and you should seek the counsel of an attorney on your specific matter to comply with the laws in your jurisdiction.

ROFL, I guess if you are able to ignore the layers of other issues TOS, breaking of technology to specifically exclude your use case, etc and are only willing to apply some very tangential case law against your reasoning it is "legal".

The captcha always reminds me of The Stanley Parable:

> Employee #427's job was simple: he sat at his desk in room 427 and he pushed buttons on a keyboard.

> Orders came to him through a monitor on his desk, telling him what buttons to push, how long to push them, and in what order.

Isn't this a typical Quora answer? Full of filler and shitty hard-to-verify details that provide no value to the answer ("the language is encrypted twice", what the hell), and very little effort on answering the actual question (what is the purpose of CAPTCHA).

And the community rules try to block people from writing firm "you're full of shit"-like answers, even though every other answer of Quora is full of lies like "Linux is fast, because it was designed for 16-bit computers".

I had my “wow” this place might not be that good experience with Quora yesterday when I was trying to google evaluate AWS workmail.

Quite a lot of the “extremely good looking” answers on Quora straight up said that you couldn’t do e-mail in AWS. These were answers from after workmail was a thing by the way.

So I started looking at other Quora answers on stuff I wouldn’t normally need an answer for, and it’s frighteningly how often completely wrong answers look correct.

Don’t get me wrong, there is a lot of truly amazing answers as well, and it’s entirely possible that I just suck at it, but I don’t think I can always tell the amazing answer from the completely wrong one.

My experience with Quora has been that more often than not the older the answer, the better it is. I find that answers in history, are often better than in tech. It always seems like the community that initially built Quora, stopped building it further several years ago and now it's floating out in space Wile E. Coyote style.

Quora went significantly downhill a few years back. It was a combination of hordes of new users, bad moderation, and bad incentives (order in which answers get shown etc).

I was going to say the same. I don't use Quora at the moment but when I did, I'd be more interested in subjective questions (history, geographical - ie. travel etc., and more philosophical questions) rather than objective and technical questions (and answers) as I've found many of them to be simply untrue.

Sounds like a form of the Gell-Mann Amnesia Effect:


The answer is not just fluff. It for example links to https://github.com/neuroradiology/InsideReCaptcha where you can read more.

I looked at that and it’s pretty nifty. It actually looks like google did encrypt the client side code and implement their own JavaScript VM and that the decryption key for the source is based on variables inside the VM during execution (some kind of state) in some way and other properties from the webpage (css is mentioned). It all falls into the realm of obfuscation.

> the actual question (what is the purpose of CAPTCHA)

This assessment isn't quite right. The actual question is about how the captcha differentiates between a human and a bot at the box checking stage.

You're right, though, that it is both full of filler and also doesn't address the question as posed at all.

> Why can’t a bot tick the 'I'm not a robot' box?

It can, by taking over the mouse...except...

LUCKILY, the top answer (on my screen at least it's https://www.quora.com/Why-can-t-a-bot-tick-the-Im-not-a-robo...) does actually try to answer the question.

I feel like the OP submission might have just been some sort of submarine self-promotion for the "CEO of <redacted>".

That answer talks about mouse movements. Most captchas I do these days are mobile. So I don't trust this as a very solid answer.

Mobile device movement variability (IMU, compass, soft vs hard press, and so on) and mouse movement variability are relatively analogous, and both can be measured and analyzed in very similar ways.

It also links to a patent describing a novel mobile captcha invented by the author, so they might have some knowledge about the domain.

The box has made browsing using TOR insufferable! It fusses and makes me click storefronts and traffic lights until I run out of patience and close out of whatever webpage I was trying to visit. I assume it has to do with a lack of Google cookies on the browser, essentially punishing me for trying to protect my privacy.

This might surprise you, but it actually has to do with what traffic coming out of TOR looks like. Well in excess of 90% of traffic coming out of TOR is spam, bots, malicious, or some combination!

Google isn't going out of their way to punish you for trying to protect your privacy. They're trying to stop unwanted traffic. By unfortunate happenstance, you appear to be disguising yourself in the exact same way a shocking amount of bad traffic is.

Not just for Tor.

I use Firefox with a few basic extensions (Privacy badger, uBlock, Google Container) yet every time I am presented with having to pick out traffic lights over and over and over again. I usually have about 5 or 6 "challenges" before I give up and use another site.

My timezone has not changed, my IP address and rough location has not changed, my screensize has not changed, my broadband speed has not changed, and my general computer dexterity has not changed, yet I am relentlessly targeted. On chrome I never saw these challenges, but on firefox with the privacy plug-ins I am always always always challenged.

At this stage I think the only signal it is using is "is there a google cookie in this browser? and if so has the google cookie got some 'normal' looking activity logged against it?" I.e. they are checking their server-side logs for a given cookie ID and seeing if that looks normal or not (i.e. seen on google search, seen on youtube, seen ads from a variety of third parties on various different sites, mixed up with time of day and speed of viewing etc etc).

Since I have got Google in a container in Firefox, I am guessing that my google cookie is not present when the captcha loads (due to the containers and privacy badger et al) so there is no identity back in the mothership to compare me against.

for google, you are the enemy. not even bots.

captcha is google master blow against ad blockers.

a regular user, who they have all the info, give them dollars per ad impression. You, with your doNotTrack (ha! that was a joke) and privacy addons makes them only cents per ad impressions.

you are google's enemy. remember this when you get stuck in captcha hell (and consequently censored from most sites until changing device/ip)

IDK. I run Firefox on many OSes, everywhere with uMatrix that blocks known trackers, ad networks and such. I don't see most ads (if any).

I rarely see the "I am not a robot" box, and hasn't seen image recognition tasks for a long-long time.

That also heavily depends on what kind of/which sites you visit.

"that also depends if you have something to hide" was said of every police state and censorship scheme.

if you were really blocking all trackers, Captcha would even work. Firefox help page for their new tracker blocking feature says so even.

They're on a lot of sites that I frequent.

Yup. It's insufferable. Even on sites where I'm a paying customer, I have to go through captcha garbage.

If you're a paying customer, complain to the company. Let them know their site is annoying and frustrating to use because of this.

If they lose enough customers over this, they will probably remove the captcha.

I think quora over states what Google looks at by a wide margin, just try to access a captcha in incognito, they won't have access to as much info as they do on you and yet you're still presented with the same level of captcha (if not more of them, which is to be expected)

Sometimes just checking the checkbox is enough. Sometimes you need to identify cars and store fronts. I think the better Google knows who you are, the more likely just the checkbox is going to be enough. If you go incognito, you have to train their neural nets, if you give up your privacy, you get in for free.

The clever part from Google's perspective is that you have to trade one of these things to Google in order to get access to sites that do not belong to Google at all. Google convinced site owners to have their users pay a tax to Google.

There are many services out there that can solve Google's recaptcha for fraction's of a penny. When someone puts one up, they can make things more expensive, and perhaps sometimes uneconomical, but in general, the cost is low (~$2.00 for 1,000 recaptchas).

When someone uses a recaptcha, they should think about why they are doing so. It's one thing to use it to save a business model, but it's another to use it to protect information that should be free anyway. The elephant in the room is government data. Many government agencies think that selling their data can be a nice source of side revenue, and a recaptcha is a good way of enforcing it. In reality, they just increase the costs for everyone, and those with means can obtain the information while those without means cannot.

Governments need to release their data, freely, without captchas or fees for single users and bulk users, no exceptions.

I've actually been pleasantly surprised at how much data /is/ available, and how much of it is available through common formats like Socrata Open Data API (for use with tools like https://github.com/xmunoz/sodapy)

The counter argument is that they do a great job with trivial stuff like registered dog's names, and less well with sensitive/important issues like policing.

What's the right way to leverage the platform developed for the first into the second?

> Governments need to release their data, freely

Totally agree. Fortunately the Dutch government is trying to make as much data open as they reasonably can, and regularly organise events to encourage developers to use their open APIs.

> My timezone has not changed, my IP address and rough location has not changed, my screensize has not changed, my broadband speed has not changed, and my general computer dexterity has not changed, yet I am relentlessly targeted. On chrome I never saw these challenges, but on firefox with the privacy plug-ins I am always always always challenged.

That's because Google isn't just profiling "Tor users". They're going after anyone who values privacy in any way or technology.

Simply put, you're being punished for ensuring privacy. And anybody who uses Google's captcha services is an accessory to that.

There is no Google "punishment algorithm". It's just computers being dumb.

Somebody made those computers dumb in that exact way. That's the complaint.

I think that Google is more than happy to punish people for protecting their privacy. That may or may not be the main goal, but it doesn't appear to be something Google considers a downside.

Sometimes people intentionally make computers dumb.

Same thing happens to me, same extensions involved, mostly browse incognito. I bet your suspicion is spot on.

I use chrome with Privacy Badger + uBlock Origin and I have to solve the CAPTCHAs every single fucking time. I even have to solve them multiple times. At this point I just leave a page if they have one of those captchas.

>This might surprise you, but it actually has to do with what traffic coming out of TOR looks like.

That's a massive load of bullshit. Google has a captcha challenge that only humans can solve. That alone is already sufficient to prevent unwanted traffic. That is how every captcha system works. However google is an exception. If you're logged in to a google account or are using chrome then google can use that information to track your captcha history. Privacy minded people avoid google like the plague and therefore they cannot be tracked.

>Google isn't going out of their way to punish you for trying to protect your privacy. Except this is exactly what happens. It's not "unfortunate". It works like this by design.

If google cannot track you then the captcha will force you to do something that no other captcha system does: give you even more challenges even if you have solved them correctly. You will spend the next 5 minutes solving captchas correctly and then at the end it will tell you you've failed. This again is unique to google: correct answers lead to failure. The problem immediately goes away if you let google track you, it doesn't matter how bot infested the network is. No other captcha system does it this way.

Google is clearly doing this to get free labour to label their datasets, force people to have a google account and encourage them to use chrome.

If you are using TOR, and not accepting cookies, they are going to have no way of knowing that you are the same user who just solved the CAPTCHA. Every request is going to appear to be from a new user.

If you do everything you can to prevent google from knowing who you are, don't be surprised when they behave like they don't know who you are.

Tor Browser accepts session cookies. It won't have an established google identity, but it fully supports a temporary "solved the captcha" identity.

> The problem immediately goes away if you let google track you

I took that to mean they were blocking cookies

What prevents a botnet from sharing that same session cookie?

A botnet doesn't need Tor in the first place. And you can limit the use of a single captcha solution. It's not much different from the problem of a legitimate google account being borrowed by a bot.

The tile fade-in is also egregious. The only reason that exists is to punish humans.

Well, it punishes bots in an equal amount, in that the bots have to wait longer before they can retry.

It punishes humans more than computers because computers are more efficient multitaskers. A computer can find a productive way to use the second between each tile fade in, but a human has no realistic way to productively use that second. The human sits there staring at the screen waiting, while the captcha-solving computer does other things (perhaps solve other captchas given to it through other connections.)

Slight nitpick but past captcha successes are a characteristic of cyborg accounts, which still act as a bot most of the time.

A lot of the behavior that captcha exhibits is in part a function of feature analysis from ML models - features that may seem ridiculous to layman humans but make sense to a neural net plugged into the data.

> That's a massive load of bullshit. Google has a captcha challenge that only humans can solve. That alone is already sufficient to prevent unwanted traffic.

It's not bullshit, it just depends whether your website is being targeted directly or not. We're targeted directly and the robots hitting us are getting the CAPTCHAs solved, presumably with human help.

This sounds like it should be illegal.

I think you are partially wrong. Maybe Google is not doing this intentionally but it also doesn't happen just because traffic is coming out of a tor node. I am using ff with some of the recommended extensions from https://www.privacytools.io/ and I get to fill in traffic signs all the time. And yes I am logged into Google.

I think what OP is talking about is Cloudflare not Google's decision. Google provides the CATCHPA API but Cloudflare decides to flag nearly all Tor traffic and make it go through the CATCHPA.

In the case of Cloudflare specifically, they support Privacy Pass[0], an extension that allows solving one captcha to allow you through to multiple sites without de-anonymizing or reducing the security properties that tor provides.

[0] https://blog.cloudflare.com/cloudflare-supports-privacy-pass...

Cloudflare is a good actor, they offer the PrivacyPass extension that basically generates 30 auth tokens from one CAPTCHA challenge and then uses those until it needs new tokens. Sadly the overwhelming majority of sites doesn't use CAPTCHA through CloudFlare but directly through Google, rendering PrivacyPass moot.

Cloudflare is not a good actor in this, they have shown that they do not care about encryption (allowing non-https backends while showing https to the end user) and embedding trackers in verification pages (the CAPTCHAs on random pages).

Cloudflare is the scum of the internet. They've put a crazy amount of effort towards making wide swathes of the internet unusable for people trying to protect their identity and privacy. I wouldn't trust their implementation of Privacy Pass.

Sigh. We changed this so long ago yet people repeat this over and over again. Do you use the Tor Browser? Please show me a site on Cloudflare which uses CAPTCHA on Tor.

I don't know about TOR, but a couple of years ago we had a site on Cloudflare that had the CAPTCHA come up for visitors from mainland China - where the great firewall blocked the requests to Google. Chinese users were effectively locked out. We contacted Cloudflare about this and got dismissive replies.

I ended up removing a chrome extension that randomises user-agents because of this. It dramatically cut down google captchas.

Another thing that sets it off is virtual machine usage, I can be logged into chrome and gmail on the same residential IP for hours but the moment I try to search google for a problem inside a VM it's a minute of slow loading captchas.

Have moved to bing instead, that sort of wasted productivity is a burden.

This. The reality is, Google (and Cloudflare, and everyone else trying to block scrapers and malicious traffic) use heuristics that boil down to "99% of our traffic behaves like this". If you go out of your way to fall into the 1%, e.g. using Tor, disabling Javascript, randomizing your user-agent, etc., you're going to get CAPTCHAed.

Yeah, blending in seems to work better in many cases. Remember the guy who sent a bomb threat over TOR? The only reason he was caught so quickly was because he's the only person on the organisation's network to have accessed TOR before the incident.

> Google isn't going out of their way to punish you for trying to protect your privacy. They're trying to stop unwanted traffic. By unfortunate happenstance, (...)

This does not agree with my experience. I browse without cookies and severely limited javascript (using umatrix), and I also encounter the myriad of ridiculous inconveniences that the OP was referring to. On the good side, however, the web is much faster and generally less annoying.

> I browse without cookies and severely limited javascript (using umatrix), and I also encounter the myriad of ridiculous inconveniences that the OP was referring to.

Isn't this also something that many bots do (don't run javascript and don't have realistic cookies)? It seems like just another instance of reducing your distance from the "bot" cluster in agent-space.

These are exactly the kinds of behaviors that bots sometimes engage in.

It's often the website provider redirecting users to a captcha based on certain conditions.

As a webmaster I can confirm that I hard block all TOR traffic for this exact reason. 90% of this traffic is malicious robotic junk of some form.

Also, I’m just not interested in the remaining 10% "legit" traffic from people who are aggressively paranoid about their privacy. Almost all of them ended up being dickheads who were using TOR to abuse other members of our community.

To the people who think every website should treat TOR users with respect, please understand that you are intentionally making yourself indistinguishable from the mountain of robotic junk, abuse and human dickheads. It's not my fault that you have chosen to do this, and it's not my job to provide you with tools to prove you're not a dickhead.

To the people voting me down, please understand that I am relaying factual information about my specific experience as webmaster of various large-ish regional websites. If you don't like the facts, voting them down won't change them.

...or maybe voting me down will change the facts.

Yeah, that's totally going to work.

Google seems to do the same even if you check the box while in an incognito window; I doubt the issue is TOR itself, but rather the lack of tracking data that Google has on that particular session.

Have you used Captcha on TOR? It really does feels like they're trying to punish you. They give you about 4 pages of "identify the traffic light", all of which are difficult for humans, then reject and give you another 4 pages. Or that thing where it fades out for about 7 seconds before you click the next image, and then wait another 7 seconds.

Google used to HATE my VPN, couldn't do anything on Google through it without a dozen damn pics to choose from.

My VPN must have gotten white listed (or cracked down on some of their traffic patterns) because that stopped.

Or did they just get better at fingerprinting us?

Nothing would surprise me.... I mostly experienced it on an android phone....

If Google wants to do that, that’s their prerogative. What pisses me off is when a bank or similar “secure” type of service forces me to train Google’s ML models in order to access my stuff. I didn’t agree to provide unpaid labor to Google.

Running uBlock origin seems to trigger the same thing, even on a static IP. It feels an awful lot like punishment to me.

This is a horrible argument. What gave Google the right to be the moral authority of the internet (we, we did)? Even if 99% of exits from tor nodes are malicious, Google should have absolutely no capability to throttle this traffic. Unless you claim most of the traffic in tor are from bots, your argument doesn't make any sense. Captchas serve 2 purposes: slowing down bots, annoying humans. By putting captcha to tor exits, Google not only slows down miniscule amount of bots, but also annoys human traffic (good or bad). It is by no means a "good" thing that Google is capable of this.

I have exactly the same experience without using Tor, living in Germany...

I personally don't care too much about the hassle, but I really don't like the idea that I'm basically playing Artificial "Intelligence"/doing clickworking for the not so community oriented efforts of Google.

> Well in excess of 90% of traffic coming out of TOR is spam, bots, malicious, or some combination!

Do you have any data on this?

An excellent, wise, and cogent question! In fact I do have data. You can find it here: https://blog.cloudflare.com/the-trouble-with-tor/

> On the other hand, anonymity is also something that provides value to online attackers. Based on data across the CloudFlare network, 94% of requests that we see across the Tor network are per se malicious. That doesn’t mean they are visiting controversial content, but instead that they are automated requests designed to harm our customers. A large percentage of the comment spam, vulnerability scanning, ad click fraud, content scraping, and login scanning comes via the Tor network.

The obvious caveats apply, of course. It's completely possible what Cloudflare saw at the time is no longer true and TOR is no longer mostly spam. It's equally fully possible that the traffic Cloudflare sees is wildly unrepresentative of what TOR traffic actually looks like, and it's mostly people worried about their privacy. This is just the data we have at the moment.

A small percentage of bad actors using automaton can produce a lot of traffic. So although it may be true that a large portion of the requests coming from TOR exit nodes is malicious, it would be unwise to conclude that most users of TOR have bad intentions.

True, but from the perspective of an org like CloudFlare, that doesn't matter. They don't know (or care) about the user breakdown coming from Tor; they just know that the vast majority of traffic coming from it is malicious. And since part of the point of Tor is to make it hard to determine who's who, the good traffic gets binned with the bad.

I don't think anyone is concluding that.

I think a lot of people come to exactly that conclusion.

Why would they? Most people cognizant of these terms knows a bot generates more traffic than a human; that’s the point of most bots.

Cloudflare's documented experience aligns closely with mine; I've been limiting or blocking TOR ever since 2008 because over 90% of the traffic was malicious bots, and the majority of the remainder was malicious humans.

And when you have malicious traffic swimming in an anonymous pool, there's no practical alternative but to block all of it.

Isn't cloudflare the org that "Doesnt censor under any circumstances", and then turned around and censored white supremacists? Not that I agree with them (I DONT!), but it was a full 180.

And also, isn't cloudflare also the one to allow booters and stressers to be online behind CF - and they used stolen CC's to boot?

The Tor decisions to screw users over is just the cherry on top. Especially is egregious is when a captcha is demanded on even a simple static page. Seems pretty obvious what's going on here.

Everyone should censor and shun white supremacists. They have no place in modern society. When they shed their noxious views, we can all welcome them back with open arms.

Ok, so you've decided that being white supremacist is bad. I can agree with you on that, but still the question remains: who get's do decide what has a place in modern society? Who decides what "modern society" even is? Today Google might decide to censor white supremacists, tomorrow it can be human rights advocates. I think that allowing any type of censorship, even for such a noble cause as fighing racism is a slippery slope. Especially when done by a private company that is outside of our control (and governments are only marginally better).

You're trying to generalize a useful rule ("shun white supremacists") but it doesn't work in this case. I don't think we need to, either.

We're not robots. We can shun white supremacists and leave everyone else alone. This isn't a slippery slope, it's just good sense (no more white supremacists, hey!). Humankind will get along just fine if we tack on that one extra rule and all follow it.

Good thing the definition of white supremacist is commonly agreed upon and noncontroversial and absolutely isnt subject to definition creep :)

> Good thing the definition of white supremacist is commonly agreed upon and noncontroversial and absolutely isnt subject to definition creep :)

The definition is commonly agreed upon, and what "white supremacist" means is not at all controversial to most people. It certainly isn't so arbitrary as to be meaningless.

Now, the term may be misapplied at times, as may any term, but for it to be misapplied, it has to have an accepted application to begin with. A term without a definition can't be subject to definition creep, and the possible creep of a term like "white supremacist" is that wide to begin with.

What isn't subject to definition creep?

Murder now, for some, includes abortion.

Censorship now includes, for some, private companies removing bad actors from their private systems.

Come on, that's lazy to dismiss it that way when society literally changes all the time.

mamon [banned] on Feb 14, 2019 [flagged] | | | [–]

Offtopic, but regarding abortion:

Whether or not abortion is a murder is not about definition of "murder" it is about definition of "human being".

There's no doubt that abortion involves killing a living creature, the whole pro-choice vs pro-life debate is basically about one simple question: "is fetus a human being?". If you answer that with "yes", then every abortion becomes a murder, plain and simple.

This also explains why there will never be a compromise between two crowds: it is logically impossible to compromise on yes/no questions.

Isn’t the compromise position essentially ‘after X weeks’, where the value of X is highly contested? (And on the binary yes/no question there’s nuances too which get debated eg if continuing the pregnancy would be a significant threat to the mother’s life)

I think they do exactly that. For example disabling browser fingerprinting in firefox and not being logged into Google causes the majority of sites to display the captcha, especially when using a VPN.

They could use a memory-hard hashing function, like ARGON2 for proof of work, it would make spamming much harder.

Not really, because spam isn't done on the spammer's hardware. Not to mention, an expensive hashing function is precisely something bots can do but humans cannot.

If you're putting constraints on Tor traffic, it's not because of raw throughput. It's because it's extremely poor quality traffic.

I see..the goal of ARGON2 is not to be expensive, but to be hard to parallize. Anyways the other points that you wrote make sense.

You're absolutely right! It could even be integrated meaningfully into browsers to make it easier to work with. Something Cloudflare's Privacy Pass (https://support.cloudflare.com/hc/en-us/articles/11500199265...) could work.

It looks really nice.

It should be default for the TOR browser for sure, if just a few people use it, it decreases the anonimity set.

Nah, it was released back in 2017. I've seen it discussed periodically ever since.

The issue with just doing memory-consuming work client-side is that it only marginally slows down spamming. Spammers tend to use compromised machines they don't own. Unless you can make it prohibitively expensive to calculate something using machines you don't pay for - perhaps not a trivial ask - you wind up needing a different set of tools. This is why Google tends to look at things that will exhibit human variation rather than pure computation.

It's not that your ideas aren't good. I'm sure ARGON2 has a use here! It's that this might not be a problem easily solved by consuming more resources.

Cool, I'll try it out the next time I have a problem with using TOR. You're right that ARGON2 doesn't help if CPUs/RAM are free, it just makes parallelization hard.

Parallelization is easy if you have a botnet of millions of machines owned by others to draw on.

Those images are infuriating!

Click all boxes with traffic lights. Ok, well, this one box just barely contains the bottom right corner of the traffic light. Click. Nope, that little corner didn't count. Try again. Ok, well on this one, the right side of the traffic light is only barely over the line, so I won't click it. Nope, that sliver of the light mattered this time. MF!

Heh, maybe one day they can show a bunch of pictures of sand, where each subsequent pic has a grain removed, with the instructions "click on all the heaps".

Spambots will solve the Sorites paradox!

"Click all the ships of Theseus"

"Is there no ship? Close the browser window."

click on each star that is currently visable out your window :-/

I actually have a few screenshots where the task was impossible since the data was mislabeled. The latest example was "click all of the buses". It wouldn't let me continue because I wouldn't select the fire truck.

My naive assumption is that you should click the "refresh" button in these cases.

Just click whatever you suspect is needed to pass. Don't go above and beyond trying to give the actual right answer; you're just feeding some proprietary database owned by Google. QA for it is their problem.

There is some alternate (or future) reality where a google self driving car accident is blamed on bad training data from CAPTCHAs.

> " It wouldn't let me continue because I wouldn't select the fire truck."

Another one is "click the mountains". It typically won't let you through unless you click anything with trees on the horizon, even if the terrain is clearly flat. Google's robot thinks mountains are made out of wood, and any human who disagrees is labeled a robot. It's insanity.

I've recently gotten caught in one of these, where it was "click all of the bicycles" and after a few clicks (it was one of those which fade out to present a new picture) the only "bicycle" left was a bicycle-shaped street decoration. It wouldn't let me proceed unless I clicked on something, so I had to refresh to get a new task.

I assumed the infuriating ambiguity is intentional, in order to train some algorithm they need to know what the prevailing human correct judgement is in dicey situations

I don't think it's intentional -- it probably just emerges from the training process.

I'm guessing they do something like load up a batch of images and once N people agree on one, record the answer and remove it from the rotation. You end up left with the ambiguous images where people couldn't agree.

Then why do I keep seeing the same g-damn FIRE HYDRANT! :)

>Those images are infuriating!

And, does the pole count?

The whole thing is way more stressful than it needs to be for what it is.

I'm convinced the ambiguity is intentional. What I don't get is what answer they expect in those scenarios.

I always figure they're looking for a population consensus. They're doing image recognition at scale and these are clearly ambiguous, hard images to classify. They could easily have a few people at Google say, "I determine this is a storefront" and make that the "correct" answer, but I think they're more interested in a consensus of what most "normal" people would classify as a storefront, especially in potentially-volatile classifications where real humans might argue over the answer. They can skip the argument and just know which side will win it.

What they're actually getting though is the population consensus of what normal people believes Google's image classifier believes. The system incentivizes users to reinforce misconceptions their classifier has.

Does this look like a mountain to you? https://0x0.st/zzvr.jpg

Google's image classifier would think that's a mountain. If you disagree, google will classify you as a robot. After failing these sort of challenges a few times the user decides to play along and tell google what they think google wants to hear, rather than the truth.

What makes you think Google's image classifer would think that's a mountain?

Especially if this is all used for learning, enough people saying "that is clearly not a mountain" would reinforce that it's, in fact, probably not a mountain. Even if I got classified as a robot, I'm not sure I would think "oh, a system designed to classify images would think this not-a-mountain is a mountain", so I definitely wouldn't double down and keep marking it as a mountain. I'd, well, not. And assume the system is at least as good as classifying the images it chooses to use as I am.

> "What makes you think Google's image classifer would think that's a mountain?"

Because every single time it asks me to classify mountains it rejects my answers if I don't click on trees on the horizon (and often trees on the horizon are the only "mountains" presented) and every single time it accepts the answer that such trees are mountains. I've gotten the mountains challenge dozens of times, the results are very consistent. If there is a group of trees on the horizon, that is asserted to be a mountain.

> "enough people saying "that is clearly not a mountain" would reinforce that it's, in fact, probably not a mountain."

Totally irrelevant because if I am trying to get through a google captcha, it's because that captcha is standing in the way of me doing something. My interest is in passing the captcha, not correcting Google's shitty image classifier. So I have absolutely no incentive to make my life harder by insisting on correct answers, and every incentive to tell Google what they want to hear.

>So I have absolutely no incentive to make my life harder by insisting on correct answers, and every incentive to tell Google what they want to hear.

I guess this is where the misunderstanding is. You don't think Google wants to hear the correct answer?

Trying to guess at what the daily/monthly flavor of "correct" is seems like it'd do more harm than good, resulting in some kind of nondeterministic guessing game of "well, trees on the horizon are probably assumed to be a mountain" that never settles on actually-correct answers (and, I'd wager, is often more inconvenient to the user than just answering correctly would be, because now there's a layer of indirection on what they think a system thinks of an image, rather than just what they think of that image).

If everyone just answered "no, that's trees" instead of a hand-wavy "I think you think it's a mountain", I feel like this captcha would be significantly easier for us humans (because we could actually give real answers), as well as less inconvenient for people who just want to pass on through and get on with whatever they were doing before a site wanted to verify they weren't a bot (because they can just, well, identify images instead of playing a game of "what does the machine think?").

> "You don't think Google wants to hear the correct answer?"

They may want it but they don't reward it. I don't care what sort of answer they want, I only care what sort of answer they accept. I'm not going to donate my time to these bastards by doing anything more than what's necessary to pass their captcha.

> "If everyone just answered "no, that's trees" instead of a hand-wavy "I think you think it's trees","

That's just not going to happen: https://en.wikipedia.org/wiki/Prisoner%27s_dilemma

It is just a consequence of other humans also having problems with these cases. They do not mind that you have to make multiple attempts, it is just more yummy data for their bots (their machine learning algorithms are trained on this stuff).

I'm pretty convinced they're not really using these for ML, but that their ML algorithms have already run on these and they already know these difficult (read: ambiguous) enough to make you give up. These cases specifically only come up when they seem to think you're probably a bot (based on cookies or IP or whatever). They seem to deliberately put the photo boundaries such that they slice through whatever object they want you to look for. And they intentionally make the delays extremely long. These don't happen when they think you're probably a human and just want to throw an extra hurdle (like if you're Googling a little too frequently from your usual browser/location).

this on so many levels.

Thankfully they'll eventually fall back to the "click the images of _object_ until there are no pictures left with a(n) _object_" in it, but those clicking block ones of a specific picture are super frustrating.

It goes like this: "so you want to be anonymous and won't let us track every single thing you do? ok, then you'll help us train our AI so we can improve our self-driving cars and improve how Google Maps extracts information from Street View images"

I always figured the main point was behavioral profiling on mouse and keystroke trajectories.

They seem to keep adding categories though, which makes me suspect that it is all about ML training. Recently it's chimneys and bridges (although that one may be older).

It's always annoying though.

The likely answer is a bit of both. They use the image tests because it's something that is still kind of hard to do for computers and then uses a small percentage of the boxes as unknown tests to improve some ML algorithm. Unfortunately as computer vision has gotten better they've had to make the challenges harder to the point where they're quite low quality and sometimes count very small features qualifying images. My least favorite is labeling 'cars' because it can be hard to tell if it wants to count cars way off in the distance through the adversarial noise they add to the images.

The bot-vs-human is distinguished by profiling mouse and keystroke patterns.

The image classifications that you do, however, are used to train the computer vision system.

The worst thing is that Cloudfare is using ReCaptcha, and it's everywhere. The internet is broken at this point.

CloudFlare at least is using a thing where you only have to solve a ReCaptcha once, and then you can cryptographically prove you did, without compromising anonymity.

PrivacyPass does not work.

Why not?

At least not in China.

> punishing me for trying to protect my privacy.

TOR doesn't protect your privacy, it just lumps you in with—and makes you indistinguishable from—the worst crap on the internet. If you don't want to be treated like crap, don't try to blend in with the crap.

I get the feeling that 90% of the check is if you are signed into a Google account, otherwise you're going to click some images. I've noticed this a lot of incognito mode where I will almost always have to do a captcha.

I don't think so. When I was traveling in Malaysia a few months ago I was always signed into my Google account, but constantly needed to fill in captchas and even got suspended from Google Scholar for a few hours for "suspicious traffic".

I guess Mozilla hasn't noticed that one yet. They've been removing captcha bypassing add-ons from their site. And because all Firefox versions that aren't buggy require add-ons to be signed by Moz it makes distributing them through other channels rather tedious.

what moz should really be doing is removing captchas from websites, not the captcha-bypass add-ons ;)


A very good addon against the shit from Cloudflare and Google.

Well OK, for Tor it is understandable. But I got the same every time i launched GTA V

That’s because Rockstar had/has a huge problem with stolen accounts from bots using wordlists to brute force passwords.

The difficulty is probably cranked all the way up

that's good. tor traffic should not leak on the open web. that just diminish tor network and cause headaches to node operators.

if you care about all that, run a node without internet exit, and also strive to make your sites available on tor (hate the "hidden service" nomenclature)

Have you considered an out of country vpn? The privacy protection may be similar for most common browsing.

Too bad more and more services block VPN’s.

It's insufferable even without using Tor. And it only gets more insufferable every few months, it seems.

As a user who's constantly clicking on the crosswalk or storefront images you can't help but to think that you're essentially working for free training Google's machine learning models by providing them with supervised data points.

That's what ReCaptcha always was... it was originally a known and another unsure text blurb from scanned books/text documents. Now it's street signs etc.

It's actually pretty interesting to see how the captchas have evolved as, presumably, Google decides "OK, we have enough data to consistently identify this thing" and moves on to the next challenge.

I recall the modern (non-text) captchas used to be cars pretty much every time. Then, the images started getting grainier as they apparently wanted to improve their recognition in different conditions. Then crosswalks and store fronts became quite common, eventually with the same kinds of noise distorting images. Now I've started seeing things like buses, bridges, motorcycles, bicycles, etc. It feels like they've finished getting enough data for improving Google Maps and have begun moving towards collecting data for their self-driving car projects.

Google: spins off robotic car company Waymo, announcing public-facing robotaxi service launch in 2018

Also Google: "Our standard for 'what a machine couldn't possibly do' is identifying a stop sign."

Machines have been able to beat captchas for years and years. The point is that it costs money and that's good enough to prevent free and scalable abuse.

Isn't the captcha data used specifically to train that AI? That's what they did with the old one at least where it was used to train machines to read.

Those really grainy images, they always seemed like they had to have noise added on purpose? It was like really badly processed film grain, but if the images were enlargements then wouldn't they be pixelated?

Stuff you get now often requires cultural information, like "sidewalk" isn't a cross-cultural name, I'd guess almost everyone knows it, but meh. What classes as a store, is a lawyers office a store? Also, I seem to recall I had "click on all minivans"?? Not sure what one of those is, nor really what is classed as a car in USA, is an MPV a car [I'd guess that's what a minivan is?]? Do pedestrian crossing lights (green/red man) count as [part of] traffic lights? I've often wanted a short description of the locus of the terms they're using. Of course it never tells you if you failed, just gives you a further captcha, which it might have done anyway.

The fire hydrant one always annoyed me - in many countries stand-up fire hydrants are nonexistent or at least look very different. I guess they're banking on people having seen enough American movies/TV shows?

It's not like the word 'fire hydrant' exists in those countries either, so some kind of knowledge/learning/look-up is involved anyway, no?

In the UK we call them fire hydrants and they’re typically placed underground with a small metal cover in the middle of the road and a sign on the pavement (sidewalk for Americans) saying where it is.

Remember this belongs to Google https://patents.google.com/patent/US9407661B2/en

Are you sure? On the page it says:

    Current Assignee: Juniper Networks Inc
Was it owned by Google at some point in the past?

Yeah, a lot of the storefront ones are just plan hit things randomly until it lets you through - how am I supposed to know if a building with some writing on it in Korean is a store or something else? Are you supposed to include the poles in traffic lights or not?

The V2 was just annoyingly badly designed because the questions were badly put.

North American food is what amused me, I was thinking how many are going to get that one right outside North America.

> Then, the images started getting grainier as they apparently wanted to improve their recognition in different conditions.

I'd always assumed that was noise carefully tuned to throw off one machine learning model or another that was being used to beat the captcha, sort of like this: https://www.theverge.com/2017/11/2/16597276/google-ai-image-...

I think it might be the same when they switch to other types of objects (like crosswalks or bikes). Someone's model got too good, so they had to change to something else. I also get the impression that they add delays to the tile refresh before they do that.

I suspect Google now uses robots to generate captchas for humans, under the assumption their image recognizers are far better than anyone else's. They already have some very well ones for other products (self driving cars, street view) and lots of street-level city imagery. That would explain why their captchas are so difficult for humans to solve -- they're testing if you see things like their "AI," not like other humans.

>I'd always assumed that was noise carefully tuned to throw off one machine learning model or another that was being used to beat the captcha

I was thinking it was trying to dirty up the image just like the lenses on cameras get dirty. What happens to the image recognition when there's water spots, dirt, mud, etc on the lens that keeps parts of the image obscured?

The images get a lot grainier when you browse over VPN or fail the first round. So I suspect it's meant to actually throw off captcha-solvers.

what would be the training data value in getting humans to classify images you have clean versions of, after applying simulated noise?

If you have the clean version of the image, you need to get that classified by a human - then you can throw noisy versions of it into the training set for your AI. You don’t need to ask a human, hey, I added noise to a picture of a yield sign. Is it still a picture of a yield sign?


With the possibility of almost uniquely identifying us on the web through fingerprinting... Google, of all companies is in the perfect position to know that my web request was made by me... And therefore I'm not a robot.

You can only conclude that recaptcha is a ml training exercise.

>You can only conclude that recaptcha is a ml training exercise.

They're not secretive about it


> With the possibility of almost uniquely identifying us on the web through fingerprinting... Google, of all companies is in the perfect position to know that my web request was made by me... And therefore I'm not a robot.

The article explains that this is part of what reCAPTCHA does, e.g.:

> Finally they combine all of this data with their knowledge of the person using the computer. Almost everyone on the Internet uses something owned by Google – search, mail, ads, maps – and as you know Google Tracks All Of Your Things. When you click that checkbox, Google reviews your browser history to see if it looks convincingly human.

But your point is otherwise right in that it's used for ML training, which Google admits as another commenter pointed out.

> Finally they combine all of this data with their knowledge of the person using the computer. Almost everyone on the Internet uses something owned by Google – search, mail, ads, maps – and as you know Google Tracks All Of Your Things.

Human [n]: Entity that uses Google®-brand services.

— Google Dictionary, 2020 edition

Maybe, maybe not. From time to time google forces me to prove I'm not a robot because of unusual search queries. Either my queries are unusual or I have an infected computer. The thing is, it also happens to me on my iPhone.

Or, maybe they feel I'm not pulling my own weight, seeing as I rarely ever click on adds. They probably need more monkeys to feed the beast so I get selected to train their AI beast.

Edit: It could also be that I'm always running on incognito mode.

I wouldn't be surprised if soon enough you'll have to play a mini-match of Starcraft 2 as part of REcaptcha to train DeepMind.

And I find it equally interesting to think about what happened to Google Books project after half the world was entering text into captchas perhaps close to a decade? Is the project still going on?

Eventually you will have to park a self driving car in a tight spot for captcha.

Helping scan all the world's books as part of a plan to make them freely available to all is much easier to get behind than tuning map objects for their self driving cars. Particularly when most books that were scanned never, ever showed up on Google books.

> Particularly when most books that were scanned never, ever showed up on Google books.

If one looks at the history of Google Books one can see that they started with big ambitions, but hit copyright in quite intensive ways. That also changed their approach to other projects. Clearing all rights internationally isn't easy.

OTOH, succeeding at self-driving cars has the potential to save lives by the thousands.

And if they succeed to much they can monopolize transportation and thus mobility. Not a power I want to have in a private company. Especially not in a company from outside my country's jurisdiction. Where I can't have an impact via democratic law making process.

> Especially not in a company from outside my country's jurisdiction. Where I can't have an impact via democratic law making process.

This is false. EU governments have already placed significant restrictions and fines upon US tech companies in the past. There is no reason to believe that they won't be able to again.

Those fines are the perfect representation of the lack of control EU has over US tech giants. They are essentially opaque and impossible to inflence through normal regulatory and political channels so the only options are the big guns. You can be sure Google is not going to heavily invest in the EU tech sector under such adversarial set-up. There's a subtle blackmail here: we push the envelope as far as it goes and the EU can choose to submit or risk technological backwardness.

It's a great situation for the US economy but a very bad strategical position for Europe.

Don't forget the bonus: all your forum accounts can be linked to your google account, just in case you didn't use your gmail as the email address.

That's what really kills me here. This is lock-in at least as bad as the Equifax situation.

Self-destructing cookies is a solution

Not if Google has collected enough other identifying information about you to determine whether or not you're a robot.

But I'm kinda hoping that the reason I keep having to identify cars and store fronts is that my refusal of third-party cookies is causing them to have no idea who I am. But that might be the optimistic view.

In any case, I wouldn't mind if sites stopped using recaptcha.

Identifying text for public domain works that Google is making available online is much different than forcing the web to train their models for their profit.

they stopped using the text because some forum campaign that promoted typing cursewords instead of the unkown word. they probably started showing cursewords on the rendered search highligths on google books.

there was a decent write up from a whitehat showing the damage, but I can't find it


I can't imagine some forum has enough traffic to meaningfully screw up their data, and they don't tell you which of the two words is the unknown word, so you're just going to fail a lot doing that.

It was very obvious which was which and you had a 50/50 chance. I can confirm this used to work and I always used a curse word plus the known word.

It was pretty big on 4chan at least at some point in the early 2010's. And the unknown word was always the hard one to read iirc.

This was extremely common on 4chan maybe 7 or so years back I can't remember the exact year and it worked for a very long time before anything was done about it and everyone knew to do it. Google asked for two words and presented them as two different fonts. The real word that needed transcribing was always identifiable and you could just write what you wanted as long as you got the second test word correct. Much fun was had.

Sounds like something 4chan would advocate.

More importantly, it's trivial to filter out everyone putting in the same swear.

4chan was doing it specifically with a six letter racial slur that's unrepeatable.

Since you're so afraid to use the word 'nigger' here's a book by a Harvard law professor you should probably read: https://en.wikipedia.org/wiki/Nigger:_The_Strange_Career_of_...

I guessed correctly: the author belongs to the one racial group for whom it's politically correct to use the word.

Better safe than sorry, there are enough people who think context doesn't matter.

So what's going to happen if you write "nigger" that you're so unwilling to write it? It's not like you'd be calling someone a nigger, you'd be using it in a clearly informative manner,

My understanding is that some folks have gotten in trouble anyway. Good luck to you!

I don't care to find out.

not really. it started with the word "penis" for the original "campaign" but since a single word was obviously ineffective the meme improved.

> they stopped using the text because some forum campaign that promoted typing cursewords instead of the unkown word. they probably started showing cursewords on the rendered search highligths on google books.

If I remember correctly, Google later on also sometimes showed two "known" words or, if they had actual other evidence that you are human, two unknown words.

Isn't there a good OSS alternative for ReCaptcha?

Captchas inherently require some security by obscurity; they're not a good fit for OSS solutions.

I'm doubtful about any good solution for having computers decide whether we are sufficiently human.

> I'm doubtful about any good solution for having computers decide whether we are sufficiently human.

Two questions: - Couldn't a computer just temporarily hire a human to prove there is a human involved? - Why are we using recaptcha or verifying humanity anyway? I understand stopping spam, scams, and fraud, but scraping already public data doesn't present significant harm.

Yeah, it shouldn't be used to limit public data. The main use is to prevent spammers from spamming fora or registering thousands of throwaway email addresses.

The Turing test. Have one of your ops chat with them to see if they’re a bot.

Cheaper: have a robot chat with them ...

Google Duplex?

I've been thinking about this a lot lately. Where is our compensation? It's our time and brain power training Google's AI that will one day be sold back to us. I'm really not into this.

Because Google can extract value from captchas, it makes world-class captchas and bot detection AI available to every webmaster for free. I don't know what that level of service would otherwise cost, but it almost certainly wouldn't be affordable for low-traffic blogs and the like, which would end up vulnerable using weaker captchas or trying to roll their own. Everywhere else the cost would just get passed on to users.

I don't love the compromise of paying for things with my data or by training Google's AI, but it's hard to say users aren't getting anything out of it. That said, I do miss the old reCaptcha.

> it almost certainly wouldn't be affordable for low-traffic blogs and the like

Very few low-traffic blogs that I see use (or need) CAPTCHAs. I know that the ones I run don't.

> I don't love the compromise of paying for things with my data or by training Google's AI, but it's hard to say users aren't getting anything out of it.

I don't think they are getting much, if anything out of it -- aside from being increasingly punished for defending themselves against being spied on by Google.

My personal blog has a spam filter for comments.. it's either that or captcha.. or sign in with Google/Facebook.

Often a trivial non-standard thing like "what's the name of the author" works well enough. Especially outside the English language. Spammers won't spend the time to bother adopting their scripts for that.

If this somple thing comes from a popular WordPress plugin the equation for the spammer changes, of course.

There's certainly a period of time where that solution is sufficient as it stops the lowest level of drive-by <form> spam.

But it also sucks the first day you get an attacker who solves it once and then spams you thousands of times.

Modern spam tools are pretty impressive these days and minimize the targeted work the human spammer needs to do in these cases. In the early 2000s, you could set a custom question and then assume no attacker is going to manually code for your little blog.

But even in 2008 I was using spam software (out of curiosity) where you could import a massive blog list, and it would pause spamjobs with failed comment submissions, let you pencil in a value for this unknown field, and then click resume.

You could also choose other actions for that field like "prompt me each time" and sit at your computer multiplexing your labor across hundreds of blogs. And that was pretty polished ten years ago.

> If this somple thing comes from a popular WordPress plugin the equation for the spammer changes, of course.

Exactly :)

My sites use a spam filter as well. I find that it's perfectly adequate.

It's the same with email for example. I've helped a friend roll out his own server because he doesn't want Google reading his emails.

Fair enough, but you won't get Google's spam filter or availability either, which your privacy was paying for.

I do this. Funnily enough one of the reasons I did it was because Google’s spam filters gave me too many false positives and my gmail account attracted enough spam that sorting through manually was a pain.

Has it been a good experience for you? What are you using?

My point was just that even if something is provided to the customer for free, doesn't mean it's easy to produce. That causes a lot of the issues my non-tech friends have with understanding the scope of work. Just because social media is free and easy to set up as a customer doesn't mean developing a social media is easy at all.

I’ve been using exim and dovecot with rspamd for spam filtering. Have two VPSs on different providers to provide MX backup properly (they’re cheap these days and for low traffic I don’t need much more than the smallest VPS). I do DKIM and SPF but not DMARC and it gets through gmails spam filter fine and passes the various other tests you can find online. Took a while to set up right (in the end I found the best route to predictability to be writing my own exim config file rather than using someone else’s template) but pretty simple to run after that - there’s some effort to make sure I keep up to date with security patches and monitor log files for anything untoward but it’s relatively small. Using letsencrypt certs so email clients have been relatively simple to set up.

Overall it’s been a good experience. I run into a few sites which when I send to them classify my email as spam or grey list my sending IP so mail doesn’t get through quickly but then I used to have the same spam problem with some sites running my own domain through google apps.

Gavin Newsom, the governor of CA, spoke about this in his State of the State yesterday: https://www.sacbee.com/news/politics-government/capitol-aler...

This book offers one set of proposals for "Data as Labor", inspired by Jaron Lanier: http://radicalmarkets.com/chapters/data-as-labor/

And there's going to be a lot of discussion of the idea at the RadicalxChange conference in March (https://radicalxchange.org/), including with Jaron himself as well as the book authors. (Disclosure: I do the conference website as a volunteer).

Are you kidding? Your compensation is all the free apps you get (Gmail, Maps, etc.) You’ll rarely see a Captcha for a paid product once they have your cc info.

I get most of my captchas while attempting to access products I've paid for already (very few at purchase time).

Imagine if those sites had to build their own captcha service instead. How much more expensive would they be?

It's actually not that hard, assuming you control your form generation. Bots usually fill in fields using the actual field name - not the label the user sees. So provide a field labelled "Age" but named "email", and simply check it contains digits. If it's got an email address in it, it's a bot.

Labels can also be obfuscated with javascript, replacing the raw HTML "Email" with "Age"on page load. Getting this right will require the bot to parse both HTML and JS, and we can force them to handle CSS too. Add a "zip" field, and hide it with complex CSS rules. If it contains a zip code, it's a bot.

If you're really paranoid, randomise combinations of distinguishable fields (name, email, phone, age and hidden fields) every time you generate the form, so even if a bot herder manually maps names to fields one time, it'll fail the next. At this stage it'll be cheaper for the bot herder to use Mechanical Turk, after which even Google's captcha is compromised.

>So provide a field labelled "Age" but named "email", and simply check it contains digits. If it's got an email address in it, it's a bot.

Or a blind user who might actually rely on both labels and names. That's a bit like what arxiv does, they have hidden links that ban your ip when you crawl, but the links aren't hidden for AT users. I got myself banned that way once.

This is very hostile to people who use screen readers.

I've found that the tech industry often is. Trying to get managers to set aside time to iron out accessibility issues is like pulling teeth. Trying to get other developers to take it seriously is almost as bad. Often you count yourself lucky if the bare legal minimum is implemented.

Accessibility is very important, and if accessibility features are implemented well they'll often be useful even to people without disabilities, but do any CS/SE or code bootcamp programs take the topic seriously? I'm sure it must be taught somewhere, but it doesn't seem to be common at all. Can you even imagine 21st century university architecture department that didn't cover ADA compliance? That'd be unthinkable.

> Can you even imagine 21st century university architecture department that didn't cover ADA compliance? That'd be unthinkable.

I can easily imagine it: architecture departments from universities in other countries don't necessarily have to cover compliance with USA laws.

wonder if someone is collecting actual data instead of listening to udfalkso, the google sales rep here.

maybe to save them a few $ from bots and spam (bandwidth and storage is very cheap today) they might be losing new users by the thousands (and traffic acquisition is far more expensive than the formers)

Or they could just use one of the many free CAPTCHA applications that are around.

Off the shelf spam software like Xrumer[0] has been cracking those captchas for 10+ years.

Recaptcha isn't obnoxious for fun, it's obnoxious because this is the state of the arms race right now. There's also the challenge of creating a captcha that allows blind people in.

[0]: https://en.wikipedia.org/wiki/XRumer

Yeah, that is what annoys me. “Thanks for paying us to use our product. Now do free work for us for the privilege of using the product you already paid for!”

But I don't use those services, so that can't be my compensation.

Your compensation is access to the website that uses Recaptcha and the fewer abuse/bots that you deal with on that platform.

For example, since you're here and HN uses Recaptcha on its register/login form, it seems like the compensation was adequate.

> Your compensation is access to the website that uses Recaptcha

Which is one of the reasons why the presence of reCAPTCHA is strong push to avoid that site.

> since you're here and HN uses Recaptcha on its register/login form, it seems like the compensation was adequate.

Perhaps so. I don't remember doing a CAPTCHA to sign up, but I don't dispute that I did it. However, I've never been presented with one after signup. If I was, I wouldn't be here.

> I've been thinking about this a lot lately. Where is our compensation?

You give Google training for ML models.

Google gives the site provider the service of excluding bots from submitting the form.

The site provider gives you whatever was provided by the form you were trying to submit.

No one is uncompensated.

You might be interested what https://hcaptcha.com is doing.

I don't really understand the case where you'd use this.

First, it seems tacky scrounging for peanuts from the users' captcha work. Or it's like a product/services website showing Adsense ads. It's a cheapening message to send.

Second, since you make more money from more captcha volume, you're incentivized to maximize your use of captcha which is at odds with every complaint in this comments section about captcha. Most sites only use captcha to gate low-volume actions like register/login (e.g. HN).

They created their own Ethereum token too which always puts a bad taste in my mouth these days.

Finally, it doesn't address the upstream complaint that someone else is profiting off the user's "work" rather than the user. Though I don't find that complaint very reasonable. And a tiny fraction of a cent sounds about right. The truth is that users benefit from anti-abuse systems. The number of bots that HN's recaptcha on register/login has stopped is worth that tiny fraction of a cent to most users.

Is it somehow less tacky to give that value away to Google for free?

Sites can set the difficulty level necessary for their application. Some are under continual targeted attack, others are mainly keeping out rogue automated spambots from their comments section.

The user is typically getting a free service, a better site experience due to less bot traffic, or both. I think sharing the value of their work with the website is a fair deal.

As for using blockchain tech for ledger functions, that is all under the hood: websites can cash out to dollars as they prefer.

(disclosure: work on bot detection at hCaptcha.com)

> Is it somehow less tacky to give that value away to Google for free?

Yes, mainly because we're talking about fractions of cents. Also, it's not for free; the website and its users get a good anti-abuse measure in return.

There's a big difference between something that cannot make money and something that makes pennies for the site. But, to be fair, 99.9% of users aren't going to notice the difference in captcha branding either way unlike my example of a banner ad on a retail site.

My main reaction is that the UX incentive to minimize user exposure to captchas seems to work against the primary pull of using hcaptcha in the first place.

Though one site I can think of that has a captcha behind every action (every post) is 4chan. Maybe you can get them on hcaptcha one day. It would at least help you test your tagging system against vandalism. :)

Google's entire business model is leveraging of fractions of a cent on a mass scale

I assume you're talking about ads. But you're bidding at least cents on Adwords and making at least cents on Adsense. In Adsense's hey day, I had relatively low-volume sites making over $1 per click and paying my rent in lucrative niches.

I didn't find any pricing examples on hcaptcha's website. For all I know, people are bidding 5 cents per image.

Anyways, I definitely want to see more serious contenders in the captcha space so that we all aren't contributing to Google's middle-manning of the entire internet, and I'd like to try hcaptcha even out of curiosity.

> it doesn't address the upstream complaint that someone else is profiting off the user's "work" rather than the user

If it means no/fewer ads to support a site then the user benefits because they don't have to pay real money to keep the site up.

Wow thanks for this, I added myself to the waiting list.

If you use my referral URL I get a bump in the queue:


Can we get a hcaptcha pyramid scheme going, https://hCaptcha.com/?r=d6064d5f22b7 ...?

Your compensation is that you get to use the website, and the website's compensation for putting up a gate for their users is that they get to keep the bots out.

> Where is our compensation?

When you search for an address on google maps, that little tiny house number on the house was once a captcha image and now google knows that number so it can take you to the exact location on a map when you search for that number.

Everyone helps train the machine so when they want something from the machine then the machine is better at finding what they asked for. That seems pretty democratic to me.

reCAPTCHA provides protections to site owners for free. By using reCAPTCHA, site owners pass the cost of said protection on to their users.

disclaimer: work for google, nothing related to reCAPTCHA though. opinions are my own, etc.

So you want compensation because your data is used along with millions of others to train an algorithm to distinguish if a bot or a real human to provide a service to you? Nice.

Yes if the data is of value. They don't give this data out publicly. Open source the data or pay.

It isn’t to train their bot detection algorithm. It is to train their other efforts (e.g. self driving cars, mapping, etc.)

20% of these companies could be considered public property because it is public which has been feeding the algorithms training data.

A dividend on this could probably provide for a basic income.

1e-10 cents.

Google's services, of course.

Is it worth it?

Yes, especially when they show you about 15 sets of images in a row taking 2 minutes to complete, clearly going beyond demonstrating you're human.

The Cascade Bicycle Club in the Pacific Northwest threw me into one of these multi-minute Captchell vortexes when I tried to log in to my years-old account to renew my membership and register for one of their organized rides. I was already on the fence as to whether it would be worth paying to do these rides that I have already done several times over the years. That (ironically) dehumanizing experience pushed me over the edge. I didn't complete the Captchas, didn't log in, didn't renew my membership, and didn't register for anything this year.

More and more frequently when presented with a captcha, I've been deciding I don't care enough about whatever it was to want to exchange robot training for access. Especially if they pull that shit after I've spent effort on something (a comment, say) - I will absolutely walk away and not come back.

Manipulative user-hostile websites can rot.

I think they only do that if you get the first set wrong.

Doesn't reflect my experience.

One solution is to remove Google out of your life as best as possible.

Personally I now use...

- iCloud.com instead of Gmail

- DDG for search though I do have to !g like 20 to 30% of the time for things like driving directions (from X point to Y point), local movie times nearby and flights.

I still use

- YouTube

- Google Maps some as its great for getting distance between X and Y

- Google News (is there a better substitute)

- Google Photos (is there anything that compares)

Hoping in time to rely a ton less on Google products.

I'm in the same boat, friend. Expelling Google from my life.

Apple Maps works for me. I appreciate that's not the case for everyone, but it's come a LONG way. I sincerely use Apple news (on iOS) and have been loving it, but appreciate it's not for everyone's use case.

Google photos.. yeah wow. There really isn't much like it. I've resigned to storing my photos myself on a private server and slowly making albums/things come together. But I have to NOT use google photos. It's too scary.

Gmail was easy

Youtube I use a fake gmail account that's not linked to me in the slightest and only use it on 1 iPad, else not logged in.

It's a quest. But I'll get there. Someone really ought to make a Google Photos competitor though, there's nothing that has the same level of polish right now.

Not a solution in this case - the problem is that companies like Cloudflare use these captchas to slow down your visit if you are using Tor or a VPN. In fact being logged in to google probably helps you bypass these checks

What is it in particular about GPhotos you find incomparable?

Do you have any alternatives? Something that works on most platforms and allows you to automatically backup photos as they are taken.

No, I don't, I was just curious -- I used to use Picasa and it's facial recognition was far superior to anything else I could find.

Flickr app has auto upload from Android at least, I'd guess Flickr as Google photos closest competitor?

Dropbox has had automatic photo backup for ages and is available on most platforms.

I suspect you could replace the entire mess described in the post with: what's the logged in account's spaminess? followed by, what are the doubleclick cookies' spaminess?

You could further approximate that with: "How much does Google's AI think this human's time is worth in future revenue?"

I for one intentionally inject errors into their image classifier until it lets me in anyway.

I always try and click one or two extra boxes that are wrong. Sure I sometimes have to go through and confirm extra images, but you know what? I don't work for free and so I'm doing my small part to bugger up Google's data set.

I've recently made a discovery that pleases my petty side.

You know how they usually give you several questions to solve, even if you're quite convinced you solved a question correctly?

Turns out if you click randomly, they keep showing you new questions as well. If, after a handful of purposely wrong answers, you answer one correctly, they let you through.

I now purposely mess up the answers a few times. It seems neither slower nor faster than actually taking the time to do it right, but it takes less mental load, and it makes me not feel like doing slave labour for a machine.

I'm now wondering if the fact that I block third-party cookies is the reason I always have to identify cars and store fronts. Maybe I should just avoid sites that use recaptcha.

For a while there they kept saying I was wrong and it would effectively lock me out of some accounts that required it. Recently it’s stopped but it’s super annoying how ambiguous it is.

There's an xkcd for that https://xkcd.com/1897/

not trying to be facetious, isn't that (half of) the point?

Who says it's for training? ;)

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact