Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why are we expected to train Google's NN for free through captchas?
46 points by mstipetic on Feb 10, 2021 | hide | past | favorite | 57 comments
I keep seeing the captchas more and more whose obvious purpose is to train their self-driving NN, and they're getting out of hand. Sometimes I have to work through 5+ images. How is that legal that they can just interrupt me on a whim and ask for me to do work for them? Is there an alternative solution available?



Take a step back and consider the owners of the websites you are visiting. They need a way to filter out spammers and decided to use captchas, which is their choice to do so.

As a consumer, it’s your choice to not support them if you feel it has become too burdensome to fill out the captcha.

The legality of this shouldn’t even be a question. No one is forcing you to use these sites.


This argument falls apart as soon as you think about the power imbalance between you as a person and Google as the behemoth tech company that has tens of thousands of websites using it's captcha system.

This is why we have consumer protection laws


Google isn’t forcing these websites to use ReCaptcha. There are plenty of alternative captcha systems out there. Honestly, I don’t see how Google factors into this equation, and I’m fairly anti-Google.


Google isn't forcing anyone to use ReCaptcha, but in my state many government websites use it. So I actually am forced to use it. This is even more true with Covid, where many in-person services have transitioned to online-only.

What's worse is that ReCaptcha isn't an honest CAPTCHA. Instead, ReCaptcha relies in large part on building a tracking profile. If you take privacy enhancing measures, such as clearing cookies and using a VPN, then you can't pass ReCaptcha. It will present you with a endless loop of "please try again" no matter your answer.


you can roll out your own captcha if you want. its just that website owners find it convenient to use recaptcha or hcaptcha instead. google hosts a service website authors find useful and at the same time benefits from it - its a win win really.


No it doesn't.

This doesn't harm consumers. They are very slightly inconvenienced, likely less-so than if websites were forced to use other methods to achieve the same thing.


It's not really Google's power imbalance that's the issue here because Google is providing a service to the website operator. The power imbalance is because the website that uses Google's captcha system doesn't have an alternative and forces you to transact with what is essentially a 3rd party subcontractor.


> The legality of this shouldn’t even be a question. No one is forcing you to use these sites.

That argument would work if government sites didn't use reCAPTCHA in their online services.

Yes, you could probably complete the process by phone but thats akin to telling people to boycott Facebook when its the place to find out about events and participate in community discussions.

reCAPTCHA is EVERYWHERE. There are no viable alternatives. Asking people to boycott reCAPTCHA is asking them to boycott a huge chunk of the net.

Besides, this 'think of all the poor website operators' argument is bunk. Bots are going to happen whether the operator likes it or not. Better to lean in than out.


hCaptcha is very good. It’s trustworthy. For example, Cloudflare uses it.

https://www.hcaptcha.com/


hCaptcha makes money by using the website user as free labor.

In a way, if a website is doing this instead of ads, it's better, still, getting to choose how I repay a website for content, I would prefer more control over my form of payment. Less acceptable cases exist as well, I ended up cancelling auto-payment of a service which recently started putting up an hCaptcha for every login attempt.

Here's the relevant section from their website:

hCaptcha is designed to solve the most labor intensive problem in machine learning: labeling massive amounts of data in a timely, affordable, and reliable way.

More data generally produces better results in training machine learning models. The recent success of deep models has led to increasingly large datasets, almost always with some human review. However, creating large human-reviewed datasets via Mechanical Turk, Figure Eight, etc. is both slow and expensive.

hCaptcha allows websites to make money serving this demand while blocking bots and other forms of abuse.

Powered by HUMAN Protocol

The hCaptcha marketplace is powered by the HUMAN Protocol, an open decentralized protocol for human review that runs on the Ethereum blockchain. Websites earn Human Tokens (HMT) whenever users use the hCaptcha widget on their site, and machine learning companies pay Human Tokens to get their data labeled.


Maybe I should have been more clear in the question. I'm OK with filling out a captcha, I understand the internet is a messy place. But in this case we're literally being forced to train their neural networks for self-driving cars. I'm OK with silly captchas.

Now you're opening up a bigger question of why are these massive companies that have hit on a basically infinite source of money allowed to falsely subsidise non-related industries and killing off any potential of innovation there by offering things for below-cost/free?


So you’d be okay to fill in a captcha as long as you only entered irrelevant data that didn’t have any positive meaning or effect?


Yes.

Also, come to think of it, I'd like some assurance that I'm not being bothered more if I'm using privacy enhancing tech and basically being corralled into behaving a certain way.


You’re free to develop a better alternative and market it.


Some work-related and paid services have that, e.g. Discord, Nintendo.


If you think about it a bit, it should be obvious that nothing is being trained on your results. You're not being exploited for training data, because computers are already better at identifying traffic lights from street view images than you are.

The image recognition task has not changed for a very long time. How many captchas do you think have been solved by users in the last 5 years, all the same kind of recognition problems on the same kind of data? Trillions? It would be way past any kind of diminishing returns for this task to get more labels. If there was value there, they would have at least switched to a different kind of task.

In addition to that they're trying to get sites to move to Recaptcha v3, which does not have the user solve any kind of. Why would that be if the answers had any kind of value?

The fundamental problem with using captchas for training is that it devalues their value as a tool for security, since it sets up conflicting incentives. And the latter is what Google is selling. (Yes, selling. Recaptcha Enterprise costs money to use at scale, and the list price seems to be $1 / 1000 calls). The people paying for that service are using it to block attackers and let good users through, not to challenge the good users just to get some more labeling data from them.


"computers are already better at identifying traffic lights from street view images than you are" -- I'm still better than they are at identifying traffic lights IRL though :-)


first, computers are not better at identifying those things than us, our brain evolved for that and it is a massively complex problem for a computer.

sure we have models that can do it running on server on static images, but the problem they are trying to solve is not street view, the problem is training ML models capable of doing this in real time in an autonomous car.

taking a step back, identifying things in images is a problem very complex for a computer, best way in this day and age is to use ML but you need to train it to do it, for that you use a bank of know images and that is what they use reCaptcha for, to built those image banks that they will later use to train and evaluate ML models.

Second, they can use reCaptcha for that and still provide all those things that you mentioned.

another thing, they are not continuously using the same set trillions of times, they are increasing the image set over time, and this is what they use reCaptcha for to classify those new images, they dump new images in the set and every time you see a reCaptcha it will show you a few know images and a new unknown one that need to be classified.

Your input alone say nothing about that image, but after hundreds of peoples saying that, for example, there is a traffic light in a image while getting all the other know traffic light images right indicate there is a big likelihood of that picture containing a traffic light, once they have a big confidence on the image classification after hundreds or even thousands of captcha they move it to the know images set.


A relevant blog post titled "You (probably) don’t need ReCAPTCHA" [1] was shared recently.

[1]: https://nearcyan.com/you-probably-dont-need-recaptcha/


They provide an otherwise complicated service for free to the owner of the page you are visiting.

Blame the pages you go to.


though I agree with the sentiment, I've encountered some paid sites that still have captchas, bizarrely.


I can’t stand CAPTCHAs. I assumed I’ve been seeing them more and more because I use a password manager and websites flag this as a possible bot. Anyone else?


I assume it’s because I’m using macOS/iOS content blocker and it’s blocking stuff the websites were using to identify me.


I use first party isolation in Firefox so I trip the worst possible challenge for every captcha.

I think it's awful we've allowed Google to trick us into placing trackers on every site in existence that also force us to do free work for them. There has to be a better solution.


Also, they're getting significantly harder/adversarial as bots get better. I've never failed these picture Captchas as often as I have the last three weeks.


I noticed they also get harder on a VPN and I encounter more of them with non-chrome browsers.


How would a website know you use a password manager?


I'd guess a website can trigger off how fast a password is entered in the input field.

That said, I'd expect a large chunk of users to use their browser's password manager, if nothing else.


Password managers can also have a "go & fill" option that opens a new tab with the applicable URL, fills the login credentials, and submits. Even without specifically tracking how fast a password is entered into the password field, a website could trigger for all instances when a user manages to log in within 1 second of the first request.


Interesting: So does Chrome. Sometimes. I've never tracked down what causes it to do this, but there's at least one site I use it with where it clicks "submit" (or some moral equivalent) on its own when filling out the login page.

Probably something the website opted into. No clue.


I wonder what is different between your online fingerprint and mine, in generalities.

I have not seen a captcha in months if not longer, I don't remember doing one anytime recently.

I have browser and network level ad and malware blockers in place and sometimes come upon sites I cannot access at all, but never unusable ones due to no captcha.

But, I also have a Google account linked to chrome that stays logged in, and I wonder if that allows me to avoid them.


IME, using Firefox (with or without a Google account) is a sure fire way to get shown a ReCaptcha.


A while back there was a concept for a captcha I liked, but it seemed to vanish. It was called something like "StupidFilter" and asked a free form question. If you answer the question correctly, you can then post data on the site. That doesn't seem complicated to me and if each site had their own unique set of questions it might be harder to make a generic bot to bypass it. The reason I like this method is that you can tailor it to the audience of your site. PC modding site? Have entry-level PC modding questions. Bots would have to become domain content aware. Not perfect, but I don't need perfect.

Asking the web developers on here, how difficult would it be to make something like that? Or does that already exist?


Once upon a time I hosted a very small phpBB with a couple hundred members and a few posts a day. We had domain specific questions for our captcha, and they cycled between 3-4 different questions. Within a month, the bots had been programmed to answer all the questions on our registration. Being obscure and domain specific did not help.


How about domain specific pictures? "What is in the picture below?" The put a picture of a CPU heat-sink using the previous example And have many pictures to cycle through. Maybe change them up every couple months. Or number the pictures/questions and have the script use a different batch every week. You just need 52 batches and use ImageMagick to change the checksum and quality randomly. Then change them out once a year. I mean, they can update the bot every week, but there must be some really amazing reason to go through that effort if your site is attracting such people. I could see that happening to a gambling site, but probably not the average site.


The thing that really ticks me off about those captchas -- besides that it seems I'm more likely to have them forced on me when I'm on a VPN, so presenting an AWS IP to their server -- is that sometimes they think I'm wrong about, e.g., which square has an image with a bicycle. No! It's YOU, captcha algo, that are incorrect! Your images are wrongly labeled! Grrr, so annoying :-(


I find the requests are often vague. "Check each square with a traffic light" -- does that mean just the light itself, or does it also mean the supporting structure? Also sometimes the images are lacking context. "Click each square with a train" and one of the images is a close up of some metal carriage with a window. Could be a bus, could be a train.


this just seems self-centered. why do we expect access to everything on the internet for free? captchas are hardly a significant price..


The price being letting one central entity monitor which sites everyone visits, and block access if they like? For example if you try to avoid their spying by use of TOR, only to find half the web is now blocked.

Or did you mean the 30 seconds-1 minute of time to solve the captcha (assuming it doesn't infinitely keep going)?


And it does that, sometimes.

No bigger middle finger than an indefinite unsolvable captcha!


I wasn't accessing google's content when being interrupted. I'm happy to solve a meaningless captcha, but all the images presented are literally "find the traffic lights, find the crosswalks, find the bicycle riders," it's obvious what their purpose is


Why are you angry at Google when you weren't on Google's page? The owner of that page decided to use the captcha there, if Google didn't offer it they would use another service. This is like being angry at your internet provider because you saw ads on the internet.

BTW I'm happy people aren't spending their time on meaningless tasks, at least it's useful for something.


I've ran into captchas on sites that I pay for.


Me too. deezer.com treats their paying customers like sh* :(


I wonder about the quality of the data they're getting from all the random people training them. A few days ago I had to answer one by "selecting all the parking meters" and it wouldn't let me continue without also selecting a picture of what was clearly a rural mailbox on a post.


You're not required to do anything - you can simply close the page. It's as legal as any other code on that page the owner decided to put there. You aren't entitled to use the page in your way.


For free sites, largely agree. I do find it odd however if this happens as well on site where I pay for the account or for a service. At the very least that site should recognise me with some confidence and not have recaptxha rely on all the tracking around the world. Sometimes I need to fill in 7-8 sets of those puzzles in order to be able to login to order groceries...


Then I recommend to stop paying for that service.


Yep. I've done this. I explained to the CSR when asked why I wanted to cancel that it was because I found the website's captcha unnecessary and too hard to use.

I'm sure they didn't care, or just thought I was tilting at windmills, but if I'm one of many that cancel for that reason, the service might understand it's costing them money.


A few days ago I thought of building my own captcha system. Was wondering, how would you test a captcha system? Like, how can I get a few clever bots to try and break it to see if it works?


Why do you expect to have a say in it ? The have power, you dont.


To commenters asking why do paid sites use them: because fraud is real and the fact that you pay doesn't mean you aren't a bot.


Why would this be illegal?


It is a way to prove you’re human and get useful work out of it. Is your issue that you’re giving free labor to Google specifically or would you have an issue irrespective of the company?


The non-google ones usually waste more of my time, so by comparison I am pretty chill with the Google ones.


My issue is giving free work for free to a company. I see no reason to help them in their chase of a 100B+ market opportunity for free. Some kind of "silly" captchas I might be OK with, if a website chooses them. But sometimes they interrupt (at least while I was using chrome) in the middle of browsing and block off a part of the internet for me until I do as they ask.


This sounds more like you have some kind of malware that's getting you to solve captchas so they can sell the results?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: