Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is no arms race. 99.9% of the time Google knows if you're a robot based on your browser state. They make you label the images because it's a free way to get training data.


I see this argument a lot and considered it the case to, but I have to wonder if that really is the case. Wouldn't google be empowering people to really mess up their training data?

If I'm trying to automate a system to fool their captcha, I'm probably getting a lot of bad results. Or I could just be intentionally feeding them bad data, the fact that not being allowed through captcha keeps letting me make more and more inputs would enable someone to do that as long as they would like to.

I don't know, maybe I'm missing something.


I believe they cross-check between different users. I believe it used to work something like this on the old recaptcha system (the one with the words or house numbers): They show you 2. 1 is known by recaptcha, the other one isn't. If you enter the known one correctly, you can enter. The unknown one is presented to other users and when there is enough consensus amonst users it is promoted to a known one. So it's hard to mess with the system as an individual.


> I don't know, maybe I'm missing something.

The thing you're missing is volume. Even if you assume the vast majority of people will attempt to mess up your data, when you have enough people doing it, you can look at them on aggregate and based on patterns disregard bad data. It might be "expensive", but still worth it.


Ive seen bot detection arms races in video games first hand, and there is a lot more to gain on webstes. Surely there are engineers smart enough to combat google's bot detection, and with massive sets of labeled data google stands a fighting chance.


Then why does Google use the world's worst captcha for their own services?


How does that work? Could we create an open source lib to replace Recaptcha?


Google has basically your entire browsing history, and search history. Google then compares that with what is considered normal for a human, and then lets you in if you pass that check.

Of course, if you run IRC bots that scrape Google with a headless browser to implement a .search functionality, and which offer link titling in IRC, and you use a separate bucket of cookies and IP for every IRC bot, your bots now also have a human search and browsing history, and also will pass all ReCaptchas...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: