> On an unrelated note, I have found captchas that don't even work and reprompt ...

nickcw · 2023-08-17T16:59:00

> This is by design. The ones where you have to identify a bus, crosswalk etc are all used to train ML models. Your results are checked against other for the captcha, but sometimes you are the first person to see the image and there’s no way to check your answer so you’ll always get served another.

Do you have a reference for this? I wouldn't have thought a process like that would be needed now-a-days for training ML models.

thebruce87m · 2023-08-17T17:05:04

Not sure if there is anything directly from the horse’s mouth, but there are lots of articles describing it: https://www.techradar.com/news/captcha-if-you-can-how-youve-...

Human labels are absolutely still needed, for now at least.

mcast · 2023-08-17T17:05:33

Google was using CAPTCHA data to train its Street View photos for addresses and street names back in 2012*.

https://techcrunch.com/2012/03/29/google-now-using-recaptcha...

est31 · 2023-08-18T01:22:53

Training ML models has been the thing for CAPTCHA services ever since Google bought ReCaptca in 2009, and was originally used to provide training data for OCR, but then switched to other data. https://en.wikipedia.org/wiki/ReCAPTCHA

And even though you can do a lot unsupervised these days, supervised labeled data is still something really useful for training ML models (often in combination with larger unsupervised corpuses).

jacurtis · 2023-08-17T20:39:44

So it is actually a little different than what is noted above. I actually was told this directly from the mouth of someone who worked on this project. I don't believe it is that secret. But this is how it works.

The Captcha presents you with 9 squares. It selects a identification test at random (crosswalks, trains, buses, stoplights, etc). For this example let's say the identification test is to identify crosswalks. The squares are then filled as follows:

1) Two of the squares are requested that pass the identification test at an alpha value p < 0.05 (meaning it is more than 95% confident it IS a crosswalk).

2) One square is requested that passes the identification test at an alpha value of p < 0.01 (meaning 99%+ confident, effectively certain it IS a crosswalk)

3) One square is requested that fails the identification test at an alpha value of p < 0.01 (it is almost certainly NOT a crosswalk)

4) Two squares are requested that fail the identification test at an alpha value of p < 0.05 (it is 95% confident that it is NOT a crosswalk)

5) Three squares are requested that need have low confidence intervals p > 0.05

The captcha then shuffles these 9 images at random, it offsets the images a little bit by altering the crop slightly to prevent memorization by bots. Then it presents these 9 squares to the users asking them to identify according to the identification test.

The captcha scores the user based on their selection with the 6 known squares. The response you give on the 3 low-confidence squares has zero impact on you passing or failing the test. From what I was told, you must successfully identify both of the 99% interval squares correctly (one that passes the id test and one that doesn't). That is a hard pass/fail. From there, the captcha scores your response on the 95% confidence interval squares to the expected values. It compares that to other variables such as the speed that you answer them, the movement of the cursor and other variables (such as selecting, deselecting, etc). It also compares IP address google session data as part of its determination to determine the liklihood of humanity in the user. My understanding is that is is moderately forgiving. If the user is determined to be human based on those responses, then your responses are fed back into the confidence intervals for all of the images presented (other than the two "known" squares). Data by users that fail the Captcha is discarded so it doesn't feed into the confidence metrics of the images presented.

From what I was told, you can actually incorrectly identify 2 squares and still pass the captcha. The IP address and mouse movement plays a significant impact in the response as well as your ability to identify the two known squares.

Three of the squares are entirely unknown to the bot. You are purely feeding the confidence on those images for future use in the CAPTCHA and other google products. But there is no test where you are "guaranteed to fail" as mentioned above. Every test presented to you can be passed. There are 2 known squares which you MUST answer correctly. Your behavior and computer data and answers on the mid-confidence squares are what further impact your pass/fail determination. The three unknown squares never impact your pass rate. They are filler, the captcha only watches how you interact with the filler squares, not what you actually respond.

IIsi50MHz · 2023-08-18T22:40:14

> But there is no test where you are "guaranteed to fail" as mentioned above. Every test presented to you can be passed.

Sadly, when a site with captcha has decided to fail your every attempt, there is no feedback on why. You can request new sets as much as you wane; you can submit perfect results as much as you want, and you can try alternate methods offer (such as audio captcha) as much as you want. Typic'ly, when this happens, the following don't help: incognito/private mode; toggling extensions, adding some less direct mouse movement, forrce-reloading the page. Occasionally, other browsers or other hardware (phone, tablet, desktop), Windows/Linux/Mac, changing user-agent string might help.

dandellion · 2023-08-17T17:30:19

They're not very smart, because I've been answering them all wrong, on purpose, as a joke for years and I always get through.

ljlolel · 2023-08-17T19:00:31

Based on the captchas this obviously hasn’t been true for years