
I’m not a human: Breaking the Google reCAPTCHA [pdf] - louis-paul
https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf
======
kajecounterhack
[http://arxiv.org/abs/1602.02697](http://arxiv.org/abs/1602.02697)

I don't work on reCAPTCHA but I imagine they could easily + significantly beef
up image captcha by adding adversarial image examples that trip up would-be
automators.

Note that _adversarial examples generalize across models._ Meaning a good
adversarial example will work against SVMs, convnets, etc just the same.

If you're interested in the basics of how it works, Julia Evans has a great
post: jvns.ca/blog/2015/12/24/how-to-trick-a-neural-network-into-thinking-a-
panda-is-a-vulture

The key is that deep neural nets are actually very (piecewise) linear and that
makes them susceptible to adversarial training.

~~~
olalonde
Or use those images: [http://imgur.com/a/K4RWn](http://imgur.com/a/K4RWn) :)

------
bluedino
Using the Google's own reverse image search to break an image-based captcha.
Simple, but brutal.

~~~
neikos
Thought the same thing, that's ironic /and/ smart

~~~
proksoup
Perhaps in response google could identify their own image they just served
from recaptcha being google searched and throw the google image search into a
captcha jail? .... if they have a large enough pool of images they are using,
to be able to make that correlation, and if google search and recaptcha are
capable of sharing that level of intimate detail ... probably a lot of
problems with that tho.

------
tyingq
_" We ran our captcha-breaking system against 2,235 captchas, and obtained a
70.78% accuracy"_

That's more impressive than it sounds. I'm pretty sure 70.78% is more accurate
than I am with reCAPTCHA manually. A lot of the captcha's presented are very
fuzzy, or have ambiguous questions, etc.

~~~
gboudrias
Indeed. It's good that these guys are white-hats because they could have made
a _killing_ selling to spammers (for as long as they could go undetected,
which could've been a while).

~~~
etrain
Going rate for humans breaking captchas is like $1/1000 captchas solved. A
fully automated service could make some money, sure, but not exactly a
_killing_.

~~~
totony
From paper, "Assuming a selling price of $2 per 1,000 solved captchas, our
token harvesting attack could accrue $104 - $110 daily, per host (i.e., IP
address). By leveraging proxy services and running multiple attacks in
parallel, this amount could be significantly higher for a single machine."

------
kauegimenes
I am the only one who thinks google reCAPTCHA is just a tool that Google uses
to train Machine Learning Algorithms? First it was used to help Google learn
how to Read, now its learning to detect object, landscapes, ...

~~~
kevincox
I'm pretty sure this is the advertised purpose of it. It stops bots AND helps
machine learning.

~~~
kuschku
*AND helps a company make profit by training their proprietary models.

If they were open, it would be a lot better (considering all the training is
done by volunteers, too)

~~~
deelowe
uhh. Everything they do is arguably for profit. That's why Google is an LLC.
What do you expect them to do, put an "* this is done to earn us money"
disclaimer on everything?

~~~
bitskits
This is a little pedantic and doesn't change your point, but Google is not an
LLC. It's a C Corporation.

------
Matt3o12_
> Over the period of the following 6 months, text captchas appeared to be
> gradually “phased out”, with the image captcha now being the default type
> returned, as these captchas are harder for humans to solve despite being
> solvable by bots [3, 4].

This is so true. When I first go an image captcha, it took me 3 minutes to
learn how to solve it. I had JavaScript disabled so I did not realize I had to
do 3 pages to complete the challenge (after two, I reloaded and got a new
route thanks to tor the page since I though i message up). Also I'm not sure
what to click when there is a partial match, is that still a match? And don't
even get me started with solving signs (in my first challenge I had to tell
apart different kind of signs that were in the image). Is a no trespassing
sign still a traffic sign? What about if it is a sign but I don't know which
kind of sign (due to domain knowledge – especially when they're foreign
signs). Another problem is that they require you to understand the language
the challenge is in. You can still solve a text captcha if you don't speak any
English.

I wish there was a way to force text captchas

------
bpires
Quite interesting how the authors even mention that this strategy is very
economically viable.

From the paper: 'Assuming a selling price of $2 per 1,000 solved captchas, our
token harvesting attack could accrue $104 - $110 daily, per host (i.e., IP
address). By leveraging proxy services and running multiple attacks in
parallel, this amount could be significantly higher for a single machine.'

~~~
dc2
Makes me wonder if they got paid for reporting it to Google.

------
sotrueee
You could also just pay a service that uses human workers in third world
counties. It's a little over a tenth of a cent per captcha.

~~~
amelius
Or you could set up a pr0n site that shows the material only after the user
has completed the captcha. This trick has been done before.

~~~
dspillett
I've never seen a documented case of such tricks actually being used, and I've
seen calculations that suggest the cost/benefit outcome is no better than just
paying the poor to do the menial work.

The last such analysis I paid attention to was some years ago so the situation
may have changed, but I suspect the hassles of running a porn site and CAPCHA
proxy still aren't worth it:

* obtaining content sufficient to attract interest

* paying for bandwidth & other resources)

* writing the authentication system

* then maintaining it (every time the CAPCHA service(s) change their process you potentially need to make and test changes to your code)

* and you need to work around rate limits (depending on the CAPCH design it may not be possible to make the relevant requests client-side so if the services has rate limits you'll have to route through something that sufficiently randomises your source address).

* providing support

* dealing with bad press

~~~
jamessb
The closest thing to a documented case I've seen is a report of people gaming
the Time Person of the Year poll so that the top person was moot, and the
first letters of the candidates spelt out "Marblecake. Also the game.":

 _By understanding how reCAPTCHA worked – the team was able to double their
productivity (since they usually only had to enter one word instead of two).
To further optimize their voting they created a poll front-end that allowed
you to enter votes quickly while giving you an update of the poll status (and
since it is a 4chan kind of crowd, they also provided the option to stream
some porn just to keep you company while you are subverting one of the largest
media companies in the world._

[https://musicmachinery.com/2009/04/27/moot-wins-time-inc-
los...](https://musicmachinery.com/2009/04/27/moot-wins-time-inc-loses/)

However, this is slightly different as people were deliberately solving
CAPTCHAs (and watching porn) rather than wanting to watch porn and also
incidentally solving CAPTCHAs that they had no direct interest in.

------
mirekrusin
I guess we're all waiting for browser extension to fill in captcha for us,
which is sometimes hard to get right, great usability improvement.

------
throwaway7767
I have a bad feeling CloudFlare is going to break the internet for Tor users
with impossible-to-solve CAPTCHAs again after this...

~~~
Mahn
Not that it can become much worse than it already is, though. The current
reCAPTCHAS are already "fuck this shit"-inducing on Tor.

~~~
throwaway7767
It very much can. Only a few months ago they were simply impossible. I tried
to solve 30 of them while accessing a site and was still unable to. The
current set are merely extremely annoying and time-consuming, but they can
consistently be solved (at least by English speakers).

~~~
appenin
They can be consistently solved, the only problem is the endless repetition
and "multiple answers needed" nonsense, which only seems to apply to Tor
users.

------
JimmyM
I may be being overly-cautious, but does anyone have a summary of this that
isn't a PDF from blackhat.com?

edit: Thanks for both responses!

~~~
liviu-
TL;DR from
[https://www.reddit.com/r/netsec/comments/4dvifg/im_not_a_hum...](https://www.reddit.com/r/netsec/comments/4dvifg/im_not_a_human_breaking_the_google_recaptcha/d1uppxn)
(though you may be overly-cautious as blackhat.com seems to be the webpage of
a computer security conference [1]):

"""

Live attack To obtain an exact measurement of our attack’s accuracy, we run
our automated captcha-breaker against reCaptcha. We employ the Clarifai
service as it shows the best result amount other services.

Labelled dataset. We created a labelled dataset to exploit the image
repetition. We manually labelled 3,000 images collected from challenges, and
assigned each image a tag describing the content. We selected the appropriate
tags from our hint list. We used pHash for the comparison, as it is very
efficient, and allows our system to compare all the images from a challenge to
our dataset in 3.3 seconds. We ran our captcha-breaking system against 2,235
captchas, and obtained a 70.78% accuracy. The higher accuracy compared to the
simulated experiments is, at least partially, attributed to the image
repetition; the history module located 1,515 sample images and 385 candidate
images in our labelled dataset.

Average run time. Our attack is very efficient, with an average duration of
19.2 seconds per challenge. The most time consuming phase is running GRIS,
consuming phase, as it searches for all the images in Google and processes the
results, including the extraction of links that point to higher resolution
versions of the images.

"""

[1]
[https://en.wikipedia.org/wiki/Black_Hat_Briefings](https://en.wikipedia.org/wiki/Black_Hat_Briefings)

------
eveningcoffee
I must admit that I have not been able to pass any of the text reCAPTCHA
challenges lately when I have wanted to access the more sensitive content.
Either I am getting them wrong or my submits get ignored.

------
dylanpyle
One of the more interesting references linked here was
[https://github.com/neuroradiology/InsideReCaptcha](https://github.com/neuroradiology/InsideReCaptcha)
— apparently the reCAPTCHA JavaScript snippet uses an encrypted bytecode
format running on a custom-built VM, which one intrepid person has decompiled.

------
peternicky
Anyone else annoyed with the grammar errors on the paper?

~~~
unpixer
"We automately searches Google for certain terms, followes links from the
results, watches video on Youtube, searches on Google Maps, visites popular
websites that contains Google plus plugins and widgets."

It's like reading a security paper authored by Gollum.

------
mpeg
I worked on a little piece on recaptcha a while ago[0].

I'm glad to see research around how this captcha method is actually as
breakable as any, but the real problem continues to be that there is no
separation between the data Google collects for advertising, and the data
Google collects from the captcha service.

Their privacy policy says they can link the captcha to your other online
identity and use it to target you. It's especially cringeworthy to see lots of
sites of dubious legality implement reCAPTCHA (like thepiratebay).

[0]: [http://www.businessinsider.com/google-no-captcha-adtruth-
pri...](http://www.businessinsider.com/google-no-captcha-adtruth-privacy-
research-2015-2)?

------
chinathrow
I don't get why, when and how Google uses reCAPTCHA within their own tools.
E.g. within the Webmaster Tools, I can submit up to 500 URLs for manual
fetch/render and subsequent index submission. So the rate limit is already
there and reasonable.

However, after 4 submitted URLs, I get a reCAPTCHA. From then on for every
URL, I have to complete it with additionaly visual quizzes.

~~~
janesvilleseo
Another way they get triggered is when people use browser/desktop based rank
checkers.

There are also plugins some SEOs use to pull lots of requests. These tools are
quite old now and not very useful, but people still use them.

~~~
at-fates-hands
Google recently killed off its page rank toolbar - which most people in SEO
used to measure their success with the sites they worked on.

Google has confirmed it is removing Toolbar PageRank:
[http://searchengineland.com/google-has-confirmed-they-are-
re...](http://searchengineland.com/google-has-confirmed-they-are-removing-
toolbar-pagerank-244230)

------
chadscira
I have used services like 2captcha.com, to get solving costs down to $0.5-$1
per 1000 solved captchas.

Googles reCAPTCHA is hardly an effective solution... Also if google wanted to
they could just automatically verify people without you clicking that
checkbox. Because at the end of the day they already know if they are going to
auto-verify you, or make you pass a test.

~~~
byuu
Someone should really make a web browser plugin that automatically solves
CloudFlare, Google and Facebook captchas for people using ::gasp:: VPNs. I
would happily pay $1 for every 1000 captchas served at me to be solved for me,
even if it took the page a bit longer to load for me still. As long as the
latency wasn't too much higher than it'd take me to solve it myself.

Just like with DRM, captchas only seem to punish legitimate users and perhaps
_very_ small-time bad actors.

