
Reverse-engineering the new “captchaless” ReCaptcha system - InsideReCaptcha
https://github.com/ReCaptchaReverser/InsideReCaptcha
======
chatmasta
When I was a fifteen year old "internet marketer" I was able to create 60,000
google accounts from residential IP's, despite google's CAPTCHA system. As a
fifteen year old kid, that's when I knew google is not an invincible company.

The trick was that google's backend registration logic did not validate the
referrer of the signup form, or that the submitting IP address matched the one
that downloaded the CAPTCHA. So I cloned the form on my own server, loaded it
in an invisible iFrame with all fields filled except the CAPTCHA. Then I
served the CAPTCHA to the user, who solved it and clicked "submit". Then the
entire hidden form in the iframe submitted and registered a google account
with the visitor's IP.

I was surprised that worked, because it was a huge vulnerability. Nowadays I
would go for the bug bounty, but back then I tried to sell it for $5k --
unfortunately nobody believed I could do it and I couldn't prove it without
revealing the method, so I ended up unable to take advantage of the code.

I did have about 60k accounts from random page views on blogspot blogs, but I
never did anything with them. Perhaps that was the only time my inability to
finish projects ironically saved me from some trouble. :)

~~~
timboslice
I remember seeing you post something about this back in the day (maybe on WF?)

good times :-P

~~~
neo_optimus
What is (was?) WF?

~~~
us0r
WickedFire

------
InsideReCaptcha
I received an email from Google requesting the following:

> The code you reversed is used to protect many sites’ registration process
> including Google and many others. We are concerned that having your code and
> analysis publicly available will make it easier to build registration
> automation tools which will result in a surge of spam in all the services
> protected by this code and will affect negatively many Internet users.

> This is why we kindly ask you to temporarily remove it from GitHub so your
> work won’t be used for a malicious purpose which we believe was never your
> intended goal.

As I wasn't aware that the _botguard_ was also used for this purpose
(separately of ReCaptcha, in Gmail and other services) before publishing my
code, I removed the GitHub repository _for now_. I'm sorry for honest security
enthusiasts who didn't read the article, but I don't want to cause harm.

Google also proposed me to come visit them in their offices to discuss about
my work.

~~~
the8472
Considering that this was done using information also available to malicious
parties, is now "out there" even if you take it down yourself and is just a
security by obscurity scheme it does leave an bad aftertaste in my mouth.

Google is essentially trying to run code on a user's computer but doesn't want
anyone to know what it's running there while it doesn't stop the "bad guys"
from doing their own analysis without publishing it. I'm not saying that
they're trying to do anything evil, but it just strikes all the wrong notes
for me when they try to suppress information on a system they should have
known is wide-open to analysis.

------
spacefight
So in other words, it's a tracking script also serving as a captcha.

"Google servers will receive and process, at least, the following information:

    
    
        Plug-ins
        User-agent
        Screen resolution
        Execution time, timezone
        Number of click/keyboard/touch actions in the <iframe> of the captcha
        It tests the behavior of many browser-specific functions and CSS rules
        It checks the rendering of canvas elements
        Likely cookies server-side (it's executed on the www.google.com domain)
        And likely other stuff...

"

At least the old captcha was a more simpler image.

~~~
crxgames
Some of this data is readily available in the user-agent string. The other
does require some JS execution and sending the results back though. Check this
out:
[https://valve.github.io/fingerprintjs/](https://valve.github.io/fingerprintjs/)

~~~
TheLoneWolfling
My browser fingerprint is... Nonexistant, because I don't allow arbitrary JS
execution without consent.

And my UA string is randomized (unless I override it for a particular site),
so that doesn't do anything either.

I wish someone could come up with a plugin that allowed a "safe" subset of JS
to run without consent, though.

~~~
notatoad
congratulations, you're this guy:
[http://www.xkcd.com/1105/](http://www.xkcd.com/1105/)

until there's a significant number of people doing the same thing, you're
simply "that guy in <insert geoip lookup city here> with the randomized UA",
and you're infinitely more fingerprintable than just about anybody else on the
internet. got to EFF's panopticlick to see how unique your fingerprint is.
using an iPad with up-to-date software gets me the same fingerprint as about
18000 other people in my geoip region, i haven't been able to do better than
that yet.

~~~
TheLoneWolfling
You misunderstand what I mean by "randomized".

It's a weighted random selection from a list of most common browser UAs,
weighted by frequency of that UA.

So they can try to fingerprint all they want - all it'll do is clog up their
database with useless entries.

~~~
notatoad
if your other browser properties don't match the UA though, you're still
showing up as a unique fingerprint. You'll be the guy with an IE8 UA sending
an accept:image/webp header, or the guy with a Safari UA who's following link
prefetching instructions that are only valid in chrome, or something else that
makes you unique.

or my personal favourite: sending the do-not-track header, something that only
a small number of people send that makes you much easier to fingerprint.

~~~
TheLoneWolfling
The combination of UA and accept headers need to be changed in sync. Good
point - any other things like that that should be watched out for?

And DNT is currently at around ~8%, so, although it does leak some
information, it doesn't leak an absurd amount (~3.6 bits). (That's using data
from here [1], which is FF-only. If you have a better source of data for this,
please let me know.)

[1] [https://dnt-dashboard.mozilla.org/](https://dnt-dashboard.mozilla.org/)

~~~
anglebracket
Any number of things can out you as a fake. Whether or not the request's
Accept-Encoding has sdch, can help you figure out if something's Chrome.

You can also abuse parsing quirks to figure out which rendering engine's being
used, or just try to use request-generating features that shouldn't be present
in whatever browser you're saying you are (<svg>, <video>, styling on engine-
specific psuedoelements, etc.)

Here's an example[1] using just HTML+CSS that will request a different image
depending on whether you use a webkit or gecko derivative. If you use neither,
no image will be requested. Someone who says they're Chrome but requests
Firefox's image is immediately outed as a liar.

Same thing given something like `<img
src="jar:[http://example.com/ewwww_jar_uri!/baz">`](http://example.com/ewwww_jar_uri!/baz">`).
Gecko will make a request to
[http://example.com/ewwww_jar_uri](http://example.com/ewwww_jar_uri) while
other browsers won't since they don't support the jar URI.

I believe Mario Heiderich also posted some stuff using webkit's styleable
scrollbars that could be used for fingerprinting screen sizes and how large
certain elements are when rendered.

The list goes on, but my point is that fingerprinting at the rendering /
layout engine level is trivial, so you're better off being legitimately
ordinary if you're worried about fingerprinting.

[1]: [http://codepen.io/anon/pen/YPwMmY](http://codepen.io/anon/pen/YPwMmY)

------
majke
Oh, it's the botguard we know from the gmail thread:

[https://news.ycombinator.com/item?id=8275970#up_8277858](https://news.ycombinator.com/item?id=8275970#up_8277858)

Great analysis!

------
homakov
At the end of the day only google.com cookies make sense, everything else can
be routinely faked just like author says "Programmatically bypass the captcha
by simply executing a rendering engine and automating movements of the mouse."
What google was trying to hide? They seem to put a lot of effort in it.

~~~
spacefight
They are browser/canvas fingerprinting you and me at a very large scale.

~~~
homakov
But what it has to do with fighting bots? Bots can look like browsers.

~~~
lojack
It's enough information to come up with a unique fingerprint without using
cookies.

With that fingerprint they can track your habits across multiple domains. Bots
can look like browsers, but they can't necessarily browse the internet the
same way a human can.

~~~
homakov
What habits? There's nothing else google can see, they only track movements
_inside_ of recaptcha iframe. They have 0 information about how you browse
web.

~~~
patio11
_hey have 0 information about how you browse web._

You have chosen a very interesting value of zero.

They have all your interaction with every Google property, tied to your
account when you're logged in and semi-persistent pseudonymous identities when
you are not. They have the clickstream data between every google property
(most notably, Search) and the rest of your web experience, which can (fairly
easily) reveal many websites you visit. They have ga.js or AdSense tracking
code running on double-digit percentages of all pages on the Internet.

If asked to, Google could provide you with as accurate a record of my flights
between Japan and the US as either nation's customs agency could, simply by
looking at a time series of IP addresses. Their data got radically more
accurate a few years ago when I started using Google Maps with the location
permission turned on.

For added giggles, Google is a SQL join away from associating my
extraordinarily-well-established-but-weakly-verified Internet identity with
unique identifiers like e.g. my social security number. That would probably
make someone in the Borg hesitate for a few minutes, but clearly they're OK
with saying "At scale, we know the huge class of people which happens to
include Patrick -- who we know intimately but prefer to avoid acknowledging
that fact in social settings -- is identifiable by a vector of features, and a
machine can very quickly cluster a random Internet user with Patrick versus
with Spammy McSpamsalot. We can thereby organize data about this person to
serve their needs better, for example by giving them access to resources only
for trusted people, like captcha-free whatevers. Bonus: this is one more
reason why it is fun and convenient to invite Google into even more areas of
their life! It's a win-win!"

~~~
homakov
I probably didn't make it clear - recaptcha iframe provides 0 useful
information about how you browse web. Of course other google services such as
analytics and chrome collect more than enough about your habits. But I don't
believe "mouse tracking" inside of 200x400 iframe, nonsense

------
bobcostas55
The new system is incredibly cumbersome when you have to type the captcha. It
went from 2 key presses to 5 (plus actually typing the text of course).
Obviously if you just need to tick it everything is fine, but for me the tick
works ~5% of the time.

~~~
the8472
There is a <noscript> version of recaptcha which more or less displays the old
version:
[https://developers.google.com/recaptcha/docs/faq#no_js](https://developers.google.com/recaptcha/docs/faq#no_js)

One could probably write a greasemonkey script that replaces the new captcha
with the noscript version.

------
taha_jahangir
From original readme:

> It turned out this new ReCaptcha system is heavily obfuscated, as Google
> implemented a whole VM in JavaScript with a specific bytecode language.

------
kyrre
Saw this on /g/. Great work op.

------
guan
Is there a place where I can try the new ReCaptcha? The linked Google post
only has screenshots.

~~~
InsideReCaptcha
It was activated a few hours ago everywhere on 4chan.

The Google post also gives the examples of Snapchat
([https://support.snapchat.com/login2?next=/](https://support.snapchat.com/login2?next=/)),
WordPress
([https://wordpress.org/support/register.php](https://wordpress.org/support/register.php))
and Humble Bundle
([https://www.humblebundle.com/](https://www.humblebundle.com/)).

------
chetangole
The code is still available at many places on GitHub, many people cloned it to
their repos.

------
warcode
Does the obfuscation serve any real purpose in improving the security?

~~~
mox1
It's similar to when malware authors pack and encode their stuff.

At the end of the day, it does not stop a determined individual from reverse
engineering the code (and then publishing the technique).

BUT it does make it more difficult to understand, work with, etc. It simply
reduces the number of individuals who have the skills necessary to follow
everything though.

It will deter the less skilled people from attempting to "hack" the system....

~~~
zamalek
> It will deter the less skilled people from attempting to "hack" the
> system....

... especially if slight changes are made. Google could change the smallest
thing and run the lot back through their mangler and the attacker would have
to go through the process of unobfuscating it all over again.

~~~
coldcode
Creating an ever increasing war of mangler vs demangler.

~~~
yincrash
Which is exactly the case with regular captchas except now this is less
annoying for real human users.

------
opless
Aaaand it's down.

