Hacker News new | past | comments | ask | show | jobs | submit login
Reverse-engineering the new “captchaless” ReCaptcha system (github.com/recaptchareverser)
216 points by InsideReCaptcha on Dec 9, 2014 | hide | past | favorite | 50 comments

When I was a fifteen year old "internet marketer" I was able to create 60,000 google accounts from residential IP's, despite google's CAPTCHA system. As a fifteen year old kid, that's when I knew google is not an invincible company.

The trick was that google's backend registration logic did not validate the referrer of the signup form, or that the submitting IP address matched the one that downloaded the CAPTCHA. So I cloned the form on my own server, loaded it in an invisible iFrame with all fields filled except the CAPTCHA. Then I served the CAPTCHA to the user, who solved it and clicked "submit". Then the entire hidden form in the iframe submitted and registered a google account with the visitor's IP.

I was surprised that worked, because it was a huge vulnerability. Nowadays I would go for the bug bounty, but back then I tried to sell it for $5k -- unfortunately nobody believed I could do it and I couldn't prove it without revealing the method, so I ended up unable to take advantage of the code.

I did have about 60k accounts from random page views on blogspot blogs, but I never did anything with them. Perhaps that was the only time my inability to finish projects ironically saved me from some trouble. :)

I remember seeing you post something about this back in the day (maybe on WF?)

good times :-P

What is (was?) WF?


You probably wouldn't have got in much trouble, just a c&d and probably asked to fork over some money from your profits and sign a contract with google that youll never do it again (And "your" account on google disabled"). ;)

At least thats what happend with me and my friend when we mined about 1/10th of users from facebook, and crowdsourced a bunch of info from people, made it public, (we had an psudeoanonmyous no login messaging feature where people would literally share informaion that would make people question what privacy is when n-1 people are divulging everything about you) from searching names. If I wasn't a US citizen, id probably do it better now that i know more of what im up against. Maybe the landscape might be more open in the US in the future.

People moan on and on about facebook and companies like them as if they are all powerfull (they can seem that way if you play by their rules and submit to their jurisdiction for pragmatic reasons), but challenging those assumptions and leveraging user behaviors, and arbitraging legalities, one can probably crack some pretty big holes...

heh, you could've made some distributed service using google appengine, and take full advantage of the free tier ;)

hmmmmm a litecoin miner perhaps?

I received an email from Google requesting the following:

> The code you reversed is used to protect many sites’ registration process including Google and many others. We are concerned that having your code and analysis publicly available will make it easier to build registration automation tools which will result in a surge of spam in all the services protected by this code and will affect negatively many Internet users.

> This is why we kindly ask you to temporarily remove it from GitHub so your work won’t be used for a malicious purpose which we believe was never your intended goal.

As I wasn't aware that the botguard was also used for this purpose (separately of ReCaptcha, in Gmail and other services) before publishing my code, I removed the GitHub repository for now. I'm sorry for honest security enthusiasts who didn't read the article, but I don't want to cause harm.

Google also proposed me to come visit them in their offices to discuss about my work.

Considering that this was done using information also available to malicious parties, is now "out there" even if you take it down yourself and is just a security by obscurity scheme it does leave an bad aftertaste in my mouth.

Google is essentially trying to run code on a user's computer but doesn't want anyone to know what it's running there while it doesn't stop the "bad guys" from doing their own analysis without publishing it. I'm not saying that they're trying to do anything evil, but it just strikes all the wrong notes for me when they try to suppress information on a system they should have known is wide-open to analysis.

That's a bit strange, usually Google will just make changes that invalidate your analyses (and they can definitely make changes rather quickly.) In other words, by reacting in this way they're basically saying "we rely on security through obscurity". I was almost expecting a "you remove it, or we'll remove your site from our search results." The fact that it's a tracking script might have something to do with it...

Ouch ! I was curious about looking at the implementation, now I'm even more curious. Anybody has a mirror ?

Edit: got one, just using Github's research.

Edit2: had to patch the decompiler to have it run on python 2.7.8, so that it understands that long is an int.

I had forked you, but after reading this I decided to remove the git repository too.

I think reCaptcha is overused by services that should be publicly available for automation, specially in Brazil. I think this is a bad use case for reCaptcha or any other captcha system.

But I also understand that the majority of reCaptcha users are fighting against spam, and the public description of their javascript engine could really hurt the Internet.

Anyway... congrats for your work!

You should maybe tell them to remove their own "google cache" version as well...

Great example of "talk softly, carry a big stick." Thanks Teddy Roosevelt.

So in other words, it's a tracking script also serving as a captcha.

"Google servers will receive and process, at least, the following information:

    Screen resolution
    Execution time, timezone
    Number of click/keyboard/touch actions in the <iframe> of the captcha
    It tests the behavior of many browser-specific functions and CSS rules
    It checks the rendering of canvas elements
    Likely cookies server-side (it's executed on the www.google.com domain)
    And likely other stuff...

At least the old captcha was a more simpler image.

Some of this data is readily available in the user-agent string. The other does require some JS execution and sending the results back though. Check this out: https://valve.github.io/fingerprintjs/

My browser fingerprint is... Nonexistant, because I don't allow arbitrary JS execution without consent.

And my UA string is randomized (unless I override it for a particular site), so that doesn't do anything either.

I wish someone could come up with a plugin that allowed a "safe" subset of JS to run without consent, though.

congratulations, you're this guy: http://www.xkcd.com/1105/

until there's a significant number of people doing the same thing, you're simply "that guy in <insert geoip lookup city here> with the randomized UA", and you're infinitely more fingerprintable than just about anybody else on the internet. got to EFF's panopticlick to see how unique your fingerprint is. using an iPad with up-to-date software gets me the same fingerprint as about 18000 other people in my geoip region, i haven't been able to do better than that yet.

You misunderstand what I mean by "randomized".

It's a weighted random selection from a list of most common browser UAs, weighted by frequency of that UA.

So they can try to fingerprint all they want - all it'll do is clog up their database with useless entries.

if your other browser properties don't match the UA though, you're still showing up as a unique fingerprint. You'll be the guy with an IE8 UA sending an accept:image/webp header, or the guy with a Safari UA who's following link prefetching instructions that are only valid in chrome, or something else that makes you unique.

or my personal favourite: sending the do-not-track header, something that only a small number of people send that makes you much easier to fingerprint.

The combination of UA and accept headers need to be changed in sync. Good point - any other things like that that should be watched out for?

And DNT is currently at around ~8%, so, although it does leak some information, it doesn't leak an absurd amount (~3.6 bits). (That's using data from here [1], which is FF-only. If you have a better source of data for this, please let me know.)

[1] https://dnt-dashboard.mozilla.org/

Any number of things can out you as a fake. Whether or not the request's Accept-Encoding has sdch, can help you figure out if something's Chrome.

You can also abuse parsing quirks to figure out which rendering engine's being used, or just try to use request-generating features that shouldn't be present in whatever browser you're saying you are (<svg>, <video>, styling on engine-specific psuedoelements, etc.)

Here's an example[1] using just HTML+CSS that will request a different image depending on whether you use a webkit or gecko derivative. If you use neither, no image will be requested. Someone who says they're Chrome but requests Firefox's image is immediately outed as a liar.

Same thing given something like `<img src="jar:http://example.com/ewwww_jar_uri!/baz">`. Gecko will make a request to http://example.com/ewwww_jar_uri while other browsers won't since they don't support the jar URI.

I believe Mario Heiderich also posted some stuff using webkit's styleable scrollbars that could be used for fingerprinting screen sizes and how large certain elements are when rendered.

The list goes on, but my point is that fingerprinting at the rendering / layout engine level is trivial, so you're better off being legitimately ordinary if you're worried about fingerprinting.

[1]: http://codepen.io/anon/pen/YPwMmY

You are doomed, this does not work.

Are your headers in the correct order for the given UA? Correct capitalisation for the given UA? Correct accept for the given UA? Correct white space around or between values for the given UA?

It is far better to appear to be the same as everyone else if you want to be anonymous (i.e. to browse on an iPad) than it is to do anything to try and not be tracked.

Anonymity today is to be invisible within the crowd, not to stand out as you are the only sheep that is shorn.

In Bulgaria, we had a joke about our communist president Todor Zhivkov, who was supposedly hiding from the fascists in the forests. So, the joke was that he was hiding... but nobody was looking for him, because he hasn't done anything to get their attention. Anyway, trying to "hide" is "security by obscurity". It's like your defense being hiding your SSN from your computer, when other weaker systems can already be exploited and your SSN could be stolen from them. Whatever you have to hide, it's already kept somewhere else in most cases. Instead, find a defense strategy that does not depend on obscurity - this is the only defense.

What do you use to do that? Thanks.

Oh, it's the botguard we know from the gmail thread:


Great analysis!

At the end of the day only google.com cookies make sense, everything else can be routinely faked just like author says "Programmatically bypass the captcha by simply executing a rendering engine and automating movements of the mouse." What google was trying to hide? They seem to put a lot of effort in it.

They are browser/canvas fingerprinting you and me at a very large scale.

But what it has to do with fighting bots? Bots can look like browsers.

It's enough information to come up with a unique fingerprint without using cookies.

With that fingerprint they can track your habits across multiple domains. Bots can look like browsers, but they can't necessarily browse the internet the same way a human can.

I'm surprised nobody else has seen this. Google is removing the cookies because that's how advertisers track users.

No cookies? no way of tracking.

Or if you want to see it like this, they are getting something standard and turning it into something that it's under their control.

What habits? There's nothing else google can see, they only track movements inside of recaptcha iframe. They have 0 information about how you browse web.

hey have 0 information about how you browse web.

You have chosen a very interesting value of zero.

They have all your interaction with every Google property, tied to your account when you're logged in and semi-persistent pseudonymous identities when you are not. They have the clickstream data between every google property (most notably, Search) and the rest of your web experience, which can (fairly easily) reveal many websites you visit. They have ga.js or AdSense tracking code running on double-digit percentages of all pages on the Internet.

If asked to, Google could provide you with as accurate a record of my flights between Japan and the US as either nation's customs agency could, simply by looking at a time series of IP addresses. Their data got radically more accurate a few years ago when I started using Google Maps with the location permission turned on.

For added giggles, Google is a SQL join away from associating my extraordinarily-well-established-but-weakly-verified Internet identity with unique identifiers like e.g. my social security number. That would probably make someone in the Borg hesitate for a few minutes, but clearly they're OK with saying "At scale, we know the huge class of people which happens to include Patrick -- who we know intimately but prefer to avoid acknowledging that fact in social settings -- is identifiable by a vector of features, and a machine can very quickly cluster a random Internet user with Patrick versus with Spammy McSpamsalot. We can thereby organize data about this person to serve their needs better, for example by giving them access to resources only for trusted people, like captcha-free whatevers. Bonus: this is one more reason why it is fun and convenient to invite Google into even more areas of their life! It's a win-win!"

I probably didn't make it clear - recaptcha iframe provides 0 useful information about how you browse web. Of course other google services such as analytics and chrome collect more than enough about your habits. But I don't believe "mouse tracking" inside of 200x400 iframe, nonsense

The new system is incredibly cumbersome when you have to type the captcha. It went from 2 key presses to 5 (plus actually typing the text of course). Obviously if you just need to tick it everything is fine, but for me the tick works ~5% of the time.

There is a <noscript> version of recaptcha which more or less displays the old version: https://developers.google.com/recaptcha/docs/faq#no_js

One could probably write a greasemonkey script that replaces the new captcha with the noscript version.

From original readme:

> It turned out this new ReCaptcha system is heavily obfuscated, as Google implemented a whole VM in JavaScript with a specific bytecode language.

Saw this on /g/. Great work op.

Is there a place where I can try the new ReCaptcha? The linked Google post only has screenshots.

It was activated a few hours ago everywhere on 4chan.

The Google post also gives the examples of Snapchat (https://support.snapchat.com/login2?next=/), WordPress (https://wordpress.org/support/register.php) and Humble Bundle (https://www.humblebundle.com/).

The code is still available at many places on GitHub, many people cloned it to their repos.

Does the obfuscation serve any real purpose in improving the security?

It's similar to when malware authors pack and encode their stuff.

At the end of the day, it does not stop a determined individual from reverse engineering the code (and then publishing the technique).

BUT it does make it more difficult to understand, work with, etc. It simply reduces the number of individuals who have the skills necessary to follow everything though.

It will deter the less skilled people from attempting to "hack" the system....

> It will deter the less skilled people from attempting to "hack" the system....

... especially if slight changes are made. Google could change the smallest thing and run the lot back through their mangler and the attacker would have to go through the process of unobfuscating it all over again.

Creating an ever increasing war of mangler vs demangler.

Which is exactly the case with regular captchas except now this is less annoying for real human users.

Your username is the slang name for Carling Black Label beer in South Africa :)

Exactly, it's good beer :)

Aaaand it's down.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact