Hacking and defeating Google's reCAPTCHA with a 99% accuracy

apendleton · on June 14, 2012

One exciting thing about this: the entire model of reCaptcha (at least the text ones; I assume the audio ones are similar) is to make people do useful work when solving captchas by having them complete tasks that they consider too hard for computers to do well (in the text reCaptcha case, OCR). If someone writes software that can defeat the captcha, it does mean the security model is broken, but it also means the state of OCR technology (or audio recognition or whatever) has been advanced, and the digitization of books that had previously required human intervention can now be accomplished by automated means. In other words, spammers are incidentally creating the tools to expand the scope of digital human knowledge. Win-win, really.

fuelfive · on June 14, 2012

Unfortunately, this attack does nothing to advance the state of the art in OCR (or audio recognition). It's basically the same story as every other CAPTCHA attack to date: take advantage of some accidental statistical regularity in the generation function. As soon as this kind of flaw is discovered, it only takes a few hours for the generation code to be patched in such a way that completely prevents this sort of attack from working.

fghh45sdfhr3 · on June 14, 2012

So either the code is easy to patch, or we DO advance. Win/Win?

xibernetik · on June 14, 2012

Not really... Even if the code is difficult to patch, speech/audio recognition doesn't advance much when an attacker figures out how to remove the (non-random) noise added by a machine over the sound file. Actual speech recognition relies on the ability to filter out background noise - which is a lot more complex/random - added by surroundings, not a machine.

It's very difficult to generate some sort of noise via algorithm that a) humans can filter out and b) can't be removed by some algorithm. As a result, audio captchas are a huge vulnerability and the weakest link in almost any captcha system, although you can't get rid of them by law.

Hypotheticals aside, the code was easy to patch - note the footnote: > In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack.

d2vid · on June 14, 2012

Could one take real recorded noise and add that rather than noise generated via algorithm? Wouldn't that force attackers to solve a real problem (removing background noise from an speech sample)?

xibernetik · on June 15, 2012

It's not really solving the "real" problem... If I'm just mashing two audio files together, that's going to be different than someone talking in the middle of a train platform and there will likely be algorithmicly-determinable difference from the artificially generated words and the naturally generated noise.

All of this aside, removing background noise is not a huge issue anymore. We have pretty decent noise-cancellation technology. Speech recognition - the other big component - has advanced a lot in recent times and is actually pretty good, although not for every company/product.

Even if it would be helpful, you'd have to record an incredible amount of noise in the first place, seeing as you're getting millions of hits a day and if you have a small sample set, the attackers will just figure out the solutions to that sample set and be done.

I'm not saying it's impossible, but I am saying it's probably not worth it at this point. Captchas (in their traditional forms) don't make sense as a long-term strategy anyways.

robryan · on June 15, 2012

Yeah, you would think they could record thousands of hours of real world noise then randomly use sections of it on each audio captcha.

A1kmm · on June 15, 2012

If the attacker manages to obtain all the random noise, they could index every window in the noise in a k-d-tree and perform an efficient nearest neighbour search for the exact background from the CAPTCHA audio, and then simply subtract the background, giving perfect segmentation in O(log(N)) asymptotic average time complexity for N windows (at 64kHz and 2000 hours of audio, N=460800000, log N = 19.95).

jopt · on June 14, 2012

That kind of Win-Win is also a Lose-Lose. It's just a glass-half-empty thing.

TeMPOraL · on June 14, 2012

> If someone writes software that can defeat the captcha, it does mean the security model is broken, but it also means the state of OCR technology (or audio recognition or whatever) has been advanced, and the digitization of books that had previously required human intervention can now be accomplished by automated means.

No it doesn't. reCaptcha only checks one of two words it displays (the other one being what OCRs can't handle themselves), so naturally you only need to crack one and input garbage as the other, thus actually making the world a worse place.

apendleton · on June 14, 2012

Probably not substantially worse. You would need to be able to tell which was which, since they're not consistently ordered and the "good" one is deliberately obfuscated/smudged/whatever, and, since recaptcha depends on multiple users agreeing on the right answer, you and other attackers would need to be consistent in your garbage in order for it to make it into the canonical book transcript.

anonymoushn · on June 14, 2012

For ReCAPTCHA, you only have to type the known word almost-correctly. You can just type anything for the word they don't already know. Even if someone had tech that could pass visual ReCAPTCHA reliably, it wouldn't need to be capable of doing useful work.

aptwebapps · on June 14, 2012

Are the audio versions sourced from recordings that need transcription as the visual versions are sourced from scanned documents? I assumed that was just to improve access.

If not, then it isn't useful work, really.

vhf · on June 14, 2012

Looking at how the hacked it, we can safely deduce the audio version is only to improve access.

Useless for transcription, because it 'validates' phonetically. This is part of the attack : Whether audio is "wagon" or "van", the 'word' "wagn" validates. Same for "Spoon" and "Teaspoon", which both can be validate by entering "Tspoon". (Since Ts can be said "Ss" as in "Tsunami" (Sunami) and "T S" as in T-Rex.)

apendleton · on June 14, 2012

That's a shame. I would imagine that a productively similar audio task could be devised around humans helping to transcribe hard-to-auto-transcribe audio recordings, and that it could use a similar strategy to the visual one: a known-good clip and an ambiguous one, requiring the user to transcribe both.

vhf · on June 14, 2012

Well, I'm not a linguist or anything like that, but I'm not sure it would work so well, if it was useful at all.

1/ Homophony. Knowing how to write a word you hear often needs contextualisation. Giving two whole sentence for the user to hear is too long and takes too much time for the user. 'Right, but no need for a whole sentence, a few words suffice' Right, but if the computer knows where to cut these sentences, it can also transcribe it itself. 2/ You have to assume good spelling from user. 3/ Is it useful ? I mean, 100M recaptcha are done everyday, what part of these 100M are audio recaptcha ? 0.0005% ? Less ? Transcribing 2 sentences every week makes no sense. Keep in mind that homophony+bad spelling are two factors increasing hugely the number of times the same 'unknown clip' will have to go through 'human validation' until we can assume with a certain level of confidence that it has been transcribed.

Funny 4/, take a look at the [cc] button on some youtube videos : on the fly transcription. Thanks Google. :) Oh, and also on the fly translation of on the fly transcription, btw. Google even said they were working on Voice to voice translation for Google Voice : English speaker calls Chinese speaker, english voice transcribed, then translated, then synthesized, same the other way. :)

drharris · on June 14, 2012

After so much work, gotta love the footnote here: "Note: In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack."

barik · on June 14, 2012

This isn't a huge problem for security people. Indeed, at most academic security conferences, by the time you present your work it has usually already been broken and/or countered.

To goal of good research (and one that differentiates researchers from criminals) is to present a proof of concept and to advance the state of security. The fact that security is a perpetual arms race is incidental.

Well, at least that's what security researchers tell themselves anyway to avoid going mad. :)

JackC · on June 14, 2012

What makes it funny is that the bulk of the article isn't "here's how we did it and what we learned," but "here's what you need to do to get our code running on Ubuntu." Then you get to the footnote: "PS: It's pointless to get our code running on Ubuntu."

I spent the whole article wondering why they were so interested not only in presenting a proof of concept, but in getting as many people as possible actively breaking captchas. Then I got to the end and switched to wondering whether the whole thing is an elaborate prank.

barik · on June 14, 2012

Yes, I thought this particular link was perhaps not the best way to present the work, mainly because the interesting part is actually this: "We accomplished this with a combination of Machine Learning, hashing methods, keyspace reduction tactics, and taking advantage of an overall limited number of captchas. Specifically, Stiltwalker goes head to head against reCAPTCHA'S audio captcha system and defeats all but a sliver of it's challenges."

On the other hand, it looks like they provide a corpus (http://www.dc949.org/projects/stiltwalker/stiltwalker-corpus...) [1.5 GB!] that you can still use to run the program.

bmj1 · on June 14, 2012

Yes - I almost missed that - that should be highlighted more clearly not buried in a footnote - as it undermines the impact of the research...

eli · on June 14, 2012

The research is just as solid as it ever was, it just minimizes the immediate, real-world effects.

Karunamon · on June 14, 2012

I'd be willing to wager that Google has many different versions of ReCaptcha sitting on ice, to be shifted into production if a flaw in the current system is found. I've noticed some different image captchas floating around too.

djKianoosh · on June 14, 2012

Still, the important takeaway for webapp developers is never to rely on just one form of protection. Add timestamp checks, honeypots, etc, in addition to captcha and use them appropriately for your application.

drone · on June 14, 2012

Alas, I'm running a handful of websites, and nearly nothing works, outside of stopping the most egregious of automated bots. Every new protection technique seems to be broken fairly fast. The best results so far have been combining basic WAF techniques (like placing poisoned form fields re-named as regular form fields and randomizing the names of standard form fields, one-time hash values in required fields, etc.) with some form of captcha.

However, every day five or six make it through. Observing their patterns and timing, and the fact that they make zero mistakes while interacting very slowly with the site, I can only presume that humans are directly involved as well, either as mechanical turks or simply manually posting the spam.

notably, they follow normal, guided flows through the site - not like robots at all, but hit pages that would obviously be interesting to a human (say, have a certain type of image content linking them versus another), while avoiding the least popular pages - never hit urls that are hidden via CSS, etc.

seanp2k2 · on June 14, 2012

Yup; human spammers are awful. Labor is so cheap now that it's financially viable to pay people to spam sites in some cases. The Internet is a sad place indeed.

omonra · on June 14, 2012

This may be very interesting to crack, but who is responsible for Google making their CAPTCHA almost impossible for human to decipher now? I seriously have to click 5 times before even seeing anything resembling letters I can parse

duiker101 · on June 14, 2012

I didn't go in depth of the article/method but from what I've read it takes advantages of the audio function not the images.

joelthelion · on June 14, 2012

This simply means computers are getting really good at this game. And that Google, with all its power, hasn't found a better alternative to Captchas yet.

I find that pretty worrying for the future of the internet. An internet without working captchhas will probably be full of bots and spam.

Jach · on June 14, 2012

Naive Bayes spam classifiers are fast enough that I think they could be used as drop-in replacements, and they'd be well-trained too considering Google's existing success with spam in email. And Naive Bayes isn't the only solution, there are plenty of other heuristics. Even something as dumb as "first post from this user/IP/'identity' and full of links?" would catch a lot of common spam.

Jach · on June 15, 2012

I really wish HN required a comment with a downvote on accounts with more than 100 karma (where one can reasonably assume the user isn't a newbie leaving empty comments like "I agree"). I'd ask that person to think it through: why doesn't gmail require a captcha every time you want to send an email? What about other mail providers? What about your own native client? An internet without working captchhas[sic] will probably be full of bots and spam. Is your inbox full of bots and spam? Mine isn't. Even my spam box is at less than 800 over the period of a month; before one of the big botnets was taken down a few years ago that was generating most of the world's spam I still had less than 4000 over a month.

masonlee · on June 14, 2012

Is this an argument that future web services may depend more heavily on identity/reputation services?

verroq · on June 14, 2012

You only have the type one word correct, and they make sure one word is readable so with some practice you can easily recognise which word is the key-word.

btilly · on June 14, 2012

With practice and decent eyes. Once your eyes go downhill, it is amazing how much changes.

Most software developers are under 40 and therefore have little to no appreciation for what happens to people's eyes after they are past 40.

ohgodthecat3 · on June 14, 2012

He isn't speaking of reCAPTCHA but google's nearly impossible to read captchas that take way too much time to decipher.

If you'd like to see it you can usually get it by putting in a bad password to a gmail account too many times (though I don't know if that has other consequences).

Edit: Here is an image example http://www.techian.com/wp-content/uploads/captcha.png

sp332 · on June 14, 2012

I wonder what they do for non-Latin locales?

simonbrown · on June 14, 2012

Strange they don't use reCAPTCHA for this.

verroq · on June 14, 2012

Oh those, yeah they are very hard to decipher.

dkersten · on June 14, 2012

For me, the key word is usually the unreadable one. I've never had success typing only the readable one.

conradfr · on June 14, 2012

After reading that previously on HN, I tried and sadly had to type both word.

fibertbh · on June 14, 2012

Agreed. They are unbearable. But name a better alternative!

Anything that I can think of like adding to the CAPTCHA such as a half-line of letter above and below or adding noise would probably make cracking them programmatically even easier.

throwaway1979 · on June 14, 2012

Google's captcha system is horrid. I've mentioned this to people on the accessibility team but to no avail. They used to have a wheel chair icon next to the bloody scrambled text. I taught a computer class to seniors and it was painful watching them deal with the account sign up process (also, I thought it was insulting asking a mobile senior to click on the wheel chair icon ... to the designer ... FU!). Clicking on the wheel chair would give audio that barely made any sense to me. The whole process was stupid.

Like many others, I can barely get through their captcha service. I'm actually happy people circumvented it. Maybe someone will think it through this time around.

aidenn0 · on June 14, 2012

http://en.wikipedia.org/wiki/International_Symbol_of_Access

TazeTSchnitzel · on June 14, 2012

OK. Show me a CAPTCHA that is easy for humans to read, and very difficult for computers to read.

rytis · on June 15, 2012

It shouldn't be about being able to "read". Ideally it should be something that only human could "solve". May be stupid example, but I can't come up with a better one (and if I could, I would be a lot wealthier by now ;) ): show a picture where a forest in landscape is violet and ask a user to identify what's odd on that picture. Another example, show a face with two noses. Ask them to identify the odd item. Things like that. Easy for human, impossible for computers. The problem is that these tasks need to be generated by humans, as I cannot think of any (irreversable) way to do this automatically. May be mechanical turk to the rescue?

saraid216 · on June 14, 2012

> Like many others, I can barely get through their captcha service. I'm actually happy people circumvented it. Maybe someone will think it through this time around.

Anytime you're ready, we're listening.

entropy_ · on June 14, 2012

Well, chances are it'll get even harder now, not easier since they'll need to add further complexity to differentiate humans from bots.

tmh88j · on June 14, 2012

I also thought that was insulting with the wheel chair icon. A person in a wheel chair has problems walking, not (necessarily) their vision. How about "Help" in text?...less confusion and possible anger

ceejayoz · on June 14, 2012

Apparently the wheel chair is actually an ISO standard.

http://en.wikipedia.org/wiki/International_Symbol_of_Access

DanBC · on June 14, 2012

But it's a symbol for mobility - not for all disabilities.

There are symbols for visual impairment, but they're not international.

(http://commons.wikimedia.org/wiki/File:Pictograms-nps-access...)

(http://3.bp.blogspot.com/-HfEx4Y_O_Gs/Tf0huVPZBXI/AAAAAAAAAC...)

excuse-me · on June 14, 2012

Since the point of the audio version is not to be hit with lawsuits under the ADA - perhaps it should just be a little icon of a lawyer?

PassTheAmmo · on June 14, 2012

I imaging that whole sites could be designed using only your proposed lawyer icon, possibly with some additional icon representing political correctness.

Graphon · on June 14, 2012

What's the international symbol for a lawyer?

https://www.google.com/search?q=parasite+icon

s_henry_paulson · on June 14, 2012

Here's the Ars Technica article which does much better job explaining the system:

http://arstechnica.com/security/2012/05/google-recaptcha-bro...

ahmadss · on June 14, 2012

agreed, thanks for posting the ars link.

the youtube video in the OP does a great job of explaining and the thinking behind the attack, and even though it's an hour long, it's worth the watch.

seanp2k2 · on June 14, 2012

Video tutorials are awful if they're the only option. I read very quickly, and videos are typically paced at the lowest "average" comprehension level, so they're a painfully inefficient way to relay technical information.

Plain text wins again, at least for one-way technical communication.

eli · on June 14, 2012

I tend to agree with you, but FYI you sound pretty arrogant when you say it that way.

dunmalg · on June 15, 2012

How about when I say I hate videos of what should be written text because years of exposure to gunfire in the Army has impaired my hearing?

I will never understand why people think that the crappy audio channel of a youtube video of some guy "umming" and gulping through a poorly prepared speech, all recorded on a $5 microphone, is a suitable way of passing information.

dutchbrit · on June 14, 2012

I actually tried hacking reCaptha via audio and the Google Speech to text API a few days ago. It didn't work unfortunately, it really frustrates me at times when I have to refresh reCaptcha 10 times to actually be able to read the damn thing!!

robotmay · on June 14, 2012

What problems did you run into when trying to use the speech API? It's quite an interesting avenue for bypassing these in the future as speech recognition gets better. Also highly entertaining when you use Google's own service against them.

dutchbrit · on June 14, 2012

Exactly my thought, with using their own service! The thing was, is that the background noise made the API not recognise the actual words (I think - since you can't really "debug"), but maybe if you reduce the noise, it might be possible - I wanted to give that a go, but haven't had a chance yet. I just don't know of any ways to automatically reduce noise.

coldpie · on June 14, 2012

I suspect you haven't gotten around to it, but did you study the noise at all? According to Wikipedia <http://en.wikipedia.org/wiki/Voice_frequency>, 85-255 Hz is the usual range of human speech. You could try a band-pass filter with that pass band, and probably a pretty high roll-off (this will cut off sounds like S and X, but I suspect voice recognition software is often capable of dealing with those, given how often they fail to be represented well in discrete audio).

Audacity can do this, and there's almost certainly automatible tools out there. Worth a shot, anyway.

Dn_Ab · on June 14, 2012

There must be a reason why they are confident current algorithms will have a hard time. The article says it works by playing speech backwards so any basic filter will also significantly distort the speech part since I assume the "noise" will occupy similar bands. Subtractive techniques probably won't work either because the noise will likely be highly varying.

But perhaps if one could learn two sparse representations that incorporated phase information for backwards and forwards speech using matrix factorization - it might work as a method for removing recaptcha noise. This idea is so simple though, I assume someone tried it and it doesn't work well.

dutchbrit · on June 14, 2012

I found Audicity when I was playing around, but I didn't get round to actually trying it. Didn't know about the frequency so thanks!!!

goostavos · on June 14, 2012

The coursera Machine Learning course references an algorithm that can separate speech from noise, which a fairly high degree of accuracy. May be worth looking into. Noise removal tools like audacity tend to introduce "watery" or "bubbly" effects when pushed hard.

dutchbrit · on June 14, 2012

Ps. Here's how you send the audio to the API - the mp3 that you can download from reCaptcha needs converting to flac first, which can be achieved with ffmpeg.

http://pastebin.com/LQ30iWKD

robotmay · on June 14, 2012

Awesome, that's nice and straightforward. If you could improve the audio quality/remove noise, then it could make for an entertaining web service. You could go so far as making a browser plugin (for more irony; in Chrome) which would auto submit the audio file and enter the text for people.

dutchbrit · on June 14, 2012

Indeed, that was my initial plan, after getting frustrated at ThemeForest.net

Seriously, who adds reCaptcha to a login form?!

danso · on June 14, 2012

In systems that are less secured than Google, the audio catchpa seems trivial to break...I think I've seen one on court sites that read a combination of numbers from 1 to 9 with some variance in the vocal speed. I'm not an audio engineer but that seems fairly trivial to crack (though maybe their visual catchpa would be easier...I dunno, not an expert in OCR either).

It's a good lesson in a form of social engineering. Sites have to provide this alternative access for the visually impaired...yet I bet the resources/creativity put behind it is not at the same level as the kind put into the catchpa used by 99% of the userbase. Furthermore, the most important client -- your boss -- is likely to not be blind him/herself, which eliminates that extra critical layer of oversight.

Graphon · on June 14, 2012

> Note: In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack.

I take it that "fully nerfs" means this defeat of recaptcha is no longer useful?

roguecoder · on June 14, 2012

Pretty sure I don't have 99% accuracy at solving reCAPTCHAs. Perhaps it's become a CAPITCHA, Completely Automated Public Inverse Turing test to tell Computers and Humans Apart...

joelthelion · on June 14, 2012

What are we going to do once all CAPTCHAS are completely broken?

neotek · on June 15, 2012

CAPTCHAs are ultimately already completely broken - they don't really slow anyone down except ordinary users, because spammers are using human labour to solve CAPTCHAs at the rate of one or two dollars per thousand solved.

verroq · on June 14, 2012

Their testing and high success rates on the public Google reCaptcha test probably tipped off a Google employee internally.

kedyr2 · on June 14, 2012

captchas are very annoying. it is surprising that they have lasted this long. what would be the best alternatives?

TillE · on June 14, 2012

Obscurity. If you're not a significant target, you can get by with simpler methods.

Deciphering obfuscated text is one of very few verifiable tasks that humans can do which computers cannot.

Alternatively, you can look at other methods of identity verification. If you ask for $5 to create an account, you'll have few spam problems.

monsterix · on June 14, 2012

Now here is some serious hacker's news! Was getting nagged by 'what NewEgg can/can't do' posts off late.

wendsday · on June 14, 2012

A company that relies on a bot that accesses others' resources to make money and at the same time relies on reCAPTCHA to frustrate other bots from accessing it own resources.

It may be easy to do today, but, going forward, how do we determine which bots are "good" and which ones are "bad"?

Clearly, simply being a "bot" does not imply "bad" intent. If it did then we should all be blocking search engine bots. Yet this is what reCAPTCHA does: it blocks not based on intent, but based on the characteristic of being a "bot".

akincisor · on June 14, 2012

It's easy. One bot reads a text file and only accesses the resources outlined in the file. The other maliciously tries to eat up resources without restraint. Good - Bad.

wendsday · on June 15, 2012

Like I said, it's relatively easy today. But how about going forward?

There will be lots more bots and lots more usages. Things might not be so simple. For example, some sites might exclude all search engine bots except a chosen few despite the fact they all honor robots.txt and behave essentially the same.

(BTW, if by "text file" you mean robots.txt, isn't that an exclusion list? You seem to be saying it's an inclusion list.)