Hacker News new | past | comments | ask | show | jobs | submit login
Hacking and defeating Google's reCAPTCHA with a 99% accuracy (dc949.org)
224 points by ahmadss on June 14, 2012 | hide | past | favorite | 80 comments



One exciting thing about this: the entire model of reCaptcha (at least the text ones; I assume the audio ones are similar) is to make people do useful work when solving captchas by having them complete tasks that they consider too hard for computers to do well (in the text reCaptcha case, OCR). If someone writes software that can defeat the captcha, it does mean the security model is broken, but it also means the state of OCR technology (or audio recognition or whatever) has been advanced, and the digitization of books that had previously required human intervention can now be accomplished by automated means. In other words, spammers are incidentally creating the tools to expand the scope of digital human knowledge. Win-win, really.


Unfortunately, this attack does nothing to advance the state of the art in OCR (or audio recognition). It's basically the same story as every other CAPTCHA attack to date: take advantage of some accidental statistical regularity in the generation function. As soon as this kind of flaw is discovered, it only takes a few hours for the generation code to be patched in such a way that completely prevents this sort of attack from working.


So either the code is easy to patch, or we DO advance. Win/Win?


Not really... Even if the code is difficult to patch, speech/audio recognition doesn't advance much when an attacker figures out how to remove the (non-random) noise added by a machine over the sound file. Actual speech recognition relies on the ability to filter out background noise - which is a lot more complex/random - added by surroundings, not a machine.

It's very difficult to generate some sort of noise via algorithm that a) humans can filter out and b) can't be removed by some algorithm. As a result, audio captchas are a huge vulnerability and the weakest link in almost any captcha system, although you can't get rid of them by law.

Hypotheticals aside, the code was easy to patch - note the footnote: > In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack.


Could one take real recorded noise and add that rather than noise generated via algorithm? Wouldn't that force attackers to solve a real problem (removing background noise from an speech sample)?


It's not really solving the "real" problem... If I'm just mashing two audio files together, that's going to be different than someone talking in the middle of a train platform and there will likely be algorithmicly-determinable difference from the artificially generated words and the naturally generated noise.

All of this aside, removing background noise is not a huge issue anymore. We have pretty decent noise-cancellation technology. Speech recognition - the other big component - has advanced a lot in recent times and is actually pretty good, although not for every company/product.

Even if it would be helpful, you'd have to record an incredible amount of noise in the first place, seeing as you're getting millions of hits a day and if you have a small sample set, the attackers will just figure out the solutions to that sample set and be done.

I'm not saying it's impossible, but I am saying it's probably not worth it at this point. Captchas (in their traditional forms) don't make sense as a long-term strategy anyways.


Yeah, you would think they could record thousands of hours of real world noise then randomly use sections of it on each audio captcha.


If the attacker manages to obtain all the random noise, they could index every window in the noise in a k-d-tree and perform an efficient nearest neighbour search for the exact background from the CAPTCHA audio, and then simply subtract the background, giving perfect segmentation in O(log(N)) asymptotic average time complexity for N windows (at 64kHz and 2000 hours of audio, N=460800000, log N = 19.95).


That kind of Win-Win is also a Lose-Lose. It's just a glass-half-empty thing.


> If someone writes software that can defeat the captcha, it does mean the security model is broken, but it also means the state of OCR technology (or audio recognition or whatever) has been advanced, and the digitization of books that had previously required human intervention can now be accomplished by automated means.

No it doesn't. reCaptcha only checks one of two words it displays (the other one being what OCRs can't handle themselves), so naturally you only need to crack one and input garbage as the other, thus actually making the world a worse place.


Probably not substantially worse. You would need to be able to tell which was which, since they're not consistently ordered and the "good" one is deliberately obfuscated/smudged/whatever, and, since recaptcha depends on multiple users agreeing on the right answer, you and other attackers would need to be consistent in your garbage in order for it to make it into the canonical book transcript.


For ReCAPTCHA, you only have to type the known word almost-correctly. You can just type anything for the word they don't already know. Even if someone had tech that could pass visual ReCAPTCHA reliably, it wouldn't need to be capable of doing useful work.


Are the audio versions sourced from recordings that need transcription as the visual versions are sourced from scanned documents? I assumed that was just to improve access.

If not, then it isn't useful work, really.


Looking at how the hacked it, we can safely deduce the audio version is only to improve access.

Useless for transcription, because it 'validates' phonetically. This is part of the attack : Whether audio is "wagon" or "van", the 'word' "wagn" validates. Same for "Spoon" and "Teaspoon", which both can be validate by entering "Tspoon". (Since Ts can be said "Ss" as in "Tsunami" (Sunami) and "T S" as in T-Rex.)


That's a shame. I would imagine that a productively similar audio task could be devised around humans helping to transcribe hard-to-auto-transcribe audio recordings, and that it could use a similar strategy to the visual one: a known-good clip and an ambiguous one, requiring the user to transcribe both.


Well, I'm not a linguist or anything like that, but I'm not sure it would work so well, if it was useful at all.

1/ Homophony. Knowing how to write a word you hear often needs contextualisation. Giving two whole sentence for the user to hear is too long and takes too much time for the user. 'Right, but no need for a whole sentence, a few words suffice' Right, but if the computer knows where to cut these sentences, it can also transcribe it itself. 2/ You have to assume good spelling from user. 3/ Is it useful ? I mean, 100M recaptcha are done everyday, what part of these 100M are audio recaptcha ? 0.0005% ? Less ? Transcribing 2 sentences every week makes no sense. Keep in mind that homophony+bad spelling are two factors increasing hugely the number of times the same 'unknown clip' will have to go through 'human validation' until we can assume with a certain level of confidence that it has been transcribed.

Funny 4/, take a look at the [cc] button on some youtube videos : on the fly transcription. Thanks Google. :) Oh, and also on the fly translation of on the fly transcription, btw. Google even said they were working on Voice to voice translation for Google Voice : English speaker calls Chinese speaker, english voice transcribed, then translated, then synthesized, same the other way. :)


After so much work, gotta love the footnote here: "Note: In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack."


This isn't a huge problem for security people. Indeed, at most academic security conferences, by the time you present your work it has usually already been broken and/or countered.

To goal of good research (and one that differentiates researchers from criminals) is to present a proof of concept and to advance the state of security. The fact that security is a perpetual arms race is incidental.

Well, at least that's what security researchers tell themselves anyway to avoid going mad. :)


What makes it funny is that the bulk of the article isn't "here's how we did it and what we learned," but "here's what you need to do to get our code running on Ubuntu." Then you get to the footnote: "PS: It's pointless to get our code running on Ubuntu."

I spent the whole article wondering why they were so interested not only in presenting a proof of concept, but in getting as many people as possible actively breaking captchas. Then I got to the end and switched to wondering whether the whole thing is an elaborate prank.


Yes, I thought this particular link was perhaps not the best way to present the work, mainly because the interesting part is actually this: "We accomplished this with a combination of Machine Learning, hashing methods, keyspace reduction tactics, and taking advantage of an overall limited number of captchas. Specifically, Stiltwalker goes head to head against reCAPTCHA'S audio captcha system and defeats all but a sliver of it's challenges."

On the other hand, it looks like they provide a corpus (http://www.dc949.org/projects/stiltwalker/stiltwalker-corpus...) [1.5 GB!] that you can still use to run the program.


Yes - I almost missed that - that should be highlighted more clearly not buried in a footnote - as it undermines the impact of the research...


The research is just as solid as it ever was, it just minimizes the immediate, real-world effects.


I'd be willing to wager that Google has many different versions of ReCaptcha sitting on ice, to be shifted into production if a flaw in the current system is found. I've noticed some different image captchas floating around too.


Still, the important takeaway for webapp developers is never to rely on just one form of protection. Add timestamp checks, honeypots, etc, in addition to captcha and use them appropriately for your application.


Alas, I'm running a handful of websites, and nearly nothing works, outside of stopping the most egregious of automated bots. Every new protection technique seems to be broken fairly fast. The best results so far have been combining basic WAF techniques (like placing poisoned form fields re-named as regular form fields and randomizing the names of standard form fields, one-time hash values in required fields, etc.) with some form of captcha.

However, every day five or six make it through. Observing their patterns and timing, and the fact that they make zero mistakes while interacting very slowly with the site, I can only presume that humans are directly involved as well, either as mechanical turks or simply manually posting the spam.

notably, they follow normal, guided flows through the site - not like robots at all, but hit pages that would obviously be interesting to a human (say, have a certain type of image content linking them versus another), while avoiding the least popular pages - never hit urls that are hidden via CSS, etc.


Yup; human spammers are awful. Labor is so cheap now that it's financially viable to pay people to spam sites in some cases. The Internet is a sad place indeed.


This may be very interesting to crack, but who is responsible for Google making their CAPTCHA almost impossible for human to decipher now? I seriously have to click 5 times before even seeing anything resembling letters I can parse


I didn't go in depth of the article/method but from what I've read it takes advantages of the audio function not the images.


This simply means computers are getting really good at this game. And that Google, with all its power, hasn't found a better alternative to Captchas yet.

I find that pretty worrying for the future of the internet. An internet without working captchhas will probably be full of bots and spam.


Naive Bayes spam classifiers are fast enough that I think they could be used as drop-in replacements, and they'd be well-trained too considering Google's existing success with spam in email. And Naive Bayes isn't the only solution, there are plenty of other heuristics. Even something as dumb as "first post from this user/IP/'identity' and full of links?" would catch a lot of common spam.


I really wish HN required a comment with a downvote on accounts with more than 100 karma (where one can reasonably assume the user isn't a newbie leaving empty comments like "I agree"). I'd ask that person to think it through: why doesn't gmail require a captcha every time you want to send an email? What about other mail providers? What about your own native client? An internet without working captchhas[sic] will probably be full of bots and spam. Is your inbox full of bots and spam? Mine isn't. Even my spam box is at less than 800 over the period of a month; before one of the big botnets was taken down a few years ago that was generating most of the world's spam I still had less than 4000 over a month.


Is this an argument that future web services may depend more heavily on identity/reputation services?


You only have the type one word correct, and they make sure one word is readable so with some practice you can easily recognise which word is the key-word.


With practice and decent eyes. Once your eyes go downhill, it is amazing how much changes.

Most software developers are under 40 and therefore have little to no appreciation for what happens to people's eyes after they are past 40.


He isn't speaking of reCAPTCHA but google's nearly impossible to read captchas that take way too much time to decipher.

If you'd like to see it you can usually get it by putting in a bad password to a gmail account too many times (though I don't know if that has other consequences).

Edit: Here is an image example http://www.techian.com/wp-content/uploads/captcha.png


I wonder what they do for non-Latin locales?


Strange they don't use reCAPTCHA for this.


Oh those, yeah they are very hard to decipher.


For me, the key word is usually the unreadable one. I've never had success typing only the readable one.


After reading that previously on HN, I tried and sadly had to type both word.


Agreed. They are unbearable. But name a better alternative!

Anything that I can think of like adding to the CAPTCHA such as a half-line of letter above and below or adding noise would probably make cracking them programmatically even easier.


Google's captcha system is horrid. I've mentioned this to people on the accessibility team but to no avail. They used to have a wheel chair icon next to the bloody scrambled text. I taught a computer class to seniors and it was painful watching them deal with the account sign up process (also, I thought it was insulting asking a mobile senior to click on the wheel chair icon ... to the designer ... FU!). Clicking on the wheel chair would give audio that barely made any sense to me. The whole process was stupid.

Like many others, I can barely get through their captcha service. I'm actually happy people circumvented it. Maybe someone will think it through this time around.



OK. Show me a CAPTCHA that is easy for humans to read, and very difficult for computers to read.


It shouldn't be about being able to "read". Ideally it should be something that only human could "solve". May be stupid example, but I can't come up with a better one (and if I could, I would be a lot wealthier by now ;) ): show a picture where a forest in landscape is violet and ask a user to identify what's odd on that picture. Another example, show a face with two noses. Ask them to identify the odd item. Things like that. Easy for human, impossible for computers. The problem is that these tasks need to be generated by humans, as I cannot think of any (irreversable) way to do this automatically. May be mechanical turk to the rescue?


> Like many others, I can barely get through their captcha service. I'm actually happy people circumvented it. Maybe someone will think it through this time around.

Anytime you're ready, we're listening.


Well, chances are it'll get even harder now, not easier since they'll need to add further complexity to differentiate humans from bots.


I also thought that was insulting with the wheel chair icon. A person in a wheel chair has problems walking, not (necessarily) their vision. How about "Help" in text?...less confusion and possible anger


Apparently the wheel chair is actually an ISO standard.

http://en.wikipedia.org/wiki/International_Symbol_of_Access


But it's a symbol for mobility - not for all disabilities.

There are symbols for visual impairment, but they're not international.

(http://commons.wikimedia.org/wiki/File:Pictograms-nps-access...)

(http://3.bp.blogspot.com/-HfEx4Y_O_Gs/Tf0huVPZBXI/AAAAAAAAAC...)


Since the point of the audio version is not to be hit with lawsuits under the ADA - perhaps it should just be a little icon of a lawyer?


I imaging that whole sites could be designed using only your proposed lawyer icon, possibly with some additional icon representing political correctness.


What's the international symbol for a lawyer?

https://www.google.com/search?q=parasite+icon


Here's the Ars Technica article which does much better job explaining the system:

http://arstechnica.com/security/2012/05/google-recaptcha-bro...


agreed, thanks for posting the ars link.

the youtube video in the OP does a great job of explaining and the thinking behind the attack, and even though it's an hour long, it's worth the watch.


Video tutorials are awful if they're the only option. I read very quickly, and videos are typically paced at the lowest "average" comprehension level, so they're a painfully inefficient way to relay technical information.

Plain text wins again, at least for one-way technical communication.


I tend to agree with you, but FYI you sound pretty arrogant when you say it that way.


How about when I say I hate videos of what should be written text because years of exposure to gunfire in the Army has impaired my hearing?

I will never understand why people think that the crappy audio channel of a youtube video of some guy "umming" and gulping through a poorly prepared speech, all recorded on a $5 microphone, is a suitable way of passing information.


I actually tried hacking reCaptha via audio and the Google Speech to text API a few days ago. It didn't work unfortunately, it really frustrates me at times when I have to refresh reCaptcha 10 times to actually be able to read the damn thing!!


What problems did you run into when trying to use the speech API? It's quite an interesting avenue for bypassing these in the future as speech recognition gets better. Also highly entertaining when you use Google's own service against them.


Exactly my thought, with using their own service! The thing was, is that the background noise made the API not recognise the actual words (I think - since you can't really "debug"), but maybe if you reduce the noise, it might be possible - I wanted to give that a go, but haven't had a chance yet. I just don't know of any ways to automatically reduce noise.


I suspect you haven't gotten around to it, but did you study the noise at all? According to Wikipedia <http://en.wikipedia.org/wiki/Voice_frequency>, 85-255 Hz is the usual range of human speech. You could try a band-pass filter with that pass band, and probably a pretty high roll-off (this will cut off sounds like S and X, but I suspect voice recognition software is often capable of dealing with those, given how often they fail to be represented well in discrete audio).

Audacity can do this, and there's almost certainly automatible tools out there. Worth a shot, anyway.


There must be a reason why they are confident current algorithms will have a hard time. The article says it works by playing speech backwards so any basic filter will also significantly distort the speech part since I assume the "noise" will occupy similar bands. Subtractive techniques probably won't work either because the noise will likely be highly varying.

But perhaps if one could learn two sparse representations that incorporated phase information for backwards and forwards speech using matrix factorization - it might work as a method for removing recaptcha noise. This idea is so simple though, I assume someone tried it and it doesn't work well.


I found Audicity when I was playing around, but I didn't get round to actually trying it. Didn't know about the frequency so thanks!!!


The coursera Machine Learning course references an algorithm that can separate speech from noise, which a fairly high degree of accuracy. May be worth looking into. Noise removal tools like audacity tend to introduce "watery" or "bubbly" effects when pushed hard.


Ps. Here's how you send the audio to the API - the mp3 that you can download from reCaptcha needs converting to flac first, which can be achieved with ffmpeg.

http://pastebin.com/LQ30iWKD


Awesome, that's nice and straightforward. If you could improve the audio quality/remove noise, then it could make for an entertaining web service. You could go so far as making a browser plugin (for more irony; in Chrome) which would auto submit the audio file and enter the text for people.


Indeed, that was my initial plan, after getting frustrated at ThemeForest.net

Seriously, who adds reCaptcha to a login form?!


In systems that are less secured than Google, the audio catchpa seems trivial to break...I think I've seen one on court sites that read a combination of numbers from 1 to 9 with some variance in the vocal speed. I'm not an audio engineer but that seems fairly trivial to crack (though maybe their visual catchpa would be easier...I dunno, not an expert in OCR either).

It's a good lesson in a form of social engineering. Sites have to provide this alternative access for the visually impaired...yet I bet the resources/creativity put behind it is not at the same level as the kind put into the catchpa used by 99% of the userbase. Furthermore, the most important client -- your boss -- is likely to not be blind him/herself, which eliminates that extra critical layer of oversight.


> Note: In the hours before our presentation/release, Google pushed a new version of reCAPTCHA which fully nerfs our attack.

I take it that "fully nerfs" means this defeat of recaptcha is no longer useful?


Pretty sure I don't have 99% accuracy at solving reCAPTCHAs. Perhaps it's become a CAPITCHA, Completely Automated Public Inverse Turing test to tell Computers and Humans Apart...


What are we going to do once all CAPTCHAS are completely broken?


CAPTCHAs are ultimately already completely broken - they don't really slow anyone down except ordinary users, because spammers are using human labour to solve CAPTCHAs at the rate of one or two dollars per thousand solved.


Their testing and high success rates on the public Google reCaptcha test probably tipped off a Google employee internally.


captchas are very annoying. it is surprising that they have lasted this long. what would be the best alternatives?


Obscurity. If you're not a significant target, you can get by with simpler methods.

Deciphering obfuscated text is one of very few verifiable tasks that humans can do which computers cannot.

Alternatively, you can look at other methods of identity verification. If you ask for $5 to create an account, you'll have few spam problems.


Now here is some serious hacker's news! Was getting nagged by 'what NewEgg can/can't do' posts off late.


A company that relies on a bot that accesses others' resources to make money and at the same time relies on reCAPTCHA to frustrate other bots from accessing it own resources.

It may be easy to do today, but, going forward, how do we determine which bots are "good" and which ones are "bad"?

Clearly, simply being a "bot" does not imply "bad" intent. If it did then we should all be blocking search engine bots. Yet this is what reCAPTCHA does: it blocks not based on intent, but based on the characteristic of being a "bot".


It's easy. One bot reads a text file and only accesses the resources outlined in the file. The other maliciously tries to eat up resources without restraint. Good - Bad.


Like I said, it's relatively easy today. But how about going forward?

There will be lots more bots and lots more usages. Things might not be so simple. For example, some sites might exclude all search engine bots except a chosen few despite the fact they all honor robots.txt and behave essentially the same.

(BTW, if by "text file" you mean robots.txt, isn't that an exclusion list? You seem to be saying it's an inclusion list.)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: