

Breaking CAPTCHA with automated humans - troyhunt
http://www.troyhunt.com/2012/01/breaking-captcha-with-automated-humans.html

======
user24
$0.001 per 27 seconds gives an hourly wage of 13.85 cents.

This system has to be using enforced labour, no-one will work for that little,
even in the poorest countries. And that 13 cents figure is assuming that 100%
of revenue goes to the individual worker.

Taking the best case from the reported average of 10-15 seconds, that means an
operator can solve 360 captchas per hour; an hourly revenue per worker of 36
cents.

I have a suspicion that this makes use of a similar 'economy' as the one about
Chinese prisoners being forced to farm gold in WoW[1]

If you have just ten 'workers' on 12 hour shifts, that's $43 per day, $301 per
week, $15,652 per year. The numbers start to stack up fairly well in bulk -
assuming all 1.5 million Chinese prisoners[2] were able to sustain 12 hour
shifts 365 days a year and maintain an average of 10 seconds per solve (and
that there was a constant supply of paying customers), that's over 2 billion
dollars annual revenue. Obviously that number's way off, but probably not by
an order magnitude.

[1]
[http://www.pcworld.com/article/228716/chinese_prisoners_alle...](http://www.pcworld.com/article/228716/chinese_prisoners_allegedly_forced_to_play_world_of_warcraft.html)

[2] As of 2006:
[http://www.kcl.ac.uk/depsta/law/research/icps/downloads/worl...](http://www.kcl.ac.uk/depsta/law/research/icps/downloads/world-
prison-pop-seventh.pdf) [PDF]

~~~
biot
I'd be surprised if it takes more than 5 seconds for a human to solve each.
Keep in mind that 27 seconds was the time in the queue from submit to solve.
24 seconds of that might have been spent in the backlog and 3 seconds being
viewed and solved by a human.

Additionally, it supposes that all images being solved are the kind that a
human must do. Tons of CAPTCHA images being output by naive generators can be
solved automatically. I'd be surprised if every image submitted isn't going
through an automated solver first with only the < 95% confidence ones being
sent to a human.

Actually, you could do a hybrid model where every image is auto-solved and a
human can either click to confirm that the auto-solver got it right or
manually type a new response. That should get your time per solve down to a
second or two each.

~~~
user24
> I'd be surprised if every image submitted isn't going through an automated
> solver first with only the < 95% confidence ones being sent to a human.

Yes, that's how I'd do it too, but I don't know if they're that sophisticated
yet (they'll get there though).

Even with 1 second per solve, that's only $3.60 per hour. A massive
improvement over 13c, but still certainly not worth anyone in my country's
time.

~~~
dzhiurgis
600 bucks per months is a great salary even in some european countries.

------
TheSeb
One thing that the author has clearly misunderstood is the following:

"I must admit, I do feel a bit sorry for the folks sitting there endlessly
solving a never ending stream of CAPTCHAs; frankly, just one drives me a bit
nuts! But what must have been even worse – and I need to take some blame here
– is that while testing I kept submitting the same CAPTCHA over and over and
over again. I can picture the poor operator sitting there thinking “WTF is
this guy doing already?!” Then again, maybe they made some quick bucks because
recognising the same pattern time and again becomes more efficient."

For the majority of the captcha solving services out there the same captcha is
just solved ONCE. When a request is submitted to the API the image data is
typically read from the image file and encoded in Base64 or in a similar
format. The Base64 string representing the image is then sent to the captcha
API.

The API will use the Base64 string as a key and check if it exists in a
database/key-value store. If it exists, return the value for the Base64
string, if it doesn't exist, send it to one of the available workers. When the
captcha is solved by a worker it is saved in the database. If someone would
send in the same captcha next time (thus resulting in an identical
Base64-encoded string representation of the image), the solved captcha text
would just be sent back without any human interference.

I talked to one guy running one of these services a while ago and they set it
up the following way:

1\. Requests are sent using JSON to their API.

2\. The Base64-encoded string representation of the image is checked against
Redis to see if it has already been solved. If it has, return it to the client
requesting it. Award the person who originally solved it 50% of the fee that
the client is paying for it (thus enabling these captcha solvers to earn some
kind of passive income).

3\. If it hasn't been solved, run it through OCR-software. Is the result
reliable enough? Save it in redis and return it to the client.

4\. Is the OCR result not reliable enough? Send it off to a human worker. The
worker solves the captcha, the captcha text is saved in Redis as a value
associated with the key (the original Base64-representation of the image).
Award the worker 75% of the fee the client paid. Return the solved captcha
text to the client.

5\. Deal with invalid captchas (remove entry from Redis, subtract from due
commission to worker etc.)

6\. Rinse and repeat. Scale with more servers + more workers.

Pretty cool stuff :D

~~~
user24
But the captcha is randomly distorted afresh each time, so each captcha image
will never have been seen before. You'll never get an exact hash match unless
you're looking at an extremely naive implementation.

~~~
troyhunt
During my testing, I literally had a saved bitmap on my local machine which I
reissued many, many times. Obviously this meant it was an exact match on each
submission.

~~~
user24
but that's obviously an edge case rather than something you'd optimise for in
the system.

------
ghshephard
Ideal approach to running one of these services is to front a site that
actually uses Captchas, ideally in significant volume, and then just _pass
through_ the captchas being sent by others needing them to be solved.

Say you run a site that has 1000 users/minute authenticating, that means you
actually have 1000/captchas/minute being solved - may as well sell those users
(unwittingly performed) time.

~~~
samwillis
I have heard of people doing exactly this with pornography websites...

~~~
dspillett
I've never seen a real example of this (having asked many times when people
have mentioned it), it is one of those ideas that on first though everyone
things must be real.

The problem with that scheme is that you need to create or curate enough
content to make it work and somehow make the people who might want that
content aware that you can give it to them, and that is a _lot_ more hassle
and expensive than paying some impoverished 3rd world farmer to parse the
captchas.

~~~
pavel_lishin
> I've never seen a real example of this

Are you sure?

~~~
dspillett
Can you name one real documented example of it?

If not then I stand by my assertion that it does not exist as the time and
hassle (and potentially expense) of creating/curating the content required is
simply not worth the effort when automated code can do much of the work and
slave labour in "3rd world" countries can do the rest cheaply enough.

------
samps
Leaving aside the ethical issues, has anyone considered building a "white-hat"
CAPTCHA cracker tool using one of these services? Personally, I hate solving
CAPTCHAs to sign up for stuff, and I wish there were a browser extension
button I could hit that would do it for me. I'd gladly pay a few cents and
wait several seconds (browsing in another tab) if it meant I didn't have to
try to decipher some of the crazier CAPTCHAs out there.

~~~
user24
But who'd be doing the solve? If you have a machine to do it then you have a
product (but only until the captcha improves, after all that's the whole
point). If it's humans, you're either pricing yourself out of the market or
using slave labour.

------
drucken
The interesting thing about these human-based attempts to defeat individual
CAPTCHA-"protected" services is that if they are actually reCAPTCHA or
variants, they perform a strong _socially useful_ function!

This is because reCAPTCHA solving is linked to the digitization of paper
documents.

There was recently a TEDTalk by the inventor of the CAPTCHA about this topic.
It is very informative and entertaining. I highly recommend it:
<http://www.youtube.com/watch?v=-Ht4qiDRZE8>

------
lelele
Next submission: "Breaking CAPTCHA breakers with automated humans". That is,
having humans filter comments submitted through a CAPTCHA. I'd bet they would
work even faster than CAPTCHA breakers: typing CAPTCHAs is time-consuming (at
least for me, and often I can't read what I should type).

Further next submission: "CAPTCHA-breakers-breakers team up with CAPTCHA-
breakers to blackmail sites."

------
mikeknoop
> But what must have been even worse – and I need to take some blame here – is
> that while testing I kept submitting the same CAPTCHA over and over and over
> again. I can picture the poor operator sitting there thinking “WTF is this
> guy doing already?!”

Perhaps this is why the author saw such a high 94% success rate?

~~~
troyhunt
Nope, that was only when testing the CAPTCHA cracker console before I'd
scripted the ability to pull a live one from the registration page. The bulk
cracking was done with fresh loads and the logs show all unique CAPTCHAs.

------
dzhiurgis
Back in 2004, when RuneScape introduced captcha to combat SCAR and AutoRune
automatic trainers, someone figured the very same system. I remember solving
hundreds of captchas and never getting any credits, which were usually traded
for CC's, shell and paypal accounts. owww, RuneScape + cheats = worst drugs
ever.

------
pax
Of course, their (kolotibablo.com) registration page also features a captcha,
spam seems to be annoying them too :)

------
samarudge
Now to write a script that repeatedly submits CAPTCHAs with the text "terrible
job", "boring", "pointless life".

~~~
foxylad
Or in case they are working against their will, "Return HELP if you're being
held prisoner in a captcha factory", "My location is..." and "Please tell
Amnesty International".

~~~
asomiv
This is actually a pretty good idea. The captcha should include some kind of
system for reporting slavery or other kinds of work-related abuse, whereby the
reporter will be rewarded if the report results in the slavedriver being
successfully prosecuted. As the number of employees become bigger, the
likelihood that at least one of the employees will rat out his employer
increases exponentially.

~~~
pavel_lishin
Is it worth the risk to them? Their employer can grep through logs for hotkeys
trivially, and it's not like typing in "help I am in a basement in Beijing" is
going to send a rescue squad in 20 minutes.

~~~
asomiv
You can provide a virtual keyboard the way E-Gold does it in order to fight
key loggers.

------
zuppy
Some years ago I read about one of the best ways to solve captchas: you need
to set up a porn site and require visitors to solve them for you, in order to
"view" the content. They have no idea that they're doing. Free labour :)

------
uptown
Is it juts me, or has CAPTCHA gotten considerably more difficult to read
lately? It takes me at least 3-4 reloads before I get something I can both
read and type with the keyboard in front of me.

~~~
ryandvm
Are you a robot?

~~~
uptown
A crazy cab driver called me an unfeeling robot one day, so I suppose it's
possible.

------
stuaxo
I'm sure that on file sharing sites when you solve captchas this is exactly
what is happening ...

