
TextCaptcha: Simple Textual Captcha Challenges - miki123211
http://textcaptcha.com/
======
plibither8
My only contention with text captcha apart from those of bots is that these
require you to be able to read, understand and know the answer of the
question, and that too in the language that captcha is being served.

This might sound like a nonexistant issue, but lets take the example of a
country like India with a huge number of internet users. There are many who
use the Internet without being completely literate in English or even their
mother tongue. Use cases like checking exam results, train tickets etc require
captchas to be filled and the normal deformed letter ones do well. I doubt
text captchas will be accessible to them.

But then again, user base matters a lot too and that should be taken into
account always!

~~~
dillonmckay
I was looking through the examples, and for a large percent of the users I
deal with in the US, these questions would be too difficult.

I had a user struggle to find the exclamation symbol on the keyboard.

------
FreeHugs
There certainly is some benefit of this, but the barrier for bots is low. All
it needs is a custom bot that can be written in minutes.

Because it seems to use a fixed set of question types. All of which a simple
script can solve.

Example:

    
    
        What is Mark's name?
    

Writing a script that extracts the name from this string is trivial. And the
other question types are similarely easy.

The next step would be to allow the site owner to put in domain specific
questions. For example a chemistry website might ask "What is the formula of
$moleclue?". And then provide ['Water'=>'H2O','Salt'=>'NaCl'...] as questions.
Then the site would be protected from a general textcaptcha solver and bots
would have to be customized for each domain.

~~~
meehow
I like the idea of domain specific questions, but I also think that questions
are the biggest added value of this API. If you have your own list, then you
can implement your own "captcha" pretty easily.

------
EmilStenstrom
The value here is not that they are easily solvable, with enough effort your
could probably write something that solves these too. Thing is, are you
willing to go through that effort for something that only a few sites use?

This is in contrast to ReCaptcha, which is used my millions of sites, so using
crowdsourcing and systems to make them easy to use is worth it there.

For a long time I used the fixed captcha, "What is 1+1?", but written in my
local language, and managed to filter out all the spam...

~~~
type0
Custom captcha is not very hard to implement. But for some reason very few
sites actually use it, is it that we don't believe in the ability to write a
few questions with honey pots or just overconfident that the Google service is
the only one that people could use?

------
nils-m-holm
I like the idea because, according to ReCaptcha, I am officially a bot! And it
works! Whenever I see one of those, I just close the tab and go away.

------
tyingq
I found the Adafruit "what size is this resistor" captcha pretty clever:
[https://blog.adafruit.com/2010/04/21/resisty-resistor-
captch...](https://blog.adafruit.com/2010/04/21/resisty-resistor-captcha-
solve-the-resistor-values-to-post-a-comment/)

Not suggesting it's broadly useful, but cool nonetheless.

~~~
0az
I'm personally a fan of Lichess's "mate in 1" captchas.

~~~
tyingq
I'm crap at chess, but there's this:
[http://tetration.xyz/ChessboardFenTensorflowJs/](http://tetration.xyz/ChessboardFenTensorflowJs/)

------
qot
> The answers are provided as MD5 checksums of the lower-cased answers which
> allows you to compare a users response with the answers without explicitally
> [sic] knowing the answer yourself.

Why bother? All the hashes I tried were able to be instantly cracked online by
CrackStation.

If the hashes are kept server-side, then why not use plaintext?

~~~
kristopolous
Not only, it suggests the false impression that people can do it all client
side... If you're sending the answers over the wire, they can simply be sent
back to you without any real computational effort. No clever text processing
required.

Keeping it in plaintext would have made doing this being dumb more obvious.
Heck, two separate files would have even been better. Also you can do
something like damaru levenshtein [1] on plaintext so you wouldn't have to be
a stickler for exact matches

[1]
[https://en.m.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_...](https://en.m.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)

------
kccqzy
Are new questions being continually added to the database of challenges? If
not, one can quite simply request many challenges, have a human solve them,
and remember the answers.

If you generate the questions based on fixed rules, it is only a matter of
time before an algorithm or a human figures out all the necessary patterns.

I personally feel like these text puzzles aren't as energy-draining to solve
as reCAPTCHA, so a human could reasonably solve many hundreds in an hour of
free time or so. That doesn't even include having a TTS engine dictate the
question to Google Assistant.

~~~
kabacha
You leave this implication that modern "hard-to-crack" Captchas are somehow
different when in reality when reCaptcha and co are cracked to death and you
can literally buy 5000 captcha solves for less than 5 USD.

Captchas in general are a bit of an illusion really. The best Captcha you can
have is a home-made one as the attacker has to go through actual manual effort
rather than enabling --solve-recaptcha flag on the bot script.

~~~
peteretep
I feel like there are orders of magnitude differences in cost and effort
between “completely free because you downloaded a library” and “$5 per 5k and
there’s a few HTTP requests in there too”

~~~
kabacha
Is there though? Whom are you protecting yourself from? People who write curl
scripts or serious attackers? Surely if someone wants to commit to an attack
$5 is completely neglectible amount of money, right?

All you "protect" yourself from is casual script users and script kiddies
which really can be solved by IP rate limit. If someone has access to
thousands of IPs they can probably afford to drop $5 to solve the captchas
too, right?

~~~
throwawaymath
Having written code to bypass captchas, up to and including Google ReCaptcha:
yes, there is a difference. A large difference.

$5 _per request_ is not a negligible amount of money. In practice it doesn't
cost anywhere near that amount to call a MechanicalTurk API which will solve
ReCaptcha for you. But it's still significant for any nontrivial number of
requests, such as in the use case of scraping.

You should adjust your priors here. You're focused on the narrow case where a
win condition is achieved by spending _n_ dollars to solve a single instance
of ReCaptcha. People who use ReCaptcha are (in my professional experience)
overwhelmingly more focused on requiring ReCaptcha to be solved for every
individual request of a given type.

I have been in the position you speak of, where I had a revolving set of IP
addresses, requesting servers and user agents, and $5 per request would have
immediately shut my operation down. As it was, the actual ~$0.15 per request
to solve ReCaptcha was sufficiently significant that I couldn't curate enough
data for what I needed, despite having all the other resources you mention.

~~~
yorwba
The cost mentioned upthread was $5 for 5000 requests, or 0.1 cent per request.
Would that have allowed you to collect the data you wanted?

~~~
throwawaymath
Possibly, I no longer do that work. But that was never the price I saw for the
service. Cheapest I ever saw was still 10 times that, and requests frequently
had to be resent due to spotty completion.

------
cryptoquick
A neat way to improve on this would be to get a series of 3-5 challenges,
especially if it's something like math and logic puzzles that build on each
other, so it's a multi-dimensional problem, so it's difficult to cache or
precompute.

The hash of the answer is just a string that concatenates the answers, and the
challenges always mixes them in different orders. One possible example:

1\. Does Red combine with Yellow to make Green or Orange? 2\. If the answer to
the last question was reversed, what letter would it start with? 3\. If you
added that latter to the end of these words, which word would be most edible?
Mac, Nam, Pi, Snak

answers: orange,e,pi

Of course, you'd want to design the interface to support multiple choice
selection.

I would also recommend against using MD5, since, even if the hash weren't
known to the end user, which should be sufficient even in this case, the
attack MD5 is most known for is the fact that it's trivial to generate text
that could match any given hash, regardless of what was used to originally
make it. It seems like a potential attack vector somehow, depending on the
case, and it's not terribly harder to just use one of the many tried-and-true,
known-not-broken cryptographic hashing algorithms. SHA-256 would be adequate.

I'm not 100% certain of all the logic behind this, there's always cases I'm
not rigorous to consider, but I'd be interested in seeing how others might
improve this approach in similar ways.

~~~
kchamplewski
> it's trivial to generate text that could match any given hash

Source on this?

Wikipedia states what I have heard before which is that MD5 collision attacks
are pretty trivial now, but carrying out a preimage attack as you describe
remains theoretical at this time.

[https://en.wikipedia.org/wiki/MD5](https://en.wikipedia.org/wiki/MD5)

~~~
DownGoat
There is a way to defate this that is much simpler, in the examples on the
page 5 out of 7 examples has the answer in the question. Just do MD5 sum of
every word/combination of words in the question and you would find the answer
to many of the questions. This together with a targeted dictionary would
propbably give you a very high success rate for little cost. MD5/SHA-familly
hashes are inexpensive to compute, you can do billions of the in a second. If
you cant find the answer, then just request a new challenge untill you find
one you can answer.

------
miki123211
Some interesting "behind the stage" insights about this post.

I've posted this a couple months ago. Just before christmas, I've received an
email from the mods asking me to repost, as they thought the story was
interesting. Initially after reposting, it didn't get much traction, but, when
I look at it now, it actually has upvotes. However, the number of upvotes it
has is much bigger than the amound of Karma i received. I wonder how
correlated this facts are.

------
jpalomaki
As recognized, captchas are not effective in keeping determined people out.
Nowadays the method to put some price tag for access seems to be to use SMS
verification.

However not everybody is very determined. There are for example bots that are
using the comment forms to send spam. The expected payout for each posted
message must be really, really low.

This could be useful for example in the comment form plugins created for
various content management systems.

------
dessant
TextCaptcha can be a viable alternative to reCAPTCHA depending on the type of
service you're protecting, and it doesn't come with the privacy drawbacks of
Google's catch-all privacy policy. The personal data collected by the
reCAPTCHA service is not clearly distinguished in their privacy policy [1],
and there are no tools to exercise control over the data gathered by
reCAPTCHA.

W3C has put together a comprehensive document about captcha types and their
application:
[https://www.w3.org/TR/turingtest/](https://www.w3.org/TR/turingtest/)

[1]
[https://policies.google.com/privacy?hl=en](https://policies.google.com/privacy?hl=en)

------
tux3
This seems like exactly the sort of questions a big language model a la GPT-2
would be able to answer.

I'm curious how TextCaptcha would fare in terms of complexity compared to the
other language understanding benchmarks.

~~~
n_ary
Honestly, someone willing to spend effort & time & resources on a big language
model will have enough resources to find other ways to access whatever they
are hoping for, similarly how they can buy ReCaptcha solve services. So, in a
sense if that is your threat model, you need multiple traps, neither of these
will suffice alone. :)

------
meehow
This will work well as long as it's a niche captcha and creating automated
solver it's worth the effort. Hardcoding something in JavaScript what will be
injected to all form submits is even simpler and doesn't require anything from
your users. It will work as long as bots don't execute JS or someone is
willing to spent some time on modifying the bot.

------
vortico
How many of these logic questions are there? After refreshing
[http://api.textcaptcha.com/myemail@example.com.json](http://api.textcaptcha.com/myemail@example.com.json)
about a hundred times, I'd estimate there are about 50 questions. Someone
could write a script that solves each particular question manually.

------
ajnin
I created something similar for my personal website. I don't get spam
comments, but I think that's only because I'm too small to be worth the effort
for spammers. It would be pretty easy to pattern-match the few different
question types and find the right answer using a dictionary.

------
meehow
I just did some statistic and "four" is correct answer to 8% of questions.

------
shakna
> The answers are the MD5 hashes of correct lower cased answers: you should be
> able to check responses from real users you challenge with the question
> against these checksums.

That's a somewhat concerning choice of hash function. MD5's collision
resistance is broken. It is well known to be broken, for more than a decade.

There's references to PHP on the page. 10 years ago [0] this message appeared
in PHP's manual:

> The well known hash functions MD5 and SHA1 should be avoided in new
> applications. Collission attacks against MD5 are well documented in the
> cryptographics literature and have already been demonstrated in practice.
> Therefore, MD5 is no longer secure for certain applications.

[0]
[https://www.php.net/manual/en/function.hash.php#94104](https://www.php.net/manual/en/function.hash.php#94104)

~~~
phyzome
Collision resistance is irrelevant if there's already a hash value specified.
What matters here is being able to find preimages for a given output, and even
if MD5 had a practical preimage attack, it would be too expensive to use just
for cracking captchas. :-)

PHP is a red herring; this would apply for any language.

~~~
shakna
> PHP is a red herring; this would apply for any language.

I used the warning from the PHP manual, because the site creator should be
familiar with it.

You'll find similar warnings against MD5 everywhere.

------
dangerface
I like the use of MD5 hashes to make it stateless, you could also test on the
client side before sending the form, its so annoying for the full form to be
rejected because I couldn't guess the captcha.

~~~
ryanianian
This is also an attack vector that the article mentions. If you have logic on
the client-side that can tell if the captcha is correct it lets an attacker
brute-force it. (Basically as far as I see you can't send the hashes to the
client without opening up this vector.)

~~~
dangerface
True, I thought about scrypt and salts to slow down a brute force but for a 5
letter captcha I guess the search space would be too small no matter what you
did.

------
Fragoel2
How about keeping the same idea of a different question every time but mixing
in some images? Like:

"How much is one + (image depicting a 6)?

~~~
dsr_
It wouldn't be a text captcha, then.

It would certainly lose utility for visually-impaired users.

OCR is pretty well advanced by now; the image captchas that used to rely on it
now use extremely distorted images.

~~~
app4soft
Try solve using OCR next:

> _How much is one
> +[https://i.imgur.com/VbyiURo.png](https://i.imgur.com/VbyiURo.png) _

~~~
dsr_
Apparently I'm a robot. I can't add one + owl + owl and type in an answer I
expect to have recognized.

------
terminaljunkid
Isn't there any kind of problems that can be automatically generated but can't
be solved reliably by machines?

~~~
jobigoud
Winnograd Schema Challenges would be a step further for these text captchas.
They are asking you to identify an ambiguous pronoun in a statement.

Examples:

\- The man couldn't lift his son because he was so weak. Who was weak?

\- The firemen arrived after the police because they were coming from so far
away. Who came from far away?

\- The drain is clogged with hair. It has to be [cleaned/removed]. What has to
be [cleaned/removed]?

They are crafted in such a way that there are two variants of the challenge
every time. Ex: The man couldn't lift his son because he was so heavy. Who was
heavy?

I don't know if they are hard to generate automatically. They do require
common sense to solve and it's still an open problem.

[https://en.wikipedia.org/wiki/Winograd_Schema_Challenge](https://en.wikipedia.org/wiki/Winograd_Schema_Challenge)

------
rienbdj
interesting idea. the fact that I have never encountered one of these in use
suggests that they don't work well enough.

~~~
atulvi
This is true for any new product

~~~
dylz
I've seen stuff like this in use for domain specific knowledge for forums for
years - either there is a near infinite question pool or you're reduced down
to something that's broken easily given a site-specific bot.

~~~
hombre_fatal
Over 10 years ago, off the shelf forum spam software, Xrumer, would build a
database of these questions for you when it detected the field on a /register
page. Since basically every website has a small, finite pool of questions
(usually just one), you'd just sit down, answer them all, and then click
"resume".

