Hi phikai, I built a privacy friendly alternative to ReCaptcha called FriendlyCa...

barnabask · on Oct 28, 2020

Man this needs more attention, cool project. I see you tried to submit to HN a couple of times and didn't get traction, that's too bad. Don't give up!

aeyes · on Oct 28, 2020

Is the demo somehow tweaked to be less hard?

On my machine it doesn't take any time to solve it and I see no signs of CPU usage. Even trying a couple of times in incognito mode and watching CPU immediately after loading the page for the first time.

On many sites creating a profile takes a few seconds. Loading one of my CPU cores for another 5 seconds doesn't really bother me if I wanted to create massive amounts of profiles/posts. I'll still do over 100 per minute on a standard desktop PC.

protoduction · on Oct 28, 2020

The default difficulty is set to a difficulty that makes sense on websites that have a varied audience (which includes some ancient browsers on old devices).

The solver runs in WebAssembly and is really really fast (~4M hashes per second) - but not every browser supports WASM yet (around 0.3% empirically). The JS fallback is around 10 times slower (more in 5+ year old browsers) - for those users you want at least a decent solve time too.

For Gitlab's audience the difficulty can probably be increased a lot - it all depends on the website and usecase. I'm sure the JS fallback's performance can be improved (it involves a lot of operations on 64bit ints that need to be represented as two numbers in JS), happy to accept PRs [1] :)

[1]: https://github.com/FriendlyCaptcha/friendly-pow/blob/master/...

thinkloop · on Oct 28, 2020

What are your thoughts on performing a quick intial test on each client to measure their performance then tailoring the puzzle to be difficult enough for each?

unilynx · on Oct 28, 2020

Once the spammer figures out what you're doing, he'll just throttle the CPU for the duration of the quick test.

Depending on how smart the test is, just having Date.now() return values with a -12000, -11000, -10000 offsets the first few calls might even do it

sytse · on Oct 28, 2020

That looks cool! Can someone create an issue to add support for this to GitLab? And maybe we can consider switching GitLab.com to this as well.

robotmay · on Oct 28, 2020

I'm personally interested in this too so I've created one :D https://gitlab.com/gitlab-org/gitlab/-/issues/273480

sytse · on Oct 28, 2020

Thanks for creating this! I think adding support for this in GitLab is a no-brainer. After that we can consider enabling it for GitLab.com

birdsbirdsbirds · on Oct 28, 2020

Hopefully you are successful, but how can you scale? If it takes 5 seconds on a desktop, then a server can solve 500.000 captchas per month. At $5 per month, a spammer can still send 1.000 messages for a cent.

protoduction · on Oct 28, 2020

It's not enabled yet in production - but the main mechanism is by increasing the difficulty as more requests are made from an IP in a certain timeframe (it's basically rate limiting at that point). Think: every 3rd request in a minute doubles the difficulty with some cooldown period.

With that the cost (and complexity) of an attack can hopefully be in the same ballpark (or higher) than ReCaptcha - without your end user having to label cars or send data to Google.

But in the end a determined spammer will get through any captcha cheaply (for reference: ReCaptcha solves are sold by the thousands for $1) - we just hope we can do better than ReCAPTCHA, especially UX-wise.

leonidasv · on Oct 29, 2020

I love this concept of proof-of-work captchas, but there's a growing number of tools and ways to bypass IP blocks via IP rotation[1], specially after the explosion of IaaS providers. How do you intend to tackle this?

[1] Some examples: https://rhinosecuritylabs.com/aws/bypassing-ip-based-blockin... https://oxylabs.io/products/real-time-crawler https://github.com/alex-miller-0/Tor_Crawler https://www.scrapinghub.com/crawlera/

njitram · on Oct 29, 2020

There are free and paid list of all ip addresses from datacenters like https://udger.com/resources/datacenter-list, they probably existing for specifically preventing this, so maybe thats an option here.

coder543 · on Oct 28, 2020

The obvious follow-up question is how IPv6 impacts this, because I think it's supposed to be easy for someone to get their hands on a decent chunk of IPv6 addresses.

Maybe the difficulty could scale as a property of how similar the IP address is to previously seen addresses... so the addresses in the same /64 block would be very closely related, for example. (I think that's how IPv6 works... but definitely something I haven't researched lately, so I could just sound very confused)

protoduction · on Oct 28, 2020

I don't have all the answers yet, but indeed rate limiting a larger block (at least /64), or even at multiple prefix sizes with different weighting makes sense.

zahllos · on Oct 29, 2020

So the way this is supposed to work is that providers hand out /48s and each site should be allocated a /64. In practice if you for example rent a VPS, you'll be handed a /64 for it by your service provider from their /48.

I would personally treat any /64 as the same. Depending on your local network setup the second half of the address could be anything and could change frequently. You might also get multiple addresses. Whereas getting a new /64, or /48, requires slightly more effort.

Of course there's a risk you'll block a /64 and that takes out some whole company or whatever, but I've seen that happen to corporate proxies that got flagged as a source of spam as well so this is not an easy problem even without the 2^128 address space.

ognarb · on Oct 29, 2020

Your website mention that friendlycaptcha is open source but looking at the license in the repository, it is a custom license that can't be defined as open source. Can you change it to source available?

_qcti · on Oct 28, 2020

Love to see this. ReCaptcha is nothing short of a menace. I'll take a shot at this for my next project

laughinghan · on Oct 28, 2020

There doesn't appear to be any discussion on your website or on GitHub about why, to be blunt, this is even a good idea in the first place.

A classic 2004 paper, "Proof-of-Work" Proves Not to Work [0], explained that the fundamental problem with proof-of-work bot filters is that attackers will always be able to solve the cryptographic puzzle faster than legitimate users. A touch of security-through-obscurity can help at the margins, but you chose Blake2b, which is used by cryptocurrencies like Zcash, Siacoin, and Nano [1], and as a result there are optimized GPU algorithms (first Google result [2]) and FPGA designs (one of the top Google results [3]). Have you run the numbers on any of those?

The closest to any discussion of these numbers that I saw was a mention on your website that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers [4].

In another comment you bring up the idea of starting with a lower difficulty, and increasing it with repeated requests from the same IP address (IPv4, I assume). Unfortunately, access to unique IPv4 addresses is highly correlated with access to more compute power: laptops and desktops in developed countries are most likely to be in a household with a unique IPv4 address, whereas mobile devices on 4G internet and households in developing countries are more likely to be behind Carrier-Grade NAT [5], where thousands or millions [6] of hosts share a pool of a handful or dozens of IPv4 addresses. (The exact same concern applies to IPv6 /64 prefixes.)

This means that mobile devices will face a "double-jeopardy": your service will present them with higher proof-of-work difficulties because the same IPv4 address is shared by more people, and at the same time, the mobile device solves the proof-of-work slower for the same difficulty than a desktop.

Do you have documented anywhere on your website or GitHub how you address these concerns?

[0]: https://www.cl.cam.ac.uk/~rnc1/proofwork.pdf

[1]: https://en.bitcoinwiki.org/wiki/Blake2b

[2]: https://github.com/zhq1/sgminer-blake2b

[3]: https://xilinx.github.io/Vitis_Libraries/security/2020.1/gui...

[4]: http://theory.stanford.edu/people/jcm/papers/captcha-study-o...

[5]: https://en.wikipedia.org/wiki/Carrier-grade_NAT

[6]: Yes, millions. RFC 6598 reserved a /10 for them, which is 4 million unique IPv4 addresses: https://tools.ietf.org/html/rfc6598

coder543 · on Oct 28, 2020

I'm not associated with the project in any way, but your well researched comment did miss at least one important factoid.

This comment:

> The closest to any discussion of these numbers that I saw was a mention that it may take up to 20s on mobile; for comparison, the much-hated image CAPTCHA takes about 6-12s on average for native English speakers, and 7-14s for non-native speakers.

Missed this quote from the website:

> As soon as the user starts filling the form it starts getting solved

> By the time the user is ready to submit, the puzzle is probably already solved.

The time spent solving reCAPTCHA is active user involvement. The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.

"up to 20 seconds" was also seemingly presented as a worst-case scenario. Most users' devices would presumably be faster than that, but I don't know how the author researched that conclusion on how performance scales. Friendly Captcha does report back some information on how long it is taking users to solve the captcha, and it looks like website owners could use that to adjust the difficulty based on the needs of their specific audience and how tolerant they are of untargeted spam.

The stuff you point out about Blake2b seems entirely legitimate, and I wonder if an Argon variant would be more appropriate to avoid specialized hardware being quite so problematic.

Personally, I really like the idea of Friendly Captcha. Certainly, there are problems with any captcha implementation. People can rant for many, many paragraphs about websites that use reCAPTCHA... I'm not surprised to see someone ripping apart a different captcha system. The ideal solution would be for spammers to just stop being so obnoxious... but good luck with that plan.

laughinghan · on Oct 28, 2020

The time being spent on Friendly Captcha is passive and can overlap with time being spent filling out a form.

Great point!

I wonder if an Argon variant would be more appropriate

The creators of Argon2 actually also created a memory-hard proof-of-work function they call MTP (for "Merkle Tree Proof", which is a terrible name, totally un-Googleable; I always have to search for the title of their paper, "Egalitarian Computing"): https://arxiv.org/pdf/1606.03588.pdf

A bug bounty for it was sponsored by Zcoin, which is nice. Zcoin is actually considering moving away from it, but mainly because the proof size of 200kb is prohibitive, which is less of a concern for a captcha system: https://forum.zcoin.io/t/should-we-change-pow-algorithm/477

I'm not surprised to see someone ripping apart a different captcha system

I really don't mean to rip it apart. I just wanted to see some discussion, any discussion, of the well-known flaws with the idea and what ideas OP has to address them.

protoduction · on Oct 28, 2020

It is also important to note that the 6-12 seconds and 7-14 seconds reported in the paper is for the garbled text CAPTCHAs, not for image labeling tasks (fire hydrants, cars, etc).

protoduction · on Oct 28, 2020

I'll try to provide my thoughts on each of the issues you've mentioned, let me know if there's something I missed.

On using blake2b: I chose blake2b as I was looking to use a hash function that is small in implementation, readily available and already optimized. With WebAssembly the solver can achieve (close to native) speeds and be least be an order of magnitude or two closer to optimized GPU algorithms.

Using specialized hardware, image tasks (and even more so audio tasks which must be present for accessibility reasons) have the same issue that they can be solved by GPU algorithms (i.e. machine learning, in which even a low percentage success rate would already be enough). If you search on GitHub you will find there are more ML captcha cracking repos than captcha implementations - they are probably even easier to get started with than adapting GPU miner code.

Image/Audio Captcha vs ML is an arms race that can be beat for split seconds of compute (even on CPU) or cheap human labeling: it's just as broken. FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility) by not engaging in the arms race - I think it makes a better trade-off. Like the sibling comment pointed out the captcha solving can happen entirely in the background so that hopefully it doesn't even make the user wait.

As for rate limiting/difficulty adjustment: it's not perfect and it could lead to problems if you share the IP with a spammer (and let's be realistic: even with a million users on one IP there won't be tens of users signing up to some forum per minute). Also normal captchas have problems here though: users from these locales already get presented with much more difficult+frequent recaptcha tasks (I also doubt they are localized: American sidewalks are harder to label if you've never seen one in real life). Setting a reasonable upper limit to difficulty may be good enough here.

On not using blake2b: I have considered mutating the hashing algorithm every day randomly to make writing an optimized solver for it all that more difficult - but that would mean one could no longer self-serve the JS+WASM and be done with it. I won't rule it out for FriendlyCaptcha v2 if this does ever become a real problem.

Swapping out the hash function should be easy (the puzzles are versioned to allow for this). If you have a different function in mind and someone implements it in Assemblyscript (so we also have a JS fallback) then we can definitely consider it.

laughinghan · on Oct 31, 2020

Thanks for your detailed response.

I've seen all the projects claiming to have broken ReCAPTCHA—often using Google's own ML services, hilariously—but it's unclear to me how broken image/audio CAPTCHAs are in practice (and the number of GitHub repos doesn't seem like a good measure to me). If they really are completely broken, then why are they still so widely used? If they really are completely broken by ML, how do human CAPTCHA-solving services stay in business?

FriendlyCaptcha optimizes for the end user (privacy + effort + accessibility)

Good point. I am concerned though that burning CPU cycles on the proof-of-work uses battery life if the end-user is on mobile, without getting their getting any choice in the matter. What if, given an informed choice, they would have preferred an image CAPTCHA? (On the other hand, that could use more cellular data. Might be good to run the numbers on this too.)

even with a million users on one IP there won't be tens of users signing up to some forum per minute

I think this is a bad choice of threat model. "Some forum" would likely be better off with simpler measures, like a hidden honeypot textbox: https://dev.to/felipperegazio/how-to-create-a-simple-honeypo...

NorwegianDude · on Oct 28, 2020

Cool project, but I do find it quite ironic that it's named friendly captcha when it's not a captcha.

Eldt · on Oct 28, 2020

How would you define "CAPTCHA"?

perryizgr8 · on Oct 29, 2020

The original expansion was "Completely Automated Public Turing test to tell Computers and Humans Apart".

jimmydorry · on Oct 29, 2020

CAPTCHA: a computer program or system intended to distinguish human from machine input, typically as a way of thwarting spam and automated extraction of data from websites

I would say this Oxford Languages dictionary definition is close enough.

webphineas · on Oct 28, 2020

Really nice! Finally someone is using the blockchain technology in a meaningful way!

laughinghan · on Oct 28, 2020

This doesn't use a blockchain, it uses a Hashcash-style proof-of-work function (an idea that predates the Bitcoin by decades): https://en.wikipedia.org/wiki/Hashcash

redbergy · on Oct 28, 2020

Awesome work, I will be giving this a try in my next project

remram · on Oct 28, 2020

> up to 20 seconds on old smartphones

That sounds like a very battery-unfriendly idea.

protoduction · on Oct 28, 2020

It's not perfect, but maxing a single core for 20 seconds on an older smartphone is a necessary evil for this kind of captcha.

The alternative: loading a third party script and multiple images (~2MB) to label for ReCAPTCHA and spending time performing the task also takes some battery (and mental) power.