Hacker News new | past | comments | ask | show | jobs | submit login
You probably don’t need ReCAPTCHA (kevv.net)
450 points by ve55 6 days ago | hide | past | web | favorite | 234 comments





"Many developers vastly over-estimate the likelihood of customized spam."

I run 100s of small random low traffic low priority sites. Without some form of form control, the ALL get hit with customized and random other crap spam. I don't have decent experience with many things in life, but I can say this is one topic I have YEARS of experience with. I've never over-estimated the amount of any type of spam any form can get after being on the web for just a few months. Doesn't matter how big they are or what they do.

I'm not saying ReCAPTCHA is the only thing out there or even the best, but having an open form is just asking for trouble.


Have an input element that can't be seen. If it has something in it, ignore the submit. Works for all my sites so far.

Doesn’t work as soon as you’re big enough to target.

The company I work for makes a SaSS forum product, and while we do have multiple spam prevention methods (akismet, stopforumspam, honeypot, a hidden input), there’s enough stuff out there that has targeted our platform that a Recaptcha on the registration form is needed.

We haven’t need it on any other forms yet though. After registration it’s all handled by the other methods and various moderation tools.


Please just don't use the bouncing ball that Dropbox made me use once. It was the first time my lack of athleticism prevented me from a signing in.

I’ve not heard of this one before. Wouldn’t it cause serious problems for accessibility?

Almost certainly.

What is that? This comment is the first Google result for dropbox bouncing ball.

I tried from finding it myself, and I've only seen it once. I was logging into Dropbox via Chrome on my Android phone, and I must have done something to trigger captcha spambot hell. I just tried triggering it again and was not able to do so.

The ball had an animal in it and I was asked to bounce the ball, causing it to rotate. I had to bounce the ball with just enough force to get it to land so the animal was positioned upright. After several failed attempts, I gave up.


Did you try randomizing the 'name' and 'ids' of the inputs? (including the invisible one)

I was preparing a response here, but many of the other commenters have covered it.

I recently spent time ensuring our Auth pages’ HTML could be easily cached outside of our application servers. They were a common target of DDOS attacks because we were generating a unique nonce for CSRF protection.

Randomizing form field names does not defeat a targeted attacker (and we have definitely been a target), prevents HTML caching, and will prevent auto filling fields by browsers and password managers.

Additionally it will be terrible from a usability and accessibility standpoint.

It’s trivial to target a form field by the text/label around it so those would need to be randomized as well.


> Randomizing form field names […] will prevent auto filling fields by browsers and password managers.

I wholly agree that this would not help, but for the sake of completeness, I want to point out that <input autocomplete=""> [0] is designed to solve this, by decoupling input field names from their intent.

But Chrome is playing dumb about it [1]. And of course, the spambots will just adapt to parse the autocomplete info…

[0]: https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes...

[1]: https://www.reddit.com/r/programming/comments/ar1qj1/chromiu...


>Additionally it will be terrible from a usability and accessibility standpoint.

ReCaptcha is by definition terrible from a usability and accessibility standpoint too, just has all the privacy problems too.


> and will prevent auto filling fields by browsers and password managers.

I would MUCH prefer the recaptcha over this!


Your browser probably isn’t smart enough to autofill the ‘comment’ part of a guestbook or the ‘body’ of an email.

I really dont know how well that will work against a dedicated attacker.

I am much more confident in ReCAPTCHA of stopping bots compared to any roll your own solution.

I dont want to hope that an alternative is good enough for my needs. I want the best when it comes to protecting my site.

Any alternative needs to have a proven track record and support to make consider replacing ReCAPTCHA.


> I am much more confident in ReCAPTCHA of stopping bots compared to any roll your own solution.

I'm much more afraid of ReCaptcha blocking bonafide users. It's a harmful obstacle that punishes legitimate users for not sharing as much data as possible with Google.

Even if you really need a captcha, there are better solutions out there.


Google really, really does not like people who use Firefox and/or a VPN.

> I am much more confident in ReCAPTCHA of stopping bots compared to any roll your own solution.

I am as well. We enabled Recaptcha on one site and had spam signups drop by 99%. Unfortunately, regular signups also dropped by 20% because people give up when they hit Recaptcha and don't absolutely, seriously need what it's protecting. To us, joining the arms race against the spammers (which, so far, we've easily won) was much more profitable than turning away legitimate customers.


> I really don't know how well that will work against a dedicated attacker.

>>> You probably don’t need ReCAPTCHA

Probably being the keyword, because you probably aren't a big enough site for a dedicated attacker. Or for a dedicated attacker to be an issue.

And really, let's s/attacker/bot/g. Not every bot is a problem. Not every bot is an attacker, i.e. someone doing something malicious.


For 20$ you can solve a few million ReCAPTCHA's using Buster and a paid-for STT engine. Atm Buster works about 95% of the time, so you'd see significant amounts of spam even with ReCAPTCHA.

I'd love to pay $20 for a firefox extension that makes this problem go away. I use lots of privacy extensions on Firefox, and those Captchas are annoying as hell. Tor is even worse.

Can you please provide a few ready-to-use links?


Just install Buster, it's free on the Mozilla Addon Site, you can set it to a STT provider other than Google, which I recommend since they seem to detect using their STT engine now.

You can pay for Azure and other STT engines to solve it for you an dthe results are usually a bit better.


When you consider what “the best” means, please include the value of not feeding your users into Google’s gaping maw.

Some have never cared or either stopped caring altogether because of the average user's apathy. Not that I agree with it, but I can see why someone would ignore that con in favor of the pros.

How much does ReCAPTCHA's aggressively targeting non-Chrome and/or privacy-enabled browsers and making completing captchas exceedingly difficult factor in your decision?

Do you want your site "protected" from those users, too?


Most of the times when I encounter recaptcha I don't even bother filling it out - it's a huge pain in the ass and apparently I look scary because I always have to jump through way too many hoops before I'm allowed into the crappy walled garden that it's probably protecting. I can't be the only one that feels this way, that is something you should consider when picking a captcha solution as well.

So you force your users to consent to sharing all of their data with Google? That’ll teach ‘em.

What's an alternative that works at scale, though? It's easy to say "this is bad for these reasons, don't use it" while ignoring that there's not really better options once you get targeted.

I used a bunch of randomized questions with single word answers (case insensitive and typo tolerant) and hidden fields for years now.

You can use common knowledge or simple ambiguity of language. You can use simple math arithmetic, written in properly obfuscated html. and randomly generated on each page load. You can use custom question about the content of the article (helps with informed answers).

On a small blog of mine just one question with one answer on the contact form prevented all spam for over 5 years already although it would be trivial to exploit in a targeted attack.

Targeted attacks are rare unless your captcha protects a juicy target that is worth a targeted attack at some point.


Yeah but to be fair he did ask for alternatives in case you are targeted. It happened at work here too, someone with a grudge and a botnet waged a multi-month targeted campaign, and reCAPTCHA was the only thing that helped.

Are there alternatives in situations like this?


To clarify, I do think that this post gives good alternatives because most spam is not targeted. However, you must do something like this if you're a big site or a small site who pissed someone off

The reasonable thing to do would be to initially create challanges with multiple levels/difficulties so you can quickly change the mechanism when you are really targeted.

For my personal blog I managed to be spam free with a simple question/answer pair for 5 years. Took me a minute to implement and leaves my user data where it belongs.


"all their data" is a bit much, isn't it? ReCAPTCHA gives Google exactly one datum, namely the user's visit to the one page it is on.

And I would even hazard a guess that the TOS specify that Google will not retain/link that information, considering that's how Analytics is run.


I am fairly certain that ReCAPTCHA does many things behind the scenes. It probably is using webGL and many other browser features to "fingerprint" your browser, OS, graphics card, sound card, etc. This is simple by just for example drawing some polygons in the background then reading the frame buffer, because different graphics cards / drivers may output different buffers slightly. Then it can store that fingerprint to show you less ReCAPTCHA in the future if you successfully pass the first one. This will also link that fingerprint with all other websites which use google analytics and now they have your full browsing history. The TOS may specify they are not _sharing_ that information, but they can do whatever they want internally to fully mine that data.

Well on top of that you train the image recognition algorithms of a tech giant. So for them it is a win-win strategy: user data and free labour

ReCAPTCHA basically looks up your google account and checks your browsing history and if your IP looks "spammy" to determine if you are a bot. The actual challenge is just a data mining operation and isn't meant to actually prove if you are a human because if it has determined you are not a human it won't let you through even if you do 10 challenges correctly.

Theoretocally it might use signals when you are logged in, but Recaptcha also works when you are not logged into google. So, not really.

Just because login isn't required, dosen't mean its not recorded.

No, that's how it used to be. Now with ReCaptcha v3 the recommend you load it on all your pages, not just the forms you are trying to protect, so they can predict friend vs foe more accurately.

Or rather keep Google tracking cookies alive forever and updated.

So how does one block this?

Firefox and uMatrix[0], and then never go to those sites again, because you won't be able to use them anyway. Whether or not you want to contact the owner of the site and tell them what's up is up to you.

[0]: https://addons.mozilla.org/en-US/firefox/addon/umatrix/


It's trivial to detect element visibility, this just doesn't work in bigger sites.

You are right it doesn't work but is not trivial at all to detect visibility, there are millions of ways to hide an element using CSS, for example a rare one (without using "opacity", "display" or "visibility") is: transform: scale(0.00001);

The only way I see this being useful is if you do this for one or more elements as well as encrypt the name of every input element and also randomize the layout enough that they can't easily use CSS selectors or regular expressions to fine the relevant inputs by page location.

I can and have defeated forms that tried to do all of those things very easily in the past.

Keep in mind that if you randomize across a few variations (i.e. 4-5 page layouts), that's easily discerned if you pull the page source down 20-30 times, doa complex diff, scrub out obviously random strings, and check the total unique variations you're seeing.

That may seem like a lot of work, but consider that if you don't do it all at once, but instead roll out small change after small change, the person or people using it are not weighing to cost to do everything required to bypass it compared to finding another open mail form, but the cost to bypass just the new fix you put in place. Also, they might think it's fun doing so...

And on the site dev's side, they can just choose to outsource it to a CAPTCHA (not that there aren't services to easily bypass CAPTCHAs at scale at sub-cent per CAPTCHA rates, see https://anti-captcha.com/).

Note: To forestall any assumptions, I wasn't doing any spamming or helping spamming in any way.


You're not trying to make your site absolutely bot-proof. Someone deliberately targeting your site can figure out any such measures. (You want legitimate users to do so.) You're just trying to throw in enough friction that most common drive-by scripts won't succeed.

It's a "don't have to outrun the bear" situation, make yourself just difficult enough that some easier target gets snagged instead.


> It's a "don't have to outrun the bear" situation

If everyone else is incorporating recaptcha, they're all running faster than you. Even with bypass services, cheap is not the same as free, especially at the scale spam runs at. I imagine a mail form that obviously doesn't incorporate a CAPTCHA is going to garner some attention. It might work for weeks or months if it's not being paid attention to, so that's probably worth them spending a few minutes looking at.


I use a simple english question with a five letter word as an answer sucessfully for 8 years on a contact form now. The text isn’t obsfucated, the answer is always the same.

This is as primitive as it gets. I didn’t get a single spam mail in all that time.

The idea is not to outrun your competition, it is to become a special target that would demand special work to successfully get into. Bots are dumb as long as the humans behind them don’t give them a hint how to deal with your site.

And if you’re really that valuable of a target, you can step it up a notch or even switch to google’s data collecting solution.


I mostly agree... I used to work for a classic car website, and in that case, we dealt with a LOT of comment spam, and scams that were out there. A lot of it is actually individual people, doing actual work to get past. We also did see a lot of custom bots, etc. It took a few different approaches and even recaptcha wasn't always the best option, but it did help with most of the non-scam traffic.

> Even with bypass services, cheap is not the same as free, especially at the scale spam runs at.

Spam doesn't scale on a small site. Say you can absolutely fill a small site with spam comments to the point that 99% of comments are spam. Very few people visit the site (it's small after all). Fewer still read the comments. Virtually none of those will click on the (usually obvious) spam links. And still fewer will buy, making you money. If you spend 2 hours customizing your spam script to circumvent anti-spam measures on a small site, you might as well flip burgers at McDonald's, you'll make significantly more money.

Spam works at scale only when you're not customizing. I'm involved with quite a few small to medium and a few larger sites (the largest getting around 4m PI/month) and though we use WP we get virtually no spam because of trivial deviations. We get an immense amount of attempts though. The little we do get is obviously manual spam: in the correct language, with content targeted to the individual page/post content (beyond "very interesting article, I wrote about the same" one-size-fits-all).


The spam I see is trying to add little bits of pagerank all over the place.

What about the wikipedia solution for this: rel="nofollow" ?

The spammers don't care; they spam anyway.

That also scales only if you don't customize. A low-value backlink that will usually be removed in the near future isn't worth an hour or two of a developer's time.

do you eliminate all spam that way? i get signup SEO spam even with recaptcha & stopforumspam.

No it does not. Some creeps through the registration system, which is why we have additional trust measures afterwards.

On my own site I see 1 or 2 spam posts a week although I get the feeling it’s real people doing the registration. They sign up, make 1 comment, get reported very quickly, then banned.

We haven’t had to make our signup/registration system that strong in of itself though, because most of our largest clients end up using some SSO method exclusively and will have their own prevention methods.


It will also ignore people who use browser's autofill form function. Realized this after receiving a dozen complaints.

You can randomly generate the "name" of the fields and autofill will never fill them, another option to disable autocomplete is to leave them without "name" and handling the submit using JavaScript.

Disabling autocomplete is user-hostile and additionally should be considered a security flaw. It makes it harder to use password managers.

You can disable a single input field from autocomplete

`autocomplete="off"`


'Autocomplete' doesn't work for Firefox (since version 38), Google Chrome (since 34), and Internet Explorer (since version 11

Huge pain


This totally works in latest Chrome/FF/Safari

https://codesandbox.io/s/static-jkvzs

https://jkvzs.codesandbox.io/

The other trick is to add a random string/number in from of the name attribute e.g. name="348349_name". This prevents autofill. Interestingly 1Password and LastPass are smart enough to infer that it's a name or email field.

For the honeypot, random number + word makes it ignored by autofill/1password


My mistake,

looked into it again and it seems Chrome enabled it again in Chrome 68: https://stackoverflow.com/questions/25823448/ng-form-and-aut...

Firefox had it disabled too but enabled it back again: https://developer.mozilla.org/en-US/docs/Web/Security/Securi...

And IE is just a cluster f.

My point being, Autocomplete off is not a valid solution as it can break at an updates notice, and the code hacks, while may work, are a pain to deal with


There are people who don't run Javascript.

This field will also been available for all screen reader users. It's important to add an adequate label ("Please do not fill out this field" or similar).

Yeah, these honeypotts work quite well, however the last time I looked into it (circa 2012) I read that it is not ideal for visually impaired users with screen readers.

I forget everything that you need to do, but I believe I read that there's a happy medium you can strike such that a script would still input data but it wouldn't be shown to a screen reader.

I did this years ago, it expecting it to work - but there hasn't been a single form spam since!

I'm sure this won't work for everyone, but if your small, I highly recommend giving it a go.


We used to this, does not seem to work at all these days anymore

Now your site is terrible for blind people.

That is smart

I've got some web sites that use javascript to submit a form to an API gateway endpoint which then uses SES to send me an email. I get absolutely zero spam from any of them. I think the vast majority of spammers use non-javascript tools and that recaptcha is unnecessary for most low-profile sites if they'd just use javascript to POST the data instead.

I used 5 randomly choosen english questions with easy answers my readers would know — this prevented literally 100% of the unwanted spam, while it was easy on the users.

If you want to filter comments you could even make the questions reflect the content of the article, filtering uninformed TL;DR type of comments and giving the users the feeling you value onformed opinions.


I once used a form with a random math question: 7+2 = __

It prevented 100% of bot spam for years. Granted, I was never a big enough target to make anyone rewrite their bot, but that's the same for most of us. I'd never use a captcha so long as something trivial like that works 100%.


This (or something very similar) is a core feature of phpBB:

https://www.phpbb.com/support/docs/en/3.2/kb/article/how-to-...

It works very well.


I have next to zero experience in this, but completely agree. Literally 'my first website' I made when I was a kid that had a very basic guestbook that I put together in php (actually it may have even been perl)/mysql got hit with random spam. Everything was something completely custom that I had made and not some package that may have had a common vulnerability - so there was and must still be stuff crawling the web looking for any kind of form to push junk into. In this case it was clearly trying to do some spam 'SEO'/keyword stuffing, because the page would display the static text (I was smart enough to filter out HTML though)...

I must lack imagination! I cannot figure out what the ROI is if one codes stuff to randomly submit crap to forms.

A forum I post on has zero restrictions on sign up. No recaptchas, nothing. Couple of times a month we see some (usually Cyrillic) bot post some thread with a link that (probably) no one clicks and we all have a laugh and tell it to post more (it never does).

It's a small site, footfall in the 10s, so maybe that's the reason.


We've blocked cyrillic characters on a contact form, and we went from hundreds of spam messages every day to less than one a month.

Most websites probably don't. If you're one of those people, congratulations! Stick a honeypot input into your form and call it a day.

However, if you're working on anything with non-insignificant amounts of traffic, you'll get hit with some customized spam.

I've been dealing with these spammers, and if you do nothing, your forum will be filled with korean ads. We implement Akismet, StopForumSpam, Project Honeypot, and ReCAPTCHA, with the latter being the most effective (sadly). I'm pretty sure some of these spam agencies have customized tooling to handle NodeBB (they're using websockets to submit the posts, instead of HTTP POST).

Outside of these strategies, the most effective by far is reputation restrictions. Post queues if you're new or don't have enough upvotes, etc. However it does require manual effort, of course.

Would definitely love an alternative.


> We implement Akismet, StopForumSpam, Project Honeypot, and ReCAPTCHA

Did you tried some techniques from the article? Like hidden form fields, simple javascript checks or simple captcha?


I used to run a forum hosting service. After a while you try everything. Honeypot fields, incorrect field names, etc. You will notice things like CSRF tokens generated by one IP and used by another. When stuff fails, they send 100 real people and record what worked, and it fixes their script. It’s all pretty automated. IP reputation can be helpful for some players, but most snowshoe.

Spam attempts would grow exponentially. So every time we cut it down by 90% via one of these tricks, it only gave us a bit of time.

None of this stops the determined troll though.. they can have all day to manually add offensive content. Shadow banning was good for this (1999) and group shadow banning was the best (bifurcated forum posts so all the banned people saw each other, but no one else did). Ah, memories. So good.


> all the banned people saw each other

What did happen with that strategy?


In my experience, the biggest issue I run into is targeted botnet brute force attacks.

In cases like these, someone loads up a huge botnet, a downloaded list of hacked usernames and passwords, and tries every single combination hoping to find a reused username/password combination.

In these cases, it is almost always extremely targeted. Log correlation has helped quite a bit, but it is still very painful since they alternate IPs with every request.

Automatically adding a blanket ReCAPTCHA on all login pages during a distributed brute force attempt is one of the few things that actually stops an attacker like this with minimal negative consequences.

I'm sure it is frustrating to users, but I think service disruption from what is effectively a DDoS is a worse user experience.


It doesnt have to be botnets. Proxylists are enough to enable that attack for almost anyone. Pornsites have had that problem for decades and their solution was captchas, which led to OCR on the attacker site which resulted in stuff like recaptcha. Since you are attacked by one guy who makes a focused effort you are just causing him more effort for the initial setup with the other solutions mentioned not actually stopping him. It sucks but it still beats the alternatives.

I've never dealt with anything seriously distributed. Are they running full browsers or simple scripts? Do they execute JS, do they load images? Do they perform other actions on the site, or will they show up with their initial request be a login request? Do they simulate keyboard and mouse events?

Curious about this too

and you can easily count the number of failed attempts from a particular IP, and just show captcha for those over X failures, rather than every login. Normal users don't fail _that_ many times, and so are non-the-wiser.

As the commenter said, they rotate IPs. It is not that easy. I've also been on the other side of a sophisticated attack like this. The really savvy adversaries do the following, at least:

1. Rotate through several thousand to several hundred thousand noncontiguous, geographically distributed, residential IP addresses,

2. Associate each IP address with a single user agent and suite of cookies,

3. Associate each IP address with a particular target username,

4. Only attempt a few incorrect logins at a time, and a somewhat random (albeit realistic) number at that, within a given time interval,

5. Use random, apparently human delays between successive requests,

6. Issue requests using extremely high fidelity simulacra of web browsers, customized to the sequence and structure of HTTP requests on the website.

When the stakes are high this is the kind of opposition you'll get. Bank account takeover, social media account takeover, ticket scalping, automated sneaker buying, financial research, market research, etc.

Recaptcha introduces unpleasant user friction, but it usually works well. To invert a popular turn of phrase, it makes stopping simple attackers easy and hard attackers possible. The most sophisticated attackers will still lease reputable Google accounts and mechanical turk time to bypass Recaptcha challenges, but it will be expensive for them.

Technical sophistication is only one dimension of this game. The other is making adversaries spend more money than they can gain from being successful.


> automated sneaker buying

???

Please ELI5. I mean, why are sales bad, even if automated? Are they using stolen cards?


Most likely related to high-demand, limited run sneaker "drops", which people then resell on the secondary market. Sneaker-scalpers, if you will. It's a problem because it prevents legit buyers from getting in on the sale.

Why don't they just sell more of them?

Or as ALittleLight says, auction them?

Edit: OK, I know, limited editions. Like numbered and signed prints. But it's arguable that people who want them the most will get them. Even if it's just for resale. Doesn't seem like the seller's responsibility.


> Doesn't seem like the seller's responsibility.

sounds like to me that the seller doesn't want the scalper to sell outside the official channels imho. It might dilute the brand as well.


Exclusivity agreements are anti-competitive tbqh

The real problem is supply. Popular tickets are scalped because there's only so many tickets. Then unpopular tickets are scalped because it was so easy to scalp the popular ones.

There's only so many sneakers that can be made: making more chews up the supply chain for something which isn't _truly_ being consumed.


Why do "official channels" matter? They're just sneakers.

And why brand dilution? Scalpers sell at a premium, not discounted.


Seems like a market problem. Why not auction the sneakers?

To be clear, I don’t have a horse in this particular race. I’m neither condemning nor condoning the market dynamics of sneaker arbitrage here. It’s just an example I’m very familiar with because I used to write scrapers and I’ve been offered silly amounts of money to make them for sneaker trading groups. Not as much as hedge funds will pay for writing crawlers for market research, but still more than you’d probably expect just so they can flip Supreme shirts and Yeezys faster than competitors. It’s ridiculous, but this is the world we live in. The point is simply that when the stakes are high (particularly when there is money to be made), stopping adversaries will be really, really difficult.

Back to the point at hand, I don’t like recaptcha in principle. But given my view from both sides of the table, it’s one of very few things that consistently works for sophisticated adversaries. It’s about as close to a silver bullet as they come, with the additional upside that it’s the absolute easiest thing to implement - in both an absolute sense and relative to the return. And once you have, most of what you can implement beyond recaptcha has diminishing returns in comparison.

All of that being said, I would be inclined to agree that most websites and apps don’t need recaptcha, simply because most of them aren’t worthwhile targets for the types of attacks recaptcha is singularly effective against.


Hedge fund paying for scrapers, is that a common thing these days? What kind of prices do they pay if I may ask?

Extremely common. Most hedge funds buy what's called "alternative data" from vendors who aggregate it, like 7Park. The data is collected by providers who collect it from location telemetry, web scraping, satellite imagery, etc. Scraping from web applications is one of the more common forms.

The more successful quant funds will often build out internal research teams to do this. For example, both Two Sigma and Millennium have (not so well advertised) research teams devoted to this kind of data collection internally.


The rapper Post Malone released a custom collaboration with Crocs and the shoe sold out in 24 hours, reselling on eBay for several hundred dollars.

https://www.digitalmusicnews.com/2018/11/02/post-malone-croc...


Rotate through several thousand to several hundred thousand noncontiguous, geographically distributed, residential IP addresses

How are they getting residential IP addresses, compromised PCs?



How the hell do they get people to open up their home computer to be used this way?

They encourage developers to include their SDK:

> Monetize your mobile app or game with our SDK, without showing intrusive ads or requiring annoying subscriptions and in app purchases.

https://luminati.io/sdk

They approached nmap of all people:

> Hi,

> My name is Lior and I'd like to offer you a new way to make money off your software. The Luminati SDK provides your users the option to use your software for free by contributing to the Luminati proxy network.

> We will pay you $3,000 USD a month for every 100K daily active users.

> No collection of users' data, no disruption of user experience.

> I'd like to schedule a 15 minute call to let you know how we can start. Are you available tomorrow at 12:30pm your local time?

> Best regards,

> Lior

https://seclists.org/nmap-dev/2018/q1/27


$3000 USD for 100k users? If you can get five pennies from each you're better off than with this crap.

The equation changes if you actually care about your users' experience. Or if you don't and include both Luminati and ads.

It's pretty hard to get five pennies from each of your users

Users of free mobile apps (mostly games) are offered the option of allowing use of their devices as proxies as an alternative to being interrupted by ads.

https://luminati.io/faq


Excellent, a new botnet to exploit. ;)

They pay developers to include the Luminati "SDK" (their euphemism for malware) in popular apps.

Sounds like the answer is to increase the response size for failed login requests. At $12.5/G, if you blow up your response to a mega byte, they'll spend about a cent per try - close to the rate they'll need to pay to have recaptchas solved by humans.

Most criminals are not using services like Luminati - they are using actual botnets made up of compromised computers. In that case, their bandwidth costs are far cheaper than yours.

Then can't they run-out-of-money DDoS you fairly easily? Since you'd pay for the outgoing bandwidth and at Google Cloud and AWS that's expensive.

I don't know how expensive it is with Google/AWS, but I'm paying about $1.50 per TB at my non-cloud-host (vs their $12500/TB), so if it comes down to it, they need to outspend me by multiple magnitudes. Sucks, but still cheaper than losing customers due to hyper-annoying Recaptchas, and I doubt that somebody is willing to stomach $12500 cost to make me suffer $1.50 ... I'm sure there would be more efficient attacks ;)

Other commenters have basically answered already, but to be clear Luminati is not the only provider, just the most infamous. It’s very easy to find others of greater or lesser reliability. Search “residential IPs proxy” and you’ll find many vendors.

But yes, the whole cottage industry is sketchy. Almost all providers are leasing users’ computer with outright malware or shady TOS. The savvy play is to release a free game, app or even SDK which will then opportunistically route requests from the control server through the user’s device.

Recaptcha solving APIs are frequently bundled with the more reliable and premium services of this kind. They introduce a lot of latency since there’s a real mechanical turk across the world solving it for you, but they basically work.


>The most sophisticated attackers will still lease reputable Google accounts and mechanical turk time to bypass Recaptcha challenges, but it will be expensive for them.

You can also just pay people to solve recaptchas all day.


That's what mechanical turk is.

Thanks a lot for the correction. I blame the lack of coffee...

However that still contributes to security. If you make the expense of bypassing it greater then the worth of accessing the site then you have stopped them.

Sounds like big operations like that should have been putting people into jail

If only.

You need to update your intel to 2019 where botnets cost peanuts and I can trivially rotate a tiny botnet of 100k IP addresses against your "solution", each IP addresses making a couple attempts. Modern script kiddie software has had that built in for over a decade.

Never had your /login forum attacked with {uname,pass} combos? This is exactly what the traffic looks like.


ReCAPTCHA has crossed into the domain of cattle-corralling users and thus should be considered harmful. If the system decides it doesn't like you (most likely because you're "too anonymous," but you don't really know) you will be presented with slower-loading images to click and more click-all-the-things rounds. To pretend this is about slowing down bots is disingenuous as best. On top of that, usage of ReCAPTCHA perpetuates the very problems I just described.

Why isn't there a solid alternative offering yet?


There should be an open source captcha solution where all the labeled images can be used to develop a model available freely to the public.

Y'know, this would be a great project for Mozilla, if they have the resources for it. They're already doing that crowdsourced voice training data thing.

I bet CloudFlare has the bandwidth for it.

And yet they use reCaptcha. I hit it all the time accessing sites when I was in France.

They also have great IP address reputation data

As if IP addresses are not shared. IPv6 isn't present enough.

and you don’t think that will be quickly used to make a tool to break the captchas?

No. You cycle through new types of challenges that the model can't solve yet.

For example, one week of stop-sign recognition, another week of pedestrian labeling, etc.


Building those models sounds like a lot of continuous and boring work, and is probably best done by a company. Unless you find some crowd-sourced way of building those models.

Exactly, see the other comment in response to mine. If Mozilla can do this for speech recognition, then we can do this for image recognition/transcription tasks.

>Why isn't there a solid alternative offering yet?

The latest version of recaptcha doesn't even prompt users. It loads on the front-end and uses a scoring system. It's likely you've used it but didn't even know because it's invisible.

It's the older implementations that have the slow loading images.


On Google Chrome, with adblock on, without my Google account signed in, in a new incognito tab with no extensions, I have the experience of it being invisible. When I go back to the same site on Firefox, logged onto my Google account, no adblock on, no privacy options on, I have to identify dozens of photos.

As far as I can tell, it just checks to see if your browser is Google Chrome to give you your score.


Again, that's still not the latest version of recaptcha, which never prompts users for photos.

If you use as blocking and privacy extensions or live in a "suspicious country" it's still slow loading and multiple pages of images.

Not really. Thats if the developers code it so that if you get a low score -> trigger images captcha (recaptcha v2)

Recatpcha v3 by itself NEVER shows anything to the user and its entirely up the application itself on how to deal with users who are likely to be bots (which also includes users with anti-fingerprinting measures)


Kicking """questionable""" users off the site entirely is not a "solid alternative".

This is just splitting recaptcha into two pieces and giving you the first half. Okay, fine, but it's the second half that was causing all the problems!


Because when you don't annoy 'too anonymous' users with obnoxious captchas, your service gets flooded with spam, which annoys every single other user.

> To pretend this is about slowing down bots is disingenuous as best.

I'm not sure you have a good understanding of what happens to internet services when they don't throttle spam. They become completely unusable.


ReCAPTCHA doesn't prevent the kind of spam that makes internet services completely unusable (i.e. DDoS botnets), it prevents form submition spam.

To be fair, spamming a user-submitted content site (for example, a blog with comments) is just as bad as a DDoS; either one makes the site unusable.

Most forum or blog comment spam I've seen was trivial to detect, even without access to the server. Isn't written in the right language? Very likely spam. Contains a link and little text? Very likely spam.

If you have access to the server's information, it gets even better. Origin makes it much easier to identify likely spam, previous interactions with the site and their speed ("hits the page and 1s later submits a comment") provide more info.

Sure, all of that can be worked around, but that makes it more complicated and increases the cost for the attacker. If they spend money on faking actual user interaction with your blog, routing all requests through a residential IP in your country etc pp, they are likely spending as much or more than they would on a recaptcha solving service.


I'd much rather have a service where I can submit a comment to check if it's meaningful or likely spam, than to force my users to waste time and share data with Google that they don't want to.

If clearly spam, block it, if clearly okay, allow it. If unsure, leave it for a moderator. It might even train users to write better comments if badly written ones need to wait for moderation.


As if that's not a complete privacy nightmare. I'm sure either the NSA, or an ad-tech agency would love to get a firehose feed of this sort of data, while providing you with a YES-SPAM/NO-SPAM service.

Also, anyone who mentions <something the service wants to censor> can be, of course, blackholed as spam. ;)


As long as it's a public forum, I don't see the issue. It's going to be public anyway, and they could easily scrape it if they want to.

The submission itself should be done by the site, so the user remains anonymous.


There's a lot of information that can improve spam classification, that is not publicly queryable.

Any spam detection engine that asks for that information will outperform ones that don't.


I think we have to take a step back, and consider why we want to separate humans from computers in the first place.

Humans can do a lot of bad things that computers can do. Think of armies of low-wage people in Asia, that are paid to click on ads, spread spam, or write reviews.

And also consider that computers can actually do good things, for example, allowing humans to automate their work on certain websites, or providing better accessibility for certain users.

Therefore, instead of introducing CAPTCHAs, why not focus on the actual threats. If you want to protect against spam, then build a spam filter. If you want to prevent bots from bulk-downloading your data, then build a rate-limiter, etc.


But that's something captchas are used for. Prevent fake signups.

It doesn't do that, though. Humans also create fake accounts. It does make mass creation of fake accounts impractical, though.

I solved that in past by actually charging for my service. I think the internet would benefit from having more paid content and less ads driven stuff.

One thing that captchas do protect from is brute force attacks on user passwords. Although there are other possibilities (like making the connection slow after a number of attempts).


This is why you need a multi-layered solution to fight spam. Captchas reduce the amount of fake accounts, which can then be taken care of by the additional layers.

It's easier to fight against a bigger opponent (botnets, etc) if you mitigate their superiority in number first.


It does do that. It literally does. It removes the cheapest method of creating fake accounts, thereby reducing the problem hugely.

Literally none of those alternative methods listed worked on my moderate traffic wiki. Recaptcha (and before it went away, identify the dogs or cats from Microsoft) is literally the only solution that stopped us from getting spammed. I wonder how much experience the author of this article really has in this domain.

Recaptcha has saved the internet as far as I'm concerned.


I'm almost sure that the traditional (ha!) wiki model where everyone is free to edit is not meant to be functional under the modern internet anyway. A verified and/or interested user model (the OP mentions this possibility in the heading "Community-specific questions", too bad that the second example image is too ambiguous even for anime fans though) seems to work though; I've seen it being much effective on, for example, the Esolang wiki [1].

[1] https://esolangs.org/wiki/Special:CreateAccount "Which number does this Befunge code output: [...]"


For my small wiki, refusing all submissions with external links eliminated virtually all spam. Yes, it's drastic, but in my case external links were not essential. I still use ReCaptcha to cut down on spam account signups.

We do that as well, outside of a small whitelist of allowed external links. Turns out it's not enough and we still need recaptcha for a few reasons.

We have a small set of anonymous edits every day, which go through recaptcha to be allowed.


I fought an interesting implementation of customized spam on a web app registration form a few months back. Suddenly, every 2-3 seconds, we would get a sign up from a random email address @qq.com (it really clogged our sign up Slack channel). I didn’t want to go full CAPTCHA so I dropped in a simple honeypot and the spam stopped for a good 4-5 hours. Then it picked right back up like normal. I then implemented a randomized honeypot, e.g. <input name=“lastname478482”> and again, it ceased immediately and picked up a few hours later. Finally I just blocked all submissions with emails @qq.com and it stopped completely and hasn’t returned months later.

Sometimes I wonder if there was a real person on the other end writing code to combat the code I was writing at the same time, and finally gave up on the 3rd iteration.


There is someone else on the other end. I had one of them previously get on the forums I worked on and complain about how quick I was to blocked them. I wrote an entire system for learning the patterns and automatically blocking them as they occurred. Eventually I determined that the vast majority of the IP addresses originated from Bangladesh so I just banned the entire country from accessing the web sites. That was the only solution, unfortunately for those legitimate users there, that made the spam stop for good.

I have to ask, why do people do this? What do they get out of it?

I assume SEO targeting and associating phrases with various products. For example, people on Reddit will name and shame by dropping people's names accused of crimes next to the thing they are accused of in a complete sentence. This is an attempt to promote the relevance of those two facts in Google Search to each other.

Doesn't the nofollow attribute [1] work to remove that incentive? IIRC that was the intended purpose of nofollow?

[1] https://en.wikipedia.org/wiki/Nofollow


The automated spam tends to be for financial gain somehow, Either by spaming links thinking someone will click them or SEO stuff. The non automated spam is usually people who find enjoyment in being annoying

I recommend reading the book "Spam Nation" by Brian Krebs(sp?). He goes into the different motives behind spam, the big actors, and how people get paid for creating spam.

i have the same experience, exclusively bangladeshi and indian IPs signing up to enter SEO spam for small businesses in the US (insurance, hair salons, lawyers, house renovation etc etc). Stopforumspam has most of them in their lists.

"ReCAPTCHA relies extensively on user fingerprinting, putting emphasis on the question of "Which human is this user?" rather than the ordinary "Is this user human?". "

Classic example of collecting more information than what is needed.


Depends. The traditional techniques that automatically establish humanity without determining identity are more and more vulnerable to AI, so the only way to keep CAPTCHAs effective is to integrate identity. From that perspective, ReCAPTCHA isn’t collecting more than “needed”. On the other hand, the cutting-edge cryptographic technique used by Privacy Pass does supposedly preserve anonymity by making it impossible for CloudFlare to link “who solved the CAPTCHA” to “who wants to access X site”, but it still involves information being collected in some form.

> The traditional techniques that automatically establish humanity without determining identity are more and more vulnerable to AI, so the only way to keep CAPTCHAs effective is to integrate identity.

Is it? What's stopping AI from developing an identity in the eyes of Google? An AI that might behave exactly like Google's ideal user. Searches random stuff on their search, looks at and clicks their ads, logs in to various Google services, and when it encounters a ReCaptcha, it clicks the "I'm not a robot" checkbox.

At some point, caring about privacy might turn out to be the distinguishing feature of humans.


The blog posts suggests many websites are using ReCAPTCHA when it is not truly needed. In those cases, given the focus of ReCAPTCHA on identity, this is a case of collecting more information than is needed. According the author, ReCAPTCHA is not needed to defeat "uncustomised spam". To put it another way, while there be nothing wrong with ReCAPTCHA itself (e.g. the way it is designed), there could be something wrong with the way it is being used, perhaps on a massive scale.

Combined with https://dmv.ca.gov (which logs into google, and uses recaptcha extensively) it becomes a serious question.

I've noticed a number of major cryptocurrency exchanges using this slide puzzle as a form of captcha [1]. I dont exactly know how it works but I assume the jittery nature of a human sliding a mouse is enough to discern the bots from the nots. Are there any major downsides to this form of captcha that I may be missing? [1] https://www.geetest.com/en/

EDIT: I now see the article does actually mention this, though I still do wonder how far the fingerprinting goes.


Requiring to solve ReCapcha after may be more sensible than the usual use case where ReCapcha is required before any interaction.

Let the user create account, let the user create first comment/post, save it somewhere but hide it, then require ReCapcha to make the post visible and activate the account.

The issue with requiring ReCapcha before is that the website owner will never know whether a legitimate user was turned away or whether it was spam.


And how does the timing change things? You want to look at first comments/posts to determine whether it was spam or not?

After the post is saved (but hidden until the user solves ReCapcha) you can look at the database and analyse what's happening. Did ReCapcha turned away an interesting post by a legit user? What is the ratio of spans versus legit users? Etc.

If you turn both spammers and annoyed users with ReCapcha before any interaction,you will never know how many were legit and you won't be able to manually accept interesting legitimate content written by a user that is annoyed by Recapcha.


I had this same thought in 2013 and created this library which uses a few simple tests to protect from comment spam. It definitely could be improved/tuned (and oof this code I wrote is baaaad), but for many sites it is good enough. https://github.com/mccarthy/phpFormProtect

I used to use a script to combat uncustomized spam. A hidden input would get populated with my birthday in hex and then checked on the server. Every JS based client would be able to use the forms and we went back to math captcha for noscript.

Literally thousands of spam comments per day would stop instantly on deployment. We put this on over 400 sites and never had an issue of customized spam.

Definitely agree with recaptcha not being necessary. However, it's probably needed for popular sites which you're more likely to use. Do we see more recaptcha because of that?


Here is an idea I thought of for a captcha. Render your webpage and form and include a hidden "password" field. Use a javascript hashing algorithm to hash the password on the client browser (preferably a very slow one that uses a lot of CPU). When you submit the form check the calculated hash the client did with a pre-calculated hash on the server. If they don't match reject the form. You can pre-generate a list of password/hash combinations to avoid slowing down your servers. Could it work?

It would certainly help in some cases, and act as a delay in some others. It pretty much wouldn't stop a targeted attack since they quite probably don't really care about how much CPU use each request is using, and you literally have to publish the code to generate the hash, so it makes it pretty easy to reverse engineer.

Its not reverse engineer-able. Its basically "proof-of-work" as in all participants know the algorithm, you just make attackers spend CPU time to use your site which would slow them down or cost money.

It could work... If you want to set minimal system requirements to visit your website.

It will also annoy users of password managers with auto-filling capabilities. "password" is normally used for actual passwords.

Besides, nothing stops the attacker from replacing your code with a faster implementation.


There are password hashing algorithms out there (like bcrypt) that specifically take a long time to compute using the fastest method that we can think of.

I shouldn't have named it "password". The idea was all the form fields are hidden and the process is transparent to the user.

This sounds similar to the blockchain email spam prevention Hashcash.

https://en.m.wikipedia.org/wiki/Hashcash


I've developed an alternative to captcha for blog comment/forum spam a while ago.

There are many methods that are easily dismissed with "but not all spam is like that" and "someone could work around it easily":

• blocking of links (if you don't need them, or in fields that are not for them).

• blocking of obviously spammy keywords (or bayesian filter)

• invisible fields and syntax to trip up dumb implementations

• requiring JS, properly-functioning cookies

• blocking of IP ranges that belong to VPS providers and a couple of 3rd world telecoms that allow spam

Each of them is surprisingly effective and combined they block 99.8%.

You really shouldn't flatter yourself thinking that a spammer will even look at your page. They have literally millions of sites to spam, and they couldn't care less if they get yours or not. A bot will find a 1000 other sites to spam quicker than it takes a human to click "View Source".

Most of spam is done by amateurs who take shitty off-the-shelf spam software, seed it with a target list copied off some forum, and run it on a couple of spam-friendly or incompetent VPS hosts. By volume, this is the vast majority and it's very easy to block.


> The comment form of my blog is protected by what I refer to as “naive captcha”, where the captcha term is the same every single time. This has to be the most ineffective captcha of all time, and yet it stops 99.9% of comment spam.

This is what we did on one of our sites. 5 minutes to implement it using a few lines of code. Same result, couldn't be happier.


“It’s worth noting how much easier it is to successfully solve ReCAPTCHAs when the user is logged into their Google account”. Well to me it makes absolute sense as Google knows that the logged in user is a human. This article is just following the current trend all Google is bad.

Right, someone can't register a google account and make their spambots bypass CAPTCHAs with it. Just knowing that a human was present once doesn't mean they are always human. Googlen probably lets you pass easily once or twice in a certain amount of time but keeps track and starts giving you more difficult solves as time goes on. But then, someone might register a thousand google accounts and rotate through them to allow cool off time, so I'm not sure how this is handled. It is not a given that rewarding people with a google account is a good thing.

However, when the california DMV website will log you into google just by browsing and integrates recaptcha to do online transactions such as making an appointment, it gets real.

Microsoft’s implementation is the worse. I sometimes have a hard time deciphering the captcha. Why do they need that in an iOS app? Are robots emulating people from an iPhone?

The iPhone app probably communicates to some servers over an API of some kind - there's no reason someone malicious couldn't pretend to be an iphone and communicate over the same API

and no reason someone malicious couldn't click-farm iPhones either.

https://gizmodo.com/thai-click-fraud-farm-busted-using-wall-...


Mechanical Turk style processing never ceases to amaze me.

Actually, now that is think about it, Microsoft apps could only require a captcha if the username trying to log in doesn’t match the user’s previous iCloud user token.

I’m thinking about how Overcast uses a token linked to the user’s iCloud account and doesn’t require a username and password if you only use iOS devices. You can optionally add a username and password to access the web client.


> Are robots emulating people from an iPhone?

Many of the more sophisticated ones prefer emulating mobile application requests to web requests, so yes.


User agent is quickly changed.

Doesn't iPhone have some sort of device attestation?

Can any iOS developers chime in? I know there has to be some type of server side validation to validate previous in app purchases. Could something similar tie a logged in iCloud user to an account?

In the browser?

The parent post was about using captcha in the app.

> Uncustomized spambots are also so unintelligent that they do not correctly answer simple questions such as “What is 2+3?”, or “what color is this website?”.

I found that some uncustomized spambots are able to solve simple math challenges like this one.

And the color question is bad for accessibility.

Many of the suggested alternatives are also terrible for accessibility, because they require solving a visual puzzle, without an audio alternative.


> And the color question is bad for accessibility.

That's a valid concern, but not a reason to use ReCaptcha, because ReCaptcha is worse for accessibility.


Websites should use ReCAPTCHA, but set a cookie that includes sufficient (pseudo) random numbers. Keep a copy on your server.

As long as a browser provides a cookie that is present on your server, there is no need for another ReCAPTCHA.

If the user misbehaves, e.g. too many wrong guesses of password(s), remove the cookie from your server.

You can even email the user the cookie as a url e.g. example.com/?cookie=12345678901234567


We did everything to avoid using CAPTCHA because spammers on our platform could defeat CAPTCHA anyway. Oftentimes, they're real, actual humans.

Hidden inputs, honeypots, etc. None of them worked long-term.

Our solution: focus on the content.

All content now runs through the following filters, and takes care of 99.9% of spam: 1) Auto-approve any and all content immediately, unless: a) The content contains any HTML (url included). If it does, send it to a moderation queue. a1) If the submitter had submitted content previously and gotten approved, auto-approve their content, even if it contains HTML.


One thing you can do to slow down bots is make it computationally expensive to sign up. Utilize a JavaScript-based proof of work library. Normal users can breeze right through, but somebody trying to create thousands of accounts will find their operation grinding to a halt.

coin miners would be an option, however adblockrs would block them and antiviruses consider them malware

I understand when Recaptcha gets used before registration forms, but why, oh why, does Discord do it before any login?

If I know my email address and password I should be able to login without being logged in to Google as well ...


On the plus side you've likely become fairly adept at identifying street signs, light posts, bicycles, storefronts, bridges and buses in order to access Discord, especially if you were trying to access via Tor to gain a little privacy.

Can't speak for OP, but personally I simply stopped using Discord.

Ditto. Don't forget to close your account as it will not self-destruct.

To stop bots trying lists of usernane password pairs.

Aren't hidden form elements are a major issue for accessibility?

Admittedly, so is ReCaptcha, so the trade-off may be necessary as much as it sucks. But, it's probably at least worth a mention?


They are definitely an issue for accessibility. I make sure to put "Hey! Don't put anything in this field!" as a placeholder.

That sounds like something bots could easily adapt to if the practice become widespread.

It's already widespread. But, it requires customization to overcome for many spammers. Depending on the popularity of your site, your threat model may require you to do more. But it is still a useful tool.

Though bots may use it, it seems the perfect use case for aria-hidden.

Biggest pain with ReCAPTCHA is since I've switched to Brave. Sometimes it's 4 to 5 ReCAPTCHA forms coming up until the system is satisfied I'm a user.

I wish Hacker News didn't use it. It's a PITA logging in! I don't see it when I'm in the U.S. but when I'm overseas...

From a French IP that usually gets flagged by everybody as potentially a robot, I see exactly 0 JS on the HN login form.

Are you sure it is HN that uses ReCAPTCHA?


I've seen multiple comments suggesting HN uses ReCAPTCHA, but I have never encountered it myself, and I even have Javascript disabled and login through 'anonymous' IPs such as tor, so I'm unsure what these users could be doing that is 'worse' to trigger ReCAPTCHAs.

If most users don't even know that ReCAPTCHA is used, that's a good sign that it is being used as little as possible, though.


I wonder if it's from CloudFlare and not HN directly.

It is from Cloudflare. They love reCaptcha. It’s ok to give free user data to Google if it means less work for them.

Same here. I only log in through Tor and I'd never have imagined HN uses ReCAPTCHA if it wasn't for people suggesting that.

Try creating a new account through Tor, there is a good chance you'll see it then.

I've seen it before. In fact, I'd double checked last time I brought it up and it was there on /login in an incognito tab.

Either HN has changed or they now conditionally load it. For example, I encountered it every time I used Tor on HN, though I haven't hit /login in months.


I only login through Tor and I never encounter it.

Perhaps it's Cloudflare, and not HN directly.

I have seen it sometimes.

Although the Badge pattern makes it flexible to add functions with different badges, this one doesn't have the awkward interface:

https://godbolt.org/z/25k1DG

It just makes use of the non-transitiveness of friend.


I don't tend to have too many issues with ReCAPTCHA except when I'm travelling out of the country - then it makes my life really hellish. Given everything Google knows about my travel arrangements you would think they might be able to connect the dots.

In my experience, the vast majority of spam bots don’t run JavaScript, so simply setting a hidden input to a specific value on key down and checking for said value server side has prevented 99%.

Seems to me this would interfere with the accessibility of the website for those that use screen readers.

x = exp(-i)

The rhs is exp(ix), the numerator of the lhs is 2pi^2, and the lhs is some constant C. so the solution should look like x = i*log(C), not exp(-i)

x = cos(1) - isin(1)

i had to give up after the 10 or 12 anti google paragraphs. after reading that, this website certainly doesn’t need recaptcha because i’m going away!

there were a couple of feints as if he were about to get to the content, then ha! back to diatribe.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: