Hacker News new | past | comments | ask | show | jobs | submit | proszkinasenne2's comments login

Imperva with it's core product around blocking bots publishing statistics on how crowded platforms are with bots. ¯\_(ツ)_/¯


In Chrome/Chromium there is a WebRTC Network Limiter [1] extension that let you set "Use only my default public IP address" policy and render the method I presented ineffective.

[1] https://chrome.google.com/webstore/detail/webrtc-network-lim...


Google abandoned that extension in 2016, which is why the last option (for disable_non_proxied_udp) is greyed out.


It's both security and privacy issue. Whonix wiki explains the latter in more detail https://www.whonix.org/wiki/Data_Collection_Techniques#:~:te...


If you use Chrome-exclusive links, please at least also link to the closest standard section [0] and preferably, mention the Chrome-linked text directly.

That said, they don't say anything about security, I obviously forgot about fingerprinting, but still don’t see security issues?

[0] https://www.whonix.org/wiki/Data_Collection_Techniques#Finge...


Sure, it can be! Also, as some people have already pointed out, this is often a gray area where people go beyond violating ToS. Some good examples are privacy violations (scraping personal data), credentials stuffing etc.

Recently, there is a boom of "anti-bot" services. These are essentially SaaS businesses that "protect" websites from being scraped by automated software. As you onboard the first customer who wants to extract data from a bot-protected website, you are going to run into an unlimited waterfall of stupid troubles. Your bots will be blocked, will consume excessive amount of data, kill your CPU/GPU performance.

I have shared some highlights on how to bypass these recently on HN [1], but it is sadly only the tip of the iceberg. On the other hand, since the post has been featured on HN I have been reached by more than 50 companies and individuals whose business operating model is based solely on data extraction/automated scraping. These are (in my opinion) successful companies, and two out of these are part of YC.

[1] https://news.ycombinator.com/item?id=29060272


Wasn't there a ruling that web scraping was legal now?


The LinkedIn case, it's still up in the air i think - https://news.bloomberglaw.com/us-law-week/supreme-court-scra...


Thanks for this!


- Trust Token API to verify whether you are a bot or not - Federated Learning of Cohorts (FLoC)

https://www.google.com/amp/s/blog.google/products/ads-commer...

As well as there is plenty of techniques allowing device fingerprinting that Google (might) use.

https://github.com/niespodd/browser-fingerprinting


Anything based on Chromium is vulnerable to all specialised fingerprinting techniques such as this one https://niespodd.github.io/persistent-tracking-shader-cache/ and many others that I listed here https://github.com/niespodd/browser-fingerprinting

Some parts of Chromium seem to be intentionally exposing fingerprinting surfaces and, because its changing quickly with new features and addons, keeping up with patches like Bromite does is incredibly challenging task


I thought about it too but when you consider cost of running headless Puppeteer (lets say on AWS) and the cost of a good proxy that is charged per GB its often as expensive (if not more) as some of these SaaS-es. This is the case especially for websites with some heavyweight JS/CSS/img assets.


That's true when it's a one-time job: pull the data and disappear. I also see how this is the case for most freelancers on Fiverr or Freelancer. This is the tool they know, so they use it. However I imagine there is a number of companies that strongly rely on continous data scraping - let it be for price comparison - and I've seen one heavily using Puppeteer


@jjgreen I am genuinely interested what are the existing solutions and how people deal with the problem. This is why it's "Ask HN". If there is none and someone would be interested in using our tool, why creating two topics?


The question reads like you only ask to promote your solution. It's better to split it into the genuine question and later a Show HN.


Agreed. I have changed the text, so that it is less confusing. Thank you for pointing this out!


@qiyuxuan96 Hi there. I am developing a cross-browser extension development/deployment SaaS. Would you be interested in hearing more?

We make life easier with things like: - versioning, - packaging extensions for different extension galleries, - collecting payments, - gathering analytics (extension views, installations etc.).

Ping me at niespodd@gmx.ch if interested.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: