Hacker News new | past | comments | ask | show | jobs | submit login
Scrape like the big boys (incolumitas.com)
374 points by incolumitas 79 days ago | hide | past | favorite | 189 comments

I used to lead Sys Eng for a FTSE 100 company. Our data was valuable but only for a short amount of time. We were constantly scraped which cost us in hosting etc. We even seen competitors use our figures (good ones used it to offset their prices, bad ones just used it straight). As the article suggest, we couldn't block mobile operator IPs, some had over 100k customers behind them. Forcing the users to login did little as the scrapers just created accounts. We had a few approaches that minimised the scraping:

Rate Limiting by login,

Limiting data to know workflows ...

But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

It's a cool problem to tackle but it is just an arms race.

I know a guy at Nike that had to deal with a similar problem. As I recall, they basically gave in -- instead of trying to fight the scrapers, they built them an API so they'd quit trashing the performance of the retail site with all the scraping.

Yes. That's exactly what everyone should do.

Well, not EXACTLY. The exactly should be to just do WebSub/PuSH. No need to invent your own thing and hope that bots learn how to use it properly.

Agreed. What I mean is people need to stop fighting these pointless battles.

If data is your competitive advantage or product, then what? Accept that your market no longer exists and that there's no way to stop theft?

You're going to need to explain how scraping publicly available information on a website is theft.

If information is your competitive advantage maybe you shouldn't have it on a publicly accessible website, and should instead stick it behind an API with pay tiers and a very clear license regarding what you may do with it as an end user.

Note, a simple sign up being required to view a website makes it not publicly available information any longer and you can cover usage, again, in a license.

Then you have a whole bunch of legal avenues you can use to protect your work. Assuming you can afford it that is.

How practical is this really though? Like, imagine you're a newspaper. Unless you're the FT or Wall Street Journal or something like that, nobody is making an account to read an article. They'll just go somewhere else.

> You're going to need to explain how scraping publicly available information on a website is theft.

Seriously? Do I need to explain why a song doesn’t enter the public domain when it is played on the radio?

Do I need to explain that copyright is practically unenforceable in the 21st century? Data is trivially copied and there's nothing you can do to fight that, no amount of laws will ever make it non-trivial again. Even if you successfully sue somebody for this, it won't stop them.

At some point people are gonna have to accept this.

>"Do I need to explain that copyright is practically unenforceable in the 21st century?"

This sentence added nothing substantive to your comment, and made it rude; could you please be a bit more polite in the future?


For what it's worth, I didn't think it was rude. And I do think it was a substantive aspect of his response. It's a valid perspective even if I personally think it's somewhat orthogonal and incomplete.

I was replying to a similar sentence, but it is true that in the end it did nothing but escalate the situation. I apologize and yes, I will try to be more polite.

I was responding to a question of whether it was theft, not whether such theft is morally grey or unstoppable in practice.

Yours is a somewhat orthogonal point and one I don’t entirely disagree with.

Well if you want to stick to the hard facts then it's even simpler: copyright infringement is not theft - those things are covered by entirely separate laws.

What are copyright strikes, then?

Not even a thing in my country. Likely easy to avoid in others.

OP was talking about price lists. IANAL but AFAIK you can't copyright a list of prices.

That may be true. A list of prices might not be copyrightable in your particular jurisdiction. However I was only responding to the raw assertion made without any such qualification.

On the opposite end of the spectrum might be a photographer's website containing a gallery of their sample work. The fact that the gallery is openly published doesn't represent a relinquishing of copyright over those images.

No but those are substantively different situations such that this exact thing is being argued in the highest courts of the US. It's not quite the cut and clear case you seem to believe it to be.

I'm not American. I'm interested to read about these cases. Can you point me towards some relevant material, or at least cite the case name(s)?

Ah yup, certainly. So this is the one I was mainly thinking of -- https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Interesting. From what I can gather, the material in HiQ Labs v. LinkedIn was not claimed to be under copyright and that the argument being fought over in court was with respect to mechanically subverting access to a competitor.

You appear to have claimed that rights over material is broadly relinquished if it's published in public:

> If information is your competitive advantage maybe you shouldn't have it on a publicly accessible website

And your distinction was further clarified when you argued that placing barriers to access fundamentally changes the equation:

> Note, a simple sign up being required to view a website makes it not publicly available information any longer and you can cover usage, again, in a license.

Perhaps you meant to speak only of material which is not subject to copyright? In which case I think your argument does track.

Mmm I’ll go with that, to be honest I saw that LinkedIn had initially made a complaint under DMCA (which HiQ then got an injunction for) as well and given how the case played out I was uncertain to the extent the case was signaling that you may be waving certain rights by making some content publicly available with no gating like a sign up.

No, but there is a legit philosophical argument about theft when it comes to copyright. There are two ways to look at theft: acquiring something you didn't earn vs. someone losing something they did earn. Generally, we tend to focus on the latter. From that perspective, "copying" is really not "theft", and arguably "copyright" does more net societal harm than any benefit it provides.

It is copyright information, no? So technically it is intellectual property theft if the scraping use is for commercial purposes.

Not all information falls under copyright.

If you build a database of touristic places and display in your website, the information is not protected by copyright.

In Europe they have laws covering _sui generis database rights_, but they are from another era and unenforceable nowadays.

No? If you place information publicly on a website it's pretty much free game, no copyright violation, especially regarding user generated information. That's my take, but legally it's a gray area and it's still going back and forth in the courts (at least in the US) but for a while before a decision was vacated by the supreme court scraping publicly available information on a site was legally protected and seemingly inline with my thoughts on it.

If we are to live in a mutually prosperous society, how is the labor and therefore well-being of the content creator improved by a web scraper? Does this precedent not injury future opportunities for exercising one’s life to making website data available for others to scrape?

By my understanding any website with a copyright disclaimer warrants their data as exclusively their own and are granting permission for other web users to generate it, ie people are not entitled to share their web data with anyone. So if they are, and we agree that it’s good that they do, and continue to create information for others to know, how do we avoid the implicit harm in extracting data without nothing being given in return but possibly harming the internet’s experience for everyone accessing the same information?

I'm actually cool assuming there is implicit harm and no benefit, but by that logic we need to tear google down too. I'm cool making that trade but it has to be done equally.

If you can't make that trade then you've weighed the value provided by an organization like google to be more valuable than the copyright of these content creators and I want other players who may want to be able to challenge google to have the same protections and access google does to have a chance at providing the same value.

Well actually, isn’t Google improving the value of the content ergo property itself by improving its accessibility? I was inferring a one-way street with the accumulated data that can lead to server crashes - which I don’t believe Google’s web crawl does at all (in fact that would be counter-productive).

I'd say most crawlers are looking to provide enriched value for content at their end use. Google is just an aggregator (the biggest by far) but other aggregators are looking to provide similar value.

So then an aggregator is different than a scraping service with respect to the value given to the rest of humanity? In that, in principle one adds value to the content creation and the other deducts through potential harmful interference with its reciprocity?

I'd say many aggregators do offer value to the rest of humanity but I imagine there are probably some exceptions and also not all scraper services offer no value it's just different value to different people.

Some scraping services make their money by offering scraping services to companies for specific information and you could argue they provide value to other businesses that way, but not to the broader "rest of humanity".

So I'm not sure it's as simple as just "aggregator" good "scraping service" bad as value provided takes on many different forms, and that's what makes this difficult.

I guess it may come down to your take on what you think of middlemen, because they are all effectively middlemen in the data economy.

Edit: I was rereading your comment, in respect directly to the value added to the content, then yes maybe it is more clear that aggregators are in principle different because they do add that value where scraping services that sell the data do not offer any enrichment to the content creator. I personally think protecting content aggregators that republish the data to create visibility or other value for the content creator to the extent that they're not worried about being sued for that is probably a worthwhile thing to happen because of the net benefit to our ability to find information/content.

The real Jedi move

Especially if you charge for it, which would save them money, because they wouldn't have to redo their code every time you changed your website.

I think there's an opportunity for a new JS framework to have something like randomly generated dom that will always display the page and elements the same to a human but constantly break paths for computers.

Like displaying a table with semantic elements, then divs, then using an iframe with css grid and floating values over the top.

This almost seems like a problem for AI to solve.

Even if your DOM is obfuscated, the rendered page remains vulnerable to OCR. Obfuscate the rendered pixels and you’ll annoy your humans and eventually find that the scrapers’ OCR is superhuman.

Still, maybe AI comes into it. Maybe poisoning the data is the right way to do it conditioned on ML-juiced anomaly detection.

pdf and print newspaper is still a massive pain in the ass to OCR accurately

To some extent those already exist and I get annoyed by them when they cause 1Password to be useless on their login page. But it probably would help with algorithmic scraping.

This is already common. It's mildly annoying for scrapers but generally a waste of time since you can usually still orient yourself based on the content of the nodes.

This would have huge accessibility issues, breaking screen readers and the like.

We already have react-native-web (<3), so we have that covered.

> It's a cool problem to tackle but it is just an arms race.

Plus, it's one you're going to lose. I was once asked at an All-Hands why we don't defend ourselves against bots even more vigorously.

My answer was: "Because I don't know how to build a publically available website that I could not scrape myself if I really wanted to."

> But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. ... If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

Is that legal? It would be a big blow to trust if I was the customer, but that's without knowing what you were selling and in what market.

It’s legal if it’s in the contract. Standard for contracts to allow for mistakes and confirmations of prices

It's not mistake if you do it deliberately!

Yes (not saying it's a mistake) but putting confirmation can be in the contract, no law says you only get 1 chance to display price.

I love the honey pot approach. Put tons of valued hrefs on the page that are invisible (css) that the scrapper would find. Then just rate limit that ip address and randomize the data coming back. Profit.

I think this falls into the "arms race" trap, though. If you can make an href invisible via CSS, then the scraper can certainly be written to understand CSS, and thus filter out the invisible hrefs..

I scrap government sites a lot as they don't provide apis. For mobile proxies, I use the proxidize dongles and mobinet.io (free, with Android devices). As stated in the article, with cgNAT it's basically impossible to block them as in my case, half the country couldn't access the sites anymore (if you place them in several locations and use one carrier each there).

Wow, this is super interesting:



I feel like I'm getting a glimpse into the dark underbelly of the web.

Is it just one ip per dongle at a time? Or can you have multiple ips on the same device.

Just one IP at a time but you can change every 5 Min or more if you like

Damn I need like 100 at a time and that could get expensive

In a particularly hard to scrape website, using some kind of bot protection that I just couldn't reliably get working (if anybody wants to know what that was exactly, I'll go and check it) I now have a small Intel NUC running with firefox that listens to a local server and uses Temper Monkey to perform commands. Works like a charm and I can actualy see what it's doing and where it's going wrong. (though it's not scalable, of course)

We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.

I'm curious what bot protection it was? It couldn't have been trying too hard unless you were employing multiple anti-fingerprinting techniques, I'm assuming you used firefox's built in anti-fingerprinting?

You can use chromium/chrome/cdp and turn headless off and see the same thing.

Where I was working we stopped caring about ips browser etc because it was just a race. What we did was analyzing behaviour of clicks and acted on that. When we recognized it we went on serving a fake page. It cuts down a little bit of costs because it was static pages. In general it took a lot of time for them to discover the pattern and it was way more manageable for us.

We did the same and the bot developers wrote bots that acted like humans. It took them not very long to find out.

Its easy to detect chrome headless so scraping with it is not really how "big" boys do it :D the only scrapers/bots that are really hard to detect are the ones running and controlling real browser and not chromium. I do a lot od research aggainst abitbot systems, some times is friday night. If you spend each one in pub it doesnt mean your normal.

puppeteer-extra and undetected-chromedriver beg to differ :)

Not really, i did test it (and use it for some cases), but there are still sites that detects it. I can and anyone who can check webgl renderer name, though this can be done by faking driver name but thats just one of many ways:) Its ongoing fight. If you dont move your mouse or type faster than 95% of my portal users i can detect you with js script written in under 1 minute.


We are seeing a lot of bot traffic too but chose to accept it as reality. We are aware if thousands of bots create unpredictable cost surges that there is something wrong with our product, it should not create such heavy loads to our servers in the first place to fulfil it's mission.

I believe the future will make us more free by using more bot / AI technology since who wants to spend their whole day in front of a computer and research information if machines can do the job just fine?

The author says proxys are expensive and then proceeds to spend a shitton of money buying all that hardware.

4G proxies are just soo much better than so called "residential" or straight datacenter proxies. It makes sense to create your own 4G proxy farm if you conduct business in that area.

With only 10 dongles and 10 dataplans, you can have a lot of IP addresses that are extremely hard to block. It's an one time investment, paying proxy providers is a fixed cost.

Where do you get 4G dongles that don't suck nowadays?

We tried to get some, but all of the ones we could get were various levels of broken or unsupported.

That was not the authors main argument against proxies, that was just an additional point. You ignored the primary argument in your judgment.

>>Because I could not fully trust the other customers with whom I shared the proxy bandwidth. What if I share proxy servers with criminals that do more malicious stuff than the somewhat innocent SERP scraping?

Can they not call out a secondary point?

Sure but nitpicking does not lead to productive discussions.

If you want to avoid bot detection, learn how bot detection work. A lot of commercial "webapp firewalls" and the like actually have minimum requirements before they flag certain traffic as a botnet; stay below those limits and you can keep hammering away. Sometimes those limits are quite high.

In the past we've had the most success defeating bots by just finding stupid tricks to use against them. Identify the traffic, identify anything that is correlated with the botnet traffic, and throw a monkey wrench at it. They're only using one User Agent? Return fake results. 90% of the botnet traffic is coming from one network source (country/region/etc)? Cause "random" network delays and timeouts. They still won't quit? During attacks, redirect to captchas for specific pages. During active attacks this is enough to take them out for days to weeks while they figure it out and work around it.

Having spent a week battling a particularly inconsiderate scraping attempt, I’m quite unsurprised by the juvenile tone and fairly glib approach to the ethics of bots/scraping presented by the piece.

For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.

But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.

Right with you there.

I had a particularly bad time not so long ago, when a customer's site - a shop - was brought to its knees because someone, probably a competitor, hired some scraper-company of some sort to scrape every product and price.

The scraper would systematically go through every single product page.

And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.

They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.

Now, maybe if they'd just thrown, say, 5 or 10 of the scraper "units" at the site, no one would have noticed in amongst Googlebot (which they wanted to use anyway because they are using Google Shopping to try to bring in more sales).

But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.

Now, the site was robust enough to handle this load - barely - which was massive, however, having to do that /and/ also handle normal day-to-day traffic? Nah. The bastards got greedy and like you I spent a few days unfucking the damage they were causing.

Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Not everybody in this space is out to destroy your site. Some of us actively try to put as little load on your site as possible. My scraper puts less load on sites than I do when I browse them normally, I've measured it. Really sucks when we get lumped together with the other abusers and blocked.

Exactly, some of us use scrapers because while we can't go full Richard Stallman, we also don't want to visually sift through ridiculous UI just to look at some basic data/text.

> we also don't want to visually sift through ridiculous UI just to look at some basic data/text


First scraper I ever built was for my school portal. Absolutely atrocious user interface. It got to the point that I seriously hated that site so I built a script to log into it and download my information. I just wanted to see my grades without suffering.

In a past life, we were consulting with a startup that offered a subscription data service. They were very sensitive about scrapers, especially on the time limited try-before-you-buy accounts, which competitors were abusing.

At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.

It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.

That's a good way of going about dealing with this kind of abuse indeed. Wish I'd thought of doing that at the time, but due to the nature of this shop you didn't need a user account to browse the products/prices.

I'm now making an entirely new shop for them - I shall bear this in mind. Thanks for that!

Yea. Detect them and mess with them is the only approach that seems to work for a lot of abusive activity. Banning doesn’t work because they will just start over from scratch. The only thing you can really do is make them think you haven’t “caught” them yet and during that stretch make sure their time is wasted.

It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.

[0] https://varnish-cache.org/

If your site is so poorly written it can't handle a few hundred computers trying to do something as simple as loading your product pages then sorry, but that's on you. The information is on the public web and scrapers are as entitled to access it as any web browser.

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Wait till you find out what half of Google's business is based on (spoiler - scraping).

I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.

At least google's scraping does result in your website being discoverable by users. So you get something out of it. That's not to say that sometimes Google is missing or stealing data they scrape. But at least there is some benefit. Many other scrapers are merely taking the data to compete.

I strongly feel that if a human can get to it manually, we have to accept that either it will be botted or humans will be paid to do it by hand (They call these people "analysts" or "market researchers").

I might argue that what google actually uses their scraped data for is their search engine - which is private. They simply allow us access to specially crafted queries, which they can and do manipulate (for many reasons, some good some bad).

The only thing I'd say meets that definition would be like Common Crawl.

Exactly. I am surprised that the 'devs' can't figure out a way to block only annoying/excessive scrapers. Most likely they are just lazy and then just put 3rd party 'solution' and job done. Pay me.

>Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.

i've found from experience that most scraping attempts originate against host-sites that are generally user-hostile; no APIs to use, JS tricks to bother user browsing, or groups that profit from first-mover advantage and thus try to obscure data.

So, if your sites are commonly the victim of scrapers that are harvesting publicly available data i've found that it's more useful to ask myself what alternatives I could provide those that feel the need to scrape.

As for a 'lack of ethics' on how publicly available data is wrangled -- well, i'll just say that I feel that it remains the responsibility of the administrator rather than being something to push the blame onto clients for. There are plenty of technical avenues to pursue before appealing to morals and ethics for help.

This and the post you are replying to both sound like sabotage by a competitor rather than legit data collecting.

Bots are one of those things that are easy to build and hard to get right, and there's really no way of preparing for the chaotic reality of real web pages other than fixing the problems as they show up. Weird and unexpected interactions are going to happen. Crawling the real web involves navigating a fractal of unexpected, undocumented and non-standard corner cases. Nobody gets that right on the first try. Because of that I do think we need to be a bit patient with bots.

At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.

I kinda feel like rate limiting your request to individual domains and IP addresses is an easy thing that goes a long way towards getting it right.

There are still snags with that.

Stuff like redirect resolution is very easy to overlook. You may think you're fetching 1 URL per second, but if you are using the wrong tool and you're on a server that has you bouncing around like in a pinball machine and takes you through a dozen redirects for every request, the reality may be closer to 10 requests per second.

On top of that, sometimes the same server has multiple domains. Sometimes the same IP-address serves a large number of servers (maybe it's a CDN).

If you build your site in a way that multiplies each request 10x, well then that's what you get. Don't do that and you won't have issue with requests. Or handle those requests properly. There are solutions to that. You know how many requests your local google CDN gets? They know how to manage load.

Most pages have at least a http->https redirect, many contain a lot of old links to http content.

Usually it's error pages that really drive the large redirect chains. They often have a vibe of like some forgotten stopgap put in place to help with some migration to a version of the site that is no longer in existence.

Of course you don't know it's an error page until you reach the end of the redirect chain.

If an amateur can do that to your service by scraping, imagine what someone can do if they actually intend to do you harm. With cloud pricing models someone could find a little misconfiguration or oversight and put you in the hole in operating costs. Anti-abuse is a necessary design when your service is exposed to the internet.

Not saying that doesn't suck - it does, it's why many ideas don't work in practice as an online service.

I'm right there with you. I'm the lead engineer for an automotive SaaS provider (with ~6000 customers and ~4 billion requests per month) and we recently started moving all our services to Cloudflare's WAF to take advantage of their bot protection. We were getting scrapes from botnets in the 100000+ per minute range that was affecting performance.

We chose to switch to the JS challenge screen as it requires no human interaction. We now block 75% (estimated to the best of our knowledge) of bot traffic but some customers are livid over the challenge screen.

I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

[0]: https://github.com/venomous/cloudscraper

If someone wanted to get past it they probably could. We've had a few sources of traffic that we've had to straight up block (as opposed to challenge) because of this exact issue. So far it's been a "good enough" solution that blocks enough of the bot traffic to be effective.

What were they scraping, if I can ask? Was it targeted or just wget -r style?

It was a hybrid of low-effort vulnerability scanning and targeted inventory scraping. Many dealerships in the automotive space will pay gray-hat third parties to scrape and compile data on their competitors.

The irony for us as a provider is that it's one of our customers (party A) paying a third party to scrape data from another one of our customers (party B) which in turn affects the performance of party A's site. We've started blocking these third parties and directing them to paid APIs that we offer.

And how do you get your 'inventory data'? Aren't you scraping (or using scraped data) yourself? Oh the irony :)

No, we're a contracted provider for these customers. They ingest their data into our network through APIs or CSVs.

Makes little sense - customers upload data to you and they don't want any data back? Really?

It's not them who want it back, it's their competitors who want it.

I get it why someone else scrapes it. But why customers upload data in the first place? Aren't they interested in getting some OTHER data from you and that OTHER data may as well be scraped?

Why do you think those bots were scraping your data in the first place?

Why not create api endpoint and charge mild cost for that data? You’ll make money instead of spending it.

Do you honestly believe all site scraper people/companies are ethical enough to go to whoever pays /them/ to scrape data from a competitor's site and say "oh they offer an API to access this data let's pay for that", instead of "why pay for that data when we can scrape it right off their site"?

Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?

I run a large scraper farm against several large sites. They're not online shops, and we don't compete with them. But they do have hundreds of thousands of data points that we use to provide reports and analytics for our clients, who also do not compete with the sites.

I absolutely would pay for an API that provides that data. I'd be willing to pay 10x more than the cost of maintaining and running the scrapers.

But the sites being scraped have no interest in that.

Have you tried approaching those sites and asking them to provide an API, pointing out that it would be easier for both of you in the long run? Or are you just assuming they wouldn't do it.

Because right now, I sure wish that the bots - which comprise probably 2/3 of my traffic - are causing me huge headaches and I wish that the people doing it would tell me what the heck they want.

Yes, we have. And no, they are not interested.

Building and maintaining the scraper is the not cost they would use to measure it internally. It’s the cost to build the API, and support it and perhaps any perverse incentive it creates where even more data flows out to competitors.

For all intents and purposes, this isn't competitive data for them. There aren't really competitors in the space anyway, the barrier to entry is ridiculous. In fact, by law, operators in the industry are required to share this particular data with each other and industry regulators. But they don't share it with outside parties in the aggregate form we need it in. Hence, the scraping.

Building API is 5 times easier than building routes for your public webpages, which is basically an 'API' as well.

And the cost of being scraped.

Well, you don't need an api, just a CSV file with a catalog.

The scraping company WILL use the API/CSV file... they will probably also still charge their customer for scraping, so it's a win-win :D

You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

You can be principled and insist on blocking bots and spend a lot of time and money on tools, people, and ultimately hosting because the bots will always win; or you can offer the data for free/minimal fee and serve it with almost zero cost and cache it so you can do that with a micro sized server.

You can always lie about some of the prices if you want, but you will just encourage bots again.

Ethics are nice, but let's be honest, very lacking. Sometimes it's better to be pragmatic.

> You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

There's the problem right there. The prices and product data are publicy visible - because there is a target audience of /humans/ for whom the site is designed and intended to be used by. The site is not there to cater for a competitor's scrapers.

I don't care how much people couch their unethical behaviour in "the data is publically available", the basic fact is most if not all websites exist for human eyeballs to look at them. They do not exist for arseholes to DOS them by inundating them with scrapers.

From my perspective, the problem is that the data that is offered isn't really "for humans". The data is for convincing the humans to buy/pay or worse, browse and watch ads as a result.

But overall, information is one of those goods that has intrinsic properties like no other. It can be copied, infinitely. And we haven't yet figured out the dynamics of how to reason about it, so it feels like we're pretending they're physical goods.

Edit. Side note. I'd go further and say that some of the data is even worse, it's "offered" with the real intention being to confuse the users into performing non-optimally in the market. Look at Amazon/Ebay/AliExpress/Google listings for evidence of that. Just Google - Google is a ML and scraping power house, and the best they can muster is to be spammed with fake websites and duplicate/confusing listings.

You hit the nail on the head. It's hard to have sympathy for site operators complaining about scraping, where almost every site does its best[0] to make using it a time consuming, potentially risky and overall annoying ordeal. Not to mention, information asymmetry is anathema to a well-functioning market, and yet no. 1 reason for fighting bots given in the whole thread here is a desire to maintain that information asymmetry.

And that's also the dirty secret behind the "attention economy": it's whole point is to make things as inefficient as possible, because if you're making money on people's attention, you need to first steal it (by distracting them from what they're trying to achieve), and then either direct towards your goals (vs. those of the users), or stretch it out to maximize their exposure to advertising.


[0] - Sometimes unintentionally. Unfortunately, the overall zeitgeist of UX design is heavily influenced by bad players, so default advice in the industry is often already intrinsically user-hostile.

> Not to mention, information asymmetry is anathema to a well-functioning market, and yet no. 1 reason for fighting bots given in the whole thread here is a desire to maintain that information asymmetry.

This is exactly right.

> the basic fact is most if not all websites exist for human eyeballs to look at them.

There's a whole ethical subthread here of websites trying to making the experience for those humans miserable, and taking away the agency necessary to protect oneself from that. A browser is a user agent. So is a screen reader. So is a script one writes to not deal with bullshit fluff, when all one wants is a simple table of products, features and prices.

I agree 100%, but it is a fact of life, and sometimes it's better to just minimize the fuzz and focus on the things that matter.

Your argument is perfectly valid and applies to offline activities as well (what stops a competitor from walking through the aisles of a Walmart or Costco?), but this is a battle that can't be won, there are too many parasitic actors. It is human nature.

Understanding your competitor's pricing is not "parasitic", it's research. Every company I've ever worked for that sells something online scrapes their competitors in some way (whether with bots or with interns).

I would say it's the opposite of parasitic. It's essential to having a well-functioning free market.

> (what stops a competitor from walking through the aisles of a Walmart or Costco?)

That's a significant portion of Nielsen's business model.

Let's not encourage these unethical people to even think of using human eyeballs and manual data entry for their scraping instead of bots. That sounds pretty darn unethical.

> Why would an online shop do that?

Because otherwise the HTML will become the API.

Ethical - of course not. Practical.

Valuable public data is going to be scraped - this is inevitable. Even paywalled or signup protected valuable data is going to be scraped.

Why not sell valuable data for reasonable price then.

My point was more that we can accept with, and live with, scrapers but expect some minimal level of consideration if you’re going to abuse our very expensively gathered dataset. Sending us 10x daily traffic so you can scrape quicker than the fair usage policy of our API allows is just… poor etiquette? Unkind? Not really sure how to phrase it. I’m exhausted after multiple 18 hours days trying to keep our website online for the public.

As a programmer that just sometimes wants to check if given item is available in store I would like to be able to use API for that. But if it is not available one has to scrape.

>where some amateur did real damage to us

If an amateur can do damage to you, then I have some bad news for you...

This is nonsense. It's always easier to destroy than to build/mantain. If you got any real advice, by all means...

If an amateur can do damage to you, then I have some bad news for you...

I believe the point wasn't surprise that damage occurred at all, but frustration that damage can occur just out laziness/ignorance rather than malice.

Indeed, that was precisely their point, and "bad news for you" is disingenuous as there are many techniques used by incompetent, or just downright unethical and greedy scraper companies which, no matter how robust the target is, can still give it a major headache.

I've witnessed a site being basically DOS'ed due to particularly greedy and aggressive mass scraping attempts.

Precisely this, thank you.

To be clear, they did “damage” was to our bottom line. Most sites don’t capacity plan for random cliff walls of 2-10x traffic (clearly we should!). We’re scalable enough to handle that traffic after a period, but a) it caused intermittent periods of low availability (costing us money because we didn’t generate income the way we normally do) and b) cost us money from scaling all our services up.

It’s just selfish. If you’re going to take the product of other people’s work in a manner they don’t consent to, at least do it in a way that doesn’t cost them twice over.

Considering the demand for your content, why haven’t you created and provided an API? Maybe you could monetize?

I wrote a scraper a couple of years ago to get a single data point from a website where my client was already a paying customer. This website had an API, which they were also paying for, but the API didn't cover that data point, so at the time they had one of their admin people populating that missing piece of data manually, which was taking them around ten minutes a day.

I asked them if my customer could pay to access this data point via their API and they quoted 3600 EUR/month! Enter the scraper...

We do offer an API - the scrapers are trying to circumvent using that, presumably.

Why do you think are they trying to circumvent it?

Does your API provide all the information that can be found on the site, or are they scraping because the API is incomplete?

We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

This is the number one reason to scrape websites. It's always nice when there's an API with documentation and rate limiting rules you can follow. Sometimes the data I need just isn't there, though. Then I open up their site and find a huge amount of private API endpoints that do exactly what I want. Then I open up a ticket about it and it gets 200 replies but they ignore it for years. It's fucking stupid and it's really no wonder people scrape their site.

Why would Amazon wish to provide you with easy to access data on their products and prices when you could either be a competitor wishing to undercut those prices, or be a scraper company hired by such a competitor?

In what universe is providing such a straightforward way of helping a competitor considered sane business practice?

Most sellers who are on Amazon platform give Amazon that information and a lot more, knowing full well Amazon will use their sales data to launch an Amazon Basics competitior.

It is a sane business approach when you are a pragmatic business who knows the limits that constrain your business.

Either the content company is going to build a simple API (could be just a static CSV file hosted on S3 or whatever) with useful information or try to monetize/hide this information and force scapers to use the website .

A bot is always going to win unless you want to make users also a lot of friction. In the era of deepfakes and fairly robust AI tooling the difference between bot action and humann action is not all that much.

If you are going to be agressive with captcha , IP blocks and other fingerprinting, users who get identified false positive.or annpyed would leave.

When the cost of losing those users is more than allowing access to scrapers,you would absolutely setup the API.

Man your comment is hilarious because in fact Amazon DOES provide an API for exactly that

And yet...

> We've once had to scrape Amazon product pages because they have a lot of API endpoints, but those didn't contain the data we needed.

...only a couple of comments up.

You don't know what data they needed. Maybe they needed reviews or product descriptions. The API doesn't cover everything but it does cover the exact use case I was replying to.

Because they will get the data regardless of what you do and if you don't make an API it will cost you more due to overhead.

Markets are competitive and efficient when all parties have full information. If Amazon doesn't want its prices to be known amd finds ways to successfully prevent them from being scrapes, in some sense the state should force it to disclose them via API (or something equivalent)

In the end, they still get the data, just in a much less desirable way for both you and the customer.

Is it not viable to put majority of your data behind a login and so the bots only get a very limited snapshot while legitimate users get it through a free login?

I’m asking this because I’m going through very similar situation and would love to see other opinions around this.

You are defining legitimate users as those that have a valid session cookie? Good luck

Maybe the API terms/cost are prohibitive? I'm sure there's some equilibrium where they would rather pay you than go through the trouble of scraping.

Maybe docs or infra are unbearable

What is your site may I ask?

Just curious about the difference in value from using your API and web scraping as there is a cost to web scraping as well.

If you make your scraper well, and it counterfeits being a real user believably, you end up with a solution that can be tweaked as needed to handle whatever traps people put in to try to defeat your scrapers.

If you make your api client well, you don't have the problems of a scraper - but if the api owner decides to change rules for api and you can't do what your business is based on being able to do (think of api owner as Twitter) then you need to make a scraper.

Wait, why wouldn't you have rate limiting on your API? Providers like Cloudflare offer this although I guess you could roll your own too since our industry loves to reinvent the wheel.

Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.

Doing a bit of low-stakes monitoring of webpages lately. It started (as I'm assuming it often does) with right-clicking a network request in Chrome and selecting "copy as curl"

Then graduated to JavaScript for surrounding logic e.g. data transformation

I had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.

Throwing together a "plugin" probably takes me less than 20 minutes normally.

I'll probably have a look at using prowl to ping my phone.

And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.

Basic question, how does one profit from scraping data and what kinda data?

Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.

1. Be job site. 2. Have employees that cost money call facilities and get job listings. 3. Establishing relationships with facilities to list jobs. 4. Buy job listings from 3rd parties. 5. List them for free hoping to make margin. 6. Scraper steals all jobs, lags site, and gets value of hard work for free.

ahh thanks

These are scraping artificially limited releases clothes/shoes. You buy a shoe at $100 and immediately sell it at $1000.

Artificial scarcity - every week you release a "limited edition item", but if you do the math, it's not limited edition at all if you integrate over a year.

Anything you might look up or keep track of online that helps your business is probably being scraped by someone who is using it themselves or selling access to the curated data set.

Prices (are yours high, low compared to competition?), reviews, locations of physical stores, search result placement (where does your widget show up when someone searches "widget" on your site?), just to name a few use cases.

Here's a project that's been in the news recently that relies heavily on scraped data. http://www.thebillionpricesproject.com/

Thanks for the share. Great stuff.

I used to scrape websites to generate content for higher SERPs.

Ended up going into the adult industry lols. (https://javfilms.net)

Neat! I've run across your site organically :P

I've always wondered, and since you're right here... how do sites like this make money?

It looks like you're probably crawling all the JAV vendors, finding free clips of today's releases, embedding them in your own site to draw traffic, and making money with affiliate links to buy the full content?

Am I missing anything? It seems hard to believe you'd get enough affiliate signups to make it worthwhile.

I can imagine your site as being a few hours a year of script maintenance and a money printer, or a 40hr/week SEO job with 1000s of similar sites across the adult industry.

I'd love to know anything you're willing to share about how the business works.

Not the same kind of scraping, but does anyone have thoughts/resources/best practices for doing link previews (like Twitter/iMessage/Facebook)?

You shouldn‘t really need to do any scraping tricks to get that, because it‘s data the websites (usually) want to give to bots. Or are people getting bot block screens from Cloudflare et all for that basic action these days?

It should be a matter of a simple GET request to fetch plain html and parse the OpenGraph meta tags out if that. There are many open source libraries to do that for you depending on your language.

If bot blocks really are a problem, a SaaS solution like Microlink could probably do it for you.

Bot blocks are definitely an issue for certain sites, I've implemented it that way currently.

Microlink is a good tip, thanks!

Could you share your code for AWS lambda and puppetter? It’s definitely interesting for other websites

You can put some wasm crypto mining code and at least profit from bots. :D

wow! That was an interesting read.

A little pet-peeve I have is when an obscure(ish) acronym is used and never defined. Is SERP a well-known acronym? Perhaps this is a niche blog and I'm not the intended audience.

Yes; a SERP is a Google search result page. It's the most important acronym in SEO.

I don’t remember never ever hearing it and I’ve been in the industry for some time

Huh. I've never been in the industry, but noticed "SERP" at least 15, maybe 20, years ago and have remembered it since.

(If I were writing something to be published, though, I would write "search-engine results page" instead of "SERP".)

You've been in the SEO industry for some time and never heard SERP?

The web programming industry, not the SEO industry

There is actually only very little overlap between SEO and web development. Web developers should know some basic technical SEO but you'd be surprised how much knowledge is in the SEO industry that doesn't overlap at all with development. So, it's not too surprising you don't know what SERP is. Web devs might think SEO just means fast load times, proper markup, meta tags, etc but that's only the very surface

You can't be serious.

The word SERP feels like a bit of a shibboleth for SEO-people. They seem to take it for granted, the rest of the world just looks puzzled when they hear it.

I had to look it up.

SERP: Search Engine Results Page

Unintroduced acronyms should always be avoided.

Depends on the audience-acronym pair. I don't think HTTP needs an introduction in a technical article, OTOH (on the other hand ;) ) a general newspaper should probably expand HTTP but not WWW.

Perhaps a stroll through your own comment history (or the comments of any other HN (hacker news) user) would illuminate a lot of places where acronyms are used without introductions. TBH (to be honest) though, I'm not sure if every one of those should always have one or sometimes not.

As English speakers we often take for granted acronyms such as DB or even USA. For foreigners these can also be inscrutable.

On HN I'm used to SE meaning Software Engineer so I came up with "Software Engineer Ranting Board" before asking Google to give me the SERP that would provide me with the true meaning of SERP.

An all-too-common occurrence in HN comments as well.

Not the OP, but I thought it was well known.

That said, I do a lot of SEO work.

Still, it should be best practice to define any acronym or initialism the first time you use it

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact