Hacker News new | past | comments | ask | show | jobs | submit login
Web scraping is legal, US appeals court reaffirms (techcrunch.com)
1040 points by spenvo on April 18, 2022 | hide | past | favorite | 237 comments



"On LinkedIn, our members trust us with their information, which is why we prohibit unauthorized scraping on our platform."

This is an unpersuasive argument because it ignores all the computer users who are not "members". Whether or not "members" trust LinkedIn should have no bearning on whether other computer users who may or may not be "members" can retrieve others' public information.

Even more, this statement does not decry so-called scraping only "unauthorised" scraping. Who provides "authorisation". Surely not the LinkedIn members.

It is presumptuous if not ridiculous for "tech" companies to claim computer users "trust" them. Most of these companies recieve no feedback from the majority of their "members". Tech companies generally have no "customer service" for the members they target with data collection.

Further, there is an absence of meaningful choice. It is like saying people "trust" credit bureaus with their information. History shows these data collection intermediaries could not be trusted and that is why Americans have the Fair Credit Reporting Act.


Why do we even bother to give any credence at all to company PR-provided explanations like this?

They are spin doctors lying through their teeth, because literally everyone is. The culture of business writing/speaking has become so far disconnected from physical reality that it is amazing that anyone accepts these explanations at all, ever.

Like seriously, these guys have proferring casually miquetoast explanations for anything down to a science. If there is one thing you can guarantee, it's that the provided explanation simply isn't true.


> Why do we even bother to give any credence at all to company PR-provided explanations like this?

It's propaganda and it would be great if it didn't work, but it works at least partially. Someone who doesn't know anything about the subject will hear it and repeat it, someone will hear the repetition and give it credibility because it doesn't come directly from a company. I have heard the most stupid opinions from people that go directly against their interests. You don't need fancy algorithms to manipulate society, just perseverance.


People should bother a second to think who benefits from their unexamined beliefs.

To be honest, a lot of people seem to want to kiss ass for the “prestigious” companies, too. You can see this phenomenon on all Apple threads, for example. They either don’t want their status symbols tainted, or they subconsciously believe that flattery will net them something.


Not all companies abuse PR to such a degree. Newegg took on patent trolls and I enjoyed hearing about that. It's just that controversial topics make the news, and if LinkedIn wants to make themselves look like idiots then you can be sure TechCrunch is going to report on that as the drama unfolds.


And that's commendable. Also very rare among large companies.


As an a aside, I visited Newegg last night for the first time in quite a while. It is a depressing remnant of it's former self. Totally changed.


I'm wondering why there is a statement from LinkedIn, but nothing from Hiq Labs? I don't mind giving companies an opportunity to defend themselves when an article is written about them, but I'd expect the author of the article to get counter statements from other parties.


> The culture of business writing/speaking has become so far disconnected from physical reality that it is amazing that anyone accepts these explanations at all, ever.

The execs, lawyers, judges, public relations crews, and others who deal with the language day in and day it must become inured to it. Joe Public might never read an EULA, and might be shocked (if not already pessimistically resigned) to understand what companies actually do with his data, but an arbitrator who hears 6 complaints a day has been sitting in the boiling pot for a long time.


You overestimate the average intelligence of a human.


Relatedly — the court found that LinkedIn was selling a very similar dataset to the one they were attempting to cut off — so the claim that they were protecting customers' privacy was stretched


From the opinion:

Finally, LinkedIns asserted private business interests protecting its members data and the investment made in developing its platform and enforcing its User Agreements prohibitions on automated scraping are relatively weak. ... Further, there is evidence that LinkedIn has itself developed a data analytics tool similar to HiQs products, undermining LinkedIns claim that it has its members privacy interests in mind."


Haha what a joke. Do you have a link for that?


I think he's talking about the fact that with a premium membership you can have access to turnover rates stats etc...


Ah okay, thank you. In my book that still makes them hypocrites while their lawyers claim to protect customer privacy by not exposing such data. They should drop their claims and focus on improving the product.


I can't state how much I dislike companies or lawyers who doggedly pursue such selfish policies.

> This is an unpersuasive argument because it ignores all the computer users who are not "members".

It's not just unpersuasive, it's disingenuous. LinkedIn wants to reap the benefits of having a public website while excluding competitors from their definition of "public".

If LinkedIn wants a membership-only website, they can privatize it like Facebook.

As a data scientist, I won't use LinkedIn after seeing them pursue this. They need to learn what public and private mean, and that it isn't the job of courts to punish people or businesses for accessing publicly available information. LinkedIn can set up its own boundaries as many other services already do.


The principles in the PR are just a smokescreen.

Linkedin is being completely self serving. Taking whatever they can get, and feeding bullshit, sorry PR, to confuse the issue.


"On LinkedIn, our locked in members information is our sole competitive advantage, which is why we prohibit unauthorized scraping on our platform."


You might also like this statement LinkedIn provided to The Register:

"We're disappointed, but this was a preliminary ruling and the case is far from over," a company spokesperson said. "We will continue to fight to protect our members' ability to control the information they make available on LinkedIn."1

1. http://www.theregister.com/2022/04/19/scraping_public_data_l...

Members cannot control the agreements that Microsoft/LinkedIn enters into with member information as the bargaining chip. There are no generally limits on how Microsoft/LinkedIn can use the information, either internally or externally.


Great point about non-members/members


On LinkedIn, our members trust us with their information

I still get occasional emails despite having deleted my account years ago, after having a profile with them for a short time.

I've heard plenty of other dark pattern anecdotes about LinkedIn, and can assure them I would trust them as far as I could throw them.


They really should use the word "entrust" rather than "trust". It would be more accurate. But the trasnsfer of information from "members" to LinkedIn still does not provide an explanation for LinkedIn's efforts to stop HiQ from accessing it. This case points to the problem with mediating access to public information (e.g., user generated content) as a business. The further LinkedIn chooses to go with legal proceedings, the more they risk the courts upending this "business model" by exposing its inherent flaws.


s/bearning/bearing/


You can't stop me from keeping the car I borrowed. Your bank can't physically force you to pay back a loan. I can probably outrun the staff at that restaurant. Does that mean the laws preventing such behaviour should be abolished, the (implicit) agreement to pay for the food I order should not be binding?

LinkedIn presumably tells its users how they are using the data, at least if they follow the law. Shouldn't people be allowed to consider those terms to be acceptable without it meaning they lose all protection?

"But it's impossible to protect your data against all re-use", you'll say, "someone may remember it". That is not just obviously true, it is the only reason one might want the protection of the law.

Things that can easily be prevented by technical means do not need to be prohibited. That's why the common argument that "it's your own fault when you are raped at night, in a park" or that someone stole your car when you accidentally didn't lock it is so absurd: laws aren't so much about preventing you from being raped. In the absense of the law, you'd just never leave your (fortified) appartment. Laws are about allowing you to go outside, to not spend your money on steel-reinforced doors, to leave your convertible parked with the top down.

"Might is right" just leads to a pointless arms race: social networks waste money on protecting against scrapers. They'll hide everything behind logins or paywalls. Scrapers will waste money on overcoming those protections.

If the scrapers win, some people will decide not to do something they would have otherwise done. In other words: they have become a bit less free. The social network becomes less useful and might shut down. The scrapers find themselves with nothing to scrape. Congratulations, everone lost.


In this case, and in the ruling, the data bring scraped was available without login.

The ruling specifically says that scraping data that is publicly accessible (which I presume means without login) is OK.


I'm really not sure what your argument is here.

LinkedIn serves pages containing their users' info, and users are made aware of this. These pages require no authentication, presumably because they're a marketing tool. People built scrapers to obtain that public info. LinkedIn said that was illegal, the court says it isn't.

I don't see who loses here, other than LinkedIn and other sites that want the benefits of listing information without the downsides.


Don't conflate the digital world with the real world. If you really want to go there... The way I see it, Linked in is more like a public square anouncement board. You can write down the details of what is there and sift through it on your own, or pay them to give you a summary based on what you are looking for.

Edit: Would it be illegal photograph the board, OCR it and provide a summary for half the price of others? There is such a thing and "wrong/stupid laws"


PLEASE DRINK VERIFICATION CAN TO CONTINUE SCRAPING.


"Web scraping is legal" seems to be an overbroad interpretation - the Ninth Circuit is merely reaffirming a preliminary injunction in light of a recent and separate ruling by the Supreme Court. The opinion [1] only weighs the merits hiQ's request for injunctive relief, and doesn't say anything about the meatier topic of web scraping legality.

I think the real action will be the ruling of the district court.

[1] https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...


Upon more digging [1], the district court trial is currently scheduled for Feb 27 2023. It'll take a while!

[1] https://storage.courtlistener.com/recap/gov.uscourts.cand.31...


First Thought: That headline seems wropng. People are probably gonna get burned just reading that.

Second Thought: I should really read the opinion before I make any comments.

Clicked through, skimmed around, looked for a link to the slip looking for the opinion link ... couldn't find it.

Thank you, sir.


While I have sympathy for what the scrapers are trying to do in many cases, it bothers me that this doesn't seem to address what happens when badly-behaved scrapers cause, in effect, a DOS on the site.

For the family of sites I'm responsible for, bot traffic comprises a majority of traffic - that is, to a first approximation, the lion's share of our operational costs are from needing to scale to handle the huge amount of bot traffic. Even when it's not as big as a DOS, it doesn't seem right to me that I can't tell people they're not welcome to cause this additional system load.

Or even if there was some standardized way that we could provide a dumb API, just giving them raw data so we don't need to incur the additional expense of the processing for creature comforts on the page designed to make our users happier but the bots won't notice.


I've told this story before, but it was fun, so I'm sharing it again:

I'll skip the details, but a previous employer dealt with a large, then-new .mil website. Our customers would log into the site to check on the status of their invoices, and each page load would take approximately 1 minute. Seriously. It took about 10 minutes to log in and get to the list of invoices available to be checked, then another minute to look at one of them, then another minute to get out of it and back into the list, and so on.

My job was to write a scraper for that website. It ran all night to fetch data into our DB, and then our website could show the same information to our customers in a matter of milliseconds (or all at once if they wanted one big aggregate report). Our customers loved this. The .mil website's developer hated it, and blamed all sorts of their tech problems on us, although:

- While optimizing, I figured out how to skip lots of intermediate page loads and go directly to the invoices we wanted to see.

- We ran our scraper at night so that it wouldn't interfere with their site during the day.

- Because each of our customers had to check each one of their invoices every day if they wanted to get paid, and we were doing it more efficiently, our total load on their site was lower than the total load of our customers would be.

Their site kept crashing, and we were there scapegoat. It was great fun when they blamed us in a public meeting, and we responded that we'd actually disabled our crawler for the past week, so the problem was still on their end.

Eventually, they threatened to cut off all our access to the site. We helpfully pointed out that their brand new site wasn't ADA compliant, and we had vision-impaired customers who weren't able to use it. We offered to allow our customers to run the same reports from our website, for free, at no cost to the .mil agency, so that they wouldn't have to rebuild their website from the ground up. They saw it our way and begrudgingly allowed us to keep scraping.


This sounds like exactly what a 'data ownership' law would solve. Allow the user, via some official oauth to their service providers, to authorize even a competitor to access their account so the competitor can bear the burden of interfacing with the API to port their new user's data over; but it should be a one-time-every-year thing, so that the law doesn't require it to the point of forcing companies to scale their service to handle bots like the main OP is experiencing.


I have worked with .mil customers who paid us to scrape and index their website because they didn't have a better way to access their official, public documents.


This is not .mil specific: I've been told of a case where an airline first legally attacked a flight search engine (Skyscanner) for scraping, and then told them to continue when they realized that their own search engine couldn't handle all the traffic, and even if it could, it was more expensive per query than routing via Skyscanner.


Michael Lewis' podcast had an episode recently where the Athena Health people related a (self-promotional) anecdote that, after they had essentially reverse-engineered the insurers' medical billing systems and were marketing it as software to providers, a major insurance company called them up and asked to license information about their own billing system because their internal systems were too complicated to understand.


Yep. Have seen similar things.


Me too but for a private company

In reality it was probably more like org sub group A wanted to leverage org sub group B’s data but they didn’t cooperate


Amazing story :) Though I am left wondering if there are ever any circumstances where minorities don't get used as leverage somehow


Yeah, that was unfortunate. We had precious few Federal-strength levers at our disposal, though, and sometimes you have to go with what's available.


I have sympathy for your operational issues and costs, but isn't this kind of complaint the same as a shopping mall/center complaining of people who go in, check some info and go out without buying?

I understand that bots have leverage and automation, but so does you to reach a larger audience. Should we continue to benefit from one side of the leverage, while complaining about the other side?


It's more like a mall complaining that while they're trying to serve 1000 customers, someone has gone and dumped 10000000 roombas throughout the stores which are going around scanning all the price tags.


No. When I say that bots exceed the amount of real traffic, I'm including people "window shopping" on the good side.

My complaint is more like, somebody wants to know the prices of all our products, and that we have roughly X products (where X is a very large number). They get X friends to all go into the store almost simultaneously, each writing down the price of the particular product they've been assigned to research. When they do this, there's scant space left in the store for even the browsing kind of customers to walk in. (of course I exaggerate a bit, but that's the idea)


I’m sympathetic to the complaints about “rude” scraping behavior but there’s an easy solution. Rather than make people consume boatloads of resources they don’t want (individual page views, images, scripts, etc.) just build good interoperability tools that give the people what they want. In the physical example above that would be a product catalog that’s easily replicated with a CSV product listing or an API.


You don't know why any random scraper is scraping you and thus you don't know what api to build that will do them from scraping. Also, it's likely easier for them to contribute scraping than write a bunch of code to integrate with your API so it there's no incentive for them to do so either.


Just advertise the API in the headers. Or better yet, set the buttons/links only to be accessible via .usetheapi-dammit selector. Lastly, provide an API and a “developers.whatever.com” domain to report issues with the API, get API keys, and pay for more requests. It should be pretty easy to setup, especially if there’s an internal API available behind the frontend already. I’d venture a dev team could devote 20% to a few sprints and have an MVP thing up and running.


For the 2nd part, I have done scraping and would always opt for an API if the price is reasonable over paying nosebleed amounts for residential proxies


I think lots of website owners know exactly where the value in their content exists. Whether or not they want to share that in a convenient way, especially to competitors etc is another story.

That said if scraping is inevitable, it’s immensely wasteful effort to both the scraper and the content owner that’s often avoidable.


Writing a scraper for a webpage is typically far more development effort than writing an API wrapper


Yes, but the scraper in this context is already built. A bird in hand and all that.


In this case, yes, obviously. But as far as "it's likely easier for them to contribute scraping than write a bunch of code to integrate with your API", that presupposes no existing integration.


Yes, exactly. Nobody is standing up and saying "we're the ones doing this, and here's what we wish you'd put in an API".

Also, I'm a big Jenson Button fan.


No, because those are people going to the mall. Not robots 100x the quantity of real people.


Unlike the mall a website can scale up to serve more users at relatively low cost. Also those robots may bring more people to the website. Potentially a lot more even.


Reading your comment my impression is that this is either an exaggeration or a very unique type of site if bots make up the majority of traffic to the point that scrapers are anywhere near the primary load factor.

Would someone let me know if I’m just plain wrong in this assumption? I’ve run many types of sites and scrapers have never been anywhere close to the main source of traffic or even particularly noticeable compared to regular users.

Even considering a very commonly scraped site like LinkedIn or Craigslist - for any site of any magnitude like this public pages are going to be cached so additional scrapers are going to have negligible impact. And a rate limit is probably one line of config.

I’m not saying you are necessarily wrong, but I can’t imagine a scenario that you’re describing and would love to hear of one.


Bots are an incredibly large source of traffic on the non-profit academic cultural heritage site I work on. It gets very little human traffic compared to a successful for-profit site.

But the bots on my site -- at least the obvious ones that lead me to say they are a large source of traffic -- are all well-behaved, with good clear user-agents, and they respect robots.txt, so I could keep them out if I wanted.

I haven't wanted because, why? I have modified the robots.txt to keep the bots out of some mindless loops trying every combination of search criteria to access a combinatorial expansion of every possible search results page. That was doing neither of us any good, was exceeding the capacity of our papertrail plan (which is what brought it to our attention) -- and every actual data page is available in a sitemap that is available to them if they want it, they don't need to tree-search every possible search results page!

In some cases I've done extra work to change URL patterns so I could keep them out of such useless things with a robots.txt more easily, without banning them altogether. Because... why not? The more exposure the better, all our info is public. We like our pretty good organic Google SEO, and while I don't think anyone else is seriously competing with google, I don't want to privilege google and block them out either.


As another example, I used to work on a site that was roughly hotel stays. A regular person might search where to stay in small set of areas, dates and usually the same number of people.

Bots would routinely try to scrape pricing for every combination of {property, arrival_date, departure_date, num_guests} in the next several years. The load to serve this would have been vastly higher than real customers, but our frontend was mostly pretty good at filtering them out.

We also served some legitimate partners that wanted basically the same thing via an API... and the load was in fact enormous. But at least then it was a real partner with some kind of business case that would ultimately benefit us, and we could make some attempt to be smart about what they asked for.


It's a B2B ecommerce site. Our annual revenue from the site would put us on the list of top 100 ecommerce sites [1] (we're not listed because ecommerce isn't the only businesss we do. With that much potential revenue to steal from us, perhaps the stakes are higher.

As described elsewhere, rate limiting doesn't work. The bots come from hundreds to thousands of separate IPs simultaneously, cooperating in a distributed fashion. Any one of them is within reasonable behavioral ranges.

Also, caching, even through a CDN doesn't help. As a B2B site, all our pricing is custom as negotiated with each customer. (What's ironic is that this means that the pricing data that the bots are scraping isn't even representative - it only shows what we offer walkup, non-contract customers.) And because the pricing is dynamic, it also means that the scraping to get these prices is one of the more computationally expensive activities they could do.

To be fair, there is some low-hanging fruit in blocking many of them. Like, it's easy to detect those that are flooding from a single address, or sending SQL injection attacks, or just plain coming from Russia. I assume those are just the script kiddies and stuff. The problem is that it still leaves a whole lot of bad actors once these are skimmed off the top.

[1] https://en.wikipedia.org/wiki/List_of_largest_Internet_compa...


If the queries are expensive because of custom negotiated prices and these bots are scraping the walkup prices, can you not just shard out the walkup prices and cache them?

Being on that list puts the company's revenue at over $1 billion USD. At a certain point it becomes cheaper and easier to fix the system to handle the load.


Indeed - this is one of the strategies we're considering.


> As a B2B site, all our pricing is custom as negotiated with each customer ... the pricing is dynamic

So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency? It seems like this is the exact kind of scraping that we generally want more of! I'm sorry about your personal technical predicament, but it doesn't sound like your perspective is really coming from the moral high ground here.


So your company is deliberately trying to frustrate the market, and doesn't like the result of third parties attempting to help market efficiency?

No. First, we as a middleman resseller MUST provide custom prices, at least to a certain degree. Consider that it's typical for manufacturers to offer different prices to, e.g., schools. This is reflected by offering to us (the middleman) a lower cost, which we pass on to applicable customers. Further, the costs and prices vary similarly from one country to another. Less obviously, many manufacturers (e.g., Microsoft, Adobe, HP) offer licensing programs that entitle those enrolled to purchase their products at a lower cost. So if nothing else, the business terms of the manufacturers whose products we sell necessitates a certain degree of custom pricing.

Second, it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways - say, getting a better deal on expensive products that can be classified as "capital expenses" while allowing us to recover some of that revenue by charging them somewhat more for the products that they'd classify as operational expenses.


You're just describing a cooperative effort to obfuscate pricing and frustrate a market. So sure, your company could be blameless and the manufacturers are solely responsible for undermining price signals. I've still described the overall dynamic that your company is participating in. It's effectively based around closed world assumptions of information control, and so it's not surprising that it conflicts with open world ethos like web scraping.

> it seems strange to characterize as "frustrating the market" what we're doing when we cooperate with customers who want to structure their expenses in different ways

I'm characterizing the overall dynamic of keeping market price discovery from working as effectively. How you may be helping customers in other ways is irrelevant.


Holy moly doesn’t that sound more like tax evasion or fraudulent accounting than financial planning?

They’re trying to pay less tax by convincing your company to put a different price on products they buy based on their tax strategy.

It sounds illegal.


The majority of tax and other civil laws are basically full of things that are illegal/problematic if you do them individually, but if you can find someone else to cooperate with then it becomes fine.


This makes a lot of sense to me.

What do you think about their other assertion that the search page is getting a gigantic number of hits that a/ cannot be cached and b/ cannot be rate limited because they're using a botnet?


I'm guessing the bots are hitting the search page because it contains the most amount of information per hit, and that the caching problems are exactly due to these dynamically generated prices or other such nonsense. After all, the fundamental goal of scraping is to straightforwardly enumerate the entire dataset.

The scale of the botnet sounds like an awfully determined and entrenched adversary, likely arising because this company has been frustrating the market for quite some time. A good faith API wouldn't make the bots change behavior tomorrow, but they certainly would if there were breaking page format changes containing a comment linking to the API.


Thanks for the explanation!

The thing I still don’t understand is why (edit server not cdn) caching doesn’t work - you have to identify customers somehow, and provide everyone else a cached response at the server level. For that matter, rate limit non-customers also.


The pages getting most of the bot action are search and product details.

Search results obviously can't be cached, as it's completely ad hoc.

Product details can't be cached either, or more precisely, there are parts of each product page that can't be cached because

* different customers have different products in the catalog

* different products have different prices for a given product

* different products have customer-specific aliases

* there's a huge number of products (low millions) and many thousands of distinct catalogs (many customers have effectively identical catalogs, and we've already got logic that collapses those in the backend)

* prices are also based on costs from upstream suppliers, which are themselves changing dynamically.

Putting all this together, the number of times a given [product,customer] tuple will be requested in a reasonable cache TTL isn't very much greater than 1. The exception being for walk-up pricing for non-contract users, and we've been talking about how we might optimize that particular cases.


Ahhhhh, search results makes a whole lot more sense! Thank you. Search can't be cached and the people who want to use your search functionality as a high availability API endpoint use different IP addresses to get around rate limiting.

The low millions of products also makes some sense I suppose but it's hard to imagine why this doesn't simply take a login for the customer to see the products if they're unique to each customer.

On the other hand, I suspect the price this company is paying to mitigate scrapers is akin to a drop of water in the ocean, no? As a percent of the development budget it might seem high and therefore seem big to the developer, but I suspect the CEO of the company doesn't even know that scrapers are scraping the site. Maybe I'm wrong.

Thanks again for the multiple explanations in any case, it opened my eyes to a way scrapers could be problematic that I hadn't thought about.


Good explanation, thank you.

I would think that artificially slowing down search results can discourage part of the bots. Humans don't care much it a starch finishes in 5 seconds and not 2 AFAIK.

Especially on backends where each request is a relatively cheap operations wise (especially when each request is a green thread like in Erlang/Elixir), I think you can score a win against the bots.

Have you attempted something like this?


This is really interesting but they’re using a network of bots already - even if you put a spinner that makes them wait a couple seconds the scrapers would just make more parallel requests no?


Yes, they absolutely will, but that's the strength of certain runtimes: green threads (i.e. HTTP request/response sessions in this case) cost almost nothing so you can hold onto 5-10 million of them on a VPS with 16-32 GB RAM, easily.

I haven't had to defend against extensive bot scraping operations -- only against simpler ones -- but I've utilized such a practice in my admittedly much more limited experience, and was actually successful. Not that the bots gave up but their authors realized they can't accelerate the process of scraping data so they dialed down their instances, likely to save money from their own hosting bills. Win-win.

Apologies, I don't mention to lecture you, just sharing a small piece of experience. Granted that's very specific to the backend tech but what the heck, maybe you'll find the tidbit valuable.


If you've got a site with a lot of pages, bot traffic can get pretty big. Things like a shopping site with a large number of products, a travel site with pages for hotels and things to do, something to do with movies or tv shows and actors, basically anything with a large catalog will drive a lot of bot traffic.

It's been forever since I worked at Yahoo Travel, but bot traffic was significant then, I'd guess roughly 5-10% of the traffic was declared bots, but Yandex and Baidu weren't agressive crawlers yet, so I wouldn't be terribly surprised if a site with a large catalog that wasn't top 3 with humans would have a majority of traffic as bots. For the most part, we didn't have availability issues as a result of bot traffic, but every once in a while, a bot would really ramp up traffic and cause issues, and we would have to carefully design our list interfaces to avoid bots crawling through a lot of different views of the same list (while also trying to make sure they saw everything in the list). Humans may very well want to have all the narrowing options, but it's not really helpful to expose hotels near Las Vegas starting with the letter M that don't have pools to Google.


I appreciate the response but I’m still perplexed. It’s not about the percent of traffic if that traffic is cached. And rate limiting also prevents any problems. It just doesn’t seem plausible that scrapers are going to DDoS a site per the original comment. I suppose you’d get bad traffic reports and other problems like log noise, but claiming it to be a general form of DDoS really does sound like hyperbole.


> a very unique type of site if bots make up the majority of traffic

Pretty much Twitter and the majority of such websites.


Do you really believe bots make up a significant amount of Twitter’s operating cost? Like I said they’re just accessing cached tweets and are rate limited. How can the bot usage possibly be more than a small part of twitter’s operating cost?


Bandwidth isn't free.


I didn’t say it is free, I said that the bandwidth for bots is negligible compared to that of regular users.


Negligible isn’t free either.


I'm sympathetic to this. I built a search engine for my senior project and my half baked scraper ended up taking down duke law's site during their registration period. Ended up getting a not so kindly worded email from them, but honestly this wasn't an especially hard problem to solve. All of my traffic was coming from the cluster that was on my university's subnet, it wouldn't have been that hard to for them to IP address timeouts when my crawler started scraping thousands of pages a second on their site. Not to victim blame, this was totally my fault, but I was a bit surprised that they hadn't experienced this before with how much automated scraping goes on.


I’m honestly more interested in bot detection than anything else at this point.

It seems like it should be perfectly legal to detect and then hold the connection open for a long period of time without giving a useful response. Or even send highly compressed gzip responses designed to fill their drives.

Legal or not, I can’t see any good reason that we can’t make it painful.


Make it painful if they abuse the site.

We all benefit from open data. Polite scrapers are just fine and a natural part of the web ecosystem.

Google has been scraping the web all day every day for decades now.


The court just ruled that scraping on its own isn't a violation of the CFAA. Meaning it doesn't count as the crime of "accessing a protected computer without authorization or exceeding authorized access and obtaining information".

However presumably all the other provisions of the CFAA still apply, so if your scraping damages the functioning of a internet service then you still would have committed the crime of "Damaging a protected computer by intentional access". Negligently damaging a protected computer is punishable by 1 year in prison on the first offence. Recklessly damaging a protected computer is punishable by 1-5 years on the first offense. And intentionally damaging a protected computer is punishable by 1-10 years for the first offense. These penalties can go up to 20 years for repeated offenses.


As someone that has been on the other end, I can tell you devs don’t want to use selenium or inspect requests to reverse engineer your UI and wish there were more clean APIs.

Have you tried making your UI more challenging to scrape and adding a simple API that requires free registration?


So much this!

I work in E-Commerce and (needless to say) we scrape a lot of websites. Due to our growth and the increase in scrapers we require, I’ve been writing a proposal to a higher up to talk to our biggest competitors to all set up a public API that batches the data for a smaller amount of requests.

It would save everyone quite some traffic and effort.


That's what rate-limiting is for. Don't be so aggressive with it that you start hitting the faster visitors, however, or they may soon go somewhere else (has happened to me a few times).


Rate limiting isn't an effective defense for us.

First, as a B2B site, many of our users from a given customer (and with huge customers, that can be many) are coming through the same proxy server, effectively presenting to us as the same IP,

Second, the bots years back became much more sophisticated than a single, or even relatively finite, IP. Today they work AWS, Azure, GCP, and other cloud services. So the IPs that they're assigned today will be different tomorrow. Worse, the IPs that they're assigned today may well be used by a real customer tomorrow.


If your users are logged in you can rate limit by user instead of by IP. This mostly solves this problem. Generally what I do is for logged in users I rate limit by user, then for not-logged-in users I rate limit aggressively by IP. If they hit the limit the message lets them know that they can get around it by logging in. Of course this depends on user accounts having some sort of cost to create. I've never actually implemented it but considered having only users who have made at least one purchase bypass the IP limit or otherwise get a bigger rate limit.


Have you tried including the recaptcha v3 library and looking at the distribution of scores? -- https://developers.google.com/recaptcha/docs/v3 -- "reCAPTCHA v3 returns a score for each request without user friction"

It obviously depends on how motivated the scrapers are (i.e. whether their headless browsers are actually headless, and/or doing everything they can to not appear headless, whether Google has caught on to their latest tricks etc. etc.) but it would at least be interesting to look at the score distribution and then see whether you can cut off or slow down the < 0.3 scoring requests (or redirect them to your API docs)


For web scraping specifically, I’ve developed key parts of commercial systems to automatically bypass reCAPTCHA, Arkose Labs (Fun Captcha), etc.

If someone dedicated themselves to it, there’s a lot more that these solutions could be doing to distinguish between humans and bots, but it requires true specialized talent and larger expenses.

Also, for a handful of the companies which make the most popular captcha solutions, I don’t think the incentives align properly to fully segregate human and bot traffic at this time.

I think we’re still very much still picking at the very lowest hanging fruit, both for anti-bot countermeasures and anti-anti-bot (counter-countermeasures).

Personally I believe this will finally accelerate once AI’s can play computer games via a camera, keyboard, and mouse. And when successors GPT-3 / PaLM can participate well in niche discussion forums like HackerNews or the Discord server for Rust.

Until then it’s mainly a cost filter or confidence modification. As long as enough bots are blocked so that the ones which remain are technically competent enough to not stress the servers, most companies don’t care. And as long as the businesses deploying reCAPTCHA are reasonably confident that most of the views they get are humans (even if that belief is false), Google doesn’t have a strong incentive to improve the system.

Reddit doesn’t seem to care much either. As long as the bots which participate are “good enough”, it drives engagement metrics and increases revenue.


Scrapers can pay a commercial service to Mechanical Turk their way through reCAPTCHA. It makes a meaningful difference to scraping costs at scale, but sometimes it's still profitable.


I'd pay for a service to do this for me as an ordinary end user, so i never have to solve a captcha myself again.


You would still have to wait for each captcha to be solved, which might be more frustrating than doing it yourself.


It sounds great, until you have Chinese customers. That’s when you’ll figure out Recaptcha just doesn’t really work in China, and have to begrudgingly ditch it altogether…


Do you know if there's a way to rate limit logged-in users differently than visitors of a site?


rate limiting can be a double edged sword, you can be better off giving a scraper highest bandwidth so they are gone sooner, otherwise somthing like making a zip or other sort of compilation of the site available may be an option.

just what kind of scraper you have is a concern.

does scraper just want a bunch of stock images;

or does scraper have FOMO on web trinkets;

or does scraper want to mirror/impersonate your site.

the last option is the most concerning because then;

scraper is mirroring bcz your site is cool and local UI/UX is wanted;

or is scraper phishing smishing or otherwise duping your users.


Yeah, good points to consider. I think the sites that would be scrapped the most would be where the data is regularly and reliably up-to-date, and a large volume of it at that - so not just one scraper but many different parties may on a daily or weekly basis try to scrap every page.

I feel that ruling should have the caveat that if a fair cost paid API version for getting publicly listed data then the scrapers must legally use that (say no more than 5% more than cost of CPU/bandwidth/etc of the scraping behaviour); ideally a rule too that at minimum there be a delay if they are republishing that data without your permission, so at least you as the platform/source/reason for the data being up-to-date aren't harmed too - which may then kill the source platform over time if regular visitors somehow start going to the competitor publishing the data.


Absolutely you just have to check the session cookie


nginx can be set up to do that using the session cookie.


The problem with many sites (and LinkedIn in particular) is that they whitelist a bunch of specific websites, presumably based on the business interests, but disallow everyone else in their robots.txt. You should either allow all scrapers that respect certain load requirements or allow none. Anything that Google is allowed to see and include in their search results should be fair game.

Here's the end of LinkedIn's robots.txt:

User-agent: * Disallow: /

# Notice: If you would like to crawl LinkedIn, # please email whitelist-crawl@linkedin.com to apply # for white listing.


And this is what the HiQ case hinged on. LinkedIn were essentially selectively applying the computer fraud and abuse act based on their business interests - that was never going to sit well with judges.


Btw, LinkedIn does have an API for things like Sales Navigator. It used some weird partnership program (SNAP) to get into it and it starts at (I think) 1500$/year per user. Still pretty cheap though, I think you’d get the value out of that quite quickly for a >300 people company.


> Even when it's not as big as a DOS, it doesn't seem right to me that I can't tell people they're not welcome to cause this additional system load.

You can tell them. You just can't prosecute them if they don't obey.


> While I have sympathy for what the scrapers are trying to do in many cases, it bothers me that this doesn't seem to address what happens when badly-behaved scrapers cause, in effect, a DOS on the site.

Like when Aaron Swartz spent months hammering JSTOR causing it to become so slow it was almost unusuable, and despite knowing that he was causing widespread problems (including the eventual banning of MIT's entire IP range) actually worked to add additional laptops and improve his scraping speed...all the while going out of his way to subvert MIT's netops group trying to figure out where he was on the network.

JSTOR, by the way, is a non-profit that provides aggregate access to their cataloged archive of journals, for schools and libraries to access journals they would otherwise never be able to afford. In many cases, free access.


The effect on Jstore's revenue would have been negligible.

I'm surprised to see someone so cold and unfeeling about Aaron Swartz. Especially considering the massive injustice with regards to application of the law and sentencing.

> Federal prosecutors, led by Carmen Ortiz, later charged him with two counts of wire fraud and eleven violations of the Computer Fraud and Abuse Act, carrying a cumulative maximum penalty of $1 million in fines, 35 years in prison, asset forfeiture, restitution, and supervised release.


You probably can. On the protocol level with JSON-LD or other rich data packages that generate xml or standardized json endpoints. I did this for an open data portal, and this is something most G7 governments do with their federal open data portals using off the shelf packages (that are worth researching a bit obviously first), particularly in the python and flask world. We were still getting hammered by China at our Taiwanese language subdomain, but that was a different concern


I don't know what kind of data you serve up but perhaps you could serve low quality or inaccurate content from addresses that are guessed from your api. I.e. endpoints not normally reachable in the normal functioning of your web app should return reasonable junk. A mixture of accurate and inaccurate data becomes worthless for bots and worthless data is not worth scraping. Just an idea!


But don't you already have countermeasures to deter DoS attacks or malicious human users (what if someone pays or convinces people to open your site and press F5 repeatedly)?

If not, you should, and the badly-behaved scrapers are actually a good wake-up call.


What have you done to protect the site? Most automation libraries are detectable(puppeteer, even with extra-stealth, selenium, playwright...)

The only library that I know that is more or less undetectable is used by a just a few hundred people...


Colour me interested in this library! :)


Curious, what library is that?


When the original ruling in favor of HiQ came out, it still allowed for LinkedIn to block certain kinds of malicious scraping. LinkedIn had been specifically blocking HiQ, and was ordered to stop doing that.


Implement TLS fingerprinting on your server. People can still fake that if they are determined, but it should cut the abuse way down.


No, nor can we just do it by IP. The bots are MUCH more sophisticated than that. More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses. So they can pop up anywhere, and that IP could wind up being a real customer next week. And they're only recognizable as a botnet with sophisticated logic looking at the gestalt of web logs.

We do use a 3rd party service to help with this - but that on its own is imposing a 5- to 6-digit annual expense on our business.


> Our annual revenue from the site would put us on the list of top 100 ecommerce sites

and you're sweating a 5- to 6- digit annual expense?

> all our pricing is custom as negotiated with each customer.

> there's a huge number of products (low millions) and many thousands of distinct catalogs

Surely the business model where every customer has individually-negotiated pricing model costs a whole lot to implement, further, it gives each customer plenty of incentive to attempt to learn what other customers are paying for the same products. Given the tiny costs of fighting bots, in comparison, your complaints in these threads here seem pretty ridiculous.


> More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses.

those are only the low-effort/cheap ones, the more advanced scraping makes use of residential proxies (peoples' pwned home routers, or where they've installed shady VPN software on their PC that turns them into a proxy) to appear to come from legitimate residential last mile broadband netblocks belonging to comcast, verizon, etc.

google "residential proxies for sale" for the tip of an iceberg of a bunch of shady grey market shit.


There's a lot of metadata available for IPs, and that metadata can be used to aggregate clusters of IPs, and that in turn can be datamined for trending activity, which can be used to sift out abusive activity from normal browsing.

If you're dropping 6 figs annually on this and it's still frustrating, I'd be interested in talking with you. I built an abuse prediction system out of this approach for a small company a few years back, it worked well and it'd be cool to revisit the problem.


Have you considered setting up an API to allow the bots to get what they want without hammering your front-end servers?


Yes. And if I could get the perpetrators to raise their hands so I could work out an API for them, it would be the path of least resistance. But they take great pains to be anonymous, although I know from circumstantial evidence that at least a good chunk of it is various competitors (or services acting on behalf of competitors) scraping price data.

IANAL, but I also wonder if, given that I'd be designing something specifically for competitors to query our prices in order to adjust their own prices, this would constitute some form of illegal collusion.


What seems to actually work is to identify the bots and instead of giving up your hand by blocking them, to quietly poison the data. Critically, it needs to be subtle enough that it's not immediately obvious the data is manipulated. It should look like a plausible response, only with some random changes.


What makes you think they would use it?


It's in their interest. I've scraped a lot, and it's not easy to build a reliable process on. Why parse a human interface when there's an application interface available?


TLS fingerprinting is one of the ways minority browsers and OS setups get unfairly excluded. I have an intense hatred of Cloudflare for popularising that. Yes, there are ways around it, but I still don't think I should have to fight to use the user-agent I want.


I don't want to say tough cookies, but if OPs characterization isn't hyperbole("the lion's share of our operational costs are from needing to scale to handle the huge amount of bot traffic."), then it can be a situation where you have to choose between 1) cut off a huge chunk of bots, but upset a tiny percent of users, and improve the service for everyone else, or 2) simply not provide the service at all due to costs.


I don't think it's likely to cause issues if implemented properly. Realistically you can't really build a list of "good" TLS fingerprints because there are a lot of different browser/device combinations, so in my experience most sites usually just block "bad" ones known to belong to popular request libraries and such.


Seems like you could sue the scraper for that, then? If they cause you damages by their unapproved actions, you have a tort claim.


Yes, I think working to accommodate the non-humans along with the humans is the right approach here.

Scrapers have a limited range of IPs, so rate-limiting them and stalling (or dropping) request responses is one way to deal with the DoS scenario.

For my sites, I have placed the majority behind HTTP Basic Auth...


You realistically can't. There are services like [0][1] that mean any IP could be a scraper node.

[0] https://brightdata.com/proxy-types/residential-proxies [1] https://oxylabs.io/products/residential-proxy-pool


> How does Bright Data acquire its residential IPs?

> Bright Data has built a unique consumer IP model by which all involved parties are fairly compensated for their voluntary participation. App owners install a unique Software Development Kit (SDK) to their applications and receive monthly remuneration based on the number of users who opt-in. App users can voluntarily opt-in and are compensated through an ad-free user experience or enjoy an upgraded version of the app they are using for free. These consumers or ‘peers’ serve as the basis of our network and can opt-out at any time. This model has brought into existence an unrivaled, first of its kind, ethically sound, and compliant network of real consumers.

I don't know how they can say with a straight face that this is 'ethically sound'. They have, essentially, created a botnet, but apparently because it's "AdTech" and the user "opts-in" (read: they click on random buttons until they hit one that makes the banner/ad go away) it's suddenly not malware.


NordVPN (Tesonet) has another business doing the same thing. They sell the IP addresses/bandwidth of their NordVPN customers to anyone who needs bulk mobile or residential IP addresses. That's right, installing their VPN software adds your IP address to a pool that NordVPN then resells. Xfinity/Comcast sort of pioneered this with their wifi routers that automatically expose an isolated wifi network called 'xfinity' (IIRC) whether you agree or not.


The Comcast access points do, at least, have the saving grace that they're on a separate network segment from the customer's hardware, and don't share an IP address or bandwidth/traffic limit with the customer.

Tesonet and other similar services (e.g. Luminati) don't have that. As far as anyone -- including web services, the ISP, or law enforcement -- are concerned, their traffic is the subscriber's traffic.


> They sell the IP addresses/bandwidth of their NordVPN customers to anyone who needs bulk mobile or residential IP addresses

I would be interested in a reference for this if you have one.


As other have said (A) there are plenty of countermeasures you can take, but also (B) you are frustrated that you are providing something free to the public and then annoyed at the "wrong" customers are using your product and costing you money. I'm sorry, but this is a failure of your business model.

If we were to analogize this to a non-internet example: (1) A company throws a free concert/event and believes they will make money by alcohol sales. (2) A bunch of sober/non-drinking folks attend the concert but only drink water (3) Company blames the concert attendees for "taking advantage" of them when they really just had poor company policies and a bad business model.

Put things behind authentication and authorization. Add a paywall. Implement DDOS and detection and banning approaches for scrapers. Etc etc.

But don't make something public and then get mad at THE PUBLIC for using it. Behind that machine is a person, who happens to be a member of the public.


Alternatively it could be seen that your juice company offers free samples. Then somebody abuses free and takes gallons home with them to bottle and sell as their own.

That’s what it feels like when someone is scraping your network to bootstrap a competitor.


Again, what you call abuse of free samples, someone else calls a savvy strategy tailored to you're poorly crafted business plan. Have ways to limit the free samples or else it's your fault...


There are certain classes of websites where the proposed solutions aren’t a great fit. For example, a shopping site hiding their catalog behind paywalls or authentication would raise barriers to entry such that a lot of genuine customers would be lost. I don’t think the business model is in general to be blamed here and it’s ok to acknowledge the unfortunate overhead and costs added by site usage patterns (e.g. scraping) that are counter to the expectation.


Have you considered using a cache service like cloudflare?


You could ban their IPs?


IP bans are equivalent to residential door locks. They’re only deterring the most trivial attacks.

In school I needed to scrape a few hundred thousand pages of a proteomics database website. For some reason you had to view each entry one at a time. There was IP throttling which banned you if you made requests too quickly. But slowing the script to 1 request per second would have taken days to scrape the site. So I paid <$5 for a list of 500 proxy servers and distributed it, completing the task in under half an hour.


I agree it’s not perfect. It’s also significantly better than nothing.


It's also completely insufficient despite being better. There are so many scraper services that its just a matter of paying a small amount of money to make use of tens of thousands of IPs spread across ISPs and countries.


Can you share where did you get such a nice proxy for 500 proxies? TIA


Using proxies to hide your identity to get around a denial of access seems to get awfully close to violating the Computer Fraud and Abuse Act(in USA, at least).

I’m surprised your school was okay with it.


Don't worry, I don't live in the USA. Thanks for your concern though.


Have you considered serving a proof-of-work challenge to clients accessing your website? Minimal cost on legit users, but large costs on large-scale web-scraping operations, and it doesn't matter if they split up their efforts across a bunch of IP addresses - they're still going to have to do those computations.

https://en.wikipedia.org/wiki/Hashcash


No thanks, as a user I would stay far away from such websites. This is akin to crypto miners. I don't need them to drive up my electricity costs and also contribute to global warming in the process. It's not worth the cost.


This is completely absurd - anti-spam PoW is not remotely comparable to crypto miners, and the electricity cost will be so far below the noise floor of owning a computer in the first place that you will literally not notice (and neither will the environment), unless website owners are completely insane and set up multi-second challenges (which, they won't).

And, it's absolutely worth the cost - as a website owner, you get to impose costs on botting operations with minimal penalties for normal users and minimal environmental impact. Bots work because the costs of renting an AWS server and scraping websites (or sending spam, whatever) are extremely tiny - adding PoW challenges to everything that could be spammed suddenly massively changes the cost of running those spam operations, and would result in noticeably less spam if deployed widely.

In fact, the net "environmental impact" would be negative, as botters start to shut down operations due to greatly increased operational costs.


You do PoW every time you send an email.


If most of your traffic is bots, is the site even worth running?

This really is akin to the question, “Should others be allowed to take my photo or try to talk to me in public?”

Of course the answer should be yes, the internet is the digital equivalent of a public space. If you make it accessible, anyone should be able to consume.

If you don’t want it scraped add auth!


Do you also believe someone running a drone to follow and photograph you, personally, wherever you go in public would be fair and legal?


Every time I hear news like this, I remember this local news in Japan [0].

An old bureaucratic company developed a system for a local library. It was very slow, so a user started scraping it and developed his own library system. A few weeks later, he was arrested.

The system was so shit that it became unusable by only hundreds of requests per hour. The local government filed a claim, and the officer arrested him for the "dangerous cyber attack."

[0] In Japanese: https://www.wikiwand.com/ja/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%A...


Isn't that literally a DOS attack?


"Hundreds per hour" is hardly a denial of service.

If we assume the very high end of "hundreds", 900, it's still 15 requests per minute. Your system is a disgrace if it can't handle that kind of traffic. This incident was in 2010 so even that is not an excuse.


Both things can be true: the system probably was a disgrace, but also the scraper did, by definition, deny service - so it was a kind of accidental Denial of Service attack. Hopefully, given the intention and so on, in a good legal system this wouldn't be treated the same as a malicious DoS; but also the scraper shouldn't be used, as it denies service to other potential clients.


If the request load is high enough to bring down the system, it is literally a Denial Of Service.

Website admins don't owe you anything. If their website sucks, it sucks. That doesn't give you the right to take down their system.


I'm not sure that "hundreds of requests per hour" meets the criteria for a DDOS. Assuming 999 requests per hour, that's about one request every four seconds, which the vast majority of websites should be able to handle with no trouble at all.


DoS is different from DDoS. A DoS attack is any kind of attack in which the attacker seeks not to gain access to a system, but to simply deny others access. This can include sending large payloads, maintaining idle connections, causing crashes, logging out other users etc.

A DDoS is a particular kind of DoS attack - one in which the attack is conducted using a distributed set of machines, usually helping to (a) hide the attack from monitoring tools and (b) make it hard to block using IPs. Obviously this was not a DDoS, but could have looked like a plain DoS.

The definition of a DoS attack depends on the system being attacked, there is no hard and fast limit to how much traffic a system is supposed to handle. For a badly written system like the one described here, the actions of the scraper were indistinguishable from the point of view of the library from the actions of a malicious actor. Hopefully, since the intent was not malicious in this case, thing may have been cleared up - though of course it's also possible that, unfortunately, they haven't.


> A DoS attack is any kind of attack in which the attacker seeks not to gain access to a system, but to simply deny others access.

By your own definition, it was not a DoS attack, because the subject was not seeking to deny others access. And without the intent to do harm, one may not genuinely identify the incident as an attack. Requiring intent is an important part of your definition, as it protects innocent users of a buggy system from being classified as attackers whenever any downtime incident occurs.


Sure, but that's something that needs to be investigated by a court of law, it's not something you can automatically know as a system operator.

The ideal thing in this case may have been for the police to investigate, to conclude that the DoS was not done with malicious intent; but also to instruct the scraper that any further attempts to scrape the system in a this way will be considered malicious.


I understand that some websites are poorly implemented and can't handle hundreds of requests. IANAL, but AFAIK whether it's a DOS attack or not partially depends on the scraper's intent.

In this case, after volunteer engineers invested in their system, it turned out that they created a DB connection after each request, maintained it for 10 mins, and didn't reuse it or close it. So the number of DB connections quickly reached its limitation and couldn't handle the following requests. Similar problems happened with other library systems developed by the same company.

The prosecutors dropped the charge citing his scraping was reasonable, and he didn't have any malicious intent.


My website crashes as soon as it is visited. Don't visit it, or you will be arrested for DoSing my site.


Scraping is legal, but the legal framework requires scraping to abide by three rules:

1. The page is not behind a login. If you have to create an account to access the content, then you have to agree to abide by the terms of service which may ban scraping.

2. What you scrape is still protected by copyright. If you scrape a clip from Star Wars from a web page, you still can't redistribute it without a license.

3. Your activity may not impede access to the site in anyway i.e. no DDOSing. If there is a robots.txt file, you are supposed to abide by it but the courts do not literally say that. Must rate meter your requests, scrape at night, so on. Gray area if the robots.txt bans every page and you end up having to ignore it.


The Supreme Court decision referenced in this decision involved a case where the information accessed was protected by a login and accessing the information violated the terms of service. The court ruled that is not a violation of the CFAA. I fail to see that it is therefore necessarily required that the page is not behind a login for scraping to be legal. There may be other issues regarding the type of information or repeatedly creating accounts to skirt account terminations which affect whether it is legal, but simply violating terms of service for accessing a page behind a login doesn't appear to make it illegal by itself.


This case was about users public profiles. LinkedIn makes users profiles accessible to the Googlebot and other public visitors without the need to login. There was an additional wrinkle in that HiQ actually had a commercial agreement with LinkedIn for some time that explicitly allowed them to scrape. They were arguing that canceling this agreement was anticompetitive and would destroy their entire business. Because of that pre-existing agreement, they likely didn't dwell on the distinction of public or private as they already set the precedent of signing up to an agreement that authorized them in the past, so they can't now say they never needed one in the first place.


I'm not sure why you are referring back to the specifics of the case since your comment was about what you put forward as a general framework required for scraping to be legal and that was what my comment was regarding.


Since when are terms of service legally binding?


Legally binding =/= enforceable with penalties or illegal to break (excepting revocation of service). Breaking TOS is not a crime [0]

[0]: https://www.eff.org/deeplinks/2018/01/ninth-circuit-doubles-...


They never weren’t. After all, they’re contracts that you agree to when you check the box. Some clauses may be unenforceable, but voiding a clause does not necessarily void the whole contract.



I am not versed in law, especially not in US law, but this case seems to be very specific that scraping is no violation of the CFAA. I do support this interpretation.

However, the case of scraping I personally find more problematic is the use of personal data I provide to one side, then used by scrapers without my knowledge or permission. I truly wonder which way we are better off on that issue as a society. Independent of the current law, should anything that is accessible be essentially free-for-all or should there be limitations on what you are allowed to do. Cases highlighted in the article: Facial recognition by third parties on social media profiles, facebook scraping for personal data, search engines, journalists or archives. (Not all need to have the same answer to the question "do we want this") Besides that, the point I care slightly less about is the idea that allowing scaping with very leisure limits leads to even more closed up systems.


"I gave my data to Linkedin and now scrapers are reading it from the public web". Be mad at Linkedin before you are mad at the scraper.


I may edit and delete my information from LinkedIn, but I have no idea who has persisted that data beyond there.

There is such a thing as scraping responsibility and irresponsibly. Both kinds happen.


This seems to have various angles to it.

First, one question is if the intent of the original owner of the data important? When I put data on linkedin (or facebook, my private website, hackenews or my employers website) I might have an opinion on who gets to do what with my data (see also GDPR discussions). Should I blame linkedin (or meta/myself/my employer) to do what I expected them to do, or should I blame those that do what I don't want them to do? Should I just be blamed directly because I even want to make a distinction between those? If I didn't want my data I could just not provide it (or participate in/surf the web at all if we extend the idea to more general data collection).

Secondly, it touches on the idea that linkedin should not make the data publicly available (i.e., without authentication), and we end up with a less open system. Is that better? Is it what we want? Maybe there are also other ways that I am not aware just now. (Competing purely on value added is probably futile for data aggregators.)


Your intent as the original owner of the data is important! You have to explicitly give Linkedin the right to display your data. It's in their Terms of Service. If Linkedin does something with your data that is outside the ToS, then that is on them, but if they do something within the ToS that you don't like then maybe you should not have provided them with your data.

As for whether the data should be public, that's a decision we each have to make.


serious question (ianal): If I write down some information, at what point does that information have copyright protection? Do I have to claim it with Ⓒ?


You might want to check out Feist v. Rural Telephone Company (1991), and also look up the Berne Convention (on copyright), which the U.S. joined in the 1970s.

If by "information" you mean mere facts without creativity in selection or arrangement, those are generally not protectable by copyright in the United States, although possibly in some other countries. Copyright generally protects works of authorship, and nothing else. No creativity no copyright.


So for this case involving LI, I get my name and employer name isn't protected but there is a section where you describe your role. That is extremely creative, right? There are infinite ways similar work could be described. It isn't just a "fact". So they can't use that information it would seem.


Not the creative part, no. Someone could presumably use the non-creative stuff as far as copyright is concerned, although other means (notably shrink wrap licensing) are often attempted to restrict that.

I used to regularly see books at the university book store that were wrapped in literal shrink wrap licenses for that reason, which of course makes browsing difficult. And then there was a major case on the subject ProCD v. Zeidenberg (7th Cir 1996), where such a license was found valid and enforceable.


Never. "Mere listings" of data (like the phone book) are not copyrightable.

But also anything you write which is copyrightable is copyright immediately. You can register the work w/ the copyright office for some extra perks but it's not strictly necessary.


In Europe, the GDPR would protect that type of data, in principle. That is, even if you provided your profile picture to Linked In and made it public, this does not entitle a third party to copy that picture and use it for other purposes (though a Linked-In 3rd party client would still be ok). It's still your data, even if you agreed to share it with Linked-In.


Consider that scrapers may be far less interested in you as the individual than they are regarding your input into the aggregated data points of you and those like you.


The scraper example in this case HiQ had a product that tracked employee profile changes to predict employee churn.

So they were specifically interested in you, personally not the aggregate


That's the same argument for all major internet tracking cookies. I don't think that's going to convince this site's userbase.


This headline is horribly misleading. The Ninth Circuit said that web scraping is not a violation of the CFAA, but that other causes of action, such as trespass to chattels, breach of contract, misappropriation, may still apply. Just super irresponsible writing.

Still, a very good day for scraping.


This is great news. A win for the Internet Archive and other archivists.


IANAL but it's not immediately obvious to me that this ruling covers bulk scraping and republishing untransformed. I'm genuinely curious about this personally. I presumably can't just grab anything I feel like off the web, curate it, and sell it.


Why not isn't that how google has become one of the largest companies in the world?


Google became huge with the public by providing a way to search for data; and they became huge as a business by using this mediation to sell targeted ads.

It's only after they had become truly huge that they started actually serving data directly (news snippets, song lyrics, factual answers, maps, reviews etc).


.... I dunno what the distinction is...They clearly were scraping all data, in terms of display they extracted pieces of publicly available websites to use as descriptions and titles, and images. They displayed cached versions of websites. So the amount of content you take of the larger body makes the distinction? But as you get larger those rules no longer apply? It doesn't make sense to me.


>So the amount of content you take of the larger body makes the distinction?

Of course. The amount of the original work is one of the key criteria on whether something is fair use or not. And there have been debates over whether Google crosses that line or not.

But no one except the most ardent anti-copyright maximalists would argue that so long as something is available on a public web page, anyone can just reproduce it verbatim and distribute it on their own site. Scrapers do a lot of this anyway but doing anything about it is a wack-a-mole game of publications often don't bother.


Next up, we just need all public court records to be freely available to BE scraped and not $3 per page

https://patentlyo.com/patent/2020/08/court-pacer-should.html


Really the problem is that PACER has been turned into a cash cow for the federal court system, with fees and profits growing despite costs being virtually nill.

But yeah, the irony of the federal court system legalizing screen scraping, something PACER contractually prohibits.

Court records should be public record, not encumbered in terms of use/access/reproduction/copyright.


Now I wonder whether “retrieving” your own OAuth token from an app to make REST calls that extract your own data from cloud services is legal. It seems to fall under the same guideline, that exceeding authorization is not unauthorized access, so even though it's usually against the terms of service it doesn't violate CFAA?


I had no idea this was even being discussed. I'm glad they are reasonable on this. Wish they had been as reasonable on breaking encryption/DRM schemes.


If I have to stake out a binary position here, I’m pro scraping. But I really wish we could find a way to be more nuanced here. The scraper in question is looking at public LinkedIn profiles so that it can snitch to employers about which employees might be looking for new jobs. That’s not at all the same as archival; it’s using my data to harm me.


It's a public page. Your employer could just as well check your page theirself. It may be a tragedy of efficiency, but it's not like the scraper is grabbing hidden data. The issue is in something else. Maybe it's the fact that your current employer would punish you for looking for a new job. Or maybe LinkedIn's public "looking for job" status is not sustainable in it's current form.


Yes, and if I have a public Twitter account it’s perfectly possible for someone to flood me with spam messages. That doesn’t mean we should do nothing to prevent it. As I said elsewhere, we should strive to make it possible for people to exist in public digital spaces without worrying about bad actors.


Someone can manually spam you, and I don't think that should be allowed. That is a separate topic and discussion. Unless you are arguing that your employer should not be allowed to check your LinkedIn status.


I’m just using it as an example of a case in which a public profile doesn’t automatically mean anything goes. I had hoped to generate discussion about how to throw out some of the bathwater without throwing out the baby too, but I guess no one is really interested.


Weev was charged, and eventually convicted based merely on scraping from AT&T [0]. When the charge was vacated, it was only on venue/jurisdiction, not on the basis of the scraping being legal. Seems there's precedent merely scraping this information is felonious behavior.

https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-begi...


Yes, but it’s specifically using data you published to harm you. Compare Blind, which is engineered to not be attributable in this way.


I understand that I published it. That doesn’t mean I should accept that hostile parties will use it against me.

This is kinda like telling someone who is being harassed on social media that they’re consenting to it by having a public account. We should strive to make our digital personae safe from bad actors, not throw our hands up and say “if you put yourself out there, you have no recourse”.


While I don't condemn such snitching, I'm not sure how the responsibility doesn't fall on the social media user for publicly advertising their search or "professionally friending" prospective employers. Neither action is necessary.


Is it so difficult to imagine that we might write laws to prevent corporations from vacuuming up our data and selling it to be used against us?

Again, I am pro-scraping. I see the benefits. But I don’t think this is a situation in which we have to accept all or nothing.


For what it's worth, Linkedin was incredibly easy to scrape back in the day, wrt profile/email correlation. Can't buy any aggressive stance they may have against 'scrapers'.

2 options.

Their Linkedin ID's are base 12, and would redirect you if you simply wanted to enumerate them.

You could also upload your 'contacts', 200-300 at a time and it'd leak profile IDs (Twitter and Facebook mitigated this ~5 years ago). I still have a @pornhub or some such "contact" that I can't delete from testing this.


These anti-scraping corporate activists need to get with the times and allow access to their data, legitimately, through an API. Third parties will scrape and sell the data regardless, so why not just cut them out and even charge for individuals to legitimately use the API. API keys could be tied to an individual and at least LinkedIn would know who was conducting what action.

Make it easier to get the data through an API than having to scrape it.


If the data is publicly available then it should be legal to scrape it and use it. LinkedIn can decide, if they want, to put all the data behind login. Problem solved.


I agree. to me this is like people complaining someone is taking a photograph of their front yard... If you don't want something visible, don't display it to the world.


An interesting departure, considering weev was convicted on merely scraping [0] AT&T. Although his charge was vacated, it was on the venue/jurisdiction, not that scraping was found to be legal.

[0] https://www.eff.org/deeplinks/2013/07/weevs-case-flawed-begi...


Simple answer, require users to login to view information then apply captchas and a series of bans with increasing durations when they act like a scraper rather than a human. LinkedIn already does the behavior detection, captchas, and bans.


Does the ruling make it illegal to block scrapers?


Doubt it. And it should never be illegal to attempt to block scrapers. Many sites do try.


I'm (obviously) not a lawyer.

Does "publicly accessible" also apply to websites that are totally free, but require registration (which is open to anyone)?

And even though this is in the US, do cases like these set precedents for other western nations?

asking for a friend ...


AFAIK anything that binds you a ToS (e.g. registration) is offlimit for legal scraping in USA as USA can literally put you in prison for violating _terms of service_, which is pretty crazy. Other countries are much better about this, though you should always consult a lawyer if something you want to scrape is behind a login page.


Excellent. And the only ruling that makes any sense.

What's important here though is to confirm that this also can't somehow be waivered by "Terms Of Service". The law needs to clearly say that web scraping is legal and that sites can't just circumvent the law by agreement with users.

Even better if it could also clearly state that it includes any and all digital content and is not limited to (say) hypertext. I.e. if I as a user can reach content in any way, then I can also scrape/save/archive/whatever using the tech I want. E.g. I can save my youtube videos, spotify songs and so on.


Wondered how this works WRT copyright (since the article did not contain the word). Here's Kent State's (short) IPR advice [https://libguides.library.kent.edu/data-management/copyright] It says "Data are considered 'facts' under U.S. law. They are not copyrightable.... Creative arrangement, annotation, or selection of data can be protected by copyright."


Wasn't it already the case (https://en.wikipedia.org/wiki/Web_scraping#United_States)?

That's why I usually recommend using US-based web scraping tools and services (such as https://webscraping.ai).

In other countries, it still may be a grey area.


Last week I got a call from some random recruiter. I was wondering how they'd get my telephone number, because I can't recall putting my phonenumber publicly anywhere. Apparently they scrape profiles from linkedin, then put those through some Google Chrome extension (should've asked which one), and it spills out phone numbers!

Contemplating if I should switch phone number now.


Lots of people here seem to believe that scraping should be legal because it is possible.

The flaw in that argument is that a law is never needed, by definition, to prevent things that can easily be avoided by technical means.

The laws against burglary protect your property and health but, more importantly, they allow us all to live in peace without investing vast amounts of resources into physical protection.


Great news for the real estate sector.

I actually built an open source project for scraping real estate websites several years ago: https://github.com/RealEstateWebTools/property_web_scraper

Might be time to go back and update it ;)


Here is the opinion for anyone who is interested

https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...


It has kinda always baffled me, why some people think web scraping wouldn't/shouldn't be legal. I mean the second you put something out on the internet, you should expect it to be public, and as a consequence of that, to be scraped.


Well, I can think of at least a few possible reasons. Not claiming those are right at all, but I do think they are the reasons why some people think/thought that scraping could be illegal.

1. It's hard to distinguish scrapers from DoS attackers, which we do expect to be illegal.

2. Depending on the use of the data after scraping, and the nature of the data being scraped, you may expect copyright laws and things like the GDPR to prevent any possible use of the data, so the very collection of it may be suspect.

3. Even for public data, a site may have Terms of Service that don't permit scraping, and some may expect to be able to enforce these


I mean, it's public data... you scrape it with your eyes whenever you view a web page, and your computer scrapes it in the browser. Saying web scraping should be illegal is like saying taking photos of buildings should be illegal.


Taking a photo of a copyrighted photo (and distributing it) is illegal. You're just saying that anything that's technically possible should be legal, since there is no need to prohibit the technically impossible.


Can anyone tell me what this means for sites that proxy and display content from other sites? For example, if this were applied to a RSS reader (something like Google Reader), how would this ruling affect that service?


This makes me feel less like an information pirate. Not cool.


Where is the lobbying? Something seems wrong ...


Is it legal though to login to a website and then scrape data (that would not be accessible if I was just browsing as a guest)?


Plaid seems to think so


Interesting as I've seen a few search engine start ups that seem to scrape other search engine results, depending on your definition of scraping. My definition would be a user agent that doesn't uniquely identify itself that isn't using an authorised API.


Technically doing a copy-paste and crawling is the same thing. If copy/pasting something is legal, then crawling it should be legal too. If not, vice versa. In any case, it ends up with copyright related laws, I think.


So glad I caught this post. I need to write more scrapers.


Ok now do private data


so i dont have the right to not be scraped? thats like sending radiowaves and making me pay a license for a radio i dont use. same with spam. i should give you a token to send me emails i maybe want to look at your stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: