Hacker News new | past | comments | ask | show | jobs | submit login
Web scraping for me, but not for thee (ericgoldman.org)
582 points by mhb on Aug 25, 2023 | hide | past | favorite | 152 comments



Hmmm, I'm a bit confused on something. The HiQ vs LinkedIn case, to my knowledge, went through the following stages:

- LinkedIn sues HiQ, Ninth Circuit sides with HiQ

- LinkedIn pushes to Supreme Court, Supreme Court vacates citing Van Buren

- Ninth Circuit re-reviews and affirms their decision

- LinkedIn moves to get the injunction preventing them from blocking HiQ dissolved, which is granted

- A mixed judgement is finally issued in Nov 2022 ultimately resulting in a private settlement

Where exactly does this leave things at? I feel like everyone loves to cite this case but never goes into the finer details.

Reading a summary of the mixed judgement from Nov 2022, it looks like maybe the issue came from HiQ using people to log in and thus the ToS came into play...? If I'm reading correctly, it looks like the court did eventually side with LinkedIn in stating that HiQ violated the ToS.

https://www.natlawreview.com/article/court-finds-hiq-breache...

Edit: Formatting.


Actually, as I re-read this, this how it should go:

-hiQ sues LinkedIn for injunctive relief in the ND Cal., win on its CFAA claim.

-LinkedIn appeals to 9th Circuit, which sides with hiQ on CFAA claim

-hiQ loses its antitrust claims at the motion to dismiss stage

(somewhere in here hiQ goes out of business, but rich benefactor keeps paying its legal bills)

-LinkedIn continues with breach of contract and other claims, wins at motion to dismiss

-LinkedIn appeals to the Supremes, who vacates and remands back to 9th Circuit after Van Buren

-9th Circuit sides with hiQ a 2nd time on the CFAA claim

-injunction is dissolved

-hiQ suffers near-total defeat at summary judgment

-hiQ waves the white flag, agrees to permanent injunction agreeing to nearly all of LinkedIn's demands and pays LinkedIn 500k


So does that mean the conclusion is that scraping is bad? Or did the 9th circuit establish that scraping is OK and hiQ suffered defeat on some other basis?


Clarification: it's the breach of terms of service that meets this bar for breach of contract in this case


If this is the case, I guess what I'm really wondering is: does the existence of the EULA cause this, or was it HiQ having "turks" sign in to do it - thus accepting the EULA?


I think that's exactly it, they had automation logging in, which meant they accepted the tos, then violated it. The govt has sided with business in these cases (it seems)


It seems to me they were defeated on the terms of some other contract they had with LI, not the scraping?


Thank you, that's certainly more in-depth than my understanding was. Very helpful to read.


Who's the rich benefactor?


Not a mixed judgment in Nov. 22. It was a massive defeat for hiQ Labs. Read the permanent injunction issued by the court.


Interesting. You appear to be a lawyer or in that realm, so I'm curious your take on it - though I also understand if you don't want to publicly make statements or anything.

i.e is the common take that people have of "scraping is legal after HiQ vs LinkedIn" just completely wrong?

Edit: oh, I didn't realize you wrote quite a bit here: https://blog.ericgoldman.org/archives/2022/12/hello-youve-be...


>Read the permanent injunction issued by the court. Happen to have a link?

The question that matters is if this establishes any precedence.



This case never went to trial but it could have. The Court denied LinkedIn's motion for summary judgment as to hiQ's waiver and estoppel defenses to LinkedIn's breach of contract claims.

Calling this Order on the parties' motions for summary judgment "precedent" would be a mistake. Nor is the Consent Judgment and Permanent Injunction "precedent". The Ninth Circuit decision is precedent.

People in this thread are stating that hiQ was "defeated". Of course. However if "defeat" means a party settling, paying a large sum and agreeing to refrain from certain conduct in the future, then Google and Facebook have been "defeated" many times.

Having "web scraping" remain a "gray area" by limiting the number of final decisions and thereby the amount of precedent might be beneficial to so-called "tech" companies. Putting aside hiQ's predicament, if more of these cases went to trial instead of settling, then we might have some clarity.

We should be thanking whomever funded hiQ's litigation costs. Getting the Ninth Circuit decision was something every web user can be thankful for.


Thanks.

So you can't be extradited to the states or go to jail there as it doesn't violate the CFAA (as the supreme court sent it back)?

I guess 500k and no lawyer fees sounds like it isn't punitive given it's, i assume, decently sizeable company?

I'm wondering if we're looking at another MPAA/RIAA situation where they threaten 6 figure sums at individuals.

I lived through that, but this time it's not some 32kbps mp3s of metallica, it's just the entire future of human thought and power itself.

You never really know how a common law justice system is going to act.


I don't. I just have a .pdf. Email me at Kieran(at)McCarthyLG(dot)com if you want a copy.


What is the legal precedent of a mixed ruling? I was unaware such a thing was even possible.


> Mark Lemley observed this happening nearly 20 years ago, in his prescient, seminal article, “Terms of Use.”: The problem is that the shift from property law to contract law takes the job of defining the Web site owner’s rights out of the hands of the law and into the hands of the site owner.

With "contracts" of adhesion proliferating, and how impossible it has become to exist in the modern world without acceding to them (something as simple as buying a new SSD involves agreeing to one), this problem is getting worse by the day.

The law is becoming increasingly irrelevant, and more and more we are ruled by one-sided "contracts" from giant companies that are in a position to push them on us.


Well said. This reminds me of my own thoughts:

There are two ways of thinking about what a webpage is:

1) A web page is a billboard

2) A web page is a pamphlet

If a webpage is a billboard, then it is morally wrong for me to paint over those sections of the billboard that I do not like (i.e., using an ad-blocker). This viewpoint is held by those who own webpages (because they want control over it) and by those who cannot change what a webpage looks like (common users).

If a webpage is a pamphlet, then I'm free to cut it up and re-arrange it however I want. Naturally, those with knowledge to cut and re-arrange are more likely to take this view. This viewpoint is more technically correct, a webpage is just a few bits of information handed to me, and to the extent I control my own computer I can cut up those bits and view them however I want.

It's fair to say that Amazon.com contains Amazon's webpage, and that Amazon owns that web page. And yet, I've never once viewed Amazon.com without using an electronic device owned by myself or another non-Amazon entity. Amazon.com doesn't exist on a billboard, it requires the use of electronic devices owned by other people. What rights do the owners of those electronic devices have? Any? At what point do the pixels on my screen become your protected space?


For the case of the billboard, what happens if you viewed them thru glasses (let's suppose it's google glass), and you use software to blot out the parts of the billboard you don't like?

You're not painting over the billboard, just blocking the light reflected off it for yourself. It does not affect others.


To play devil's advocate, what about selling that device? What if it's integrated into an existing smartglasses platform?

What if you only block your competator's ads? What if you replace them with your own? (..does Brave do this?)

What if you block all ads (yours, and your competitors), but in so doing, exploit the inertia of a consumer who is a.) already using your product, and b.) actively sheltered from exposure to the market?

ohh, i got it: just don't block ads for FAANG /s


Are trying to make a point? Because none of any of this is any worse than ads in the first place? Why would it be ok to influence customers using ads but not by blocking ads?


Antitrust, but I see your point; we ought to put pro-consumerism first.


What device can you buy today with integrated ad blocking enabled by default?

Ad blockers having paying partnerships can be a problem, but something like Ublock Origin does not.


I'm gunna plug AdNauseam. I like it better than uBO or the ilk, because it clicks every ad it blocks. It completely breaks marketing platforms for me, i never get relevant ads when i do see them. According to marketer data, i want to see ads for everything. It also makes some websearch hilarious - i search for "microwave frequency QAM" and it's all ads for kenmore and GE countertop cookers. I avoid search engines that give crap data, because they're trying to get me to buy stuff, i just want search results, thanks.

It also keeps a handy "cost to advertisers" metric you can view. I'm over $40,000 at this point.


> If a webpage is a billboard, then it is morally wrong for me to paint over those sections of the billboard that I do not like (i.e., using an ad-blocker). This viewpoint is held by those who own webpages (because they want control over it) and by those who cannot change what a webpage looks like (common users).

Why is it morally wrong to paint over billboards?

But even if we accept that unfounded premise, the equivalent of ad blocking would be to hold something in front your eyes blocking the billboard as an ad blocker does not prevent other visitors from seeing the ads but paint does.


Sounds simple to me: Are you modifying the web page on the server or on the client? The former should be illegal, the latter legal.


> With "contracts" of adhesion proliferating, and how impossible it has become to exist in the modern world without acceding to them (something as simple as buying a new SSD involves agreeing to one), this problem is getting worse by the day.

The craziest example of this is how all these contracts are appearing in the physical world as well. There are stores that actually have a sign indicating that entering the store constitutes acceptance of contract terms (with a QR code that you presumably can scan with your phone to read the contract). I've also seen public parks with the same thing basically indicating that entry binds you to a legal agreement to not sue the park/follow posted rules/etc.


There’s a dead reply to your post saying that this occurs because of the insanely litigious nature of the USA. I think it’s worth highlighting — business/property owners are essentially trying to use contract law to route around the fact that the US legal system is broken with regards to civil litigation and throwing out bogus cases. For example, having a private pool in your own back yard can make you liable for someone else’s child breaking in and injuring themselves in your pool because you not having enough barriers to stop them means you allowed the access.


> the US legal system is broken with regards to civil litigation

And the problem with that has a lot do with corporations. For instance, if you are a pedestrian and get hit by a car and end up in the hospital, in a lot of places in the USA your health insurance will not cover you at all -- you have to sue the driver and get compensated from their auto insurance. The logical method would be for your insurance to cover you and then the health insurance would recover costs through appropriate parties.

It is the same with ridiculous lawsuits like the aunt who sued her sister because the nephew jumped on her and threw out her back. In order to recoup medical costs she had to sue her sister since the sister had homeowner's insurance.

You can't entirely blame the legal system when the corporations are using it to perpetuate the problem for their own gains at the expense of everyone.


It's not either/or - the legal/legislative system is jointly at fault, for letting corporations run amok with generating endless complexity in the form of "terms" that nobody reads, understands, or actually agrees to. The big print of "health insurance" is that it covers medical expenses. From a basic legal perspective it should be impossible for any fine print to walk that back. From a consumer protection perspective, insurances sold to consumers should have to cover complete scopes that fulfill public policy goals.


I sometimes feel like insurance in general is one of the greater "hidden" evils in this world. They prey on your worries and fears, and then they try to do anything possible to weasel out of paying out when a fear becomes a reality.

Don't get me wrong, I think there is some good in it (mostly around risk calculations), but the entire industry feels scummy.


Note: if you think a dead reply is relevant, click on its timestamp and then on "vouch" to make it alive again.


And the litigiousness is downstream of having freakish medical expenses and no universal safety net. An accident can incur costs your could work you whole life to pay off so of course there's a complex adversarial social system built around those consequences.


Fortunately this is mostly done for defense. A company isn't going to sue you for violating a 12 page contract you walked by, but they might ask you to leave and not come back. They probably won't mention the contract until you sue them for illegal discrimination, at which point they will refer to the contract and argue they were only enforcing their own posted rules and not illegally discriminating.


Public parks saying that is kind of strange because the city or town could already set the rules by enacting an ordinance. Presumably they could also delegate that authority to the parks department. I suppose Parks might be doing it because the city council or mayor isn’t enacting the ordinances they want.


It is partly due to the crazy litigation in the US and the rewards people get for some coffee they spill on themselves just because the coffee was "too hot". There are two sides to this story and it is not only the business' owners who are the culprits here.


Look up that case. The victim suffered third-degree burns and required skin grafts. The myth that it’s a ridiculous, over-litigious case is just that: a myth. The coffee was absolutely too hot.


And it was caused by a profit-seeking behavior - an assumption that most, if not all people, would wait to drink the coffee until after they were out of the vehicle, which could be up to X minutes; as such they wouldn't want cold or lukewarm coffee. If most people equated mcdonalds coffee with "bad lukewarm coffee", less people would go to mcdonalds for anything, including coffee. I forget how hot they served the coffee, but it was up around 95C/200F. I know that you have to specifically request "extra hot" at places like SB and CB&TL, otherwise you get coffee hot enough to still be "hot coffee" after adding creamer, i venture 130F or so. AFAIK McDs doesn't offer "extra hot" coffee. Once bitten, and all.

I don't know about the changes made due to this lawsuit, but i reckon they changed their cups to keep ~120F coffee at that temperature longer.

edit: aside: I just realized why i prefer Fahrenheit even though i understand Centigrade. Stuff over 100F is "hotter than i am" It's intuitive in a way that merely remembering 37 is body temp (my dad was born in '37 so that's why i remember) and "anything over 40 is hot" doesn't really roll off the tongue, even though it's roughly as accurate as my ">100F is hot"


What's needed to counter these is for customers to have their own contract of adhesion that simply says if the company is to accept them as a customer, then the company's own contract is null-and-void. This would be backed by a legal team in something like a customer's union or insurance that people would subscribe to for a monthly fee. This contract would be as enforceable (or not) as the company's, leveling the playing field. It would no longer matter what they put in their fine print since you wouldn't need to read it.

If a company doesn't accept the customer's contract or won't let you bypass their own, you walk away -- no sale. Other companies will get your business.


> you walk away -- no sale

which begs the question - what if you need the company's services, and they have no competitors (within reach of you at least)?

The fact of the matter is, the company has a higher bargaining position than the customer.


Yes, that's a great idea! Just hold on while I pay my attorney on retainer $10,000 to draw up and review that contract for enforceability and loopholes. I'm also depositing $50,000 in my "rainy day" savings account to pay another team of attorneys to litigate the contract each time it gets contested or broken. Thanks for the advice!

/s


Contractual law in the modern era regularly and persistently undermines private property rights. Mandatory arbitration clauses make it worse.


The perceived hypocrisy sort of goes away when you stop thinking about it as a collaboration or a community of equals, and instead think of it as a competition, which is what it is. You would not say of a football team "oh, it's okay for you to try to score a goal on me, but if I try to score a goal on you, suddenly you're blocking the ball?!"

Naturally, they're going to say "web scraping uses resources, stop it!" but then keep web scraping in the background.

To be clear, it's bad behavior, it's just not hypocritical behavior, as it's completely in keeping with what amoral corporations locked in constant battle would be expected to do: maximize benefits to themselves while minimizing benefits to others.


That's a very interesting comparison (thanks!), but I'm not sure if it's the correct framing. Making scraping technically difficult would be equivalent to trying to score a goal (so still not great, for the rest of the world, but probably not hypocritical).

Trying to prevent certain classes of behaviours via legal means is more like trying to prevent certain types of play, by appealing to the referee, while still doing them yourself. Clearly, this often does happen in sports, but _is_ generally seen as hypocritical.


For quite sometime I've always felt that "sports analogies" are overwhelmingly the BEST way to frame most microeconomic disputes. Much better than the far inferior Darwin-esque "Survival of the fittest" metaphors that imply some natural order to certain types of greed and bad behavior.

There's NOTHING natural about our economic systems. They're all COMPLETELY made up, let's treat them that way.

(and yes, here it is about 'lobbying the ref')


Sport in general is a cultural phenomenon and it seems that all cultural phenomena share a lot of similarities.

Genetics however is not only a useful model, it's hard science. You can experimentally find out whether some characteristic is e.g. Mendelian (I'd doubt the greed is, as normally defined).

It got me thinking that to cross the two domains, there is also a meta-concept of cultural viruses ("memes") to which Dawkins applied Darwinian model. Definitely not hard science, but they kind of counter your point that "there's NOTHING natural about our economic systems".


I mean, it's true that you might see phenomena that echo natural systems or things, but I suppose what I'm getting at is: unlike natural systems, the "rules" can be changed.

Honestly, what helped me a lot is: Economic systems are more like video games than "nature." Sure, video games can look like nature, but also are extremely malleable.


> Naturally, they're going to say "web scraping uses resources, stop it!"

that's the expected cost of publishing something to the public internet. People are going to access it. No one has a right to complain when people access something that was put there for the public to see. Scrappers can be dicks about it too, they can get lazy and endlessly hammer away at some server or repeatedly pull down the same content because they messed up, but we don't need need litigation for that. If something raises to the level of DoS that's already covered under existing laws.

> it's completely in keeping with what amoral corporations locked in constant battle would be expected to do: maximize benefits to themselves while minimizing benefits to others.

Maybe we need to rethink giving some of these corporations the privilege of corporate personhood if they are just going to make things worse for everyone else while only enriching themselves. We don't need to allow parasites and pillagers to take whatever they want at our expense.


> Scrappers can be dicks about it too

It's not always about individual bad actors. You can have lots of small players causing problems. I wonder how many python developers there are right now trying to make their own offline copy of stackoverflow.com.

Wikipedia has a great defence against this. They ask you not to scrape, and at the same time, provide torrents of the data (https://meta.wikimedia.org/wiki/Data_dump_torrents)


Stack Exchange also provides one: https://archive.org/details/stackexchange

There was a hiccup around June but that seems resolved now: https://meta.stackexchange.com/questions/389922/june-2023-da...


yeah i thought this was the case. There's a couple of other sites that - i hesitate to state this completely factual phrase - leave dumps for the public to pick up.


that's the expected cost of publishing something to the public internet. People are going to access it. No one has a right to complain when people access something that was put there for the public to see.

What is your opinion on spam email?


Spam is a problem, but a very different one. Spam is often malicious/harmful. Accessing a publicly accessible website is usually benign even if the content being accessed gets saved to disk.

Some people don't publicly publish their email address, they instead selectively give it out only to those they want to get email from, but their address gets leaked/sold and abused. Ultimately people who do publicly publish a contact address (email or even a physical mailing address) are basically on the hook for deciding what to do with whatever people send them.

The spam situation got out of hand pretty fast though. The only thing that kept email spam from reaching the level of a DoS attack were blacklists and server-side filtering, and even with those things (plus client-side filters) spam is still a huge problem today. Spam is just a much bigger problem than web scraping. Even the junkmail the mailman delivers to my door has an environmental cost that's much worse than the "harm" of a web scraper's http GET requests.

We have many alternative ways to contact each other online that aren't as vulnerable to spam, but for all of its shortcomings email continues to be widely used because at the end of the day people think giving strangers the ability to reach out to them uninvited is valuable. Anyone can set up a whitelist and trash everything that comes into their mailbox unless it's from an approved sender, but almost nobody does because they want to be more reachable than that.


That's like comparing web scraping with DDoS


There are some factors that might lead to this lone of though.

- it seem a lot of web developer / product manager either do not know or do not care about robots.txt

- some web application are so badly optimized, that some of them are not able to handle more than 1 hit per second at a sustained rate, which admittedly have worked fine so far. But crawlers are persistent, causing the normal crawling activity to cause denial of service for normal users.


I hold the same view as the parent poster on public data online and my opinion on spam e-mail is that it's a consequence of naivete bordering on faulty design. It should have been set up with strong authentication (proof that this e-mail is from whom it says it is) and explicit consent (you can only message me if I allow you to message me). The latter could be as simple as rate-limiting e-mail from addresses which you have not explicitly allowed, or to which you have not previously sent letters, to one per week or so, with all such e-mails automatically going to the spam folder to be purged in a month.


Hypocrisy doesn't require one to believe what they say and utter it in good faith but fall short of those ideals in practice. Equivocating about football teams doesn't change that one is trying to impose standards on another without holding oneself equally to them. It is still hypocrisy, even if they do it amorally in bad faith. Especially then. What matters is what policy you espouse; you don't get a pass for not really believing what you say. The implication is that hypocrites are acting in bad faith.


The problem with that sort "that's what amoral corporations do" reasoning is that corporations are permitted to exist because of the idea that they do contribute to the net public good. Once that idea is out the window, then there's no reason for society not to treat corporations as the hungry Lovecraftian nightmares they are and obliterate them with fire and steamship.


> football team

In football, the rules have been extensively tuned to promote a fair fight.

Perhaps we should do a bit more of that sort of thing in corporate law.


I think the difference is that defeating the other team is the point of sports, whereas at least ostensibly the law is supposed to provide a set of coherent rules for businesses to compete against each other. Trademarks are defined according to certain legal rules, and if you have one, this is how they provide you with a limited monopoly in a certain context. Allowing businesses to define property law through contracts lets people define the rules however they want. And that leads to irrational results.


> You would not say of a football team

You also would not let one football team buy up 60% of the other football teams in the world and merge them into one MEGA-TEAM.

But apparently we still cannot summon the willpower to enforce simple and straightforward 100+-year-old antitrust laws (so instead we make new ones and just won't enforce those either).


> You also would not let one football team buy up 60% of the other football teams in the world and merge them into one MEGA-TEAM.

Why not? If that mega-team wants to destroy their revenue stream (People want to watch competitive football, as opposed to curb-stomps), they are free to do that.


And you believe that MEGA-TEAM won't get bailed out? Because otherwise bloated corpse of MEGA-TEAM will smother football as we know it?


What do you mean bailed-out?

The whole point of competitive sports is for there to be, well, meaningful competition. All the money in that industry only exists if there's meaningful chances for either team to win. Winning too hard is a detriment to the winner.

That is incredibly different from capitalism, where the whole point of it is for one of the participants to win the competition, to the deteriment of the rest of us, which is why society comes up with elaborate rules for preventing that.


As the article states the issue is with courts not companies. We need a state actor to pass a law similar to the weapons laws of other states that guarantee a right to scrape. Then all the scraping companies setup shop in that state. If a service doesn't want their data scraped they need to make sure that it doesn't get sent into that state. Ideally a large enough state that companies wouldn't want to block.


Agreed that legal clarity is important - especially for smaller players. I've built a significant hobby site that relies fairly heavily on scraping (grocery price comparison site). I believe what I'm doing is morally okay, and also that big players wouldn't run into any issues, but when it's just me (or even if it was a small company) the legal 'grey area' makes it a much bigger risk.


I like what you’re saying, but how do you provide for the existence of evil or simply incompetent scrapers who drag the system down due to incompetence?


There are clearly more than two teams in these issues. It is not a game, and it is not football.

It is a public policy issue also outpaces "competition" which is merely a subject change.


Football is metaphorical in this case.


The problem is as this article points out is democratically elected courts should not be choosing winners in a capitalistic competition


Only commenting on how we should expect corporations to act, or more accurately why we should not be surprised at their behavior.


>Let’s look at what Microsoft is doing right now, as an example. In the last couple of weeks, Microsoft updated its general terms of use to prohibit scraping, harvesting, or similar extraction methods of its AI services. Also in the couple of weeks, Microsoft affiliate OpenAI released a product called GPTbot, which is designed to scrape the entire internet. And while they don’t admit this publicly, OpenAI has almost certainly already scraped the entire non-authwalled-Internet and used it is training data for GPT-3, ChatGPT, and GPT-4. Nonetheless, without any obvious hints of irony, OpenAI’s own terms of use prohibits scraping.

I don't understand why this demonstrates hypocrisy. There is a big difference between crawling the publicly accessible web (which legitimate search engines do all the time) and scraping an authenticated web application or API.


The hypocrisy stems from:

1. OpenAI (etc.) scrape the public web to train and build their models.

2. They use these models to sell subscriptions (make money). None of this goes to the creators of the data used to train the models.

3. They deny others to do what they have done themselves.

If you compare to, say, search engines scraping the public web:

1. Search engines scrape the public web to create their search indexes.

2. They use these indexes to provide search results and sell ads around them. Importantly, these search results direct people to the websites that they have scraped (much of the time), offering opportunity for them to make money.


If you publish something on the open web, you're making that content available to anyone who can send a GET request or use the Internet Archive. On the other hand, if you're making a private application where users create an account and agree to a terms of service, you're allowed to define what kinds of requests are or aren't legitimate. I still don't see any hypocrisy here because the two use cases are fundamentally different.


> If you publish something on the open web, you're making that content available to anyone who can send a GET request or use the Internet Archive.

What about "by continue using this site you agree to this TOS"?

What if you don't have rights for published content?

What if you make your content free and opensource, but don't want big greedy corporation to use your content to train AI?

What about author rights? If I publish my painting does that mean that any corp can sell t-shirt with this content, because "content available to anyone who can send a GET request or use the Internet Archive"?


>What about "by continue using this site you agree to this TOS"?

Terms of service determine when a provider will refuse to service a user's requests. If the website responded to the request with the requested content, TOS is a moot point.

All of the other examples are covered under copyright law. Whether or not copyright has been violated depends on whether the training of AI models falls under fair use. That remains to be decided in the courts, but I think there is a plausible argument that an AI model counts as "transformative use" and wouldn't be a violation of copyright.


Or that while Microsoft is an investor in openai, they do not control openai


If SDGs learned us anything is that companies are responsible for to scrutinize the company they are investing in.

Of course this is not on the level of child labor, nor environmental pollution.


I see two issues. Web scraping is clearly a business model problem, and that’s partly due to scale.

If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads. Examples are ad blockers, answers embedded in Google results, Stack Overflow clones, and things like ChatGPT.

If ads weren’t your business model you wouldn’t be using revenue from it.

The other issue is scale, and I don’t know how to address it.

It’s easy for someone (say the government) to have a friendly policy and say “you can use dig in a park” thinking it’s useful to campers and such.

But when someone shows up with a professional strip mining crew, things are different.

If you run a site providing quality information for free, making money off book sales or professional services or such can be a good living. Even if answers end up in the Google answer box, more complicated stuff or analysis still requires a visit to read and people can start following you from there.

But if ChatGPT or whatever can “read” your stuff and give out 80% of the value without anyone even knowing it came from you, you’re screwed. Your business model no longer works. Any kind of “give away good information” business model fails. Same issue artists are now seeing.

And I don’t know how you fix that without some kind of ban. But unless every country everywhere enforces one… you have to work with the lowest common denominator and lock all your content up. No web search. No Google answers. No chat GPT. “Please don’t scrape me” in robots.txt won’t work.


It's interesting, because it's essentially the same exact discussion as traditional copyrights (e.g. for books). The only difference is that book authors are generally not giving away their books for free on their personal website. Copyrights are the attempt to protect the business model of authors who want to sell copies of something that are otherwise extremely easy and cheap to copy. Attempts to legally limit web scraping are an attempt to protect the business of model of creators who want to give away for free copies of things that are easy and cheap to copy, but only we come directly to the creator to get our free copy.


There's a nuanced difference between this new wave of AI scraping and the old "copying a site" scraping, and it's not a copyright issue.

The original problem with copyright was that the website owner's content could be duplicated elsewhere, and thus, violate copyright (as well as suck away the web traffic and presumably lowering revenue).

The new AI issue is not that the content is duplicated elsewhere, but that the knowledge contained in the content is "learnt", and used to produce a different work (totally copyright free - in the truest sense, as it is original). An example would be a recipe website. The site owner could've painstakingly collected recipes from the literature, and cataloged, labelled it, etc, making searching and such easy. But the recipes themselves are not copyrightable, only the expression of the recipe.

So given this info, the AI scrapper now has a large labelled dataset for which to learn from, and to generate new recipes. These new recipes do not violate _any_ existing copyright, as they are entirely original in expression.

I say, as an AI advocate, that the old business model of recipe hosting is destroyed by this new AI, and legislating it to remain by legal means is just fighting against the tide. After all, the world doesn't have a unified jurisdiction, and the internet is world wide, so any would-be violators could just as easily move to a different country to operate.


You’re right . That’s why scraping must be unlimited and legal for all. Any information accessible from internet should be legal to refine. Thus also us using GPT services to train our own models, scraping anything that’s publicly accessible. Our only defense is competing services that refines the data even more than any general llm. The solution is almost never regulation but competition. Fair competition


> You’re right . That’s why scraping must be unlimited and legal for all.

Unlimited scraping makes some of privacy regulations moot. Such as right to erasure (ability to delete personal data from a platform).


I don't think that's true. "Right to erasure" still works just as well as it always has, but you might need to ask the folks who have scraped and are re-sharing your information to also delete your personal data. That's not an unreasonable thing to have happen, nor is it an unreasonable thing to expect.

Let's suppose an embarrassing image of Person X is shared on Facebook and Person X uses their right to erasure with Facebook to delete their profile. Facebook has no control over the folks who may have downloaded or screenshot-ed that photo and turned it into subsequent memes. Likewise, if someone straight up scrapes and re-shares, that's not Facebook responsibility.

What I don't want to see happen is for:

1. Facebook to make it somehow impossible for anyone to ever copy or screenshot that or any photo, preventing anyone from ever doing anything with photos on Facebook without Facebook's explicit permission. This would seem to be quite the loss of user agency for very little society wide benefit (also, how would they do this?)

2. Facebook to somehow "control" that photo so closely that Facebook is able to remotely revoke folk's copies and screenshots of said photo in the spirit of "abiding by a persons right to erasure"; that'd be a huge overreach, but seems like the only other way to approach this (though "how" is also an open question).

Even asserting that "unlimited scraping makes some privacy regulations moot" seems like an implication that we can only have privacy laws by going towards situation #1, and that doesn't seem accurate given that folks can use existing privacy laws to remove content from any distributor (as long as they're compliant).


Not exactly. You can request a site to erase all the data it has on you, but not that they erase the memories of everyone who has seen this data. How is this any different?


Your tone implies you're serious, but I struggle to believe anyone could possibly equate persisting digital media with recalling a memory.

In case you really need an example to elucidate, consider reproducing an image. A scraper can quite literally accomplish that, trivially; a great artist would still be limited in multiple facets of the recreation, such that even one with the best memory and hand would find themselves far short of pixel-perfect.


I wonder how we would regard a person who could reliably perform such a feat whenever he pleased. Would we sterilize him, lest he give rise to a bunch of cute little privacy-invading monsters?


If the feat you mean is to perfectly recall disparaging information they see about people on web sites, we already have people with quite good memories. Irrelevance usually keeps them from bringing up the details of strangers' lives on a regular basis. If the juicy details are about friends or acquaintances, well, it's very easy to destroy one's social position - at least, with non-toxic people - by endlessly and tiresomely discussing other people's misfortunes or mistakes.


How many people who have seen that data are acting as a service to share it, at scale?


How many of them saved it and then reuploaded it elsewhere? Sorry, but talking about protecting the privacy of people who upload things for anyone to see just seems silly to me.


scale


So at which scale does the copying of data lower privacy, such that humans looking at it and potentially screenshooting it doesn't, but automated processes copying it does?


A fuzzy boundary doesn't make the two sides equivalent.


No, but since we are talking about laws, it is important to define the point beyond which a kind of behavior becomes unacceptable, or at least some set of criteria to determine when a specific instance is beyond that point.


You're making an idealogical argument but not confronting any of the business problems raised in the other comment.


> If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads

I don't think a paywall would fix this. One paid account is all a scraper needs. It couldn't really even be rate-limited if it's just "reading" articles as they become available. After the data is acquired it can be dispensed. If directly posting it violates copyright, then obscuring it behind AI will do the trick just fine.


But it stops being trivial. Now to scape websites en mass you have to automate signing up for them, probably paying for it.

And unlike now to sign up you have to agree to a very enforceable EULA.

So instead of going to court with “FunAI read my public website and is making money off it which I don’t think that should be fair use”, you have “FunAI violated a contract they signed and committed fraud by lying on signup”.

Seems to me that’s much easier.

There will always be people who get the content for free somehow. You don’t have to stop 100%. Even stopping 95% would be a lot better than the current 0%.


If you think about it, if free lending libraries and web search indices did not exist, and you tried to create them today, you would get sued into oblivion.


The primary grounds on which these cases rest is some nebulous understanding of contractual agreement.

I have two thoughts:

- EULAs aren’t written for companies to sign.

- I think EULAs are garbage anyway. They’re completely one sided and in most cases probably illegal or wouldn’t hold up in court if anyone actually had the resources to fight one.

Imo, the burden of ensuring someone has read and understands a EULA should be on the company who creates it and they should not be enforceable unless they can prove the person understood the EULA entirely before accessing the site. EULAs are not a business agreement. They’re some kind of corporate pseudo-law companies try to attach to the usage of a product. But what other product in the world has a big list of rules that come with it that way how you can use it (or be sued)?

So how does this all come back to this “company vs company scraping”? If you put it on the web, and you don’t have REAL copyright on the content (that is, you didn’t make it yourself), you have no right to protect it from “theft.”

PS yes, I know John Deere doesn’t let its customers work on its tractors but that’s some bullshit too.


These online agreements are often enforceable, even when companies have lots of resources to defend themselves.


I found the linked Register.com vs Verio case [1] very interesting, because the courts actually made a more nuanced decision on contracts of adhesion than typically understood, I believe:

In the case, Verio was calling Register's API for a purpose that Register disallowed. However, they only supplied the "contract" text that declared the restrictions after the call was made (as part of the API response I think).

The court did in fact agree that this was too late: If you can only find out the terms and conditions for an API call by making that API call, then this is a "shrink-wrap agreement" and the terms are void.

The restriction the court made on this was that it only applied to the first time they called the API: Verio has employees who can be expected to have common sense. So after having called the API for the first time, they had an opportunity to read the text and become aware of the restrictions. This meant that for all the subsequent API calls they made, Verio's staff was aware that they were doing something that Register explicitly forbid, but did so anyway - hence the court ruled that as "breach of contract".

The important point here was that the courts never abandoned the principle that an individual must be aware of a contract's conditions before it can enter a contract - the case was just shooting down a situation where a party was pretending to be ignorant of the terms when they really weren't.

[1] https://en.m.wikipedia.org/wiki/Register.com_v._Verio


Good example from the Allen Institute discussed last week https://news.ycombinator.com/item?id=37181415

They "released" a dataset scraped from public domain stuff under a license that restricts how people can use it


> But the content that they’re trying to protect isn’t theirs -- it belongs to their users.

Kinda. Yes, Facebook says that content belongs to users (otherwise they'd have harder time explaining they are not liable when it's illegal), but users also agree to give Facebook “non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook.”

For example, if a user deleted their* content, Facebook can still use it and show to their friends. That's why it's "kinda".


That doesn't change who the content belongs to. It just gives some rights to FB. Any, in fact, without something like "perpetual" and/or "irrevocable" in there, it doesn't imply that they could keep using it after you deleted it (or that you couldn't revoke a grant of rights.)


A license is not ownership. Anyway that part of the article is just context - none of what you describe constitutes the legal basis for the suits or rulings discussed in it - it’s just explaining why property law isn’t being used.


Did you read the posted sign? “no walking on the road outside my property”


> For example, if a user deleted their* content, Facebook can still use it and show to their friends. That's why it's "kinda".

I don't think it is correct. If you asked Facebook to remove your data from platform, it will be a GDPR (and probably CCPA, etc...) violation for Facebook to not delete your data within 1 month.


I am, in general, strongly of the opinion that legal blocks to web scraping are ridiculous and a Bad Thing.

That said, ever since I added some invisible unicode characters into my name on LinkedIn and can see the fraction of spam that I receive that addresses me with a bunch of `?`s where I put the invisible characters is rather astounding. Probably half the garbage email I get has clearly scraped my name from LinkedIn.

All that to say: if you're someone who feels affronted that you aren't allowed to scrape LinkedIn, and you try to garner my sympathy about the situation, please accept my invitation to get stuffed.


I couldn't disagree more re: legal blocks are ridiculous, and this is a very pro-privacy decision.

I shared info on LinkedIn for use by myself and other linkedin members, and for linkedin to use in their various products. You can't be pro-privacy and simultaneously believe that my putting info on LI has granted the entire world license to do with it what they will. Including running research programs or ML stuff or whatever hiq's business was, selling it to spammers, and using it in products by other companies that I've never heard of.


> You can't be pro-privacy and simultaneously believe ...

You gave information to LI, trusting that they wouldn't disclose it to people you wouldn't want to have it. Meanwhile, untrustworthy third parties were free to view that very same information. Blame LI for being casual with data about you. Blame yourself for trusting LI. Don't blame others for reading what they were free to read.


That's simply wrong -- this went well beyond HiQ could read.

HiQ's claim is that LI shouldn't be allowed to prevent HiQ from reading data that I had no intention or interest in sharing with HiQ, simply because LI makes it available to other LI users. It's wildly anti-privacy to claim that because I shared data with LI, and LI put some piece of it on the website, that HiQ has a right to read it and use it for arbitrary things.


Just my opinion. Web scrapping isn't important for AI or AGI anymore. It's important only for the facts, like news. But not for the logic. With current datasets it should be possible to move forward, beyond GPT-4. Blocking access now is like investing in aluminum was, when it was more expensive than gold. And suddenly it's not.

When we get facts separated from the model it will be enough to have just one article about, say, landing on the Mars for the model to be able to actively use this new knowledge. In this case 'AI providers' will need to cooperate with one news agency, buy right for only for a few classic books. Judging by how fast everything is moving this will happen within a couple of years if not a few months.


The first company that came to mind from reading the title was Google.


It seems like the reason the shit is hitting the fan at this particular point in time is because of the recent AI and LLM race.

If I were to guess, it probably originated in an edgy, hyperbolic statement like “the new AI empires will be built on the free data from the old ones”, uttered by someone scary and important, perhaps Altman.

Through the rumor mill all the feudal lords in the data-kingdom, like Steve Huffman, Musk etc are freaking out and are fortifying their defenses against this new perceived existential threat, sacrificing whatever minuscule openness remained in the process.

To me, it’s obvious that a new copyright interpretation is needed for commercial AI applications, which would alleviate the gold rush panic. However, the regulatory bodies (ironically, eroded by the same corporations), don’t have any teeth left to do anything about it, so it’s a Wild West, and they all know it. The legal defense against scraping has nothing to do with principles, even less so these days, it’s all just about abusing the legal system for dominance games.

Personally, I think the threat is overblown. Yes, AI will force eg Google to improve their shit ranking of recipe sites, but AI can’t provide up-to-date stuff people care about, so Google, Facebook, Twitter etc will keep providing services just like today.


I think it was already the case that a number of social media sites were cycling from the "Offer free bait" phase to the "Harvest the walled garden" phase.

Twitter barely works on Tor. Reddit barely works if you're logged out. Reddit is pushing their app really hard and obviously going to drop old.reddit one day when their internal API finally breaks it off.

If you make revenue from advertisements, having a web page that can be interpreted by any machine besides your own code is wasting money on freeloaders. Be they 3rd-party clients, Tor lurkers, or scrapers for search engines and AI.


Whenever anyone I know sends me a reddit link in a group chat, I relink them to old.reddit. The day they drop old.reddit is the day I start dropping those friends. I did put the app on my phone for a minute, since the "new" reddit web experience is so miserable, but the app was even more intolerable.


is there any more uptodate corpus of reddit comments out there? like this 2015 one

https://archive.org/details/2015_reddit_comments_corpus

(oh there is a 2TB torrent from 2005-06 to 2022-12)


other countries don't care about copyright interpretations. next you'll suggest global government. the internet always was a wild west and i do not particularly enjoy contemplating a scenario in which this ceases to be


Whatever things are morphing into, it isn’t a “Wild West”.

Unless you think of feudalism or indentured servitude as Wild or Western.

It’s a place of a few royal palaces, with walls & moats, surrounded by the masses who submit the value of their creative & social efforts through little slots, for the right to receive in kind from others, but filtered & diluted with whatever the royalty wants you to see and hear.

The evolution of aggregation sites is not on a path that enhances or maintains individual freedoms or independence.

Amazon charging suppliers a fee for NOT using its services ought to be an Onion article. [0]

[0] https://www.theverge.com/2023/8/16/23834653/amazon-seller-fe...


>a decision Bloomberg says Amazon made to satisfy regulators

Regulators are out of control, yes. But that does not change the fundamental nature of the internet. Especially as it pertains to something which is purely data, copyrighted material. Amazon is not stopping filipino and bulgarian kids from downloading mission impossible any time soon. Nor is it stopping people from bulgaria from paying for AI services in the Philippines which are trained on American data.


Amazon, Facebook, etc. don’t want to kill off anyone or their economic presence.

Any more than a Duke or Earl wants to kill off peasants or stop them from playing and singing with their families in the evening.

The lords just want to extract most of the value that they create, most of the value that they spend, and let them see what the lord wants them to see, an increasing percentage of time.


All low-grade conspiracy theories hyperfocus on motivation. If motivation alone could move the world, some serial killer in some prison in the desert would have destroyed the Earth with his mind powers long ago.


The problem is that it is global government, if your local police assist American corporations in seizing your assets for setting up a server far away in another country that downloads public social media profiles. It's just not a global government that anyone whose net worth is < $1B has any influence upon.


there is no conceivable global government that would care at all about a random guy on an island with no money


Most of the important countries (except China) do care about copyright, patents and such. Though they were mostly implementing western laws through both influence and threats.


It doesn't matter who the important countries are militarily if Argentina and Kazakhstan are the ones which have no enforced rules on training AI on foreign data.


The Web Environment Integrity (WEI) API also plays into FAANGX's capability to monopolize scraping and therefore create competing AI systems.


With regards to the hiQ case (IANAL, but I followed this case closely): the 9th circuit asserted that the CFAA (Computer Fraud & Abuse Act), which is a _criminal_ statute, does not apply to scraping websites. On the other hand, if you create a user account and scrape while logged in, that is a matter of contract/civil law, and the website can sue you to stop breaching the terms of service (plus maybe damages).

The takeaway for those of us in the trenches who feel we have as much a right as Google or Bing to crawl websites, is that we are not subject to arrest and imprisonment, just tort law. The ruling has apparently caused the major social networks to switch from criminalizing web scraping to putting most content behind a login and threatening to sue small companies out of existence.


It's interesting seeing the reactions from other websites/orgs after OpenAI publicly announced GPTBot. Tons of people blocking GPTBot outright (made a small page that track this: https://wayde.gg/websites-blocking-openai)


Is there any legal issue with a spider trap designed to poison LLMs?


I wonder if blocking gptbot is a good signal that a website has non LLM generated content on it, and is therefore good training data...


An illustrator specifically wrote[1] that this is why they won't be tagging their social media posts with #HumanMade, #NoAI and similar, as it's a signal there's unadulterated training data.

[1] https://www.davidrevoy.com/article977/artificial-inteligence...


Sweet memories of Facebook, which was spamming all the contacts to invite them to join Facebook.


Same as LinkedIn. In fact that was LinkedIn's actual growth hack. Now they sell books talking about other things.


I remember when LinkedIn had an anti-pattern of displaying a check list of your gmail contacts and asking which of them you want to connect with. These were all unchecked by default. But under the fold - that is, you needed to scroll down to see them - were all the rest of your contacts, only these were all checked. If you just went with what was displayed without scrolling down and unchecking the rest, you would unintentionally blast your entire contact list with contact requests. That was everyone in your contact list, whether appropriate or not: your dog groomer, people you were in a lawsuit with, whoever. I never did figure out how they were able to get our email contact lists. I vaguely remember reading about web-scraping shenanigans. I for one never knowingly gave them access.


FTFA: “ All the world’s knowledge is available for the taking on the Internet”

No. This is not true. Quit saying this. A small fraction of the world’s knowledge is on the Internet, much less scrapable.


Perhaps it's legal to scrape in some world states outside the USA?


A good question. Japan came out and declared that copyright does not apply to training AI.

There must be a good chunk of the world that doesn't have any laws forbidding it. This isn't under the jurisdiction of WIPO or anything like that, it's just a completely insane evolution of anglo common law


> Courts need to realize that if you allow private companies to invent intellectual property rights through online contracts of adhesion, courts will be at the mercy of private decision-makers on questions that should be questions of public interest.

Courts feel very fine doing the bidding of private decision-makers on what should be questions of public interest, thank you very much. Reminds me of this joke:

“Wake up, sir, wake up, you have shat yourself!” — “I’m not sleeping.”


Fair use is how cos like Google got started in the first place. AI is similarly dependent on sucking up vast quantities of information, with a light touch of course, and use that to create the NEXT Google.

For that matter, who will build the next LLM if anyone like Elon Musk can slap terms of use change on an account because he wants to stop fair use. This is anti innovation and means challenging all these monopolies- Google in search and twitter in Social media will be all the harder.

This needs to get to SCOTUS fast for clarification as it's evil, wrong and will kill innovation.


Can I just Vent, That this stuff burns me up.

If we can't be clear on Web Scraping, do we really have any hope for more clear rules on more complicated things like AI or privacy or copywrite?


Legal discussion aside, as a person working in security, it's pretty annoying to have to block certain web scrapers / automation while allowing a certain few.


How is the situatuon legally and ethically for you to use scraped data as text embeddings for a commercial product?


According to the District Court's October 27, 2022 Order, hiQ's litigation was being funded by a third party.

Who was this "litigation funder".^1

Perhap the funder wanted to settle, even though hiQ could have prevailed at trial on the CFAA claims. Arguably, hiQ's goal was not to establish new CFAA precedent, it was to compete with LinkedIn. It wanted a Court to order LinkedIn to stop blocking its access to public data.

This took too long. hiQ ran out of money. (Or its funding source(s) ran out of patience.)

If hiQ had the funding to go the distance, then it's impossible to predict what would have happened.

It's sad that the top comment in this thread believes LinkedIn sued hiQ to stop it from scraping. That's not what happened. hiQ's "business" was going to fail because LinkedIn was effectively blocking its access, even when hiQ was using proxies and mechanical turks. hiQ filed for a declaratory judgment against LinkedIn: hiQ sued LinkedIn.

Here is the Order for anyone who cares to read it.

https://ia600100.us.archive.org/29/items/gov.uscourts.cand.3...

1. According to the Order, the funder is identified in some correspondence filed as an exhibit. It is not clear from the Order whether names in the correspondence were redacted, i.e., it's not clear whether it's public or private information. I'm assuming the former.


> it's interesting that the top comment in the thread believes LinkedIn sued hiQ to stop it from scraping.

At time of writing, I believe that top comment is mine - and I'll be clear that I wasn't saying that I think that's why LinkedIn sued HiQ.

In fact I don't particularly care why they did it, I'm just interested in what precedents the case has set and whether the typical comments thrown around regarding the case have any actual merit.

There's a comment replying to mine that has a better breakdown of it all that I'd trust over my own anyway. ;P


"In fact I don't particularly care why they did it, I'm just interested in what precedents the case has set and whether the typical comments thrown around regarding the case have any merit."

What are the "typical comments thrown around". Perhaps could provide some references.


I know, I linked to that blog over in my comment chain already. :)


I am not certain who it is. And even if I were, I would not dox someone who wanted to keep their identity private.


> Some of the biggest companies on earth—including Meta and Microsoft—take aggressive, litigious approaches to prohibiting web scraping on their own properties, while taking liberal approaches to scraping data on other companies’ properties.

Umm, no; author needs to study the word "hypocrisy" more deeply than a cursory glance in the dictionary.

Doing something to others, while defending against the same thing, is not hypocrisy.

For instance, soccer player isn't a hypocrite for defending against the ball going into his net, while trying to put it into the other team's net.

A soldier on the war front isn't a hypocrite for shooting, while also taking cover and dodging bullets.

These subjects are not hypocrites because they are not acting in one way, while preaching that they, or others, ought to be acting in a different way.

Microsoft would be hypocrites if they published an official statement such that nobody who engages in web scraping has the right to defend their own site against web scraping, because that would not resemble their actual behavior and position which could be inferred from their behavior. (Is there such a statement somewhere?)

For hypocrisy to take place, you have to actually preach that you and others should behave in a certain way, and then not actually behave in exactly that way. If you only act, and don't preach, you cannot be a hypocrite.

Moreover, your team's net is not the same object as another team's net. If a soccer player loudly professes "it is morally wrong for anyone to kick the ball into our net", but then kicks the ball into the other team's net, that is not hypocrisy. His statement references only his own net; he didn't proclaim that it's wrong to kick any ball into any net whatsoever.


Scraping other sites while prohibiting it on your own is "do what I say, not what I do" behavior, which I think is a fair, consensus understanding of what it means to be hypocritical.


To convict Microsoft or Meta of hypocrisy, you need to find some official statement from the company which says that web scraping is wrong.

Hypocrisy is the professing of a moral position to which one does not conform.

If you don't preach that web scraping is wrong, and engage in it, that doesn't mean you can't defend your site against it.

Everyone shits; it is not ipso facto hypocrisy not to want it on your lawn.

Hypocrisy could be involved in the way you articulate your wish not to have it on your lawn.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: