
Web scraping case fails under Supreme Court's Dastar doctrine - neoflexycurrent
http://blog.internetcases.com/2018/11/19/web-scraping-case-fails-under-dastar/
======
btilly
Here is an attempted translation from legalese.

Company B scraped listings off of Company A's site, and published their own
site based off of that data. Then A sued on three grounds:

1\. Company B's notices falsely claimed copyright to the listings.

2\. Company B ignored company A's copyright notice.

3\. Violation of the Lanham Act, which prevents people claiming something is
from somewhere other than where it is when they sell it. (This is the
interesting one.)

The court ruled as follows.

General copyright notices at the bottom of a web page claim copyright over the
site as a whole and not to all of the data that may appear in the site.
Therefore 1 fails because company B's notice is not claiming copyright. And 2
fails because A's notice was not specific enough to claim copyright.

As for 3, the fact that there is no physical product means that the precedent
the Supreme Court set in the Darstar case applies - the Lanham act only
applies to physical products.

I'm sure that I didn't get it quite right, but that version may be more
readable than the original article.

~~~
Novashi
So if you have data that you can legitimately claim copyright to mixed with
data that you can’t, how would you proceed?

~~~
Retric
Data and factual information, such as rainfall amounts, are not protected by
copyright.

This includes a lot of things like prices which seem to be creative endeavors
making it somewhat confusing.

~~~
baroffoos
I thought that collections of facts are still under copyright so things like
Google Maps data is copyrighted even if the things in it are facts. You can
create an identical copy as long as you verify and collect the facts yourself
and if a mistake on google maps is found on another map then its obvious it
was copied.

~~~
fjsolwmv
Nope. Facts are not copyrightable. You cns copy all of Google's factual map
data, but you can't copy their artsy rendering of it in a picture.

~~~
baroffoos
Then why is copying data from google maps absolutely banned for open street
map editors? Also if google includes some fake data than thats not a fact and
possibly a creative work.

~~~
yellowbkpk
OSM editors are prevented from using Google Maps because OSM prefers that we
have explicit permission to use data sources. Since we don't have explicit
permission to use Google Maps, we can't use it.

Separately, Google Maps has a terms of use that prevent reuse of Google Maps
data. You agree to those terms when visiting Google Maps or using Google Maps
API.

You wouldn't be breaking copyright law when copying from Google Maps to OSM,
you'd be breaking the terms of use contract with Google and community Norma
expectations in OSM.

~~~
codedokode
> You agree to those terms when visiting Google Maps or using Google Maps API.

Those terms are not even presented on the screen when visiting Google Maps.
And even if they were, I didn't sign anything or agree to anything. It is
ridiculous if those terms are legally binding. Because then I will make a site
and make you pay $1 for every page viewed.

~~~
db48x
Except that Google has a credible threat: they can delete your account,
including your email, your Android applications, your Youtube videos, etc.

~~~
__david__
That's only a threat if you use Google for those things.

------
burtonator
Scraping public web content is a really confusing situation.

My company, Datastreamer, has been in business for ten years indexing public
web content (news + blogs). We focus primarily on "live" content. Content that
publishes often.

The main challenge we've always had is that just because the content CAN be
indexed doesn't necessarily mean you MAY index it.

A recent situation was around Craigslist vs 3taps:

[https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc](https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc).

Basically the issue doesn't evolve around WHO has copyright but who has copy
_access_ to the content.

So if you create an account on Acme.com... You still own the copyright to the
content you post but Acme controls access. Not only that but the ToS that you
sign gives them rights to your content including bulk sales.

This means that Acme can monetize the content that YOU create while actively
preventing people from indexing it even that may be your intention.

This means that in 2018 a company like Google COULD NOT get started because
websites would just not allow them to access your content.

I believe that when most people post public content on the Internet they
intend it to be _public_ including accessed by other search engines crawlers,
etc.

Now we're in a horrible situation where just a few companies essentially own
the Internet.

This is why Google can't index Facebook content or Twitter content even though
it's public - they can't access it.

~~~
meritt
> Google can't index Facebook content or Twitter content even though it's
> public

They index plenty of their content [1][2]. What they don't index is content
not explicitly marked as "Public". e.g. Facebook posts with visibility
settings or protected tweets. FB, Twitter, and now LinkedIn have plenty of
content that's not publicly accessible: that is, content which doesn't require
a logged-in account that explicitly agreed to TOS/EULA, but they have tons of
publicly-accessible content, too. The latter is fair-game.

That said, the Craigslist lawsuit is still bewildering to me. That content is
explicitly public, does not require a login, and the only agreements are the
automatic unforceable browsewrap ones. The LinkedIn v. HiQ case is very
similar to the Craigslist v. 3Taps, however the decisions are in opposite
directions.

[1]
[https://www.google.com/search?q=site%3Afacebook.com](https://www.google.com/search?q=site%3Afacebook.com)

[2]
[https://www.google.com/search?q=site%3Atwitter.com](https://www.google.com/search?q=site%3Atwitter.com)

~~~
fjsolwmv
What's bewildering?

Published does not mean uncopyrightable. A radio station broadcasts a song;
that doesn't remove the song's copyright protection. Receiving a copy is not
making a copy.

3Taps settled put of court; there was no decision.

~~~
7j
I'm not sure the radiostation analogy holds. Radiostations can't claim
copyright to the song.

------
rossmachinery
My company, Alan Ross Machinery, is the plaintiff in this case. Happy to
answer questions to the extent legal counsel will permit it. I can tell you
the case remains active, stay tuned...

~~~
pavel_lishin
It seems like y'all are sort of an ebay/craigslist for industrial machinery,
right?

What benefit does scraping your site give someone else? If someone posts your
listings on their site, how do they connect would-be buyers with the sellers,
anyway?

~~~
rossmachinery
Without products to list eBay has no business. So, there's a strong incentive
for companies who want to be in the space to acquire listings through a
variety of means, including some means they ought not to.

~~~
donaltroddyn
I quickly browsed your site, and you don't seem to publish contact details for
sellers, so back to Pavel's question: What does the site that scraped your
listing do if someone wants to buy a scraped product? Send them to your site?

~~~
rossmachinery
They seemed to do several things, but click through to our site was not one of
them. They capture(ed) visitor data much of which I assume they retain for
their purposes.

~~~
donaltroddyn
Interesting - thanks for your answer.

------
curiousgal
This brings up a question that I've always had. There are plenty of companies
that offer e-commerce "insights" by scraping all merchants' products and
prices and then sell that data to a particular merchant. Is that legal?

~~~
stoic_heimdall
Also interested in the answer to this question.

------
gammateam
Misleading title: this is a Trial court case in the lowest Federal Court.
Northern District of Illinois specifically. Nobody cares about trial court and
this is mildly informative if we want to discuss the rationale anyway.

------
AznHisoka
Whatever happened to the HiQ labs case? Did HiQ labs ultimately win and can
continue scraping Linkedin?

~~~
manurandon
Yes, they won

~~~
perpetualpatzer
Any idea where I can read the opinion? I know they won a preliminary
injunction, which LinkedIn challenged in the 9th Circuit in march '18, but
hadn't heard anything since then.

~~~
comex
There is none; the appeal is still open. You can check the status of the case
on PACER – at least if you don't mind paying a few cents for every request due
to the judiciary being stuck in the 90's:

[https://ecf.ca9.uscourts.gov/n/beam/servlet/TransportRoom?se...](https://ecf.ca9.uscourts.gov/n/beam/servlet/TransportRoom?servlet=CaseSearch.jsp)

(The case number is 17-16783.)

There was oral argument on March 15, and the 9th Circuit posts video
recordings of all hearings on YouTube:

[https://www.youtube.com/watch?v=tvLdJujOp8k](https://www.youtube.com/watch?v=tvLdJujOp8k)

(I haven't watched it yet, so I don't know how it went.)

Since then, the only filings have been a few citations of supplemental
authorities, the last one in June. If I'm not mistaken, the case is just
waiting for the judge to write an opinion. According to the 9th Circuit's FAQ:

> 18\. How long does it take from the time of argument to the time of
> decision?

> The Court has no time limit, but most cases are decided within 3 months to a
> year.

...It's currently been 8 months since the date of the argument, so hopefully
that won't take too much longer.

~~~
AznHisoka
That's what I thought... Most people assumed the verdict earlier this year was
an automatic win, and they keep citing it. But it wasn't the final verdict,
which is what I'm interested in. Hope it comes out soon.

------
sam0x17
So now if they add a specific copyright notice on the page that was getting
scraped, the court might come back later and rule differently if scraping
continues? Or am I misunderstanding.

~~~
elliekelly
I think the issue was more that the plaintiff claimed a copyright on each
_page_ but the defendant had copied the _photographs_ and _descriptions_.
Having only read this one decision it sounds like the Judge had dismissed the
case once without prejudice to allow the plaintiff to restate their claim to
include copyright violations for the specific material that was copied (the
photographs and descriptions) but the plaintiff failed to do so. The opinion
then says that a photograph merely appearing on a site doesn't mean the
website claims ownership of the photo. Since the plaintiff made no claims of
ownership over the copied material they can't sustain a claim of copyright
infringement.

------
echelon
Is this a strong ruling that establishes precedent and is unlikely to be
overturned? (Forgive me for not having a great understanding of the legal
world.)

Here are a couple of cases that I'm especially interested in:

1\. Does this mean that it is legal to scrape "database"-type websites for
statistics and provide them on your own website? Could one use this ruling and
copy all of IMDB's film data? Repackage that data into a Creative Commons
website? Or a better set of tools for casting agents?

2\. What about social or community-curated websites? Could you mirror all
Reddit comments (which used to be Creative Commons anyway) to a more dev-
friendly site? Don't force a mobile app down people's throats? Make it ad-
free, donation-supported like Wikipedia?

3\. What about big media? Could you bootstrap a new video site by scraping all
existing (or popular) YouTube videos? Provide a means for owners to "claim
ownership" of their account on the new site? Then market it as YouTube "but
grown up" (18+)?

4\. Kind of getting off-topic, but could you build a new music service by
temporarily ignoring copyright, copying pirated music, then pivot to something
that does collect money (via ads or subscriptions) for the music rights
holders? I'm thinking Spotify, but built for music connoisseurs. Rich APIs
with tagging, smart playlists, etc.

How likely would any of these be to avoid lawsuits until they're big enough to
hire a legal team? Are certain behaviors less legal than others?

I'd really appreciate feedback on this. (Thanks in advance!)

~~~
weinzierl
> Could one use this ruling and copy all of IMDB's film data?

IMDb forbids scrapers in their conditions[1] but you can freely download their
datasets[2].

Similar to what you described: omdbapi[3] is a third party API for the free
IMDb data.

[1] [https://www.imdb.com/conditions](https://www.imdb.com/conditions)

[2] [https://datasets.imdbws.com/](https://datasets.imdbws.com/)

[3] [http://www.omdbapi.com/](http://www.omdbapi.com/)

~~~
hummingurban
it's not legally binding and not the law. The TOS can forbid all they want.

[https://www.eff.org/deeplinks/2018/04/dc-court-accessing-
pub...](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-
information-not-computer-crime)

suggests there is little to no recourse for IMDB and the likes. Craigslist was
able to win their case against 3taps, arguing the scraping was putting a load
on their servers (typical Craig Newman bullshit) and that they continued
scraping even after the IP ban and that is a computer frauds act or something
like that which is draconian response likes of which that guy who killed
himself because he got caught for scraping academic journals.

------
dotdi
Can somebody translate?

~~~
randomerr
Plaintiff had bad copyright notice. If the copyright notice is changed to be
specific about the data and not 'everything under the under sun' on this
website. This would be like someone trying to copyright the word 'crayon' on
his website. Crayon in most countries counties means 'writing device' _. Since
So you can copyright the word 'crayon' your website. But you could copyright
'Crayola Crayons' since that is specific.

In this case the copyright was 'This is our website. We don't have anything
specific about our data.' So no specified claims to sales or specific claims
to sales data puts everything in the public domain as long as the scraper uses
the data for 'informational purposes' and does not make a copyright claim to
the data itself.

_FYI: a more specific definition for crayon is 'a writing device where the
writing material wears off to leave a meaningful mark.' Examples are: crayons
(duh), pencils, charcoal sticks, etc. Since pens and markers have ink or
pigment reserves and do not wear down they are not considered crayons.

~~~
rz2k
Though you can't copyright "Crayola Crayon", you could trademark the name.

However, you could copyright a article describing the merits of a specific
type of writing instrument, regardless of who owns the trademark.

------
kylnew
How might this apply to content on a personal website? What I'm gathering from
the article/discussion is that a Copyright message in my footer may not be
enough to copyright all my works. Therefore, I should generate a small
copyright notice at the end of each blog post as well that more claims
explicit copyright over the article. Do I understand that correctly or can
someone clarify? (Thanks in advance)

~~~
baroffoos
I have noticed that automated bots always scrape and repost my blog posts on a
variety of websites. I don't really care but something to keep in mind. No
copyright notice will stop a script that can't understand them.

~~~
kylnew
You’re right but should it come down to a lawsuit your butt is covered. You’re
probably never going to go after someone unless they reach financial success
with it anyway. Also, what about disincentivizing anyone who might know how to
exploit content not properly copyrighted online?

It’s probably a lot like the piracy problem — Don’t spend a lot of time
fighting it because those people don’t pay, but secure the legal rights to
your works well enough to fight anyone with more sinister plans than making
just a copy or two.

------
nprateem
I've always wondered whether my famous "number defence" would work in
copyright cases. It argues that no digital file can be copyrighted. It goes
like this:

Is the number 1 copyrightable? No, because it exists and was "discovered".

Is the number 2 copyrightable? No, because it exists and was "discovered".

Is it reasonable to assume therefore that this very large number X, that
represents this disputed file, cannot in fact be copyrighted because it
already existed was in fact just "discovered"?

If it's argued that there was some effort to "discover" this number, then I'll
write a generator that produces each number between -1 million and +1 million
and claim "copyright" over each of them. After all, regardless of the process
for arriving at a particular mp4 of a movie (people just dancing round on a
set, running the output through various software, etc.), the final output is
simply a number. Just because someone went to a lot of effort shouldn't
prevent me from going through some effort to write files containing each
number between -1 million and +1 million and charging royalties for anyone to
use them.

In fact, this raises the interesting point that anything that can be
represented as a number already fundamentally exists in the range 0 to
+infinity. It's just down to us to discover them. Think about that for a
moment: Somewhere out there in the range 0-infinity is a good Star Wars
Episode 7 just waiting to be discovered.

We could therefore write algorithms that search all the numbers in some search
space (e.g. whatever 0-3mb is in decimal), scan them to see if they're valid
files, e.g. mp3s and then run them through AI to see how they compare to known
music, etc.

Thus, a new tech "Big Random" is born... I thank you :-)

~~~
SwellJoe
This is an absurd argument. Software (and written work) is _not_ generated by
random number generators...even though it theoretically could be.

If you can randomly generate works that have market value (copyright is
intended to encourage creation of works with value, not merely a sequence of
random words or bits), then we can talk about whether they're individually
copyrightable (maybe the generator itself is the only copyrightable work in
the picture, I dunno).

But, no one is going to take an argument seriously that because the number 1
(or the letter "a") cannot be copyrighted then no sequence of numbers or
letters can be copyrighted.

~~~
nprateem
I can generate works with market value. In pseudocode:

i=0

while {

    
    
      write_to_file("%s.bin" % i, i)
    
      i++
    

}

Of course the effort is in discarding the dross.

How are we to determine "market value"? And let's not forget, the copyright
crowd don't want to claim rights over just 1 specific number, but _any_ number
that happens to render (when run through a movie player) anything that
resembles e.g. Star Wars. I don't think this argument is as absurd as it first
appears...

~~~
SwellJoe
"And let's not forget, the copyright crowd don't want to claim rights over
just 1 specific number, but any number that happens to render (when run
through a movie player) anything that resembles e.g. Star Wars. I don't think
this argument is as absurd as it first appears..."

It makes it _more_ absurd, as it indicates clearly that it's not a random
sequence of numbers that is being protected by copyright, but a specific,
recognizable, creative work. If I watch an mpeg of Star Wars and then watch a
laser disc version of Star Wars, I will recognize them as the same work (well,
ignoring Lucas' retconning nonsense and CGI shitshow). Most humans would,
including most judges and juries in a copyright case.

"The copyright crowd" includes the leadership (and presumably the populace) of
nearly every developed nation on earth, so you're arguing against a pretty
solid majority (which is fine, I have some unpopular ideas that I hold pretty
strongly...the majority seems to love and support war and killing more than I
ever will).

Anyway, I'm all for reasonable copyright terms (and the US has absurd and
abusive copyright law that punishes and inhibits new creators at the behest of
old corporations and billionaires), but it's just not a sound basis to argue
that because one can randomly generate any creative work given infinite
monkeys and infinite time that no creative work should be able to be
copyrighted. It says that there is no difference between a creative work, like
Star Wars, and a random series of bits, because Star Wars can be recorded as a
(non-random) series of bits.

------
edoo
I made good money back in the day scraping fortune 100 companies for fortune
500 companies. Everyone does it for market intelligence. It is rarely made
public though.

------
anigbrowl
I hope this leads to improvements in scraping software. I find it perplexing
there aren't more tools to help strip away the cruft from structured data.

------
prirun
They should have sued on trespassing grounds. That's what eBay did to prevent
scrapers. (eBay vs BiddersEdge)

------
wrycoder
Prior law asserts raw data can't be copyrighted. The particular display of the
data can be.

------
gymshoes
Can anyone please ELI5?

