

Ask HN: Databases, Scraping, and Copyright Question - DanielBMarkham

Hope you guys can settle an argument I am having with another HNer. I'm going to make up a hypothetical question because I don't want to discuss the real one in detail.<p>Let's say I have a site that is all about politics and stocks. I list the stock and the board of directors, and site members can submit their political opinions about how they think the corporate officers are acting. In addition, I write a program to go out and find the FaceBook, Twitter, etc accounts for these officers and I list them with the stocks.<p>Not wanting to do a lot of data entry, I find a database on an ftp site somewhere that has stocks and the corporate officers that are related to them. It is part of some huge system dedicated to do something completely different, like a brokerage system or a  system for investing. Although they provide the data openly, on each dump there is a notice about how you can't use the information for commercial purposes (but you are free to use for personal purposes and/or download, copy, and give to friends) Their price is $50K. For that amount of money, you <i>might</i> be able to use <i>part</i> of the data, depending on what your actual needs are (translated: we want to know how much we can get out of you, so we're not committing to anything)<p>Do I have to physically type in each stock and the directors? And then update it all the time? When there are multiple sources of publicly-available information that gives me the same thing? That sounds crazy.<p>I was speaking with another HNer just now. He said if I downloaded the database tables I needed directly that was bad. But if I crawled a site created from the database from the static information located on each stock's page, then that was different. I say that matters of fact -- which people are officers for which company -- cannot be copyrighted. It would be different if I were pulling down big hunks of the database and reusing/re-purposing them, but I'm only asking a simple question about information that is very widely distributed.<p>So who is more right here? If you were trying out an idea for a website, what would you do?
======
Vivtek
Phonebooks can't be copyrighted, and databases can't be copyrighted (exactly)
but if their terms of service state that usage is restricted, and they're the
kind of company I would expect them to be, you can expect a lawyergram as soon
as they find out. Unless your pockets are deep, this will be a Bad Thing.

Unfortunately, the _exact same thing_ applies to scraping their site for that
data. If you scrape somebody _else's_ site generated from a paid use of that
database, then ... maybe. Maybe. It's an awful risk.

It would be better if you could figure out a way to scrape other publicly
available data sources - even using that database as a guide.

Don't get me wrong; there are really three questions here: the moral, the
legal, and the practical. Morally, I consider scraping a site to be OK because
it's really available to the public; just grabbing the database from a site
that says, "don't do this" - not so much. Practically, I just went over that -
it's a risk. Legally? I'm not a lawyer, but I suspect that the legality is
more or less cognate with my moral feelings above - but a lawyer might be able
to tell you.

If _I_ were trying out an _idea_ , though, which was your original question,
I'd download the damn database and get it running. Then I'd try to find a more
plausible data source, but that is a task that could be assigned a couple of
months and be somewhat lower priority.

~~~
pedalpete
Actually my understanding is that your first sentence is a fallacy.

Each individual entry in a phonebook cannot be copyrighted, as that is public
information.

However, the collection as a whole can have a copyright.

It's like not being able to copyright a music note, but being able to
copyright how those pieces of information (notes) are put together into a
song.

~~~
Vivtek
The Copyright Act specifically applies to "original works of authorship", and
in 1991 the Supreme Court decided that phonebooks _specifically_ are not
original works of authorship, although individual design elements in the book
itself may be.

<http://www.ivanhoffman.com/database.html> is a good overview.

However, there is what's called a "compilation copyright", which is covered
rather nicely at <http://www.bitlaw.com/copyright/database.html> \- the theory
there is that the selection of data is original authorship. This is still
based on the same Supreme Court case (Feist), where the Court stated that if
the underlying data (phone numbers and names) is uncopyrightable (you can't
copyright facts), then if that data is "selected, coordinated, or arranged in
such a way that the resulting work as a whole constitutes an original work of
authorship" the database may be copyrighted.

But the basic point of Feist is that phonebooks in particular are not subject
to copyright. That's why I led off with that example, and it took me one
Google search (on "database copyright", and these links were #3 and #4) to
confirm my vague memory.

I'd actually say at this point that the database in the original post is
probably _not_ copyrightable. Which is all fine and good, because it won't
stop a company from suing you, and may the richest man win. So it changes
nothing from the practical standpoint - copyrightable or not, the original
poster would be better off finding independent sources of data.

~~~
TeHCrAzY
Similar, In Australia, phone books were also recently ruled (I'm not a legal
person, a judge made a ruling in a case I think) to be not covered by
copyright.

------
KingOfB
My understanding is that most "public" data is not copyrightable, so scraping
it is fine. For instance, scientific data is public, but access to databases
can be charged for. What Macy's phone number is is public information, but
Yellow Pages can charge for access to it. In these cases, scraping is fine
unless they have a 'Terms of service' that states otherwise. Sites often have
terms of service that you wouldn't notice unless you cared to write a scraper
for them. Google local for instance has a terms of service that makes it clear
it's not for scraping.

I have scraped a number of sites to 'repurpose' in other ways. If you're
betting the farm on it, talk to a copyright lawyer. If there are lots of sites
with the same data, I don't think anyone will notice or care.

