

Harvard puts metadata for 12M library items into the public domain - vgnet
http://www.hyperorg.com/blogger/2012/04/24/2b2keverythingismiscbig-data-for-books-harvard-puts-metadata-for-12m-library-items-into-the-public-domain/

======
guelo
"The records consist of information describing works—including creator, title,
publisher, date, language, and subject headings—as well as other descriptors
usually invisible to end users, such as the equalization system used in a
recording. "

I'm having a hard time thinking of what could be done with this data besides a
library catalog.

~~~
dfc
What is your point?

~~~
guelo
What is this data good for?

~~~
dfc
The first thing that comes to mind is that I should never have to type more
than a couple of characters into a citation manager when adding a book.

Outside of the numerous uses for citations, reading lists, etc. I imagine that
this is a very interesting dataset for researchers in publishing and library
sciences. It is also a great resource for anyone developing library related
software.

------
tar
Why not link directly to the official press release:
[http://isites.harvard.edu/icb/icb.do?keyword=k77982&page...](http://isites.harvard.edu/icb/icb.do?keyword=k77982&pageid=icb.page498373)

------
gvozd
It's about time they did this. Harvard's was about the only major library that
didn't allow Z39.50 access to their full MARC21 records. As a private
individual with a large rare and antiquarian book collection, I welcome the
news, since I've found that Harvard sometimes has the only other copy of a
book I'm cataloging. A few other libraries require you to jump through some
hoops to get to the data (British National Library, for example), but Harvard
was shutting everyone other than faculty and alumni out.

------
cbsmith
I hate that the article title says "Big data for Books".

Here's a hint on how you can get a sense of whether you are dealing with "Big
Data": _IF I CAN FIT IT ON A THUMB DRIVE, IT ISN'T BIG DATA_.

~~~
sanxiyn
The thing is, bibliography cannot be "big data" by that definition. 12
millions items is nearly 10% of Google Books catalog.

[http://booksearch.blogspot.com/2010/08/books-of-world-
stand-...](http://booksearch.blogspot.com/2010/08/books-of-world-stand-up-and-
be-counted.html)

------
Jun8
The direct links for API access and download (3.16GB) is given in the DPLA Dev
Blog: <http://blogs.law.harvard.edu/dplatechdev/>

------
dfc
It seems that bittorrent would be the logical choice for distributing the
dataset. I wonder if this is an oversight or if they are not expecting many
people to download the dataset...

~~~
chimeracoder
Almost certainly the former, but this was my first thought as well. I don't
understand why more distributors don't at least provide this as an option; it
saves them the bandwidth costs...

~~~
mdaniel
Burnbit.com provides a service that will turn any open URL into a bitorrent
seed.

------
kveykva
I'm not sure who would need to actually fill out the submission form. But
wouldn't this: <http://aws.amazon.com/publicdatasets/#3> be convenient for
working with a data set like this?

------
sparknlaunch12
Universities are doing some pretty cool stuff with data. Every tech uni is now
getting their students to work on social media data analysis. More exciting
than entity relationship diagrams...

------
ryan-guest
It's going to be interesting to see what people build and/or analysis they do
with this data.

------
esonderegger
>Finally, note that Harvard asks that you respect community norms, including
attributing the source of the metadata as appropriate.

That's not what "public domain" means. If they wanted attribution, there are
licenses for that. "Public domain" means that it belongs to all of us now. In
that case, attribution is meaningless.

~~~
slapshot
It is possible to live life at a higher standard than simply the minimum legal
standard. In many communities, norms of attribution go well beyond what is
required by the copyright law.

For example, in academic writing, it is still considered plagiarism to copy
from public domain texts without attribution, even if it is not copyright
infringement to do so. Accordingly, academic writers attribute their quotes of
public domain texts, even though there's no legal need to do so.

Here, Harvard is asking nicely that people attribute the data. In academic
contexts, that's perfectly normal and reasonable. Anyway, it's just a request,
feel free to ignore it -- you're right that you are legally free to do
whatever you'd like.

