
AggData: Datasets created from scraping the web - adamhowell
http://aggdata.com/
======
weirdwes
Before the App Store hit and iPhone web applications were all the rage, I
started working on a restaurant locator. Oddity Software was a company I came
across that provides datasets like the ones from AggData, though I'm not sure
if it's from scraping the web. I figured I'd give it a mention in case people
came here searching for additional resources. They definitely have more in the
way of free lists (<http://www.odditysoftware.com/free_lists.html>), though I
can't personally vouch for accuracy or timely updates as I haven't used them.

Listable (<http://www.listable.org>) is another list type service, though it's
lists are much less complex and are user created.

I'll be adding AggData to my bookmarks, though. I could see myself using at
least one of their "FreeData" lists in the future and possible some of their
paid ones.

------
aggdata
Wow, this discussion is way deeper than we have ever gotten into at AggData.
In fact, "frig", I think we may need to hire you. :) We have been very
particular in the type of data we collect for some of these very reasons, and
we feel that the location data was enough in the public domain to protect us
from infringement allegations. We don't currently have much in place to pursue
those trying to resell our data, and it hasn't really been a problem yet. I
think, like mentioned, it doesn't make much economical sense.

A couple of other quick responses: yes, we know our search is kind of lacking
now, and we're working to fix it. Also, we have major plans of offering bulk
data and specific regional data; we're currently just working on expanding our
library, though.

Thank you, everyone, for your insight! -Chris Hathaway, AggData LLC

(and seriously, frig, send us a message on our contact page, I have more
questions for you)

------
toppy
Hey, AggData guys, why not change business model and sell your data in bulks?
Wouldn't be nice to use it that way?

from aggdata.dealership_locations import cadillac

print "Cadillac Dealers in NY:"

for loc in cadillac:

    
    
        if loc.city == 'New York':
    
            print loc.address, loc.phonenumber

------
timmaah
Their "FreeData" sets could use some attention.

I realize it is _free_ , but if you are going to have it as an example of what
you do, have it correct and up to date.

The headers for the congress data are completely off, and is not current..
Franken, Kennedy.. etc..

------
vijayr
If you are interested in data, here are some sites to get them from

<http://theinfo.org/get/data>

<http://infochimps.org>

[http://developer.amazonwebservices.com/connect/kbcategory.js...](http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=243)

<http://ckan.net>

EDIT: Comprehensive list here [http://www.datawrangling.com/some-datasets-
available-on-the-...](http://www.datawrangling.com/some-datasets-available-on-
the-web)

------
utnick
I have looked into making a business like this before, there are quite a few
of them and I do like scraping.

But don't you have to break a lot of 'terms of use' agreements to scrape this
data? Could you get in legal trouble for that?

~~~
cschneid
Maybe, but there's a decent legal argument that it's not copyrightable data
(facts of where stuff is for example), and that it's publicly posted ("find a
store near you" links).

I suppose it gets more and more fuzzy if you move into non-fact data, and you
open yourself up more to lawsuits.

~~~
lonestar
By the same legal argument, couldn't anyone buy one of AggData's datasets and
then publish it for free on the internet?

~~~
frig
Yes + no.

NO: The kicker is that if you did that they'd sue you for breach of contract.

YES: once you passed it off to third parties (in violation of contract) it's
not clear how strong of a remedy AggData actually has.

Some of the notions + folk wisdom about copyrightability of facts is based on
assumptions that increasingly don't hold.

EG: you can't really copyright the factual contents of the phonebook; if I
want to compete with the existing phonebook publisher by retyping the phone
book they don't have a copyright claim against me (provided I only transcribe
the facts, and organize the information in an obvious or mechanical fashion,
like alphabetical ordering).

If you were instead to compete with an existing phone book company by
literally _xeroxing_ their product they could probably take you to trial (on a
theory that the underlying _facts_ aren't under copyright but the specific
page layouts and so on are; there's also the issue of the ads you'd be
xeroxing but let's not muddle things overmuch).

If this has already happened and been litigated I've never heard of it.

What something like AggData is doing shows some of the conceptual limits of
the existing framework:

\- existing physical instantiations of abstract "databases" (collections of
fact)

\-- (1) couldn't be economically "xeroxed" (EG: if you do it cheaply it is
visibly inferior-looking; if you do a very high fidelity reproduction it's
about as costly as just re-doing it from scratch)

\-- (2) had enough "wiggle room" in how they might be represented in a human-
friendly medium such that:

\--- (A) on the one hand there's the possibility of a viable copyright claim
against a "xeroxer" (under the theory that the page layout is under copyright
even if the facts themselves are not)

\--- (B) on the other hand _allowing_ for the possibility of "retypers" to
actually take advantage of the not-copyrighted status of the underlying facts
and actually produce a different product (b/c it is _possible_ to reproduce
the same facts with a format sufficiently-different from the source you drew
them from)

\- but as the "database" becomes increasingly digital

\-- (1) "xeroxing" is very economical (far more so than "retyping")

\-- (2) there's increasingly less meaningful "wiggle room" as to how the facts
might be represented in a "database"; changes-of-format dont' do much, but
once the data is shorn of its human-friendly formatting all the useful ways of
storing it are essentially isomorphic, meaning (2.B) above is increasingly
unlikely (it may no longer be possible to "clone" the abstract data without
being too close to the exact format of the source for legal protection).

If someone's actually seen these issues played out or "settled" I'd love to
learn more about it.

~~~
cschneid
To summarize for my own learning, you can't copyright facts themselves, but
you can copyright the representation of facts (page layout, etc).

The question then is if a CSV file counts as "page layout". How customized
does something have to be, before it is a creative work? I doubt CSV goes far
enough. I wonder if the ordering and presence of columns in the CSV is enough?

I believe I heard a story about a dictionary suing based on the page numbers
being copyrightable, but I can't verify that via google. If that is in fact
true, then perhaps a CSV is a creative work protected under copyright.

Boy is copyright broken when faced with computers....

~~~
frig
Yeah basically accurate so far as I know (disclaimer: for awhile I worked @ a
publisher of reference works; we had rules-of-thumb explanations for what we
could get away with but I'm not an expert).

The real story behind the story you might've heard would've been more like a
page-layout type defense with the fact that the contents of each page (by #)
were identical between two competing works; throw in the usual misreporting
and there you go.

If you track it down please do share.

The CSV thing is really where it gets tricky: generally "trivial" obfuscations
or rearrangements to get around existing laws don't fare well in court (and
sadly even "trivial" isn't really well-defined), so if a CSV "counts" then
rearranging the columns (or randomly sorting the rows, etc.) won't help you.
But it's not obvious that a CSV doesn't count either.

Basically the model is supposed to be:

\- you see a printed pages containing representations of facts

\- you "learn" the facts (eg: "copied" them to your brain)

\- you produce printed pages containing representations of facts

...and of course there may be _incidental_ and/or _unavoidable_ resemblances
between the two representations but insofar as what you "copied" was just the
facts you had fair game.

It's sort-of tacitly assumed that if you did make an exact copy it'd be pretty
obvious and that if you went through the above process what you did would look
_different enough_ to be pretty obvious also (unless you deliberately cloned
something the hard way, which is stupid enough to be rare).

With straight-digital "database" dumps (like the CSV) you have a situation in
which if you went through the full process you'd create something that's
pretty much indistinguishable from what you'd get if you just hit ctrl-d; this
pretty much breaks the rules of thumb / intuition behind the rules around
"facts".

~~~
cschneid
I think I found it, it was a law book.

[http://lists.essential.org/1995/info-policy-
notes/msg00019.h...](http://lists.essential.org/1995/info-policy-
notes/msg00019.html)

~~~
frig
YUP, and it's not quite why I guessed but the situation is pretty similar:
non-copyrightable content (legal opinions), successful use of copyright claims
over the arrangement.

Incidentally a good historical anecdote; in addition to many aspects in-and-
around copyright I tend to think the thinking around public records laws is
_woefully_ antiquated (to our general detriment).

Thanks for finding that.

------
mitko
Only "Locations" kinds of data? And the search is awful- it couldn't find
anything for McDonalds for example
(<http://aggdata.com/search/node/McDonalds>)

I was hoping to use it as a possible alternative to
<http://archive.ics.uci.edu/ml/> for ML data sets but now I am kind of
disappointed.

~~~
jrwoodruff
I just used the search to find 'mcdonald' and it came right up with the
McDonalds data.

~~~
timmaah
He forget the '

searching for mcdonald's works fine as well.

