
We have an API (Please give us a database) - Hagelin
http://nat.org/blog/2010/03/we-have-an-api/
======
epi0Bauqu
Wow, I wish all API owners would read this. I'm always going around asking for
dumps. I need them for speed & scalability reasons. It's even OK with me if
your dump is in a weird format--I'll deal with it, but I'd like a dump.

~~~
jrockway
_It's even OK with me if your dump is in a weird format_

You would love the financial industry. Standards? Hahahahahaha.

~~~
icey
In 2003 I was working with a national bank on a project, and we needed to get
data dumps from them. They told us they would be happy to provide the data in
any format we requested.

"Great!" we thought, and asked them to send us the data in XML since we
already had a process for handling it that way.

There was a moment of silence as they muted their side of the conference and
when they came back, the IT manager who was heading up this project on their
side goes "okay but... _what is XML_?".

At that point I realized that bank IT is where government IT workers go when
government work is too demanding.

~~~
jrockway
_At that point I realized that bank IT is where government IT workers go when
government work is too demanding._

My experience is different. The only non-OSS programmers I've ever met with a
clue work at banks.

------
kungfooey
With historical data like this you would think that providing it as a download
would also be in the best interests of the provider. It would reduce the
number of total requests against the API. If the download were offered as a
torrent, I could see a pretty significant savings in bandwidth costs.

It's also possible that these companies (particular the old-school types who
make the decisions) prefer to keep the illusion of a monopoly on these data.

~~~
wizard_2
I think you hit the nail on the head. It doesn't matter that people can create
their own databases, the NYT can still revoke access and stop anyone from
accessing their api. It smells of the ideology behind DRM oddly, maintaining
the illusion of control.

------
dan_sim
The main reason why API got that popular is because it's now easy to do them.
Rails and most other frameworks has built-in features for that.

The other day, I thought about writing a SQL dump for a project but nobody
really did it there's not that much information. Do I "reformat" the database?
Is there a breach in security if I provide dumps? Do I keep the current IDs or
use new ones?

On the other hand, this article is maybe the kick in the butt that makes me
want to answer these questions.

------
idm
Well, a database dump is fine if you have some new metadata to manage it.
Basically, a versioned schema ... Imagine that you end up wanting to
synchronize your DB dump with someone else's DB dump. The thing that's nice
about the API is that you can be loosely coupled with it. Having access to all
of the data is both a blessing and a curse, because you can't escape a
stronger coupling.

Or, if it's couchdb, then each object has its own revision. The parent blog
post is anticipated by one of the major use cases of couchdb in the "millions
of couches" vision. Having a nice RESTful API is great for some use cases, but
consider that you're on a high-latency connection. Here, it would be much
better to use a local, embedded database instead of the REST interface. Take a
look at couchdb - this is already possible.

------
joblessjunkie
Adding an API to a website is relatively simple -- you just add new URLs and
views to your existing web site. It leverages the existing HTTP request
infrastructure.

A database dump is an entirely different beast, and requires an entirely new
process. Labor needs to be devoted to deciding the format of the dump,
filtering the data, generating the dumps (or designing a new system to
generate such dumps automatically), and file hosting needs to be allocated.
Whenever schema changes, the process needs to be revised.

I'm not saying these things are hard, just that they don't fit smoothly with
what's already in place for most websites, so it becomes an additional
expense. Most companies aren't flush with cash and resources for every feature
under the sun.

------
trin_
yes! you could even charge a small amount of money for that! most of the time
i'd gladly pay a small amount instead of waiting several days for the
scrape/api to finish!

or you could scrape it and sell it on this yc datamarketplace thing ... was it
datamarketplace.com?

~~~
dotBen
"yes! you could even charge a small amount of money for that! most of the time
i'd gladly pay a small amount instead of waiting several days for the
scrape/api to finish!"

And you'd be surprised at the number of sites that will happily sell you a
dump of the data. It's just such a niche thing that site owners will rarely
tell you on your site.

As geeks we tend to assume that if something is not mentioned on a website as
being for sale or available, then it must not be available. This is not true.

Always ask!

(but be prepared to pay a commercial rate license!)

------
jimbokun
This is an issue we constantly deal with in the PSLC DataShop.

<https://pslcdatashop.web.cmu.edu/>

(PSLC = Pittsburgh Science of Learning Center)

We are basically a repository for click data from computer tutors. Learning
researchers access the data to run analyses, test hypotheses, etc. to better
understand how people learn. Some analyses can be run in the DataShop, and we
provide tab delimited downloads of the data. We have also recently added a
REST API to the data.

There are privacy issues with the data, so we cannot just dump the entire
database and make it available to everyone. I'm wondering, though, if a
limited SQL dump might be more useful to some of our users than the tab
delimited files. If you were to imagine yourself wanting access to our data,
what would you ask for?

We would also like to find ways to encourage people to share more of their
learning data and analyses through the DataShop. We are planning to allow
researchers to post analyses through our REST API. Beyond that, what do you
think would make it more convenient for you to share your data through the
DataShop (again, imagining yourself to be a learning researcher)?

One last note, some of our data is being used for this year's KDDCup.

<https://pslcdatashop.web.cmu.edu/KDDCup/>

If you're interested in machine learning challenges, registration starts April
1.

------
IgorPartola
This is exactly why I provide a nightly dump of the data for one of my pet
projects (<http://igorpartola.com/projects/disciddb>). For one, it makes it
easy for me to make off-site backups. For another, it ensures longevity of my
project. If something happens to me, it's easy for someone to grab the source
code (it's OSS) and the data dump and run their own server.

------
ananthrk
A very well reasoned comment from the article itself!

[http://nat.org/blog/2010/03/we-have-an-api/comment-
page-1/#c...](http://nat.org/blog/2010/03/we-have-an-api/comment-
page-1/#comment-6503)

------
seldo
A dataset is great if you want to analyze a large dataset (government data in
particular lends itself to this). However, an API allows you to build an
application around _data that doesn't exist yet_ : not last year's best seller
list, but next week's.

If all you had was a way to download the full dataset, the first thing you'd
ask for is "a way to download only the stuff that's happened since yesterday".
And that's an API.

~~~
Pheter
"So, keep calling for online APIs. But ask for downloadable datasets too."

It is not a case of one or the other. APIs are useful, but so are datasets.
Why not provide both?

------
dotBen
The other answer here is not to offer a SQL dump but to offer a query language
that lets the developer interrogate the data in other ways.

Most API sets are based on providing rows back to the user, but what if they
want to do COUNTS() or UNION() type calls?

This is why Yahoo!'s YQL and Facebook's FQL are so great.

The answer is to offer a query language on large data sets.

------
cmelbye
Twitter needs to do this, or at least something like their own version of
Facebook's FQL (which is a really outstanding way to get data and even to
offload some of the compilation of the data)

------
aw3c2
Jamendo does this: <http://developer.jamendo.com/en/wiki/NewDatabaseDumps>

------
riobard
Does Hacker News have an API? Didn't find an official one :|

~~~
jrockway
It's like JSON, except it wraps all objects in <table>s.

