

Get your act together, Data.gov - cjoh
http://sunlightlabs.com/blog/2009/get-your-act-together-datagov/

======
wsprague
The problem with data.gov and sites like it is that they are built on faulty
premises about data:

1\. Fiction: Data doesn't require lots of work to make it useable, so we can
just upload whatever we have and it will be useful to somebody. Fact: the big
useable datasets (census, ipums, nlsy, all the private marketing datasets)
have armies of people cleaning and integrating them. It costs money, it takes
time, and it is easy to screw up.

2\. Fiction: Links are worth something. Fact: links are worthless.

3\. Fiction: XML adds values. Fact: aascii tab delimited in consistent formats
add value, while XML SUBTRACTS value.

4\. Fiction: a good dataset is easy to use. Fact: even a good dataset (google
IPUMS for an example) takes a lot of work to get to know how to manipulate,
presuming one can use some sort of statistical programming language in the
first place.

5\. Fiction: simple summaries of common data data are useful. Fact: everybody
has already done the simple summaries. (This is just a bonus item, and doesn't
apply to data.gov, but does apply to faulty thinking about data in general.)

6\. Fiction: Federated data is just fine. Fact: Data that is curated, cleaned,
and integrated into one big monolithic package is FAR better, because an
analyst can then learn the conventions and names and such in one piece, and
parallel categories are more likely to align.

7\. Fiction: Good data is easy for a layperson to use. Fact: good data still
requires a lot of skill. Well, maybe in nations with decent public schools a
layperson can do something with data, but not in the US.

What I WOULD like is the following (taken from another post, now deleted):

An ideal data.gov would have a lot of staff who put together a few integrated
and curated datasets from the agencies. These would be hierarchies of data in
a few formats (shp, txt, raster, SQL text dump, and ...?), along with well
written codebooks and narrative READMEs. They would be distributed using git
or subversion. The staff would have the expertise to make such nice data
packages for you and me, and they would have the political oomph to demand
that the agencies release the data to them. The staff would also give classes
on how to use the data along in some open source statistical packages to do
useful work. Good examples of curated data that I know are IPUMS and the
Portland Metro's RLIS (both google-able).

~~~
jfager
I don't understand what you're getting at with this list.

1\. Yes, datasets need to be cleaned. But you need to have the dataset before
you can clean it, and different people will want to clean it in different
ways. Get it up there first, and keep the political debates confined to the
gathering methods. Griping about raw datasets only gives them an excuse to
keep delaying putting anything out (in other words, this critique is actively
harming the movement, please stop making it).

2\. I don't understand what you mean by this. If a link points to a high-
quality dataset that's otherwise hard to find, then it's very valuable.

3\. Not all data is expressible in tab-delimited ascii tables. I'd like my SEC
filings in well-structured XML, for instance.

4\. This is a strawman. Nobody serious has ever said a good data set is easy
to use and understand.

5\. Ironically, this is the one point you make I agree with, and then you
claim it doesn't apply to data.gov. I think this is actually the worst thing
about data.gov right now, that they think they're giving us anything when they
post their little summaries. Give us the raw data, please.

6\. Isn't this just restating a combination of #1 and #3? Yes, big clean
monolithic data sets are nice, but the priority is getting access to the data
in the first place.

7\. You're restating #4, which was a strawman.

~~~
wsprague
OK, we disagree. Except that #4 IS sort of redundant, though I want to make
the point that data is almost impossible for a layperson, and still really
hard for a practiced analyst.

~~~
jfager
I actually meant it when I said I didn't understand what you were getting at.
I initially read it as you saying that there shouldn't be a data.gov at all
(because raw data's useless, curated data's expensive and difficult, and
simplified data summaries are likely to be misinterpreted by lay people), but
that can't be right, so I'm really curious what you were actually trying to
say. What would an ideal data.gov look like, to you?

~~~
wsprague
I moved my reply to my top comment and screwed up the reply tree here. So this
is for the comment that follows this one:

data.gov is fundamentally flawed, and won't be anything but annoying until it
is reworked into something along the lines of what I suggest. Or so I think...

