

It shouldn't take 64 lines of code to do something really simple  - john_horton
http://onlinelabor.blogspot.com/2011/10/all-public-government-data-should-be.html

======
georgemcbay
I don't doubt this guy's assertion that the data could be formatted better,
but if I were working on something like this I'd just be glad that I could
access the data at all.

And 64 lines of code... bfd? That's much better than the bad old days of
custom binary data formats where you have to write a thousand line data loader
in C, worrying about issues like lsb vs msb, whether the number storage is
properly following IEEE 754, etc.

I guess I'm just...old, because my first response to 'I had to write 64 lines
of Python code to get this data into shape' is to shake my head and think
"Kids these days!".

~~~
repiret
I wish I could parse everything I've been asked to parse with 64 lines of
Python.

FWIW, I find it ironic that the author is complaining about whitespace-implied
structure in his source data, yet uses Python.

~~~
john_horton
My point was not that this was absolutely hard, but rather that's way too hard
for the totally mundane thing I was trying to accomplish---namely getting
public data from a US-funded statistical agency into a form suitable for
further statistical analysis (which is the whole reason they make this data
public).

Also, I don't see how it's ironic that I used Python---it seems irrelevant.
Whitespace in Python has a well-defined, commonly understood purpose; using
tabs or spaces to indicate hierarchal data relationships is not at all
standard, presumably because it creates messy dependencies across rows of
data.

~~~
lurker19
"Messy dependencies across rows" is exactly how Python whitespace works. That
data file exactly Pythonic indentation in the main content section.

~~~
john_horton
The reason indentation to imply structure doesn't make sense for datasets is
that in data analysis you are constantly subsetting and sorting data at the
row level---something you generally don't do to the lines of a compter
program.

------
mattdeboard
FWIW the greater Department of Labor (of which the BLS is a subordinate unit)
has a pretty okay web API with several "SDKs" (most of which are API wrappers
for particular languages). Weirdly, no Python wrapper yet but I wrote one, as
well as another for Clojure

<https://github.com/mattdeboard/python-usdol>
<https://github.com/mattdeboard/clj-usdol>

edit: A couple of the BLS's databases are available through the DOL API:

<http://developer.dol.gov/BLS_Numbers-DATASET.htm>

"This Dataset contains historic data (last 10 years) for the most common
economic indicators. More information and details about the data provided can
be found at <http://bls.gov>

<http://developer.dol.gov/DOL-BLS2010-DATASET.htm>

"The Bureau of Labor Statistics produces occupational employment and wage
estimates for over 450 industry classifications at the national level."

That last one is actually a pretty interesting dataset. I pulled it to tinker
with at work last week.

------
woodgears
Given the crazy complicated XML schemas people come up with, I'd say you are
lucky it's just plain text, and you can parse it without needing a library.

------
jrockway
Ironically, at work we buy a bunch of BLS datasets from a third-party vendor.
Their format is even worse: an opaque binary database and a win32 DLL that
reads it. 1000 lines of Haskell later, it sort of works...

(This is mostly because they chose to represent dates as integers, and they
have four possible date types: yearly, quarterly, monthly, and daily. 1901, in
their exciting world, could mean "year 1901" or it could mean "month January
1919". That's nice work, Lou.)

~~~
lurker19
Sounds like Excel.

Where do you work that uses Haskell in the office?

~~~
jrockway
I work for BofA, but I'm pretty sure I'm the only one that uses Haskell.

I used Haskell because I hate Windows, and Haskell lets me do all my work in a
maximized Emacs session without having to install anything other than Emacs
and the Haskell Platform. "When life gives you Windows, don't make lemonade.
Use Haskell."

------
jackfoxy
Yup, worked on this very data problem (bad data sources, especially gov) for a
long time, but gave up as I didn't see it as the road to riches. At the end I
hit upon an AI solution I'm sure will work, but I never finished it, still on
the shelf, though.

------
zmanji
Isn't this the problem that data.gov is trying to solve?

~~~
john_horton
I'm guessing that's the goal, but it's only going to be as good as the inputs
provided by the different agencies.

For example, if you look for the time series I was after on data.gov, the
"Download CSV" button brings you to...the BLS page, with all the problems I
discussed.

[http://explore.data.gov/Labor-Force-Employment-and-
Earnings/...](http://explore.data.gov/Labor-Force-Employment-and-
Earnings/American-Time-Use-Survey/jgti-eqds)

------
lurker19
Try Data Wrangler <http://vis.stanford.edu/wrangler/>

------
snorkel
I don't understand the headline, the Python code in the article is not so
terrible considering the input.

~~~
smokinn
His point is precisely that the input is awful.

~~~
snorkel
Yes, the article says the input is awful, but this headline implies the code
is awful.

~~~
hmottestad
+1

I was think something along the lines of, look at this stupid person who needs
64 lines of code to do a simple insert in a binary tree.

------
hmottestad
Semantic web :) Solves everything, not kidding.

Although is a few years when there are a million new standards I guess I can
keep my job :)

~~~
hmottestad
Why am I down voted? Semantic Web does indeed solve this and many more
problems. And I'm the only one to mention it. The semantic web has standards
for data, standards for the semantic mark-up and ability to connect multiple
datasets together. It also allows for standardisation of vocabulary and
sharing and reuse.

