

Package Data Like Software - palewire
https://source.opennews.org/en-US/articles/pluggable-data/

======
otakucode
Whew, reading the first few paragraphs after seeing the title started to scare
me. I was afraid they were going to advocate locking data up inside of a
proprietary app and only releasing that to the public in place of releasing
the raw data!

I ran into this years ago with the IMDB dataset. It appears to be formatted
such that it aggressively resists sane parsing. Of course, I expected to want
to update the data and whatnot, so I built code to download the data files or
updates, parse them, and put them into a Sane Format (in my book, only CSV and
JSON qualify right now). Then I wrote a simple tool to take any generic JSON
and create tables from it and insert all the data. This just always seemed to
be the right thing to do to me. Just hacking the file into a useable format
and plunging ahead with analysis seemed like a bad option to me, but I take it
from this article that it is the common approach?

It may just be an artifact of the kinds of systems I've worked on (bank, govt)
but I'm not comfortable unless 'deployment' consists of executing 1 script
which can take a system from absolute barebones (no DB schema, no existing
tables, no prearranged libraries, nothing) to production-ready. What if you
have a catastrophe and your backups are hosed? What if you want to spin off a
new environment for testing? The idea that there has to be existing state
whose personal history is assumed to be in a certain state, or that after the
system deploys someone has to grab some scripts out of their home directory
and remember to apply them (and in the right order) before things can get
going just terrifies me. What if that employee gets a brain tumor? I suppose
it doesn't matter quite as much if your system being down for 5 minutes
doesn't result in a report on the national news and impact hundreds of
millions of people, but still... don't most people have a personal investment
in knowing their system isn't just an array of spinning plates with a chasm of
chaos awaiting an earthquake?

------
Blahah
Beautiful idea, not dissimilar from dat [0] (if you haven't already, you guys
should talk).

I find the Django relationship to be an odd choice - the vast majority of
people working with data are not using Django. Why pair the two?

[0]: dat-data.com

~~~
palewire
I love dat, and find its broad promise very appealing though I'm not sure
exactly how to pull off all the blocking and tackling necessary for our
project limited to its framework.

In addition to downloading the daily file from the state, the file has to be
unzipped and transformed before it's ready for loading. Any advice you have
would be appreciated.

------
jboggan
A lot of the nitty-gritty data munging and processing often gets discarded
after a project or never included in the project repo in a meaningful way. I
like Drake [0] because we used it a lot at Factual and it really made data
generation and formatting very repeatable and easy.

I really think the packaging system the author is going for would be best
built on top of Drake or a similar workflow management program. Instead of
following their laundry list of configuration steps one could manage that
automatically with a source-controlled workflow. Drake does have the
advantages of non-linear and async workflows being pretty easy to build,
maintain, and update.

What I would love to see is a data package manager that downloads the raw data
and processing workflow, updates any software packages needed to run the
workflow, and then spits out the data in the form you need it, whether
CSV/TSV/JSON/etc. I don't know much about dat yet, but it looks like it would
be a good end-point for serving the data as well.

0 - [https://github.com/Factual/drake](https://github.com/Factual/drake)

~~~
palewire
Hey, thanks for your thoughts. I like the idea of a data specific packaging
platform. In our case, we're using the Django and Python system because it's
one we're familar with, but I'm open to considering other options and would
love to learn more about them.

------
mshron
Love the idea!

I would ask for a little more separation of concerns. One package for raw but
cleaned data with a collection of schemas, and a second for loading arbitrary
data + schemas in to Django (and probably accomplishing all of the extra
administrative steps provided in the example).

That way if I want to add other schemas for a non-Django use in the same
package (say if I care more about analysis than clicky-interfaces) or not use
Django, I can still use a package manager for the same data.

~~~
palewire
I agree that something more generic and less-Django specific has a lot of
appeal, I'm just not sure I know exactly how to pull it off.

Our experiment in the post is based on our previous packaging experience,
which is largely limited to pip and setup.py.

~~~
scottlocklin
You haven't lived until you've been confronted with a python package failure.

------
palewire
A humble suggestion from your friends at the California Civic Data Coalition

