

Dat, open-source software, seeks to restart the open data revolution - gordon_freeman
http://www.wired.com/2014/08/dat/

======
danieldk
Can anyone explain how this is different from iRods, which is already in
production at many organizations, abstract away the underlying data storage
systems, provides policy enforcement, authentication [1], and trigger rules?

[http://irods.org/](http://irods.org/)

Moreover, iRods is written in C++, which can be an advantage to Node.js at
various levels. First of all, because it is easier to provide interoperability
with other languages. Second, because many data centers are very conservative
(you often see CentOS/RHEL 3/4/5, or even SUSE Linux Enterprise Server), and
will not be happy to install the relatively bleeding-edge Node.js stack.

[1] In practice, a lot of scientific data is provided for non-commercial use.
This is often a necessity, because the data was originally provided by a
commercial entity, who don't want to provide the same data to competitors.
E.g. in NLP, a lot of treebanks are based on news papers. They can often be
redistributed freely for non-commercial purposes, but not for commercial
purposes.

~~~
pbnjay
AFAICT, in my limited experience with irods and reading through the dat docs.
Is that irods is mainly a data distribution mechanism. Whereas dat here seems
to be the a generic ETL framework (data extraction and munging).

------
rpedela
Based on a quick look at the documentation, it looks like it can read, write,
and store data. Is the functionality for versioning, diffs, different storage
backends, etc there yet? I know it takes time to build these things, I am just
wondering how far along the project is.

I also have a concern about the fact it is written in Node. It is well known
that Javascript doesn't understand large numbers so I am curious how the
project is handling this?

~~~
sh1mmer
Node easily does arbitrary sized numbers. This is achieved either using Node's
own Buffer type, TypedArrays or a binding such as BigNum which provides an
interface to the number functions in Node's openSSL binding.

~~~
rpedela
> Node easily does arbitrary sized numbers.

No it doesn't. Having to use buffers is not easy and the BigNum OpenSSL stuff
is slow and limited (only integers). I have personally had a hell of a time
supporting PostgreSQL's numeric type in a Node web server. Can it be done with
Node? Sure, but it is not easy or fast.

If dat was just for moving buffers around then it would probably be okay, but
it is wanting to be the place for data transformations as well which is what
concerns me.

------
terhechte
Here's a good explanation of what the software tries to achieve, from one of
the files in the Github repo. Much more informative than the Wired article:

Here's a concrete example: A police department in a city hosts an Excel
spreadsheet on their web server called Crime-2013.xls. It contains all of the
reported crime so far this year and gets updated every night at midnight with
all of the new crimes that were reported each day.

Say you wanted to write a web application that showed all of the crime on a
map. To download the new data every night you'd have to write a custom program
that downloads the .xls file every night at midnight and imports it into your
application's MySQL database.

To get the fresh data imported you can simply delete your entire local crime
database and re-import all rows from the new .xls file, a process known as a
'kill and fill'.

But the kill and fill method isn't very robust, for a variety of messy
reasons. For instance, what if you cleaned up some of the rows in the crime
data in your local database after importing it last time? Your edits would get
lost.

Another option is a manual merge, where you try and import each and every row
of the incoming Excel file one at a time. If the data in the row already
exists in the database, skip it. If the row already exists but the incoming
data is a new version, overwrite that row. If the row doesn't exist yet, make
a whole new row in the database.

The manual merge can be tricky to implement. In your import script you will
have to write the logic for how to check if an incoming row already exists in
your database. Does the Excel file have its own Crime IDs that you can use to
look up existing records, or are you searching for the existing record by
other method? Do you assume that the incoming row should completely overwrite
the existing row, or do you try to do a full row merge?

At this point the import script is probably a few dozen lines and is very
specific to both the police department's data as well as your application's
database. If you decide to switch from MySQL to PostgreSQL in the future you
will have to revisit this script and re-write major parts of it.

If you have to do things like clean up formatting errors in the Police data,
re-project geographic coordinates, or change the data in other ways there is
no straightforward way to share those changes publicly. The best case scenario
is that you put your import script on GitHub and name it something like 'City-
Police-Crime-MySQL-Import' so that other developers that want to consume the
crime data in your city won't have to go through all the work that you just
went through.

Sadly, this workflow is the state of the art. Open data tools are at a level
comparable to source code management before version control.

[https://github.com/maxogden/dat/blob/master/docs/what-is-
dat...](https://github.com/maxogden/dat/blob/master/docs/what-is-dat.md)

------
gordon_freeman
Dat can be useful to let city governments create data visualizations for
dynamically changing real-time data from various entities without worrying
about what kind of format the original data is in. This can really empower
cities becoming more efficient. One application would be they might create a
dashboard where they can see real-time analytics of their library systems,
fire-depts,crime-zones vs police-stations etc even though all these entities
have raw data in different format and even different database management
systems.

------
fiatjaf
I like the idea behind Dat, but I totally hate its authors because they said,
somewhere in their page, sometime ago, that their preference is for "academic
research data" (or something like that).

Why did they need to say that? I don't want a tool that has a preference for
something so stupid as academia.

But I'll probably forget this and start loving Dat if it manages to enable
this "open data revolution".

~~~
rwl
According to the article, the focus on scientific data is a product of funding
from the Sloan Foundation:

"Although Ogden's background is in city government, the Dat team is now
squarely focused on the needs of scientists. That's largely because of the
Sloan Foundation's focus. 'I don't come from a scientific background and
wasn't even thinking about science data,' he says. 'But they convinced me that
I should.' He explains that scientists have to deal with many of the same
issues with formats and tracking changes that city governments do. Using Dat,
Ogden says, much of this complexity could be abstracted away, at least for
some users of the data."

I don't think this is a reason for hating the authors or the project. Academic
scientists face a lot of the same problems as users of open data, and if the
Sloan Foundation wants to pay to solve those problems for science, the project
moves forward more quickly, and people using open data in other ways still
benefit.

~~~
fiatjaf
This is good to hear. My hatred has gone. Thank you!

------
dang
We changed the title to a sentence from the article since the thing about free
food is misleading.

------
mlvljr
Dat open-source again! ;)

~~~
mlvljr
Oh, come on, was there no one else who initially misread the title?? :)

------
random28345
> Let’s say your city releases a list of all trees planted on its public
> property. It would be a godsend—at least in theory. You could filter the
> data into a list of all the fruit and nut trees in the city, transfer it
> into an online database, and create a smartphone app that helps anyone find
> free food.

So assuming that an individual fruit tree produces 20,000 calories of edible
fruit annually, and there are a couple dozen fruit trees in a typical American
city, we will have spent a hundred man hours in app development and testing to
turn half a million potential calories into a few thousand, as we inflict the
tragedy of the commons on these public resources and encourage people to pick
the immature fruit before someone else with the app does.

That idea is so stupid, by next week I expect to see 8 startups with a
combined valuation of 80 million dollars all attempting to monetize the 24
fruit trees on public property in Mountain View by selling ads targeting
"urban nomads" (aka homeless), or by paying homeless to gather unripe fruit
for each other in whatever litecoin or ripple clone is in vogue that week.

~~~
tlrobinson
I don't think I could have written a more perfect parody of Hacker News
comments that ignore the point of the article but nitpick one small thing in
an attempt to demonstrate the commenter's intelligence.

