Hacker News new | past | comments | ask | show | jobs | submit login
Dat, open-source software, seeks to restart the open data revolution (wired.com)
109 points by gordon_freeman on Aug 20, 2014 | hide | past | web | favorite | 18 comments

Can anyone explain how this is different from iRods, which is already in production at many organizations, abstract away the underlying data storage systems, provides policy enforcement, authentication [1], and trigger rules?


Moreover, iRods is written in C++, which can be an advantage to Node.js at various levels. First of all, because it is easier to provide interoperability with other languages. Second, because many data centers are very conservative (you often see CentOS/RHEL 3/4/5, or even SUSE Linux Enterprise Server), and will not be happy to install the relatively bleeding-edge Node.js stack.

[1] In practice, a lot of scientific data is provided for non-commercial use. This is often a necessity, because the data was originally provided by a commercial entity, who don't want to provide the same data to competitors. E.g. in NLP, a lot of treebanks are based on news papers. They can often be redistributed freely for non-commercial purposes, but not for commercial purposes.

AFAICT, in my limited experience with irods and reading through the dat docs. Is that irods is mainly a data distribution mechanism. Whereas dat here seems to be the a generic ETL framework (data extraction and munging).

iRODS is more of a distributed filesystem (thus presumably has more in common with GlusterFS, Ceph, RiakCS, MogileFS, etc).

Dat is basically a generic, pluggable database replication tool.

Based on a quick look at the documentation, it looks like it can read, write, and store data. Is the functionality for versioning, diffs, different storage backends, etc there yet? I know it takes time to build these things, I am just wondering how far along the project is.

I also have a concern about the fact it is written in Node. It is well known that Javascript doesn't understand large numbers so I am curious how the project is handling this?

Node easily does arbitrary sized numbers. This is achieved either using Node's own Buffer type, TypedArrays or a binding such as BigNum which provides an interface to the number functions in Node's openSSL binding.

> Node easily does arbitrary sized numbers.

No it doesn't. Having to use buffers is not easy and the BigNum OpenSSL stuff is slow and limited (only integers). I have personally had a hell of a time supporting PostgreSQL's numeric type in a Node web server. Can it be done with Node? Sure, but it is not easy or fast.

If dat was just for moving buffers around then it would probably be okay, but it is wanting to be the place for data transformations as well which is what concerns me.

Here's a good explanation of what the software tries to achieve, from one of the files in the Github repo. Much more informative than the Wired article:

Here's a concrete example: A police department in a city hosts an Excel spreadsheet on their web server called Crime-2013.xls. It contains all of the reported crime so far this year and gets updated every night at midnight with all of the new crimes that were reported each day.

Say you wanted to write a web application that showed all of the crime on a map. To download the new data every night you'd have to write a custom program that downloads the .xls file every night at midnight and imports it into your application's MySQL database.

To get the fresh data imported you can simply delete your entire local crime database and re-import all rows from the new .xls file, a process known as a 'kill and fill'.

But the kill and fill method isn't very robust, for a variety of messy reasons. For instance, what if you cleaned up some of the rows in the crime data in your local database after importing it last time? Your edits would get lost.

Another option is a manual merge, where you try and import each and every row of the incoming Excel file one at a time. If the data in the row already exists in the database, skip it. If the row already exists but the incoming data is a new version, overwrite that row. If the row doesn't exist yet, make a whole new row in the database.

The manual merge can be tricky to implement. In your import script you will have to write the logic for how to check if an incoming row already exists in your database. Does the Excel file have its own Crime IDs that you can use to look up existing records, or are you searching for the existing record by other method? Do you assume that the incoming row should completely overwrite the existing row, or do you try to do a full row merge?

At this point the import script is probably a few dozen lines and is very specific to both the police department's data as well as your application's database. If you decide to switch from MySQL to PostgreSQL in the future you will have to revisit this script and re-write major parts of it.

If you have to do things like clean up formatting errors in the Police data, re-project geographic coordinates, or change the data in other ways there is no straightforward way to share those changes publicly. The best case scenario is that you put your import script on GitHub and name it something like 'City-Police-Crime-MySQL-Import' so that other developers that want to consume the crime data in your city won't have to go through all the work that you just went through.

Sadly, this workflow is the state of the art. Open data tools are at a level comparable to source code management before version control.


Dat can be useful to let city governments create data visualizations for dynamically changing real-time data from various entities without worrying about what kind of format the original data is in. This can really empower cities becoming more efficient. One application would be they might create a dashboard where they can see real-time analytics of their library systems, fire-depts,crime-zones vs police-stations etc even though all these entities have raw data in different format and even different database management systems.

I like the idea behind Dat, but I totally hate its authors because they said, somewhere in their page, sometime ago, that their preference is for "academic research data" (or something like that).

Why did they need to say that? I don't want a tool that has a preference for something so stupid as academia.

But I'll probably forget this and start loving Dat if it manages to enable this "open data revolution".

According to the article, the focus on scientific data is a product of funding from the Sloan Foundation:

"Although Ogden's background is in city government, the Dat team is now squarely focused on the needs of scientists. That's largely because of the Sloan Foundation's focus. 'I don't come from a scientific background and wasn't even thinking about science data,' he says. 'But they convinced me that I should.' He explains that scientists have to deal with many of the same issues with formats and tracking changes that city governments do. Using Dat, Ogden says, much of this complexity could be abstracted away, at least for some users of the data."

I don't think this is a reason for hating the authors or the project. Academic scientists face a lot of the same problems as users of open data, and if the Sloan Foundation wants to pay to solve those problems for science, the project moves forward more quickly, and people using open data in other ways still benefit.

This is good to hear. My hatred has gone. Thank you!

We changed the title to a sentence from the article since the thing about free food is misleading.

Dat open-source again! ;)

Oh, come on, was there no one else who initially misread the title?? :)

> Let’s say your city releases a list of all trees planted on its public property. It would be a godsend—at least in theory. You could filter the data into a list of all the fruit and nut trees in the city, transfer it into an online database, and create a smartphone app that helps anyone find free food.

So assuming that an individual fruit tree produces 20,000 calories of edible fruit annually, and there are a couple dozen fruit trees in a typical American city, we will have spent a hundred man hours in app development and testing to turn half a million potential calories into a few thousand, as we inflict the tragedy of the commons on these public resources and encourage people to pick the immature fruit before someone else with the app does.

That idea is so stupid, by next week I expect to see 8 startups with a combined valuation of 80 million dollars all attempting to monetize the 24 fruit trees on public property in Mountain View by selling ads targeting "urban nomads" (aka homeless), or by paying homeless to gather unripe fruit for each other in whatever litecoin or ripple clone is in vogue that week.

I don't think I could have written a more perfect parody of Hacker News comments that ignore the point of the article but nitpick one small thing in an attempt to demonstrate the commenter's intelligence.

Having previously lived in Phoenix for several years, I can tell you that the amount of citrus on public property that falls to the ground and rots is huge. Even at high density apartment complexes, people don't pick near capacity (although this might have something to do with heavy use of herbicides on manicured lawns). There's no tragedy of the commons going on there (and I have to wonder if you've ever tasted a green orange)...

On the other hand, there are pretty sound reasons for planting edibles on public spaces instead of merely ornamental plants.

Come on, don't be so rude, it's totally uncalled for. And you are missing the perspective. Not all cities are typical American, okay? There may be cities in warmer climate zones with tens of thousands of fruit trees.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact