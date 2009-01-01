What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.
A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.
Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.
Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.
I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.
Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.
I am curious, what kind of process orchestration tools people use and can recommend?
In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.
Me too, for better or for worse.
As for the issues, there are many. Just quickly a few:
* Data provider has an FTP server, most files are automatically generated, some are hand-named (with inconsistencies). How do you handle (without a lot of effort) a list of exceptions along with the regular files?
* Data provider has a good strict XML schema, but the relevant information for a single item is spread across three files, inside a tar archive. Since the there are 500k files inside the archive, you best not want to extract it, but process it on the fly.
* Data provider chooses layout that saves every item in a single XML file, inside 2-3 levels of directories. There are 20M of them. Unzipping the archive alone takes more than a day with default system settings and the usual tools. How do you process these things fast?
There are more subtle issues as well:
* FFFD regularly occurs in natural language strings. Can you correct these strings?
* File has .csv ending, looks like CSV on first glance, but all the standard RFC compliant parsers choke on it.
* XML file that elements, that have RTF tags embedded in it. You need to parse the RTF in the elements, because there is relevant information there, that you need to add to the transformed version.
* Date issues. Inconsistent formats and almost-valid dates.
* Combine data, coming from an API with data fetched from ten different servers to produce a transformed version with a legacy command line application (that might be slow, so you have to split your data first and parallelize the work, combine it and make sure it's complete).
I am thinking about a longer article or even a short book about these kind of data handling and quality questions and what ways there are to address them. Would you read a book like this and what topic would be the most pressing or relevant?
For unzipping something that crazy, I'm interested in your solution - I think I'd have to write a custom zip library and use a RAMdisk or similar.
As for the zip: We simply "unzip -p" and stream process it carefully (with a custom program reading XML and transforming it). Cuts processing time from hours (extracting the zip and creating all directories, then visiting each file) to minutes (read from a single file).
Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)
Although it has a completely different model and I think more suitable for "big data".
In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.
I have a printout of this paper, but unfortunately it doesn't appear to be online :-(
