

Ask HN: ETL - packaged tools or roll your own? - bsg75

Having spent a not insignificant time dealing with peculiarities of SSIS and Kettle, I sometimes question if packaged ETL / data integration environments bring more costs than benefits (in small to mid-size projects).<p>Although they do offer a path to automation, logging, and (sometimes) parallel execution, ETL runtimes and design tools can be harder to debug compared to a combination of plan SQL and a scripting language.<p>Given a choice between the two approaches, why have you chosen an ETL package over a "plain" programmatic one?
======
mcantelon
I tried to grok Kettle, but at some point it seemed easier to just program a
solution and this has worked well for what we're doing.

I think if you're doing something simple that you're going to have to repeat a
lot, and possibly train others to do, then rolling your own makes sense. If
you're dealing with a lot of variance in terms of the format, volume, and
schema of the data you're transforming then a generalized ETL tool might make
more sense.

We're mostly transforming CSV, to facilitate migrations to our product (Qubit
Toolkit: an open source information management used predominantly by
archivists and written in PHP), so I've been developing a CSV-specific utility
class for our needs with the idea of possibly making it into a generalized CSV
transformation tool (where non-programmers can use something like XML or YAML
to define transformations):

<https://github.com/mcantelon/qubit-csv-transform>

------
tzm
I prefer rolling my own progmattic solution. In fact, I'm working on a JS-
based data platform that democratizes the writing of transformation routines
and web service integration for developers who love data, yet loath legacy ETL
systems.

From my view, mobile developers are needing data manipulation and publishing
tools that tightly fit within their programming environment and workflow. So
I'm building it, including real-time sync and visualization.

I actually need to hire... like stat. Seed funded here in Mountain View.
Partnered with the leading mobile platform company for sales and distribution.
Raising an A round together. Currently just a single founder that loves data
hacking and design.

------
tocomment
Thats a really great question. I've wondered the same thing myself. I think a
really simple ETL tool would make an excellent open source project.

If you want to do one I'd be up for helping.

~~~
bsg75
Unfortunately I am too busy fixing the current one to write a new one, but if
I had a wish list for a concise one:

1) Jobs and transforms stored as in a human readable and editable format - NOT
XML

2) A design tool that has as much focus placed on it as the runtime engine
(but a good enough #1 limits the need for a feature laden #2)

3) A runtime not Java based

The last is not so much a dig on Java, but it seems of all the OSS tools I
use, the Java ones are the most problematic. Horrible UIs, complicated
configs, and exception reporting (or stack dumps) that are completely non-
descriptive of the actual problem.

------
flavien_bessede
I always loved ETL softwares, they are flexible and offer a broad range of
features. But if you think about it, you can do most of it yourself,
especially logging and automation. The only huge advantage over in-house ETL
solutions is parallel execution which is really hard to code properly. Even
thought ETL softwares are amazing, they clearly do not justify their price,
and you will always love your own code better than a tool that may or may not
have all the features that you need. Roll your own.

~~~
bsg75
I have not had the fortune (?) to use one in the expensive category, and to
date, not too impressed with the inexpensive / free ones.

It is quite possible that experiences are affected by quality.

------
Ingaz
Use both.

"Execute OS command" task is the most powerful component in all ETLs. And
every ETL-tool must have command-line utility to run it from scripts.

We have:

    
    
      a) ETL-packages that only coordinating a lot of small scripts
    
      b) And scripts that starts a lot of ETL-packages

