
Ask HN: Is SQL a primary tool of choice for doing ETL pipelines in 2019? - sosilkj
Is SQL considered a primary tool of choice for ETL pipelines in 2019? Benefits&#x2F;drawbacks?
======
specialist
I've used and created numerous ETL stacks. So learn from my mistakes.

First, move the code, not the data. Batch processing is for mainframes. I
know, I know, this has been impossible to realize any where I've ever worked.

Second, less is more. If command line tools work, use 'em.

Avoid IDLs, maps, schemas, visual programming, workflow engines, event
sourcing, blah blah blah. It's all useless abstractions, measured by number of
indirections and stack trace depth. It's all wicked hard to debug. It's all
abandoned unmaintained obfuscation layers.

Data processing (ETL) is just cutting and pasting strings. Input, processing,
output. Sometimes with sanity checks. Sometimes with transformations, like
munging date fields or mapping terms ("yes" to "true"). Very rarely with
accumulators (aggregators) where you need some local persistent state.

Third, and this is pretty rare, use better APIs for data extraction. It's all
just scrapping. Don't over think it. I wish I could show the world the APIs I
created for HL7 (healthcare) data. For 2.x, I created "fluent" (method
chaining) data wrappers (like a DOM) which could not blowup (used Null Objects
to prevent null pointer exceptions). For 3.x, I used path query thingie to
drill down into those stupid XML files. This was CODE, not mappings, so it was
practically a REPL, meaning fast to code, fast to debug.

Fourth, you control the execution. Be more like Postfix/Qmail, where each task
has it's own executable. Be less like J2EE, BizTalk, where you ask the runtime
to control the lifecycle of your code.

Good luck.

~~~
jklein11
I'm really interested in learning more about the work you have done with HL7
v2/v3. I have never really thought of HL7v2 messages being document oriented.
Does the data wrapper allow you to access PID 3(1) will give you the first
external ID(full CX) and PID 3(1).1 will give you just the ID?

~~~
specialist
Yes.

Even better, for 2.x I had an "HL7 interface" compiler that would transform
the client (partner's) specification into verbose wrappers. The resulting
"fluent" code read just like the spec.

Alas, data quality, consistency was always a challenge. Data from multiple
sources would come thru the same feed. So often our scrapping (data
extraction) code would extract PID (for example) from 3 common places, from
most specific to more general, then use null coalescing to choose the best
match.

As hinted before, 3.x & XML was handled the same in spirit, but very different
code, using query expressions instead of method chaining.

I'd be happy to share whatever I have.

Here's a temp (24 hours) email address if you wish to PM me:
vwebaemg@sharklasers.com

------
davismwfl
There is no on size fits all tool for good ETL processes IMO.

We just built a pretty involved ETL process over the past couple of months
that had two key components, a true batch ETL process and then a real-time
synchronization between two Postgres databases on different servers. In our
case, real-time is basically less than 2 seconds for this application so
nothing insane, we routinely see it working in about 300-500ms.

We could've solved it all using just SQL to be fair, but found that it was
better even for the batch ETL process to utilize SQL only for the data CRUD
and leave the business logic to the application code we wrote (it is in
nodejs). This let us take advantage of some caching which reduced the database
server load dramatically during the processing, id lookups and other known
values being the core area we reduced with cache. It also let us keep our
business rules in code which is where we have them for data validation etc, so
we could reuse existing code sections to maintain consistency and not try to
replicate some pretty sophisticated validations and transforms in SQL.

I have also worked on projects in the past where the entire process was done
using SQL and it is doable but I'd say it was less than ideal and definitely
can cause significant overhead on the DB servers which is unnecessary most of
the time.

I had one project where the entire ETL process was in MSSQL with stored procs
and SSIS, spread across multiple servers. That system worked really well
overall (integrity was super good and reasonably maintainable for SQL people),
but again strained the DB servers a lot and was fairly slow. We moved most of
that ETL process to C# at the time given it was a MS house, and the difference
was dramatic. Load on the DB servers dropped and ETL that would take hours to
run on SSIS with stored procs etc took minutes with C# and the utilization of
memcached. In fact on that one we even wrote extensions for SQL Server using
.NET and had SSIS call our external code which used the caching engine to
reduce DB lookups amongst a bunch of other things. Pretty damn cool solution
but definitely wasn't simple, lots of smart people worked on that project to
make it super reliable and fast.

------
dalailambda
SQL has definitely become the defacto tool for a lot of data processing. This
model of working is generally referred to as ELT as opposed to ETL.

For small/medium scale environments Fivetran/Stitch with Snowflake/BigQuery
using getdbt.com for modelling is an insanely productive way to build an
analytics stack. I consider this the default way of building a new data stack
unless there's a very good reason not to.

For larger scales Facebook has Presto, Google has Dremel/Procella/others, and
a lot of data processing is done using SQL as opposed to writing code.

The only downside really is that it tends to be fairly focussed on batch
pipelines (which are fine for 95% of workloads). But even that is becoming
less of an issue with Beam/Spark so you can use SQL for both batch and
streaming.

Source: Solution Architect at an analytics consultancy.

------
jstrebel
Coming from a Big Data (Hadoop / AWS) background, I would say that SQL is a
good choice for analyzing data once it is in the data warehouse, but it is not
needed to get the data there. Most of the ETL code in this environment is
based on Apache Spark 2.x and either Scala or Python as programming language.
On a Hortonworks HDP cluster, you would use PySpark directly; on AWS, you
would use AWS Glue which allows you to set up and schedule the PySpark code.
Although PySpark does not stop you from using SQL, there is a preference to
write the transformations in terms of native Spark functions. So, to answer
the question: no, SQL is not a primary tool of choice for implementing ETL
pipelines in 2019 - PySpark is.

------
IpV8
Generally yes. You can use gui tools like Matillion or AlteryX to orchestrate,
or you can just run SQL as commands or view materializations. With the advent
of natively scalable SQL based databases such as Snowflake, SQL is clawing
back territory from the 'Big Data' databases rapidly. It is way more flexible
to throw a ton of hardware at SQL as opposed to pre-conforming your data for
specific usecases with non-relastional databases. Obviously if you're talking
realtime or sensor data SQL may not be your best strategy, but you're question
is really broad.

------
fjp
I've been doing ETL-heavy work as (async) Python microservices.

The most important thing is to separate your infrastructure layer from your
business logic, and have the infrastructure layer satisfy interfaces that are
defines in business logic layer.

That way, when my datastore switches from FTP to S3 or MySQL to Postgres, I
can swap out the datastore client, make sure the new one satisfies the
interface, and the rest of the application works just the same.

The microservice's database to track and schedule runs is in Postgres, with
SELECT FOR UPDATE providing an easy way to treat a Postgres table as a work
queue.

~~~
playing_colours
You words reminded me that 10-15 years ago ORMs were all the hype, and one of
the advertised benefits to use them was that you are abstracted from an
underlying database engine, and can flawlessly switch between them.

A bit later, as usually the hype went down and there were a lot of critical
articles about ORMs, where they said that well, it is not actually that easy
to switch database engines in a real project even with ORMs, and more - ORMs
hide great specific powerful features of databases. And finally - how often do
you actually jump between database engines, what a strange benefit after all?

I am in no way criticizing your approach, and I clearly understand the
rationale and benefits. It was funny, your words reminded me that our industry
progresses in an infinite spiral trajectory, and repeats itself on a next
level. :)

~~~
fjp
I know what you mean! In my approach, the ORM would be in the infrastructure
layer - I'm assuming I will have to switch it out, if I'm using one at all.

It's the times where I have to migrate from one implementation of an interface
to another where I have learned the most, usually meaning "I let the structure
of this database get reflected too much in the business logic"/

------
thenanyu
I work at Mode, an analytics tool company. We get questions about data
engineering from our customers all the time, and the solution we recommend 90%
of the time is DBT

[https://www.getdbt.com/](https://www.getdbt.com/)

If you're used to software development, DBT gives you a very similar workflow
for managing the T and L portions of the job.

------
eb0la
If you are doing a serious ETL job, you'll need to handle rejects - I mean, a
way to identify which rows of data cannot be handled gracefukky by your ETL.

Doing this in plain SQL or Spark is complicated.

This is usually a job for commercial ETL tools like informatics, talend, data
stage...

------
tracker1
Personally, I'm a fan of scripts for this type of work... Node, Python, Perl,
etc. That's just me though. I know a lot of people that live and breath SSIS
packages and similar.

It really depends on where you want the data and how you want to use it.

------
vivan
As with most questions about stack choice: it depends.

------
llampx
It depends on your sources and whether you want to do real-time or not.

