
Ask HN: Is ETL (data integration in batch processing mode) really dead? - srigan
I have recently come across this presentation https:&#x2F;&#x2F;www.infoq.com&#x2F;presentations&#x2F;etl-streams?utm_source=infoq&amp;utm_medium=popular_widget&amp;utm_campaign=popular_content_list&amp;utm_content=homepage
Should every data integration or data processing pipeline should be built based on stream processing  architecture, even though there is no need for such a thing from day zero? The argument i hear for doing so is that in future we might have a need for real time processing. Would like to hear what others are thinking.
======
BjoernKW
Far from it. CSV is by far the most common data exchange format for ERP, CRM
and business systems in general. EDI is another. Good luck communicating with
SAP ERP or NetSuite without good old-fashioned SOAP. Judging from the
documentation none of these seems to be supported by Confluent.

SOAP and CSV are not sexy. They have plenty of shortcomings. However, those
are the formats that are used in the real world today (and for some time to
come).

Stream processing is a very useful design pattern but like any design pattern
it should be used carefully and only where appropriate (see: Microservices).

If I were to build a new complex ERP from the ground up I'd be remiss not to
use something like Kafka or Confluent for data processing.

If I want to communicate with legacy systems though that's an entirely
different matter. The same applies when targeting SMBs. You'd have a hard time
explaining to small business owners why they suddenly need a newfangled stream
processing architecture while their old "Export CSV and load that into Excel"
process worked just fine.

------
bsg75
Not all data comes in at a rate where streaming approaches are necessary.
Sensors, click or IoT data perhaps, but for things like purchases, signups, or
other "daily" activities batch processing is suitable and less complex to
build.

I would wager that most data is not of a streaming nature, but as the ability
to process live pipelines is relatively new, it gets more attention.

------
njd
It is possible, but there are a lot of caveats.

For example, how do you detect when a source's semantics change? This will
break any cleansing or transforms done in the stream platform. Until it gets
fixed, data may be missing, wrong, or worse (e.g. corrupted) and propagated
down stream.

When data is cleansed and transformed early, there is no way to go back to the
raw data, unless you carry it forwards too.

Consider these sort of questions for your use case.

~~~
srigan
Isn't this a problem even with ETL based solutions too? Could you please
explain this with an example?

~~~
njd
Yes, especially with early cleansing and transformation. When source semantics
are dynamic, try to build recognizers rather than expect sources to obey some
semantic agreement. They won't. Cast what you can recognize into an
intermediate shape. I like object, property, value (i.e triple) with metadata.
Don't cleanse or transform the source data. Let the data be the data. Cleansed
and transformed data fall into the category of assertions. Assertions can be
made by humans and software, keep metadata. Allow your applications and your
analysts to overlay their own semantic meaning on the triples. Naturally,
consensus understanding of source semantics is desirable, but don't want to
prevent analysts from using the raw or asserted data as they see fit. Still
need software to analyze triples for the bad, etc. to make assertions that
data is suspect. Otherwise, your downstream programs and analysts will need to
make such assertions. Given such a process, applications and analysts can
write queries to ignore suspect data, undesirable assertions, etc. if they so
choose.

------
data36
Of course one of the initial authors of Kafka will say, that ETL is dead, and
streams only are the future... But that's not the reality in many cases.

------
atsaloli
I've implemented an ETL pipeline at one of my clients just this quarter that
runs nightly and gives them the data they need in the format they want (Web UI
+ CSV exports). Just _having_ this available at all is a huge win. Having the
data be "fresher" would be nice but it's small margin of win compared to the
original win.

