
Is it bad for prod systems to integrate via a data lake? - kyllo
Help me either win or avoid an argument:<p>I work at a major tech company. We have a massive data lake for analytics &#x2F; data science and essentially all of our systems send logs to it. A pattern that I&#x27;m seeing pop up sometimes is teams building systems that read other systems&#x27; data from the data lake.<p>This screams antipattern to me. The data lake introduces significant latency and many of the log streams in it have little to no SLA guarantees. If you read another team&#x27;s logs from the lake without their knowledge, they could move, delete, or make breaking changes to their logs and not know that they&#x27;ve broken you until you complain. I think it&#x27;s a lazy way for teams to avoid calling each others&#x27; APIs or even talking to each other to discuss integrations. I want to tell them they&#x27;re making a very big mistake by doing this.<p>Is there a good argument for applications&#x2F;services reading each others&#x27; logs from a data lake instead of integrating through APIs? Is it a valid pattern? Maybe if the data needs expensive processing before the client can use it and latency isn&#x27;t important? Or just because it&#x27;s easier&#x2F;faster than a proper service integration?
======
giaour
Late to the party, but someone at the office shared
[https://storage.googleapis.com/pub-tools-public-
publication-...](https://storage.googleapis.com/pub-tools-public-publication-
data/pdf/43146.pdf) , which reminded me of this question. You're right that
the system consuming data from the lake is taking on debt by building a system
around what is by design an unstable signal. The consumer is also attaching
technical debt (in the form of visibility debt) to the producer, since changes
to the data fed into the lake now have an undeclared consumer that may break
if the data format or cadence changes in ways that seem innocuous to the data
originator. At the major tech company where I work, this would result in
someone on the team responsible for the data originator getting paged.

I would recommend reading the paper linked to above and sharing it with teams
that are taking the approach you describe.

~~~
kyllo
Thanks for the link, this is a good formal description of the one main problem
I'm seeing with this type of architecture (undeclared consumers) as well as
several others I hadn't thought of.

------
scoobydoobydoo
My company uses a common data lake for various teams and organizations to
share their datasets. We enforce schemes, and owners are responsible for the
correctness and quality of their data, although some consumers usually build
some tests in their downstream pipelines to validate data sets from the data
lake.

It has its problems. How do I know a data set I am consuming is complete or
correct? As a consumer, how do I even know what to test for? Although
producers are technically on the hook for the quality of their data, it is not
clear which data sets in the data lake have what level of support from the
owning team. You probably want to get in writing from the producer that it’s
safe to consume or take a dependency on their data.

Personally, I don’t want to vend a service to provide data sets, especially
when they’re in hundreds of gigabytes or terabyte scale (for analytical use
cases). I don’t want to support all that infrastructure and have engineers
supporting he service logic itself, and so a common abstraction makes sense,
at least on paper.

One of my recent projects involved using this data lake as an integration
point. The specific data lake implementation means that for consumers, the
exact time that a dataset (after it is written to the data lake) is available
for querying is non deterministic, and that resulted in a lot of friction.

What is the size of the logs you want to consume? Are you able to query
specific log lines by wildcard or key? How fast do you need the data once it
has been produced? Are you willing to support your own infrastructure (hosts,
configuration, patching, etc)? If there is a datalake, is it a common/managed
service, and so is the cost of supporting your own service worth it?

Does your data lake enforce schemes on write? If so, how easy is it to change
schemes later on, and how does that impact consumers? How do you identify or
track consumers of the data, whether you’re using a data lake or a service? Is
there support for versioning?

------
crsn
IMHO, it’s an antipattern in almost every way. Being “common” doesn’t make it
“smart”. The preprocessing argument is invalid for a lake, and the upfront
“integration” time savings are cannibalized later by maintenance and risk
overhead vs. proper service integrations. And everything you mentioned aside,
there’s security and DPP risk. Lakes should not be used for collaboration
between systems - that’s what lakeshore data marts are for, at worst, or real
service-to-service APIs, ideally. (We have a good word for this in German:
Datensparsamkeit. See also
[https://martinfowler.com/bliki/Datensparsamkeit.html](https://martinfowler.com/bliki/Datensparsamkeit.html)).
The people and services regularly using the lake should be data scientists and
analytics folks, probably not prod services.

~~~
kyllo
Security and privacy risk are also good things to consider, thanks. Our data
lake does have processes and protections in place for this (GDPR tagging and
deleting, retention policies, security groups for access control) but just
because my streams are compliant doesn't mean downstream consumers are.

------
timwis
> You should use a data lake for analytic purposes, not for collaboration
> between operational systems. When operational systems collaborate they
> should do this through services designed for the purpose, such as RESTful
> HTTP calls, or asynchronous messaging. The lake is too complex to trawl for
> operational communication. It may be that analysis of the lake can lead to
> new operational communication routes, but these should be built directly
> rather than through the lake.

Source:
[https://martinfowler.com/bliki/DataLake.html](https://martinfowler.com/bliki/DataLake.html)

------
segmondy
I know I'm late to this discussion, but you're right on the money. It's slow,
you have no idea how often the data is refreshed. There's no SLA's, schema
could change. It really should be for analytics and not transactions. It's the
lazy way and folks will pay for it down the line. We got away from shared DB
architecture and this is nothing more than a backdoor to the same old way of
doing things but in a worse way.

------
mateo411
How much data are we talking about? If the data is small enough, then an API
is better. If you need to read a lot of data GBs/hour, then you probably want
to read data through the data lake.

However, it's still possible to define schemas, version, and set SLAs on the
datasets in the data lake. If you don't do that, then you are going to have to
fix this whenever the upstream dataset changes. This is even better if the
upstream dataset is produced by a team and they have no idea that they are
other people consuming their datasets. Welcome to the Wild West of Data
Engineering.

~~~
kyllo
Yeah, my first two questions were: "Why don't you get the data from the
upstream team's API? Have you discussed this design with them?"

It also happened to me once before--a production system took a dependency on
one of my streams in the data lake without my knowledge, and no SLA, then when
I made a change to it, they came out of nowhere, blowing me up with e-mails
and meeting requests. I told them to go pound sand.

------
gmmeyer
Integrating through a database or other data source is a time tested method of
linking together services. For some services in critical paths that might not
be the best way to do it but the pattern you've described here sounds pretty
normal and standard. What exactly do you want to do to solve this problem?
What's a "proper service integration?"

------
huac
the most effective way I've seen is to have a specific data eng team which
builds core tables, drawing on data from different teams. they work with
producer teams to ensure data quality and completeness, and work with consumer
teams to understand what they need to see from data.

but this depends on what your use cases are (e.g. online vs offline).

