
It’s Okay to Store Data in Apache Kafka (2017) - ooooak
https://www.confluent.io/blog/okay-store-data-apache-kafka/
======
vore
I think what the article greatly skimps over is data migrations: what do you
do if you need to change the format of your data? If you retain logs in Kafka
indefinitely as the source of truth for your data, then if you need to migrate
materialized data to a new format, you'll also need to either 1) support all
the previous forms of materialized data so operations from the log are
guaranteed to be safely replayable on it, or 2) don't do that and keep one
form of materialized data and hope you have enough test coverage to make sure
some unexpectedly old data doesn't silently corrupt your materialized data.

Event sourcing is useful, but using it as a source of truth data store in
itself instead of e.g. an occasional journalling mechanism seems pretty
fraught.

~~~
Joeri
You use AVRO and the confluent schema store so that you can do compatible
schema upgrades (automatically upgrading messages when reading). If you need
to do a breaking change, you can detect the avro schema from the message and
have code in the reader to convert one to the other. It’s not that much of a
problem in practice.

~~~
vore
This is a problem we've actually encountered in practice: you need to preserve
the conversion code for as long as the history exists, which is permanent
technical debt. If you get your domain model wrong at the start, it will haunt
you forever.

~~~
parasubvert
This is no different from evolving an RDBMS with ActiveRecord or Flyway. It’s
not technical debt (there is nothing to repay or maintain).

~~~
dllthomas
If you write "read it to the latest version" code for every version, you need
to write that for every past version - potentially a high burden. Writing
"convert it to the next version" instead is much less work for humans, and
probably usually the best call, but less efficient when you actually need to
read that old data. There are hybrid approaches to consider, too.

Wherever you are on that spectrum, there is code. Exercising it may slow down
your test suite. It might need to be touched when you update lint rules (at
least to add annotations to shut off the new rules).

Code nearly always represents _some_ amount of debt. Code to deal with
history, more so.

~~~
parasubvert
We have a slight disagreement on what it means to accrue technical debt.

When a schema changes for an all, _someone_ has to migrate it and the data,
unless the design is that all incompatible data is dropped. Either that’s the
codebase itself or the dev team is punting to the user. Punting to the user
just shifts the burden around.

If the R&D team shipped a shitty / incomplete schema to get the software out
the door that they need to change later, then yes, that is technical debt -
something they’ll eventually need to repay.

If requirements evolve over time and thus the schema needs to, that is not
necessarily technical debt in the usual sense, which usually implies a
temporary technical compromise for the purposes of expediency / getting
something out the door.

I suppose I agree there is a trade off here and a penalty - after many years
the migrations get slow to apply etc, and you could say that checkpointing the
schema every few versions and preventing upgrades from any prior point is a
way of cleaning up the “migration debt”.

But people like to suggest that there’s some other way , ie. the OP saying “
If you get your domain model wrong at the start, it will haunt you
forever.”.... I have never seen any system get the domain model perfectly
right at the start!

~~~
dllthomas
That may be. I was mostly trying to address the claim that "[i]t's not
technical debt [because] there is nothing to [...] maintain."

Financially speaking, it's more like an annuity than a credit card.

------
epistasis
Kreps has a gift for writing, this is so clear, well organized, and far more
fun to read than the topic has any right to be. Hopefully he'll retire after
confluent and finally start writing novels.

~~~
sitkack
I came to say something similar, but targeted at the quoted paragraph below.

> If Kafka isn’t going to be a universal format for queries, what is it? I
> think it makes more sense to think of your datacenter as a giant database,
> in that database Kafka is the commit log, and these various storage systems
> are kinds of derived indexes or views. It’s true that a log like Kafka can
> be fantastic primitive for building a database, but the queries are still
> served out of indexes that were built for the appropriate access pattern.

A wonderful insight!

------
mvitorino
IMO using Kafka for long term storage is not the greatest idea. It is
expensive to keep CPU and RAM constantly on top of data that it going to be
cold most of the time. There is no DML which means mistakes are expensive
(from an engineering pov). And while the whole event sourcing paradigm can
work quite well in narrow domains with teams fully aware of the implications
of what they are doing, in practice, on large orgs, it is hard to scale (from
a people perspective).

~~~
tempodox
> hard to scale (from a people perspective).

This is always going to be true. Either your system is built in a way that
even the most ignorant of users cannot do permanent damage, or you design it
to be understood and run by a small group of people who fully comprehend what
they're doing. Of course neither option is easy, but the alternative is a
prolific generator of daily trouble.

------
ryanthedev
Well maybe for non critical data. Multi regional Kafka clustering is not easy.
There are much better and cheaper data storage options that can provide
eventual consistency.

~~~
je42
What tech would you use to store a critical append-only data if not with Kafka
?

~~~
rockostrich
Kafka topics compacted to some sort of disk storage? It really depends on what
you need to do with it. All of our kafka topics have a TTL on messages that's
less than or equal to 7 days. If the data is critical, then that topic has a
archiver job that writes the messages to files in HDFS.

Postgres, Spanner, etc. come to mind as options if you need to store critical
append-only data and also have it be queryable in real-time.

~~~
tr4r3x
postgres is a bit different use case, isn't? kafka is mainly about
scalability, it has nice partitioning out of the box. and i would not say that
it's not reliable, because you always replicate data with it

------
KaiserPro
you _can_ do it, but you shouldn't

treating your message bus as a infinite storage system is going give you a bad
time.

~~~
marton78
This is quite an unsubstantiated claim in response to a detailed article. Care
to elaborate?

~~~
gopalv
Not the original comment, but I have run into a specific set of regulatory
scenarios where the immutability of any event log is a big problem.

The data retention policies for users who leave the platform is set to 30 days
by GDPR, which requires the data to be deleted and expunged by the storage
system in ways in which normal users cannot recover - or in another way, the
data needs to "offline".

This is not actually that "all data older than 30 days is thrown away", but
that "all data for users who have deleted their accounts need to be deleted
from the start of their account creation, 30 days after they say forget-me".

Kafka (or Pulsar or Pravega) or any of the other immutable commit log
implementations make forgetting a small slice of data from a large set a
complicated and nearly impossible task to accomplish.

You can accomplish some part of this with log compaction assuming you have a
definite primary key for all updates (i.e the log needs to be partitioned on a
key to do compaction along that key). If there's a way I could declare a
primary key as a device+event_timestamp+metric, but delete by a user_id
column, let me know and I'll be happy to find out how to do it.

However in its original form, Kafka is still very useful.

Being able to hold data in Kafka in those periods of time is extremely
valuable and naturally lets the system lose part of its state with the ability
to replay itself back into the same state from a saved checkpoint.

If you store 7+ days of Kafka data and flush the newly arrived data-set into a
persistent, but mutable columnar store every day & maintain the
partition/offsets on commit, then you can recover from a complete loss of the
mutable store's in-memory data by replaying the log from where you left off.

The row-major nature of its storage still hurts though if you plan to do all
your analysis off it directly, because you'll burn through the disk bandwidth
for no good reason.

~~~
cfontes
> forgetting a small slice of data from a large set a complicated and nearly
> impossible task to accomplish.

We use Kafka as our storage for almost everything, and we managed to solve
this by encrypting all user data that is relevant to GDPR and trowing away the
key when asked for a removal.

if a user asks to be forgotten, we commit a empty privacy key for this user
and compress the privacykeys topic and all is done, no service will be able to
decrypt it anymore.

So far it has been a good solution and it was easy to implement on all our
services.

~~~
ec109685
This is a great idea. If you couple this practice with storing user data
encryption keys with additional layers of security, you’ll decrease your
susceptibility of someone being able to extract all data if they get access to
your kafka. Spotify talks about this here:
[http://labs.spotify.com/2018/09/18/scalable-user-
privacy/](http://labs.spotify.com/2018/09/18/scalable-user-privacy/)

------
zbentley
This article seems to propose log compaction as the answer to the question of
size (i.e. how much historical data is going to have to be kept around and how
much is that going to cost). However, log compaction is not well suited to
many use cases: storing partial updates or diffs in the log, storing many
trillions of tiny entries (as keys), multiple messages on the log
corresponding to related (contingent) updates, and so on.

Those are tractable but hard to solve; log compaction is not a silver bullet
and unless you think really hard about how your data changes over time, you
may end up storing more of it than you expect if you use the log as an eternal
source of truth--compaction or not.

