
Kafka, GDPR and Event Sourcing - delebe
http://danlebrero.com/2018/04/11/kafka-gdpr-event-sourcing/
======
WA
> _A suggestion from Michiel Rook’s blog is that maybe is enough to remove the
> data from the projections /read models, and there is no need to touch the
> data in the event store._

No. This is very easy. The right to be forgotten means: if I’m done with your
web service and I want to have my account deleted, you have to delete
everything you have on me in a reasonable timeframe and which is not required
by you for other laws (such as keeping receipts of my purchases for 6-10
years).

If someone can replay that log step by step (and I think this is the idea of
Event Sourcing), and my data shows up in there for no particular reason, it’s
illegal.

There is no need for you to archive data, if I don’t want you to create that
archive or if I want you to delete that archive.

Also, if I haven’t used your web service after X years, you should send a
friendly reminder that my account and all data will be erased soon. If I don’t
reactivate, you should delete every information that helps you or anyone to
identify me as a person.

Edit: arguing with the purpose of “archiving” is exactly this fucked up
weaseling around which companies do to circumvent the law. Take responsibility
for your users data and delete everything after a while or on request.

~~~
_betty_
Makes me wonder about block chain data that can't be deleted

~~~
PeterisP
That's quite simple - as a controller holding my data, you're responsible for
safeguarding the data and fulfilling the requirements, so if putting that data
on a public blockchain makes it impossible to do your duty, then you're not
allowed to put that data on a public blockchain.

If _I_ put my data there, that's my problem, but you're not allowed to do
that.

~~~
LoSboccacc
> If I put my data there, that's my problem, but you're not allowed to do
> that.

is it really? that's a part I did not really understand about the regulation
and can't find anything about voluntarily relinquishing control of personal
data.

i.e: I leave my personal name and email address on a public forum. Google
later add that post to it's index, or another random company like Web Archive
or archive.is scrapes it.

What are my rights? What are their obligations? is the forum owner liable for
my action if I didn't explicitly agree for my data to be shared by him to all
unforeseeable future scrapers?

> as a controller holding my data, you're responsible for safeguarding the
> data

this confuses me even more. say I'm running an analytic system. I track users
trough a cookie. tracking cookies are, for some reason, personal information.
if an user delete the cookie from their browser and then ask to fulfill my
obligations about erasing his data from my system, how do I identify him?
who's liable then?

~~~
WA
> _I leave my personal name and email address on a public forum. Google later
> add that post to it 's index, or another random company like Web Archive or
> archive.is scrapes it._

I wish there was an official guide on how to run a forum properly, because all
forums suffer from the same problems.

What I figured out so far:

You as a forum administrator must delete that personal data on request. This
could mean digging through all posts of a user and delete all PII, although
you don’t have to delete all posts if they contain no PII.

You as a forum administrator must get your privacy policy right and possibly
make it harder for third parties to index PII. This depends on your intentions
and whether or not people know that your forum is public.

Say, your forum is about car parts. You could set a subforum that asks members
to introduce themselves (where PII are most likely) to _noindex_ or hide from
the public.

This way, you’ve put in reasonable effort to protect your users and your
obligation is done. Indexing by third parties is now out of your control.

But say you run a medical forum where people post health data (considered
super sensitive) and are expected to post a lot of PII, you might have to set
the entire board to ”members only“.

Although I’m not sure about any of this and to some extend, most forums
provide value by being visible to visitors and indexes by Google. Quite sad.

~~~
PeterisP
IMHO this is not talked about much because that's not a new GDPR issue; GDPR
introduces a bunch of new things about e.g. consent of processing data, but
the "right to be forgotten" and requests to delete my PII that someone else
posted to your forum is pretty much unchanged, it was a thing in EU
legislation for quite some time already (might be a full decade?) so all the
"here's what to do now" articles don't touch this.

------
arpa
Okay, this approach is really cool, but... just hear me out: what if... you
don't store personally identifiable data in the event store? What if you only
store references/ids that point to services which can resolve those ids to
data, and that data needs not to be immutable. In the event of erasure request
you shred/anonymize that data only, without touching kafka. I mean it's pretty
obvious to me, to the point i feel like i'm missing some huge point and
grossly misunderstanding the whole problem

~~~
dnomad
I think that's exactly what any real system would do. This sort of widespread
event encryption and forgetting is crazy. It is a very bad idea to delete
events from the immutable event log. There is no telling how downstream
consumers will react but it probably won't be good. It's much simpler to only
hold references to a user (UserID) and then if you need to forget that a user
you simply burn whatever is behind that UserID.

This is actually how the GDPR is designed: it's not about user data it's about
personal data, that is data that can be associated or related back to an
"identifiable person." The problem is _not_ that you need to delete all the
user's data across all your systems, the problem is that you need to break any
associations that would allow you to _identify_ a natural person who has asked
to be "forgotten." The funny thing is that while some people make lots of
noises about GDPR being some huge "burden," the reality is that any architect
worth her salt should've been designing systems like this from the very start
rather than letting personal data be replicated en mass from one system to
another. It is a basic normalization of data. All the GDPR is doing, like most
regulations, is requiring businesses follow best practices and not cut corners
that might harm end users. The great thing about this in the long run is that
this fixes a huge problem with the internet. Today users are reluctant to sign
up to a service precisely because they don't want to surrender their data
because once it's gone it's gone forever. I suspect users will be more _open_
to trying new services if they can be assured that it's possible to "un-sign
up", that is be "forgotten" by a service.

~~~
Terr_
> if you need to forget that a user you simply burn whatever is behind that
> UserID [...] you need to break any associations that would allow you to
> identify a natural person

The problem is that as you collect more "impersonal" data, the probability
that your collective data can still be used to identify someone approaches
100%.

"We don't know their name, but Deleted User 4510 was friends with every single
member of the John Doe family except for John Doe himself..."

~~~
joosters
Surely the implication is that when you _' burn whatever is behind that
UserID'_ you would delete the records of who DeletedUser4510 was friends with
(and who was friends with DeletedUser4510).

If all you do is rename someone to DeletedUser4510, you pretty obviously
haven't deleted all the data you hold on them.

~~~
throwawayReply
Right, and what if those connections are burned into an immutable data source
through the connection event in your event sourcing?

That is what this article is about, and methods to deal with that.

------
dpwm
I'm not a Kafka user. However, I am a long-time keen implementer of event
sourcing and log-derived data stores. Many of these involve per-user or per-
individual sqlite databases.

I cannot understand why Kafka is causing people problems. I am surprised that
this seems to be being accepted as a problem with event sourcing in general.

The most obvious solution I came up with with GDPR right to be forgotten was
to have per-user logs. User gone? Delete their log.

Then you just need to handle how you share the projected data from the
individual event logs. Messages are unlikely to be spread over multiple log
events and can be just sourced directly. It's clear who owns that data: it's
the user who wrote the message as they can at any time delete the only copy.

Aggregation is more difficult but can be done if you structure the things you
are aggregating such that you only need a few lookups per user. Again you can
cache references to certain events that can be efficiently looked up using a
(user_id, event_id) pair. There's nothing to stop you storing the latest
projection in a separate table of the per-user database file.

Is there some technical reason why so many Kafka users seem to be having such
difficulty with this?

~~~
noir_lord
What (if any) is your favourite book on event sourcing?

I've seen tech meetup lectures (one particulary fascinating one was a guy who
worked for the NHS) but I'd like to understand it better than I do.

~~~
dpwm
To me the biggest thing that made event sourcing click was when I was working
on an internal app that needed full auditing. With event sourcing the audit
log is the authoritative data source.

Learning that there was a name for this and that others had worked it out
better than I had was very handy. I didn't find a need for a book because I
found reading Martin Fowler's article[0] on it to give me a good enough
understanding.

I've found that it's a less well-defined pattern than some implementations
make it seem. There are many trade-offs to be made. At the one extreme you can
treat events as physical events that can be easily understood. Let's say we
have a user that wants to add a phone number. Well, we could have an
AddPhoneNumber event. But this leads to a RemovePhoneNumber event as well.

But you can also take the other extreme and say that each event is a set of
collected information -- perhaps a different element of the root of a json
object -- and that you can just diff to see what changed and only need to look
at the latest such event.

I've found there's a middle ground that doesn't require too much thinking:
related data goes together if it's not going to change very often. So I will
generally put all contact information in together as an UpdateContacts event.
This way we don't need to implement all the list operations for phone numbers
and can skip looking for such events after the latest has been found. You also
still have something simple enough to work out what changed and that is
unlikely to fill your storage.

If you are implementing event sourcing, I would like to point out one thing
that I didn't see written anywhere: set a reasonable upper limit on per-user
storage. Because you will find somebody who, even not maliciously, just keeps
changing things back and forth and it's better to stop them adding events than
to stop all users by running out of disk space.

[0]
[https://martinfowler.com/eaaDev/EventSourcing.html](https://martinfowler.com/eaaDev/EventSourcing.html)

~~~
noir_lord
Thank you for the thoughtful response.

I was hoping for a good book of the "here be dragons" kind.

Often books on an architecture gloss over the rough bits and then you find
them yourself when you already committed to it.

I've become incredibly conservative when it comes to application architecture
the older I get.

~~~
karmajunkie
If you're going to dive into CQRS/ES, I'd recommend:

* Enterprise Integration Patterns (basically an entire book about messaging architectures) [1] * Vaughn Vernon's books and online writing [2], * Domain Driven Design by Eric Evans [3], * and most of what Greg Young, Udi Dahan, and that constellation of folks has done online (lots of talks and blog articles.)

Depending on your platform of choice, there may be others worth reading. For
my 2¢, the dragons are mostly in the design phase, not the implementation
phase. The mechanics of ES are pretty straightforward—there are a few things
to look out for, like detection of dropped messages, but they're primarily the
risks you see with any distributed system, and you have a collection of
tradeoffs to weigh against each other.

In design, however, your boundaries become _very_ important, because you have
to live with them for a long time and evolving them takes planning. If you
create highly coupled bounded contexts, you're in for a lot of pain over the
years you maintain a system. However, if you do a pretty good job with them,
there's a lot of benefits completely aside from ES.

[1] [https://www.amazon.com/Enterprise-Integration-Patterns-
Desig...](https://www.amazon.com/Enterprise-Integration-Patterns-Designing-
Deploying/dp/0321200683/ref=sr_1_1?s=books&ie=UTF8&qid=1523639836&sr=1-1&keywords=enterprise+integration+patterns)

[2] [https://vaughnvernon.co](https://vaughnvernon.co)

[3] [https://www.amazon.com/Domain-Driven-Design-Tackling-
Complex...](https://www.amazon.com/Domain-Driven-Design-Tackling-Complexity-
Software/dp/0321125215/ref=sr_1_2?ie=UTF8&qid=1523639703&sr=8-2&keywords=domain+driven+design)

------
kaspm
as someone who is debating how to handle right to erasure this is very
interesting. I've also been struggling with how to automate erasure within in
3rd party SaaS tools that we use.

I think to count we have 34 SaaS products of which something like half of them
contain our customers PII.

Is the regulation state that we must guarantee right to erasure or that we
must make a reasonable effort to erase customer data on request?

Are people generally automating this fractal process or manually deleting from
systems that only offer a manual process (such as Google Analytics)?

~~~
robin_reala
You’re not allowed to store PII in Google Analytics already as per the terms
of service:

 _You will not and will not assist or permit any third party to, pass
information to Google that Google could use or recognise as personally
identifiable information._

[https://www.google.com/analytics/terms/us.html](https://www.google.com/analytics/terms/us.html)
, section 7.

~~~
jonasb
Even if you don't send personal data to Google analytics they store personal
data automatically. At least Google Analytics for Firebase store the following
identifiers: Mobile ad IDs IDFVs/Android IDs Instance IDs Analytics App
Instance
[https://firebase.google.com/support/privacy/](https://firebase.google.com/support/privacy/)

~~~
joking
But those can change and if you don't store personal information you can't
relate to an specific individual.

------
CookWithMe
> The main concern with this approach is that the event store is no longer
> immutable

I think what happens when you delete all events of a (deleted) aggregate root
(such as a customer who requested to be forgotten) can be interpreted more
charitable, in a way that the event store can still be called immutable.

If you look at a functional programming language, you can not force a data
structure to be removed from memory. However, that obviously doesn't mean that
your program consumes infinite amount of memory. If a data structure is not
referenced anymore, it'll be removed from memory (by the GC). Your program
itself didn't mutate the state of the data structure, so from that point of
view everything is still mutable.

Now, let's apply the same principle to an event store: A deleted aggregate
root (the to-be-forgotten customer) should have been removed from all
projections (as required per GDPR). If you replay the events, it shouldn't
matter to the final state of a read projection whether it processed the events
belonging to this aggregate root, or not.

Therefore, one could interpret that removing the events of a deleted aggregate
root in a GC-like fashion leaves the event store immutable, in the sense that
my program(s) can't mutate the state (themselves), and their output doesn't
change.

------
noway421
Going forward no one would consider Kafka a suitable system for any kind of
production systems. Immutable persistent data already caused issues for HIPAA
compliance, but now it is virtually illegal. Using event sourcing for only
foreign keys is way too hard to enforce engineering-discipline wise, and is
just not worth the risk.

Back to SQL/NoSql/Memory stores and their mutability.

------
sudhirj
The better way seems to be to make sure all events carry only references to
users, and move user data off an event sourcing system. That way the records
that something happened to user no. 42 can remain immutable, but if user no.
42 leaves their account be tombstoned and all their personal data effectively
deleted.

------
tlrobinson
I wonder how soon we’ll see a company accidentally forget everyones’ data due
to a poorly architected/implemented solution to “right to be forgotten”.

~~~
synotna
That's a big win over the everyday occurrence of a companies leaking
everyone's data

~~~
imtringued
It can also mean bankruptcy because they also deleted the data from their
backups...

~~~
krageon
This is vastly preferable to accidentally leaking customer information. If
this is the price we have to pay for having good guarantees that we can be
forgotten, I think it is a worthwhile tradeoff.

~~~
aeorgnoieang
Can you really not imagine _even a single case_ where this isn't true?

------
noir_lord
The forget the encryption key approach is clever.

~~~
jordibunster
I used to think so too (that's what iPhones and the like do on "erase", btw)
but doesn't that just push the problem into the future, when computers can
more easily crack today's encryption?

~~~
xoa
> _the future, when computers can more easily crack today 's encryption_

No. Don't confuse symmetric with asymmetric (current public key) encryption.
They aren't subject to the same potential attacks. Even with a theoretical
fully scalable general purpose quantum computer, the best quantum attack vs a
symmetric cypher is brute forcing with Grover's Algorithm, which provides a
quadratic rather then exponential speed up. Ie., a n-bit key could be attacked
in around 2^(n/2). This is trivially countered by doubling key length, a
256-bit key would still take 2^128 which would still be effectively
impossible, and a 512-bit key would take 2^256. There is no future with any
foreseen technology that would be able to brute force that, so when it comes
to AES and the like using at least 256-bit keys it can be reasonably assumed
that destroying the key means the data is lost (anything legacy running off
128-bit is reasonable to watch out for though, 2^64 is potentially tractable).

Present asymmetric crypto systems can theoretically [1] be attacked with
Shor's Algorithm, which may be what you're kind of thinking of if you've heard
about "today's encryption getting cracked" in the general media or scifi. And
that would in fact be a big deal, it covers how most data is moved around in
communications and the Internet at present. But QC isn't magic, and it doesn't
just break anything. FDE and the like that just use symmetric crypto are safe.

1: "Theoretically" because that's _if_ (big if) an ideal quantum computer that
could be scaled to a sufficient number of qubits is created.

~~~
codetrotter
> There is no future with any foreseen technology that would be able to brute
> force that, so when it comes to AES and the like using at least 256-bit keys
> it can be reasonably assumed that destroying the key means the data is lost
> (anything legacy running off 128-bit is reasonable to watch out for though,
> 2^64 is potentially tractable).

But we can’t know for sure that AES or any other encryption algorithm doesn’t
have some as-of-yet-unknown fatal flaw that would make it breakable in some
way not necessarily even having anything to do with quantum computers?

~~~
gizzlon
Of course not. But now you're getting philosophical :) Can you _really_ now
anything?

~~~
codetrotter
Ha ha yeah that is true. My point though was that it’s important to keep in
mind if we decide to use throwing away the encryption key as or way of
protecting the data.

------
tofflos
I don't really see what encryption gets me over deleting the events. The end
result is the same - the event log is unusable for replaying this particular
projection. Why take on the extra complexity of adding encryption?
Immutability is just a means to an end. It has no value on its' own.

------
PeterStuer
If your messages are not guaranteed to contain info about at most a single
individual, wouldn't you need an encryption key to cover tuples of persons? If
so, would the existence of a key for a given tuple be a potential privacy
sensitive piece of info that would need to be masked?

------
skywhopper
As ever, it turns out that however brilliant your design spec, data structure,
or algorithm is, it breaks as soon as it comes in contact with the realities
of human nature, business processes, legal requirements, and the universe
itself. This is an insight huge swaths of tech culture fail to grasp, and it
explains the massive problems of crypto currency fraud, Facebook privacy
gaffes, and the continuing delusion that self driving cars and AI assistants
are just over the horizon. If only the world would match up to my assumptions,
then my code can save the world! But it never will. You have to build that
mismatch into your plans or you’ll eventually fail.

------
kbenson
Missed opportunity. Should have named the post "Kafka, GDPR and Kafka".

~~~
nbevans
Indeed. The post seems to assume that all event sourced systems cannot
tolerate the deletion of events on a per aggregate root basis.

~~~
spuz
Could you explain a bit more? Why does using Kafka imply that you cannot
delete events on a per aggregate root basis?

------
zilchers
Interesting article, but there’s this architect type out there I’ve been
encountering who are like, Kafka maximalists. In this case, the overall model
is ok, but the insistence to put the keys into Kafka instead of a normal data
store, and then to rely on hot access and the Kafka streams state stuff for
potentially millions of keys seems misguided. It’s ok to use relational db’s
when they make sense. It almost feels like with Kafka we’re going through the
whole NoSql or die thing all over again, just so in a few years we realize
Postgres isn’t that bad.

------
yitzchok
Backups should also be stored and encrypted using the same technique so that
you can just delete the key and so access to the data isn't possible.

------
kimi
Frankly, I'd rather go for removing from projections.

