
Deleting data distributed throughout a microservice architecture - rrampage
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/deleting-data-distributed-throughout-your-microservices-architecture.html
======
hinkley
> First, you’ll need to find the data that needs to be deleted.

Microservices do not get you out of having to have an information
architecture. They add more friction if you don't have one, but it's entirely
possible to have an unspoken/undocumented information architecture that mostly
works.

If you don't have a System of Record for data, you for sure aren't going to be
able to find it. Similar problem with no Source of Truth. For some business
models you will have both and they will be separate (especially with 3rd party
data).

You still have the problem of logs, but at least the problem is tenable.
Without any of this it's just chaos and who knows where the data went or
really even where it came from?

~~~
gravypod
It's also possible to completely automate this. Flow.io has a great talk where
the CTO talks about everything they do from an engineering perspective. Just
annotate data in your API spec language as PII. This can allow you to also
have policies that can be verified by this. If some advice has PII you can
enforce that some other service can never talk to it, for example.

~~~
616c
Link? I would love to see this as excuses not to do so come up often at work,
sigh.

~~~
mcintyre1994
I'm pretty sure they're referring to this talk:
[https://www.youtube.com/watch?v=j6ow-
UemzBc](https://www.youtube.com/watch?v=j6ow-UemzBc)

------
ijcd
My favorite approach is encrypting all of a user’s data (everywhere) and just
deleting the key from a central store instead of actually erasing.

~~~
FridgeSeal
What's the problem with actually just deleting this data when you're done with
it though?

~~~
vmateixeira
Because FKs on a relational model.. as an example, deleting a user/account
might end up being a task on going through every reference to it, and the
references to its references, etc.

This is actually the reason some companies do not delete users/accounts [0].

[0]
[https://news.ycombinator.com/item?id=23005060](https://news.ycombinator.com/item?id=23005060)

------
valera_rozuvan
One should really consider DevSecOps when dealing with a large infrastructure
involving dozens or hundreds of micro services. Security as Code by design
should be implemented as soon as possible. In my current gig - the following 3
points are key:

\- 24x7 Proactive Security Monitoring

\- Shared Threat Intelligence

\- Compliance Operations

I would really advise anyone interested in this subject to check out DevSecOps
manifesto [1].

Also - if you are running a k8s cluster for all your micro-services, there are
several guides online as to best practices (see [2] for example).

\----------

[1] [https://www.devsecops.org/](https://www.devsecops.org/)

[2] [https://dev.to/petermbenjamin/kubernetes-security-best-
pract...](https://dev.to/petermbenjamin/kubernetes-security-best-practices-
hlk)

------
BrentOzar
This is relevant for enterprises with legacy systems, too, like shops that
have multiple interfaces that extract, transform, and load data across point-
of-sale systems, warehouse fulfillment systems, and data warehouses.

------
golover721
This gives me nightmares about having to retrofit GDPR requirements into a
complicated system with many data stores and applications. Not only the
difficulty of ensuring data is deleted but tracking data lineage so you can
delete derived data too. Fun times!

~~~
Hokusai
> This gives me nightmares about having to retrofit GDPR requirements into a
> complicated system with many data stores and applications.

This is true when the domain is not well defined, that is the case for many
legacy systems. Usually this systems have also problems with de-normalized
data where there is several copies of the same entity across the system.
Copies get out of sync.

I do not think that nowadays is as common as it used to be. When I was a kid,
I saw many systems where you had "customer" data that was replicated in the
two or three applications that were using it. Then, maybe at night, some task
will "make sure" that all the data was on sync or in real-time with triggers
in the database. Applications that were not part of this synchronization will
pop up, some fields will be in one application but not in another one, etc.
Bad-defined domain objects and identifiers will make sometimes fields too
small to fit the original data, being in the wrong format (text vs numeric) or
miss unique keys and duplicate rows.

Q: "Can you send a mail (stamp-based-mail) to everybody that works the
Christmas shift?" A: "No. That data is in the scheduling system. We can only
send mails from the Human Resources system as is the only one that stores the
address".

I hope new generations of developers do not find themselves in this
situations, but they will need to maintain the many legacy systems that still
live way beyond what anyone expected.

------
Ram_Lakshmanan
This will help us [https://bitmovin.com/finding-memory-leaks-
java-p2/](https://bitmovin.com/finding-memory-leaks-java-p2/)

------
crimsonalucard
My company recently decided to switch over to microservices. It was a long an
arduous process but we chose not to compromise.

Basically for the greatest amount of modularity, we divided all 400 functions
in our monolithic application into 400 individual servers because obviously
functions aren't modularizing everything enough. You really need to put more
and more wrappers around all of your functions. First put a framework around
your function, than put an http api layer around it, then wrap a server app
around it, then put an entire container around it and boom! More wrappers ==
Less technical debt. To illustrate how this works see example below:

    
    
       wrapper(
         wrapper(
           wrapper(
             wrapper(
               f(x)
             )
           )
         )
       )
    

See that? Obviously for every additional wrapper you add around your original
function your technical debt becomes less. This is why it makes sense to make
to not just use functions to modularize your data but to wrap all your
functions in containers and then put those containers in containers.

Now all 400 of our engineers each as an individual manages one entire function
within one entire container. It's amazing, they don't have to think about two
things anymore, they can just concentrate on one thing.

While I'm not sure what has improved yet, everything feels better. Our company
is following industry trends and buzzwords. Technical debt actually went up
but that's just our fault for building it wrong.

Some engineer asked me why couldn't we just load balance our original monolith
and scale it horizontally. I fired that engineer.

Another engineer came to me and told me that for some function:

    
    
       func some_func(a: int, b: int) -> int: {
          return a + b
       }
    

It's probably better to do 1. instead of 2.

    
    
       1. Call the function: some_func(2,3)
    
       2. Make a request: request.get("http://www.pointlessapi.com/api/morecomplexity/someextratechnicaldebt/randomletters/asdfskeidk/some_func?a=2&a=3") 
       and parse the json:
    
       {
         "metadata": {
             "date": "1/1/2020"
             "request_id": "123323432",
             "other_pointless_crap": ...
             "more_usless_info": ...
             "function_name (why?)": "some_func"
         }
         "actual_data": 5
       }
    

I fired that engineer too. Obviously 2. is better than 1. Also why am I using
JSON? Not enough wrappers! You have to wrap that http call in additional
wrappers like GraphQL or GRPC!!! (See wrapper logic above).

Have you guys heard of a new trend called Sololithic architecture? Basically
the new philosophy states that all of 400 of our microservices should be
placed in 400 containers and run under a cluster of 399 computers under
kubernetes!

I may not understand where technical debt comes from and I also may not
understand how all these architectures will fix the problem of technical debt
forever... I know that industry trends and buzzwords are more intelligent than
me and monoliths are bad bad bad and obviously the source of all technical
debt! Just cut everything into little pieces and technical debt becomes ZERO.

Right? The technical definition of Technical debt is not enough cutting of
your logic into tiny pieces and not enough wrappers around your modules so all
you need to do is cut everything up and put wrappers around it and problem
solved! Makes sense!

~~~
nell
You used words like "buzzwords" which makes you sound facetious. I know you
aren't, but others could get the wrong idea. This is a very useful piece of
advice that should be taught in CS 101.

