
Adopting Microservices at Netflix: Lessons for Architectural Design - davidkellis
http://nginx.com/blog/microservices-at-netflix-architectural-best-practices
======
davidw
My comment, slightly edited, from the previous posting of this, at
[https://news.ycombinator.com/item?id=9106813](https://news.ycombinator.com/item?id=9106813)

> One kind of coupling that people tend to overlook as they transition to a
> microservices architecture is database coupling, where all services talk to
> the same database and updating a service means changing the schema. You need
> to split the database up and denormalize it.

That sounds like a decision you wouldn't want to take lightly; the kind of
thing you might do once your company is already big. I wouldn't want to start
out that way though, it sounds like a recipe for a mess.

~~~
anonymousDan
Yes I just came in here to write a comment along those lines. I mean surely
you're opening a whole can of worms in terms of consistency etc. I get the
feeling Netflix happens not to have use cases where strong consistency is a
requirement. I'd be interested to get more detail about how they went about
the transition - even just pointers to more on the meta-data management tools
they use.

~~~
davidkellis
The consistency problem is an open question in my mind. I definitely don't
like the idea of having some data synchronization tool to fix the inconsistent
data across services problem. I wonder what the best practice is for
maintaining data consistency across services.

Does anyone know?

~~~
shanemhansen
Ideally you don't have to sync the data because one service owns that data.
Other services request that data via api. In a RESTful world those api
requests are cacheable.

~~~
davidkellis
But what about the situation where you have an entity service that owns the
data for one piece of the domain, for example a People service, and then other
services, like the Address service and the Billing service, reference a
particular person. In that scenario, I can imagine the Address service and the
Billing service would have a foreign key referencing a person in the People
service. Then, what happens if the Person gets deleted? In that case, we've
got a consistency problem, even though each service owned its data.

Is the best practice to not use entity services?

~~~
shanemhansen
So the problem you've identified is real. I used to have some bootleg footage
of some private amazon tech talks where the speaker emphasized that in
distributed systems it was generally a terrible idea to have transactions span
entities.

I think you basically have to learn to live in an eventually consistent world.
In the case of people being deleted I would imagine that the user service
exposes a pub/sub interface where address and billing services subscribe to
"delete" events.

~~~
threeseed
You don't HAVE to live in an eventually consistent world. If you use something
like ZeroMQ or use REST then you can "notify" other services of a "person
deleted" event in a synchronous manner.

~~~
Xorlev
That assumes the network is always good and services are up. Welcome back to
eventual consistency (or none at all)

~~~
threeseed
If the network is bad then your monolithic app wouldn't work either.

The problem of services being up/down has been solved with service discovery
e.g. Consul, Etcd, Zookeeper.

~~~
saryant
That has nothing to do with the fact that if your systems are distributed, you
will have eventual consistency.

If System A needs to tell System B about an event in order for A and B to
remain consistent, but B is down, you've got eventual consistency, because B
can't become consistent with A until it's back up and has performed whatever
recovery is necessary to process that event. Service discovery does nothing to
solve that problem.

------
exacube
This seems like it's just taking the idea of decoupling your service and
talking to them via APIs a little further by saying.. decouple them to an even
smaller granularity?

This has always been the generally accepted way to scale out software
services. Is there a novel idea being discussed here, or just that they've
been doing this at Netflix?

~~~
brown9-2
Yes, a lot of organizations already design their backends in this way without
necessarily thinking to use a brand new term to describe it.

 _Cockcroft defines a microservices architecture as a service-oriented
architecture composed of loosely coupled elements that have bounded contexts._

To me this just reads as "service-oriented architecture in a sensible way".

No one thinks to intentionally build a SOA with tightly coupled components
with poor boundaries.

------
jfoutz
If your app depends on lots of of them, it's only going to run as fast as the
slowest dependency. 1% chance of poor performance isn't to bad, the joint
distribution of 20 microservices each with a 1% chance, well that gets pretty
ugly. In the normal case, everything is great, but the failure modes of each
service become a much bigger deal.

It's a great architecture, but fan out of dependencies is a real risk.

~~~
danudey
This is solved because you can scale each service differently. Too many DB
queries from your messaging service? Upgrade the DB, add another read slave.
Too much CPU load on your image processing service? Add more image processing
nodes.

Breaking your system out into multiple dependencies means you can not only
scale your infrastructure, but you can scale individual parts of your
infrastructure based on demand, bottlenecks, usage, etc.

Netflix has talked about in the past how, _because_ their systems are broken
apart, they don't have to deal with these issues. Rating service having
problems? Don't show user ratings. Search service offline for updates? Disable
search. If Netflix was one giant (Rails? Django? Node?) app, it would be very
difficult to cut out poorly-performing parts temporarily.

~~~
pmahoney
> Rating service having problems? Don't show user ratings. Search service
> offline for updates? Disable search

As an example of a (probably?) bad way to organize services, I worked on a
project that had factored a role-based access control system into its own
service. Every single web request hit this service, which made it a single
point of failure, performance critical, impossible to temporarily disable,
etc.

~~~
shanemhansen
One alternative to centralized role servers is to use client certificates.
I've used x509 certs for this purpose. They are pretty hairy, but so is
rolling your own authentication/authorization/token system.

~~~
reubenbond
Another alternative is JSON Web Tokens. Many of the benefits of Client
Certificates while avoiding many of the hardships.

------
polynomeal
My team recently added a microservice to support our fairly monolythic backend
service. The big challenge we found was that it takes a lot of effort to make
a new (micro)service. We need to create a system of alarms (instead of relying
on existing catch all defaults). We need its own test environment, we need to
find ways to send traffic to pre-prod. We needed to figure out how to
bootstrap the new service into the companies infrastructure. We needed to
think about how the dependent service would authenticate against the new
service. All of that on top of the core feature work.

All these things are good. You want isolated, focused test environments. You
want tightly defined alarms. However we underestimated how long creating a new
service would take. In the end we ended up pushing features out when they were
ready but before the operational work was complete. Unsurprisingly we saw the
issues we knew we wanted to protect against.

Better microservice franeworks that match the companies infrastructure would
be helpful. Make building microservices cheap by building tools to speed up
the process.

~~~
cam-
Front End teams tend not like micro services, there is too much overhead in
getting too little data. As an example we integrate with one micro service
where get back a boolean and and a date. We have the overhead of an http call
and all the error handling that goes with it for two pieces of data which
would be better aggregated into another service. We story point an integration
with a new service as an 8, but adding a new field (or two) in an existing API
data structure is a 1.

I hope micro services is not just a new fashion in software and is actually
useful ten years from now.

~~~
doktorn
It is by no means a new fashion. In an interview from 2006, Werner Vogels (CTO
& VP of Amazon) talks about it.
[http://queue.acm.org/detail.cfm?id=1142065](http://queue.acm.org/detail.cfm?id=1142065)

------
nawitus
A few questions:

a) How do you prevent technical debt? It seems to be more difficult due to
APIs which shouldn't have breaking changes. In theory you could always version
up the APIs and serve both versions or just add a new API for a breaking
change, but these solutions seems awkward.

b) How do you start developing multiple microservices at the same time? I
would expect APIs to change a lot in the beginning, which would mean that
updating one microservice would break another. Perhaps that is acceptable
before the first "stable release" of a microservice.

~~~
jacques_chester
> _How do you start developing multiple microservices at the same time?_

Same as any other project: Develop from the outside in.

In practice, trying to develop in the "optimal order' leads to speculative
development that will be wasted.

~~~
mercurial
Nothing prevents you from having multiple microservices sharing the same
codebase.

That said, the "it's just like the web" model doesn't sound fantastic to me.
It sounds like your app now depends on contracts which are only enforced by
good practices, not by something strongly typed you can check at compile time,
unless you use something like protocol buffers to generate the boilerplate.

~~~
jacques_chester
This is where the test-driven world, which in my experience is strongest on
the dynamic language side of programming, has come back around full circle.

In microservices, _everything_ is dynamically typed.

There is no single binary produced by a single compiler performing whole-
program checks of consistency. Even tools like protobufs don't help when code
bases drift, or someone introduces a foreign tool, or someone upgrades
versions and introduces a subtle mismatch, or some doesn't know you call their
service and shuts it down ...

Turns out that driving from tests, and starting those tests from the outermost
consumer, is a fairly well-proved way of coping with such conditions.

~~~
mercurial
> There is no single binary produced by a single compiler performing whole-
> program checks of consistency. Even tools like protobufs don't help when
> code bases drift, or someone introduces a foreign tool, or someone upgrades
> versions and introduces a subtle mismatch, or some doesn't know you call
> their service and shuts it down ...

Static typing is not a panacea, but large codebase plus dynamic typing
everywhere sounds like a recipe for disaster. No matter the amount of testing.

> Turns out that driving from tests, and starting those tests from the
> outermost consumer, is a fairly well-proved way of coping with such
> conditions.

You need tests no matter what. However, static typing means a much greater
confidence in your codebase.

~~~
jacques_chester
As soon as you distribute your system, you have dynamic typing, whether you
like it or not.

At runtime you are inspecting incoming messages and then routing them to code.
It doesn't matter what language the code is written in, it will need to route
and validate the messages at runtime.

The type system cannot provide compile-time assurances of behaviour, because
it cannot create a single consistent binary which enforces the guarantees.

Your only remaining tool is to drive code from tests and only from tests.

~~~
mercurial
> As soon as you distribute your system, you have dynamic typing, whether you
> like it or not.

You have serialization/deserialization issues. You can still type your
messages.

> At runtime you are inspecting incoming messages and then routing them to
> code. It doesn't matter what language the code is written in, it will need
> to route and validate the messages at runtime.

Of course.

> The type system cannot provide compile-time assurances of behaviour, because
> it cannot create a single consistent binary which enforces the guarantees.

If you make the assumption that you deploy up-to-date binaries, then knowing
at compile time that your producer and consumer use the same data structure
for the messages they exchange would give me much better confidence than "it
looks like the API conforms to what's written on the wiki".

~~~
jacques_chester
> _You can still type your messages._

You can hope that they respect the type. For a robust distributed system, you
will have to check everything at runtime.

> _If you make the assumption that you deploy up-to-date binaries, then
> knowing at compile time that your producer and consumer use the same data
> structure for the messages they exchange would give me much better
> confidence than "it looks like the API conforms to what's written on the
> wiki"._

My reading is that we agree that running code is the only source of truth, we
disagree on what guarantees distribution deprives us of.

~~~
mercurial
If you cannot ensure that your producer receives messages following a certain
schema, even though you enforce it statically in your codebase, you also
cannot ensure that your running code passes your tests.

~~~
jacques_chester
Which is why I start from integration testing of the whole system, with
frenemy tests for any foreign services that I must rely on.

You're right that tests don't make Byzantine failures go away. But neither do
static types. My point that distribution turns all systems into analogies for
dynamic language programming remains, and so the emphasis on tool support
changes along with it.

------
pbrettb
building and rebuilding and maintaining applications is hard enough. Now we
have whitepapers from a company that wants to sell us on a new paradigm which
-- cough cough -- they just happen to have software to support.

