
What I Wish I Had Known Before Scaling Uber [video] - kiyanwang
https://www.youtube.com/watch?v=kb-m2fasdDY
======
iamleppert
It amazes me they have 1,700 services. It would be hilarious satire to
actually see the description of each. And the debugging scenarios he listed
make it sound like they have very poor engineers working for them. Who on
earth lets an application get to prod that iterates through a list of items
and makes a request for each thing?

When did we loose our heads and think such an architecture is sane? The UNIX
philosophy is do one thing and do it well, but that doesn't mean be foolish
about the size of one said thing. Doing one thing means solving a problem, and
limited the scope of said problem so as to have a cap on cognitive overhead,
not having a notch in your "I have a service belt".

We don't see the LS command divided into 50 separate commands and git
repo's.....

~~~
willejs
Saying something like '...they have very poor engineers working for them.' is
pretty unfounded, and naive. It's easy to say that having 1700 services is
overkill from our point of view, but we don't know the complete architecture,
problems being solved and environment that they operate in.

One thing to possibly consider is that once you have set up tooling for a
platform; logging, request tracing, alerting, orchestration, common libraries,
standards and deployment. Deploying and operating new services becomes
relatively straight forward.

Saying that, 1700 is a lot, which makes me intrigued to see inside uber.

~~~
iamleppert
This wasn't a derivation from the repo count, it was in response to the debug
part of the talk where he was talking about fan out. Some app was iterating
through a list of things and for each thing was making a network request,
instead of figuring out what it needed up front. It's either complete laziness
or incompetence, probably the later. I assume with any company that has that
many services their culture is partly to blame, too.

It's interesting because this kind of technical debt will eventually open the
door to competitors once Uber is under more financial pressure...i.e. they
have burned through all the investor money.

~~~
barrkel
I agree that it is utterly stupid. However, it is almost certainly a political
/ interpersonal / interteam issue. Very probably, the person making the calls
couldn't or didn't convince the service they were calling that a bulk access
mechanism was required.

It probably doesn't help that canonical REST as usually applied to APIs
doesn't really have a single well-known pattern for this.

------
tiborsaas
"Uber is most reliable over the weekends when engineers don't change it" :)

~~~
kenrose
We see this pattern at PagerDuty over the majority of our customers. There is
a definite lull in alert volume over the weekends that picks up first thing
Monday morning.

It's led to my personal conclusion that most production issues are caused by
people, not errant hardware or systems.

~~~
asimuvPR
I've come to question releasing often as a result.

~~~
jon-wood
I can see the argument that if releasing causes things to break then don't
release so frequently, but in practice the end result of that is lots of
things breaking at once and having to unpick everything. Debugging is much
easier if you're debugging a single change fresh in your mind.

~~~
asimuvPR
You are right. Making small changes that can be isolated for debugging
purposes is a good approach. What I mean is that we should always question
"best" practices and how we apply them to our development process. These days
there is a tendency to drink the kool-aid (guilty of this as well). DevOps is
something that we are still learning and developing as a profession.

~~~
dozzie
> DevOps is something that we are still learning and developing as a
> profession.

Not really. DevOps is simply a modernish label for system administration that
we have for dozens years already.

~~~
the_other
Isn't that just "Ops"?

~~~
dozzie
My point exactly.

~~~
aurelianito
Yes, but now we actually try to automate in a systematic way; as opposed to a
single admin hacking some custom shell scripts. That's what the dev in devop
mean. And that's new, at least it was not a common activity in the XX century.

~~~
dozzie
If by "systematic way" you mean a group of programmers clueless about system
administration hacking together some custom Ansible scripts, then I wouldn't
call it progress.

Sysadmins have tools to automate their work for a long, long time (cfengine,
bcfg2, even Puppet and Chef predates DevOps hype). DevOps didn't bring
anything new to the table.

~~~
aurelianito
Is not about doing it right or wrong. Is about the automation that is reused
cross-project and the fact that a lot of things that used to require a
sysadmin now we automated its job away. If developers implement it correctly
or poorly is a different issue.

It is also not about hype (or not). I do agree that the name is posterior to
the beginning of the practice, but it is the name that we have.

~~~
dozzie
Being a sysadmin was always about automation. DevOps brought nothing to the
table about that. Neither tools nor paradigm. And the name is just another one
for "administering systems", if we keep what you seem to mean by DevOps.

It's still no progress, hacky scripts written by sysadmins or similarly hacky
scripts written by programmers, and systematic way of automating tasks was
available to sysadmins and was used by them for a long time. DevOps brought
nothing new to the table.

~~~
asimuvPR
I've always had the imoression that a sysadmin ( a position I deeply respect)
was more comprehensive than DevOps. DevOps mostly focuses on automating the
infrastructure of software development. A sysadmin does that and more.

------
paukiatwee
Just to confirm, 1000 microservices in this case is 1000 different apps
(e.g.different docker images) running simultaneously? 1000 microservices in
this case not 1000 microservice instances (e.g. docker instances)?

If it is 1000 microservices as in different apps, then they must have at least
2000 running apps (at least 2 instances per app for HA).

Maybe uber only have 200 "active" microservice app running at the same time
where each microservices have N running instances.

I just cant imagine running 1000 different microservices (e.g different docker
images, not docker instances) at the same time.

~~~
mranney
Our number is closer to 1,700 now, but yes this means 1,700 distinct
applications. Each application has many instances, some have thousands of
instances.

~~~
mtrpcic
Could you give some information as to what the breakdown of functionality is
for those services? I can't fathom 1700 different and unique pieces of
functionality that would need to be their own services.

~~~
lnanek2
I work at a microservice company. An example microservice is our geoservice
which simply takes a lat/lon and tells you what service region the user is in
(e.g. New York, San Francisco, etc..). You can see how dozens of these
services might be needed when handling a single request coming in from the
front end or mobile apps. The service may eventually gain another related
function or two as we work on tearing down our monolith (internally referred
to as the tumor because we do anything to keep it from growing...).

~~~
Scarblac
Why would you make that a separate service when e.g. one query on a PostGIS
table can do that?

~~~
user5994461
One query on a table is the exact same thing as a HTTP GET call on a service.

~~~
Scarblac
Yes, but instead of making it a whole new service, you are probably already
using a database and can use that service for this functionality as well.

But since asking the question I've realized that if your application already
needs a huge amount of servers because it simply gets that much traffic, then
putting something like this in its own docker instance is probably the
simplest way (it might even use postgres inside it), if those boundaries
change now and then.

But most companies aren't near that scale.

~~~
user5994461
You're both missing the point here. Both things are conceptually equivalent:

\- Select(db, somekey, someparameters) [return some db object]

\-
http_get_query([http://service.com/somekey/someparameters](http://service.com/somekey/someparameters)
[return some JSON]

They are external (micro)service:

\- they both need the target system to be available.

\- they both may fail in weird and unexpected ways.

\- they both need to handle failure gracefully.

Their usage have different properties:

\- A database call need to have a permanent connection pool to the database,
usually requiring db user and password.

\- A http call is just call and forget. It's a lot easier to use, in any
applications, at any time.

------
eternalban
I think these "alarming trends" are highlighting that operational complexity
is an easier pill to take than conceptual complexity, for most workers in the
field.

Microservices address that gap.

And in the process the field is transformed from one of software developers to
software operators. This generation is witnessing the transfer of the IT crew
from the "back office" to the "boiler room".

~~~
majormajor
Do you think that's it's that people would rather deal with operational
complexity, or just that it's easier to _not think about_ operational
complexity early on ("running locally, with low volume, everything works
together great!")?

Personally, I vastly prefer debugging and building on non-microservice
architectures than on something split willy-nilly into dozens of services
(because most implementations I've seen don't do microservices with clean
conceptual boundaries - it's more political/organizational divisions that
determine what lands where, not architectural concerns).

~~~
threeseed
You seem to be missing the point though. Most developers working on
microservices projects don't have to deal with the operational complexity.
They just need to work on their microservice and don't have to know what is
going on elsewhere.

Personally I much prefer microservices for debugging. You can quickly identify
which one is the problem then test the APIs in isolation pretty quickly. Sure
beats having to wait 20 minutes for a monolithic app to build.

~~~
majormajor
So what do these people do when they get data back from one of their
dependency services, and it looks wrong? Or a bug is reported that somewhere
in the chain, something is being dropped on the floor? You say you can quickly
identify which one is the problem, but if that's a chain that spans 5 teams,
how does that actually work in practice? (My experience is that it doesn't.)

That's the sort of thing I was including in operational complexity, not just
the "are the packets making it through" stuff.

~~~
dahauns
Exactly. Testing an API in isolation doesn't help you a thing when the bug
arises because of, say, subtle inconsistencies between API implementations of
interacting services on a long chain.

This sounds a lot like the code coverage fallacy. (to which I usually answer
"call me when you have 100% data coverage").

------
fitzwatermellow
Link to Slides here:

[https://gotocon.com/dl/goto-
chicago-2016/slides/MattRanney_W...](https://gotocon.com/dl/goto-
chicago-2016/slides/MattRanney_WhatIWishIHadKnownBeforeScalingUberTo1000Services.pdf)

Any speculation as to why Uber doesn't just want to use something like Netflix
Eureka / Hystrix instead?

~~~
mranney
The Netflix stack has a lot of assumptions about Java, and most of Uber is
written in languages other than Java.

------
Roritharr
I'm not a big green-IT guy, but always pushing your systems close to its load
maxima and then backing off with the test traffic as real traffic comes in
feels like an enormous waste of electricity.

~~~
taeric
Almost as much as the overhead of splitting everything to a thousand
microservices...

------
dorianm
I got told by an Uber engineer that's because of their hyper-hyper growth
their tech is basically the biggest mess possible.

That's also why that makes it an interesting place to work at and helped them
achieve this growth.

Personnaly I think this should be made into a global app with no geo-fencing
(e.g. Available everywhere basically).

------
StreamBright
Absolutely amazing to watch. I think most of the big companies (Amazon,
Google) already have solutions for these issues like: limited number of
languages, SLAs between services, detailed performance metrics and the ability
to trace microservices.

------
nichochar
Honestly I don't think the problem is microservices. I mean everything he
brings up is true, but it's more of a "how you do microservices" issue.

I used to work in startups, and overall was impressed with velocity. Then I
joined a big valley tech company, and now I understand.

It's because they hire smart and ambitious people, but give them a tiny
vertical to work on. On a personal level, you WANT to build something, so you
force it.

I think you solve this by designing your stack and then hiring meticulously
with rules (like templates for each micro service), instead of HIRE ALL THE
ENGINEERS and then figure it out (which is quite obviously uber's problem)

------
mstade
I quite enjoyed watching this. My takeaway isn't so much that this is a
critique or endorsement of microservices, but rather just a series of
observations. Not even lessons learned in many cases, just honest pragmatic
observations. I like how he doesn't seem to judge very much – he obviously has
his own opinion of several of these topics, but seem to let that more or less
slide in order to not get in the way of these observations.

Good talk, will likely watch again.

------
buzzdenver
Did I miss what WIWIK stands for ?

~~~
jjbiotech
What I wish I knew

~~~
mawburn
I can't be the only one who keeps reading it as: Whitest Kids U Know.

------
inthewoods
I found this video so super interesting and yet frustrating for completely
personal reasons: the company I work for used to sell a tracing product that
was specifically designed for the distributed tracing problem and handled all
of the issues he highlighted - trace sampling, cross-language/framework
support built in, etc. It was/is based on the same tech as Zipkin but is
production ready. Sadly, he and his team must have spent a huge amount of time
rolling their own rather than ever learning about our product. Now, it still
might not have been a good match, but man, the problems he mentions were right
in the sweet spot of what our product did really, really well.

~~~
jasonshen
This is why sales and marketing is so so important and yet
overlooked/undervalued. You might have a great product that's perfect for your
customer, but you also have to convince them to use it and pay for it.

~~~
inthewoods
True - and marketing to developers is very challenging - a lot of noise to get
above.

------
agentultra
What I Wish Small Startups Had Known Before Implementing A Microservices
Architecture:

Know your data. Are you serving ~1000 requests per second peak and have room
to grow? You're not going to gain much efficiency by introducing engineering
complexity, latency, and failure modes.

Best case scenario and your business performs better than expected... does
that mean you have a theoretical upper bound in 100k rps? Still not going to
gain much.

There are so many well-known strategies for coping with scale that I think the
main take-away here for non-Uber companies is to start up-front with some
performance characteristics to design for. Set the upper bound on your
response times to X ms, over-fill data in order to keep the bound on API
queries to 1-2 requests, etc.

 _Know your data and the program will reveal itself_ is the rule of thumb I
use.

~~~
threeseed
I have no idea why people keep thinking microservices is all about
scalability. Almost like they've never worked on a problem with them before.

Microservices is all about taking a big problem and breaking it down into
smaller components, defining the contracts between the components (which an
API is), testing the components in isolation and most importantly deploying
and running the components independently.

It's the fact that you can make a change to say the ShippingService and
provided that you maintain the same API contract with the rest of your app you
can deploy it as frequently as you wish knowing full well that you won't break
anything else.

It also aligns better with the trend towards smaller container based
deployment methods.

~~~
yummyfajitas
You don't need to build a distributed system for that. Just build the
ShippingServiceLibrary, let others import the jar file (or whatever) and
maintain a stable interface.

I wrote about this a couple of years ago in more detail, so I'll just
reference that:
[https://www.chrisstucchio.com/blog/2014/microservices_for_th...](https://www.chrisstucchio.com/blog/2014/microservices_for_the_grumpy_neckbeard.html)

HN discussion:
[https://news.ycombinator.com/item?id=8227721](https://news.ycombinator.com/item?id=8227721)

~~~
threeseed
The point is that unless you use JVM hot reloading (not recommended in
Production) you will need to take down your whole app to upgrade that JAR. Now
what if your microservice was something minor like a black word filter. Is
adding a swear word worthy of a potential outage ?

~~~
rer
Is hot reloading that big of a problem?

~~~
betenoire
not for everyone

------
admiralhack_
Great video. I went in expecting it to cover mostly the technical side of
things. Instead Matt gave a great overview of the team / engineering
organization dynamics to watch out for when adopting microservices. (I
particularly liked that he pointed out how developers may write new code /
services instead of dealing with team politics.)

------
gb123
Hey, Matt Ranney, I used your node-pcap library to learn how to parse PCAP :)
Did not know you worked for Uber, thanks!

------
petetnt
Really enjoyed this talk. Our services don't quite (yet :)) run at that scale,
but many of the issues mentioned have already peaked at some point. It's also
good to have (more) validation to some choices we have made in the past, are
currently making or are thinking about making in short term future.

------
amelius
I didn't see the video. But given that a cab-service has a natural sharding
point (i.e., per city), I don't get why scaling is such an issue.

~~~
kowdermeister
Cab service on a planetary scale :) They have 2000 engineers, 800
microservices and 8000 GIT repositories. Does it still seem trivial to you?

EDIT: to downvoters: why shoot the messenger? :)

~~~
esrauch
I'm curious how a company ends up with 10 git repositories per microservice,
and 4 repositories per engineer?

~~~
sah2ed
That's just a manifestation of Conway's Law [1].

[1]
[https://en.wikipedia.org/wiki/Conway's_law](https://en.wikipedia.org/wiki/Conway's_law)

~~~
taeric
Taken in a healthy organization, Conway's Law is not a bad thing. It is really
just more of an observation. So... taken to this example, it sounds like the
company is a confusing mess of people trying to figure out who they need to
coordinate with to make something happen.

~~~
marcosdumay
> It is really just more of an observation.

Or maybe some instruction into how to assemble the organization.

~~~
taeric
True. I was using it as an argument to adjust the system architecture, as
well. My argument was it didn't matter really which changed to get things into
alignment, but that having them be different was a bit of a concern.

------
gtrubetskoy
I think the world of service architecture is roughly divided in two camps: (1)
people who still naively think that Rest/JSON is cool and schemas and
databases should be flexible and "NoSQL" is nice and (2) people who (having
gone through pains of (1)) realized that strong schemas, things like Thrift,
Protobufs, Avro are a good thing, as is SQL and relational databases, because
rigid is good. (Camp 1 is more likely to be using high level dynamic languages
like Python and Ruby, and camp 2 will be more on the strongly typed side e.g.
C/C++, Go, Java).

~~~
adekok
> people who still naively think that Rest/JSON is cool and schemas and
> databases should be flexible and "NoSQL" is nice and

Yes, and no.

Yes: REST / JSON is nice. I've used them widely as a kind of cross-platform
compatibility layer. i.e. instead of exposing SQL or something similar, the
API is all REST / JSON. That lets everyone use tools they're familiar with,
without learning about implementation details.

The REST / JSON system ends up being a thin shim layer over the underlying
database. Which is usually SQL.

No: databases should NOT be flexible, and "NoSQL" has a very limited place.

SQL databases should be conservative in what they accept. Once you've inserted
crap into the DB, it's hard to fix it.

"NoSQL" solutions are great for situations where you don't care about the
data. Using NoSQL as a fast cache means you (mostly) have disk persistence
when you need to reboot the server or application. If the data gets lost, you
don't care, it's just a cache.

~~~
fail2fail2ban
> SQL databases should be conservative in what they accept. Once you've
> inserted crap into the DB, it's hard to fix it.

You can make your schema very light and accepting almost like NoSQL which is
how you get into the situation you described; the solution is to use stricter
schema. That and it helps to hire a full time data engineer/administrator.

> Using NoSQL as a fast cache

I'd rather use caching technology, specifically designed for caching, like
Redis or Varnish or Squid.

------
andrewvijay
Highly recommended video. Lots of stuff he spoke are very relatable. Like
having many repos , storing configs as a separate repo, politics by people,
having a tracking system

~~~
babyrainbow
Sorry, don't think so. I went through the slides and didn't find anything
really interesting.

~~~
softawre
You went through the slides but didn't watch the video? If so, you can't
really make the argument that it isn't interesting.

